Summary: Here are five additional factors you’ll want to consider when putting together your short list.
If you stuck with us through the first eight lessons you now know most of the major considerations in selecting the right NOSQL or NewSQL database to support your Big Data project. However, as you work with your IT staff or consultants to put together your short list there still remain some additional considerations.
1. The Distinctions among Vendors of the Four Types of NOSQL Databases are Becoming Less Clear as Vendors Add Features and Move to Support Multiple Database Types.
If you read the previous lessons carefully you may already have matched your project with a particular NOSQL type. But be aware that many vendors are now offering capabilities that cross the traditional boundaries among the four types. This graphic by 451 Research shows the state of the market in 2012 and you can already see the crossovers emerging between Key Value and Document, and between Key Value and Column Oriented (shown here as Big Table). The lesson here is that you will need to cast a wider net to ensure you have all the qualified suppliers, and of course, things will have changed quite a bit since even 2012. Remember also that providers of Big Data suites may also package multiple database types within the same suite. You are likely to find that you will need more than one type as your competence with Big Data grows.
2. The Best Sellers of the Last Few Years May or May Not Be Right for You.
As young as this market is there are indeed already some market leaders. Getting good data on market share is difficult because many sales are from the majors and lump SQL and NOSQL sales together. Leaving aside the majors such as IBM, Oracle, Microsoft, and SAP the adoption of NOSQL databases looks a little like this graphic also from 451 Research (January 2012).
This graph shows the responses of 165 MySQL users, the most popular of the SQL RDBMS databases but not the only one. It’s doubly interesting because it shows not only those who have deployed NOSQL but those who were considering each one. Mongo and Couch are Document Oriented DBs. Hbase and Cassandra are Column Oriented DBs. And Redis is a pure Key-Value DB. Note that the data is sufficiently old that CouchBase and MAPR are not shown and should be in the mix of suppliers with high numbers of installs. But it is clear from the chart that Mongo had and probably still has a sizable lead in its installed base. Does this mean you should give more weight to considering Mongo? Not necessarily. Features continue to evolve and other vendors may now surpass Mongo in the features you need for your specific project. However, it is generally true that because of its larger install base it is easier to find developers experienced in Mongo. Given that qualified and experienced labor for each different type of DB is in short supply, you might conclude that the higher availability of experienced developers adds points to Mongo’s evaluation, all other factors being equal.
3. The Performance of your System will vary a lot Depending on How It’s Used.
If your interest is primarily analytical then you may be less concerned with the responsiveness of your system (latency) since your data scientists can organize their work to accommodate batch runs that may take a few hours or even overnight. But if your system is to be customer-facing you should be very concerned.
As it turns out, the performance of different NOSQL types and even among vendors of the same type can vary widely. A system that typically operates on 95% reads/5% writes will perform much differently from a system used for 50% reads/50% writes. A system asked for 1,000 simple queries per second will perform differently when that same system is asked for 100 complex queries per second. Here are two representative charts from Altoros, a company that specializes in integrating different DB types on the web and conducting performance testing.
The Y axis is latency, the delay your users will experience while the X axis is volume. Note in this case that Hbase and Cassandra are Column Oriented DBs and Mongo is a Document DB. Some of the tested DBs were not able to perform at all past certain volume levels. The purpose here is to illustrate how markedly system performance may vary based on volume, activity types, and complexity of search.
There are two lessons here. First, if you have a high volume customer-facing system you must test, test, test. Second, always start with a proof-of-concept project before committing major resources to ensure you understand what the required performance parameters will be and whether a satisfactory balance can be achieved.
4. Look to the Future. How will Big Data Integrate with your Overall Data Strategy.
It is possible to do stand-alone Big Data projects. If you are just now experimenting with the value of unstructured or semi-structured text analysis for example to gauge customer sentiment about your products and services in the blogosphere you can run that as a completely separate project using a Key-Value or Document Oriented DB. Another standalone example would be starting to capture and analyze log data or click stream data. However let’s be frank, real value is going to come when you are able to combine your traditional RDBMS data with new NOSQL stores to gain even deeper insight.
This means that you need to be thinking from the outset about a data infrastructure that allows you to capture these different types in the most effective DB type (RDBMS or NOSQL), combine them in a common platform, and export them to all manner of dashboards, reports, visualizations, custom queries, predictive models, and optimizations. Remember from our introduction, the path from data to value must pass through predictive analytics.
Our assumption for this series has been that you are not necessarily an experienced IT manager or executive. You are likely a data user, or more probably a leader of data users. So the whole topic of a data infrastructure may seem a little overwhelming. Clearly you will need the support of your CIO and other IT leaders. The graphic here is intended only to illustrate the major components of such a system. This type of depiction is known as a ‘reference architecture’ and there are lots of ways to draw the picture. The important thing is that the components are present and work together.
On the left are your existing data inputs. Currently you have your transactional systems and your data warehouse that are RDBMS and hold your traditional structured data.
Also on the left will be any NOSQL databases you create to handle new unstructured or semi-structured data.
Also on the left will be inputs from any other third party ‘upstream’ data sources. Could be from your supply chain. Could be outside append data.
The point is that these separate databases can’t be integrated in their native form. That big blue block in the center called the ‘big data platform’ is also sometimes called a Data Lake. Probably using a NOSQL Document Oriented DB or perhaps a Column Oriented DB, data from the different systems can be brought to a central repository so it can be queried, exported, and analyzed together.
Across the top in the block labeled ‘big data applications’ are all the elements of predictive analytics ranging from reports and dashboards to predictive models. Only after passing through this analysis layer can value be moved to your business processes.
There is but one basic element. There needs to be a common data store which can hold your traditional structured data and new unstructured Big Data so the different sources can be analyzed together. Don’t leave your Big Data projects out there as separate islands. Bring the data together as part of your overall data strategy.
5. It’s Very Likely That You’ll Want More Than One Type of NOSQL / NewSQL DB.
Here’s a semi-technical phrase you can drop when you have this conversation with your tech support team, “Polyglot Persistence”. Polyglot means speaking many different languages. In the context of NOSQL/NewSQL it’s come to mean that you should use the NOSQL/NewSQL DB type (KV, Column, Document, Graph) that has the best features for the opportunity at hand. By extension it means that you will want to use several different types of DBs (including RDBMS) depending on the application and that any good sized organization will have a variety of DBs for different types of data and different types of processing of that data.
History is bearing this out. In addition to RDBMS, Disney uses Cassandra, Hadoop, and Mongo. Netflix uses Cassandra, Hbase, and SimpleDB. Twitter uses Cassandra, FlockDB, Hbase, and MYSQL. Mendeley uses Hbase, Mongo, Solr, and Voldemort. This graphic will give you a little better idea.
So in the conversation above about planning for the future and thinking in terms of an overall data strategy and overall data platform, that decision needs to incorporate the idea that there will be several different types of NOSQL DBs in your arsenal. Then you’ll be prepared for any opportunity.
July 23, 2014
Bill Vorhies, President & Chief Data Scientist – Data-Magnum