Thursday, November 6, 2014

OLTP vs MPP vs Hadoop


Some of my friends asked me about OLTP, MPP and Hadoop. I tried to expain them as below.
This is related to the time writing this post. Things are changing so fast :).

OLTP Databases (Oracle,DB2) vs MPP (Netezza, Teradata, Vertica et.):

                1. - DB Oracle or DB2 needs to read data from disk to memory before start processing, so very fast in memory calculation.
                   - MPP takes the processing as close possible to the data, so less data movement
             
                2. - DB Oracle or DB2 is good for smaller OLTP (transaction) operations. It also maintains very high level of data intigrity.
                   - MPP is good for batch processing. Some of the MPP(Netezza, Vertica) overlooks intigrity like enforcing unique key for the sake of batch performance.


Hadoop(without impala or EMC HAWQ) vs MPP:

                1. - Conventional MPP database stores data in a matured internal structure. So data loading and data processing with SQL is efficient.
                   - There are no such structured architecture for data stored on hadoop. So, accessing and loading data is not as efficient as conventional MPP systems.
                2. - With conventional MPP, it support only relational models(row-column)
                   - hadoop support virtually any kind of data.
             
                * However the main objective of MPP and hadoop is same, process data parallely near storage.
             

Cloudera impala(or pivotal HAWQ) vs MPP:

                1. - MPP supports advanced indatabase analytics
                   - Till now (impala 2.0) started supporting "SQL 2003" which may lead them to intruduce indatabase analytics.
                2. - MPP databases have industry standard security features and well defined user schema structure.
                   - Impala has very immatured security system and virtually no user schema.
                3. - MPP only supports only vendor specific filesystem and need to load data using specific loading tool.
                   - impala supports most open file formats (text, parquate)
             
                * However impala seems to become a MPP & Columnar like Vertica but cheap & open database system in near future. Just need to implement security and advance indatabase analytics.


How to choice what (in general and my personal opinion):


1. OLTP Databases (Oracle,DB2, MySQL, MS SQL, Exadata):
                - Transaction based application
                - Smaller DWH
                * However Exadata is a hybrid system and I have experience to handle DWH with ~20TB data.

2. MPP (Netezza, Teradata, Vertica)
                - Bigger Data warehouse (may be having tables with size more than 4-5 TB)  
                - Needs no or little pre-processing
                - Needs faster batch processing speed
                - In database analytics

3. Only Hadoop:
                - All data as heavily unstructured (documents, audio, video etc)
                - need to process in batch

4. Hadoop and using mainly Impala (or EMC HAWQ)
                - Need to have a DWH with low cost
                - No need to have advance analytics features
                - Can utilize open source tools
                - Not concern about security or limited number of users
             
5. Hadoop(with impala or HAWQ) + MPP:
                - Some Data need heavy pre-processing before ready to advance analytics.
                - Need cheaper query able archive or backup for older data


Referencs:
http://www.quora.com/Is-Impala-aiming-to-be-an-open-source-alternative-to-existing-MPP-solutions
http://www.quora.com/What-features-of-a-relational-database-are-most-useful-when-building-a-data-warehouse
http://blog.pivotal.io/big-data-pivotal/products/exploratory-data-science-when-to-use-an-mpp-database-sql-on-hadoop-or-map-reduc