Showing posts with label HAWQ. Show all posts
Showing posts with label HAWQ. Show all posts

Thursday, November 6, 2014

OLTP vs MPP vs Hadoop


Some of my friends asked me about OLTP, MPP and Hadoop. I tried to expain them as below.
This is related to the time writing this post. Things are changing so fast :).

OLTP Databases (Oracle,DB2) vs MPP (Netezza, Teradata, Vertica et.):

                1. - DB Oracle or DB2 needs to read data from disk to memory before start processing, so very fast in memory calculation.
                   - MPP takes the processing as close possible to the data, so less data movement
             
                2. - DB Oracle or DB2 is good for smaller OLTP (transaction) operations. It also maintains very high level of data intigrity.
                   - MPP is good for batch processing. Some of the MPP(Netezza, Vertica) overlooks intigrity like enforcing unique key for the sake of batch performance.


Hadoop(without impala or EMC HAWQ) vs MPP:

                1. - Conventional MPP database stores data in a matured internal structure. So data loading and data processing with SQL is efficient.
                   - There are no such structured architecture for data stored on hadoop. So, accessing and loading data is not as efficient as conventional MPP systems.
                2. - With conventional MPP, it support only relational models(row-column)
                   - hadoop support virtually any kind of data.
             
                * However the main objective of MPP and hadoop is same, process data parallely near storage.
             

Cloudera impala(or pivotal HAWQ) vs MPP:

                1. - MPP supports advanced indatabase analytics
                   - Till now (impala 2.0) started supporting "SQL 2003" which may lead them to intruduce indatabase analytics.
                2. - MPP databases have industry standard security features and well defined user schema structure.
                   - Impala has very immatured security system and virtually no user schema.
                3. - MPP only supports only vendor specific filesystem and need to load data using specific loading tool.
                   - impala supports most open file formats (text, parquate)
             
                * However impala seems to become a MPP & Columnar like Vertica but cheap & open database system in near future. Just need to implement security and advance indatabase analytics.


How to choice what (in general and my personal opinion):


1. OLTP Databases (Oracle,DB2, MySQL, MS SQL, Exadata):
                - Transaction based application
                - Smaller DWH
                * However Exadata is a hybrid system and I have experience to handle DWH with ~20TB data.

2. MPP (Netezza, Teradata, Vertica)
                - Bigger Data warehouse (may be having tables with size more than 4-5 TB)  
                - Needs no or little pre-processing
                - Needs faster batch processing speed
                - In database analytics

3. Only Hadoop:
                - All data as heavily unstructured (documents, audio, video etc)
                - need to process in batch

4. Hadoop and using mainly Impala (or EMC HAWQ)
                - Need to have a DWH with low cost
                - No need to have advance analytics features
                - Can utilize open source tools
                - Not concern about security or limited number of users
             
5. Hadoop(with impala or HAWQ) + MPP:
                - Some Data need heavy pre-processing before ready to advance analytics.
                - Need cheaper query able archive or backup for older data


Referencs:
http://www.quora.com/Is-Impala-aiming-to-be-an-open-source-alternative-to-existing-MPP-solutions
http://www.quora.com/What-features-of-a-relational-database-are-most-useful-when-building-a-data-warehouse
http://blog.pivotal.io/big-data-pivotal/products/exploratory-data-science-when-to-use-an-mpp-database-sql-on-hadoop-or-map-reduc