toddler DBA: HAWQ

Some of my friends asked me about OLTP, MPP and Hadoop. I tried to expain them as below.
This is related to the time writing this post. Things are changing so fast :).

OLTP Databases (Oracle,DB2) vs MPP (Netezza, Teradata, Vertica et.):

1. - DB Oracle or DB2 needs to read data from disk to memory before start processing, so very fast in memory calculation.
- MPP takes the processing as close possible to the data, so less data movement

2. - DB Oracle or DB2 is good for smaller OLTP (transaction) operations. It also maintains very high level of data intigrity.
- MPP is good for batch processing. Some of the MPP(Netezza, Vertica) overlooks intigrity like enforcing unique key for the sake of batch performance.

Hadoop(without impala or EMC HAWQ) vs MPP:

1. - Conventional MPP database stores data in a matured internal structure. So data loading and data processing with SQL is efficient.
- There are no such structured architecture for data stored on hadoop. So, accessing and loading data is not as efficient as conventional MPP systems.
2. - With conventional MPP, it support only relational models(row-column)
- hadoop support virtually any kind of data.

* However the main objective of MPP and hadoop is same, process data parallely near storage.

Cloudera impala(or pivotal HAWQ) vs MPP:

1. - MPP supports advanced indatabase analytics
- Till now (impala 2.0) started supporting "SQL 2003" which may lead them to intruduce indatabase analytics.
2. - MPP databases have industry standard security features and well defined user schema structure.
- Impala has very immatured security system and virtually no user schema.
3. - MPP only supports only vendor specific filesystem and need to load data using specific loading tool.
- impala supports most open file formats (text, parquate)

* However impala seems to become a MPP & Columnar like Vertica but cheap & open database system in near future. Just need to implement security and advance indatabase analytics.

How to choice what (in general and my personal opinion):

1. OLTP Databases (Oracle,DB2, MySQL, MS SQL, Exadata):
- Transaction based application
- Smaller DWH
* However Exadata is a hybrid system and I have experience to handle DWH with ~20TB data.

2. MPP (Netezza, Teradata, Vertica)
- Bigger Data warehouse (may be having tables with size more than 4-5 TB)
- Needs no or little pre-processing
- Needs faster batch processing speed
- In database analytics

3. Only Hadoop:
- All data as heavily unstructured (documents, audio, video etc)
- need to process in batch

4. Hadoop and using mainly Impala (or EMC HAWQ)
- Need to have a DWH with low cost
- No need to have advance analytics features
- Can utilize open source tools
- Not concern about security or limited number of users

5. Hadoop(with impala or HAWQ) + MPP:
- Some Data need heavy pre-processing before ready to advance analytics.
- Need cheaper query able archive or backup for older data

Referencs:
http://www.quora.com/Is-Impala-aiming-to-be-an-open-source-alternative-to-existing-MPP-solutions
http://www.quora.com/What-features-of-a-relational-database-are-most-useful-when-building-a-data-warehouse
http://blog.pivotal.io/big-data-pivotal/products/exploratory-data-science-when-to-use-an-mpp-database-sql-on-hadoop-or-map-reduc

toddler DBA

Thursday, November 6, 2014

OLTP vs MPP vs Hadoop

OLTP Databases (Oracle,DB2) vs MPP (Netezza, Teradata, Vertica et.):

Hadoop(without impala or EMC HAWQ) vs MPP:

Cloudera impala(or pivotal HAWQ) vs MPP:

How to choice what (in general and my personal opinion):