Some of my friends asked me about OLTP, MPP and Hadoop. I tried to expain them as below.
This is related to the time writing this post. Things are changing so fast :).
OLTP Databases (Oracle,DB2) vs MPP (Netezza, Teradata, Vertica et.):
1. - DB Oracle or DB2 needs to read data from disk to memory before start processing, so very fast in memory calculation.- MPP takes the processing as close possible to the data, so less data movement
2. - DB Oracle or DB2 is good for smaller OLTP (transaction) operations. It also maintains very high level of data intigrity.
- MPP is good for batch processing. Some of the MPP(Netezza, Vertica) overlooks intigrity like enforcing unique key for the sake of batch performance.
Hadoop(without impala or EMC HAWQ) vs MPP:
1. - Conventional MPP database stores data in a matured internal structure. So data loading and data processing with SQL is efficient.- There are no such structured architecture for data stored on hadoop. So, accessing and loading data is not as efficient as conventional MPP systems.
2. - With conventional MPP, it support only relational models(row-column)
- hadoop support virtually any kind of data.
* However the main objective of MPP and hadoop is same, process data parallely near storage.
Cloudera impala(or pivotal HAWQ) vs MPP:
1. - MPP supports advanced indatabase analytics- Till now (impala 2.0) started supporting "SQL 2003" which may lead them to intruduce indatabase analytics.
2. - MPP databases have industry standard security features and well defined user schema structure.
- Impala has very immatured security system and virtually no user schema.
3. - MPP only supports only vendor specific filesystem and need to load data using specific loading tool.
- impala supports most open file formats (text, parquate)
* However impala seems to become a MPP & Columnar like Vertica but cheap & open database system in near future. Just need to implement security and advance indatabase analytics.
How to choice what (in general and my personal opinion):
1. OLTP Databases (Oracle,DB2, MySQL, MS SQL, Exadata):
- Transaction based application
- Smaller DWH
* However Exadata is a hybrid system and I have experience to handle DWH with ~20TB data.
2. MPP (Netezza, Teradata, Vertica)
- Bigger Data warehouse (may be having tables with size more than 4-5 TB)
- Needs no or little pre-processing
- Needs faster batch processing speed
- In database analytics
3. Only Hadoop:
- All data as heavily unstructured (documents, audio, video etc)
- need to process in batch
4. Hadoop and using mainly Impala (or EMC HAWQ)
- Need to have a DWH with low cost
- No need to have advance analytics features
- Can utilize open source tools
- Not concern about security or limited number of users
5. Hadoop(with impala or HAWQ) + MPP:
- Some Data need heavy pre-processing before ready to advance analytics.
- Need cheaper query able archive or backup for older data
Referencs:
http://www.quora.com/Is-Impala-aiming-to-be-an-open-source-alternative-to-existing-MPP-solutions
http://www.quora.com/What-features-of-a-relational-database-are-most-useful-when-building-a-data-warehouse
http://blog.pivotal.io/big-data-pivotal/products/exploratory-data-science-when-to-use-an-mpp-database-sql-on-hadoop-or-map-reduc
