Designing data-intensive applications gained momentum in IT industry with even separate term - Data Engineering - being defined to cover disciplines around collection, transformation and usage of data for analysis, data science, including but not limited to machine learning, and other ways of making data usable.
Over last few decades Data Warehouses offered unified approach to data within large organizations, typically referred to as enterprises. However, as wonderfully an integrated data model might serve, processes to ingest, prepare, transform and get ready for usage such data, are always a meticulous and lengthy journey to embark on.
With so called Big Data tools emerging in the last 10-15 years, enterprises started building Data Lakes, which enabled easy ingestion of data in different shapes and means, including metrics streams, semi-structured or even somewhat schema-less files, etc. Smart companies figured that while easy to ingest, data in Data Lakes were too difficult to access and use, effectively becoming large Data Swamps.
Then in 2021 we started hearing loud about Data Lakehouse - a concept that aims to include best capabilities from both Warehouse and Lake. Databricks, a company behind a commercial data platform with the same name, presented the idea in their annual Data + AI summit, and invited Bill Inmon himself, who, along with Mary Levins and Ranjeet Srivastava published a book "Building the Data Lakehouse" the very same year.
Just a year later I've started working with Data Vault system of business intelligence - a very strict way of approaching data in an enterprise, but in return - offering easy extension and consistent usage of data model, which, when enhanced with automation, can speed up deliveries multiple times.
Now inspired with all of this and 14+ years of experience in the field, I'm offering an overview of Enterprise Data Platform architecture, with Layers or Stages of data refinement - with an attempt to suggest corresponding technologies and approaches to meet exact purpose of each Layer.
There will be more aspects added to describe each Layer in more details, stay tuned for updates.
Books and materials for inspiration:
Over last few decades Data Warehouses offered unified approach to data within large organizations, typically referred to as enterprises. However, as wonderfully an integrated data model might serve, processes to ingest, prepare, transform and get ready for usage such data, are always a meticulous and lengthy journey to embark on.
With so called Big Data tools emerging in the last 10-15 years, enterprises started building Data Lakes, which enabled easy ingestion of data in different shapes and means, including metrics streams, semi-structured or even somewhat schema-less files, etc. Smart companies figured that while easy to ingest, data in Data Lakes were too difficult to access and use, effectively becoming large Data Swamps.
Then in 2021 we started hearing loud about Data Lakehouse - a concept that aims to include best capabilities from both Warehouse and Lake. Databricks, a company behind a commercial data platform with the same name, presented the idea in their annual Data + AI summit, and invited Bill Inmon himself, who, along with Mary Levins and Ranjeet Srivastava published a book "Building the Data Lakehouse" the very same year.
Just a year later I've started working with Data Vault system of business intelligence - a very strict way of approaching data in an enterprise, but in return - offering easy extension and consistent usage of data model, which, when enhanced with automation, can speed up deliveries multiple times.
Now inspired with all of this and 14+ years of experience in the field, I'm offering an overview of Enterprise Data Platform architecture, with Layers or Stages of data refinement - with an attempt to suggest corresponding technologies and approaches to meet exact purpose of each Layer.
Layer | Source Layer | Landing Layer | Staging Layer | Integration Layer | Delivery Layer |
---|---|---|---|---|---|
Purpose | Source systems (real-time, semi-real-time and batch) | Temporary storage to ensure retrieval of data while it is available from source. Data structure and granularity are exactly as in source. | Persistent storage of data as-is. Structure and granularity is the same as source, with technical attributes added and certain type of Slowly Changing Dimension (SCD) applied to maintain history. | General, enterprise-wide data storage with unified entities across all of the enterprise. | Consumer-accessible area for data consumption. Data in this level may be interpreted, derived, combined, aggregated, transposed, shaped, etc. in any way consumers desire. Several stages of data prepraration may apply, therefore this layer also has persisted storage to house intermediary results or cater for complex intermediary calculations until data is prepared. |
Content |
|
| Landed data with history |
|
|
Technology stack | Any | Big data (Data Lake) | Big data (Data Lake / Data Lakehouse) | Relational Databases (Data Warehouse) | Big data (Data Lakehouse), Relational Databases (Data Warehouse) |
Storage | Depends on source |
| Static, persisted storage. | Relational Database Management System or other row-level storage with JOIN capabilities. | Row or Column based storage engines, depending on consumer needs and usage patterns. |
There will be more aspects added to describe each Layer in more details, stay tuned for updates.
Books and materials for inspiration:
- Inmon, William H. (2005) Building the Data Warehouse, Fourth Edition. Wiley Publishing, Inc. ISBN 978-0-7645-9944-6
- Inmon, W.H.; Linstedt, Daniel; Levins, Mary (2019) Data Architecture. A Primer for the Data Scientist. Second Edition. Elsevier. ISBN 978-0-12-816916-2
- Inmon, Bill; Levins, Mary; Srivastava, Ranjeet (2021) Building the Data Lakehouse. First printing. Technics Publications. ISBN 978-1-63462-966-9
- Srivastava, Ranjeet; Inmon, Bill (2022) The Data Lakehouse Architecture. Technics Publications, USA. ISBN 9781634622783
- Linstedt, Dan (2023) DRAFT: Data Vault 2.1 Data Architecture Specification v2.1.0.