How to build Modern Cloud Data warehouse

Things to consider when building Cloud data warehouse

Vis - The Data Enthusiast
6 min readNov 11, 2021

In order to compete in the Data Driven world, it is indeed essential to have a Multi-functional MPP system in place for in-depth analytics to derive Data & Insights out of it.

The Data strategy should be designed in a way not only on the present requirements but also should focus on how the data can be stored, processed and retrieved for any new data requirements & volume and be able to with stand the future loads.

Photo by Guillaume Bolduc on Unsplash

It really depends on what architecture the organisation are adhering to, either Lamda or Kappa architecture but irrespective of whatever the chosen architecture there are some mandatory things that needs to be followed.

Here I have curated the maximum possible things to know when trying to build the Modern Data warehouse in Cloud.

Problem statement - Data Silos

One of the most common challenges the organisations face is the Data Silos, where the data being held by one group is not accessible or fully accessible by other groups which often leads to overlapping-but-inconsistent data. As the data grows, silos continue to grow as well.

Data silos

Unified Data Platform

The company may have multiple source systems in regards to their own internal team and business requirements. To eliminate the Data Silos, having a unified/centralized version of the Enterprise Data in cloud is the optimal solution to have a holistic view of the hidden opportunities, insights and also to analyse any possible threats.

There are still companies only with OLTP systems for any Operational Transactions happening in the front-end, but in order to derive any Analytics & Insights we need a strong OLAP system built on MPP engine in particular to cloud.

Data Engineering Layers

As depicted in the above diagram these are the must have’s engineering layers in regards to Data which most of us are already aware but in addition to these I wanted to bring in other major considerations. continue reading..

Compute & Storage

As any part of any business they would have handled larger volume of data over the years of the business. In order to handle and satisfy the need for future data volume increase , choosing the right Compute and Storage would enable us to handle the need especially in cloud.

The proposal aims in recommending the Cloud Data warehouse as a PAAS solution instead of IAAS as the later would need more in hand dedicated infra support and would not handle & satisfy the requirements.

One of the key benefits of a cloud data warehouse is the separation of storage and compute. It is important because storage cost in cloud is much lower than compute cost.

MPP

Having a Massively Parallel Processing (MPP) Engine to handle the larger datasets would be idle to store and process the data in Cloud. Mostly this would fit the Cloud data warehouses like Amazon Redshift, Google Bigquery, Snowflake & Azure Synapse.

Just imagine, Control Node as a Master and Compute Node as the Worker.

MPP Engine

The Master knows how many workers to assign to do the job and splits the work across every workers into bits. All the Data warehouses are built on top of this MPP, compute nodes basically mean a VM that’s being used.

The important thing sits not only with how you store the data but also how efficient we retrieve it.

To know more on “What is takes to run Petabyte scale data ?” follow my article here

To know more on “Why should you consider Data Lake in your Cloud Architecture?” follow my article here

Decoupled Systems

Not everything can be done in a single system, we need to identify the right resource to do the right job wherever it fits and make it aligned to the process.

Event Journal Design Pattern

As the data volume gets increased, it is essential to design a event driven model which would allows us to store and process the data efficiently. i.e. it would enable us to understand the journey of the process and where the current process is located to keep track of the process and also for any diagnostic Analysis.

Hot, Warm & Cold Data

It is substantial to understand the Hot, warm and cold data, where hot refers to the data that is accessed frequently almost always and warm refers to where it is accessed but not frequently and finally cold refers to the data not being accessed or processed over a period of time which you may consider in having those as a archived version.

Tip: Having your data archived and left undisturbed for a period of time would drastically reduce the storage cost

In Azure it is referred as Tiers and in Google Cloud it is categorized as Class A,B & C operations

Partitioning & Distributions

To process the larger datasets, it needs to be segregated into different partitions to avoid the heavy load into the Data warehouse tables, thus causing locks and delayed runtime.

The partitions plays a major role in storing the data and also how efficient the data can be retrieved. In the event of any etl processing the partitions can be defined either at the source or at the sink level.

Also, identifying the right distributions for the DW tables is substantial whether to have a Round Robin, Hash, etc., distributions which are widely popular in Azure Synapse.

Availability & Disaster Recovery

Making the systems high available is what the cloud providers promise with 99.99% in most cases, but on top it is essential to decide how would the systems be available for the end-users like geo or zone replication. In case of any disaster how could we make available of our systems back & running.

Security

In terms of security, it should be a layered approach and not relying only on the Data Access level, but on the Application, compute, network etc., more the layers of security, harder the wall and very less chances to break it.

  • Authentication & Authorisation, OAuth 2.0
  • Role Based Access Control (RBAC)
  • Secret Manager/Key Vault
  • Encryption — Data at Rest & Data at Motion
  • Dynamic Data Masking — Sensitive Tables/Columns (providing view access only to
  • Row-Level security
  • Policy Tags — PII data
  • Creating Resource/Network Groups
  • Logging and Monitoring
  • Firewall
  • VPN
  • Prevent DDOS
  • Front Door

Conclusion

So, if you are trying to set your foot into cloud to build Modern Cloud data warehouse or in cloud try considering all the shared above.

I will cover more on these related concepts and bring in best practices in my future blogs.

#AzureSynapseAnalytics #AzureSynapseInfluencer #BigQuery

If you find this useful, please give a clap and share. if you have any queries just comment. Thanks for reading.

Explore ~Learn ~Do

Vis

--

--