Why should you consider Data Lake in your Cloud Architecture..?

Vis - The Data Enthusiast
3 min readApr 12, 2021

--

Importance of Data lake in Modern Cloud Data Platform

In a Modern day Data world, it is substantial to store the data rather than throwing it away over the period of time, as the companies move towards data-driven approach.

Recent survey shows that nearly 1/3 of the data is left unattended and are not being consider to derive insights out of it. The companies that makes use of each bit of the data become successful and able to deliver efficiently.

For people who are considering in migrating to cloud, you should find a space in your Data Architecture for the Data lake.

What is Data Lake ?

It is a scalable and secure data platform the allows enterprises to ingest, store, process and analyse type of data regardless of it’s volume.

Data Lake is the very first thing to have, when building Cloud Data warehouse, which would act as a staging environment when we ingest from any data source.

As we move towards cloud data architecture, it’s not always we deal with the structured data. It is really important to identify the nature of data we ingest, process & store.

Identifying the type of data either it is structured, semi-structured or un-structured would enable us to do the right job.

Basically Data lake is distributed files system or Managed Hadoop File System (HDFS) on cloud which can be used for both Batch and Stream processing.

ELT Vs. ETL

When it comes to cloud, it is always the best practice to do ELT (Extract Load Transform) instead of ETL (Extract Transform Load), you bring everything into your cloud storage and then apply transformations on top of it.

The reason might be the data source that you are extracting the data from might be from on-premises or external cloud, the connection cannot open for quite long time to perform ETL and there may be delay and processing may be slower.

On Multi-cloud architecture, it is the only idle way to handle the data from other cloud data sources by bringing the Extract Load Transform (ELT) in picture instead of Extract Transform Load (ETL).

Data Lake — ELT Pipeline

Data lake would enable us to handle any type of data irrespective of it’s size as it acts as a HDFS in cloud.

In cloud you get charged/billed mainly for two components, Compute and Storage. When used wisely, you shall store the data onto data lake and can be processed whenever needed.

Below are the possible use-Cases for Data Lake to fit in to,

  1. ) Migrate from On-Premises to Cloud
  2. ) Move big data Work loads to Cloud
  3. ) Store Semi-Structured (JSON, xml etc.,) , Unstructured (all media files images/videos)
  4. ) Store Terabytes/petabytes of data over the periods
  5. ) Multi-Cloud Architecture
  6. ) Machine Learning

Cost Management

On the other hand, when it comes to cost management, having the unused data over a period of time in the Cloud Data warehouse would cost us much overhead.

This can be eliminated by moving the period of data into Data Lake where cost of Data Lake is cheaper and later it can be retrieved and processed from there on-demand.

Choosing the Tier(Azure)/Class(GCP) of the Data Lake wisely reduces the cost enormously.

Conclusion

Data Lake is substantial for every data strategy, however, when the size of the data gets increased, it could be challenging to maintain the files system structure.

It can be overcome by having Delta Lake on top of it. We shall discuss more in detail on future blogs.

If you find this useful, please give a clap and share. if you have any queries just comment. Thanks for reading.

Explore ~Learn ~Do

Vis

--

--