What it takes to Run Terabytes/Petabytes of Data in Cloud?

Vis - The Data Enthusiast
5 min readMar 23, 2021

--

Is any of your friends or colleagues says that they’ve processed the Terabytes/Petabytes of data in matter of minutes or seconds..? then ask this..

There is always been a debate and most wanted thing on How Fast your query runs and how much time your query took to process?

Is it in Hours ? Minutes ? Seconds ? Micro-Seconds?

This is always raised when you are either in On-Premises or Modern day Cloud data platform.

A Story to start

Before diving in deep, let’s do a story time, Alex, a Software Engineer lives in London and wanted to move home to Glasgow which is 650 km away.

He wants to move home as quickly as possible due to his work assignment in Glasgow. How many people should Alex recruit from the below ?

Fig 1.1 - No. of people & days to move home

Hold on with your answers for 2 minutes..

Data Driven World

Running Larger datasets in on-Premises can be cumbersome or difficult, even though if you are on HDFS ecosystem, where we need to focus more on the infrastructure setup and ability to scale.

In a Data driven world, companies not only wanted to move to cloud to eliminate infra over head but also wants their applications and data to be processed & loaded faster than before to be available for the end-users.

But, How do you Achieve this ?

The answer would basically fall on the kind of Data Architecture that you are adhering to. However, whatever your Data strategy may be, you must make sure we curate the below

  1. ) Identify the Data Sources — RDBMS, NoSQL, Unstructured etc.,
  2. ) How you Ingest & Process the Data — ELT/ETL, Stream, Batch
  3. ) How you Store the types of Data — Data warehouse(MPP Engine) , Data Lake
  4. ) How efficient you retrieve the Data — Serving Layer

Also focus on How your Archival strategy works — this would drastically reduce the cost when you chose this wisely, to push the data over a period of time into Data Lake.

I don’t want to go in detail on these above, here I’ll focus on what should you ask..?

Asking How..?

In a Modern Cloud Data warehouse, Is anyone of your friends or colleagues says that they’ve processed the Terabytes/Petabytes of data in matter of minutes or seconds..? Then ask the following..

  1. ) Who is the Cloud provider?
  2. ) What is their Data warehouse?
  3. ) Is it Serverless and enabled Auto Scaling?
  4. ) Most Important Question ? read below & conclusion

You might get answers like AWS Redshift or Azure Synapse Analytics or Google BigQuery, which are major players on the Modern Data warehouses.

Massive Parallel Processing (MPP)

Irrespective of whatever the cloud Data warehouse you’ve chosen, it would comply in Massively Parallel Processing (MPP) concepts.

In simple terms, here, you find the way on how the metadata is stored and retrieved efficiently.

Does it sounds like a Master-Slave architecture? Yes..! it is. However, every cloud provider has their own terms on calling this like leader-worker or Control-compute nodes etc.,

Just imagine, Control Node as a Master and Compute Node as the Worker.

The Master knows how many workers to assign to do the job and splits the work across every workers into bits. All the Data warehouses are built on top of this MPP, compute nodes basically mean a VM that’s being used.

Higher the Nodes..! Faster is the Query Processing..!

What’s happening Behind..?

Basically, when you choose on-demand pricing/serverless/flat-rate/pay-as-you-go, then,

In Google Cloud, the BigQuery automatically calculates how many slots are required by each query, depending on query size and complexity. if you choose flat-rate then you have the ability to choose the number of nodes you need.

Same happens in Azure

A Synapse SQL pool is analytic resources that are defined as a combination of CPU, memory, and IO. These three resources are bundled into units of compute scale called Data Warehouse Units (DWUs).

Behind the scenes the Cloud providers does a horizontal scaling, in simple terms by increase your VM’s (nodes) and at times vertical scaling based on the nature of the query for serverless.

If it is fixed nodes, eg: DWU -1000 or Slots — 1000, then query gets split into these units/slots to process the data.

TIP: In Azure, you can chose the distribution in way how it is stored like hash, round-robin and replicated

Now going back to the Story of Alex, if you have got the answer for the question, then you shall corelate with the MPP concepts.

Yes.! He chooses 8 people, since he wants to Home move faster.

Higher the workers..! Faster is the Home Move..! But, that comes with the Price..!

The same would apply on MPP Data warehouse world,

Higher the Nodes..! Faster is the Query Processing.! But, Higher is the Billing..!

Conclusion

So, now you know, what should you be asking when someone says they processed petabytes of data in minutes.

How many DWU (Azure) or Slots(GCP) is being used?

How much it costed them?

Mostly people are overwhelmed by the processing speed and forget to look into how many Slots/DWU it really consumed and most importantly the Billing Costs, till they hit their budget boundaries.

I agree today we are processing huge data sets in minutes than ever before, but we should be aware, what it takes to happen this..?

I will cover more on these concepts and bring in best practices on how to tackle all these in my future blog. Thanks.

If you find this useful, please give a clap and share. if you have any queries just comment. Thanks for reading.

Explore ~Learn ~Do

Vis

--

--