How to Handle Data in Cloud
Following the right principles & standards
There’s nothing like a standard to follow in handling the data in cloud, it all depends on what cloud provider and what’s been chosen. However, if the right standards are adopted in architecting, design, implementation etc., it could make an ideal & robust environment for development and most importantly cost savings.
Here, I have highlighted some of few Data cloud standards that one could try and adopt or self-evaluate, if already in cloud.
These principles are a set of practices that we can follow to write better code as a team and adhere to the best standards.
1. Subscribe to right cloud services.
Once you’ve chosen your cloud provider and decided to move all your work loads into cloud, then choosing the right cloud services plays huge role in cloud adoption. Due to cloud competitive traffic, there are multiple services available with each cloud providers and many would overlap the existing.
So, before adopting any Cloud data services, make sure it suits your need and does exactly what it needs to and understand it’s purpose.
Example: In Azure, you don’t need an Azure Databricks for your ETL needs when your workload is too small and mostly you deal with structural or RDBMS databases. Azure Data Factory would be sufficient for all your ETL/ELT, wrangling requirements etc.,
If the right data services are not subscribed based on your requirement, on the above example it’s like driving a Lamborghini in a full traffic, means you are not utilizing the best out of it. In a heavy traffic even, walking is sufficient.
2. Should be Easy to understand.
Having altogether in a single data pipeline/code makes difficult to understand and can lead to slower data processing. We would split the larger data pipelines / SQL code into smaller chunks and process it.
We will achieve this by:
- Commenting our code.
- Providing description to tables & columns.
- Documenting our code where appropriate.
- Using consistent naming conventions — to be followed all through the development.
- Using consistent formatting — stick to one SQL formatter.
3. Should Perform to a high standard.
Keeping up with the right standards makes the data gets aligned organized for storage & timely retrieval.
We will achieve this by:
- Sticking to the standards of coding.
- Ensuring our code is reusable, wherever possible.
- Making sure we use the best approach to code.
- Our code should be minimally sufficient.
- Optimal usage of resources with reduced cost in cloud.
- Performing self-evaluation of standard coding.
- Peer reviewing our code always.
- Understanding the difference between cloud data warehouses, In BigQuery, the code & modelling is entirely different from traditional data warehouses/databases.
- In Azure, the data warehouse which is the synapse analytics (dedicated pool) is built on a traditional DW from SQL Server framework along with few inclusions (MPP) which is almost the same as traditional modelling, which is entirely different in BigQuery, Snowflake & Databricks SQL.
4. Should be Easy to Maintain.
When dealing with data in cloud, as the services and data gets added up day by day, then it becomes cumbersome to handle and one point it becomes tedious to understand what’s the purpose of it.
We will achieve this by:
- By having a ledger of all data pipelines, integrations, migrations, and list of services and even further low level on tables in a dashboard (PowerBI/Tableau etc.,), which becomes a centralized go to hub for monitoring and maintenance.
- Automating tests for our code where possible.
- Revisiting the Data pipelines & code and clearing of the unused components and making them run on it’s own.
- Using the Version Control like GitHub or similar.
- Make useful objects and examples discoverable.
- Providing clarity on our preferences.
- Removing barriers to writing good code — (if needed get help from teammates/slack channels etc.,)
5. Harder to Crack.
Security is the most important aspect in cloud, especially if you choose to have your data in cloud, Usually the cloud providers have their default security in place for most/all of their services. However, it is indeed important to design an extra layer of security apart from default ones. More the layer of security harder to crack it.
We will achieve this by:
- Applying layers of security.
- Applying tags, encryption to PII data — this helps to segregate the normal vs. sensitive data.
- Avoiding usage of SAS/tokens for data movements &
- Using OAuth 2.0 Authentication for API access.
- Establishing protocols to Merge & Publish code for deployment (at least two reviewers' approval is mandatory etc.,)
- Create set of firewall rules and control who can access the Data warehouse.
- Providing access to only the reporting layer to end-users & not to enterprise layer.
- Apply row-level/ column-level security based on the user groups.
Conclusion:
There are still lot to cover-up in each of these sections and many to get added, but on an overall, I have managed to touch down what is most important and how we should act in handling data in cloud.
If you find this useful, please give a clap and share. if you have any queries just comment. Thanks for reading.
Explore ~Learn ~Do
Vis