Implementing Medallion Architecture in Microsoft Fabric with Lake House and Data Warehousing
Hello Folks, and welcome back to our study time!
In our last article, we learned how to easily fetch cryptocurrency data from CoinMarketCap and integrate it into a Fabric Lakehouse in a notebook environment. Today, we’re continuing our data journey by delving into the exciting world of medallion architecture.
Imagine it as giving our data engine a turbo boost! We’ll be using Microsoft Fabric Warehouses to create our Data Warehouse and Pipelines to move data between our three layers.
This journey unfolds across three posts, beginning right here. We’re diving deeply into the concept of a medallion architecture, examining how it can improve our data operations and finally loading data into our model in the bronze layer.
I hope you guys enjoy these three articles, and if you have any suggestions, questions, or comments, please feel free to share your opinions.
So, let’s dig in and understand what medallion architecture is all about.
Medallion Architecture
We can define a medallion architecture as a data management framework that consists of three main layers: bronze, silver, and gold. Each layer has a specific role in the data processing and analytics pipeline:
- Bronze Layer: The first step in the data processing journey. Here, data from different sources is brought into the system in its original form, without any changes. This allows for a complete and untouched look at the organization’s data. The bronze layer serves as a starting point for new information, making it easy to quickly access and use raw data.
- Silver Layer: Here the data from the bronze layer goes through processes of transformation and cleansing to improve its quality and usability. Tasks such as data cleaning, normalization, and enrichment are carried out in this layer, preparing the data for further analysis and insights generation. The silver layer acts as an intermediary stage where data is refined and standardized before being used for business intelligence and reporting purposes.
- Gold Layer: At the final stage of the data lifecycle, the gold layer holds carefully curated data that is optimized for advanced analytics and decision-making. This structured and aggregated data is stored for efficient querying and analysis, making it easy for business users and data analysts to explore and visualize. Typically stored in data warehouses or data lakes, the gold layer contains datasets that are ready for consumption.
Overall, medallion architecture provides a structured and scalable framework for managing data across its lifecycle, from ingestion to consumption. By incorporating multiple layers of processing and transformation, organizations can ensure the reliability, quality, and usability of their data assets, empowering informed decision-making and driving business value.
Now that we have a grasp on what a medallion architecture is, let’s get ready to jump in and start working. It’s time to turn our knowledge into action and start constructing our bronze layer.
The Bronze Layer
Let’s start coding our Bronze Layer in a new Notebook. If you’re unsure about how Fabric works or how to create a new Notebook, just refer to my last article here.
This code snippet demonstrates the process of making an API request to fetch the latest cryptocurrency listings from CoinMarketCap and handling the response using Python.
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("Session").getOrCreate()
#This example uses Python 2.7 and the python-request library.
from requests import Request, Session
from requests.exceptions import ConnectionError, Timeout, TooManyRedirects
import json
url = 'https://pro-api.coinmarketcap.com/v1/cryptocurrency/listings/latest'
parameters = {
'start':'1',
'convert':'USD'
}
headers = {
'Accepts': 'application/json',
'X-CMC_PRO_API_KEY': '8f30839a-97a2-4e88-83fb-cc6b58825494',
}
session = Session()
session.headers.update(headers)
try:
response = session.get(url, params=parameters)
data = json.loads(response.text)
#data=response.json()
print("JSON Loaded")
except (ConnectionError, Timeout, TooManyRedirects) as e:
print(e)
Now, it’s time to normalize our JSON data, adding a timestamp column, and saving the resulting DataFrame as a CSV file to Azure Data Lake Storage.
import pandas as pd
import pytz
from datetime import datetime
import os
# Normalizing data and adding a timestamp
df = pd.json_normalize(data['data'])
#df['name'] = df['name'].astype("string")
#df['symbol'] = df['symbol'].astype("string")
#df['slug'] = df['slug'].astype("string")
#df['date_added'] = df['date_added'].astype("string")
df['tags'] = df['tags'].astype("string")
df['tvl_ratio'] = df['tvl_ratio'].astype("string")
df['quote.USD.tvl'] = df['quote.USD.tvl'].astype("string")
#df['platform.name'] = df['platform.name'].astype("string")
#df['platform.symbol'] = df['platform.symbol'].astype("string")
#df['platform.slug'] = df['platform.slug'].astype("string")
#df['platform.token_address'] = df['platform.token_address'].astype("string")
# Define New York timezone
ny_tz = pytz.timezone('America/New_York')
# Get current time in New York timezone
current_time_ny = datetime.now(ny_tz)
# Add current time in New York timezone as a new column
local_datetime = current_time_ny.strftime('%Y-%m-%d %H:%M:%S')
local_date = current_time_ny.strftime('%Y-%m-%d')
df['log_date'] = local_datetime
#print(df.dtypes)
display("JSON Normalized")
#Save currrent CSV file
file_name = "CoinMarketCap" + local_date + ".csv"
abfs = "abfss://a20e1f0c-e94e-4349-b995-c2479037a2e9@onelake.dfs.fabric.microsoft.com/156c7e07-d0d0-4cf1-8208-5b1dcb5dc729/Files/Bronze/"
file_path = abfs + file_name
# Check if the file exists
if os.path.exists(file_path):
# If the file exists, delete it
os.remove(file_path)
# Save CSV
df.to_csv(file_path)
display("CSV Saved")
After executing both code, you’ll find your CSV files stored in a folder named Bronze, located within the Files directory.
Conclusion
In the beginning stages of our series on incorporating a medallion architecture in Microsoft Fabric, we’ve started by delving into the fundamental principles and actively coding the bronze layer. We’ve explored how Microsoft defines medallion architecture as a structured system for enhancing data processing and analytics.
With the utilization of Notebooks, we’ve showcased the steps to retrieve data from an API, standardize it, include a timestamp, and save it in Azure Data Lake Storage. This practical approach has offered valuable perspectives on executing a medallion architecture.
To the next sections, we will explore further the creation of the silver and gold layers, improving our data pipeline, and discovering more valuable insights from our data. Keep an eye out for more intriguing insights and hands-on demonstrations as we progress in our quest to master medallion architecture in Microsoft Fabric.