Creating a Data Pipeline using Google Cloud Dataprep
Purpose of the article:
This article uses to create a data pipeline using cloud dataprep to load from google cloud storage bucket to google bigquery by performing cleaning and transformation of data.
Tools and Technology:
Technology- Google cloud platform
BigQuery, GCS, Dataprep, Dataflow recipe, Data quality, Dataset, Cleaning, enrich, transformation.
Cloud Dataprep isa google cloud service for visually exploring, cleaning, and transforming structured and unstructured data for analytics, reporting, and machine learning. This tool dynamically scales the resources. Let’s create a data pipeline to transform data from google cloud storage to google BigQuery using Cloud Dataprep.
In the Navigation menu, go to the BigQuery and Create a dataset in BigQuery and load the sales_data file into the source, i.e., google cloud storage bucket.
Navigate to BigQuery and create a dataset, enter the dataset_id and click on create a dataset to create a dataset in BigQuery.
- Now navigate to the google cloud storage bucket and load the data into the GCS bucket.
Transforming the data from GCS to BigQuery using cloud data prep below are the steps to be followed.
- Click on create a new flow.
- Click on the Add Dataset to get connected to your data.
- Click on import datasets, Here select GCS as our source data is google cloud storage bucket.
- Click on IMPORT & ADD TO FLOW to add your data.
Step 4: Imported data added to the recipe. You can even add one or more datasets to the recipe according to your requirement
- Click on edit recipe. Once you click on edit recipe, you can view the details of the data in the recipe.
You can clean, enrich, and transform the data in the recipe.
Step 6 :
- Click on a single column. It gives us all the details of a column like the Quality of the data, Unique values, and the data pattern if it is a string.
- Now click on functions to add a new column total_profit to the sales_data added to the dataset.
- Here total_profit is the difference between total_revenue and total_cost
- You can even perform operations like merging the data, perform aggregations, etc.
- A new column total_profit is added to the recipe now. After completing the transformation of data, click on Run.
- Now add the dataflow execution details and add the output file location.
- Now click on RUN. A dataflow job will create along with the data prep workflow.
Dataflow Job uses to extract, transform, and load the data and execute the details of the job flow.
- Successfully, the output table developed in BigQuery along with the changes.
- Now, please navigate to the options of the dataprep workflow to schedule it.
- Select the scheduling options and click on save to schedule the job.
References / Sources of the information referred:
Google cloud documents
Database Migration Service for MySQL documentation | Google Cloud
Which MOURI Tech service, this article relates to (please refer to 5 website service section) – Data and Analytics.
Cloud Dataprep isa google cloud service used for transforming structured and unstructured data for analytics. We have created a Data pipeline by loading from google cloud storage to BigQuery using Dataprep using recipes for cleansing and transforming the data.
Mahima Reddy INAGANTI
Associate Software Engineer, Analytics (DWH & Data lakes).