Introduction
This project demonstrates how to deploy a Multi-Cloud Data Pipeline leveraging fully managed services across different cloud providers. The goal is to stream data from a source (Amazon Kinesis or Azure Event Hubs) to a target storage solution (Google Cloud Storage or AWS S3), and optionally load it into an analytics platform such as BigQuery or store it long-term in S3.
Why Is It Useful?
Prerequisites
Required Tools & Accounts
AWS Account with Free Tier access (no credits required if staying within free usage limits).
Google Cloud Account (Free Tier available, no credits required if usage stays within free thresholds).
(Optional) Azure Account with Free Tier to use Azure Event Hubs or Azure Functions if choosing the Azure path.
Installed AWS CLI and Google Cloud SDK (and Azure CLI if using Azure) on the local machine.
Proper IAM Permissions:
AWS: At least “Owner” or “AdministratorAccess” privileges on the project/organization to create and configure Kinesis, Lambda, and IAM roles.
Google Cloud: “Owner” or “Editor” permissions on the project to create Google Cloud Storage buckets and BigQuery datasets.
(Optional) Azure: Owner or Contributor role to create Event Hubs and Functions.
Required Services / APIs
Important: Ensure that each of these services is enabled in your cloud console (e.g., “Enable Cloud Run API and Container Registry API” was the example for GCP; similarly enable Kinesis, Lambda in AWS if not already enabled).
Step-by-Step Implementation
Below are two approaches for each major step: one through the Graphical User Interface (GUI) and one through the Command-Line Interface (CLI). These steps can be adapted based on your preferred cloud providers, but the example here focuses on AWS + Google Cloud.
Create a Kinesis Stream (AWS)
We will set up a Kinesis Data Stream to receive incoming data.
Manual Steps (GUI)
AWS Management Console:
Sign in to the AWS Management Console.
Navigate to Kinesis under “Services.”
Click Create data stream.
Enter a Stream Name (for example, multi-cloud-stream).
Set Number of shards to 1 (this fits in free tier limits).
Click Create stream to finalize.
Confirmation: You should see a new Kinesis data stream with the status “Active.”
Command-Line Interface (CLI) :
What does this command do? Creates a Kinesis data stream named multi-cloud-stream with 1 shard.
To check if it’s active: aws kinesis describe-stream --stream-name multi-cloud-stream
What does this command do?
Displays details about the stream, including its status and the number of shards.
Create a Google Cloud Storage (GCS) Bucket
We will use Google Cloud Storage as an intermediate or final destination for the streaming data.
Manual Steps (GUI)
Google Cloud Console: Go to the Google Cloud Console. Verify you are in the correct project. Select Storage → Browser → Create Bucket. Enter a unique bucket name such as multi-cloud-pipeline-bucket-123. Choose a Region (e.g., us-central1) to minimize latency between AWS region and GCP region if possible. Leave other settings as defaults for the free tier. Click Create.
Command-Line Interface (CLI)
What does this command do?
The first line sets your active project in Google Cloud to [PROJECT_ID].
The second line creates a new bucket named multi-cloud-pipeline-bucket-123 in the us-central1 region.
Configure AWS Lambda Function to Pull from Kinesis and Push to GCS
We will create a Lambda function triggered by new records in Kinesis. Whenever data arrives, it will batch-process and upload the data as files into the GCS bucket.
Manual Steps (GUI)
IAM Role (if not existing):
Create Lambda:
Add Kinesis Trigger:
Function Code:
Note: Direct cross-cloud writes might require extra steps, like creating a service account in GCP and generating short-lived credentials from AWS. For the sake of simplicity, assume we are using a public API or an already available service account key.
Command-Line Interface (CLI)
Create the IAM Role for Lambda
aws iam create-role \
--role-name LambdaKinesisRole \
--assume-role-policy-document file://TrustPolicyForLambda.json
What does this command do?
Creates a role named LambdaKinesisRole that Lambda can assume. TrustPolicyForLambda.json should specify "Service": "lambda.amazonaws.com" as trusted entity.
Attach policies to the role
aws iam attach-role-policy \
--role-name LambdaKinesisRole \
--policy-arn arn:aws:iam::aws:policy/AmazonKinesisFullAccess
aws iam attach-role-policy \
--role-name LambdaKinesisRole \
--policy-arn arn:aws:iam::aws:policy/CloudWatchLogsFullAccess
What do these commands do?
Grant LambdaKinesisRole permissions to fully access Kinesis and to write logs to CloudWatch.
Create Lambda function
zip function.zip lambda_function.py # Package code into a zip
aws lambda create-function \
--function-name kinesis-to-gcs \
--runtime python3.9 \
--zip-file fileb://function.zip \
--handler lambda_function.lambda_handler \
--role arn:aws:iam::[ACCOUNT_ID]:role/LambdaKinesisRole
What does this command do?
Creates a Lambda function named kinesis-to-gcs using the packaged code (lambda_function.py) and associates it with the specified IAM role.
Create the Kinesis trigger
aws lambda create-event-source-mapping \
--function-name kinesis-to-gcs \
--event-source arn:aws:kinesis:[REGION]:[ACCOUNT_ID]:stream/multi-cloud-stream \
--batch-size 100 \
--starting-position LATEST
What does this command do?
Links the Lambda function to the Kinesis stream, specifying how many records to process per batch (--batch-size 100).
Note: The code inside lambda_function.py should handle sending data to GCS. This typically involves generating a signed URL from GCP (via a separate script/credentials) or calling a secure endpoint that writes to GCS.
Optional: Load Data from GCS into BigQuery
If a final analytics layer is desired, data from GCS can be periodically loaded into BigQuery.
Manual Steps (GUI)
BigQuery Console:
In Google Cloud Console, go to BigQuery.
Create a Dataset (e.g., multicloud_dataset).
Within the dataset, click Create table.
Select Source = Google Cloud Storage and choose the bucket/folder where Lambda is storing files.
Specify File format (e.g., JSON or CSV).
Click Create table to load data.
Command-Line Interface (CLI)
bq --location=us-central1 mk --dataset [PROJECT_ID]:multicloud_dataset
bq load \
--source_format=CSV \
[PROJECT_ID]:multicloud_dataset.multicloud_table \
gs://multi-cloud-pipeline-bucket-123/*.csv
Verifying and Testing the Project
Data Ingestion Test: Put sample data into the Kinesis stream:aws kinesis put-record \ --stream-name multi-cloud-stream \ --partition-key testKey \ --data "HelloMultiCloud"
Verify that the Lambda function triggers by checking CloudWatch Logs (AWS console → CloudWatch → Logs).
GCS Object Check:
Go to Cloud Console → Storage → Browser → open the bucket.
Verify an object/file with your sample data is created.
(Optional) BigQuery Load Test:
If loading to BigQuery, run a query on the newly created table to see if data arrived correctly.
Common Issues and Troubleshooting
Conclusion
We have successfully built a Multi-Cloud Data Pipeline that streams data from AWS (via Kinesis) to Google Cloud Storage, with an optional loading process into BigQuery. Along the way, we have acquired skills in configuring serverless functions (Lambda), managing cross-cloud resources (AWS → GCP), and automating workflows via the CLI. This approach can be adapted to other cloud provider combinations (e.g., Azure Event Hubs → AWS S3, Azure Functions → Google Cloud Storage) and scaled for real-world production scenarios without incurring credit charges, provided the usage stays within free tier limits.
Popular Projects
What is Cloud Computing ?
Cloud computing delivers computing resources (servers, storage, databases, networking, and software) over the internet, allowing businesses to scale and pay only for what they use, eliminating the need for physical infrastructure.