Multi-Cloud Data Pipeline & Replication

Introduction

This project demonstrates how to deploy a Multi-Cloud Data Pipeline leveraging fully managed services across different cloud providers. The goal is to stream data from a source (Amazon Kinesis or Azure Event Hubs) to a target storage solution (Google Cloud Storage or AWS S3), and optionally load it into an analytics platform such as BigQuery or store it long-term in S3.

Why Is It Useful?

Scalability: The streaming services (Kinesis/Event Hubs) and functions (Lambda/Azure Functions) auto-scale according to demand.
Cost-Effective: When properly configured with free tiers, the pipeline can handle moderate workloads without incurring extra charges.
Flexibility: It allows data replication across different cloud platforms, reducing vendor lock-in and increasing resiliency.

Prerequisites

Required Tools & Accounts

AWS Account with Free Tier access (no credits required if staying within free usage limits).

Google Cloud Account (Free Tier available, no credits required if usage stays within free thresholds).

(Optional) Azure Account with Free Tier to use Azure Event Hubs or Azure Functions if choosing the Azure path.

Installed AWS CLI and Google Cloud SDK (and Azure CLI if using Azure) on the local machine.

Proper IAM Permissions:
AWS: At least “Owner” or “AdministratorAccess” privileges on the project/organization to create and configure Kinesis, Lambda, and IAM roles.
Google Cloud: “Owner” or “Editor” permissions on the project to create Google Cloud Storage buckets and BigQuery datasets.
(Optional) Azure: Owner or Contributor role to create Event Hubs and Functions.

Required Services / APIs

AWS: Kinesis, Lambda, and (if final storage on AWS) S3.
Google Cloud: Google Cloud Storage, BigQuery (optional).
Azure: Event Hubs, Azure Functions (optional).

Important: Ensure that each of these services is enabled in your cloud console (e.g., “Enable Cloud Run API and Container Registry API” was the example for GCP; similarly enable Kinesis, Lambda in AWS if not already enabled).

Step-by-Step Implementation

Below are two approaches for each major step: one through the Graphical User Interface (GUI) and one through the Command-Line Interface (CLI). These steps can be adapted based on your preferred cloud providers, but the example here focuses on AWS + Google Cloud.

Create a Kinesis Stream (AWS)

We will set up a Kinesis Data Stream to receive incoming data.

Manual Steps (GUI)

AWS Management Console:
Sign in to the AWS Management Console.
Navigate to Kinesis under “Services.”
Click Create data stream.
Enter a Stream Name (for example, multi-cloud-stream).
Set Number of shards to 1 (this fits in free tier limits).
Click Create stream to finalize.

Confirmation: You should see a new Kinesis data stream with the status “Active.”

Command-Line Interface (CLI) :

What does this command do? Creates a Kinesis data stream named multi-cloud-stream with 1 shard.

To check if it’s active: aws kinesis describe-stream --stream-name multi-cloud-stream

What does this command do?

Displays details about the stream, including its status and the number of shards.

Create a Google Cloud Storage (GCS) Bucket

We will use Google Cloud Storage as an intermediate or final destination for the streaming data.

Manual Steps (GUI)

Google Cloud Console: Go to the Google Cloud Console. Verify you are in the correct project. Select Storage → Browser → Create Bucket. Enter a unique bucket name such as multi-cloud-pipeline-bucket-123. Choose a Region (e.g., us-central1) to minimize latency between AWS region and GCP region if possible. Leave other settings as defaults for the free tier. Click Create.

Command-Line Interface (CLI)

What does this command do?

The first line sets your active project in Google Cloud to [PROJECT_ID].
The second line creates a new bucket named multi-cloud-pipeline-bucket-123 in the us-central1 region.

Configure AWS Lambda Function to Pull from Kinesis and Push to GCS

We will create a Lambda function triggered by new records in Kinesis. Whenever data arrives, it will batch-process and upload the data as files into the GCS bucket.

Manual Steps (GUI)

IAM Role (if not existing):

Navigate to IAM in AWS Console → Roles.
Click Create role.
Choose Lambda as the trusted entity.
Attach a policy granting access to read from Kinesis and write logs to CloudWatch.
Name the role (e.g., LambdaKinesisRole).

Create Lambda:

Go to AWS Lambda → Create function.
Choose Author from scratch.
Function name: kinesis-to-gcs.
Runtime: Choose a supported runtime, for example, Python 3.9.
Execution role: Select LambdaKinesisRole or your newly created role.
Click Create function.

Add Kinesis Trigger:

In the Function overview panel, click Add trigger.
Select Kinesis.
Choose the stream multi-cloud-stream.
Batch size: Keep default or specify 100.
Click Add.

Function Code:

In the code editor, add logic to send data to GCS. Since the function needs to communicate with GCS, we can do it using a signed URL or through an HTTP request. (For cross-cloud, consider using the public GCS XML/JSON APIs.)
Click Deploy.

Note: Direct cross-cloud writes might require extra steps, like creating a service account in GCP and generating short-lived credentials from AWS. For the sake of simplicity, assume we are using a public API or an already available service account key.

Command-Line Interface (CLI)

Create the IAM Role for Lambda

aws iam create-role \

--role-name LambdaKinesisRole \

--assume-role-policy-document file://TrustPolicyForLambda.json

What does this command do?

Creates a role named LambdaKinesisRole that Lambda can assume. TrustPolicyForLambda.json should specify "Service": "lambda.amazonaws.com" as trusted entity.

Attach policies to the role

aws iam attach-role-policy \

--role-name LambdaKinesisRole \

--policy-arn arn:aws:iam::aws:policy/AmazonKinesisFullAccess

aws iam attach-role-policy \

--role-name LambdaKinesisRole \

--policy-arn arn:aws:iam::aws:policy/CloudWatchLogsFullAccess

What do these commands do?

Grant LambdaKinesisRole permissions to fully access Kinesis and to write logs to CloudWatch.

Create Lambda function

zip function.zip lambda_function.py # Package code into a zip

aws lambda create-function \

--function-name kinesis-to-gcs \

--runtime python3.9 \

--zip-file fileb://function.zip \

--handler lambda_function.lambda_handler \

--role arn:aws:iam::[ACCOUNT_ID]:role/LambdaKinesisRole

What does this command do?

Creates a Lambda function named kinesis-to-gcs using the packaged code (lambda_function.py) and associates it with the specified IAM role.

Create the Kinesis trigger

aws lambda create-event-source-mapping \

--function-name kinesis-to-gcs \

--event-source arn:aws:kinesis:[REGION]:[ACCOUNT_ID]:stream/multi-cloud-stream \

--batch-size 100 \

--starting-position LATEST

What does this command do?

Links the Lambda function to the Kinesis stream, specifying how many records to process per batch (--batch-size 100).

Note: The code inside lambda_function.py should handle sending data to GCS. This typically involves generating a signed URL from GCP (via a separate script/credentials) or calling a secure endpoint that writes to GCS.

Optional: Load Data from GCS into BigQuery

If a final analytics layer is desired, data from GCS can be periodically loaded into BigQuery.

Manual Steps (GUI)

BigQuery Console:
In Google Cloud Console, go to BigQuery.
Create a Dataset (e.g., multicloud_dataset).
Within the dataset, click Create table.
Select Source = Google Cloud Storage and choose the bucket/folder where Lambda is storing files.
Specify File format (e.g., JSON or CSV).
Click Create table to load data.

Command-Line Interface (CLI)

bq --location=us-central1 mk --dataset [PROJECT_ID]:multicloud_dataset
bq load \
--source_format=CSV \
[PROJECT_ID]:multicloud_dataset.multicloud_table \
gs://multi-cloud-pipeline-bucket-123/*.csv

What do these commands do?
First command creates a dataset named multicloud_dataset.
Second command loads all CSV files from the specified GCS bucket into a table named multicloud_table.

Verifying and Testing the Project

Data Ingestion Test: Put sample data into the Kinesis stream:aws kinesis put-record \ --stream-name multi-cloud-stream \ --partition-key testKey \ --data "HelloMultiCloud"

Verify that the Lambda function triggers by checking CloudWatch Logs (AWS console → CloudWatch → Logs).

GCS Object Check:

Go to Cloud Console → Storage → Browser → open the bucket.
Verify an object/file with your sample data is created.

(Optional) BigQuery Load Test:

If loading to BigQuery, run a query on the newly created table to see if data arrived correctly.

Common Issues and Troubleshooting

IAM Permissions: If Lambda fails to read from Kinesis or write logs, ensure the attached policies are correct.
Cross-Cloud Credentials: Writing from AWS to GCS may require a service account key or public endpoint. Double-check the authentication method.
Region Mismatch: Placing Kinesis and GCS in geographically distant regions can cause delays or higher latency.
Quota Limits: Even on free tiers, each cloud has rate limits. If data volume is large, you may exceed the free allowance.
File Format Incompatibility: When loading data into BigQuery, ensure the data format (CSV, JSON) matches your table schema.

Conclusion

We have successfully built a Multi-Cloud Data Pipeline that streams data from AWS (via Kinesis) to Google Cloud Storage, with an optional loading process into BigQuery. Along the way, we have acquired skills in configuring serverless functions (Lambda), managing cross-cloud resources (AWS → GCP), and automating workflows via the CLI. This approach can be adapted to other cloud provider combinations (e.g., Azure Event Hubs → AWS S3, Azure Functions → Google Cloud Storage) and scaled for real-world production scenarios without incurring credit charges, provided the usage stays within free tier limits.

Popular Projects

Create a Real-Time Data Stream with Kinesis

Deploy a Containerized App to Azure Kubernetes Service (AKS)

Create and test a Pub/Sub queue with a subscriber

Host a Static Website on Google Cloud Storage

What is Cloud Computing ?

Cloud computing delivers computing resources (servers, storage, databases, networking, and software) over the internet, allowing businesses to scale and pay only for what they use, eliminating the need for physical infrastructure.

AWS: The most popular cloud platform, offering scalable compute, storage, AI/ML, and networking services.
Azure: A strong enterprise cloud with hybrid capabilities and deep Microsoft product integration.
Google Cloud (GCP): Known for data analytics, machine learning, and open-source support.

LINKS