Table of Contents

New article!

Introduction:
#

In the modern cloud-native world, observability, scalability, and security are not optional, they are architectural requirements. This guide provides a full end-to-end journey: from containerizing microservices locally with Docker, to deploying them on Amazon Elastic Kubernetes Service (EKS) with cost-optimized infrastructure, observability (OpenTelemetry, Prometheus, Grafana), automated CI/CD pipelines, and DevSecOps practices.

Objective:
#

Test & Production Pipelines – Validate microservices locally (EC2 + Docker) before migrating to EKS, ensuring cost efficiency during development.
Secure & Scalable Infrastructure – Deploy on Amazon EKS with IAM roles, KMS encryption, pod identity, VPC segmentation, and autoscaling.
Full Observability – Implement OpenTelemetry Collector, Prometheus, Grafana, and CloudWatch for tracing, logging, and metrics.
Easy version management – Use Helm charts for versioned deployments, upgrades, and rollbacks.
DevSecOps CI/CD Automation – Build a GitHub Actions pipeline with Trivy, FOSSA, and OSSF Scorecard integrations.
Real-Time Alerting – Configure Prometheus + Alertmanager with email notifications for pod restarts and cluster health issues.

Phase 1: Local Docker Deployment & Foundational EKS Setup
#

Phase 1 laid the groundwork for deploying the OpenTelemetry microservices demo application, starting from a local test environment using Docker to a production-grade Kubernetes cluster on Amazon EKS. This phase focused on validating service functionality, optimizing infrastructure, and building a secure and scalable cloud environment.

1.1 Objectives
#

Validate microservices locally on EC2 + Docker Compose.
Transition to production-grade Amazon EKS cluster with security, cost-optimization, and scalability.

1.2 Implementation
#

1.2.1 EC2 Test Environment
#

To run the OpenTelemetry demo locally using Docker Compose, the team provisioned an EC2 instance with the following specifications:
- Instance Type: t2.xlarge
- vCPUs: 4
- RAM: 16 GB
- Storage: 30 GB General Purpose SSD (gp2)
- Class: On-demand
The test environment helped simulate microservices behavior and understand performance thresholds. Key takeaways:
The application performed well on a t2.xlarge instance.
Smaller instance types led to performance degradation and service failures.
On-demand pricing was chosen over spot instances for stability during evaluation.
The microservices were deployed with Docker Compose using:

cd opentelemetry-demo/
docker compose up

Verification steps included:
Confirming all services were up with docker ps.
Checking individual service logs with docker logs to detect misconfigurations or startup errors.
Exposing the EC2 instance to the internet and accessing the application via its public IP and configured port (e.g., 8080).

1.2.2 EKS Production Environment
#

After successful validation, the team transitioned to the production deployment on Amazon EKS with the following components:

Networking Design:
- Public Subnets: Hosted ingress components such as the ALB Ingress Controller.
- Private Subnets: Hosted the actual worker nodes running Kubernetes workloads.
EKS Node Group Configuration:
- Instance Type: t2.xlarge
- Auto Scaling Setup:
  - Minimum Nodes: 1
  - Desired Nodes: 2
  - Maximum Nodes: 3
Scaling Behavior: The cluster auto-scales up to three nodes under heavy load and always maintains at least one active node.
Use of Spot Instances:

Stateless or loosely coupled services were deployed on spot instances to optimize cost without affecting stability.
Add-ons and Integrations:
- VPC-CNI Addon:
  
  Provides pod networking by assigning VPC IPs directly to pods and allowing integration with security groups and AWS PrivateLink.
- EBS-CSI Addon:
  
  Allows dynamic provisioning of EBS volumes for stateful workloads, supporting snapshot-based backup and recovery.
- CloudWatch Agent Addon:
  
  Enables Container Insights, collecting metrics and logs from containers and sending them to CloudWatch for centralized observability.
- Pod Identity (IRSA):
  
  Pods request temporary IAM credentials from a DaemonSet-based Pod Identity Agent, eliminating the need for long-term secrets or manual AWS credential management.
- ALB Ingress Controller Support:
  
  A specific pod identity was created and annotated to allow ALB controllers to deploy ingress resources using proper IAM permissions.
KMS Integration:

Secrets in Kubernetes and data in EBS volumes are encrypted using AWS Key Management Service (KMS) for enhanced data security.

1.3 Verification
#

1.3.1 EC2 Instance
#

Confirmed that the Docker Compose environment functioned as expected:
Services started successfully
Logs showed proper service-to-service communication
Application accessible via public IP

1.3.2 EKS Cluster Setup
#

Used aws eks update-kubeconfig to configure kubectl.
Cloned and deployed the infrastructure from GitHub: https://github.com/arbaaz29/eks_terraform
Deployed all Kubernetes manifests using:

cd eks_terraform/k8s
kubectl apply -f .

Verified the following in both the default and otel-demo namespaces:
- Running Pods
- Services
- Deployments
- Logs for key microservices (e.g., frontend-proxy, Grafana, Jaeger, OpenSearch)

1.4 Highlights
#

EC2 (t2.xlarge) used for local validation.
EKS cluster with private/public subnets, autoscaling, IRSA pod identity, ALB ingress controller, CloudWatch integration, and KMS encryption.
Spot instances used for stateless workloads, cutting AWS costs.

1.5 Challenges and Resolutions
#

Challenge	Resolution
EC2 access to frontend-proxy blocked	Updated security group to allow traffic on port 8080
Pods could not use AWS services (e.g., EBS, ALB)	Created and annotated IAM policies with Pod Identity for appropriate service accounts
Deployment manifest misconfiguration (e.g., product-catalog, Grafana)	Fixed configMaps and applied corrections in the deployment manifests
Subnets not recognized for Kubernetes resources	Added proper Kubernetes resource tags to the subnets
GitHub user lacked EKS access	Added user to access entries with `eksclusteradmin` permissions

Notes
#

Ensure IAM roles and policies follow the principle of least privilege.
Check if service accounts have proper annotations so that respective IAM roles can be associated with them.
Confirm subnet tagging aligns with the EKS requirements for cluster and load balancer integration.
Maintain an access control list for GitHub users with justifications for elevated permissions.

Phase 2: Integrating Helm for Kubernetes Deployment
#

After successfully deploying and verifying the OpenTelemetry microservices on Amazon EKS using raw Kubernetes manifests, the next logical step was to streamline and simplify the deployment process. Phase 2 focused on using Helm, the package manager for Kubernetes, to manage and automate deployments, upgrades, and rollbacks.

2.1 Objective
#

Simplified deployments using Helm charts for OpenTelemetry demo.
Enabled seamless upgrades, rollbacks, and environment isolation.

2.2 Implementation
#

2.2.1 Adding Helm Repository
#

To begin, the team added the official OpenTelemetry Helm chart repository:

helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update

This pulled in the latest Helm charts for OpenTelemetry components, including:
Frontend, backend, and telemetry services
OpenTelemetry Collector
Jaeger, Kafka, Prometheus exporters, etc.
The use of official charts ensured that best practices were followed and configurations remained compatible with Kubernetes standards.

2.2.2 Deploying the Application Using Helm
#

To isolate the Helm-based deployment from the manually deployed environment, a new namespace was created:

kubectl create namespace otel-helm-demo

Then the OpenTelemetry demo was deployed using the Helm chart:

helm install otel-demo open-telemetry/opentelemetry-demo -n otel-helm-demo

Verification steps included:

kubectl get pods -n otel-helm-demo
kubectl get service -n otel-helm-demo

This confirmed that:
- All Kubernetes resources (pods, services, deployments) were created correctly.
- The Helm chart encapsulated all necessary microservices in a single, consistent deployment process.

2.2.3 Upgrade and Rollback
#

To simulate real-world usage and verify Helm’s lifecycle management features, the team tested an upgrade and rollback scenario.

Upgrade Scenario
- The replica count of the frontend-proxy component was increased from the default to 3:

      helm upgrade otel-demo open-telemetry/opentelemetry-demo \
      -n otel-helm-demo \
      --set components.frontend-proxy.replicas=3

Verification:

kubectl get pods -n otel-helm-demo
helm history otel-demo -n otel-helm-demo

This confirmed the increased number of frontend-proxy pods and a new revision entry in Helm’s release history.
Rollback Scenario:
- To test rollback capability, the team reverted to the previous revision:

helm rollback otel-demo 1 -n otel-helm-demo
kubectl get pods -n otel-helm-demo
helm history otel-demo -n otel-helm-demo
kubectl describe deployment frontend -n otel-helm-demo

This successfully returned the deployment to its initial configuration without any manual cleanup or reconfiguration.

2.3 Challenges and Solutions
#

Challenge	Resolution
EC2 access to frontend-proxy blocked	Updated security group to allow traffic on port 8080
Pods could not use AWS services (e.g., EBS, ALB)	Created and annotated IAM policies with Pod Identity for appropriate service accounts
Helm values misconfiguration (e.g., product-catalog, Grafana)	Fixed configMaps and applied corrections in the Helm values
Subnets not recognized for Kubernetes resources	Added proper Kubernetes resource tags to the subnets
GitHub user lacked EKS access	Added user to access entries with `eksclusteradmin` permissions
Used incorrect override path: `frontend-proxy.replicaCount`	Corrected to `components.frontend-proxy.replicas` by consulting the chart’s documentation
Pods crashed post-upgrade due to missing configuration values	Reviewed and updated `values.yaml` structure, and used `--set` flags to apply overrides inline during upgrade

Notes
#

Always verify Helm override paths against the chart’s structure and documentation.
Use --dry-run and helm template to preview changes before applying them.

2.4 Highlights
#

One-command installation with helm install.
Replica scaling, upgrades, and rollback tested successfully.
Reduced manual kubectl apply overhead.

Phase 3: Alerting Service and Notifications
#

After establishing deployment and observability foundations, Phase 3 focused on real-time alerting to detect application health issues, particularly around pod restarts. This phase introduced monitoring and alerting mechanisms using the Prometheus Stack, Alertmanager, and Kubernetes ConfigMaps, with email notifications configured via SMTP.

3.1 Objective
#

Implemented Prometheus Stack + Alertmanager with custom alert rules.
Configured Gmail SMTP for real-time email alerts.

3.2 Implementation
#

3.2.1 Deploying the Prometheus Stack with Helm
#

The monitoring stack included:
- Prometheus for metrics collection
- Alertmanager for sending alerts
- kube-state-metrics for Kubernetes state data
To deploy these components, the official Helm chart from the prometheus-community repository was used:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack -n monitoring --create-namespace

This deployed all necessary resources into a dedicated monitoring namespace, helping with logical separation and resource governance.

3.2.2 Creating the Alerting Rule for Pod Restarts
#

To detect frequent container restarts, a custom Prometheus alert rule was defined in a file named alerts.yaml:

groups:
- name: pod-restarts
  rules:
  - alert: PodRestartTooHigh
    expr: increase(kube_pod_container_status_restarts_total[5m]) > 3
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "High restart count detected"
      description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} restarted more than 3 times in the last 5 minutes."

This rule triggers an alert when any container in a pod restarts more than 3 times within a 5-minute window, persisting for at least one minute.
The alert rule was applied using a Kubernetes ConfigMap:

kubectl create configmap prometheus-alerts --from-file=alerts.yaml -n monitoring

This ConfigMap could then be mounted into the Prometheus deployment via Helm values (if dynamic configuration reload was enabled).

3.2.3 Configuring Alertmanager for Email Notifications
#

To route alerts via email, Alertmanager was configured to use Gmail’s SMTP service. The configuration was defined in alertmanager.yaml:

global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'sender@gmail.com'
  smtp_auth_username: 'sender@gmail.com'
  smtp_auth_password_file: '/etc/secrets/smtp_password'

route:
  receiver: 'Mail Alert'
  repeat_interval: 30s
  group_wait: 15s
  group_interval: 15s

receivers:
- name: 'Mail Alert'
  email_configs:
  - to: 'receiver@gmail.com'
    headers:
      subject: 'Pod stuck in restart state'

This file was converted to a Kubernetes Secret:

kubectl create secret generic alertmanager-main --from-file=alertmanager.yaml -n monitoring

The password for Gmail SMTP was provided via a mounted file (smtp_password) for secure authentication.

3.2.4 Testing the Alert
#

To validate the alerting system, a crash-looping pod was manually created using:

kubectl run crashloop-demo --image=busybox --restart=Always -- /bin/sh -c "exit 1"

This caused the pod to continuously restart, increasing the restart counter.
Prometheus, using the alert rule defined earlier, detected this behavior. Once the increase() function’s threshold was crossed and persisted for one minute, an alert was triggered.
The alert appeared in the Alertmanager UI under the configured route.
An email notification was sent to the specified address with full metadata, including:
- Alert name
- Namespace
- Affected pod
- Restart count
- Timestamps and severity

3.3 Highlights:
#

Alert on pods restarting more than 3 times in 5 minutes.
ConfigMaps + Secrets used for secure Prometheus/Alertmanager configuration.
Tested with CrashLoopBackOff pod, alert fired successfully, and email notification delivered.

3.4 Summary and Impact
#

The implementation of real-time alerting brought several operational benefits:
- Early Detection: Crash-looping pods and other anomalies are flagged almost instantly.
- Rapid Response: Email alerts reach stakeholders without requiring constant dashboard monitoring.
- Production Readiness: The system now includes observability not only through dashboards, but through active notifications.
This phase added an essential layer of resilience, helping the team respond to failures before they escalate into service outages.

Phase 4: CI/CD Integration with DevSecOps Enhancements
#

With the infrastructure, observability, and alerting systems in place, Phase 4 of the project focused on automating the software delivery pipeline using GitHub Actions. The goal was to implement a robust Continuous Integration and Continuous Deployment (CI/CD) system, bolstered by DevSecOps best practices such as automated vulnerability scanning, license checks, rollback mechanisms, and secure secret management.

4.1 CI/CD Pipeline Overview
#

Built GitHub Actions pipeline automating build → scan → push → deploy → monitor.

Step	Description
Checkout Code	Pull the latest source code from GitHub
Configure AWS Credentials	Authenticate GitHub runner to access AWS using GitHub Secrets
Login to Amazon ECR	Use Docker CLI to log in to Elastic Container Registry
Set Environment Variables	Dynamically generate `.env` file with image tags and ECR URIs
Build Docker Images	Build all microservices using `docker-compose`
Push Images to ECR	Upload container images to AWS ECR
Install Trivy	Install Trivy CLI for vulnerability scanning
Scan Images	Run scans on each image and fail on HIGH/CRITICAL CVEs
Update kubeconfig	Authenticate `kubectl` to the target EKS cluster
Patch YAML Manifests	Automatically update Kubernetes manifests with new image tags
Commit Updated YAMLs	Push the updated manifests back to the GitHub repository
Deploy to EKS	Apply all manifests using `kubectl apply`
Deploy Monitoring Configs	Apply configurations for `kube-state-metrics` and alerting rules
Rollback on Failure	If any step fails, trigger `kubectl rollout undo` for all deployments

Integrated rollback mechanisms for production safety.
Pipeline Steps: Checkout code → Build Docker images → Push to ECR → Scan with Trivy → License check with FOSSA → Security posture check with OSSF Scorecard → Deploy with kubectl apply → Rollback on failure.

4.2 Rollback Mechanism
#

One of the key production-readiness features in this phase was automated rollback. The workflow used GitHub Actions’ if: failure() condition to trigger:

kubectl rollout undo deployment/<service-name> -n <namespace>

This command restored each service to its previously stable replica set.
A failure was simulated during testing by applying an invalid image tag, which correctly triggered the rollback behavior, ensuring that no broken deployments reached users.

4.3 Secret Management
#

Sensitive data such as AWS credentials, API keys, and cluster names were stored securely in GitHub Secrets, and accessed in the workflow using:

${{ secrets.<KEY_NAME> }}

Examples of secrets used:
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
EKS_CLUSTER_NAME
FOSSA_API_KEY
This practice eliminated the need for storing plaintext credentials in code or configuration files, aligning with industry security best practices.

4.4 DevSecOps Integrations
#

Best security practices:
- GitHub Secrets for AWS credentials and API keys.
- OSSF Scorecard for repo security posture.
- FOSSA & Trivy scans embedded in pipeline
To ensure code quality, security, and license compliance, several DevSecOps tools were integrated directly into the CI pipeline:

4.4.1 FOSSA
#

Purpose: Scan for license violations and known open-source vulnerabilities.
Integration: Triggered via FOSSA GitHub Action.
Outcome: Completed successfully with no issues detected.

4.4.2 Gradle Wrapper Validation
#

Purpose: Check that gradle-wrapper.jar and gradle-wrapper.properties are valid and not tampered with.
Trigger: PR or push events.
Outcome: Successfully validated using a test commit under the correct path.

4.4.3 OSSF Scorecard
#

Purpose: Assess the security posture of the GitHub repository.
Features Checked: Branch protection, dependency update automation, token permissions, and more.
Integration: Results uploaded to GitHub’s Code Scanning dashboard.
Schedule: Triggered on push and weekly.

4.5 Challenges and Solutions
#

Challenge	Resolution
Inconsistent Docker image tagging	Used `.env` file with dynamic GitHub Actions variables to standardize tags
FOSSA action failed due to team misconfiguration	Removed `team` parameter and used auto-detection
Trivy scan failed due to bad image reference	Corrected the image tagging format
Gradle wrapper validation didn’t trigger	Created a dummy commit in the monitored path to validate integration
Kubernetes YAMLs not updated for each image	Used `sed` to auto-update image tags in all deployment files

4.6 Execution Results
#

Artifacts and verifications from successful pipeline executions included:
- GitHub Actions workflow logs showing successful build, scan, and deployment
- Trivy scan logs showing no high/critical vulnerabilities
- Confirmation of image push to ECR
- Visual confirmation of rollback behavior (if triggered)
- Updated deployment manifests committed to GitHub
- Running pods confirmed via kubectl get pods
- Live application access via ALB Ingress

Conclusion
#

This project demonstrates a production-ready DevSecOps architecture on Amazon EKS, combining:

Cost optimization with spot instances + minimal node scaling
Security best practices with IAM, KMS, Secrets Manager, and IRSA
Full observability with OpenTelemetry, Prometheus, Grafana, CloudWatch
Automated CI/CD pipeline with vulnerability scanning and rollback
Real-time alerting & notifications for rapid incident response

bazinga94/opentelemetry-demo

This repository contains the OpenTelemetry Astronomy Shop, a microservice-based distributed system intended to illustrate the implementation of OpenTelemetry in a near real-world environment.

TypeScript

E-commerce on AWS: Secure, Scalable, and Cost-Optimized Cloud Architecture

22 April 2025·3 mins

DockFast: CI/CD Pipeline with Jenkins, ArgoCD, SonarQube, Trivy, Prometheus, and Grafana

01 February 2025·4 mins

flAWS2 – Exploiting Public EC2 Snapshots (Task 4 Walkthrough)

24 October 2024·3 mins

Introduction: #

Objective: #

Phase 1: Local Docker Deployment & Foundational EKS Setup #

1.1 Objectives #

1.2 Implementation #

1.2.1 EC2 Test Environment #

1.2.2 EKS Production Environment #

1.3 Verification #

1.3.1 EC2 Instance #

1.3.2 EKS Cluster Setup #

1.4 Highlights #

1.5 Challenges and Resolutions #

Notes #

Phase 2: Integrating Helm for Kubernetes Deployment #

2.1 Objective #

2.2 Implementation #

2.2.1 Adding Helm Repository #

2.2.2 Deploying the Application Using Helm #

2.2.3 Upgrade and Rollback #

2.3 Challenges and Solutions #

Notes #

2.4 Highlights #

Phase 3: Alerting Service and Notifications #

3.1 Objective #

3.2 Implementation #

3.2.1 Deploying the Prometheus Stack with Helm #

3.2.2 Creating the Alerting Rule for Pod Restarts #

3.2.3 Configuring Alertmanager for Email Notifications #

3.2.4 Testing the Alert #

3.3 Highlights: #

3.4 Summary and Impact #

Phase 4: CI/CD Integration with DevSecOps Enhancements #

4.1 CI/CD Pipeline Overview #

4.2 Rollback Mechanism #

4.3 Secret Management #

4.4 DevSecOps Integrations #

4.4.1 FOSSA #

4.4.2 Gradle Wrapper Validation #

4.4.3 OSSF Scorecard #

4.5 Challenges and Solutions #

4.6 Execution Results #

Conclusion #

Related

Introduction:
#

Objective:
#

Phase 1: Local Docker Deployment & Foundational EKS Setup
#

1.1 Objectives
#

1.2 Implementation
#

1.2.1 EC2 Test Environment
#

1.2.2 EKS Production Environment
#

1.3 Verification
#

1.3.1 EC2 Instance
#

1.3.2 EKS Cluster Setup
#

1.4 Highlights
#

1.5 Challenges and Resolutions
#

Notes
#

Phase 2: Integrating Helm for Kubernetes Deployment
#

2.1 Objective
#

2.2 Implementation
#

2.2.1 Adding Helm Repository
#

2.2.2 Deploying the Application Using Helm
#

2.2.3 Upgrade and Rollback
#

2.3 Challenges and Solutions
#

Notes
#

2.4 Highlights
#

Phase 3: Alerting Service and Notifications
#

3.1 Objective
#

3.2 Implementation
#

3.2.1 Deploying the Prometheus Stack with Helm
#

3.2.2 Creating the Alerting Rule for Pod Restarts
#

3.2.3 Configuring Alertmanager for Email Notifications
#

3.2.4 Testing the Alert
#

3.3 Highlights:
#

3.4 Summary and Impact
#

Phase 4: CI/CD Integration with DevSecOps Enhancements
#

4.1 CI/CD Pipeline Overview
#

4.2 Rollback Mechanism
#

4.3 Secret Management
#

4.4 DevSecOps Integrations
#

4.4.1 FOSSA
#

4.4.2 Gradle Wrapper Validation
#

4.4.3 OSSF Scorecard
#

4.5 Challenges and Solutions
#

4.6 Execution Results
#

Conclusion
#