Introduction: #
In the modern cloud-native world, observability, scalability, and security are not optional, they are architectural requirements. This guide provides a full end-to-end journey: from containerizing microservices locally with Docker, to deploying them on Amazon Elastic Kubernetes Service (EKS) with cost-optimized infrastructure, observability (OpenTelemetry, Prometheus, Grafana), automated CI/CD pipelines, and DevSecOps practices.
Objective: #
-
Test & Production Pipelines – Validate microservices locally (EC2 + Docker) before migrating to EKS, ensuring cost efficiency during development.
-
Secure & Scalable Infrastructure – Deploy on Amazon EKS with IAM roles, KMS encryption, pod identity, VPC segmentation, and autoscaling.
-
Full Observability – Implement OpenTelemetry Collector, Prometheus, Grafana, and CloudWatch for tracing, logging, and metrics.
-
Easy version management – Use Helm charts for versioned deployments, upgrades, and rollbacks.
-
DevSecOps CI/CD Automation – Build a GitHub Actions pipeline with Trivy, FOSSA, and OSSF Scorecard integrations.
-
Real-Time Alerting – Configure Prometheus + Alertmanager with email notifications for pod restarts and cluster health issues.
Phase 1: Local Docker Deployment & Foundational EKS Setup #
Phase 1 laid the groundwork for deploying the OpenTelemetry microservices demo application, starting from a local test environment using Docker to a production-grade Kubernetes cluster on Amazon EKS. This phase focused on validating service functionality, optimizing infrastructure, and building a secure and scalable cloud environment.
1.1 Objectives #
-
Validate microservices locally on EC2 + Docker Compose.
-
Transition to production-grade Amazon EKS cluster with security, cost-optimization, and scalability.
1.2 Implementation #
1.2.1 EC2 Test Environment #
-
To run the OpenTelemetry demo locally using Docker Compose, the team provisioned an EC2 instance with the following specifications:
-
Instance Type: t2.xlarge
-
vCPUs: 4
-
RAM: 16 GB
-
Storage: 30 GB General Purpose SSD (gp2)
-
Class: On-demand
-
-
The test environment helped simulate microservices behavior and understand performance thresholds. Key takeaways:
-
The application performed well on a t2.xlarge instance.
-
Smaller instance types led to performance degradation and service failures.
-
On-demand pricing was chosen over spot instances for stability during evaluation.
-
The microservices were deployed with Docker Compose using:
cd opentelemetry-demo/
docker compose up
-
Verification steps included:
-
Confirming all services were up with docker ps.
-
Checking individual service logs with docker logs
to detect misconfigurations or startup errors. -
Exposing the EC2 instance to the internet and accessing the application via its public IP and configured port (e.g., 8080).
1.2.2 EKS Production Environment #
After successful validation, the team transitioned to the production deployment on Amazon EKS with the following components:
-
Networking Design:
-
Public Subnets: Hosted ingress components such as the ALB Ingress Controller.
-
Private Subnets: Hosted the actual worker nodes running Kubernetes workloads.
-
-
EKS Node Group Configuration:
-
Instance Type: t2.xlarge
-
Auto Scaling Setup:
-
Minimum Nodes: 1
-
Desired Nodes: 2
-
Maximum Nodes: 3
-
-
-
Scaling Behavior: The cluster auto-scales up to three nodes under heavy load and always maintains at least one active node.
-
Use of Spot Instances:
Stateless or loosely coupled services were deployed on spot instances to optimize cost without affecting stability.
-
Add-ons and Integrations:
-
VPC-CNI Addon:
Provides pod networking by assigning VPC IPs directly to pods and allowing integration with security groups and AWS PrivateLink.
-
EBS-CSI Addon:
Allows dynamic provisioning of EBS volumes for stateful workloads, supporting snapshot-based backup and recovery.
-
CloudWatch Agent Addon:
Enables Container Insights, collecting metrics and logs from containers and sending them to CloudWatch for centralized observability.
-
Pod Identity (IRSA):
Pods request temporary IAM credentials from a DaemonSet-based Pod Identity Agent, eliminating the need for long-term secrets or manual AWS credential management.
-
ALB Ingress Controller Support:
A specific pod identity was created and annotated to allow ALB controllers to deploy ingress resources using proper IAM permissions.
-
-
KMS Integration:
Secrets in Kubernetes and data in EBS volumes are encrypted using AWS Key Management Service (KMS) for enhanced data security.
1.3 Verification #
1.3.1 EC2 Instance #
-
Confirmed that the Docker Compose environment functioned as expected:
-
Services started successfully
-
Logs showed proper service-to-service communication
-
Application accessible via public IP
1.3.2 EKS Cluster Setup #
-
Used aws eks update-kubeconfig to configure kubectl.
-
Cloned and deployed the infrastructure from GitHub: https://github.com/arbaaz29/eks_terraform
-
Deployed all Kubernetes manifests using:
cd eks_terraform/k8s
kubectl apply -f .
- Verified the following in both the default and otel-demo namespaces:
- Running Pods
- Services
- Deployments
- Logs for key microservices (e.g., frontend-proxy, Grafana, Jaeger, OpenSearch)
1.4 Highlights #
-
EC2 (t2.xlarge) used for local validation.
-
EKS cluster with private/public subnets, autoscaling, IRSA pod identity, ALB ingress controller, CloudWatch integration, and KMS encryption.
-
Spot instances used for stateless workloads, cutting AWS costs.
1.5 Challenges and Resolutions #
| Challenge | Resolution |
|---|---|
| EC2 access to frontend-proxy blocked | Updated security group to allow traffic on port 8080 |
| Pods could not use AWS services (e.g., EBS, ALB) | Created and annotated IAM policies with Pod Identity for appropriate service accounts |
| Deployment manifest misconfiguration (e.g., product-catalog, Grafana) | Fixed configMaps and applied corrections in the deployment manifests |
| Subnets not recognized for Kubernetes resources | Added proper Kubernetes resource tags to the subnets |
| GitHub user lacked EKS access | Added user to access entries with eksclusteradmin permissions |
Notes #
- Ensure IAM roles and policies follow the principle of least privilege.
- Check if service accounts have proper annotations so that respective IAM roles can be associated with them.
- Confirm subnet tagging aligns with the EKS requirements for cluster and load balancer integration.
- Maintain an access control list for GitHub users with justifications for elevated permissions.
Phase 2: Integrating Helm for Kubernetes Deployment #
- After successfully deploying and verifying the OpenTelemetry microservices on Amazon EKS using raw Kubernetes manifests, the next logical step was to streamline and simplify the deployment process. Phase 2 focused on using Helm, the package manager for Kubernetes, to manage and automate deployments, upgrades, and rollbacks.
2.1 Objective #
-
Simplified deployments using Helm charts for OpenTelemetry demo.
-
Enabled seamless upgrades, rollbacks, and environment isolation.
2.2 Implementation #
2.2.1 Adding Helm Repository #
- To begin, the team added the official OpenTelemetry Helm chart repository:
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update
-
This pulled in the latest Helm charts for OpenTelemetry components, including:
-
Frontend, backend, and telemetry services
-
OpenTelemetry Collector
-
Jaeger, Kafka, Prometheus exporters, etc.
-
The use of official charts ensured that best practices were followed and configurations remained compatible with Kubernetes standards.
2.2.2 Deploying the Application Using Helm #
To isolate the Helm-based deployment from the manually deployed environment, a new namespace was created:
kubectl create namespace otel-helm-demo
- Then the OpenTelemetry demo was deployed using the Helm chart:
helm install otel-demo open-telemetry/opentelemetry-demo -n otel-helm-demo
- Verification steps included:
kubectl get pods -n otel-helm-demo
kubectl get service -n otel-helm-demo
-
This confirmed that:
-
All Kubernetes resources (pods, services, deployments) were created correctly.
-
The Helm chart encapsulated all necessary microservices in a single, consistent deployment process.
-
2.2.3 Upgrade and Rollback #
To simulate real-world usage and verify Helm’s lifecycle management features, the team tested an upgrade and rollback scenario.
-
Upgrade Scenario
- The replica count of the frontend-proxy component was increased from the default to 3:
helm upgrade otel-demo open-telemetry/opentelemetry-demo \
-n otel-helm-demo \
--set components.frontend-proxy.replicas=3
- Verification:
kubectl get pods -n otel-helm-demo
helm history otel-demo -n otel-helm-demo
-
This confirmed the increased number of frontend-proxy pods and a new revision entry in Helm’s release history.
-
Rollback Scenario:
- To test rollback capability, the team reverted to the previous revision:
helm rollback otel-demo 1 -n otel-helm-demo
kubectl get pods -n otel-helm-demo
helm history otel-demo -n otel-helm-demo
kubectl describe deployment frontend -n otel-helm-demo
- This successfully returned the deployment to its initial configuration without any manual cleanup or reconfiguration.
2.3 Challenges and Solutions #
| Challenge | Resolution |
|---|---|
| EC2 access to frontend-proxy blocked | Updated security group to allow traffic on port 8080 |
| Pods could not use AWS services (e.g., EBS, ALB) | Created and annotated IAM policies with Pod Identity for appropriate service accounts |
| Helm values misconfiguration (e.g., product-catalog, Grafana) | Fixed configMaps and applied corrections in the Helm values |
| Subnets not recognized for Kubernetes resources | Added proper Kubernetes resource tags to the subnets |
| GitHub user lacked EKS access | Added user to access entries with eksclusteradmin permissions |
Used incorrect override path: frontend-proxy.replicaCount |
Corrected to components.frontend-proxy.replicas by consulting the chart’s documentation |
| Pods crashed post-upgrade due to missing configuration values | Reviewed and updated values.yaml structure, and used --set flags to apply overrides inline during upgrade |
Notes #
- Always verify Helm override paths against the chart’s structure and documentation.
- Use
--dry-runandhelm templateto preview changes before applying them.
2.4 Highlights #
-
One-command installation with helm install.
-
Replica scaling, upgrades, and rollback tested successfully.
-
Reduced manual kubectl apply overhead.
Phase 3: Alerting Service and Notifications #
- After establishing deployment and observability foundations, Phase 3 focused on real-time alerting to detect application health issues, particularly around pod restarts. This phase introduced monitoring and alerting mechanisms using the Prometheus Stack, Alertmanager, and Kubernetes ConfigMaps, with email notifications configured via SMTP.
3.1 Objective #
-
Implemented Prometheus Stack + Alertmanager with custom alert rules.
-
Configured Gmail SMTP for real-time email alerts.
3.2 Implementation #
3.2.1 Deploying the Prometheus Stack with Helm #
-
The monitoring stack included:
-
Prometheus for metrics collection
-
Alertmanager for sending alerts
-
kube-state-metrics for Kubernetes state data
-
-
To deploy these components, the official Helm chart from the prometheus-community repository was used:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack -n monitoring --create-namespace
- This deployed all necessary resources into a dedicated monitoring namespace, helping with logical separation and resource governance.
3.2.2 Creating the Alerting Rule for Pod Restarts #
- To detect frequent container restarts, a custom Prometheus alert rule was defined in a file named alerts.yaml:
groups:
- name: pod-restarts
rules:
- alert: PodRestartTooHigh
expr: increase(kube_pod_container_status_restarts_total[5m]) > 3
for: 1m
labels:
severity: critical
annotations:
summary: "High restart count detected"
description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} restarted more than 3 times in the last 5 minutes."
-
This rule triggers an alert when any container in a pod restarts more than 3 times within a 5-minute window, persisting for at least one minute.
-
The alert rule was applied using a Kubernetes ConfigMap:
kubectl create configmap prometheus-alerts --from-file=alerts.yaml -n monitoring
- This ConfigMap could then be mounted into the Prometheus deployment via Helm values (if dynamic configuration reload was enabled).
3.2.3 Configuring Alertmanager for Email Notifications #
- To route alerts via email, Alertmanager was configured to use Gmail’s SMTP service. The configuration was defined in alertmanager.yaml:
global:
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'sender@gmail.com'
smtp_auth_username: 'sender@gmail.com'
smtp_auth_password_file: '/etc/secrets/smtp_password'
route:
receiver: 'Mail Alert'
repeat_interval: 30s
group_wait: 15s
group_interval: 15s
receivers:
- name: 'Mail Alert'
email_configs:
- to: 'receiver@gmail.com'
headers:
subject: 'Pod stuck in restart state'
This file was converted to a Kubernetes Secret:
kubectl create secret generic alertmanager-main --from-file=alertmanager.yaml -n monitoring
- The password for Gmail SMTP was provided via a mounted file (smtp_password) for secure authentication.
3.2.4 Testing the Alert #
- To validate the alerting system, a crash-looping pod was manually created using:
kubectl run crashloop-demo --image=busybox --restart=Always -- /bin/sh -c "exit 1"
-
This caused the pod to continuously restart, increasing the restart counter.
-
Prometheus, using the alert rule defined earlier, detected this behavior. Once the increase() function’s threshold was crossed and persisted for one minute, an alert was triggered.
-
The alert appeared in the Alertmanager UI under the configured route.
-
An email notification was sent to the specified address with full metadata, including:
-
Alert name
-
Namespace
-
Affected pod
-
Restart count
-
Timestamps and severity
-
3.3 Highlights: #
-
Alert on pods restarting more than 3 times in 5 minutes.
-
ConfigMaps + Secrets used for secure Prometheus/Alertmanager configuration.
-
Tested with CrashLoopBackOff pod, alert fired successfully, and email notification delivered.
3.4 Summary and Impact #
-
The implementation of real-time alerting brought several operational benefits:
-
Early Detection: Crash-looping pods and other anomalies are flagged almost instantly.
-
Rapid Response: Email alerts reach stakeholders without requiring constant dashboard monitoring.
-
Production Readiness: The system now includes observability not only through dashboards, but through active notifications.
-
-
This phase added an essential layer of resilience, helping the team respond to failures before they escalate into service outages.
Phase 4: CI/CD Integration with DevSecOps Enhancements #
- With the infrastructure, observability, and alerting systems in place, Phase 4 of the project focused on automating the software delivery pipeline using GitHub Actions. The goal was to implement a robust Continuous Integration and Continuous Deployment (CI/CD) system, bolstered by DevSecOps best practices such as automated vulnerability scanning, license checks, rollback mechanisms, and secure secret management.
4.1 CI/CD Pipeline Overview #
- Built GitHub Actions pipeline automating build → scan → push → deploy → monitor.
| Step | Description |
|---|---|
| Checkout Code | Pull the latest source code from GitHub |
| Configure AWS Credentials | Authenticate GitHub runner to access AWS using GitHub Secrets |
| Login to Amazon ECR | Use Docker CLI to log in to Elastic Container Registry |
| Set Environment Variables | Dynamically generate .env file with image tags and ECR URIs |
| Build Docker Images | Build all microservices using docker-compose |
| Push Images to ECR | Upload container images to AWS ECR |
| Install Trivy | Install Trivy CLI for vulnerability scanning |
| Scan Images | Run scans on each image and fail on HIGH/CRITICAL CVEs |
| Update kubeconfig | Authenticate kubectl to the target EKS cluster |
| Patch YAML Manifests | Automatically update Kubernetes manifests with new image tags |
| Commit Updated YAMLs | Push the updated manifests back to the GitHub repository |
| Deploy to EKS | Apply all manifests using kubectl apply |
| Deploy Monitoring Configs | Apply configurations for kube-state-metrics and alerting rules |
| Rollback on Failure | If any step fails, trigger kubectl rollout undo for all deployments |
-
Integrated rollback mechanisms for production safety.
-
Pipeline Steps: Checkout code → Build Docker images → Push to ECR → Scan with Trivy → License check with FOSSA → Security posture check with OSSF Scorecard → Deploy with kubectl apply → Rollback on failure.
4.2 Rollback Mechanism #
One of the key production-readiness features in this phase was automated rollback. The workflow used GitHub Actions’ if: failure() condition to trigger:
kubectl rollout undo deployment/<service-name> -n <namespace>
-
This command restored each service to its previously stable replica set.
-
A failure was simulated during testing by applying an invalid image tag, which correctly triggered the rollback behavior, ensuring that no broken deployments reached users.
4.3 Secret Management #
- Sensitive data such as AWS credentials, API keys, and cluster names were stored securely in GitHub Secrets, and accessed in the workflow using:
${{ secrets.<KEY_NAME> }}
-
Examples of secrets used:
-
AWS_ACCESS_KEY_ID
-
AWS_SECRET_ACCESS_KEY
-
EKS_CLUSTER_NAME
-
FOSSA_API_KEY
-
This practice eliminated the need for storing plaintext credentials in code or configuration files, aligning with industry security best practices.
4.4 DevSecOps Integrations #
-
Best security practices:
-
GitHub Secrets for AWS credentials and API keys.
-
OSSF Scorecard for repo security posture.
-
FOSSA & Trivy scans embedded in pipeline
-
-
To ensure code quality, security, and license compliance, several DevSecOps tools were integrated directly into the CI pipeline:
4.4.1 FOSSA #
-
Purpose: Scan for license violations and known open-source vulnerabilities.
-
Integration: Triggered via FOSSA GitHub Action.
-
Outcome: Completed successfully with no issues detected.
4.4.2 Gradle Wrapper Validation #
-
Purpose: Check that gradle-wrapper.jar and gradle-wrapper.properties are valid and not tampered with.
-
Trigger: PR or push events.
-
Outcome: Successfully validated using a test commit under the correct path.
4.4.3 OSSF Scorecard #
-
Purpose: Assess the security posture of the GitHub repository.
-
Features Checked: Branch protection, dependency update automation, token permissions, and more.
-
Integration: Results uploaded to GitHub’s Code Scanning dashboard.
-
Schedule: Triggered on push and weekly.
4.5 Challenges and Solutions #
| Challenge | Resolution |
|---|---|
| Inconsistent Docker image tagging | Used .env file with dynamic GitHub Actions variables to standardize tags |
| FOSSA action failed due to team misconfiguration | Removed team parameter and used auto-detection |
| Trivy scan failed due to bad image reference | Corrected the image tagging format |
| Gradle wrapper validation didn’t trigger | Created a dummy commit in the monitored path to validate integration |
| Kubernetes YAMLs not updated for each image | Used sed to auto-update image tags in all deployment files |
4.6 Execution Results #
-
Artifacts and verifications from successful pipeline executions included:
-
GitHub Actions workflow logs showing successful build, scan, and deployment
-
Trivy scan logs showing no high/critical vulnerabilities
-
Confirmation of image push to ECR
-
Visual confirmation of rollback behavior (if triggered)
-
Updated deployment manifests committed to GitHub
-
Running pods confirmed via kubectl get pods
-
Live application access via ALB Ingress
-
Conclusion #
This project demonstrates a production-ready DevSecOps architecture on Amazon EKS, combining:
-
Cost optimization with spot instances + minimal node scaling
-
Security best practices with IAM, KMS, Secrets Manager, and IRSA
-
Full observability with OpenTelemetry, Prometheus, Grafana, CloudWatch
-
Automated CI/CD pipeline with vulnerability scanning and rollback
-
Real-time alerting & notifications for rapid incident response