Negotiable Salary
Qode
New York, NY, USA
We are looking for a Cloud Infra Architect with AWS to lead the architecture and implementation for Launch Darkly. This person must have prior experience in leading this implementation and roll out to application teams at large scale. Launch Darkly Feature Flags with Progressive rollouts, A/B Testing with Zero Downtime & Full Automation This document outlines the comprehensive requirements for vendor partners to support initiatives aimed at achieving zero downtime, reducing production incidents, improving change failure rate metrics, and enabling full automation. The scope includes feature flags, A/B testing, progressive rollout, and support for deployment patterns across APIs, EKS, OnPrem, Lambdas, and other AWS services. Functional Scope Zero Downtime Deployments Implement blue/green or canary deployment models with seamless traffic switching, rollback capability, and session persistence. Change Failure Rate Reduction Integrate root cause tracking, automated rollback, and pre-deployment validation pipelines. Feature Flags Enable real-time toggling, secure access control, and auditability. Must support both server-side and client-side toggles. A/B and B/G Testing Support traffic segmentation, real-time metrics, rollback, and privacy compliance. Progressive Rollouts Automate staged rollouts by region, user cohort, or environment. Include rollback triggers based on metrics. Automation & CI/CD Full GitHub Actions integration, dynamic runners, and golden path patterns for EKS, Lambda, and OnPrem. Environment Patterns Support for APIs, EKS, OnPrem, Lambdas, Kafka, Glue, RDS, S3, and other AWS services. Observability & Metrics Integrate with Grafana, Splunk, and DORA metrics (lead time, change frequency, failure rate, MTTR). Self-Service Enablement & Onboarding/Migration support for feature flags Empower teams with Express Lane-style pipelines, role-based access, and audit trails. Expected Outcomes - Pilot with at least 5 teams by Nov’2025 - We need enterprise adoption ready by Nov with at least 5 Patterns inclusive of Cloud & OnPrem - 99.9%+ availability during deployments. - 99%+ reduction in change failure rate. - Full automation of provisioning, testing, and deployment pipelines - Full automation and governance for E2E feature flag lifecycle management Non-Functional Requirements (NFRs) Performance Low-latency toggling, fast rollback, Security Secure artifact storage, RBAC, audit logging, vulnerability scanning. Scalability Support for multi-region, multi-tenant deployments; dynamic scaling of runners. Resilience Chaos testing, fault injection, recovery time objectives (RTOs). Compliance Tagging enforcement, cost visibility, and privacy compliance for A/B testing, BlueGreen, Flags