Apex Aide apexaide

Migration at Scale: Moving Marketing Cloud Caching from Memcached to Redis at 1.5M RPS Without Downtime

By Paladi Sandhya Madhuri· Salesforce Engineering Blog· ·Advanced ·Developer ·16 min read
Summary

This technical case study details how Salesforce’s Marketing Cloud team successfully migrated their core caching infrastructure from Memcached to Redis Cluster while handling 1.5 million requests per second without downtime. The migration preserved application behavior, maintained performance, and improved high-availability and security by leveraging Redis features like primary-replica replication and sharding. Key challenges included zero-downtime live cutover, managing hot-key performance bottlenecks, and ensuring functional parity despite TTL and key-handling differences. Teams can learn strategies for seamless cache-layer migrations under heavy production load with stable latency and consistent cache hit rates.

Takeaways
  • Use a Dynamic Cache Router with percentage-based traffic routing for live cutover without downtime.
  • Implement a compatibility layer to handle TTL and key format differences ensuring behavioral parity.
  • Monitor and mitigate hot keys using probabilistic models and hybrid caching patterns.
  • Group services by cache key ownership to avoid split-brain and stale reads during migration.
  • Validate cache performance with production-faithful load and soak testing including tail latency metrics.

By Paladi Sandhya Madhuri, Rakesh Chhabra, Piyush Pruthi, Sumit Sahrawat, Ankit Jain, and Basaveshwar Hiremath. In our Engineering Energizers Q&A series, we highlight the engineering minds driving innovation across Salesforce. Today’s edition features Paladi Sandhya Madhuri, a Senior Software Engineer on the Marketing Cloud Caching team, whose work involves evolving the platform’s core caching infrastructure to support high-volume, latency-sensitive workloads, including a live migration handling approximately 1.5 million cache events per second across over 50 applications. Explore how the team executed a zero-downtime migration under live production traffic, preserving application behavior while changing the underlying cache engine, managing hot-key pressure from Redis at scale, and validating stable performance and reliability by sustaining end-to-end P50 latency near 1 millisecond and P99 latency around 20 milliseconds throughout the transition.

Marketing CloudArchitecture