Skip to main content

The Pain of Distributed Cache: Ideal vs. Reality

ยท 6 min read

After six months of open-source journey and surveying multiple users, we have gained a clear understanding of the pros and cons of the distributed cache model. In light of the limitations of distributed cache application scenarios, here is a simple reflection.

In today's booming era of big data and artificial intelligence, "storage-compute separation" has become the mainstream paradigm of cloud-native data architecture. Computing resources can elastically scale, while data is uniformly settled in low-cost object storage (such as S3, OSS). However, this architecture brings a fatal pain point: the high latency and low throughput of object storage seriously drag down computing performance. Thus, the distributed cache layer emerged as neededโ€”it was hoped to become a high-speed bridge connecting "flexible computing" and "cheap storage."

Distributed file cache systems can "transparently accelerate" access to remote storage and provide a unified namespace. However, when enterprises eagerly introduce them into production environments, they often fall into the predicament of underwhelming performance, operational complexity, and semantic mismatch. This article will deeply analyze the gap between the technical ideal of distributed cache and its landing reality, revealing the structural limitations of distributed cache in general scenarios.

I. The Ideal of Distributed Cache: Unified, Transparent, High-Performanceโ€‹

The original design intention was highly attractive:

  • Unified Namespace: Mount heterogeneous storage such as HDFS, S3, GCS into a single directory tree, applications only need to access xx://;
  • Transparent Cache: After the first read of remote data, automatically cache to memory/SSD, subsequent responses in milliseconds;
  • Ecosystem Compatibility: Seamlessly integrate with mainstream computing engines such as Spark, Presto, Pytorch, without code modifications;
  • Tiered Storage: Support memory โ†’ SSD โ†’ HDD multi-level caching, balancing performance and cost.

In demonstration environments, distributed cache can indeed significantly improve the access performance of object storage, especially in scenarios where model training repeatedly reads input.

II. Reality's Pain: Three Structural Defectsโ€‹

However, the ideal is full, but the reality is bony. Distributed cache exposes three unavoidable defects in real business scenarios.

Pain 1: Incomplete POSIX Semantics, Limited Versatilityโ€‹

Distributed cache provides POSIX-like interfaces through FUSE, enabling traditional applications to read remote data like accessing local files. However, its support for POSIX semantics is highly incomplete:

  • โŒ No random write support: Cannot modify bytes in the middle of a file, only allows creating new files or full overwrites;
  • โŒ No truncate, hard links, or file locks;
  • โš ๏ธ Strong consistency missing: Multiple clients may read expired cache, requiring manual metadata refresh.

This means distributed cache simply cannot run databases, log systems, or any programs requiring in-place updates. It is essentially designed for WORM (Write-Once-Read-Many), not a general-purpose file system. Many teams, after attempting to "seamlessly migrate" existing business to distributed cache, discover their applications crash due to write operation failures.

Truth: Distributed cache is not a "distributed POSIX file system," but a "data orchestration layer optimized for batch processing."

Pain 2: High Resource Consumptionโ€‹

Currently, distributed cache systems using Java or Go languages generally have high resource consumption problems.

For example, Java processes often occupy tens of GB of memory, which is somewhat wasteful for systems that use memory as cache.

Pain 3: Operational Complexity, ROI Hard to Deliverโ€‹

The deployment of distributed cache involves multiple components such as Master, Worker, Journal, UFS connectors, etc., and resource tuning (memory allocation, cache policies, network configuration) is extremely complex. More fatally:

  • Cache hit rate depends on data access patterns: If jobs are one-time scans (such as ETL), caching has no value;
  • Resource competition: Memory/SSD occupied by Workers competes with Spark Executors for node resources;
  • Troubleshooting difficulties: Issues like cache inconsistency, block loss, and UFS synchronization failures require deep source code analysis to locate.

Many teams invest months in building and tuning cache clusters, only to find limited performance improvement but doubled operational burden, forcing them to abandon it.

III. Reflection: Can Distributed Cache Replace File Systems?โ€‹

The dilemma of distributed cache reflects a deeper issue: attempting to use a general-purpose intermediate layer to solve all I/O problems is itself a form of technical dogmatism.

Distributed cache has significant effects in large-scale data I/O scenarios such as big data and AI training. However, as a company, purchasing or deploying a distributed cache cluster cannot achieve universal application across scenarios, nor can it fully utilize its capabilities.

Trend: "Specialized is better than generalized" โ€” Rather than maintaining a heavyweight distributed cluster, it is better to build a more versatile tiered file system with a cache acceleration layer in the middle, providing more general support with file system semantics.

IV. Way Out: Rational Choice, Scenario-Drivenโ€‹

Distributed cache is not without merit. It still has value in the following scenarios:

  • โœ… Hybrid cloud/multi-cloud architecture: Unified access to object storage from different cloud providers;
  • โœ… High-reuse read-only datasets: Such as benchmark datasets repeatedly used in AI training;
  • โœ… Dedicated platform teams available: Can bear its operational and tuning costs.

However, for most enterprises, a more pragmatic path is:

  1. First evaluate whether I/O is truly a bottleneck: Confirm through profiling;
  2. Prioritize optimizing data formats and query logic: Use Iceberg/Lance instead of raw files;
  3. Avoid "using cache for the sake of using cache": Caching is a means, not an end;
  4. Build general-purpose file system capabilities: Build cache that rivals file system capabilities, fully exploring versatility.

Conclusionโ€‹

Distributed cache is a phased technical experiment that has promoted the development of data orchestration concepts. However, its "pain" also warns us: there is no silver bullet, only trade-offs. On the road to pursuing high performance, blindly introducing general-purpose middleware often backfires. True engineering wisdom lies in understanding the essence of business and choosing the most matching tool, even if it's not "cool" enough.

Distributed cache should not be a standard configuration of architecture, but rather a precise scalpel for specific scenarios. Only by building more general, lightweight, and efficient tiered file systems can we avoid falling into the "cache pain" and let data truly flow, rather than being trapped in layers of abstraction.


Powered by OPPO Bigdata.