Earlier this year, Starfish Storage and WEKA set out to push the boundaries of file system scanning performance in a lab environment. By significantly increasing the scanner thread count and targeting WEKA’s high-performance file system, the goal was to measure how fast metadata operations—specifically lstat() calls—could be executed at scale.
While the exercise wasn’t a full-blown scientific benchmark (we didn’t vary hardware resources to find optimal configurations), the results still speak volumes.
1.6 Billion files scanning speed per hour in the lab
During our tests, the WEKA-Starfish combination achieved an average of 450,000 LSTAT operations per second, equivalent to 1.6 billion scanned files per hour. That means a file system containing 1.6 billion objects could be fully scanned for changes in about 60 minutes.
For modern environments with file systems in the 5–6 billion object range, this throughput would allow complete scans 7 to 8 times per day—a pace previously considered impractical in large-scale environments. In other words, the Starfish Universal Data Catalog (UDC) would be current to within 3+ hours. Extrapolating further, a mammoth file system containing 35 billion files could be scanned in just under 24 hours.
3.6 Billion files scanning speed per hour in production
Then came the real proof point: A leading semiconductor manufacturer deployed Starfish on a particularly large WEKA file system in a live production environment and shattered our lab benchmark.
Their scanning throughput hit 903,668 LSTATs per second, nearly doubling the lab rate. At that rate, a 3+ PB file system with billions of files can be scanned in just one hour, making high-resolution file system observability at scale not just possible—but practical.
Why the WEKA + Starfish combination performs so fast
WEKA is known for its blazing-fast throughput and low latency, but what’s less well-known is its exceptional metadata performance. This makes it a perfect match for Starfish, which is architected to scan as fast as the underlying storage will allow. Starfish can dial up multiple threads across multiple scan servers, and leverage NFS nConnect where appropriate.
Unlike many file systems that suffer significant performance hits during scans—disrupting listings, deletes, and file creation—WEKA handles Starfish scans with virtually no impact to production workloads. The result is a system capable of extremely high-speed, non-disruptive metadata extraction.
Where does scanning speed matter?
There are a number of scenarios where scanning speed will make a difference in your environment: .
- Malware Detection: Faster scans mean quicker detection of anomalous changes, allowing near real-time response and damage containment.
- Workflow Automation: In environments driven by automated pipelines, fast scans reduce latency between event and action.
- Backup and Replication: Change detection speed determines how aggressively you can set your RPOs. Traditional tools like rsync or legacy backup software can’t keep pace with billions of files; Starfish can.
- Archiving, ROT Cleanup, Chargeback, and Curation: Even for “slower” use cases like archiving or compliance, Starfish on WEKA can scan and catalog 35+ billion files in a single day, offering unmatched scale and flexibility.
Faster scanning means better data management decisions
File system scanning is often the overlooked bottleneck in modern data management pipelines. As data volumes and file counts continue to explode, the ability to scan and act on changed data in near real-time becomes critical.
Together, Starfish and WEKA are not just keeping pace—they’re helping customers redefine what’s possible.
Weka and Starfish will continue to scale
We’re continuing to work closely with WEKA to meet the scanning needs of tomorrow’s data-intensive environments. Expect more milestones in the near future—but until then, rest assured: no matter how large your file system grows, Starfish and WEKA give you visibility to drive actions within minutes.