High-Performance Computing (HPC) environments often face a chronic problem: scratch storage misuse. This was the challenge facing Arizona State University’s Research Technology Office (RTO). Researchers were using their 4-petabyte (PB) Sol supercomputer’s scratch storage, designed for temporary, high-speed work as a permanent archive, and they were running out of space. Scratch storage is essential to HPC, providing high-speed storage optimized for maximum input/output (I/O) performance during job executions. If scratch storage becomes full, job executions fail.
Scratch storage suffered from chronic instability
This fundamental misuse led to chronic filesystem instability. The system repeatedly reached 99% capacity, causing crashes that stalled the workflows of thousands of active researchers. Senior HPC Systems Analyst Josh Burks noted, “The scratch system kept filling, and we were stuck begging users to clean up their data.” The traditional practice of sending emails with long lists of untouched files to delete were often ignored because they were too detailed or flagged as spam. Something had to be done to return the system to a more functional state.
An existing data management tool, Starfish, provided the answer
ASU found the solution to this issue by turning to Starfish, which they had already been using in their environment for data reporting and management. In addition to identifying aging data at the file level (which can result in space savings by archiving or deleting unused data), Starfish can also easily identify entire directories with files that haven’t been accessed, modified, or changed. ASU simplified reporting to researchers by listing only the aging directories, rather than all of the files, leaving them with far less information to review, making the task of deciding what to remove or archive less overwhelming. ASU then set Starfish up to categorize scratch directories into three time-based buckets, followed by actions to take on the files:
- Phase 1 (45-75 Days): Initial alert and advisory.
- Phase 2 (76-89 Days): Heightened critical notification.
- Phase 3 (90+ Days): An archive tag is automatically applied, resulting in automated archive of the directory.
Starfish was also used to generate regular user-specific reports that copied their Principal Investigators (PIs), helping reinforce accountability for managing their data.
The solution built trust with researchers
To avoid the loss of important data that hadn’t yet been reviewed by the researchers, ASU adopted a policy of “archive first, delete later.” Under this policy, data flagged for automated archive is first migrated to a temporary archive location. Researchers are then given clear, non-disruptive pathways to move their data to persistent, paid storage options for long-term storage (the Horizon and Canyon storage systems). This “soft approach” builds trust, as users always know where their data is and know it hasn’t been deleted out from under them. Using Starfish Zones, users can also restore their own data if needed. Both Horizon and Canyon are integrated with ASU’s storage chargeback system, so implementing this solution has also increased chargeback revenue for the department.
Scratch storage is fast again
The impact of implementing the Starfish-powered workflow has been transformative:
- 3 Petabytes of Data Freed Up in the First Few Months: Early in the clean-up of scratch storage, ASU successfully archived or moved approximately 3 PB of data
- System Stability: With the new automated archive solution in place, the scratch file system is now stable, averaging a healthy 70% capacity utilization, thus eliminating previous system slowdowns.
- New Revenue: Increased adoption of paid storage options created new chargeback revenue streams.
The implementation of this new process has additionally created a vital cultural shift. Faculty and students now understand and adhere to the policy that scratch is temporary storage, thus helping them become better data citizens. As Burks notes, “The warnings, the archive step, and the ability to recover data if needed made all the difference.”
ASU’s success provides a sustainable, repeatable blueprint for any research computing facility facing similar challenges.—–For a copy of the full story on how Starfish helps leading organizations solve their toughest data management challenges, visit the Starfish resources page.
Big thanks to Senior HPC Systems Analyst, Josh Burks, at Arizona State University Research Technology Office for sharing this story!
