At Clemson University, research is always evolving and so is the need for secure, efficient data management. Becky Ligon, Storage Systems Architect for Clemson’s Research Computing and Data Infrastructure (RCDI) team, recently shared how Starfish Storage plays a vital role in protecting and organizing data across Clemson’s primary research data lake, Indigo, along with their high-performance computing (HPC) system, Palmetto 2, and how rapidly changing data governance and compliance requirements truly underscore Starfish’s central role in their data security strategy.
How do you locate and manage sensitive data quickly when you have 2.5 PB?
For over three years, RCDI has utilized Starfish to safeguard and manage data for researchers, faculty, and students across disciplines—from AI and genetics to biology and public health. With this kind of diverse cross-discipline data, security and compliance requirements can change rapidly, and often. Becky recently described how this affects her and her team:
“Security requirements can change. We have a security team who flag files that might be affected by changing governance or compliance requirements. When they identify files whose requirements have changed, they need to know every location the file might exist on our system. And, more importantly, that data can be stored on a variety of systems.”
That’s when Becky turns to Starfish.
Instead of spending days manually searching Palmetto’s 2.5 petabytes of storage with Unix find commands, Becky uses Starfish to quickly scan across user directories, scratch space, and backed-up project storage, aided by Starfish’s immense indexing and search capabilities.
The Starfish Universal Data Catalog (UDC) includes a Postgres database and maintains a comprehensive index of files across all file systems, and how they’ve changed over time. Starfish collects various types of metadata, including:
- File system metadata: Standard file attributes within the storage systems.
- File tags: User-defined labels or categories applied to individual files.
- Directory tags: Labels or categories applied to directories, which can be set explicitly or inherited by files within a directory.
- Key-value pairs: More detailed, customizable descriptions associated with files.
Managers can now search across ALL files
This metadata helps Starfish track crucial information such as the origin date of each file, the type of data it contains, and its owner. “We don’t know where the files might be”, explains Becky. “They could be anywhere—home directories, project space, even in backup. Starfish lets us search across all of it quickly. Without it, we’d have to do everything the old-fashioned way.”
Starfish data protection shrinks the backup window from a week down to less than 24 hours
Indigo now relies on VAST storage to house its ever-expanding volume of research data. Starfish plays a crucial role in managing daily backups from VAST to Clemson’s ZFS backup system. This marks a significant leap forward from earlier methods, when backing up user directories could take over a week. Thanks to enhanced Starfish agent capacity, backups are now completed in under 24 hours—even when processing the millions of small files commonly generated by AI and genomics research.
“We’ve worked closely with Starfish’s support team to tune things for our environment,” said Becky. “Our researchers generate massive amounts of data—some have 300 to 400 terabytes out there. Thanks to Starfish we’re able to keep pace.”
The next phase is helping researchers manage their own data.
Although Starfish is currently managed centrally by RCDI, there are plans to expand access directly to researchers. The goal is to allow users to assess their own data footprint, identify what can be archived, and structure their storage more effectively.
Becky shared a recent real-world example: A researcher needed to move data into Box but faced file count limits. “We let her into a dedicated Starfish Zone for her project, and she could instantly see how to restructure her data so it would upload to Box. Otherwise, she would’ve had to run tons of du and find commands, and figure it out by trial and error. She could do it quickly with Starfish.” The RCDI team hopes to expand this kind of researcher access in the future—providing transparency and giving scientists greater control over their storage use.
Starfish has also helped researchers better prepare for the financial reality of data management. “The transparency helps researchers, especially the newer ones, understand that you have to plan for storage in your grants. It’s part of the research lifecycle now.”
Starfish makes proactive data management feasible
While Starfish is a critical tool for backup and rapid response, Becky and her team are now exploring proactive uses—like file content analysis to detect sensitive data such as Personally Identifiable Information (PII) or protected health data.
With a proven track record of delivering fast, reliable backups and enabling smarter data discovery, Starfish has become an integral part of RCDI’s operations at Clemson University. As data governance becomes more complex and critical, Becky and her team are ready—with Starfish in their corner.
“Starfish lets us do in minutes what used to take days. And when you’re running a rapidly growing research infrastructure for a major university, that makes all the difference.”
Many thanks to Betsy Lignon, Storage Systems Architect for Clemson’s Research Research Computing and Data Infrastructure (RCDI) team for collaborating with Starfish on this story.