Dark Data: Should you be Scared?

Dark and unstructured data illustration

Dark data is a dramatic term that describes a real problem: massive amounts of unstructured files whose existence is unknown. In the corporation, this dark data clogs up multiple storage media ranging from production arrays, to secondary disk, to tape, to cloud.

That dark data exists is no surprise to IT. The surprise is two-fold: 1) that dark data costs the corporation a lot of money and risk, and 2) that there are technologies that do something about it.

How Dark Data Costs You Money, Time and Risk

  • Large amounts of unclaimed data waste storage resources. A stray file here and there makes little practical difference in storage resources. But when these files multiply, they take up growing amounts of storage capacity. This costs money, and in the case of production storage impact performance. With storage clocking in at 40% of the compute infrastructure, this is a big loss of resources.
  • Unclaimed data wastes business value. Big data analytics needs historical data to build business intelligence. When analytics cannot locate valuable data, all bets are off. That data’s value drops to zero.
  • Manually managing dark data takes resources. When dark data grows to the point that it impacts storage resources, IT attempts remediation. Typical are to move or delete data past a certain age or belonging to a long-gone employee. This must be done from storage system to storage system, making it a time-consuming operation that IT puts off. Sometimes forever.
  • Dark data threatens compliance and security. Mass deletions can be a good strategic move: it can also be risky. Massive migration by simple age is also a poor strategic move if you have no way to capture its value – not to mention using yet more storage resources to house old data.

What to Do about It

There are technologies out there that help with managing dark data. In order of priority, best features include: 1) discovery, 2) classification, and 3) action. All three make up the ideal solution but may not be required for every situation.

  • Discovery is the dead minimum for a dark data management software. Cross-repository, federated discovery adds a great deal of value. Locating aging data on a single storage system is not a big deal and your existing storage management tools can probably do it for you. A better plan is to invest in a product that federates discovery across multiple repositories. Automated scheduling is a very helpful feature but manual will work as long as you set and stick to your own schedule.
  • Classification is the next up priority feature. A multi-repository discovery tool will report findings by basic metadata classifications, and this may be all you need. However, more sophisticated classification features will yield results by rich metadata and additional characteristics. This enables you to understand the data that your tool is uncovering and to report accordingly.
  • Acting on classified data lets you run policy-driven operations on classified data. For example, you can automate file movement to secondary storage based on matching date ranges, creators, and file types. You can do the same to discover and move files onto analysis platforms; or you can use actionable classification to delete low value files at the end of their lifecycles. Action features should also include reports and alerts for governance and compliance.

With corporate data growing at a conservative 55% year over year, more and more data is flying under the radar – lowering business value, raising compliance risks, costing storage management time, and increasing capital and energy costs. So look for products that bring unstructured files into the light with federated discovery and classification, and ideally policy-driven action as well.

If most of your storage is on a particular array then array-based tools may work for you. All of the big storage guys offer some level of discovery, classification and action including EMC’s SourceOne division, IBM with InfoSphere and, EMC SourceOne portfolio and HP’s content management suite. Newer Acaveo discovers, classifies and acts across multiple repositories. For litigation-heavy environments, eDiscovery software with robust collection and analysis capabilities can serve double duty, such as IBM StoredIQ.