Data Superabundance part 2: The Long Tail of Data

Time for a longer discussion of our thinking about data superabundance, data management and what we’re up to…

Findability is driven by the frequency with which data gets used.  The more you use something, the easier it becomes to find.  Even sophisticated search engines like Google follow this model.  The legendary Page Rank algorithm primarily looks at who links to you.  The more popular a site is based on links, the more findable Google makes it.  So over time, findable data becomes more findable (or if you prefer "the rich get richer" — thanks Todd!).

And don’t get me wrong, Google works great.

But here’s the thing.  Thanks to the crazy year over year increase in data (zetabytes by 2010) more and more data is being used less and less.  "Huh?" I hear you cry… 

Look at it this way.  The amount of data that any individual can use frequently is pretty fixed — there are only so many hours in the day.  Time for a flashback to High School with a scary Venn diagram:

So the green dot represents the amount of data you use frequently.  The red circle is all the data you ever use.  Now let the calendar roll forward a bit.  The overall amount of data has increased significantly, but the amount of data you can use frequently is about the same.  And that picture looks like this:

So as a percentage of all the data, the stuff you use frequently is now a much tinier piece.  In other words, more data is now used less.  As data superabundance continues its merry march, frequently used data will continue to be an ever smaller piece of all the data that exists.  And that’s going to cause all kinds of problems…

And now for part 2 of PTOTD (pet theory of the day)…  The long tail.

If you take all the data inventory contained within a company, assess the frequency with which each piece of data is used, rank them in order and plot a graph, I think you’d see some kind of Power Curve.  This is also known as a "Long Tail" graph.  If you haven’t read the excellent book by Chris Anderson, you really should.

So what?  Well here’s a graph to stare at:


Chris’s book focuses on the long tail for goods.  Most retailers focus on the tip of the long tail — they stock only the most popular items and they sell a lot of each one.  He makes a great argument for focusing on a huge number of less popular items and selling just a few of each one.

The big guns in data management (ECM, CMS, data warehousing etc) are like retailers.  They focus on the tip of the graph — the top 5% or so of a company’s data.  But 80-85% of a corporation’s data is in the "Long Tail" part of the chart.  That data is hard to search and mostly unmanaged.

Now maybe your inner skeptic is thinking that less frequently used data probably doesn’t have much value.  But there’s not much correlation between usage and value.  Less frequently used data just tends to have variable value driven by circumstance — it’s data that may not be needed today but will be vital tomorrow.

disruptorMonkey is building tools to manage the "Long Tail" of data. 

Based on the response from those that get what we’re doing, this is going to be an interesting ride!