Archive for the 'Data Organization' category

Feldian Dark Matter and Superabundant data…

NickN| November 29, 2007 5:16 pm

Back in July, Brad Feld wrote a post titled "The Dark Matter of the Blogosphere".  I’m not sure if he coined the term or not, but I like its meaning.

For those of you that are less of a physics nerd than me, dark matter is something astrophysicists have been struggling with for a while.  Simply put, the Universe doesn’t have enough stuff in it to work the way it does.  The most viable explanation is that there is a _lot_ of stuff we can’t see or detect easily a.k.a. Dark Matter.

In the case of the blogosphere, Brad was referring specifically to reader comments.  There’s a huge volume of user generated content out there in the form of blog comments, and for the most part it is unsearchable and effectively invisible.  Folks like Disqus and Intense Debate are working hard to resolve this.

But I think the concept of Dark Matter is very applicable to data in general. 

Think about all of the data in your life.  How much useful information do you have that is effectively hidden and invisible?  This is as true for an individual as it is for a corporation.  Some of this information is hidden by virtue of being hard to search or hard to access… and some is hidden because it isn’t explicit — it’s "implied" by the way things have been collected, organized, or used.

So lets take a quick look at each case…

Hard to search:

The original idea for disruptorMonkey stemmed from a personal problem…  Like many of you, I have the "big box of crap" that I’ve accumulated from many different jobs.  It includes CD-ROMs of data, printed stuff, handwritten notes and numerous other treasures.  About 18 months ago, I needed to put together some sales training materials for someone.  I dug in to the big box and it took me 4+ days to organize, recreate and assemble what I needed.  It was a nightmare.  Incensed at the stupidity of the process, I started looking for a better way, which quickly lead me to set up a wiki.  Wiki’s can be great, but they’re mostly hopeless with existing data unless you reformat it for the wiki…which is a huge pain.

The underlying issue was the fact that the data was hard to search, which made it difficult to organize and repurpose.

Hard to access:

Last week I was talking to a banker, who happened to have majored in IT systems.  I was explaining some of what we do, and he started telling me about some of his data woes.  The biggest one stemmed from the fact that some banking systems are built on fairly old databases.  You’ve probably seen the horrible green-screen terminal-window interfaces in use at your local bank.  These UI’s have zero flexibility and are the result of many years of development, much of it seemingly without input from the people using the product.

Even though the whole thing is just a database, he has no way whatsoever to run unique queries.  For example, he would love to be able to search for customers with a $5,000-$10,000 personal line of credit.  The data he needs is in the database, but he has no way to access it, so from a practical perspective it doesn’t exist in any meaningful way.

Implied Data:

The discussion I had with Brad before Thanksgiving was about how Exchange server contains a lot of interesting "implied" data, above and beyond the obvious email & social network info.  Your Outlook/Exchange account says an awful lot about you and the things your interested in… along with who you talk to and what you talk about.

That’s not data that is readily exposed in any useful way, although companies like Xobni are making some headway on that front.

All three of these scenarios are about "dark matter" data.  There’s a lot incredibly important information that’s there, waiting to be mined, but today’s tools mostly can’t see or use it.

One of our longer term goals at disruptorMonkey is to build a tool that not only captures all that dark matter, it’ll put it to work and make it useful.

There’s much to do, but we’re excited with the progress we’ve made so far…

 

Information R/evolution…

NickN| October 18, 2007 12:14 pm

I guess I’m late to the blog-party on this one, so you may already have seen it.  But this video is a fantastic backgrounder on much of how we see the world of data and information at disruptorMonkey.

The video was created by Michael Wesch, an Assistant Professor of Cultural Anthropology at Kansas State University.  You can find his spot on the web here.

He also created the excellent "The Machine is Us/ing Us" which you can see here, along with a bunch of other thought provoking videos about the impact of information on our lives.

One of Prof. Wesch’s other videos has a great quote by  Marshall McLuhan from 1967:

"Today’s child is bewildered when he enters the 19th century environment that still characterizes the educational establishment where information is scarce but ordered and structured by fragmented, classified patterns, subjects and schedules."

1967!!!  And look at us now.

Thanks to Zack for the heads up — I’m behind on my blog reading and had not seen this yet.

Semantic, schmantic…

NickN| October 8, 2007 6:34 pm

There’s been some increased buzz lately about the semantic web and what it all means.  Alex Iskold, CEO of AdaptiveBlue has a great piece on SemanticWeb.com, titled "The Semantic Curmudgeon".

AdaptiveBlue make an interesting browser plugin that understands context and applies that understanding to generate useful shortcuts on pages, links and text.  So if you’re browsing music on the web, their plugin "understands" that and suggests useful contextually-related links.  Visit the site and take a tour — it will explain it much better than I just did (sorry Alex!).

Part of what we do here in MonkeyVille is semantic-technology related, in that some of our code understands words and attempts to imply context.  But by no means are we a semantic web application (and no, I will not get drawn in to the idiotic "I’m a web 3.0 app" discussions that are bouncing around the blogosphere).

However, our choice not to be a fully semantic app was deliberate, and Alex’s article hits the proverbial nail on the noggin as to why:

1. It lacks memory and is not iterative in nature.
2. Its ultimate goal is to deliver perfect answers, which are unattainable.
3. It is technologically impractical to achieve.

Back in the day, I was involved with a company that did a lot of research into symbol recognition for engineering drawings.  The idea is just like OCR — scan a page of text and get real words — but for engineering symbols.  It is a tough problem to solve, arguably worse than handwriting recognition because symbols can be anywhere in a drawing and can be drawn on top of other lines and features.

We had some clever engineers who spent a lot of time trying to solve the problem.  Using AI, fuzzy this and neural that, they boosted recognition rates from ~65% to (I think) 80+%.  We were proud, and very condescending of our competitor with their stubby and sad 65%.  But the competition were smart, as well as clever.  They responded not by developing even better technology, but creating better workflow.  They redefined the real problem: customers wanted to quickly convert hand drawn squiggles to symbols within a CAD system.  Customers really didn’t care how they got the end result.  So competitor took their oh-so-sad 65% algorithm, used it to identify everything in the drawing that might possibly be a symbol, and developed some very quick tools to tab around the drawing and manually replace all the squiggles with symbols.

Using their system, you could convert an entire drawing in ~30 minutes.  Using ours, initial processing would only take 10 or 15 minutes, but the cleanup (finding the 20% that wasn’t recognized correctly) took an hour or more.  So their dopey oh-so-stupid technology kicked our asses by 2x or more every time.

Doh!

Smart almost always beats clever.

And the situation with the purely semantic web is almost identical.  The idea that code will ever be able to identify context and meaning with 100% certainly for every individual is absurd.  It might hit 85% for most people, or more.  But it will never, ever, hit perfect accuracy.

The second issue I see with "pure" semantic web plays is that they expect authors of web pages to add additional markup that conveys contextual meaning.  Given how long it has taken for CSS to be adopted — something obviously useful at the individual level –  I question how readily authors will adapt to adding contextual markup.  Not to mention that there is an awful lot of data already out there.  IDC reckon that “The digital universe in 2006 could be likened to 12 stacks of books extending from the Earth to the sun. By 2010 the stack of books could reach from the sun to Pluto and back…”.

So while semantic tools absolutely have their use, they are just another component of an overall solution, not a universal magic bullet.

More on what we’re doing on Wednesday!

How many Black Holes have you helped create today?

NickN| July 17, 2007 1:47 pm

We see a pattern in Data Organization, and it’s one we hope to change.

It goes something like this…  New tools for organizing data come on line.  People start using the tool.  For the first few months or so, everything is wonderful.  And then the volume of data reaches some kind of critical mass, and the efficiency of the tool starts to decrease rapidly.

In the days of 16k RAM packs and tape backups, I used to think it was just a matter of storage space.  But space is the least of our problems today…

Take Wiki’s for example.  I love Wiki’s.  I’ve set up and used a bunch of them.  But once they get to a certain size, Wiki Fatigue sets in.  It becomes hard to find the data you’re looking for.  You can’t figure out where someone on your team has put something.  The organizational utility of the tool breaks down…  In the worst cases, you’ve created a data Black Hole that sucks up information…  And almost nothing escapes from a Black Hole…

And then there’s email.  Almost no-one on the planet has an organized inbox.  We all lose stuff in email all the time, from messages to attachments to updates we’ve made to attachments.  Another Black Hole.

Local hard drives?  Network folders?  The list goes on.  Any typical corporation is littered with disparate disconnected data silos, many of which turn in to Black Holes of one kind or another.  And the pattern just keeps repeating.  After you’ve used a tool for long enough and added enough data, it inevitably gets less and less efficient.  And that usually leads to a new tool, which is great for a while until…

It’s time for that to stop.