Data is largely available, no question. Everywhere we hear that the new big trend is data crunching and that the great thing of years 2000 is the large provision of freely available data sets. Just recently the US Government has released data.gov, and this event has been acclaimed by numerous people in visualization and in data analysis as a big step toward a better world. Ok, data is good. All of it is good. But I think we are getting overly excited about it.
I see a dangerous trend here: thinking that data is the only thing we need and that having large data at our hands will solve some problems. But, data per se has no real value if it is not related to problems and people! So, I see many new interesting web sites (and tools!) popping up on the Internet but I don't see any guidance about what to do with them.
In realistic settings, where visualization and data analysis tools are really needed, people is not enthusiastic about data but about the impact data can have in their context once it is analyzed. An this context is one rich of background knowledge and business goals (in its broadest meaning). I don't see the same kind of richness around.
Tasks and people are the scarce resourceBasic economy teaches us that values is generated from scarce resources and not from what is broadly and readily available. What we really need therefore is not only access to data sets but also to real problems and to the people who care about them.
Unfortunately, while today data can be readily collected and transferred, problems and people cannot. If we don't realize this, we risk to run millions of useless studies, build thousands of useless tools, and waste enormous amounts of energy and resources.
If at least tasks would be joined to data, the situation would be largely improved. Instead of guessing what these data are for, we could try to solve some real problems connected to them. Take for instance the data.gov website, what if people could post interesting questions? Or what if another website would be available where people post data AND tasks?
Interesting good examplesSome great examples come from research areas like knowledge discovery and visual analytics.
Take the KDD Cup. It is organized every year at KDD, the premier international conference on knowledge discovery. Real world data is published to let researchers compete on a series of pre-defined TASKS. The last KDD 2009 is an excellent example. After few lines of description the web page has a large title: "Task Description". The competition is based on a real data set provided by Orange, the French telecommunications company, and the goal is to do better than a data mining tool developed internally. The tasks is to: "estimate the churn, appetency and up-selling probability of customers" ... "churn rate is a measure of the number of individuals or items moving into or out of a collection over a specific period of time" ... "the appetency is the propensity to buy a service or a product." ... "Up-selling can imply selling something additional, or selling something that is more profitable or otherwise preferable for the seller instead of the original sale. " Better than just data, isn't it?
The VAST Challenge is another excellent example. It is organized every year within the Symposium on Visual Analytics Science and Technology. Similarly to the KDD Cup, a new data set with tasks is published every year. The problems are selected in a way that visual analytics technology is needed to solve them, that is, plain automatic methods without iterative user sense making, are unlikely to be the solution. Another great thing about this challenge is that data is synthetically generated so that ground truth is available. In practice, this means that different solutions can be compared in terms of their ability to discover what there is to be discovered. So, the VAST Challenge provides complex data AND tasks. But even better they also provide people! Since the 2007 edition the contest includes also a session where contest winners have the opportunity to run a small contest live together with real analysts. This is exactly the direction to take in my humble opinion.
ConclusionI don't want to give the impression that data is not important or that its wide availability is not a great thing. It is! But as data turn into a commodity there are other factors that become relevant. Having meaningful tasks and access to real people trying to solve problems is a lot more important, and a lot less likely to become a commodity. What will count in the future (present?) both for researchers and practitioners is not data but people.
I think it is important to recognize this limits and opportunities and start behaving accordingly. It is for this reason that I am not overly enthusiastic about having a lot of data. And I think the the sooner we start differentiating between just data and data + problems the better will be for all of us.