Data is largely available, no question. Everywhere we hear that the new big trend is data crunching and that the great thing of years 2000 is the large provision of freely available data sets. Just recently the US Government has released data.gov, and this event has been acclaimed by numerous people in visualization and in data analysis as a big step toward a better world. Ok, data is good. All of it is good. But I think we are getting overly excited about it.
I see a dangerous trend here: thinking that data is the only thing we need and that having large data at our hands will solve some problems. But, data per se has no real value if it is not related to problems and people! So, I see many new interesting web sites (and tools!) popping up on the Internet but I don't see any guidance about what to do with them.
In realistic settings, where visualization and data analysis tools are really needed, people is not enthusiastic about data but about the impact data can have in their context once it is analyzed. An this context is one rich of background knowledge and business goals (in its broadest meaning). I don't see the same kind of richness around.
Tasks and people are the scarce resource
Basic economy teaches us that values is generated from scarce resources and not from what is broadly and readily available. What we really need therefore is not only access to data sets but also to real problems and to the people who care about them.Unfortunately, while today data can be readily collected and transferred, problems and people cannot. If we don't realize this, we risk to run millions of useless studies, build thousands of useless tools, and waste enormous amounts of energy and resources.
If at least tasks would be joined to data, the situation would be largely improved. Instead of guessing what these data are for, we could try to solve some real problems connected to them. Take for instance the data.gov website, what if people could post interesting questions? Or what if another website would be available where people post data AND tasks?
Interesting good examples
Some great examples come from research areas like knowledge discovery and visual analytics.Take the KDD Cup. It is organized every year at KDD, the premier international conference on knowledge discovery. Real world data is published to let researchers compete on a series of pre-defined TASKS. The last KDD 2009 is an excellent example. After few lines of description the web page has a large title: "Task Description". The competition is based on a real data set provided by Orange, the French telecommunications company, and the goal is to do better than a data mining tool developed internally. The tasks is to: "estimate the churn, appetency and up-selling probability of customers" ... "churn rate is a measure of the number of individuals or items moving into or out of a collection over a specific period of time" ... "the appetency is the propensity to buy a service or a product." ... "Up-selling can imply selling something additional, or selling something that is more profitable or otherwise preferable for the seller instead of the original sale. " Better than just data, isn't it?
The VAST Challenge is another excellent example. It is organized every year within the Symposium on Visual Analytics Science and Technology. Similarly to the KDD Cup, a new data set with tasks is published every year. The problems are selected in a way that visual analytics technology is needed to solve them, that is, plain automatic methods without iterative user sense making, are unlikely to be the solution. Another great thing about this challenge is that data is synthetically generated so that ground truth is available. In practice, this means that different solutions can be compared in terms of their ability to discover what there is to be discovered. So, the VAST Challenge provides complex data AND tasks. But even better they also provide people! Since the 2007 edition the contest includes also a session where contest winners have the opportunity to run a small contest live together with real analysts. This is exactly the direction to take in my humble opinion.
Conclusion
I don't want to give the impression that data is not important or that its wide availability is not a great thing. It is! But as data turn into a commodity there are other factors that become relevant. Having meaningful tasks and access to real people trying to solve problems is a lot more important, and a lot less likely to become a commodity. What will count in the future (present?) both for researchers and practitioners is not data but people.I think it is important to recognize this limits and opportunities and start behaving accordingly. It is for this reason that I am not overly enthusiastic about having a lot of data. And I think the the sooner we start differentiating between just data and data + problems the better will be for all of us.
Comments (3)
True, having access to lots of data right now is exciting, and to an extent it's just data for data's sake. But this is just the beginning.
A lot of apps are built around the data for now, and over time, people will figure out connections that will lead to more complex and more interesting tools.
Also, once we have a better idea of what data is out there, it can be used to enrich all kinds of applications. These apps will not be centered on particular data, but simply pull in background data to aid in whatever task they're good for.
The idea for a site to post questions is a good one, but it's completely separate from the data being available. Some of those questions might be answered with public data, others might require data from other sources.
Posted by Robert Kosara | September 1, 2009 4:22 AM
Posted on September 1, 2009 04:22
Yeah, u're right.
We should talk about a whole user-centered scenario in terms of "visualizations" goals.
A small step should be taking into account some practice from the User Experience Design patterns.
Regarding what Robert wrote, I also agree with the need of having several high-quality data sources, because public data are not always available in some specific field.
I think that an appropriate use of social media could give an hand to find some smart logic to data visualizations and data analysis.
Posted by Daniele Galiffa | October 17, 2009 9:01 AM
Posted on October 17, 2009 09:01
I agree completely. I really don't care about how much data is out there.
We need to be answering questions and solving problems.
And that is the crux. Most often people form their questions around what data they have. When in fact they should simply state the problem. Then we can try to solve it by looking at the data, and if we can't, we need better data.
Posted by krees | November 12, 2009 7:27 PM
Posted on November 12, 2009 19:27