« December 2008 | Main | March 2009 »

January 2009 Archives

January 14, 2009

Book Review: The Numerati

thenumerati.jpg The Numerati are all the statisticians, computer scientists and analysts around the world who are analyzing tons of data to understand "us". This is the main topic of this wonderful book written by Stephen Baker, a Business Week journalist.

The book is an easy read, written with a simple style that makes it accessible to everybody, and yet incredibly intriguing and informative for the knowledgeable reader.

Stephen interviewed tens of researchers and entrepreneurs around the US and put into focus one of the major trends of our days: not only an incredible amount of data has been and is collected everyday around the world but we are also finally starting to "use" these data to let us understand relevant aspect of the human being. Health, Finance, Marketing, Policy, are only few examples of areas where data is collected and deeply analyzed everyday.

Content

The book is organized around 7 chapters: Worker, Shopper, Voter, Blogger, Terrorist, Patient, Lover, in which people is modeled under the lens of a specific stereotype.

In Worker we are modeled according to our skills and the way we work. We meet people like Samer Takriti at IBM who is modeling about 300.000 IBM workers to understand the relationship between their skills and their performance and how to better allocate these skills in the company the same way we used to do with any other physical company asset.

In Shopper we are modeled according to the things we buy. Researchers are analyzing the millions of transactions we make everyday in stores to understand what "type" of buyers we are. Raiyd Ghani, for instance, analyzes with his group at Accenture Technology Labs grocery store transactions to provide personalized suggestions to shoppers through the use of carts equipped with personal assistants.

In Voter we are modeled according to ... to what? This is an impressive chapter because it demonstrates that we can be modeled in a given domain indirectly, using data that apparently has no connection with the subject matter. This is what Josh Gotbaum with his political firm Spotlight Analysis does. They provide detailed indications on swing voters based on data taken from large data companies like ChoicePoint and and Acxiom, who collect an incledible amount of data about us on almost every aspect of our life (scary?! :-)).

In Blogger we are modeled according to our opinion. Yes, our opinion. There are companies like Umbria Communications which analyzes the blogosphere to understand the opinion trends of millions of bloggers on whatever interests a given company. If I want to track how people react to a new product put on the market Umbria can tell.

In Terrorist we are modeled as potential terrorists or thieves. Here we meet people like Jeff Jonas, now at IBM, who helped casinos in Las Vegas sift through millions of internal records to single out suspect customers. And the same technology is used by In-Q-Tel, the venture capital arm of the CIA which invested in this technology, to cope with national security and counter terrorism.

In Patient we are modeled according to our body signals and medical records. This is the chapter I most loved, not only for its humanitarian applications, but also for the cleverness of some solutions. Eric Dishman launched the home health division at Intel where they design smart sensors like the "magic carpet" that monitors weight an movements to monitor the health of patients and where they try to predict the onset of diseases like Parkinson's and Alzheimer's by detecting suspect variations in the stream of data.

Finally, in Lover we are modeled according to our profile to find matches among us as potential lovers. We meet Helen Fisher, a Rutger's University anthropologist, who devised an innovative method to find matches between people which is the basis of the Chemistry.com dating website. Her method goes well beyond simple matching of demographic data, it is based on her theory that we can be split in four groups where a specific hormone is predominant and that the best matches comes from complementary hormones.


Reflections

The first issue the book raises is obviously privacy. I really liked the approach of Stephen Baker, equally distant from the excitement for the new opportunities brought by innovation and the potential for a super-controlled society where drawing a full profile of ourselves is becoming worryingly easy. Any other technological shift in history came however with the promise of new advancement in human being together with novel problems (think about cars and pollution). Stephen asks the right questions to some of the researchers he met. The most interesting in terms of privacy is the one with Jeff Jonas who is "vehemently opposed to the use of statistical data mining to predict the next terrorist attack" because of the high risk of intrusion and false alarms. And yet he believes that this technology can both protect our freedom and our privacy at the same time. I think this is one of the biggest challenges of our time, to find the right balance between the opportunities for increased freedom and security and the risks of intrusion, control, and faulty conclusions in the analysis of our own data.

From a more scientific and technological point of view what strikes me is the relevance prediction has in all the application areas described in the book. In traditional data analysis, especially for those with a visualization background, the focus is on "understanding" what is in the data to build a mental model out of it and in "discovering" some special gems out of chaos. Yet, however, real world applications are more concerned with elaborating actionable solutions to run and test, and I have the impression that "prediction" lends itself better to this goal. Think about it, through the book's examples, in workers the company wants to predict performance to put people in the right place, in shoppers the grocery store wants to predict what product can be sold to one specific customer to provide timely suggestions, in voter a political party wants to predict which population segment should be addresses with a targeted message to increase the chances they hit a group of swing voters, and so on. How do we, visual information designers and analysts, cope with this fact? Are we able to provide with our tools the same level of actionable knowledge or are we condemned to just describe things and hope that this information will be useful in some way?


Implications for Visualization

What is the role of visualization in the world of the Numerati. I think it is huge!!!

First of all all the technologies used by the Numerati are to some extent prone to errors and they are always the results of continued refinement of the underlying model. Visualization can play a significant role in helping the modelers understand and test their models and explore their implication as they are applied to new data. Without such a level of interaction the risk is to build monstrous black-boxes that spit oracles we all have to follow without really knowing why.

Another area where I see a large role of visualization is when mining is used in monitoring environments, where the timely detection and comprehension of the situation (more technically known as the situational awareness problem) is important. We have a long and respected tradition of research for knowing what works best in terms of visual representation when visual saliency, detection and contextual information are at stake. Well designed visualizations that permit to get the most out of a screen in a matter of seconds are of paramount importance here, from the need to analyze terrorist attacks to the doctor monitoring a patient.

A third potential I see for visualization is the need for personal data visualization. As these technologies develops, and the results of data analysis become more pervasive, I expect to see and increase in the need of managing personal data and the results of these analyzes by end-users. And how are we going to provide this information to the average person? Visualization can play a big role here and and again it would need to reinvent itself a bit. In this domain extremely simple and useful visualizations will be needed and some of them will be provided on non-standard devices like TVs, cell phones, public displays. We need flexible and simple solutions to provide to the large public.

So, in summary, the explosion of data analysis is good news for us! We have plenty of novel challenges to address. A somewhat silent mind shift is already going on underway ... I expect to see in the future an ever tighter integration of automatic mining technologies and visualization, as the recent Visual Analytics trend demonstrates after all.

January 20, 2009

Charting the Index of Economic Freedom: a call to action

The Heritage Foundation together with the Wall Street Journal has recently published the last report on the Index of Economic Freedom. The index captures in a series of factors like: "freedoms of movement for labor, capital, and goods, and an absolute absence of coercion or constraint of economic liberty, etc." the degree of economic freedom of a given country.

Apart for the intrinsic interest such data has, in that it measures freedom, the case is very challenging in terms of graphical representation. Here is just an example of what I've been able to do so far, sincerely with little success.

index-economic-freedom_s.png

Data

The dataset (which can be downloaded directly from the website) contains for each country and year, form 1995 to 2009, the overall score and the individual factors (e.g., business freedom, trade freedom, fiscal freedom, etc.) that compose the score. Technically speaking it is, in fact, a multivariate time series, a quite tough object to handle indeed.

Chart

In my proposed solution I focus on the representation of the states that experienced the highest positive or negative changes in the whole time range. Beyond the obvious reading of best and worse countries in the overall score, which can be easily obtained from the website, I think representing measures of change is a lot more interesting.

I've created the chart with MicroCharts a wonderful little Excel add-on. Each sparkline represents the time variation of the overall score, so that it is possible to see ups and downs in the considered time span. Since the variation is represented in terms of the individual maximum and minimum values, the timelines cannot be compared in terms of their absolute values. But this is ok as long as the main goal is to covey messages like: "hey this country has significantly and steadily improved its index over the course of the years!". The absolute values can be read on the right side where min and max are color-coded the same way the small dots are coded in the sparkline. The size of the dot represents the value and the bar chart the amount of variation.

Trends

I am by no means satisfied with my design, but I think it sheds some interesting light on the data. We can see that Armenia had an impressive improvement from 42.2 to 70.6. We can also see that many Eastern Europe countries like Moldova, Bosnia and Herzegovina, Lithuania, and Romania, had a great improvement as well, as highlighted in the report. Sad examples are Argentina, which experienced a sudden decrease, probably concomitant with the country economic breakdown, and Zimbawe which went from the already low 48.7 to 22.7.

A call to action!

The real challenge for these data is to represent the single factors together with the overall score and to represent the whole dataset, which I've not done. These factors can help explain for any major variation, if it is due to a specific sector or an overall change. I'm also convinced the same data can be seen under a myriad of other lenses different to mine. It is for this reason that I propose a "call to action", inviting you to create a chart of this intriguing dataset.

In order to facilitate your task I have attached here a processed version of the file that contains the overall score organized by time in a single Excel sheet (the original data has one sheet for each year). If you go into some preprocessing too pay attention to some data inconsistencies the original file may have. Especially, note that Somalia in some years is removed from the dataset.

Good Luck!

About January 2009

This page contains all entries posted to Visuale in January 2009. They are listed from oldest to newest.

December 2008 is the previous archive.

March 2009 is the next archive.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 4.1