The golden age of data journalism?

Computer-assisted reporting or CAR has been around, well — ever since there were computers. Even when I was in journalism school (which was longer ago than I care to remember), we learned about databases we could search, etc. But the explosion of Web-based tools and ways of sifting through and sharing data has created something approaching a revolution, and the potential benefits for journalism are only just beginning to reveal themselves. If this movement has a patron saint, it is probably Adrian Holovaty, who gained renown while working on data-driven features at the Washington Post, then created the amazing as one of the first Google Maps mashups, followed by his fellowship-financed Everyblock, which aggregates local data about an area.

Another recent example of how data can drive reporting, and how Web-based tools can extend and enhance that reporting, comes from several British newspapers — primarily The Guardian — and their coverage of an emerging expense scandal involving British politicians. One of the really interesting things that The Guardian has done is to publish all of the expense info they have through a laboriously detailed and publicly accessible Google spreadsheet. As Paul Bradshaw points out at the Online Journalism Blog, this structure actually allows reporters (or in fact anyone who is interested in the info) to extract useful data simply by changing the URL. Someone has even created a page where you can run queries on the database with a simple click.

(please read the rest of this post at the Nieman Journalism Lab)

Bonus link:

See Adrian Holovaty’s definitive, two-part answer to the question “is data journalism?”

Journalism, data and community

I apologize in advance — this post is really just some links that I came across that have to do with the media, the “data-fication” of journalism, and community. Maybe when I have more time I will try to find the connections that pull these things together, but until then I will just present them as they are, in part to help myself remember and think about them:

— The Los Angeles Times has a “data desk,” which includes links to all of their data-driven projects (link via, who found it via Ben Fry’s blog, who got it from Casey). Some interesting stuff in there, including a database of fatalities from a train crash in September, along with personal information about the deceased, and a list of L.A.’s dirtiest pools.

Continue reading

Waxy digs into Girl Talk data

If you are the kind of data geek who loves to just accumulate numbers about things and then slice and dice them to see what appears, then Andy “Waxy” Baio is your kind of guy. An independent journalist and programmer whose blog at is a treasure trove of such things, Andy spent some time recently and analyzed the recent album from DJ mashup artist Girl Talk (which I wrote about here). Using data from Wikipedia — as well as some he got by using Amazon’s “crowd-sourcing” engine, Mechanical Turk — he came up with a spreadsheet listing all the samples that Gregg “Girl Talk” Gillis used on the album (264 in all) and how many samples each song contained.

Then he created a visual timeline of where the samples appear in each song, and a bar graph that shows the age of each song used as a sample (median age: 13 years old), as well as the same data laid out in a different way, to show that Gillis uses a lot of recent hits, and also a lot of 80s tunes, but not that many in between. What does any of this mean? Who knows. But it’s a tour de force of data porn. As always, Waxy gives a full breakdown of his methodology, and all the data can be download as a CSV file if you want to run your own analysis.

Data flow and creating electricity

One of the difficult parts about constantly having about 35 tabs open in Firefox is that I can never remember how I got to a particular page; was it from a Google Reader shared item? From a Twitter post? From email? My regular RSS reader? It’s hard to say. Which explains why I have no idea how I came across this post from Mark Ury, an “experience architect” at Blast Radius. I’m glad I did, however, since Mark does a really nice job of looking at how focusing on data “ownership” in social networks kind of misses the point — the real value is in data flow.

This is a point that Fred Wilson of A VC and others have also made, and one Fred says was originally brought home to him by a comment Umair Haque of Bubblegeneration made. “I don’t think it’s the data that’s so valuable,” he said. “It’s the flow of the data through the service.” In his post, Mark Ury compares this to an electric-power generation system, which uses dams to take advantage of water flow in order to generate power. The water never stops, it’s only momentarily delayed — and while it’s being delayed, you can make use of it. As he puts it:

The real opportunity in flow constraint, though, is putting capacity to use and amplifying the effect. Data is like a river: you can dam it and generate electricity. That’s what Google did with search. They created a machine that, as we pass through it on our way to find something, harnesses our collective energy and turns our data flow into the most powerful asset of this generation.

As Mark notes, services that try to restrict the flow of data too much wind up either having issues with control or ownership debates, and in many cases the data — just like water — routes itself around the obstruction and finds a new path (i.e., a new service that isn’t as restrictive). That’s a balance that a site like Facebook is continually trying to strike: not strict enough to cause people to take their data flow elsewhere, but just restrictive enough to allow Facebook to make use of the data before letting it move on. Tim O’Reilly has described Web 2.0 as any application or service that tends to get better the more people use it.


If you’re like me and have a hard time remembering how you got to a certain page, Gabe “Techmeme” Rivera has posted a comment with a tip: right-click the page and check “page info” and you can see the referring page (unfortunately it doesn’t help me in this case because I’ve already closed the tab).

MySpace: We still control your data

I can appreciate that there’s a good reason for all the buzz on Techmeme about MySpace hooking up with Yahoo, eBay and Twitter as part of the Data Portability project. Data portability and open standards are a great thing, and it’s nice to see some movement on that front after all of the announcements and back-slapping that went on about it last year — followed by very little movement on anyone’s part. But after all the party favours are handed out and everyone’s finished their MySpace punch, it might be worth noting that this “data portability” initiative still keeps the power very much in MySpace’s hands.

It’s true that the site has agreed to open up its API and allow other providers such as Yahoo and Twitter to extract user data with the OAuth standard. But we’re still talking about data that resides on MySpace’s servers and therefore effectively — according to the terms of use agreement that members sign when they register — belongs to the social network. It’s nice that they are letting you use it elsewhere, but as Stacy Higginbotham at GigaOm points out, they still get to choose which services can play, since they have to agree to MySpace’s terms of service in order to get access to the API. And what if something happens and your account gets deleted for some reason?

Don’t get me wrong — it’s good that MySpace is opening up. And I think it’s great that being the first one to adopt any kind of open standard or interoperability seems to be turning into a competitive advantage. But this is very much about MySpace wanting to become the central storage point for peoples’ data, and then doling out whatever information it wants to the services that it wants to play ball with. Even the praise from the Data Portability Project seems rather faint: it says that it hopes MySpace will someday “evolve toward becoming a compliant implementation” of the project’s best practices. I hope so too.


Ben Metcalfe, who acted as an advisor to MySpace and is also a co-founder of the Data Portability group, has posted a comment here in which he corrects some misunderstandings of mine about the nature of what MySpace is doing. In particular, he says that the launch partners are not getting any kind of special deal, but were only chosen in order to “have someone to test and debug the implementation with and also have the ability to demonstrate the complete value proposition end-to-end.” Thanks for clarifying things, Ben.