Tag Archives: #datajournalism

Government spending: Who’s doing what with the new data?

Today sees the biggest release of government spending data in history. Government departments have published details of all spending over £25,000 for the past six months and, according to this morning’s announcement, will continue to publish this expenditure data on a monthly basis.

According to minister for the Cabinet Office and paymaster general Francis Maude, it is part of a drive “to make the UK the most transparent and accountable government in the world”.

We’ve already released a revolutionary amount of data over the last six months, from the salaries of the highest earning civil servants to organisation structure charts which give people a real insight into the workings of government and is already being used in new and innovative ways.

A huge amount of public spending data has indeed been published under the current government, and today’s release is a significant addition to that. So who is doing what with the vast amount of new data? And who is making it easier for others to crunch the numbers?

The Guardian is usually streets ahead of other newspapers in processing large datasets and today’s coverage is no exception:

Who else?

There are, of course, different ways of looking at the numbers, as one Guardian commenter, LudwigsLughole, highlights:

There are 90,000 HMRC staff. They spent £164,000 in six months on bottled spring water. That equates to an annual spend per head of only £3.64. So the FT are seriously suggesting that £3.64 per head to give staff fresh bottled water is excessive? Pathetic journalism.

Exploring the data yourself

“The biggest issue with all these numbers is, how do you use them? If people don’t have the tools to interrogate the spreadsheets, they may as well be written in Latin.” – Simon Rogers, Guardian Data Blog editor.

“Releasing data is all well and good, but to encourage the nation’s ‘armchair auditors’, it must be readily usable.” – Martin Stabe, FT.

Here are some of the places you can go, along with the Guardian, to have a crack at the numbers yourself. Please add your own suggestions in the comments below.

Lots and lots of data. So what? My take on it was to find a quick and dirty way to cobble a query interface around the data, so here’s what I spent an hour or so doing in the early hours of last night, and a couple of hours this morning… tinkering with a Gov spending data spreadsheet explorer:

Guardian/gov datastore explorer

[T]he real power of this data will become clear in the months to come, as developers and researchers – you? – start to link it to other information, like the magisterial OpenlyLocal and the exciting WhosLobbying. Please make use of our API and loading scripts to do so.

Also see the good suggestions on Where Does My Money Go? for how government data publishing might be improved in the future.

So, coming full circle I return to the Guardian, and to the data-minded Simon Rogers, who asks: Will the government spending data really change the world?

A big question. Feel free to add your opinion below and any other data projects you have seen today or that pop up in the future.

#iweu: The web data revolution – a new future for journalism?

David McCandless, excited about data

Rounding off Internet Week Europe on Friday afternoon, the Guardian put on a panel discussion in its Scott Room on journalism and data: ‘The web data revolution – a new future for journalism’.

Taking part were Simon Rogers, David McCandless, Heather Brooke, Simon Jeffery and Richard Pope, with Dr Aleks Krotoski moderating.

McCandless, a leading designer and author of data visuals book Information is Beautiful, made three concise, important points about data visualisations:

  • They are relatively easy to process;
  • They can have a high and fast cognitive impact;
  • They often circulate widely online.

Large, unwieldy datasets share none of those traits, they are extremely difficult and slow to process and pretty unlikely to go viral. So, as McCandless’ various graphics showed – from a light-hearted graph charting when couples are most likely to break up to a powerful demonstration of the extent to which the US military budget dwarfs health and aid spending – visualisations are an excellent way to make information accessible and understandable. Not a new way, as the Guardian’s data blog editor Simon Rogers demonstrated with a graphically-assisted report by Florence Nightingale, but one that is proving more and more popular as a means to tell a story.

David McCandless: Peak break-up times, according to Facebook status updates

But, as one audience member pointed out, large datasets are vulnerable to very selective interpretation. As McCandless’ own analysis showed, there are several different ways to measure and compare the world’s armies, with dramatically different results. So, Aleks Krotoski asked the panel, how can we guard against confusion, or our own prejudices interfering, or, worse, wilful misrepresentation of the facts?

McCandless’ solution is three-pronged: firstly, he publishes drafts and works-in-progress; secondly, he keeps himself accountable by test-driving his latest visualisations on a 25-strong group he created from his strongest online critics; third, and most important, he publishes all the raw data behind his work using Google docs.

Access to raw data was the driving force behind Heather Brooke’s first foray into FOI requests and data, she told the Scott Room audience. Distressed at the time it took her local police force to respond to 999 calls, she began examining the stats in order to build up a better picture of response times. She said the discrepancy between the facts and the police claims emphasised the importance of access to government data.

Prior to the Afghanistan and Iraq war logs release that catapulted WikiLeaks into the headlines – and undoubtedly saw the Guardian data team come on in leaps and bounds – founder Julian Assange called for the publishing of all raw data alongside stories to be standard journalistic practice.

You can’t publish a paper on physics without the full experimental data and results, that should be the standard in journalism. You can’t do it in newspapers because there isn’t enough space, but now with the internet there is.

As Simon Rogers pointed out, the journalistic process can no longer afford to be about simply “chucking it out there” to “a grateful public”. There will inevitably be people out there able to bring greater expertise to bear on a particular dataset than you.

But, opening up access to vast swathes of data is one thing, and knowing how to interpret that data is another. In all likelihood, simple, accessible interfaces for organising and analysing data will become more and more commonplace. For the release of the 400,000-document Iraq war logs, OWNI.fr worked with the Bureau of Investigative Journalism to create a program to help people analyse the extraordinary amount of data available.

Simply knowing where to look and what to trust is perhaps the first problem for amateurs. Looking forward, Brooke suggested aggregating some data about data. For example, a resource that could tell people where to look for certain information, what data is relevant and up to date, how to interpret the numbers properly.

So does data – ‘the new oil’ – signal a “revolution” or a “new future” for journalism? I am inclined to agree with Brooke’s remark that data will become simply another tool in the journalists armoury, rather than reshape things entirely. As she said, nobody is talking about ‘telephone-assisted reporting’, completely new once upon a time, it’s just called reporting. Soon enough, the ‘computer-assisted reporting’ course she teaches now at City University will just be ‘reporting’ too.

See also:

Guardian information architect Martin Belam has a post up about the event on his blog, currybetdotnet

Digital journalist Sarah Booker liveblogged presentations by Heather Brooke, David McCandless and Simon Rogers.

#ddj: Reasons to cheer from Amsterdam’s Data-Driven Journalism conference

When the European Journalism Center first thought of organizing a round-table on data-driven journalism, they were afraid they wouldn’t find 12 people to attend, said EJC director Wilfried Rütten. In the end, about 60 enthusiastic participants showed up and EJC had to turn down some requests.

Here’s the first reason to rejoice: Data is attractive enough to get scores of journalists from all across Europe and the US to gather in Amsterdam in the midst of the summer holidays! What’s more, most of the participants came to tell about their work, not about what they should be doing. We’ve gone a long way from the 2008 Future of Journalism conference, for instance, where Adrian Holovaty and Hans Rosling were the only two to make the case for data. And neither of them was a journalist.

The second reason to cheer: theory and reality are walking hand-in-hand. Deutsche Welle’s Mirko Lorenz, organiser for the EJC, shared his vision of a newsroom where journalists would work together with designers and developers. As it happens, that’s already the case in the newsrooms with dedicated data staff that were represented at the conference. NYT’s Alan McLean explained that the key to successful data project had been to have journalists work together with developers. Not only to work on the same projects, but to reorganize the office so that they would actually sit next to one another. At that point, journalists and developers would high-five each other after a successful project, wittingly exclaiming “journalism saved!”

Eric Ulken, founder of the LA Times’ Datadesk, reinforced this point of view by giving 10 tips to would-be datajournalists, number eight being simply to cohabit. Going further, he talked of integration and of finding the believers within the organization, further highlighting that data-driven journalism is about willpower more than technical obstacles, for the technologies used are usually far from cutting-edge computer science.

OWNI, probably the youngest operation represented at the conference (it started in the second quarter of 2010) works in the same way. Designers, coders and journalists work in the same room following a totally horizontal hierarchy, with 2 project managers, skilled in journalism and code, coordinating the operations.

In other words, data-driven operations are more than buzzwords. They set up processes through which several professions work together to produce new journalistic products.

Journalists need not be passively integrated in data teams, however. Several presenters gave advice and demonstrated tools that will enable journalists to play around with data without the need for coding skills. The endless debate about whether or not journalists should learn programming languages was not heard during the conference; I had the feeling that everybody agreed that these were two different jobs and that no one could excel in both.

Tony Hirst showed what one could do without any programming skills. His blog, OUseful, provides tutorials on how to use mashups, from Yahoo! Pipes to Google Spreadsheets to RDF databases. His presentation was about publishing dynamic data on a Google map. He used Google Spreadsheet’s ability to scrape html pages for data, then processed it in Yahoo Pipes and re-plugged it on a Google Map. Most of the audience was absolutely astonished with what they could do using tools they knew about but did not use in a mashed-up way. Find escort girls near you through Independent-escort.org . New escort girls and prostitutes offering their erotic services every day.

We all agreed that storytelling was at the heart of our efforts. A dataset in itself brings nothing and is often ‘bland’, in the words of Alan McLean. Some governments will even be happy to dump large amount of data online to brag about their transparency efforts, but if the data cannot be easily remixed, letting journalists search through it, its value decreases dramatically. The Financial Times’ Cynthia O’Murchu even stated that she felt more like a ‘pdf cleaner’ than a journalist when confronted with government data.

The value of data-driven journalism comes not from the ability to process a large database and spit it to the user. Data architects have been doing that for the last 40 years to organize Social Security figures, for instance. The data and the computer power we use to process it should never be an end in itself, but must be thought of as a means to tell a story.

The one point to be overlooked was finance. The issue has been addressed only 3 times during the whole day, showing that datajournalism still hasn’t reached a maturity where it can sustain itself. Mirko Lorenz reminded the audience that data was a fundamental part of many media outlets’ business models, from Thomson Reuters to The Economist, with its Intelligence Unit. That said, trying to copy their model would take datajournalists away from storytelling and bring them closer to database managers. An arena in which they have little edge compared to established actors, used to processing and selling data.

OWNI presented its model of co-producing applications with other media and of selling some of them as white label products. Although OWNI’s parent company 22mars is one of the only profitable media outlets in France and that its datajournalism activities are breaking even, the business model was not the point that attracted most attention from the audience.

Finally, Andrew Lyons of Ultra Knowledge talked about his model of tagging archive and presenting them as a NewsWall. Although his solution is not helping storytelling per se, it is a welcome way of monetizing archives, as it allows for newspapers to sponsor archives or events, a path that needs to be explored as CPMs continue to fall down.

His ideas were less than warmly received by the audience, showing that although the entrepreneurial spirit has come to journalism when it comes to shaking up processes and habits, we still have a long way to go to see ground-braking innovation in business models.

Nicolas Kayser-Bril is a datajournalist at OWNI.fr

See tweets from the conference on the Journalism.co.uk Editors’ Blog

OUseful: Gripes with Guardian’s DataStore #datajourn

Here are thoughts from Tony Hirst, one of the first adopters and success stories for the Guardian’s Open Platform, on what the OP’s DataStore is and is not doing, in terms of data curation (or gardening). He asks:

“Is the Guardian DataStore adding value to the data in the data store in an accessibility sense: by reducing the need for data mungers to have to process the data, so that it can be used in a plug’n’play way by the statisticians and the data visualisers, whether they’re professionals, amateurs or good old Jo Public?”

Hirst has a number of queries in regards to data quality and ‘misleading’ linking on the Guardian DataBlog. In a later comment, he wonders whether there is a ‘data style guide’ available yet.

If you’re not all that au fait with the data lingo, this post might be a bit indigestible, so we’ll follow with a translation in coming days.

Related on Journalism.co.uk: Q&A with Hirst, April 8, 2009.

Martin Belam: “Introducing Information Architecture at the Guardian”

As Journalism.co.uk reported last month the first London Information Architecture mini-conference raised immediate online interest, and ‘sold’ out fast. Here Martin Belam shares his notes from the event on his blog.

Full post at this link…