Tag Archives: data

Government spending: Who’s doing what with the new data?

Today sees the biggest release of government spending data in history. Government departments have published details of all spending over £25,000 for the past six months and, according to this morning’s announcement, will continue to publish this expenditure data on a monthly basis.

According to minister for the Cabinet Office and paymaster general Francis Maude, it is part of a drive “to make the UK the most transparent and accountable government in the world”.

We’ve already released a revolutionary amount of data over the last six months, from the salaries of the highest earning civil servants to organisation structure charts which give people a real insight into the workings of government and is already being used in new and innovative ways.

A huge amount of public spending data has indeed been published under the current government, and today’s release is a significant addition to that. So who is doing what with the vast amount of new data? And who is making it easier for others to crunch the numbers?

The Guardian is usually streets ahead of other newspapers in processing large datasets and today’s coverage is no exception:

Who else?

There are, of course, different ways of looking at the numbers, as one Guardian commenter, LudwigsLughole, highlights:

There are 90,000 HMRC staff. They spent £164,000 in six months on bottled spring water. That equates to an annual spend per head of only £3.64. So the FT are seriously suggesting that £3.64 per head to give staff fresh bottled water is excessive? Pathetic journalism.

Exploring the data yourself

“The biggest issue with all these numbers is, how do you use them? If people don’t have the tools to interrogate the spreadsheets, they may as well be written in Latin.” – Simon Rogers, Guardian Data Blog editor.

“Releasing data is all well and good, but to encourage the nation’s ‘armchair auditors’, it must be readily usable.” – Martin Stabe, FT.

Here are some of the places you can go, along with the Guardian, to have a crack at the numbers yourself. Please add your own suggestions in the comments below.

Lots and lots of data. So what? My take on it was to find a quick and dirty way to cobble a query interface around the data, so here’s what I spent an hour or so doing in the early hours of last night, and a couple of hours this morning… tinkering with a Gov spending data spreadsheet explorer:

Guardian/gov datastore explorer

[T]he real power of this data will become clear in the months to come, as developers and researchers – you? – start to link it to other information, like the magisterial OpenlyLocal and the exciting WhosLobbying. Please make use of our API and loading scripts to do so.

Also see the good suggestions on Where Does My Money Go? for how government data publishing might be improved in the future.

So, coming full circle I return to the Guardian, and to the data-minded Simon Rogers, who asks: Will the government spending data really change the world?

A big question. Feel free to add your opinion below and any other data projects you have seen today or that pop up in the future.

#iweu: The web data revolution – a new future for journalism?

David McCandless, excited about data

Rounding off Internet Week Europe on Friday afternoon, the Guardian put on a panel discussion in its Scott Room on journalism and data: ‘The web data revolution – a new future for journalism’.

Taking part were Simon Rogers, David McCandless, Heather Brooke, Simon Jeffery and Richard Pope, with Dr Aleks Krotoski moderating.

McCandless, a leading designer and author of data visuals book Information is Beautiful, made three concise, important points about data visualisations:

  • They are relatively easy to process;
  • They can have a high and fast cognitive impact;
  • They often circulate widely online.

Large, unwieldy datasets share none of those traits, they are extremely difficult and slow to process and pretty unlikely to go viral. So, as McCandless’ various graphics showed – from a light-hearted graph charting when couples are most likely to break up to a powerful demonstration of the extent to which the US military budget dwarfs health and aid spending – visualisations are an excellent way to make information accessible and understandable. Not a new way, as the Guardian’s data blog editor Simon Rogers demonstrated with a graphically-assisted report by Florence Nightingale, but one that is proving more and more popular as a means to tell a story.

David McCandless: Peak break-up times, according to Facebook status updates

But, as one audience member pointed out, large datasets are vulnerable to very selective interpretation. As McCandless’ own analysis showed, there are several different ways to measure and compare the world’s armies, with dramatically different results. So, Aleks Krotoski asked the panel, how can we guard against confusion, or our own prejudices interfering, or, worse, wilful misrepresentation of the facts?

McCandless’ solution is three-pronged: firstly, he publishes drafts and works-in-progress; secondly, he keeps himself accountable by test-driving his latest visualisations on a 25-strong group he created from his strongest online critics; third, and most important, he publishes all the raw data behind his work using Google docs.

Access to raw data was the driving force behind Heather Brooke’s first foray into FOI requests and data, she told the Scott Room audience. Distressed at the time it took her local police force to respond to 999 calls, she began examining the stats in order to build up a better picture of response times. She said the discrepancy between the facts and the police claims emphasised the importance of access to government data.

Prior to the Afghanistan and Iraq war logs release that catapulted WikiLeaks into the headlines – and undoubtedly saw the Guardian data team come on in leaps and bounds – founder Julian Assange called for the publishing of all raw data alongside stories to be standard journalistic practice.

You can’t publish a paper on physics without the full experimental data and results, that should be the standard in journalism. You can’t do it in newspapers because there isn’t enough space, but now with the internet there is.

As Simon Rogers pointed out, the journalistic process can no longer afford to be about simply “chucking it out there” to “a grateful public”. There will inevitably be people out there able to bring greater expertise to bear on a particular dataset than you.

But, opening up access to vast swathes of data is one thing, and knowing how to interpret that data is another. In all likelihood, simple, accessible interfaces for organising and analysing data will become more and more commonplace. For the release of the 400,000-document Iraq war logs, OWNI.fr worked with the Bureau of Investigative Journalism to create a program to help people analyse the extraordinary amount of data available.

Simply knowing where to look and what to trust is perhaps the first problem for amateurs. Looking forward, Brooke suggested aggregating some data about data. For example, a resource that could tell people where to look for certain information, what data is relevant and up to date, how to interpret the numbers properly.

So does data – ‘the new oil’ – signal a “revolution” or a “new future” for journalism? I am inclined to agree with Brooke’s remark that data will become simply another tool in the journalists armoury, rather than reshape things entirely. As she said, nobody is talking about ‘telephone-assisted reporting’, completely new once upon a time, it’s just called reporting. Soon enough, the ‘computer-assisted reporting’ course she teaches now at City University will just be ‘reporting’ too.

See also:

Guardian information architect Martin Belam has a post up about the event on his blog, currybetdotnet

Digital journalist Sarah Booker liveblogged presentations by Heather Brooke, David McCandless and Simon Rogers.

ProPublica: How we got the government’s secret dialysis data

Today, US non-profit ProPublica begins publishing the findings of a long-term investigation into the provision of dialysis in the US, which will also be published by the Atlantic magazine. In an editors note on the site, Paul Steiger and Stephen Engelberg explain how reporter Robin Fields spent two years pressing officials from the Centers for Medicare and Medicaid Services (CMS) to release a huge dataset detailing the performance of various dialysis facilities.

Initially, she was told by the agency that the data was not in its “possession, custody and control.” After state officials denied similar requests for the data, saying it belonged to CMS, the agency agreed to reconsider. For more than a year after that, officials neither provided the data nor indicated whether they would.

ProPublica finally got its hands on the data, after the Atlantic story had gone to print, but plans “to make it available on our website as soon as possible in a form that will allow patients to compare local dialysis centers.”

Full story at this link.

RBI to host hacks/hackers day in November

Reed Business Information (RBI) is hosting an event for journalists and programmers interested in working together on data visualisation. The one-day “hack day”, which will take place on 29 November, will be run with the help of data scraping project ScraperWiki.

Speaking on the ScraperWiki blog, Karl Schneider, editorial development director at RBI, explains the thinking behind the event:

Data journalism is an important area of development for our editorial teams in RBI

It’s a hot topic for all journalists, but it’s particularly relevant in the B2B sector. B2B journalism is focused on delivering information that it’s audience can act on, supporting important business decisions.

Often a well-thought-out visualisation of data can be the most effective way of delivering critical information and helping users to understand key trends.

We’re already having some successes with this kind of journalism, and we think we can do a lot more. So building up the skills of our editorial teams in this area is very important.

You can register for the event at this link.

Making data work for you: one week till media140’s dataconomy event

There’s just one week to go before media140’s event on data and how journalists and media can make better use of it. Featuring the Guardian’s news editor for data Simon Rogers and Information is Beautiful author David McCandless, the event will discuss the commercial, ethical and technological issues of making data work for you.

Rufus Pollock, director of the Open Knowledge Foundation, and Andrew Lyons, commercial director of UltraKnowledge will also be speaking. Full details are available at this link.

Journalism.co.uk is proud to be a media partner for media140 dataconomy. Readers of Journalism.co.uk can sign-up for tickets to the event at this link using the promotional code “journalist”. Tickets are currently available for £25, which includes drinks.

The event on Thursday 21 October will be held at the HUB, King’s Cross, from 6:30-9:30pm.

Nick Davies: Data, crowdsourcing and the ‘immeasurable confusion’ around Julian Assange

Investigative journalist Nick Davies chipped in with his thoughts on crowdsourcing data analysis by news organisations at this week’s Frontline Club event. (You can listen to a podcast featuring the panellists at this link)

For Davies, who brokered the Guardian’s involvement in the WikiLeaks Afghanistan War Logs, such stories suggest that asking readers to trawl through data for stories doesn’t work:

I haven’t seen any significant analysis of that raw material (…) There were all sorts of angles that we never got to because there was such much of it. For example, there was a category of material that was recorded by the US military as being likely to create negative publicity. You would think somebody would search all those entries and put them together and compare them with what actually was put out in press releases.

I haven’t seen anyone do anything about the treatment of detainees, which is recorded in there.

We got six or seven good thematic stories out of it. I would think there are dozens of others there. There’s some kind of flaw in the theory that crowdsourcing is a realistic way of converting data into information and stories, because it doesn’t seem to be happening.

And Davies had the following to say about WikiLeaks head Julian Assange:

We warned him that he must not put this material unredacted onto the WikiLeaks website because it was highly likely to get people killed. And he never really got his head around that. But at the last moment he did a kind of word search through these 92,00 documents looking for words like source or human intelligence and withdrew 15,000 docs that had those kind of words in. it’s a very inefficient way of making those documents safe and I’m worried about what’s been put up on there.

He then kind of presented the withholding these 15,000 documents as some kind of super-secret, but it’s already been released (…) The amount of confusion around Julian is just immeasurable. In general terms you could say he’s got other kinds of material coming through WikiLeaks and there’s all sorts of possibilities about who might be get involved in processing it. Personally I feel much happier pursuing the phone hacking, which is a relatively clean story that Julian’s not involved in.

#ddj: Reasons to cheer from Amsterdam’s Data-Driven Journalism conference

When the European Journalism Center first thought of organizing a round-table on data-driven journalism, they were afraid they wouldn’t find 12 people to attend, said EJC director Wilfried Rütten. In the end, about 60 enthusiastic participants showed up and EJC had to turn down some requests.

Here’s the first reason to rejoice: Data is attractive enough to get scores of journalists from all across Europe and the US to gather in Amsterdam in the midst of the summer holidays! What’s more, most of the participants came to tell about their work, not about what they should be doing. We’ve gone a long way from the 2008 Future of Journalism conference, for instance, where Adrian Holovaty and Hans Rosling were the only two to make the case for data. And neither of them was a journalist.

The second reason to cheer: theory and reality are walking hand-in-hand. Deutsche Welle’s Mirko Lorenz, organiser for the EJC, shared his vision of a newsroom where journalists would work together with designers and developers. As it happens, that’s already the case in the newsrooms with dedicated data staff that were represented at the conference. NYT’s Alan McLean explained that the key to successful data project had been to have journalists work together with developers. Not only to work on the same projects, but to reorganize the office so that they would actually sit next to one another. At that point, journalists and developers would high-five each other after a successful project, wittingly exclaiming “journalism saved!”

Eric Ulken, founder of the LA Times’ Datadesk, reinforced this point of view by giving 10 tips to would-be datajournalists, number eight being simply to cohabit. Going further, he talked of integration and of finding the believers within the organization, further highlighting that data-driven journalism is about willpower more than technical obstacles, for the technologies used are usually far from cutting-edge computer science.

OWNI, probably the youngest operation represented at the conference (it started in the second quarter of 2010) works in the same way. Designers, coders and journalists work in the same room following a totally horizontal hierarchy, with 2 project managers, skilled in journalism and code, coordinating the operations.

In other words, data-driven operations are more than buzzwords. They set up processes through which several professions work together to produce new journalistic products.

Journalists need not be passively integrated in data teams, however. Several presenters gave advice and demonstrated tools that will enable journalists to play around with data without the need for coding skills. The endless debate about whether or not journalists should learn programming languages was not heard during the conference; I had the feeling that everybody agreed that these were two different jobs and that no one could excel in both.

Tony Hirst showed what one could do without any programming skills. His blog, OUseful, provides tutorials on how to use mashups, from Yahoo! Pipes to Google Spreadsheets to RDF databases. His presentation was about publishing dynamic data on a Google map. He used Google Spreadsheet’s ability to scrape html pages for data, then processed it in Yahoo Pipes and re-plugged it on a Google Map. Most of the audience was absolutely astonished with what they could do using tools they knew about but did not use in a mashed-up way. Find escort girls near you through Independent-escort.org . New escort girls and prostitutes offering their erotic services every day.

We all agreed that storytelling was at the heart of our efforts. A dataset in itself brings nothing and is often ‘bland’, in the words of Alan McLean. Some governments will even be happy to dump large amount of data online to brag about their transparency efforts, but if the data cannot be easily remixed, letting journalists search through it, its value decreases dramatically. The Financial Times’ Cynthia O’Murchu even stated that she felt more like a ‘pdf cleaner’ than a journalist when confronted with government data.

The value of data-driven journalism comes not from the ability to process a large database and spit it to the user. Data architects have been doing that for the last 40 years to organize Social Security figures, for instance. The data and the computer power we use to process it should never be an end in itself, but must be thought of as a means to tell a story.

The one point to be overlooked was finance. The issue has been addressed only 3 times during the whole day, showing that datajournalism still hasn’t reached a maturity where it can sustain itself. Mirko Lorenz reminded the audience that data was a fundamental part of many media outlets’ business models, from Thomson Reuters to The Economist, with its Intelligence Unit. That said, trying to copy their model would take datajournalists away from storytelling and bring them closer to database managers. An arena in which they have little edge compared to established actors, used to processing and selling data.

OWNI presented its model of co-producing applications with other media and of selling some of them as white label products. Although OWNI’s parent company 22mars is one of the only profitable media outlets in France and that its datajournalism activities are breaking even, the business model was not the point that attracted most attention from the audience.

Finally, Andrew Lyons of Ultra Knowledge talked about his model of tagging archive and presenting them as a NewsWall. Although his solution is not helping storytelling per se, it is a welcome way of monetizing archives, as it allows for newspapers to sponsor archives or events, a path that needs to be explored as CPMs continue to fall down.

His ideas were less than warmly received by the audience, showing that although the entrepreneurial spirit has come to journalism when it comes to shaking up processes and habits, we still have a long way to go to see ground-braking innovation in business models.

Nicolas Kayser-Bril is a datajournalist at OWNI.fr

See tweets from the conference on the Journalism.co.uk Editors’ Blog

#ddj: Follow the Data Driven Journalism conference

Today in Amsterdam the great and good of data journalism are gathering to discuss the tools, techniques and opportunities for journalists using and visualising data in stories.

Full details are on the event site, which explains:

Developing the know-how to use the available data more effectively, to understand it, communicate and generate stories based on it, could be a huge opportunity to breathe new life into journalism. Journalists can find new roles as “sense-makers” digging deep into data, thus making reporting more socially relevant. If done well, delivering credible information and advice could even generate revenues, opening up new perspectives on business models, aside from subscriptions and advertising.

OWNI.fr‘s Nicolas Kayser-Bril will be blogging about the day for Journalism.co.uk. To keep up with what’s being said, you can follow the Twitter hashtag #ddj below.

CJR and the Texas Tribune: Is data both journalism and a business?

The Columbia Journalism Review takes an in-depth look at news start-up the Texas Tribune, which launched in November last year “billing itself not only as an antidote to the dwindling capitol press corps but also as a new force in Texas political life”. CJR considers how sustainable the venture is editorially and commercially:

The Tribune’s biggest magnet by far has been its more than three dozen interactive databases, which collectively have drawn three times as many page views as the site’s stories (…) The Tribune publishes or updates at least one database per week, and readers e-mail these database links to each other or share them on Facebook, scouring their neighborhood’s school rankings or their state rep’s spending habits. Through May, the databases had generated more than 2.3 million page views since the site’s launch

Full story on CJR…

Datablog: What data releases by the UK government could mean for journalists

The Guardian’s Simon Rogers writes a timely post on the potential of data for journalism ahead of a series of anticipated announcements from Downing Street, likely to start this week, that could give journalists access to more public data from local and national government.

Of all the datasets that will be released, possibly the most significant is something called the Combined Online Information System (Coins). This is basically a list of everything spent at every level of government in the UK. The Treasury has refused FoI [Freedom of Information] requests for it in the past (it is 24 million items long). Now its release is imminent, according to Downing Street sources.

Rogers looks at how this could change the way local government in particular is reported by local media and journalists and non-journalists working a hyperlocal beat.

Full post at this link…