Tag Archives: #datajourn

Telegraph.co.uk: Guide to the full MP expenses database

Telegraph.co.uk has now published a searchable database of all MPs’ expenses. It reports:

“The searchable database will include exclusive documentary evidence as well as detailed figures assembled over recent weeks as part of an exhaustive investigation into Parliament’s expenses claims system.”

“In the coming weeks, the database will be extended to include the uncensored documentation for the claims, including receipts and correspondence with the Parlimentary authorities, of all MPs.”

Full guide at this link…

MPs’ expenses database at this link…

More to follow from Journalism.co.uk on users’ experiences later today. How have you found using the data provided by the Commons, the Guardian and the Telegraph? Drop judith at journalism.co.uk a line, or via Twitter to @jtownend.

Let the expenses data war commence: Telegraph begins its document drip feed

Andy Dickinson from the Department of Journalism at UCLAN sums up today’s announcement in this tweet: ‘Telegraph to drip-publish MP expenses online’.

[Update #1: Editor of Telegraph.co.uk, Marcus Warren, responded like this: ‘Drip-publish? The whole cabinet at once….that’s a minor flood, I think’]

Yes, let the data war commence. The Guardian yesterday released its ‘major crowdsourcing tool’ as reported by Journalism.co.uk at this link. As described by one of its developers, Simon Willison, on his own blog, the Guardian is ‘crowdsourcing the analysis of the 700,000+ scanned [official] MP expenses documents’. It’s the Guardian’s ‘first live Django-powered application’. It’s also the first time the news site has hosted something on Amazon EC2, he says. Within 90 minutes of launch, 1700 users had ‘audited’ its data, reported the editor of Guardian.co.uk, Janine Gibson.

The Telegraph was keeping mum, save a few teasing tweets from Telegraph.co.uk editor Marcus Warren. A version of its ‘uncensored’ data was coming, but they would not say what and how much.

Now we know a bit more. As well as printing its data in a print supplement with Saturday’s newspaper they will gradually release the information online. As yet, copies of claim forms have been published using Issuu software, underneath each cabinet member’s name. See David Miliband’s 2005-6 expenses here, for example. From the Telegraph’s announcement:

  • Complete records of expense claims made by every Cabinet minister have been published by The Telegraph for the first time.”
  • “In the coming weeks the expense claims of every MP, searchable by name and constituency, will be published on this website.”
  • “There will be weekly releases region by region and a full schedule will be published on Tuesday.”
  • “Tomorrow [Saturday], the Daily Telegraph will publish a comprehensive 68-page supplement setting out a summary of the claims of every sitting MP.”

Details of what’s included but not included in the official data at this link.  “Sensitive information, such as precise home addresses, phone numbers and bank account details, has been removed from the files by the Telegraph’s expenses investigation team,” the Telegraph reports.

So who is winning in the data wars? Here’s what Paul Bradshaw had to say earlier this morning:

“We may see more stories, we may see interesting mashups, and this will give The Guardian an edge over the newspaper that bought the unredacted data – The Telegraph. When – or if – they release their data online, you can only hope the two sets of data will be easy to merge.”

Update #2: Finally, Martin Belam’s post on open and closed journalism (published Thursday 18th) ended like this:

“I think the Telegraph’s bunkered attitude to their scoop, and their insistence that they alone determined what was ‘in the public interest’ from the documents is a marked contrast to the approach taken by The Guardian. The Telegraph are physically publishing a selection of their data on Saturday, but there is, as yet, no sign of it being made online in machine readable format.

“Both are news organisations passionately committed to what they do, and both have a strategy that they believe will deliver their digital future. As I say, I have a massive admiration for the scoop that The Telegraph pulled off, and I’m a strong believer in media plurality. As we endlessly debate ‘the future of news™’ I think both approaches have a role to play in our media landscape. I don’t expect this to be the last time we end up debating the pros and cons of the ‘closed’ and ‘open’ approaches to data driven journalism.”

It has provoked an interesting comment from Ian Douglas, the Telegraph’s head of digital production.

“I think you’re missing the fundamental difference in source material. No publisher would have released the completely unredacted scans for crowdsourced investigation, there was far too much on there that could never be considered as being in the public interest and could be damaging to private individuals (contact details of people who work for the MPs, for example, or suppliers). The Guardian, good as their project is, is working solely with government-approved information.”

“Perhaps you’ll change your mind when you see the cabinet expenses in full on the Telegraph website today [Friday], and other resources to come.”

Related Journalism.co.uk links:

Guardian.co.uk: Crowd-sourced experiment – ‘Investigate your MP’s expenses’

The Guardian has launched a new crowd-sourced experiment: ‘Investigate your MP’s expenses’. More to follow from Journalism.co.uk soon.

Extracts from the Guardian press release:

“The Guardian has today launched a major experiment in crowdsourcing following the publication of thousands of MPs’ receipts by the House of Commons.

“The Guardian has uploaded all of these documents to its own microsite, Investigate your MP’s expenses, allowing members of the public to interact with and analyse the data; an impossibility on the government’s website.

“For every document for every MP, users of the site will be able to: add narrative on individual expenses; highlight documents of interest; tell us how interesting that receipt is and provide a context for each receipt; help us by entering the relevant expenses figures and dates on each page.”

OUseful: Gripes with Guardian’s DataStore #datajourn

Here are thoughts from Tony Hirst, one of the first adopters and success stories for the Guardian’s Open Platform, on what the OP’s DataStore is and is not doing, in terms of data curation (or gardening). He asks:

“Is the Guardian DataStore adding value to the data in the data store in an accessibility sense: by reducing the need for data mungers to have to process the data, so that it can be used in a plug’n’play way by the statisticians and the data visualisers, whether they’re professionals, amateurs or good old Jo Public?”

Hirst has a number of queries in regards to data quality and ‘misleading’ linking on the Guardian DataBlog. In a later comment, he wonders whether there is a ‘data style guide’ available yet.

If you’re not all that au fait with the data lingo, this post might be a bit indigestible, so we’ll follow with a translation in coming days.

Related on Journalism.co.uk: Q&A with Hirst, April 8, 2009.

WindowOnTheMedia: Database journalism defined

An interesting day to flag this one up (given that the Guardian  is actively calling for people to play with the Swine Flu data today): Nicolas Kayser-Brill has written an entry on Wikipedia for ‘database journalism’.

Full story at this link…

Also see: #DataJourn Part 1: a new conversation (Journalism.co.uk Editors’ Blog)

#DataJourn part 3: Useful and recent links looking at use of data in journalism

Perhaps we’ll expand this to a Dipity timeline at some point (other ideas?), but for the meantime, here’s a list of a few recent and relevant links relating to CAR and use of data in journalism to get the conversation on Twitter – via #datajourn – going. NB: These are not necessarily in chronological order. Then, the next logical step would be to start looking at examples of where data has been used for specific journalism projects.

#DataJourn part 2: Q&A with ‘data juggler’ Tony Hirst

As explained in part one of today’s #datajourn conversation, Tony Hirst is the ‘data juggler’ (as titled by Guardian tech editor Charles Arthur) behind some of the most interesting uses of the Guardian’s Open Platform (unless swear words are your thing – in which case check out Tom Hume’s work)

Journalism.co.uk sent OU academic, mashup artist and Isle of Wight resident, Tony Hirst, some questions over. Here are his very comprehensive answers.

What’s your primary interest in – and motivation for – playing with the Guardian’s Open Platform?
TH: Open Platform is a combination of two things – the Guardian API, and the Guardian Data store. My interest in the API is twofold: first, at the technical level, does it play nicely with ‘mashup tools’ such as yahoo pipes, Google spreadsheet’s =importXML formula, and so on; secondly, what sort of content does it expose that might support a ‘news and learning’ mashup site where we can automatically pull in related open educational resources around a news story to help people learn more about the issues involved with that story?

One of the things I’ve been idling about lately is what a ‘university API’ might look at, so the architecture of the Guardian API, and in particular the way the URIs that call on the API, are structured is of interest in that regard (along with other APIs, such as the New York Times’ APIs, the BBC programmes’ API, and so on).

The data blog resources – which are currently being posted on Google spreadsheets – are a handy source of data in a convenient form that I can use to try out various ‘mashup recipes’. I’m not so interested in the data as is, more in the ways in which it can be combined with other data sets (for example, in Dabble DB) and or displayed using third party visualisation tools. What inspires me is trying to find ‘mashup patterns’ that other people can use with other data sets. I’ve written several blog posts showing how to pull data from Google spreadsheets in IBM’s Many Eyes Wikified visualisation tool: it’d be great if other people realised they could use a similar approach to visualise sets of data I haven’t looked at.

Playing with the actual data also turns up practical ‘issues’ about how easy it is to create mashups with public data. For example, one silly niggle I had with the MPs’ expenses data was that pound signs appeared in many of the data cells, which meant that Many Eyes Wikified, for example, couldn’t read the amounts as numbers, and so couldn’t chart them. (In fact, I don’t think it likes pound signs at all because of the character encoding!) Which meant I had to clean the data, which introduced another step in the chain where errors could be introduced, and which also raised the barrier to entry for people wanting to use the data directly from the data store spreadsheet. If I can help find some of the obstacles to effective data reuse, then maybe I can help people publish their data in way that makes it easier for other people to reuse (including myself!).

Do you feel content with the way journalists present data in news stories, or could we learn from developers and designers?
TH: There’s a problem here in that journalists have to present stories that are: a) subject to space and layout considerations beyond their control; and b) suited to their audience. Just publishing tabulated data is good in the sense that it provides the reader with evidence for claims made in a story (as well as potentially allowing other people to interrogate the data and maybe look for other interpretations of it), but I suspect is meaningless, or at least of no real interest, to most people. For large data sets, you wouldn’t want to publish them within a story anyway.

An important thing to remember about data is that it can be used to tell stories, and that it may hide a great many patterns. Some of these patterns are self-evident if the data is visualised appropriately. ‘Geo-data’ is a fine example of this. It’s natural home is on a map (as long as the geo-coding works properly, that is (i.e. the mapping from location names, for example, to latitude/longitude co-ordinates than can be plotted on a map).

Finding ways of visualising and interacting data is getting easier all the time. I try to find mashup patterns that don’t require much, if any, writing of computer programme code, and so in theory should be accessible to many non-developers. But it’s a confidence thing: and at the moment, I suspect that it is the developers who are more likely to feel confident taking data from one source, putting it into an application, and then providing the user with a simple user interface that they can ‘just use’.

You mentioned about ‘lowering barriers to entry’ – what do you mean by that, and how is it useful?

TH: Do you write SQL code to query databases? Do you write PHP code parse RSS feeds and filter out items of interest? Are you happy writing Javascript to parse a JSON feed, or would rather use XMLHTTPRequest and a server side proxy to pull in an XML feed into a web page and get around the domain security model?

Probably none of the above.

On the other hand, could you copy and paste a URL to a data set into a ‘fetch’ block in a Yahoo pipe, identify which data element related to a place name so that you could geocode the data, and then take the URL of the data coming out from the pipe and paste it into the Google maps search box to get a map based view of your data? Possibly…

Or how about taking a spreadsheet URL, pasting it into Many Eyes Wikified, choosing the chart type you wanted based on icons depicting those chart types, and then selecting the data elements you wanted to plot on each axis from a drop down menu? Probably…

What kind of recognition/reward would you like for helping a journalist produce a news story?
TH: A mention for my employer, The Open University, and a link to my personal blog, OUseful.info. If I’d written a ‘How To’ explanation describing how a mashup or visualisation was put together, a link to that would be nice too. And if I ever met the journalist concerned, a coffee would be appreciated! I also find it valuable knowing what sorts of things journalists would like to be able to do with the technology that they can’t work out how to do. This can feed into our course development process, identifying the skills requirements that are out there, and then potentially servicing those needs through our course provision. There’s also the potential for us to offer consultancy services to journalists too, producing tools and visualisations as part of a commercial agreement.

One of the things my department is looking at at the moment is a revamped website. it’s a possibility that I’ll start posting stories there about any news related mashups I put together, and if that is the case, then links to that content would be appropriate. This isn’t too unlike the relationship we have with the BBC, where we co-produce televlsion and radio programmes and get links back to supporting content on OU websites from BBC website, as well as programme credits. For example, I help pull together the website around the BBC World Service programme Digital Planet, which we co-produce every so often. which gets a link from the World Service website (as well as the programme’s Facebook group!), and the OU gets a mention in the closing credits. The rationale behind this approach is getting traffic to OU sites, of course, where we can then start to try to persuade people to sign up for related courses!

#DataJourn part 1: a new conversation (please re-tweet)

Had it not been published at the end of the workday on a Friday, Journalism.co.uk would have made a bit more of a song-and-dance of this story, but as a result it instead it got reduced to a quick blog post. In short: OU academic Tony Hirst produced a rather lovely map, on the suggestion (taunt?) of the Guardian’s technology editor, Charles Arthur, and the result? A brand new politics story for the Guardian on MPs’ expenses.

Computer-assisted reporting (CAR) is nothing new, but innovations such as the Guardian’s launch of Open Platform, are leading to new relationships and conversations between data/stats experts, programmers and developers, (including the rarer breed of information architects), designers, and journalists – bringing with them new opportunities, but also new questions. Some that immediately spring to mind:

  • How do both parties (data and interactive gurus and the journalists) benefit?
  • Who should get credit for new news stories produced, and how should developers be rewarded?
  • Will newsrooms invest in training journalists to understand and present data better?
  • What problems are presented by non-journalists playing with data, if any?
  • What other questions should we be asking?

The hashtag #datajourn seems a good one with which to kickstart this discussion on Twitter (Using #CAR, for example, could lead to confusion…).

So, to get us started, two offerings coming your way in #datajourn part 2 and 3.

Please add your thoughts below the posts, and get in touch with judith@journalism.co.uk (@jtownend on Twitter) with your own ideas and suggestions for ways Journalism.co.uk can report, participate in, and debate the use of CAR and data tools for good quality and ethical journalism.