Tag Archives: documentcloud

Tool of the week for journalists – DocumentCloud, to analyse documents as data

Tool of the week:  DocumentCloud

What is it? A platform to allow you to search and analyse documents as data.

DocumentCloud works by encouraging users to upload documents, it then pushes them through the Thomson Reuters-powered OpenCalais, a “toolkit of capabilities” that can be used by news sites for semantic analysis. Document sharing is good practice that many news desks have adopted and something all journalists should consider to enable data to be shared and searchable.

How is it of use to journalists? Journalists can search for keywords and analyse documents as data.

For example, try searching for “phone hacking” and you are presented with a series of parliamentary reports, the text of speeches and letters contributed by the Guardian, New York Times, the Lens and the Telegraph.

You can then dig deeper, view the documents on a timeline and find related documents.

New York Times/ProPublica’s DocumentCloud makes newspaper debut

DocumentCloud, a technology aimed at making data more accessible and helping journalists and news organisations deal with large volumes of documents, has made its debut on the Chicago Tribune’s website.

The Tribune has used DocumentCloud to publish the source documents of a news story and allow readers to browse by section and receive additional information around highlighted annotations.

The Tribune is one of 21 more than 70 news partners beta testing the technology, which is funded by a grant from the Knight News Challenge in 2009. The New York Times and ProPublica have allowed several staff to “moonlight” on the project, but is an independent organisation led by Eric Umansky, senior editor at ProPublica, Scott Klein, news applications editor at ProPublica, and Aron Pilhofer, the New York Times newsroom’s interactive technologies editor

Update: Despite this tweet from Aron Pilhofer, I am reliably informed by Amanda Hickman, programme director of DocumentCloud, in the comment below that the technology has also appeared alongside Newshour and Propublica stories too.

DocumentCloud aims to release a public beta in March 2010

Last year we reported several times on Knight News Challenge 2009 winner DocumentCloud. It’s a non-profit project, initiated by a small team from ProPublica and the New York Times, to build an open-source platform to make data more easily accessible.

It will point users to documents hosted elsewhere, similar to a card cataloguing system or search engine. Only in rare circumstances will DocumentCloud serve the documents itself. Partnered by Thomson Reuters, its 20+ beta testers include Talking Points Memo,the US National Security Archive, the Gotham Gazette and the UK’s Centre for Investigative Journalism.

Yesterday, the Gotham Gazette’s former technology director and now DocumentCloud program director, Amanda Hickman, said that a beta should be released in March 2010.

“The very most frequent question we get is ‘When can I try it?’ The answer to that one is: we’re committed to releasing a public beta in March,” she said, in an email update to the project’s followers.

Hickman said that the project still welcomes new contributors: “If you’re part of a news organisation that is planning to use DocumentCloud, take a look a the list of document contributors on our site (http://www.documentcloud.org/document-contributors/) and make sure your organisation is listed there. If you aren’t listed, let me know so that we can fix that!”

DocumentCloud still looking for more collaborators; will build on Amazon Web Services

Last week we reported on DocumentCloud’s new partner, Thomson Reuters and its long list of ‘beta-testers’ including one from the UK – the Centre for Investigative Journalism (CIJ) based at City University, London.

To re-cap, DocumentCloud is a an open-source platform to make data more easily accessible, pointing users to documents hosted elsewhere, similar to a card cataloguing system or search engine. Only in rare circumstances will DocumentCloud serve the documents itself.

We asked one of its founders, Scott Klein, about the next steps for the project, a winner of the Knight News Challenge 2009.

So why use Thomson Reuter’s OpenCalais?

[SK]”OpenCalais will, as documents are entering our system, find ‘entities’ (people, places, organisations) in them and hand them back to our servers as machine-readable swath of information, which we’ll store and index, and make available for people to query. The process will happen in real-time, and will be a big part of how we relate documents to each other.”

Will you look to partner other large organisations like Thomson Reuters?

“Yes, definitely. We intend to rely heavily on Amazon’s Web Services infrastructure – namely, their Elastic Computing Cloud and Elastic Block Store services, and Amazon has been very enthusiastic about working with us.

“As for other partners, we have a wish list of companies and technologies we think would work well with DocumentCloud. But we’re also happy to talk to anybody who is interested in contributing technology. We don’t imagine that we have all the answers or that we have to invent everything that goes into this.”

What’s next in the development / collaboration pipeline?

“[As reported by Journalism.co.uk] A few weeks ago, we released under an open-source license a major component of our document processing system, an easy-to-use parallel-processing framework for Ruby on Rails called CloudCrowd. Next we’ll start tackling other big components, such as the hosting infrastructure and user interface.”

Will you be hiring any more staff – we see you’ve appointed your lead programmer?

“Yes, we’re on the hunt for some contract staff to work on building out our infrastructure, and on our visual design/user experience.”

Knight News Challenge winner DocumentCloud releases ‘CloudCrowd’ system

DocumentCloud, the New York Times and ProPublica-backed project, has released its first open-source code since its launch.

The project, which won funding from the 2009 Knight News Challenge, was created to make documents and data useable for anyone. It will include software, a website and a set of open standards to make original source documents easy to find, share, read and collaborate on. From its site:

“Users will be able to search for documents by date, topic, person, location, etc. and will be able to do ‘document dives’, collaboratively examining large sets of documents. Organisations will be able to do all this while keeping the documents -and readers – on their own sites. Think of it as a card catalogue for primary source documents.”

DocumentCloud is not a collection of documents; rather software to support documents hosted elsewhere, two of the team – Eric Umansky, senior editor at ProPublica and Aron Pilhofer, the New York Times newsroom interactive technologies editor – explained to Journalism.co.uk in June.

The new system announced this week – CloudCrowd – will work as ‘a heavy-duty system for document processing’, in particular for importing large documents for use with DocumentCloud, the project’s lead programmer Jeremy Ashkenas said.

“Our PDFs need to have their text extracted, their images scaled and converted, and their entities extracted for later cataloguing,” he explained, adding more detail about the process, which is called ‘parallel processing’ on its site.

“All of these things are computationally expensive, keeping your laptop hot and busy for minutes, especially when the documents run into the hundreds or thousands of pages.”

The parallel processing system, named CloudCrowd, will power DocumentCloud’s document import, a process described in detail on its site by the project’s lead programmer Jeremy Ashkenas.

Ashkenas encouraged other users with ‘batch-processing needs’  who need to process large number of documents to try the system. It fits into the project’s community ethos; the aim is to invite participation and feedback ‘from scaffold to deploy’.

CloudCrowd links: