DocumentCloud, the New York Times and ProPublica-backed project, has released its first open-source code since its launch.
The project, which won funding from the 2009 Knight News Challenge, was created to make documents and data useable for anyone. It will include software, a website and a set of open standards to make original source documents easy to find, share, read and collaborate on. From its site:
“Users will be able to search for documents by date, topic, person, location, etc. and will be able to do ‘document dives’, collaboratively examining large sets of documents. Organisations will be able to do all this while keeping the documents -and readers – on their own sites. Think of it as a card catalogue for primary source documents.”
DocumentCloud is not a collection of documents; rather software to support documents hosted elsewhere, two of the team – Eric Umansky, senior editor at ProPublica and Aron Pilhofer, the New York Times newsroom interactive technologies editor – explained to Journalism.co.uk in June.
The new system announced this week – CloudCrowd – will work as ‘a heavy-duty system for document processing’, in particular for importing large documents for use with DocumentCloud, the project’s lead programmer Jeremy Ashkenas said.
“Our PDFs need to have their text extracted, their images scaled and converted, and their entities extracted for later cataloguing,” he explained, adding more detail about the process, which is called ‘parallel processing’ on its site.
“All of these things are computationally expensive, keeping your laptop hot and busy for minutes, especially when the documents run into the hundreds or thousands of pages.”
The parallel processing system, named CloudCrowd, will power DocumentCloud’s document import, a process described in detail on its site by the project’s lead programmer Jeremy Ashkenas.
Ashkenas encouraged other users with ‘batch-processing needs’ who need to process large number of documents to try the system. It fits into the project’s community ethos; the aim is to invite participation and feedback ‘from scaffold to deploy’.
CloudCrowd links: