Here at Journalism.co.uk we understand data is one of the buzzwords in journalism at the moment, it is why we have built our news:rewired conference around the topic, and its popularity was certainly clear from the packed room at Media140 today, where journalist and online communications specialist Carlos Alonso spoke on the topic.
Alonso first discussed why the use of data itself is not new, illustrating this with the use of data in the 1800s to pinpoint deaths of cholera geographically, which then led to the finding that many occurred close to a specific well, or the mapping of revolutions in Scotland or England in 1786 to map where conflict was taking place.
The golden age of using data mining was in the 1700s and 1800s. It died out in the 20th century but is coming back again. It is now really strong, but nothing new.
This talk focuses on the first parts of the journalistic process, sourcing and processing of data to find stories. First you need to start with a question, he said, think about what you’re interested in finding out and from this you’ll know what data you need.
Once you have the data you must first clean it and figure out what the important data is, we’re looking for what is behind this. So then you need to treat the data, process the data … Now with the computer you can make the data interactive so you can go into greater depth and read behind the story if you want to, the end product can be very different to what you start with.
So where can you find data?
- Public institutions, open data and government data sets. Also private initiatives such as Open Knowledge Foundation or opengovernmentdata.org. This is verifiable data, he adds, from a reliable source. Telecommunications agencies also publish a huge amount of information that isn’t on open data but is available on their webpages.
- Commercial platforms, e.g. Infochimps, Timetric, Google public data explorer, Amazon Web Services Public Data, Manyeyes by IBM.
- Advanced search procedures/searching, e.g. using Google intelligent searching for Filetypes, or performing site searches.
- Scraping and APIs, e.g. Scraperwiki, Outwit, Scripts, Yahoo Pipes, Google spreadsheets. These offer “an entry portal to their server so that you can look for the data that you want”, he said.
- Direct requests.
- Creating your own databases, although this is “a huge amount of work and requires a lot of resources, but you can use the community to help you”, he added.
Alonso also offered a useful list of what news outlets often look for, and then display, in data: trends, patterns, anomalies, connections, correlations (although important to not assume causal effect), comparisons, hierarchy, localisation, processes.