Data journalism
Data journalism is a specialized form of journalism that focuses on the use, analysis, and presentation of data sets to report news stories. It merges traditional reporting methods with advanced data analysis techniques, often incorporating elements like infographics and data visualization. This field has evolved significantly since the late 1980s when notable investigative works highlighted its potential, such as Bill Dedman's Pulitzer Prize-winning series that exposed racial discrimination in mortgage lending.
At its core, data journalism aims to extract meaningful insights from large datasets, often requiring a strong proficiency in tools like MySQL and Python, alongside various data visualization software. Its significance has surged with the advent of open data and large-scale data leaks, exemplified by projects like the Panama Papers, which revealed extensive financial misconduct through detailed data analysis. Media organizations are increasingly forming dedicated data teams to enhance their reporting capabilities. While data journalism offers new avenues for storytelling, it also presents challenges, particularly regarding data completeness and interpretation, emphasizing the importance of critical analysis in understanding and presenting information.
On this Page
Data journalism
Overview
Data journalism is a journalistic specialty that draws on the use, analysis, and presentation of data sets in the reporting of news stories. A relatively recent term, and discipline, it is sometimes used synonymously with data-driven journalism, and sometimes constructed as an umbrella term that encompasses data-driven journalism along with infographics, data visualization, and database journalism. Database journalism organizes its information into a database instead of the traditional story or narrative structure, while data-driven journalism is a specialty that uses computers for data analysis in order to find or inform stories.
While the use of data in stories dates back to the heralded use of UNIVAC to attempt to predict the election results for CBS News in 1952 (which it did successfully), data journalism as a recognized specialty really began to emerge at the end of the 1980s, especially after Atlanta journalist Bill Dedman was awarded the Pulitzer Prize for his 1988 series "The Color of Money," which used computer-based data analysis to uncover and discuss patterns of racial discrimination by mortgage lenders. Shortly thereafter, the Missouri School of Journalism and the nonprofit organization Investigative Reporters and Editors formed the National Institute for Computer Assisted Reporting (NICAR), which held its first conference in 1990 at Indiana University. NICAR remains the largest conference for data journalism, but the latter term has displaced "computer assisted reporting," or CAR, for the simple reason that it is more specific: In the twenty-first century, all reporting is, in the plainest sense, computer-assisted. The Guardian either coined or popularized "data journalism" in 2009 with the launch of its Datablog, which made liberal use of data visualization tools. In one story, an interactive map detailed improvised explosive device (IED) attacks in Afghanistan according to their type and number of casualties. Over 16,000 attacks were listed, which would have been impossible to meaningfully present in a conventional text-only narrative.
Computer-assisted reporting relied on database software to collect information, statistical programs to analyze information, online access to public records and other research sources, and geographic information system mapping to study demographic changes. Its techniques were informed by both computer science and the social sciences, borrowing, for example, the analytical techniques used in sociology, political science, and economics. Even before the NICAR era, there were a number of notable stories by CAR reporters; both the Miami Herald's Clarence Jones (in 1969) and The New York Times' David Burnham (in 1972) used CAR techniques in their crime reporting, for instance.
Data journalism has a larger tool box. Much of it depends on open data, or on leaks of very large data sets, and especially on the analysis of data sets too large to be practically handled manually. More and more media organizations have established data teams to work as or with data journalists, including The New York Times and ProPublica. The basic endeavor of data journalism is to obtain useful data to work with, mine it for useful information, construct data visualizations that convey that information, and assemble the results into a comprehensible story. MySQL and Python competencies are usually necessary, and open-source software tools are relied on heavily. For data visualization, Yahoo! Pipes, Open Heat Map, and Many Eyes are common tools.
One of the pitfalls of data journalism is that the data that can be found is not always complete. In the Panama Papers case discussed below, full analysis was never completed, because while the leak was enormous, it was nevertheless not a leak of all relevant documentation; in some cases missing information could be inferred, while in other cases there was not enough to connect the dots. Recognizing whether or not one possesses all of the necessary information is an important part of all journalism, but is especially key when dealing with large sets of data.


Further Insights
Data journalism is an extension of the visual presentation of information that has always accompanied the modern world's capacity for the collection and storage of large sets of data. In 1858, for example, famed nurse and medical reformer Florence Nightingale presented a report on the medical and health conditions faced by British soldiers in the Crimean army. It was heavy on diagrams and charts, including pie charts, which had been developed at the beginning of the century, and turned seemingly endless lists of figures into easy-to-digest diagrams comparing the frequency of causes of mortality and the efficacy of various interventions. It was a work filled with what are now called infographics but that at the time were still novel outside of specific academic contexts (such as in mathematics or statistics), and several of them were of her own invention, including the Coxcomb chart, a variation on the pie chart (representing different proportions with changes to the radius rather than angle) still in use in the twenty-first century.
Web developer and journalist Adrian Holovaty of Chicago was an early advocate of data journalism in the form of software development aimed at journalistic applications. In 2005, he created the open-source web application framework Django, which has been used by the Washington Times and PBS. That same year, he launched chicagocrime.org, one of the first and most noteworthy Google Maps mashups (inspiring Google to develop an official Google Maps API). In web applications, mashups combine content from multiple sources into a single interface. In Google Maps mashups, relevant subject-specific information or content is overlaid with an interactive Google Map of a given area. For example, chicagocrime combined a Google Map of Chicago with crime data obtained from the Chicago Police Department; Holovaty was awarded a Batten Award for Innovations in Journalism and inspired countless similar projects. Holovaty himself later expanded the chicagocrime idea to EveryBlock, which presented civic information (not only crime, but also public health inspections, construction, road work, and so on), posts from residents, and content from across the web pertinent to a given location. EveryBlock set of a goal of being available in 500 communities in all 50 states by 2025.
In 2008, in anticipation of the presidential election, Nate Silver founded the website FiveThirtyEight, named for the number of electors in the Electoral College. For a time, Silver's blog was licensed to The New York Times Online; in 2013, the site was acquired by ESPN, and expanded its coverage from politics to sports, science, popular culture, and other areas. Silver was a statistician with experience in sabermetrics, the data-intensive science of baseball, before applying his statistical knowledge to politics. FiveThirtyEight successfully predicted not just the outcome of the 2008 election, but the outcomes of 49 of the 50 states, and in 2012 bested this by predicting which candidate would carry each state as well as the District of Columbia. Silver had conceived of the idea of a data-driven, hard science approach to political analysis, like the sabermetrics that had changed the discussions surrounding baseball over the previous decade, while waiting for a flight in New Orleans' Louis Armstrong Airport.
Silver's forecasting has not been perfect, of course. Up until the last minute, he had picked the wrong result in the Massachusetts special Senate election of 2010. In the 2016 presidential election, he was criticized during the campaign for giving Trump too high a chance of winning, and criticized after the election for not having given him enough of a chance. Silver's 2016 forecast involved "unskewing" polls Silver thought had been conducted poorly; as a result, FiveThirtyEight projected a 29 percent chance of a Trump victory, which was on the one hand higher than other pollsters, but on the other, seemed to many people too low in hindsight, given Trump's actual victory. Silver contended that this criticism misunderstands forecasting: with a 29 percent chance of victory, Trump would win in nearly one out of three scenarios, which is not an especially rare event. Forecasting deals with probability—the most probable outcome is not a guaranteed outcome, nor does the actual occurrence of a less probable outcome indicate a faulty forecast. FiveThirtyEight's projections in the 2020 presidential election were more accurate. It projected that Biden would win the election, Democrats had a three-in-four chance of winning the Senate, and the House would continue to have a Democratic majority.
The reception of Silver and especially his election forecasts points to the ways that the public—and other journalists—have not fully adjusted to data journalism. Traditional journalistic approaches to election forecasting might include polling data, certainly, but would be primarily narrative, perhaps relying on "conventional wisdom" or pointing to specific policy or character issues that could have an election impact. Silver's didn't just include polling data, it analyzed large numbers of polls, which are weighted by their historical performance; this analysis is then supplemented with analyses of voter demographics and past voting patterns. Silver's basic approach remained the same since his work on the 2008 elections. It was his reliance on past voting patterns that led him to correctly predict the 2008 results in North Carolina and Indiana when most pollsters forecast those results incorrectly. By 2010, FiveThirtyEight's database of pollster rankings included nearly five thousand election polls.
During The New York Times affiliation, Silver's data journalism focused on non-election issues as well. For example, he devoted a column to the media coverage of 2011's Hurricane Irene, and another comparing the growth curves of media coverage of protests by the Tea Party and Occupy Wall Street, further analyzing that coverage by geography and discovering the data showed that coverage of protests rose according to frequency of conflicts between protesters and police. Overall, Silver's work helped bring new attention and respect both to blogging and to data journalism.
The nonprofit investigative journalism organization ProPublica organized a data journalism project in 2017 called Documenting Hate. Founded in response to the rise of hate crimes in the United States, Documenting Hate facilitates the reporting of hate crimes and bias incidents, while using machine learning and other data analysis techniques to collect news stories about hate crimes. The project originated in part because of the lack of hard data on hate crimes, whether from private or public sources.
In 2018, The New York Times published a story resulting from both on-the-ground and data journalism. Over five trips to Iraq, NYT reporters gathered files from abandoned offices of the Islamic State. Analysis of the 15,000 pages of documents revealed the workings of the Islamic State's government (Callimachi, 2018), from the kinds of offenses ISIS officers arrested people for to the workings of its marriage office and Department of Motor Vehicles. One of the revelations of the NYT's work was the extent to which ISIS seemed to have learned from the mistakes made after the American invasion. When Americans purged members of Saddam Hussein's party from the Iraqi government, for example, it was effective in removing the Baathist party from power but also created a shortage of competence and skill in the civil institutions that kept the country working. ISIS was able to come to power in no small part by filling that vacuum, and when they seized land in Iraq and Syria, they absorbed the existing administrative infrastructure rather than attempt to rebuild it from scratch.
Data journalism played an important role in conveying information to Americans during the COVID-19 pandemic. Graphic concepts such as "flattening the curve" became commonplace. Data visualizations were used to to display many kinds of information such as the number of daily cases, the number of hospitalizations per age groups, and even how to properly wear a mask.
Issues
Data journalism has taken on particular significance as large data dumps of leaked documents and other information have become part of the media landscape. One of the best-known examples is a collection of documents known as the Panama Papers. Consisting of 11.5 million documents, the Panama Papers detailed financial information for more than 200,000 entities, in many cases revealing money laundering and tax evasion. The Panama Papers were anonymously leaked to German journalist Bastian Obermayer in 2015. The sheer size of the documents—2.6 terabytes, including e-mails, PDFs, databases, text files, PDFs, and over one million image files—would have made manually sorting through them a task for hundreds of people over a considerable amount of time. Even with the use of software, the effort to authenticate and analyze the Panama Papers involved one hundred different news organizations working in twenty-five languages. The files were indexed using software packages like Apache Tika and Apache Soir, and a custom interface was built for accessing them.
The Panama Papers are so named because of the use of Panama as a tax haven. Data from the leaked documents revealed the extent to which numerous wealthy individuals and celebrities had relied on offshore accounts to avoid taxes, as well as the money laundering efforts of various criminal or terrorist organizations. Among the stories which came out of the Panama Papers were the use of offshore accounts or shell companies by five then-current heads of state, about a dozen former heads of state, and dozens of family members of heads of state. While in many of these cases, the activities revealed were not actually illegal, many ethical questions were raised about government officials intentionally avoiding their tax burden, as well as the potential for such accounts to be used to conceal illicit sources of income such as bribes. Numerous wealthy celebrities were revealed to be involved as well. In many cases, it is not clear the extent to which these various private individuals were aware of the specifics of their finances, or if such accounts and shell companies were set up without their explicit knowledge by Mossack Fonseca and other financial firms.
Edward Snowden called the Panama Papers the biggest leak in the history of data journalism. It was a leak that demonstrated the possibilities of data journalism, by turning massive amounts of data into real reporting in ways that would have been impractical two decades earlier and impossible a decade before that. Two years after the leak of the Panama Papers, 13.4 million documents were leaked to Bastian Obermayer again, known as the Paradise Papers. Again dealing with offshore investments legal, illegal, and of questionable ethics, the Paradise Papers named 120,000 individuals and companies, including the United Kingdom's Prince Charles and Queen Elizabeth II and corporations such as Facebook, Twitter, Disney, Apple, Walmart, and McDonald's. Together with the Panama Papers, the Paradise Papers have offered a more informed look at the mechanics and issues of offshore investments and their use.
Bibliography
Appelgren, E. (2016). Data journalists using Facebook. Nordicom Review, 37(1), 88–101. Retrieved March 15, 2018, from EBSCO Academic Search Ultimate http://search.ebscohost.com/login.aspx?direct=true&db=asn&AN=115897192&site=ehost-live
Callimachi, R. (2018, April 8). The ISIS files. The New York Times. 1–12. Retrieved May 23, 2018 from EBSCO Online Database Academic Source Ultimate. http://search.ebscohost.com/login.aspx?direct=true&db=asn&AN=128938650&site=ehost-live
Fairfield, J., & Shtein, H. (2014). Big data, big problems: Emerging issues in the ethics of data science and journalism. Journal of Mass Media Ethics, 29(1), 38–51. Retrieved March 15, 2018, from EBSCO Business Source Ultimate. http://search.ebscohost.com/login.aspx?direct=true&db=bsu&AN=93798187&site=ehost-live
Fink, K., & Anderson, C. W. (2015). Data journalism in the United States. Journalism Studies, 16(4), 467–481. Retrieved March 15, 2018, from EBSCO Academic Search Ultimate. http://search.ebscohost.com/login.aspx?direct=true&db=asn&AN=103640198&site=ehost-live
Guo, L., Vargo, C. J., Pan, Z., Ding, W., & Ishwar, P. (2016). Big social data analytics in journalism and mass communication. Journalism & Mass Communication Quarterly, 93(2), 332–359. Retrieved March 15, 2018, from EBSCO Business Source Ultimate. http://search.ebscohost.com/login.aspx?direct=true&db=bsu&AN=115311608&site=ehost-live
H&R Block and Nextdoor announce 11 new community-led projects to 1.6 million people. (2022, May 20). Nextdoor, about.nextdoor.com/press-releases/hr-block-and-nextdoor-announce-11-new-community-led-projects-to-benefit-1-6-million-people/
Kim, D. E., & Kim, S. H. (2018). Newspaper journalists' attitudes towards robot journalism. Telematics & Informatics, 35(2), 340–357. Retrieved March 15, 2018, from EBSCO Education Source. http://search.ebscohost.com/login.aspx?direct=true&db=eue&AN=128127833&site=ehost-live
Kirkpatrick, K. (2015). Putting the data science into journalism. Communications of the ACM, 58(5), 15–17. Retrieved March 15, 2018, from EBSCO Business Source Ultimate. http://search.ebscohost.com/login.aspx?direct=true&db=bsu&AN=102392545&site=ehost-live
Knight, M. (2015). Data journalism in the UK: A preliminary analysis of form and content. Journal of Media Practice, 16(1), 55–72. Retrieved March 15, 2018, from EBSCO Academic Search Ultimate. http://search.ebscohost.com/login.aspx?direct=true&db=asn&AN=101869664&site=ehost-live
Morini, F. (2023, Mar. 14). Data journalism as "terra incognita": Newcomers' tensions in shifting towards data journalism epistemology. Journalism Practice, doi.org/10.1080/17512786.2023.2185656
Rooney, S. (2018). Interactive journalism: Hackers, data, and code. New Media & Society, 20(2), 837–839. Retrieved March 15, 2018, from EBSCO Academic Search Ultimate. http://search.ebscohost.com/login.aspx?direct=true&db=asn&AN=127838701&site=ehost-live
Sidik, S. (2021 Apr. 13). How the COVID-19 pandemic has shaped data journalism. The Global Investigative Journalism Network, gijn.org/2021/04/13/how-the-covid-19-pandemic-has-shaped-data-journalism/.
Stalzer, M., & Mentzel, C. (2016). A preliminary review of influential works in data-driven discovery. Springerplus, 5(1), 1–17. Retrieved March 15, 2018, from EBSCO Academic Search Ultimate. http://search.ebscohost.com/login.aspx?direct=true&db=asn&AN=117264910&site=ehost-live