Data Analytics in the Social Sciences
Data Analytics in the Social Sciences involves the application of statistical techniques to extract meaningful insights from various types of data, including large-scale datasets known as Big Data. Traditionally, social science research has relied on qualitative methods such as observations and interviews to gather data, but the emergence of digital information sources has prompted researchers to adapt to more quantitative approaches. Unlike the hard sciences, where data tends to be homogeneous, social science datasets are often heterogeneous, comprising diverse types of information such as demographics, opinions, and behaviors.
A significant area of focus is sentiment analysis, where researchers use sophisticated software to assess the emotional tone of large bodies of text, often sourced from social media platforms. This analysis enables insights into public attitudes towards various issues, with researchers mining data from social networks to uncover connections among demographic characteristics. Collaborative efforts between social scientists and data scientists have become increasingly essential, as many social researchers may lack technical skills for data manipulation. Despite the challenges of integrating new technologies and methods, the potential for innovative discoveries remains high, prompting calls for enhanced training in data analytics within social science education. Overall, the integration of data analytics in social sciences is reshaping research methodologies and expanding the scope of inquiry into societal phenomena.
Data Analytics in the Social Sciences
Abstract
Data analytics is the practice of using statistics to extract meaningful information from raw data. In the social sciences, research has traditionally involved small scale projects wherein researchers gather data through observations, interviews, or other methodologies, and then code and organize that data before analyzing it to determine if any conclusions can be drawn from it, and if the data offers any insights relevant to the research questions being asked. The advent of the Internet and the proliferation of large amounts of information—so-called Big Data—is now forcing researchers in the social sciences to consider other approaches to their work.
Overview
Researchers in the hard sciences have for many years been accustomed to working with large amounts of data, whether it is telemetry from an interstellar spacecraft or weather information being used to predict storm patterns. Analyzing and manipulating large data sets requires different tools and a different mindset than that necessary for simpler projects, but it also makes it possible to ask new types of questions and open up new fields of discovery (Hesse, Moser & Riley, 2015). Over the last few years, these changes have begun to spread into the social sciences as well. This presents special challenges for social scientists.
The social sciences differ from the hard sciences in many ways, but an especially relevant factor is that while big data is used in both fields, in the hard sciences the large data sets are often homogeneous, meaning that they are all of the same type. For example, a set of data containing blood pressure measurements for all hospital patients over a ten year period would be quite large, but all of the data would be in the same format and of the same type. In the social sciences, there is a greater chance that research will involve large sets of data that is heterogeneous, meaning that it is made up of different types of data. An example of this might be sets of data about the population of a particular state, including age, educational attainment, and political affiliation—all of this is useful information that can be manipulated in order to answer interesting questions, but the data will have to be extensively manipulated to accomplish this. Heterogeneity of data thus presents special challenges in the social sciences (Lin, 2015).
To date, the most common type of assignment using data analytics in the social sciences is processing and classifying large amounts of text. Put more simply, this type of project tries to look at a body of text to determine what it says about a particular subject; not necessarily what is said word for word, but whether the predominant view in the text is positive, negative, or neutral toward an issue. This seems like a simple enough task for a human being to accomplish, but it is important to remember that the computers and software used in data analytics process information very differently from the way that human beings do it, and they also work with much larger amounts of information.
A human being might be able to, during the course of an afternoon, read ten newspaper editorials about a given political issue, and then explain what side of the issue the majority of the editorials took (Tay et al., 2016). A research project using data analytics would be more likely to analyze the full text of every article that appeared in a newspaper during the last five years, to determine the societal attitude toward an issue. This type of work is called sentiment analysis.
Sentiment analysis, like other types of data analytics, requires sophisticated software and considerable computing power, since machines do not intuitively detect attitudes and tones in text in the way that human beings can. These types of data analytics can become so resource intensive that is necessary to divide the data analysis into smaller subroutines that can be spread out over multiple computers and run in parallel. In addition, there is usually a significant amount of "front-end" work that must be done for text analysis to prove fruitful. This can involve providing the analysis software with sample texts that have been analyzed by human beings, so these can be used as models to demonstrate what the software should be looking for (Souto-Otero & Beneito-Montagut, 2016).
Further Insights
One of the richest sources of data now being explored by social scientists is that of social networks such as Facebook, Twitter, and LinkedIn. These sites allow users to create personal profiles describing themselves, their interests, and related information. Users can connect with one another and share updates about their own activities with their connections. The information that people share can take the form of text, audio, photos, and just about any other format that can be digitally transmitted. Social scientists can take sets of data from social networks and use it to explore what people are thinking and saying about various issues, and this analysis can then be combined with the demographic data that people share about themselves on the social network.
Mining social media data allows one not only to see what percentage of the population is in favor of an issue, but also to study what other characteristics that issue's supporters may share. For example, researchers might use this type of data analysis to determine that 37 percent of users of a social network are opposed to mandatory military service, and furthermore, that 82 percent of those opposed are parents to one or more children (Chan & Bennett Moses, 2016).
Discovering connections like this is sometimes known as exploratory data analysis, because it allows researchers to try out combinations of different types of data to see if there might be a connection. Of course, as with any research, one must be careful to keep in mind the maxim that "Causation does not equal correlation." This means that just because two phenomena appear to be connected to one another (i.e., correlated) in some way—either they always or frequently occur together or—this does not allow one to assume that one phenomenon is causing the other. This is clear from the example, where being against mandatory military service obviously does not cause one to be a parent, just as being a parent does not cause one to oppose mandatory military service.
As long as one does not jump to conclusions, however, exploratory data analysis can be a powerful means of uncovering relationships between different pieces of information. This can be as simple as checking for a correlation between various pieces of information in a data set, such as calculating whether there is any sign of a relationship between people's gender and their favorite flavor of ice cream. As whimsical as this may seem, explorations of this sort are the very essence of what it means to be a scientist and to conduct scientific research. Certainly it is possible that such innocent questions will lead nowhere, yet the history of scientific discovery is full of examples where pure chance, idle curiosity, or both were instrumental in the discovery of important new information.
On a more positive note, some features of working with big data seem to be a natural fit with the social sciences, such as the use of mixed methods to assess data using a variety of techniques and perspectives. Social scientists have employed this for many years, and it has become evident that it is highly appropriate for working with large data sets, given their heterogeneous composition. Using multiple methods with big data is often necessary because a data set can contain so many different types and formats of information, that analyzing the data with a single method would be tremendously limiting at the very least, and profoundly misleading at the worst.
Issues
For some social science researchers, new technologies and techniques can interfere with their ability to adopt a data analytics approach to their research. This can occur for a number of reasons. The most basic type of difficulty may be with the researcher attempting to understand what social networks are, why people participate in them, and what kind of information they may contain. This type of information is especially difficult to grasp for older researchers, since it is not something they grew up with, and it is quite different from other types of data they may have encountered. In the past, social scientists needing to collect data for their research might have needed to construct surveys, conduct interviews, or go out into the field to observe people's behavior. It can be a major adjustment to simply download a set of data from the cloud and begin manipulating it; some are slow to learn new techniques, and a few refuse to try at all (Chang, Kauffman & Kwon, 2014).
Once data has been collected, another hurdle can be the technical skill required to manipulate it. Because many researchers in the social sciences do not possess these skills, the trend has been toward more frequent and more in-depth collaboration between social scientists and data scientists, who know how to work with large amounts of digital information but may not possess the theoretical background needed to frame relevant research questions. By working together, researchers from these different disciplines can draw on each other's strengths in order to compensate for gaps in their own abilities.
The social scientist and the data analyst can confer about what areas of interest are going to be explored, and what sources of data are available for those areas. Once this type of general direction has been established, the social scientist is well-equipped to frame the guiding questions of the research, and the data scientist can then determine how the available data can be used to explore those questions. An example might be a social scientist interested in finding out how users of Twitter influence one another: whose opinions are most influential, how does influence travel and express itself, and so forth. Unless she had an understanding of how Twitter works and how to access its data, she would be stuck at this point, so she might confer with her colleague, the data analyst, about how to proceed. The data analyst might then offer to construct a visual representation of where certain words appear in the Twitter stream, allowing the social scientist to see clusters of Twitter users around the world who are using the words. By combining their areas of expertise, researchers can achieve surprising results (Gil de Zúñiga & Diehl, 2017).
Ideally, these types of collaborations will have benefits for both parties; otherwise, a dynamic can develop in which one party feels used by the other, or at the very least unmotivated to continue the project. When a big data collaboration does work, it is usually because the social scientist has found a helpful expert in technology, and the data analyst has found a useful set of data to use in the testing of some new software or algorithm. Without such a balance of incentives, it is usually the data specialist who becomes disenchanted with the collaboration, either because other projects are more pressing, or because the subject matter fails to hold the analyst's interest (Conte & Giardini, 2016).
Owing to the frequency with which this happens, calls for graduate programs in the social sciences to incorporate more training in programming and analyzing big data have been growing more urgent. In the past, students in these programs have often sought out courses in other departments, such as computer science, in the hope of gaining the expertise needed to take on the kind of innovative work that makes one's reputation. More often than not, these efforts are unsuccessful, either because the coursework is too technical or because the applications emphasized in the courses are completely unrelated to any type of work the social science students would be likely to undertake. This has led many to conclude that graduate programs in the social sciences need to develop their own coursework in this area, to prepare students to work with big data on their own terms rather than through the lens of some other discipline (Mosco, 2017).
Similarly, there have been suggestions that an area of growth for computer scientists and statisticians would be the design of new kinds of statistical analysis software that is powerful enough to be used in highly complex research projects, yet easy enough to use that one does not need to have a doctorate in computer programming to operate it. There is also a need for this next generation social science data analysis software to be designed for use not on personal computers, as social scientists are accustomed to doing, but on large-scale computing platforms that have the processing power needed to manipulate large scale datasets efficiently. Social scientists who continue to think of research in terms of what they can manage to do on their personal computers will limit themselves to laboring under paradigms of the past. If social scientists are, as a discipline, going to join their colleagues in the hard sciences in exploring what big data has to offer, then they must conquer their reluctance and find or develop the right tools (Guo & Vargo, 2015).
Terms & Concepts
Exploratory Data Analysis: A type of research that seeks to discover possible trends and connections within large amounts of data, as a means of opening up new possibilities for future research.
Heterogeneous: The property of being composed of different types of elements or substances, mixed together.
Homogeneous: The property of being composed of a single type of material throughout all regions.
Sentiment Analysis: Using software to evaluate the communicative tone in a body of text.
Social Network: An online community in which users create profiles, connect with associates, and share information.
Bibliography
Chan, J., & Bennett Moses, L. (2016). Is Big Data challenging criminology. Theoretical Criminology, 20(1), 21–39.
Chang, R. M., Kauffman, R. J., & Kwon, Y. (2014). Understanding the paradigm shift to computational social science in the presence of big data. Decision Support Systems, 63, 67–80.
Conte, R., & Giardini, F. (2016). Towards computational and behavioral social science. European Psychologist, 21(2), 131–140.
Gil de Zúñiga, H., & Diehl, T. (2017). Citizenship, social media, and big data: Current and future research in the social sciences. Social Science Computer Review, 35(1), 3–9.
Guo, L., & Vargo, C. (2015). The power of message networks: A Big-Data analysis of the network agenda setting model and issue ownership. Mass Communication & Society, 18(5), 557–576. Retrieved January 1, 2018 from EBSCO Online Database Sociology Source Ultimate. http://search.ebscohost.com/login.aspx?direct=true&db=sxi&AN=109420870&site=ehost-live
Hesse, B. W., Moser, R. P., & Riley, W. T. (2015). From big data to knowledge in the social sciences. Annals of the American Academy of Political and Social Science, 659(1), 16–32.
Lin, J. (2015). On building better mousetraps and understanding the human condition: Reflections on big data in the social sciences. Annals of the American Academy of Political and Social Science, 659(1), 33–47.
Mosco, V. (2017). After the Internet: New technologies, social issues, and public policies. Fudan Journal of the Humanities & Social Sciences, 10(3), 297–313. Retrieved January 1, 2018 from EBSCO Online Database Sociology Source Ultimate. http://search.ebscohost.com/login.aspx?direct=true&db=sxi&AN=124485778&site=ehost-live
Souto-Otero, M., & Beneito-Montagut, R. (2016). From governing through data to governmentality through data: Artefacts, strategies and the digital turn. European Educational Research Journal, 15(1), 14–33.
Tay, L., Parrigon, S., Huang, Q., & LeBreton, J. M. (2016). Graphical descriptives: A way to improve data transparency and methodological rigor in psychology. Perspectives on Psychological Science, 11(5), 692–701.
Suggested Reading
Bail, C. A. (2017). Taming Big Data. Sociological Methods & Research, 46(2), 189–217. Retrieved January 1, 2018 from EBSCO Online Database Sociology Source Ultimate. http://search.ebscohost.com/login.aspx?direct=true&db=sxi&AN=121334798&site=ehost-live
DeHart, D. (2017). Team science: A qualitative study of benefits, challenges, and lessons learned. Social Science Journal, 54(4), 458–467. Retrieved January 1, 2018 from EBSCO Online Database Sociology Source Ultimate. http://search.ebscohost.com/login.aspx?direct=true&db=sxi&AN=126166246&site=ehost-live
Halford, S., & Savage, M. (2017). Speaking sociologically with Big Data: Symphonic social science and the future for Big Data research. Sociology, 51(6), 1132–1148. Retrieved January 1, 2018 from EBSCO Online Database Sociology Source Ultimate. http://search.ebscohost.com/login.aspx?direct=true&db=sxi&AN=126598017&site=ehost-live
Kosinski, M., Wang, Y., Lakkaraju, H., & Leskovec, J. (2016). Mining big data to extract patterns and predict real-life outcomes. Psychological Methods, 21(4), 493–506.
McFarland, D., Lewis, K., & Goldberg, A. (2016). Sociology in the era of Big Data: The ascent of forensic social science. American Sociologist, 47(1), 12–35. Retrieved January 1, 2018 from EBSCO Online Database Sociology Source Ultimate. http://search.ebscohost.com/login.aspx?direct=true&db=sxi&AN=113251477&site=ehost-live
Tomescu-Dubrow, I., & Slomczynski, K. M. (2016). harmonization of cross-national survey projects on political behavior: Developing the analytic framework of survey data recycling. International Journal of Sociology, 46(1), 58–72. Retrieved January 1, 2018 from EBSCO Online Database Sociology Source Ultimate. http://search.ebscohost.com/login.aspx?direct=true&db=sxi&AN=113744425&site=ehost-live