Data Harmonization
Data harmonization is the process of integrating multiple datasets into a cohesive framework, enabling researchers, analysts, and policymakers to extract meaningful insights from diverse sources. This practice is essential when data originates from various contexts, such as different years, countries, languages, or studies, as inconsistencies in measurement units, classification methods, and terminology can hinder comparative analysis. To effectively harmonize data, analysts must ensure that all datasets use consistent units and formats, allowing for clear and comprehensive analyses.
The harmonization process can be manual, involving meticulous sorting and standardizing by researchers, or automated through specialized software programs designed to manage large datasets. Such programs can significantly enhance the efficiency of data integration, accommodating complex historical datasets and diverse variables.
Data harmonization is particularly valuable in fields like public health, sociology, and business, where understanding trends over time or across different populations is crucial. However, the success of data harmonization relies heavily on the quality and completeness of the original data. Ethical considerations are also paramount, especially regarding personal data privacy and consent, which must be navigated carefully throughout the research process. Ultimately, effective data harmonization can lead to valuable insights and predictive analytics that inform decision-making across various sectors.
Data Harmonization
Abstract
Data harmonization is used by researchers, analysts, students, and policy makers from many fields to bring together and make sense of multiple data sets. This process may include combining multiple years of data, as in census data. It might also involve much more complex tasks such as bringing together data collected in multiple nations, languages, time periods, or by many different researchers. To complete a data harmonization project, analysts must be sure that the final product uses the same units, scales, and terminology. Large data sets can enable understandings and conclusions that are not clear from smaller, more exclusive data sets.
Overview
Data harmonization combines multiple datasets into one larger set that allows researchers to better understand specific phenomena. This combination can occur through either human researchers who take care to combine data, or, it can be facilitated by computer programs that have been specifically designed to create large datasets. Data harmonization is made necessary because researchers frequently work independently, using their own variables, recording methods, and notes. The collected information is valuable for the researcher's original project, but it is also valuable for future researchers, who may want to compare multiple studies that have been conducted over a long period of time, within a diversity of communities, or across vast geographic spaces. Problematically, because the original researchers are focusing on one specific question, they often use their own units of measurement, methods for classification and ways to analyze the findings. This can cause problems when later researchers want to compare data sets. For example, if one researcher recorded data in ounces and another in pounds, the units of measure will need to be reconciled, such as by converting all findings into ounces, before using both data sets together.
Converting data from pounds to ounces is easy, but not all data harmonization projects are so simple. Sometimes historic data has been recorded in ways that are confusing to modern researchers, or does not fit well into modern classifications. For example, a researcher studying population change over a long period of time might need to use documents that classified humans as slaves or property of another human. To compare population numbers, the researcher would need to ensure that these humans were properly accounted for. This might include checking census data to ensure that all humans are properly accounted for. The CEDAR dataset in the Netherlands, for example, was designed to draw together census data from 1795 to 1971 (Meroño-Peñuela, Ashkpour, Guéret, & Schlobach, 2017).
Researchers might also have difficulty combining data sets that recorded the same location with different names—such as the Russian city of Saint Petersburg, which has experienced a number of name changes from Petrograd to Leningrad and then back to Saint Petersburg. To ensure that a study was conducted correctly, a historic researcher would need to make sure that her searches for information about the city included all possible city names. Data harmonization would make this search easier by including notes in the data sets, for example, indicating that a census from Leningrad applies to modern-day Saint Petersburg.
Some libraries and archives have manually added these notes. However, access to online databases and the digitization of historic records allow researchers to access many more data sets than they could ever manually code or recode. For this reason, computer-based data harmonization programs have been developed. For example, the Research Institute of the McGill University Health Center developed a DataSHaPER program, (data schema and harmonization platform for epidemiological research). This program is designed to bring together biodata collected throughout Europe and North America. Other similar projects include the BioSHaRE (biobank standardisation and harmonisation for research excellence in the European Union) program. This project has been used to combine data about obesity from multiple countries and projects to study the concept of "healthy obesity," which is rare but can be found in large datasets (Doiron et al., 2013).
When the DataSHaPER program was first developed, researchers had an idea of what they were searching for. However, as time goes on and more data becomes available, those researchers will have to revise and update their program to ensure that it adapts to the expanding body of available data sets. Developing programs such as DataSHaPER is more efficient than requiring that all researchers use the same methods for finding and recording data. Researchers have valid reasons for recording data in various specific ways. For example, researchers studying soda consumption would want to record data in the same form that their informants provide information. They might record the number of ounces of soda that each informant drank over a period of time. This information is useful for the researcher's project, but would require data harmonization to compare with a study that recorded consumption in liters.
The process of data harmonization begins similarly for both manual and electronic data. First, researchers find the most important data sources that will address a specific research question or problem. Some of these data sets are held in libraries, others are kept by individual researchers in their offices or on their computers. Other data sets are kept as raw data, such as a set of tweets that have not yet been organized or classified by any researcher. The size of the collected data sets, as well as the amount of funding available for the project and time that the researcher can dedicate to the project will determine if the collected data is harmonized manually or electronically.
After the data sets have been collected, researchers must sort through the collected data to ensure that there are not any incorrectly recorded or entered data. The types of errors that the researcher is searching for include incomplete data sets that didn't matter to the original research project, but can produce errors if included in a computerized data harmonization program. For example, a researcher might have asked about a participant's gender but allowed participants to decline to answer. The box for participant gender might then contain a number of blank answers, which could produce error reports if electronically harmonized. Before harmonizing this dataset, the researcher would need to decide if all surveys that refused to indicate gender should be removed from the data set. Or, if gender is not important, should the question be eliminated from the original data set before harmonization. A third option could be to leave the original data as is, with its missing answers, but program the electronic data harmonization program to consider the entries as valid, thus preventing error reports. The analyst should also check to be sure that there are no replications of the data set, or information that has been recorded twice. Finally, the analyst will ensure that all data is relevant to the specific project.
After the data has been checked, the analyst will begin to harmonize the data. First the researcher or electronic program will look for information that has been recorded differently but fits well together into larger units or classifiers. This level of sorting is called "practicality" harmonization. For example, if one data set records participants as "women" and the other records participants as "girls," the data harmonizer could easily combine the two into one larger "female" unit. The analyst will also look for data that is recorded in the exact same way in two different data sets. For example, the analyst might find that both datasets use the same age ranges, such as 18–24 years old. Finding this similarity between datasets is called "purity."
Finally, the researcher will design a common format—such as recording participant age in the same way or sets, and apply that formatting to all data sets. Once data sets have been combined, the analyst can move forward and can also make the combined data accessible to other researchers who may be interested in similar topics or questions.
Further Insights
Programs such as the University of Minnesota's Population Center are working to harmonize census data over long periods of time, such as between 1850 and the present in the United States. The data harmonized from this project have contributed to insights in multiple fields. This includes studies of household water use in Puerto Rico (Yu et al., 2015) and changes in the lives of working mothers over 200 years (Asronson et al., 2017). This data is important for businesses, which need to know about new demographic trends, want to predict migration patterns, or are looking to better understand global trends as opposed to those effecting only one nation.
In business settings, data harmonization enables decision makers to access complex sets of information to quickly make decisions or mark trends. For example, by using data harmonization decision makers can quickly make sense of large sets of social media data to understand emergent trends. Data harmonization is also important for fields that have many employees moving between many jobs. For example, in health care settings, data harmonization has proved to be very useful for understanding the efficiency and safety standards of offices, clinics, nurses, doctors, and other employees. However, these organizations and employees earn their credentials from a variety of sources and organizations, many of which use different measurements to determine if an individual or group has achieved a specific qualification or rank. Using data harmonization would enable each organization to continue using its own scoring system while allowing for comparisons between much larger groups and fields (Hughes, Beene & Dykes, 2014).
Data harmonization is also used by providers of health care services. For example, the CHANCES (Consortium on Health and Ageing: Network of Cohorts in Europe and the United States) project has brought together 287 different variables on diverse of health and aging issues. This includes demographic information such as economic and social data and medical data such as genetic markers for diabetes. From the study of these combined information sources, the CHANCES project could inform new medical practices about where to open their offices or government organizations about what kind of programs are needed in specific areas (Boffetta et al., 2014). This data could also be used for targeted marketing campaigns that are designed for consumers of a specific age and health status.
Issues
Data harmonization is a successful method of combining multiple data sets, but it is only as successful as the original data. It cannot overcome missing data points, faulty recording, or other types of incomplete data. Researchers are also cautioned that they should pay attention to the scalability of their data—meaning if the data was intended to represent a large population or if it has always been collected from small communities. The risk here is that a researcher might think that a large data set, resulting from data harmonization, can speak to the experiences of many different individuals when in reality it only represents many members of a small group.
Data harmonization has also become a useful predictive tool. Using a large amount of knowledge about the past, researchers are able to predict how communities or individuals will respond in the future. For example, the National Institute on Drug Abuse frequently uses data harmonization to understand large social events, such as the spread of HIV/AIDS. However, those researchers warn that harmonization works only when studies have asked similar questions and shared similar goals. Additionally, they warn that it can be hard for researchers to gain access to data sets controlled or funded by other, at times competing agencies. Researchers might also have difficulty combining data sets collected by government or private research organizations which have vested interests in refusing to share their data (Chandler et al., 2015).
Even when data sets are made available, researchers need to pay careful attention to the ways that personal data is shared. Many informants, survey takers, and interviewees agree to share their personal data with a specific person or group, but they have not necessarily agreed to share their data with future researchers or to contribute to a data harmonized project. These concerns are particularly important for health data, which is valuable for researchers but the sharing of which could violate patient expectations and agreements (Auffray, 2016). Other data that may cause problems if shared include purchasing patterns, political beliefs, and religious convictions. It is the responsibility of researchers, at each stage of a project, to ensure that the data is being collected, shared, and analyzed in ethical ways. When these assurances are made and guaranteed, data harmonization promises to provide new insights, predictions, and information across a wide diversity of fields.
Terms & Concepts
Big Data: Data sets that are so large that they must be processed and analyzed using a computer program. These data sets may be collected all at once by a single researcher or they may be created by joining together many data sets through data harmonization.
Data Quality: Data quality is judged by characteristics of the data set. While some projects will judge all characteristics, others will only focus on the most important characteristics as judged by the researchers who need to use the data. Some standard characteristics used to judge data quality are the age of the data set, the completeness of the information presented, and the number of data points included in a data set.
Data Schema: The way in which a database is organized. Schema are the different groups that are used in the organization. This might be the same groupings that were used in a research project, or it might be a new set of classifications that are designed during data harmonization to ensure clarity and proper reporting of the collected data.
Data Wrangling: The transformation of data from a collected form—such as converting rows of data into more meaningful groupings. This process is also sometimes called data munging.
Meta-analysis: A specific form of analysis that uses statistical analysis to analyze multiple data sets.
Predictive Analytics: Utilizing historic data to predict future trends and actions. Predictive analytics can be formed based on many different data sources, but the most successful come from large data sets. This typically requires data harmonization that provides the largest possible view of past activities on which to create predictions.
Bibliography
Aaronson, D., Dehejia, R., Jordan, A., Pop-Eleches, C., Samii, C., & Schulze, K. (2017). The effect of fertility on mothers' labor supply over the last two centuries (No. w23717). Cambridge, MA: National Bureau of Economic Research.
Auffray, C., Balling, R., Barroso, I., Bencze, L., Benson, M., Bergeron, J., Bernal-Delgado, E., Blomberg, N., Bock, C., Conesa, A., & Del Signore, S. (2016). Making sense of big data in health research: Towards an EU action plan. Genome Medicine, 8(1), 71.
Boffetta, P., Bobak, M., Borsch-Supan, A., Brenner, H., Eriksson, S., Grodstein, F., Jansen, E., Jenab, M., Juerges, H., Kampman, E., & Kee, F., (2014). The consortium on health and ageing: Network of cohorts in Europe and the United States (CHANCES) project—design, population and data harmonization of a large-scale, international study. European Journal of Epidemiology, 29(12), 929–936.
Chandler, R. K., Kahana, S. Y., Fletcher, B., Jones, D., Finger, M. S., Aklin, W. M., … Webb, C. (2015). Data collection and harmonization in HIV research: The seek, test, treat, and retain initiative at the National Institute on Drug Abuse. American Journal of Public Health, 105(12), 2416–2422. Retrieved January 1, 2018 from EBSCO Online Database Business Source Ultimate. http://search.ebscohost.com/login.aspx?direct=true&db=bsu&AN=110787196&site=ehost-live
Doiron, D., Burton, P., Marcon, Y., Gaye, A., Wolffenbuttel, B. H., Perola, M., … Holle, R. (2013). Data harmonization and federated analysis of population-based studies: The BioSHaRE project. Emerging Themes in Epidemiology, 10(1), 12.
Hughes, R., Beene, M., & Dykes, P. C. (2014). The significance of data harmonization for credentialing research. Washington, DC: Institute of Medicine of the National Academies.
Meroño-Peñuela, A., Ashkpour, A., Guéret, C., & Schlobach, S. (2017). CEDAR: The Dutch historical censuses as linked open data. Semantic Web, 8(2), 297–310.
Yu, X., Ghasemizadeh, R., Padilla, I., Meeker, J. D., Cordero, J. F., & Alshawabkeh, A. (2015). Sociodemographic patterns of household water-use costs in Puerto Rico. Science of the Total Environment, 524, 300–309.
Suggested Reading
Daly, H. (2017). Reaping strategic data benefits from mandatory trade reporting projects. Journal of Securities Operations & Custody, 10(1), 38–44. Retrieved January 1, 2018 from EBSCO Online Database Business Source Ultimate. http://search.ebscohost.com/login.aspx?direct=true&db=bsu&AN=127849910&site=ehost-live
Esposito, C., Castiglione, A., Tudorica, C., & Pop, F. (2017). Big Data orchestration as a service network. IEEE Communications Magazine, 55(9), 102–108. Retrieved January 1, 2018 from EBSCO Online Database Business Source Ultimate. http://search.ebscohost.com/login.aspx?direct=true&db=bsu&AN=125187243&site=ehost-live
Fortier, I., Raina, P., Van den Heuvel, E. R., Griffith, L. E., Craig, C., Saliba, M., Doiron, D., … Granda, P. (2017). Maelstrom Research guidelines for rigorous retrospective data harmonization. International Journal of Epidemiology, 46(1), 103–105.
Loebbecke, C., & Picot, A. (2015). Reflections on societal and business model transformation arising from digitization and big data analytics: A research agenda. The Journal of Strategic Information Systems, 24(3), 149–157.
Murtagh, M. J., Turner, A., Minion, J. T., Fay, M., & Burton, P. R. (2016). International data sharing in practice: New technologies meet old governance. Biopreservation and Biobanking, 14(3), 231–240.
Ratajczak-Mrozek, M. (2017). Interorganizational network embeddedness and performance of companies active on foreign markets. Journal of Management & Business Administration. Central Europe, 25(4), 144–157. Retrieved January 1, 2018 from EBSCO Online Database Business Source Ultimate. http://search.ebscohost.com/login.aspx?direct=true&db=bsu&AN=127333632&site=ehost-live