Thammasat University students interested in social science, political theory, research methods in politics, statistical theory and methods, politics and international relations, and related subjects may find a new book useful.
Data Management for Social Scientists: From Files to Databases is an Open Access book available for free download at this link:
Its author is Professor Nils B. Weidmann, who teaches political science at the University of Konstanz, Germany.
The Thammasat University Library collection includes several books about different aspects of data management.
The publisher’s descriptions of the book follows:
The ‘data revolution’ offers many new opportunities for research in the social sciences. Increasingly, social and political interactions can be recorded digitally, leading to vast amounts of new data available for research. This poses new challenges for organizing and processing research data. This comprehensive introduction covers the entire range of data management techniques, from flat files to database management systems. It demonstrates how established techniques and technologies from computer science can be applied in social science projects, drawing on a wide range of different applied examples. This book covers simple tools such as spreadsheets and file-based data storage and processing, as well as more powerful data management software like relational databases. It goes on to address advanced topics such as spatial data, text as data, and network data. This book is one of the first to discuss questions of practical data management specifically for social science projects.
An introduction to the book notes:
The way in which we conduct empirical social science has changed tremendously in the last decades. Lewis Fry Richardson, for example, was one of the first researchers to study wars with scientific methods in the first half of the twentieth century. Among many other projects, he put together a dataset on violent conflicts between 1815 and 1945, which he used in his Statistics of Deadly Quarrels. Richardson collected this information on paper, calculating all of the statistics used for his book manually. Today, fortunately, empirical social science leverages the power of modern digital technology for research, and data collection and analysis are typically done using computers.
Most of us are perfectly familiar with the benefits of digital technology for empirical social science research. Many social science curricula – for example, in political science, economics, or sociology – include courses on quantitative methods. Most of the readers of this book are trained to use software packages such as SPSS, Stata, or R for statistical analysis, which relieve us of most of the cumbersome mathematical operations required for this. However, according to my experience, there is little emphasis on how to prepare data for analysis. Many analyses require that data from different sources and in potentially different formats be imported, checked, and combined. In the age of “Big Data,” this has become even more difficult due to the larger, and more complex, datasets we typically work with in the social sciences. I wrote this book to close this gap in social science training, and to prepare my readers better for new challenges arising in empirical work in the social sciences. It is a course in data processing and data management, going through a series of tools and software packages that can assist researchers getting their empirical data ready for analysis. Before we discuss what this book does and who should read it, let us start with a short description of the research cycle and where this book fits in.
Most scientific fields aim to better understand the phenomena they study through the documentation, analysis, and explanation of empirical patterns. This is no different for the social sciences, which are the focus of this book. I fully acknowledge that there is considerable variation in the extent to which social scientists rely on empirical evidence – I certainly do not argue that they necessarily should. However, this book is written for those that routinely use empirical data in their work, and are looking for ways to improve the processing of these data. How does the typical research workflow operate, and where does the processing of data fit in? We can distinguish three stages of an empirical research project in the social sciences:
- Data collection
- Data processing
- Data analysis
The first stage, data collection, is the collection or acquisition of the data necessary to conduct an empirical analysis. In its simplest form, researchers can rely on data collected and published by someone else. For example, if you conduct a cross-national analysis of economic outcomes, you can obtain data from the comprehensive World Development Indicators database maintained by the World Bank (2021). Here, acquisition for the end users of this data is easy and just takes a few mouse clicks. Similarly, excellent survey data can be obtained from large survey projects such as the Demographic and Health Surveys (US Agency for International Development, 2021) or the Afrobarometer (2021). In other cases, data gathering for a research project is more difficult. Researchers oftentimes collect data themselves, for example by coding information from news reports or other sources, or by conducting surveys. In these cases, data collection is a fundamental part of the contribution a research project aims to make, and requires considerable resources.
The output of the first stage is typically a (set of) raw dataset(s). Before the raw data can be used in the analysis, it needs to be processed in different ways. This data processing can include different operations. For example, we may have to adjust text-based codings in our data, since our statistical package can only deal with numbers. In many cases, we need to aggregate information in our dataset; for example, if our original raw data contains survey results at the level of households, but we conduct our analysis at the level of villages, we have to compute the sum or the average over all households in a village. In other cases, we have to combine our original dataset with others. For instance, if we study the relationship between the level of economic development and the level of democracy, we may have to combine information from the World Development Indicators database with data on regime type, for example from the Varieties of Democracy project.
The third stage in our simple research workflow is data analysis. “Analysis” refers to any kind of pattern we derive from the data prepared in the previous stage, such as a correlation table, a graphical visualization of the distribution of a particular variable, or the coefficients from a regression model estimated on the data. Data analysis – whether it is descriptive, graphical, or statistical – requires that our data be provided in a particular format, a format that is not necessarily the most convenient one for data collection or data storage. For example, if we analyze the relationship between development and regime type as mentioned earlier, it is necessary to combine data from different sources into a single dataset that is ultimately used for the analysis. Hence, separating data processing from data analysis – as we do in this book – is not simply a convenient choice in the research workflow, but rather a necessity.
(All images courtesy of Wikimedia Commons)