POL3325G Lecture 2

January 14, 2025

Outline

Lecture:

Housekeeping
Types of quantitative data
Data quality standards

Lab:

Download R & RStudio
Basics of RStudio
What is an object?
What is a function?
What is a package?

Office Hours

Shanaya Vanhooren
- Office hours: Tuesdays 3:30-4:30pm in SSC 7317
Noah Vanderhoeven
- Office hours: Thursdays 1:00-2:00pm in SSC 7330
Jan Eckhardt
- Office hours: Mondays 1:00-2:00pm in SSC 7328
all offices are located on the 7th floor of the Social Science Center in the Political Science Department
please make every effort to attend one of our office hours to discuss any questions or concerns (before emailing your question and/or seeking a meeting outside of regularly scheduled office hours)

Understanding quantitative data types

Quantitative data

Quantitative data “any type of data that is numeric in form” or assigned a numeric value (Brancati 2018, p.231).
- Varies in type (process by which it is collected), size, structure, and quality
- Today we will focus on type and quality

Observational data

Observational data is “collected without researchers interacting with their subjects or their environment” (Brancati 2018, p.231).
- e.g., socio-economic data, transactional data, various forms of “big data”

Quantitative data “any type of data that is numeric in form” or assigned a numeric value (Brancati 2018, p.231).
- Varies in type (process by which it is collected), structure, and quality
- Two types: observational and non-observational data
Observational data is “collected without researchers interacting with their subjects or their environment”.
For instance, we can think of socio-economic data. These are social and economic indicators.
Social variables: e.g. demographics like age, population size, as well as things like health indicators like life expectancy, mortality rates etc.
Economic variables: income (household income, poverty levels), employment (unemployment rates, employment rates in different sectors of the economy), economic growth measures (eg., GDP (gross domestic product – total monetary value of all finished goods and services produced in a country’s borders; GDP per capita measures GDP divided by the country’s population to indicate economic output per person), inflation rate (percentage of increase in the average price level of goods and services usually annually or monthly; when inflation is high people can buy less with the same amount of money, etc.)

New technologies are being used to collect observational data. These are sometimes useful in research on politics and other areas of social science (especially economics).
In their article “The View from Above: Applications of Satellite Data in Economics”, Donaldson and Storeygard (2016) review how satellite imagery data is used in economics.

Source: Donaldson, D., & Storeygard, A. (2016). The view from above: Applications of satellite data in economics. In Journal of Economic Perspectives (Vol. 30, Issue 4, pp. 171–198). American Economic Association. https://doi.org/10.1257/jep.30.4.171

New technologies are being used to passively collect observational data. These are sometimes useful in research on politics and other areas of social science (especially economics).
For example, in their article “The View from Above: Applications of Satellite Data in Economics”, Donaldson and Storeygard (2016) review how satellite imagery data is used in economics.
In the image included on the screen, high resolution satellite images are used in a study by Marx, Stoker, and Suri in 2015 as an indicator or proxy for housing investment in a Nairobi slum. The newly replaced or improve roofs are more reflective than older, rusted roofs. An algorithm developed by the authors is used to detect these reflective differences in order to try to understand ethnic favoritism in the residential markets in Kibera, Nairobi.
This is only one example of how satellite images are being used.
They have also been used to in market research to count cars in parking lots to estimate retail demand. In the study of politics, similar satelitte images could be used to count crowds at political rallies, for instance.

Other examples of observational data
- reading assigned for last week’s class: Casaburi & Troiano (2016)
- in the study of urban politics: measures of urban land supply in U.S. cities using topographical data
- written text, spoken word, and other visual materials (e.g., blog posts, social media posts, Hansard or parliament records, including video records)
- most forms of “big data”, which we will discuss in more detail later on in the term
- measurement of event occurrence (e.g., treaties, wars)

Non-observational data

Non-observational data “is collected through researchers either interacting with their subjects or intervening in their subjects’ environments” (Brancati 2018, p.234).
- e.g., surveys and polling data, experiments

Observational data - pros and cons

often more widely available and easier to collect since it does not require ethics clearance
not subject to observer bias (researcher influences response of subjects through their actions or words)
sometimes very “messy” (lacks structure) and takes significant effort to re-structure into a usable format (e.g., social media data, data scraped from the web)
over time, some observational data may be subject to guinea pig or measurement effects (people change their behaviours because they are aware that it is being measured, tracked etc.)

Non-observational data - pros and cons

may not represent the real world well
subject to potential observer bias
guinea pig or measurement effects
- e.g. social desirability bias of participants: tendency for participants to over-report socially “desirable” behaviour and under-report “undesirable” behaviour
challenges with gathering a large, representative sample
can be used to study outcomes associated with policies, institutions, or practices that do not exist in the real world (yet)
tends to exist in cleaner format because it was collected and documented by researcher(s) for a specific purpose

may not represent the real world well
- (does the lab represent real world conditions? are the sorts of people who participate in surveys or experiments the same sorts of people we are interested in learning about?)
- are the manipulations used in an experiment representative of real world situations?
- experiments in a lab exclude the distractions of the real world. Experimental data has also been criticized for being very narrowly time focused - are not very effective for evaluating long term effects of the manipulation taking place.
subject to potential observer bias (how does the way we ask questions on surveys influence responses? - there is research that examines how question wording or the order of questions on a survey can influence how people respond. If we ask someone who the prime minister is to evaluate political knowledge but we already talked about Justin Trudeau earlier in the survey, we may be influencing a person’s response by prompting them.)
guinea pig effect or measurement effects: when people are aware their behaviours are being measured, they alter their response or behaviour
- e.g. social desirability bias of participants: participants (experiments) or respondents (surveys) will over-report socially “desirable” behaviour and under-report “undesirable” behaviour because they want to appear in a good light and because of stigma and shame associated with undesirable behaviours. There are research design strategies that are used to try to counter social desirability bias.
challenges with gathering a large, representative sample
can be used to study outcomes associated with policies, institutions, or practices that do not exist in the real world (yet)
- (e.g. expierment: run a study in a lab where we imitate some policy or voting practice that doesn’t exist, for example, then we can try to estimate how people respond to it, at least in the lab setting)
tends to exist in cleaner format because it was collected and documented by researcher(s) for a specific purpose

Evaluating the quality of our data

When we collect data or deal with off-the-shelf data, we can use the following criteria to evaluate data quality:

Accuracy
Data Validity
Precision
Completeness
Consistency

Accuracy: is the data reflective of real-world values? Is it correct?
- A dataset that records the number of federal by-elections held every year in Canada is accurate if the number reported is the same as the number that actually occurred.
- E.g., if the dataset only recorded by-elections for 2024, and it recorded 4 by-elections, it would be accurate.
  - (By-elections were held in the ridings of Durham, Ontario (March 4), Toronto-St.Paul’s, Ontario (June 24), Elmwood-Transcona, Manitoba (September 16), and LaSalle-Émard-Verdun, Quebec (September 16)).

Data Validity: do the scores of the variable accurately capture what a variable is said to represent or indicate?
- “describes the extent to which data depicts the measures they claim to represent” (Brancati 2016, p.235)
- e.g. prison overcrowding, voter turnout

Second, is the validity of our data. This describes the extent to which the data measures what it claims to represent. In this case, when we are first looking at our data, we should ask whether the values of the variables in the dataset represent what they claim to represent.
- In the textbook, Brancati gives the simple example of prison overcrowding. Ideally, we would measure this by measuring the space each prisoner has in a cell or the space per prisoner. However, due to a lack of this type of data, we might measure it using prison occupancy rates, but this may not be a true reflection of overcrowding (and therefore, not a valid data source) because if a prison has a large occupancy rate it may or may not be over crowded depending on the amount of physical space available to prisoners.
- Another example might be voter turnout across Canadian elections. If we have a variable for voter turnout and it is simply the raw number of people who voted in each election, say in 2011, 2015, 2019, 2021, we are not really capturing voter turnout in any meaningful way because the number of eligible voters in these different election years changes. A more valid measure of voter turnout would be the percentage of eligible voters who voted.
- Here we are talking about validity at the most precise point of the data, so what we mean is whether the scores accurately capture what the variable is said to indicate. Later on in the course we will talk about how validity relates to the operationalization of concepts more broadly.

Precision: increases as we measure data in smaller units or intervals.
- We should be measuring our data as precisely as feasibly possible without sacrificing accuracy or validity.
- A more precise variable may actually be inaccurate, especially in the case of sensitive survey questions, for instance.

Third is precision. Precision increases as we measure data in smaller units or intervals. -We should be measuring our data as precisely as feasibly possible without sacrificing accuracy or validity.
- Sometimes greater precision might be inaccurate when we are trying to measure something that is just in general difficult to measure. A good example is deaths in an international war or number of protestors at a rally. These are often difficult things to measure with absolute certainty, so it is therefore better to be less precise (e.g., work in buckets such as 10-20,000) in order to be more accurate.
  - It is also the case that when designing a survey and collecting non-observational data, it may be better to create more imprecise measures in order to increase accuracy but also increase the amount of data we collect. If you are asking people very specifically about their income, this may increase the non-response rate (and therefore, reduce completeness) and/or increase the number of inaccurate responses.

Completeness: a dataset is complete if it (1) includes values for the whole universe of relevant cases and (2) includes observations for all of the relevant measures or variables in the data.
- e.g. (1) includes the universe of relevant cases A survey of provincial voters shouldn’t exclude voters located in Manitoba (unless there is some theoretically driven reason to do so).
- e.g. (2) includes observations for all relevant measures or variables in the data If the survey of provincial voters dataset includes a variable that indicates respondents’ ages, it shouldn’t be missing the ages of the respondents in Manitoba but include the ages of all other voters.

Consistency: Data consistency “refers to the absence of contradictions in the data” (Brancati 2016, p.238). For example, data is consistent when cases are coded according to the same rules and the data are collected using the same types of sources.
- Inconsistent data lacks validity, but consistency does not guarantee validity.
- Consistency cannot make up for low levels of validity.

Consistency: Data consistency “refers to the absence of contradictions in the data” (Brancati 2016, p.238). For example, data is consistent when cases are coded according to the same rules and the data are collected using the same types of sources.
If we had a dataset of party platforms and some of the platforms were true election platforms and other were pamphlets distributed by parties related only to a single issue, we might argue that there are inconsistencies in the data since they are different types of sources.
Inconsistent data lacks validity, but consistency does not guarantee validity (because we may be measuring something other than what we intended to measure but we do so in a way that is free of contradictions).
Consistency cannot make up for low levels of validity. Just because our data is recorded consistently, for instance by over time referring to the criteria for assiging values to each case, it doesn’t mean that it represents the variable well.
- Going back to the example of voter turnout. If we have a measure of turnout that is just a raw number of voters, it may be consistent - recorded the same way, using the raw number of people who voted in an election maybe always including absentions or something - then it is consistent. But it still isn’t valid if we want a true measure of turnout.
Data might be inconsistent because the rules were not applied consistently by different research assistants, for instance.

15 minute break and attendance sign in
return for the hands-on component of the class