January 14, 2025

Outline

Lecture:

  • Housekeeping
  • Types of quantitative data
  • Data quality standards

Lab:

  • Download R & RStudio
  • Basics of RStudio
  • What is an object?
  • What is a function?
  • What is a package?

Office Hours

  • Shanaya Vanhooren
    • Office hours: Tuesdays 3:30-4:30pm in SSC 7317
  • Noah Vanderhoeven
    • Office hours: Thursdays 1:00-2:00pm in SSC 7330
  • Jan Eckhardt
    • Office hours: Mondays 1:00-2:00pm in SSC 7328
  • all offices are located on the 7th floor of the Social Science Center in the Political Science Department
  • please make every effort to attend one of our office hours to discuss any questions or concerns (before emailing your question and/or seeking a meeting outside of regularly scheduled office hours)

Understanding quantitative data types

Quantitative data

  • Quantitative data “any type of data that is numeric in form” or assigned a numeric value (Brancati 2018, p.231).
    • Varies in type (process by which it is collected), size, structure, and quality
    • Today we will focus on type and quality

Observational data

  • Observational data is “collected without researchers interacting with their subjects or their environment” (Brancati 2018, p.231).
    • e.g., socio-economic data, transactional data, various forms of “big data”

  • New technologies are being used to collect observational data. These are sometimes useful in research on politics and other areas of social science (especially economics).
  • In their article “The View from Above: Applications of Satellite Data in Economics”, Donaldson and Storeygard (2016) review how satellite imagery data is used in economics.

Source: Donaldson, D., & Storeygard, A. (2016). The view from above: Applications of satellite data in economics. In Journal of Economic Perspectives (Vol. 30, Issue 4, pp. 171–198). American Economic Association. https://doi.org/10.1257/jep.30.4.171

  • Other examples of observational data
    • reading assigned for last week’s class: Casaburi & Troiano (2016)
    • in the study of urban politics: measures of urban land supply in U.S. cities using topographical data
    • written text, spoken word, and other visual materials (e.g., blog posts, social media posts, Hansard or parliament records, including video records)
    • most forms of “big data”, which we will discuss in more detail later on in the term
    • measurement of event occurrence (e.g., treaties, wars)

Non-observational data

  • Non-observational data “is collected through researchers either interacting with their subjects or intervening in their subjects’ environments” (Brancati 2018, p.234).
    • e.g., surveys and polling data, experiments

Observational data - pros and cons

  • often more widely available and easier to collect since it does not require ethics clearance
  • not subject to observer bias (researcher influences response of subjects through their actions or words)
  • sometimes very “messy” (lacks structure) and takes significant effort to re-structure into a usable format (e.g., social media data, data scraped from the web)
  • over time, some observational data may be subject to guinea pig or measurement effects (people change their behaviours because they are aware that it is being measured, tracked etc.)

Non-observational data - pros and cons

  • may not represent the real world well
  • subject to potential observer bias
  • guinea pig or measurement effects
    • e.g. social desirability bias of participants: tendency for participants to over-report socially “desirable” behaviour and under-report “undesirable” behaviour
  • challenges with gathering a large, representative sample
  • can be used to study outcomes associated with policies, institutions, or practices that do not exist in the real world (yet)
  • tends to exist in cleaner format because it was collected and documented by researcher(s) for a specific purpose

Evaluating the quality of our data

When we collect data or deal with off-the-shelf data, we can use the following criteria to evaluate data quality:

  • Accuracy
  • Data Validity
  • Precision
  • Completeness
  • Consistency

  • Accuracy: is the data reflective of real-world values? Is it correct?
    • A dataset that records the number of federal by-elections held every year in Canada is accurate if the number reported is the same as the number that actually occurred.
    • E.g., if the dataset only recorded by-elections for 2024, and it recorded 4 by-elections, it would be accurate.
      • (By-elections were held in the ridings of Durham, Ontario (March 4), Toronto-St.Paul’s, Ontario (June 24), Elmwood-Transcona, Manitoba (September 16), and LaSalle-Émard-Verdun, Quebec (September 16)).

  • Data Validity: do the scores of the variable accurately capture what a variable is said to represent or indicate?

    • “describes the extent to which data depicts the measures they claim to represent” (Brancati 2016, p.235)

    • e.g. prison overcrowding, voter turnout

  • Precision: increases as we measure data in smaller units or intervals.
    • We should be measuring our data as precisely as feasibly possible without sacrificing accuracy or validity.
    • A more precise variable may actually be inaccurate, especially in the case of sensitive survey questions, for instance.

  • Completeness: a dataset is complete if it (1) includes values for the whole universe of relevant cases and (2) includes observations for all of the relevant measures or variables in the data.

    • e.g. (1) includes the universe of relevant cases A survey of provincial voters shouldn’t exclude voters located in Manitoba (unless there is some theoretically driven reason to do so).

    • e.g. (2) includes observations for all relevant measures or variables in the data If the survey of provincial voters dataset includes a variable that indicates respondents’ ages, it shouldn’t be missing the ages of the respondents in Manitoba but include the ages of all other voters.

  • Consistency: Data consistency “refers to the absence of contradictions in the data” (Brancati 2016, p.238). For example, data is consistent when cases are coded according to the same rules and the data are collected using the same types of sources.

    • Inconsistent data lacks validity, but consistency does not guarantee validity.
    • Consistency cannot make up for low levels of validity.

  • 15 minute break and attendance sign in

  • return for the hands-on component of the class