Setting up environment

# List of packages
packages <- c("tidyverse", "srvyr", "srvyrexploR", "broom","gt", "modelsummary", "gapminder") # add any you need here

# Install packages if they aren't installed already
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

# Load the packages
lapply(packages, library, character.only = TRUE)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## 
## Attaching package: 'srvyr'
## 
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## 
## `modelsummary` 2.0.0 now uses `tinytable` as its default table-drawing
##   backend. Learn more at: https://vincentarelbundock.github.io/tinytable/
## 
## Revert to `kableExtra` for one session:
## 
##   options(modelsummary_factory_default = 'kableExtra')
##   options(modelsummary_factory_latex = 'kableExtra')
##   options(modelsummary_factory_html = 'kableExtra')
## 
## Silence this message forever:
## 
##   config_modelsummary(startup_message = FALSE)

## [[1]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "srvyr"     "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "srvyrexploR" "srvyr"       "lubridate"   "forcats"     "stringr"    
##  [6] "dplyr"       "purrr"       "readr"       "tidyr"       "tibble"     
## [11] "ggplot2"     "tidyverse"   "stats"       "graphics"    "grDevices"  
## [16] "utils"       "datasets"    "methods"     "base"       
## 
## [[4]]
##  [1] "broom"       "srvyrexploR" "srvyr"       "lubridate"   "forcats"    
##  [6] "stringr"     "dplyr"       "purrr"       "readr"       "tidyr"      
## [11] "tibble"      "ggplot2"     "tidyverse"   "stats"       "graphics"   
## [16] "grDevices"   "utils"       "datasets"    "methods"     "base"       
## 
## [[5]]
##  [1] "gt"          "broom"       "srvyrexploR" "srvyr"       "lubridate"  
##  [6] "forcats"     "stringr"     "dplyr"       "purrr"       "readr"      
## [11] "tidyr"       "tibble"      "ggplot2"     "tidyverse"   "stats"      
## [16] "graphics"    "grDevices"   "utils"       "datasets"    "methods"    
## [21] "base"       
## 
## [[6]]
##  [1] "modelsummary" "gt"           "broom"        "srvyrexploR"  "srvyr"       
##  [6] "lubridate"    "forcats"      "stringr"      "dplyr"        "purrr"       
## [11] "readr"        "tidyr"        "tibble"       "ggplot2"      "tidyverse"   
## [16] "stats"        "graphics"     "grDevices"    "utils"        "datasets"    
## [21] "methods"      "base"        
## 
## [[7]]
##  [1] "gapminder"    "modelsummary" "gt"           "broom"        "srvyrexploR" 
##  [6] "srvyr"        "lubridate"    "forcats"      "stringr"      "dplyr"       
## [11] "purrr"        "readr"        "tidyr"        "tibble"       "ggplot2"     
## [16] "tidyverse"    "stats"        "graphics"     "grDevices"    "utils"       
## [21] "datasets"     "methods"      "base"

From Toy Data to Real-World Research

In our previous session, we worked with the gapminder dataset to learn basic data manipulation skills. While gapminder is excellent for learning data manipulation, it differs from the data we typically encounter in social science research:

Gapminder (Teaching Dataset):

Clean and well-organized
No missing values
Small number of variables (6)
Clear relationships
No complex survey design

Real Social Science Data:

Missing values and inconsistencies
Complex survey designs with weights
Much larger number of variables
Different types of measurements
Data quality issues

Today, we’ll learn to describe and understand real survey data using a major social science dataset:

American National Election Studies (ANES):

Measures political attitudes and behaviors
Uses complex survey design to represent U.S. population
Mix of categorical and ordinal variables
Conducted biennially since 1948

Understanding Different Types of Variables

Before we analyze data, we need to understand what kinds of variables we have. Each type requires different descriptive statistics:

1. Categorical (Nominal) Variables

Categories with no natural order
Examples from our data:
- Region (in ANES dataset, Northeast, South)
- Political party identification (e.g., Republican, Democrat, if Canadian would be –> Liberal, Conservative, Green, so on)

2. Ordinal Variables

Categories with meaningful order
Examples from our data:
- Trust in government (Never → Always)
- Education levels (e.g., from No HS completion, HS, College/University so on)

3. Numeric Variables

Whole numbers from counting
Examples from our data:
- Number of household members (0, 1, 2, 3,…)
- Times voted in past elections (0, 1, 2, 3,…)

4. Continuous Variables

Can take any value within range
Example from our data:
- Income (e.g., 87,896.05)

Understanding Our Survey Data

What’s In Our Survey?

We could do the following:

glimpse(anes_2020)

## Rows: 7,453
## Columns: 65
## $ V200001                 <dbl> 200015, 200022, 200039, 200046, 200053, 200060…
## $ CaseID                  <dbl> 200015, 200022, 200039, 200046, 200053, 200060…
## $ V200002                 <hvn_lbll> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
## $ InterviewMode           <fct> Web, Web, Web, Web, Web, Web, Web, Web, Web, W…
## $ V200010b                <dbl> 1.0057375, 1.1634731, 0.7686811, 0.5210195, 0.…
## $ Weight                  <dbl> 1.0057375, 1.1634731, 0.7686811, 0.5210195, 0.…
## $ V200010c                <dbl> 2, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 1, 2, 2, 2, 2…
## $ VarUnit                 <fct> 2, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 1, 2, 2, 2, 2…
## $ V200010d                <dbl> 9, 26, 41, 29, 23, 37, 7, 37, 32, 41, 22, 7, 3…
## $ Stratum                 <fct> 9, 26, 41, 29, 23, 37, 7, 37, 32, 41, 22, 7, 3…
## $ V201006                 <hvn_lbll> 2, 3, 2, 3, 2, 1, 2, 3, 2, 2, 2, 2, 2, 1,…
## $ CampaignInterest        <fct> Somewhat interested, Not much interested, Some…
## $ V201023                 <hvn_lbll> -1, -1, -1, -1, -1, -1, -1, -1, 1, -1, -1…
## $ EarlyVote2020           <fct> NA, NA, NA, NA, NA, NA, NA, NA, Yes, NA, NA, N…
## $ V201024                 <hvn_lbll> -1, -1, -1, -1, -1, -1, -1, -1, 2, -1, -1…
## $ V201025x                <hvn_lbll> 3, 3, 3, 3, 3, 3, 3, 2, 4, 3, 3, 3, 2, 4,…
## $ V201028                 <hvn_lbll> -1, -1, -1, -1, -1, -1, -1, -1, 1, -1, -1…
## $ V201029                 <hvn_lbll> -1, -1, -1, -1, -1, -1, -1, -1, 1, -1, -1…
## $ V201101                 <hvn_lbll> -1, -1, -1, -1, -1, -1, -1, -1, 1, 1, 2, …
## $ V201102                 <hvn_lbll> 1, 1, 1, 1, 1, 2, 1, 2, -1, -1, -1, 1, 2,…
## $ VotedPres2016           <fct> Yes, Yes, Yes, Yes, Yes, No, Yes, No, Yes, Yes…
## $ V201103                 <hvn_lbll> 2, 5, 1, 1, 2, -1, 5, -1, 1, 1, -1, 1, -1…
## $ VotedPres2016_selection <fct> Trump, Other, Clinton, Clinton, Trump, NA, Oth…
## $ V201228                 <hvn_lbll> 2, 5, 3, 2, 3, 3, 2, 2, 3, 1, 1, 1, 2, 1,…
## $ V201229                 <hvn_lbll> 1, -1, -1, 2, -1, -1, 2, 2, -1, 2, 1, 2, …
## $ V201230                 <hvn_lbll> -1, 2, 3, -1, 2, 3, -1, -1, 2, -1, -1, -1…
## $ V201231x                <hvn_lbll> 7, 4, 3, 6, 4, 3, 6, 6, 4, 2, 1, 2, 7, 2,…
## $ PartyID                 <fct> Strong republican, Independent, Independent-de…
## $ V201233                 <hvn_lbll> 5, 5, 4, 3, 5, 4, 4, 1, 3, 3, 2, 3, 4, 5,…
## $ TrustGovernment         <fct> Never, Never, Some of the time, About half the…
## $ V201237                 <hvn_lbll> 3, 4, 4, 2, 4, 2, 4, 1, 3, 2, 4, 3, 4, 3,…
## $ TrustPeople             <fct> About half the time, Some of the time, Some of…
## $ V201507x                <hvn_lbll> 46, 37, 40, 41, 72, 71, 37, 45, 70, 43, 3…
## $ Age                     <dbl> 46, 37, 40, 41, 72, 71, 37, 45, 70, 43, 37, 55…
## $ AgeGroup                <fct> 40-49, 30-39, 40-49, 40-49, 70 or older, 70 or…
## $ V201510                 <hvn_lbll> 6, 3, 2, 4, 8, 3, 4, 2, 2, 4, 2, 2, 2, 7,…
## $ Education               <fct> Bachelor's, Post HS, High school, Post HS, Gra…
## $ V201546                 <hvn_lbll> 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2,…
## $ V201547a                <hvn_lbll> -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -…
## $ V201547b                <hvn_lbll> -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -…
## $ V201547c                <hvn_lbll> -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -…
## $ V201547d                <hvn_lbll> -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -…
## $ V201547e                <hvn_lbll> -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -…
## $ V201547z                <hvn_lbll> -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -…
## $ V201549x                <hvn_lbll> 3, 4, 1, 4, 5, 1, 1, 1, 1, 3, 3, 1, 1, 4,…
## $ RaceEth                 <fct> "Hispanic", "Asian, NH/PI", "White", "Asian, N…
## $ V201600                 <hvn_lbll> 1, 2, 2, 1, 1, 2, 2, 2, 2, 1, 2, 1, 2, 1,…
## $ Gender                  <fct> Male, Female, Female, Male, Male, Female, Fema…
## $ V201607                 <hvn_lbll> -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -…
## $ V201610                 <hvn_lbll> -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -…
## $ V201611                 <hvn_lbll> -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -…
## $ V201613                 <hvn_lbll> -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -…
## $ V201615                 <hvn_lbll> -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -…
## $ V201616                 <hvn_lbll> -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -…
## $ V201617x                <hvn_lbll> 21, 13, 17, 7, 22, 3, 4, 3, 10, 11, 9, 18…
## $ Income                  <fct> "$175,000-249,999", "$70,000-74,999", "$100,00…
## $ Income7                 <fct> $125k or more, $60k to < 80k, $100k to < 125k,…
## $ V202051                 <hvn_lbll> -1, -1, -1, -1, -1, -1, -1, 1, -1, -1, -1…
## $ V202066                 <hvn_lbll> 1, 4, 4, 4, 4, 4, 4, 1, -1, 4, 4, 4, 4, -…
## $ V202072                 <hvn_lbll> -1, 1, 1, 1, 1, 1, 1, -1, -1, 1, 1, 1, 1,…
## $ VotedPres2020           <fct> NA, Yes, Yes, Yes, Yes, Yes, Yes, NA, Yes, Yes…
## $ V202073                 <hvn_lbll> -1, 3, 1, 1, 2, 1, 2, -1, -1, 1, 1, 1, 2,…
## $ V202109x                <hvn_lbll> 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,…
## $ V202110x                <hvn_lbll> -1, 3, 1, 1, 2, 1, 2, -1, 1, 1, 1, 1, 2, …
## $ VotedPres2020_selection <fct> NA, Other, Biden, Biden, Trump, Biden, Trump, …

Understanding check: how many variables and observations are there?

But that is not very effective in that there is so much to look at once and the variables have specific acronyms.

Our course textbooks provides some key information:

https://tidy-survey-r.github.io/tidy-survey-book/anes-cb.html

But you can also look directly at the dataset website or look here for the full pdf documentation (which is 796 pages!):

https://electionstudies.org/wp-content/uploads/2022/02/anes_timeseries_2020_userguidecodebook_20220210.pdf

According to the ANES documentation, one of our key variables is trust in government, which asks:

“How often can you trust the federal government in Washington to do what is right?”

Response options:

Always
Most of the time
About half the time
Some of the time
Never

Alternative loading method

Assuming you have the dataset saved in the same folder.

load("anes_2020.rda")

Exploring Trust in Government

What we want to know: How many people responded and what did they say?

Suppose you just wanted to do a quick check:

table(anes_2020$TrustGovernment)

## 
##              Always    Most of the time About half the time    Some of the time 
##                  80                1016                2313                3313 
##               Never 
##                 702

Now suppose, you wanted to both arrange the responses and find out what the most “frequent” or “typical” response/value is – which is called the mode.

How we’ll find out: Let’s count responses by category and arrange

trust_counts <- anes_2020 %>%
  group_by(TrustGovernment) %>%      # Organize by response option
  summarize(count = n()) %>%         # Count responses in each group
  filter(!is.na(TrustGovernment)) %>% # Remove missing values
  arrange(desc(count))               # Sort by frequency

trust_counts

## # A tibble: 5 × 2
##   TrustGovernment     count
##   <fct>               <int>
## 1 Some of the time     3313
## 2 About half the time  2313
## 3 Most of the time     1016
## 4 Never                 702
## 5 Always                 80

What we found:

Most common response was “Some of the time” (3,313 people)
Followed by “About half the time” (2,313 people)
Only 80 people said they “Always” trust government

Therefore, the mode is “Some of the time”.

Now suppose you wanted to the relative proportion or percentage (x 100 the prop.), giving you the distribution of responses.

What we want to know: What percentage of people gave each response?

How we’ll find out: Let’s convert our counts into proportions

trust_props <- anes_2020 %>%
  filter(!is.na(TrustGovernment)) %>%   # Remove missing values first
  group_by(TrustGovernment) %>%         # Group by trust response
  summarize(                            # Calculate counts and percentages
    count = n(),
    percentage = round(100 * n() / sum(n()), 1)  # Percentage with 1 decimal
  ) %>%
  arrange(desc(count))                  # Sort by frequency

trust_props

## # A tibble: 5 × 3
##   TrustGovernment     count percentage
##   <fct>               <int>      <dbl>
## 1 Some of the time     3313        100
## 2 About half the time  2313        100
## 3 Most of the time     1016        100
## 4 Never                 702        100
## 5 Always                 80        100

Oh oh, what happened! Every count shows 100%, that is definitely not right! Those things happen when we go too quickly. That’s why it’s important to check at every stage what you are doing and what it led to.

Understanding the Issue:

Inside summarize(), sum(n()) is calculated within each group (because of our group_by())
This means each group’s count is being divided by its own sum
Of course dividing anything by itself gives 100%!

Let’s fix this by calculating the total before grouping:

# First, let's store the total valid responses
total_valid <- anes_2020 %>%
  filter(!is.na(TrustGovernment)) %>%
  nrow()

# Now calculate proportions using this total
trust_props <- anes_2020 %>%
  filter(!is.na(TrustGovernment)) %>%   # Remove missing values first
  group_by(TrustGovernment) %>%         # Group by trust response
  summarize(                            # Calculate counts and percentages
    count = n(),
    percentage = round(100 * count / total_valid, 1)  # Use total_valid instead of sum(n())
  ) %>%
  arrange(desc(count))                  # Sort by frequency

trust_props

## # A tibble: 5 × 3
##   TrustGovernment     count percentage
##   <fct>               <int>      <dbl>
## 1 Some of the time     3313       44.6
## 2 About half the time  2313       31.2
## 3 Most of the time     1016       13.7
## 4 Never                 702        9.5
## 5 Always                 80        1.1

Now we get sensible percentages that add up to 100%!

Key Lesson: Always check the results of what you did.

Understanding Central Tendency

Descriptive Question Example: What’s a “typical” age in our sample?

To answer this, we’ll explore different ways of finding the “middle” or “center” of our data. Each method tells us something different about our respondents’ ages.

The Median: Finding the Middle Position

What we want to know: What age splits our sample in half?

How we’ll find out: Let’s find the middle value when ages are ordered

age_median <- anes_2020 %>%
  filter(!is.na(Age)) %>%     # Remove missing ages
  summarize(
    median_age = median(Age),
    n_valid = n()             # Count valid responses
  )

age_median

## # A tibble: 1 × 2
##   median_age n_valid
##        <dbl>   <int>
## 1         53    7159

What we found:

Median age is 53 years
Half of respondents are younger than 53
Half are older than 53
Based on valid responses (after removing NAs)

The Mean: Finding the Mathematical Average

What we want to know: What’s the mathematical average age?

How we’ll find out: Let’s add all ages and divide by the count

age_mean <- anes_2020 %>%
  filter(!is.na(Age)) %>%
  summarize(
    mean_age = mean(Age),
    n_valid = n()
  )

age_mean

## # A tibble: 1 × 2
##   mean_age n_valid
##      <dbl>   <int>
## 1     51.8    7159

What we found:

Average age is 51.8 years
Mean is slightly lower than median (51.8 vs 53)
This small difference tells us something about our age distribution (we will come back to this during distribution week!)

More Tools for Describing Data

What’s the spread of ages in our sample?

What we want to know: How wide is the range of ages and where do most respondents fall?

How we’ll find out: Let’s explore different ways to measure spread in our data

The Range

# Calculate the range of ages
age_range <- anes_2020 %>%
  filter(!is.na(Age)) %>%
  summarize(
    min_age = min(Age),
    max_age = max(Age),
    age_range = max_age - min_age
  )

age_range

## # A tibble: 1 × 3
##   min_age max_age age_range
##     <dbl>   <dbl>     <dbl>
## 1      18      80        62

What we found:

Youngest respondent: 18 years old
Oldest respondent: 80 years old
Total range: 62 years
Note: Range is sensitive to extreme values (consider, e.g., income in US from lowest to highest)

Quartiles

What we want to know: Where do the middle 50% of ages fall?

How we’ll find out: Let’s divide our data into quarters

# Calculate quartiles
age_quartiles <- anes_2020 %>%
  filter(!is.na(Age)) %>%
  summarize(
    q1 = quantile(Age, 0.25),  # First quartile (25th percentile)
    median = median(Age),       # Second quartile (50th percentile)
    q3 = quantile(Age, 0.75),  # Third quartile (75th percentile)
    iqr = q3 - q1              # Interquartile range
  )

age_quartiles

## # A tibble: 1 × 4
##      q1 median    q3   iqr
##   <dbl>  <dbl> <dbl> <dbl>
## 1    37     53    66    29

Understanding the Quartiles:

Q1 (25th percentile): 37 years - 25% are younger
Median (50th percentile): 53 years - middle value
Q3 (75th percentile): 66 years - 75% are younger
IQR: 29 years (i.e., 37 to 66) - where the middle 50% of ages fall

A Complete Summary

What we want to know: Can we get all key statistics at once?

How we’ll find out: Let’s use R’s summary() function

summary(anes_2020$Age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   18.00   37.00   53.00   51.83   66.00   80.00     294

Understanding the Output:

Minimum: 18 years (youngest respondent)
1st Quartile: 37 years (25% are younger)
Median: 53 years (middle value)
Mean: 51.83 years (average)
3rd Quartile: 66 years (75% are younger)
Maximum: 80 years (oldest respondent)
Missing Values: 294 NAs

What This Tells Us About Our Sample

Age Range
- Survey includes adults 18-80 years old
- Range of 62 years (80 - 18)
Central Tendency
- Median (53) slightly higher than mean (51.8)
- Suggests a slight skew in our age distribution
- Most respondents are middle-aged
Data Quality
- 294 missing ages
- Need to consider impact on analysis

The relationship between mean and median can tell us about the shape of our distribution:

When mean ≈ median: Suggests symmetrical distribution
When mean < median: Suggests negative skew (tail to left)
When mean > median: Suggests positive skew (tail to right)

We will come back to this during the distribution week and visualize!

Understanding How Spread Out Our Data Is

A fundamental tool: Standard Deviation

Standard deviation measures the typical distance from the average. The formula is:

SD = sqrt( Σ(x - μ)² / n )

Where:
- SD is the standard deviation
- x is each value in our data
- μ (mu) is the mean
- n is the number of observations
- Σ means "sum up"

Why do we calculate it this way?

First, we find each value’s distance from the mean (x - μ). This tells us how far each observation is from average.
We square these differences for two key reasons:
- It makes negative differences positive (being 5 years younger or 5 years older than average both count as a 5-year difference)
- It gives more weight to larger differences, which makes sense when measuring spread (being 20 years from average is more than twice as extreme as being 10 years from average)
We average these squared differences (÷ n) to get a typical squared distance
Finally, we take the square root to get back to the original units (years in our case)

Let’s use age in our ANES survey as an example:

# Calculate mean and standard deviation of age
age_stats <- anes_2020 %>%
  filter(!is.na(Age)) %>%
  summarize(
    mean_age = round(mean(Age), 1),
    sd_age = round(sd(Age), 1)
  )
age_stats

## # A tibble: 1 × 2
##   mean_age sd_age
##      <dbl>  <dbl>
## 1     51.8   17.1

In our data:

The average age is about 52 years
The standard deviation is about 17 years

What does this mean in plain language?

Most respondents’ ages fall within 17 years of 52 years old
So most people are between 35 and 69 years old
This gives us a sense of how “spread out” the ages are

The “Rules of Thumb” for Standard Deviation

There are some helpful guidelines about standard deviation:

The 68-95-99.7 Rule:
- About 68% of people fall within 1 standard deviation of the mean
- About 95% fall within 2 standard deviations
- Almost everyone (99.7%) falls within 3 standard deviations

Let’s see what this means for our age data:

# Calculate the intervals
age_ranges <- anes_2020 %>%
  filter(!is.na(Age)) %>%
  summarize(
    mean = round(mean(Age), 1),
    sd = round(sd(Age), 1),
    
    # One SD range (about 68% of people)
    one_sd_low = round(mean - sd, 0),
    one_sd_high = round(mean + sd, 0),
    
    # Two SD range (about 95% of people)
    two_sd_low = round(mean - (2 * sd), 0),
    two_sd_high = round(mean + (2 * sd), 0)
  )

age_ranges

## # A tibble: 1 × 6
##    mean    sd one_sd_low one_sd_high two_sd_low two_sd_high
##   <dbl> <dbl>      <dbl>       <dbl>      <dbl>       <dbl>
## 1  51.8  17.1         35          69         18          86

What This Tells Us:

Most respondents (68%) are between 35 and 69 years old
Almost all (95%) are between 18 and 86 years old
This makes sense given the voting age minimum of 18

Identifying Unusual Cases

Values more than 2 standard deviations from the mean are unusual
Values more than 3 standard deviations are very rare
This helps us spot interesting patterns or potential issues even

When to Use Each Measure?

Use Range When:
- You need a quick overview
- Extreme values matter
- Explaining to non-technical audiences
Use Quartiles When:
- You want to know about typical “segments”
- Outliers might distort your picture
- You need to identify “middle” or “quarter” groups
Use Standard Deviation When:
- You need precise measures of spread
- You want to report both the average and typical variation around a key variable (typical in a descriptive table)

Producing Descriptive Tables

A fundamental skill we will practice is leveraging key descriptive statistics and turning them into clear tables describing our sample or provide an overview, e.g., of key characteristics. This is a standard of quantitative research article – or some variant – often referred to as a “Table 1”.

Quick Data Summaries

While learning to create customized tables is important, sometimes we need quick descriptive statistics during our exploratory phase. The datasummary_skim() function from the modelsummary package offers a simple way to generate descriptive statistics.

Exploring Numeric Variables

Let’s look at a few numeric variables using this simpler approach:

# Create a simpler dataset with just numeric variables of interest
demo_vars <- anes_2020 %>%
  select(Age, Income7) %>%
  # Ensure variables are numeric
  mutate(across(everything(), as.numeric))

# Quick summary
datasummary_skim(demo_vars)

	Unique	Missing Pct.	Mean	SD	Min	Median	Max	Histogram
Age	64	4	51.8	17.1	18.0	53.0	80.0
Income7	8	7	4.0	2.1	1.0	4.0	7.0

Exploring Categorical Variables

For categorical variables like trust and education, we can use the same function with a different type:

# Create a dataset with categorical variables
cat_vars <- anes_2020 %>%
  select(TrustGovernment, Education) %>%
  # Ensure variables are factors
  mutate(across(everything(), as.factor))

# Quick summary of categorical variables
datasummary_skim(cat_vars, type = "categorical")

		N	%
TrustGovernment	Always	80	1.1
	Most of the time	1016	13.6
	About half the time	2313	31.0
	Some of the time	3313	44.5
	Never	702	9.4
Education	Less than HS	312	4.2
	High school	1160	15.6
	Post HS	2514	33.7
	Bachelor's	1877	25.2
	Graduate	1474	19.8

Advantages of this approach:

Quick and easy to use
Standardized output format
Good for initial data exploration
Requires less coding than custom tables

While datasummary_skim() is great for quick exploration, it offers less customization than the gt (package name) approach we’ll learn next. Think of it as another useful tool in your toolkit.

Building Professional Tables: A Step-by-Step Approach

Creating Professional Tables for Research

Before creating tables, we need to understand what we want to show and check our data. Let’s work through this step by step.

Step 1: Exploring Our Variables

First, let’s look at key variables we might want to include:

# Check education categories
anes_2020 %>%
  count(Education) %>%
  arrange(desc(n))

## # A tibble: 6 × 2
##   Education        n
##   <fct>        <int>
## 1 Post HS       2514
## 2 Bachelor's    1877
## 3 Graduate      1474
## 4 High school   1160
## 5 Less than HS   312
## 6 <NA>           116

# Check trust categories
table(anes_2020$TrustGovernment)

## 
##              Always    Most of the time About half the time    Some of the time 
##                  80                1016                2313                3313 
##               Never 
##                 702

# Quick summary of age
summary(anes_2020$Age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   18.00   37.00   53.00   51.83   66.00   80.00     294

What we learnt: - Education has several categories we might want to combine

Trust has 5 categories from Never –> Always
Age ranges from 18-80
Some missing values to consider

Step 2: Planning Our Tables

We’ll create two key tables:

Sample characteristics (demographics)
Distribution of trust in government (outcome of interest)

Let’s start with sample characteristics.

Step 3: Building Our First Table - Sample Demographics

First, let’s calculate our basic statistics:

# Calculate basic statistics
sample_stats <- anes_2020 %>%          # Start with our dataset
  summarize(                           # Create summary statistics
    # Count total number of respondents
    n_total = n(),                     
    
    # Calculate mean and standard deviation of Age
    age_mean = mean(Age, na.rm = TRUE),    # na.rm = TRUE tells R to ignore missing values
    age_sd = sd(Age, na.rm = TRUE),        # sd() calculates standard deviation
    
    # Calculate percentage female
    # mean() of TRUE/FALSE gives us proportion (0-1), multiply by 100 for percentage
    pct_female = mean(Gender == "Female", na.rm = TRUE) * 100,
    
    # Calculate percentage with high school or less
    # %in% checks if Education is either "Less than HS" OR "High school"
    pct_hs_or_less = mean(Education %in% c("Less than HS", "High school"), 
                         na.rm = TRUE) * 100
  )

# Look at our calculations
sample_stats

## # A tibble: 1 × 5
##   n_total age_mean age_sd pct_female pct_hs_or_less
##     <int>    <dbl>  <dbl>      <dbl>          <dbl>
## 1    7453     51.8   17.1       54.4           19.8

Understanding each part of the code:

summarize(): Creates summary statistics for your data
n(): Counts number of rows (total respondents)
mean(): Calculates average
sd(): Calculates standard deviation
na.rm = TRUE: Tells R to remove missing values before calculating
Education %in% c("Less than HS", "High school"): Checks if education falls into either category
* 100: Converts proportion to percentage

Now let’s create a nicely formatted table:

# Create basic table structure
basic_table <- data.frame(           # Create a new data frame
  # First column: names of our statistics
  characteristic = c(
    "Sample size (N)",
    "Age, mean (SD)",                # (SD) = Standard Deviation
    "Female (%)",
    "High school or less (%)"        # Changed from college degree
  ),
  # Second column: actual values
  value = c(
    sample_stats$n_total,            # Sample size
    # paste0() combines text - here combining mean and SD with formatting
    paste0(round(sample_stats$age_mean, 1), " (", 
           round(sample_stats$age_sd, 1), ")"),   # round(x, 1) rounds to 1 decimal
    round(sample_stats$pct_female, 1),            # Female percentage
    round(sample_stats$pct_hs_or_less, 1)         # Education percentage
  )
)

basic_table

##            characteristic       value
## 1         Sample size (N)        7453
## 2          Age, mean (SD) 51.8 (17.1)
## 3              Female (%)        54.4
## 4 High school or less (%)        19.8

Some functions explained:

paste0(): Combines text and numbers (e.g., combining mean and SD)
round(): Rounds numbers (first argument is number, second is decimal places)
%in%: Checks if values are in a set of options
c(): Combines values into a list

# Create formatted table using gt package
basic_table %>%                # Take our table and
  gt() %>%                    # Convert to gt format
  cols_label(                 # Remove column headers
    characteristic = "",      # First column blank header
    value = ""               # Second column blank header
  ) %>%
  tab_header(                # Add title
    title = "Sample Characteristics"
  )


Sample Characteristics
Sample size (N)	7453
Age, mean (SD)	51.8 (17.1)
Female (%)	54.4
High school or less (%)	19.8

Step 4: Creating an Enhanced Professional Table

Let’s build a more detailed table in three parts:

Calculate our statistics
Structure our table
Add professional formatting

Part 1: Calculating Detailed Statistics

First, let’s calculate all the statistics we want to show:

# Calculate all our statistics at once
detailed_stats <- anes_2020 %>%
  summarize(
    # Basic counts
    n_total = n(),   # Total sample size
    
    # Age statistics
    age_mean = mean(Age, na.rm = TRUE),    # Average age
    age_sd = sd(Age, na.rm = TRUE),        # Standard deviation of age
    age_min = min(Age, na.rm = TRUE),      # Youngest age
    age_max = max(Age, na.rm = TRUE),      # Oldest age
    
    # Gender percentages
    pct_female = mean(Gender == "Female", na.rm = TRUE) * 100,
    pct_male = mean(Gender == "Male", na.rm = TRUE) * 100,
    
    # Education percentages - three levels
n_valid_edu = sum(!is.na(Education)),
    pct_hs = sum(Education %in% c("Less than HS", "High school"), na.rm = TRUE) / 
         n_valid_edu * 100,               # High school or less
    pct_some_ps = sum(Education == "Post HS", na.rm = TRUE) / 
                  n_valid_edu * 100,      # Some post-secondary
    pct_college = sum(Education %in% c("Bachelor's", "Graduate"), na.rm = TRUE) / 
             n_valid_edu * 100            # Bachelor's or higher
  )

detailed_stats

## # A tibble: 1 × 11
##   n_total age_mean age_sd age_min age_max pct_female pct_male n_valid_edu pct_hs
##     <int>    <dbl>  <dbl>   <dbl>   <dbl>      <dbl>    <dbl>       <int>  <dbl>
## 1    7453     51.8   17.1      18      80       54.4     45.6        7337   20.1
## # ℹ 2 more variables: pct_some_ps <dbl>, pct_college <dbl>

Understanding this code:

We use summarize() to calculate all statistics at once
na.rm = TRUE appears often because it tells R to ignore missing values
For education, we use %in% to combine categories (e.g., “Less than HS” and “High school”)
All percentages are calculated by getting the mean of TRUE/FALSE (0/1) and multiplying by 100

Part 2: Creating Table Structure

Now we’ll organize these statistics into a table format:

# Create the structure for our enhanced table
enhanced_table <- data.frame(
  # First column: Labels for each row (14 elements)
  characteristic = c(
    "Sample size (N)",           # 1
    "",                          # 2 (spacing)
    "Age",                       # 3
    "    Mean (SD)",            # 4
    "    Range",                # 5
    "",                          # 6 (spacing)
    "Gender (%)",               # 7
    "    Female",               # 8
    "    Male",                 # 9
    "",                          # 10 (spacing)
    "Education (%)",            # 11
    "    High school or less",  # 12
    "    Some post-secondary",  # 13
    "    Bachelor's or higher"  # 14
  ),
  # Second column: The values (14 elements to match)
  value = c(
    detailed_stats$n_total,                                      # 1
    "",                                                          # 2
    "",                                                          # 3
    paste0(round(detailed_stats$age_mean, 1), " (", 
           round(detailed_stats$age_sd, 1), ")"),                # 4
    paste0(detailed_stats$age_min, "–", detailed_stats$age_max), # 5
    "",                                                          # 6
    "",                                                          # 7
    round(detailed_stats$pct_female, 1),                        # 8
    round(detailed_stats$pct_male, 1),                          # 9
    "",                                                          # 10
    "",                                                          # 11
    round(detailed_stats$pct_hs, 1),                            # 12
    round(detailed_stats$pct_some_ps, 1),                       # 13
    round(detailed_stats$pct_college, 1)                        # 14
  )
)

enhanced_table

##              characteristic       value
## 1           Sample size (N)        7453
## 2                                      
## 3                       Age            
## 4                 Mean (SD) 51.8 (17.1)
## 5                     Range       18–80
## 6                                      
## 7                Gender (%)            
## 8                    Female        54.4
## 9                      Male        45.6
## 10                                     
## 11            Education (%)            
## 12      High school or less        20.1
## 13      Some post-secondary        34.3
## 14     Bachelor's or higher        45.7

Key points about table structure:

We create a two-column table using data.frame()
Empty rows (““) create spacing between sections
Indentation (spaces before names) creates hierarchy
paste0() combines numbers and text (for age statistics)
round() controls decimal places

Part 3: Creating the Professional Table

Now we’ll use the gt package to create a professional-looking table:

# Create our final formatted table
enhanced_table %>%
 gt() %>%                                # Convert to gt table
 
 # Step 1: Remove column headers
 cols_label(
   characteristic = "",                  # Make both column
   value = ""                           # headers blank
 ) %>%
 
 # Step 2: Make category headers bold
 tab_style(
   style = cell_text(weight = "bold"),  # Bold text
   locations = cells_body(              # Apply to rows where
     rows = characteristic %in%          # characteristic matches
       c("Sample size (N)", "Age", "Gender (%)", "Education (%)")
   )
 ) %>%
 
 # Step 3: Format the sample size with commas
 fmt_number(
   columns = value,                     # In the value column
   rows = characteristic == "Sample size (N)",  # For sample size only
   decimals = 0,                        # No decimal places
   use_seps = TRUE                      # Use comma separators
 ) %>%
 
 # Step 4: Format all percentages
 fmt_percent(
   columns = value,                     # In the value column
   rows = characteristic %in%           # For rows that match
     c("    Female", "    Male",
       "    High school or less",  
       "    Some post-secondary",
       "    Bachelor's or higher"),
   scale_values = FALSE,                # Values already in percentage form
   decimals = 1                         # One decimal place
 ) %>%
 
 # Step 5: Add title and notes
 tab_header(
   title = md("**Table 1. Sample Characteristics**")  # Bold title
 ) %>%
 tab_source_note(                       # Add footnote
   source_note = "Note: Percentages based on total valid responses. SD = standard deviation."
 ) %>%
 
 # Step 6: Add borders
 tab_options(
   table.border.top.width = 2,          # Thick top border
   table.border.bottom.width = 2         # Thick bottom border
 )


Table 1. Sample Characteristics
Sample size (N)	7453

Age
Mean (SD)	51.8 (17.1)
Range	18–80

Gender (%)
Female	54.4
Male	45.6

Education (%)
High school or less	20.1
Some post-secondary	34.3
Bachelor's or higher	45.7
Note: Percentages based on total valid responses. SD = standard deviation.

How to modify this code for your needs:

Change statistics: Modify the summarize() section to include different calculations
Change categories: Update the ‘characteristic’ column in data.frame()
Modify formatting:
- Change decimal places in round() calls
- Adjust borders in tab_options()
- Modify title in tab_header()
Add new variables:
- Add calculations in first code block
- Add rows to table structure
- Add formatting rules as needed

Table 2: Trust in Government Distribution

Step 1: Examining Our Trust Variable

First, let’s look at the trust variable to understand what we’re working with:

# Look at distribution of trust responses
table(anes_2020$TrustGovernment)

## 
##              Always    Most of the time About half the time    Some of the time 
##                  80                1016                2313                3313 
##               Never 
##                 702

# Check for missing values
sum(is.na(anes_2020$TrustGovernment))

## [1] 29

What we learn: - We have five ordered categories from “Never” to “Always”

Category order isn’t sorted logically in raw data (it’s alphabetical)
We have some missing values to handle (29)

Step 2: Creating Initial Distribution

Let’s calculate our counts and percentages:

trust_dist <- anes_2020 %>%
  # Calculate total responses including missing before filtering
  mutate(total_responses = n()) %>%
  # Remove missing values for the distribution analysis
  filter(!is.na(TrustGovernment)) %>%
  # Set proper ordering of trust levels
  mutate(trust_level = factor(TrustGovernment, 
    levels = c("Never", 
               "Some of the time",
               "About half the time", 
               "Most of the time",
               "Always"))) %>%
  # Group by trust level and get counts
  group_by(trust_level) %>%
  summarize(
    count = n(),
    .groups = 'drop'
  ) %>%
  # Calculate total and percentages
  mutate(
    n_total = sum(count),
    percentage = round(count / n_total * 100, 1)
  )

trust_dist

## # A tibble: 5 × 4
##   trust_level         count n_total percentage
##   <fct>               <int>   <int>      <dbl>
## 1 Never                 702    7424        9.5
## 2 Some of the time     3313    7424       44.6
## 3 About half the time  2313    7424       31.2
## 4 Most of the time     1016    7424       13.7
## 5 Always                 80    7424        1.1

Step 3: Creating the Formatted Table

Now let’s create a professional-looking table:

# Create formatted table
trust_dist %>%
  # Select only the columns we want to display
  select(trust_level, count, percentage) %>%
  gt() %>%
  # Format columns
  cols_label(
    trust_level = "Level of Trust",
    count = "Count",
    percentage = "Percent"
  ) %>%
  # Format numbers with commas
  fmt_number(
    columns = count,
    decimals = 0,
    use_seps = TRUE
  ) %>%
  # Format percentages
  fmt_number(
    columns = percentage,
    decimals = 1
  ) %>%
  # Add title and notes
  tab_header(
    title = md("**Table 2. Trust in Government Distribution**"),
    subtitle = "How often can you trust the federal government to do what is right?"
  ) %>%
  # Add table notes
  tab_source_note(
    source_note = sprintf(
      "Note: Based on %d valid responses.",
      first(trust_dist$n_total)
    )
  ) %>%
  # Add borders
  tab_options(
    table.border.top.width = 2,
    table.border.bottom.width = 2
  )

Level of Trust	Count	Percent
Table 2. Trust in Government Distribution
How often can you trust the federal government to do what is right?
Never	702	9.5
Some of the time	3,313	44.6
About half the time	2,313	31.2
Most of the time	1,016	13.7
Always	80	1.1
Note: Based on 7424 valid responses.

Understanding the code: 1. Data preparation:

We manually specify the order of trust levels
Calculate percentages based on valid responses only
Store information about missing data

Table formatting:
- Clear column headers
- Numbers formatted with commas
- Percentages to one decimal place
- Informative title and subtitle
- Note showing response rates
Key functions:
- gt(): Creates the formatted table
- cols_label(): Sets column headers
- fmt_number(): Controls number formatting
- tab_header(): Adds title and subtitle
- tab_source_note(): Adds footnote

To modify this code:

For different variables:
- Change the category names in trust_level
- Update the counts
- Modify title and subtitle
For different formatting:
- Adjust decimals in fmt_number()
- Change column labels in cols_label()
- Modify borders in tab_options()

This approach ensures:

Logical ordering of categories
Correct percentage calculations
Professional formatting
Clear documentation of missing data
Consistent style with Table 1

Adding Professional Enhancements

One option: Using Color to Highlight Patterns

Let’s enhance our trust table by adding subtle color to highlight response patterns:

# Create enhanced trust table
trust_dist %>%
  # Select only the columns we want to display
  select(trust_level, count, percentage) %>%
  gt() %>%
  # Basic formatting (same as before)
  cols_label(
    trust_level = "Level of Trust",
    count = "Count",
    percentage = "Percent"
  ) %>%
  fmt_number(
    columns = count,
    decimals = 0,
    use_seps = TRUE
  ) %>%
  fmt_number(
    columns = percentage,
    decimals = 1
  ) %>%
  # Add subtle background color based on percentage
  data_color(
    columns = percentage,
    colors = scales::col_numeric(
      palette = c("#ffffff", "#e6f3ff"),  # White to light blue
      domain = NULL
    )
  ) %>%
  # Add light gray header background
  tab_style(
    style = cell_fill(color = "#f6f6f6"),
    locations = cells_column_labels()
  ) %>%
  # Title and notes
  tab_header(
    title = md("**Table 2. Trust in Government**"),
  ) %>%
  tab_source_note(
    source_note = sprintf(
      "Note: Based on %d valid responses. Color intensity indicates response frequency.",
      first(trust_dist$n_total)
    )
  ) %>%
  # Borders
  tab_options(
    table.border.top.width = 2,
    table.border.bottom.width = 2
  )

Level of Trust	Count	Percent
Table 2. Trust in Government
Never	702	9.5
Some of the time	3,313	44.6
About half the time	2,313	31.2
Most of the time	1,016	13.7
Always	80	1.1
Note: Based on 7424 valid responses. Color intensity indicates response frequency.

Understanding the enhancements:

data_color():
- Adds color based on percentage values
- Uses white to light blue scale
- Higher percentages show darker blue
cell_fill():
- Adds light gray to column headers
- Creates subtle visual separation

Tips for using color:

Keep it subtle (light colors)
Use color meaningfully (to show patterns)
Include explanation in table note
Consider color-blind friendly options

You can modify this by:

Changing colors in col_numeric()
Adjusting header color in cell_fill()
Adding color to different columns
Using different color scales for different purposes

Practice Exercises

Here are 5 exercises to help you practice what we covered in today’s session about descriptive statistics. These exercises use the same ANES dataset but explore different variables.

Exercise 1: Trust in People

First, examine the survey question for TrustPeople:

Look up the exact question wording in the codebook
Identify all possible response categories

Then analyze the responses:

Check the data quality by calculating:
- Total number of respondents
- Number of valid responses
- Number of missing responses
- Response rate
Create a distribution table showing counts and percentages for each response category

Hint: Follow the same steps we used for TrustGovernment, but apply them to TrustPeople. Remember to examine the question context first

Exercise 2: Campaign Interest Analysis

Start by understanding the variable:

Look up the exact question text for CampaignInterest in the codebook
Note how the question was framed to respondents

Then create a summary analysis:

Calculate data quality metrics (response rates, missing data)
Create a formatted table showing:
- Response counts
- Percentages
- Valid percentages (excluding missing data)

Hint: Think about how the question wording might affect response patterns

Exercise 3: Race and Education Demographics Table

Create a professional table that combines:

Race/ethnicity distribution using RaceEth:
- Counts and percentages for each category
- Note any missing data
Education levels using Education:
- Grouped into meaningful categories
- Percentages for each level

Format using gt with appropriate styling and clear labels.

Hint: This is different from our tutorial example as it combines two categorical demographic variables

Exercise 4: Age and Party ID

Create a descriptive analysis that examines:

Age statistics by party identification:
- Mean age for each party ID group
- Standard deviation within groups
- Compare age distributions across different party affiliations
Present results in a professional table using gt

Hint: This combines continuous and categorical variables in a way different from our tutorial examples

Exercise 5:

Create a summary table showing the relationship between education level and party identification:

Calculate the percent of each education level that identifies as Democrat (including independent-democrat), Republican (including independent-republican), or Independent
Create a professional formatted table that shows:
- Education levels
- Distribution across these three party groups
- Include row totals to verify percentages sum to 100%
- Use proper formatting for numbers and percentages

Hint: Remember to combine party categories first, then calculate percentages within education levels

For your weekly diary reflection, consider:

How does understanding the exact survey questions help in analyzing and interpreting the data?
What challenges did you face in combining different types of variables in a single table?
How would you explain these demographic patterns to different audiences?
What additional context would be helpful when presenting these results?

Week 2. Session 2: Describing Data

Sébastien Parker

Winter 2025

Setting up environment

From Toy Data to Real-World Research

Understanding Different Types of Variables

1. Categorical (Nominal) Variables

2. Ordinal Variables

3. Numeric Variables

4. Continuous Variables

Understanding Our Survey Data

What’s In Our Survey?

Alternative loading method

Exploring Trust in Government

Understanding Central Tendency

Descriptive Question Example: What’s a “typical” age in our sample?

The Median: Finding the Middle Position

The Mean: Finding the Mathematical Average

More Tools for Describing Data

What’s the spread of ages in our sample?

The Range

Quartiles

A Complete Summary

What This Tells Us About Our Sample

Understanding How Spread Out Our Data Is

A fundamental tool: Standard Deviation

The “Rules of Thumb” for Standard Deviation

When to Use Each Measure?

Producing Descriptive Tables

Quick Data Summaries

Exploring Numeric Variables

Exploring Categorical Variables

Building Professional Tables: A Step-by-Step Approach

Creating Professional Tables for Research

Step 1: Exploring Our Variables

Step 2: Planning Our Tables

Step 3: Building Our First Table - Sample Demographics

Step 4: Creating an Enhanced Professional Table

Part 1: Calculating Detailed Statistics

Part 2: Creating Table Structure

Part 3: Creating the Professional Table

Table 2: Trust in Government Distribution

Step 1: Examining Our Trust Variable

Step 2: Creating Initial Distribution

Step 3: Creating the Formatted Table

Adding Professional Enhancements

One option: Using Color to Highlight Patterns

Practice Exercises

Exercise 1: Trust in People

Exercise 2: Campaign Interest Analysis

Exercise 3: Race and Education Demographics Table

Exercise 4: Age and Party ID

Exercise 5: