Setting up environment

# List of packages
packages <- c("tidyverse", "srvyr", "srvyrexploR", "broom","gt", "modelsummary", "gapminder") # add any you need here

# Install packages if they aren't installed already
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

# Load the packages
lapply(packages, library, character.only = TRUE)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## 
## Attaching package: 'srvyr'
## 
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## 
## `modelsummary` 2.0.0 now uses `tinytable` as its default table-drawing
##   backend. Learn more at: https://vincentarelbundock.github.io/tinytable/
## 
## Revert to `kableExtra` for one session:
## 
##   options(modelsummary_factory_default = 'kableExtra')
##   options(modelsummary_factory_latex = 'kableExtra')
##   options(modelsummary_factory_html = 'kableExtra')
## 
## Silence this message forever:
## 
##   config_modelsummary(startup_message = FALSE)
## [[1]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "srvyr"     "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "srvyrexploR" "srvyr"       "lubridate"   "forcats"     "stringr"    
##  [6] "dplyr"       "purrr"       "readr"       "tidyr"       "tibble"     
## [11] "ggplot2"     "tidyverse"   "stats"       "graphics"    "grDevices"  
## [16] "utils"       "datasets"    "methods"     "base"       
## 
## [[4]]
##  [1] "broom"       "srvyrexploR" "srvyr"       "lubridate"   "forcats"    
##  [6] "stringr"     "dplyr"       "purrr"       "readr"       "tidyr"      
## [11] "tibble"      "ggplot2"     "tidyverse"   "stats"       "graphics"   
## [16] "grDevices"   "utils"       "datasets"    "methods"     "base"       
## 
## [[5]]
##  [1] "gt"          "broom"       "srvyrexploR" "srvyr"       "lubridate"  
##  [6] "forcats"     "stringr"     "dplyr"       "purrr"       "readr"      
## [11] "tidyr"       "tibble"      "ggplot2"     "tidyverse"   "stats"      
## [16] "graphics"    "grDevices"   "utils"       "datasets"    "methods"    
## [21] "base"       
## 
## [[6]]
##  [1] "modelsummary" "gt"           "broom"        "srvyrexploR"  "srvyr"       
##  [6] "lubridate"    "forcats"      "stringr"      "dplyr"        "purrr"       
## [11] "readr"        "tidyr"        "tibble"       "ggplot2"      "tidyverse"   
## [16] "stats"        "graphics"     "grDevices"    "utils"        "datasets"    
## [21] "methods"      "base"        
## 
## [[7]]
##  [1] "gapminder"    "modelsummary" "gt"           "broom"        "srvyrexploR" 
##  [6] "srvyr"        "lubridate"    "forcats"      "stringr"      "dplyr"       
## [11] "purrr"        "readr"        "tidyr"        "tibble"       "ggplot2"     
## [16] "tidyverse"    "stats"        "graphics"     "grDevices"    "utils"       
## [21] "datasets"     "methods"      "base"

From Toy Data to Real-World Research

In our previous session, we worked with the gapminder dataset to learn basic data manipulation skills. While gapminder is excellent for learning data manipulation, it differs from the data we typically encounter in social science research:

Gapminder (Teaching Dataset):

  • Clean and well-organized

  • No missing values

  • Small number of variables (6)

  • Clear relationships

  • No complex survey design

Real Social Science Data:

  • Missing values and inconsistencies

  • Complex survey designs with weights

  • Much larger number of variables

  • Different types of measurements

  • Data quality issues

Today, we’ll learn to describe and understand real survey data using a major social science dataset:

American National Election Studies (ANES):

  • Measures political attitudes and behaviors

  • Uses complex survey design to represent U.S. population

  • Mix of categorical and ordinal variables

  • Conducted biennially since 1948

Understanding Different Types of Variables

Before we analyze data, we need to understand what kinds of variables we have. Each type requires different descriptive statistics:

1. Categorical (Nominal) Variables

  • Categories with no natural order
  • Examples from our data:
    • Region (in ANES dataset, Northeast, South)
    • Political party identification (e.g., Republican, Democrat, if Canadian would be –> Liberal, Conservative, Green, so on)

2. Ordinal Variables

  • Categories with meaningful order
  • Examples from our data:
    • Trust in government (Never → Always)
    • Education levels (e.g., from No HS completion, HS, College/University so on)

3. Numeric Variables

  • Whole numbers from counting
  • Examples from our data:
    • Number of household members (0, 1, 2, 3,…)
    • Times voted in past elections (0, 1, 2, 3,…)

4. Continuous Variables

  • Can take any value within range
  • Example from our data:
    • Income (e.g., 87,896.05)

Understanding Our Survey Data

What’s In Our Survey?

We could do the following:

glimpse(anes_2020)
## Rows: 7,453
## Columns: 65
## $ V200001                 <dbl> 200015, 200022, 200039, 200046, 200053, 200060…
## $ CaseID                  <dbl> 200015, 200022, 200039, 200046, 200053, 200060…
## $ V200002                 <hvn_lbll> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
## $ InterviewMode           <fct> Web, Web, Web, Web, Web, Web, Web, Web, Web, W…
## $ V200010b                <dbl> 1.0057375, 1.1634731, 0.7686811, 0.5210195, 0.…
## $ Weight                  <dbl> 1.0057375, 1.1634731, 0.7686811, 0.5210195, 0.…
## $ V200010c                <dbl> 2, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 1, 2, 2, 2, 2…
## $ VarUnit                 <fct> 2, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 1, 2, 2, 2, 2…
## $ V200010d                <dbl> 9, 26, 41, 29, 23, 37, 7, 37, 32, 41, 22, 7, 3…
## $ Stratum                 <fct> 9, 26, 41, 29, 23, 37, 7, 37, 32, 41, 22, 7, 3…
## $ V201006                 <hvn_lbll> 2, 3, 2, 3, 2, 1, 2, 3, 2, 2, 2, 2, 2, 1,…
## $ CampaignInterest        <fct> Somewhat interested, Not much interested, Some…
## $ V201023                 <hvn_lbll> -1, -1, -1, -1, -1, -1, -1, -1, 1, -1, -1…
## $ EarlyVote2020           <fct> NA, NA, NA, NA, NA, NA, NA, NA, Yes, NA, NA, N…
## $ V201024                 <hvn_lbll> -1, -1, -1, -1, -1, -1, -1, -1, 2, -1, -1…
## $ V201025x                <hvn_lbll> 3, 3, 3, 3, 3, 3, 3, 2, 4, 3, 3, 3, 2, 4,…
## $ V201028                 <hvn_lbll> -1, -1, -1, -1, -1, -1, -1, -1, 1, -1, -1…
## $ V201029                 <hvn_lbll> -1, -1, -1, -1, -1, -1, -1, -1, 1, -1, -1…
## $ V201101                 <hvn_lbll> -1, -1, -1, -1, -1, -1, -1, -1, 1, 1, 2, …
## $ V201102                 <hvn_lbll> 1, 1, 1, 1, 1, 2, 1, 2, -1, -1, -1, 1, 2,…
## $ VotedPres2016           <fct> Yes, Yes, Yes, Yes, Yes, No, Yes, No, Yes, Yes…
## $ V201103                 <hvn_lbll> 2, 5, 1, 1, 2, -1, 5, -1, 1, 1, -1, 1, -1…
## $ VotedPres2016_selection <fct> Trump, Other, Clinton, Clinton, Trump, NA, Oth…
## $ V201228                 <hvn_lbll> 2, 5, 3, 2, 3, 3, 2, 2, 3, 1, 1, 1, 2, 1,…
## $ V201229                 <hvn_lbll> 1, -1, -1, 2, -1, -1, 2, 2, -1, 2, 1, 2, …
## $ V201230                 <hvn_lbll> -1, 2, 3, -1, 2, 3, -1, -1, 2, -1, -1, -1…
## $ V201231x                <hvn_lbll> 7, 4, 3, 6, 4, 3, 6, 6, 4, 2, 1, 2, 7, 2,…
## $ PartyID                 <fct> Strong republican, Independent, Independent-de…
## $ V201233                 <hvn_lbll> 5, 5, 4, 3, 5, 4, 4, 1, 3, 3, 2, 3, 4, 5,…
## $ TrustGovernment         <fct> Never, Never, Some of the time, About half the…
## $ V201237                 <hvn_lbll> 3, 4, 4, 2, 4, 2, 4, 1, 3, 2, 4, 3, 4, 3,…
## $ TrustPeople             <fct> About half the time, Some of the time, Some of…
## $ V201507x                <hvn_lbll> 46, 37, 40, 41, 72, 71, 37, 45, 70, 43, 3…
## $ Age                     <dbl> 46, 37, 40, 41, 72, 71, 37, 45, 70, 43, 37, 55…
## $ AgeGroup                <fct> 40-49, 30-39, 40-49, 40-49, 70 or older, 70 or…
## $ V201510                 <hvn_lbll> 6, 3, 2, 4, 8, 3, 4, 2, 2, 4, 2, 2, 2, 7,…
## $ Education               <fct> Bachelor's, Post HS, High school, Post HS, Gra…
## $ V201546                 <hvn_lbll> 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2,…
## $ V201547a                <hvn_lbll> -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -…
## $ V201547b                <hvn_lbll> -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -…
## $ V201547c                <hvn_lbll> -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -…
## $ V201547d                <hvn_lbll> -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -…
## $ V201547e                <hvn_lbll> -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -…
## $ V201547z                <hvn_lbll> -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -…
## $ V201549x                <hvn_lbll> 3, 4, 1, 4, 5, 1, 1, 1, 1, 3, 3, 1, 1, 4,…
## $ RaceEth                 <fct> "Hispanic", "Asian, NH/PI", "White", "Asian, N…
## $ V201600                 <hvn_lbll> 1, 2, 2, 1, 1, 2, 2, 2, 2, 1, 2, 1, 2, 1,…
## $ Gender                  <fct> Male, Female, Female, Male, Male, Female, Fema…
## $ V201607                 <hvn_lbll> -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -…
## $ V201610                 <hvn_lbll> -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -…
## $ V201611                 <hvn_lbll> -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -…
## $ V201613                 <hvn_lbll> -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -…
## $ V201615                 <hvn_lbll> -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -…
## $ V201616                 <hvn_lbll> -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -…
## $ V201617x                <hvn_lbll> 21, 13, 17, 7, 22, 3, 4, 3, 10, 11, 9, 18…
## $ Income                  <fct> "$175,000-249,999", "$70,000-74,999", "$100,00…
## $ Income7                 <fct> $125k or more, $60k to < 80k, $100k to < 125k,…
## $ V202051                 <hvn_lbll> -1, -1, -1, -1, -1, -1, -1, 1, -1, -1, -1…
## $ V202066                 <hvn_lbll> 1, 4, 4, 4, 4, 4, 4, 1, -1, 4, 4, 4, 4, -…
## $ V202072                 <hvn_lbll> -1, 1, 1, 1, 1, 1, 1, -1, -1, 1, 1, 1, 1,…
## $ VotedPres2020           <fct> NA, Yes, Yes, Yes, Yes, Yes, Yes, NA, Yes, Yes…
## $ V202073                 <hvn_lbll> -1, 3, 1, 1, 2, 1, 2, -1, -1, 1, 1, 1, 2,…
## $ V202109x                <hvn_lbll> 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,…
## $ V202110x                <hvn_lbll> -1, 3, 1, 1, 2, 1, 2, -1, 1, 1, 1, 1, 2, …
## $ VotedPres2020_selection <fct> NA, Other, Biden, Biden, Trump, Biden, Trump, …

Understanding check: how many variables and observations are there?

But that is not very effective in that there is so much to look at once and the variables have specific acronyms.

Our course textbooks provides some key information:

https://tidy-survey-r.github.io/tidy-survey-book/anes-cb.html

But you can also look directly at the dataset website or look here for the full pdf documentation (which is 796 pages!):

https://electionstudies.org/wp-content/uploads/2022/02/anes_timeseries_2020_userguidecodebook_20220210.pdf

According to the ANES documentation, one of our key variables is trust in government, which asks:

“How often can you trust the federal government in Washington to do what is right?”

Response options:

  • Always

  • Most of the time

  • About half the time

  • Some of the time

  • Never

Alternative loading method

Assuming you have the dataset saved in the same folder.

load("anes_2020.rda")

Exploring Trust in Government

What we want to know: How many people responded and what did they say?

Suppose you just wanted to do a quick check:

table(anes_2020$TrustGovernment)
## 
##              Always    Most of the time About half the time    Some of the time 
##                  80                1016                2313                3313 
##               Never 
##                 702

Now suppose, you wanted to both arrange the responses and find out what the most “frequent” or “typical” response/value is – which is called the mode.

How we’ll find out: Let’s count responses by category and arrange

trust_counts <- anes_2020 %>%
  group_by(TrustGovernment) %>%      # Organize by response option
  summarize(count = n()) %>%         # Count responses in each group
  filter(!is.na(TrustGovernment)) %>% # Remove missing values
  arrange(desc(count))               # Sort by frequency

trust_counts
## # A tibble: 5 × 2
##   TrustGovernment     count
##   <fct>               <int>
## 1 Some of the time     3313
## 2 About half the time  2313
## 3 Most of the time     1016
## 4 Never                 702
## 5 Always                 80

What we found:

  • Most common response was “Some of the time” (3,313 people)

  • Followed by “About half the time” (2,313 people)

  • Only 80 people said they “Always” trust government

Therefore, the mode is “Some of the time”.

Now suppose you wanted to the relative proportion or percentage (x 100 the prop.), giving you the distribution of responses.

What we want to know: What percentage of people gave each response?

How we’ll find out: Let’s convert our counts into proportions

trust_props <- anes_2020 %>%
  filter(!is.na(TrustGovernment)) %>%   # Remove missing values first
  group_by(TrustGovernment) %>%         # Group by trust response
  summarize(                            # Calculate counts and percentages
    count = n(),
    percentage = round(100 * n() / sum(n()), 1)  # Percentage with 1 decimal
  ) %>%
  arrange(desc(count))                  # Sort by frequency

trust_props
## # A tibble: 5 × 3
##   TrustGovernment     count percentage
##   <fct>               <int>      <dbl>
## 1 Some of the time     3313        100
## 2 About half the time  2313        100
## 3 Most of the time     1016        100
## 4 Never                 702        100
## 5 Always                 80        100

Oh oh, what happened! Every count shows 100%, that is definitely not right! Those things happen when we go too quickly. That’s why it’s important to check at every stage what you are doing and what it led to.

Understanding the Issue:

  • Inside summarize(), sum(n()) is calculated within each group (because of our group_by())

  • This means each group’s count is being divided by its own sum

  • Of course dividing anything by itself gives 100%!

Let’s fix this by calculating the total before grouping:

# First, let's store the total valid responses
total_valid <- anes_2020 %>%
  filter(!is.na(TrustGovernment)) %>%
  nrow()

# Now calculate proportions using this total
trust_props <- anes_2020 %>%
  filter(!is.na(TrustGovernment)) %>%   # Remove missing values first
  group_by(TrustGovernment) %>%         # Group by trust response
  summarize(                            # Calculate counts and percentages
    count = n(),
    percentage = round(100 * count / total_valid, 1)  # Use total_valid instead of sum(n())
  ) %>%
  arrange(desc(count))                  # Sort by frequency

trust_props
## # A tibble: 5 × 3
##   TrustGovernment     count percentage
##   <fct>               <int>      <dbl>
## 1 Some of the time     3313       44.6
## 2 About half the time  2313       31.2
## 3 Most of the time     1016       13.7
## 4 Never                 702        9.5
## 5 Always                 80        1.1

Now we get sensible percentages that add up to 100%!

Key Lesson: Always check the results of what you did.

Understanding Central Tendency

Descriptive Question Example: What’s a “typical” age in our sample?

To answer this, we’ll explore different ways of finding the “middle” or “center” of our data. Each method tells us something different about our respondents’ ages.

The Median: Finding the Middle Position

What we want to know: What age splits our sample in half?

How we’ll find out: Let’s find the middle value when ages are ordered

age_median <- anes_2020 %>%
  filter(!is.na(Age)) %>%     # Remove missing ages
  summarize(
    median_age = median(Age),
    n_valid = n()             # Count valid responses
  )

age_median
## # A tibble: 1 × 2
##   median_age n_valid
##        <dbl>   <int>
## 1         53    7159

What we found:

  • Median age is 53 years

  • Half of respondents are younger than 53

  • Half are older than 53

  • Based on valid responses (after removing NAs)

The Mean: Finding the Mathematical Average

What we want to know: What’s the mathematical average age?

How we’ll find out: Let’s add all ages and divide by the count

age_mean <- anes_2020 %>%
  filter(!is.na(Age)) %>%
  summarize(
    mean_age = mean(Age),
    n_valid = n()
  )

age_mean
## # A tibble: 1 × 2
##   mean_age n_valid
##      <dbl>   <int>
## 1     51.8    7159

What we found:

  • Average age is 51.8 years

  • Mean is slightly lower than median (51.8 vs 53)

  • This small difference tells us something about our age distribution (we will come back to this during distribution week!)

More Tools for Describing Data

What’s the spread of ages in our sample?

What we want to know: How wide is the range of ages and where do most respondents fall?

How we’ll find out: Let’s explore different ways to measure spread in our data

The Range

# Calculate the range of ages
age_range <- anes_2020 %>%
  filter(!is.na(Age)) %>%
  summarize(
    min_age = min(Age),
    max_age = max(Age),
    age_range = max_age - min_age
  )

age_range
## # A tibble: 1 × 3
##   min_age max_age age_range
##     <dbl>   <dbl>     <dbl>
## 1      18      80        62

What we found:

  • Youngest respondent: 18 years old

  • Oldest respondent: 80 years old

  • Total range: 62 years

  • Note: Range is sensitive to extreme values (consider, e.g., income in US from lowest to highest)

Quartiles

What we want to know: Where do the middle 50% of ages fall?

How we’ll find out: Let’s divide our data into quarters

# Calculate quartiles
age_quartiles <- anes_2020 %>%
  filter(!is.na(Age)) %>%
  summarize(
    q1 = quantile(Age, 0.25),  # First quartile (25th percentile)
    median = median(Age),       # Second quartile (50th percentile)
    q3 = quantile(Age, 0.75),  # Third quartile (75th percentile)
    iqr = q3 - q1              # Interquartile range
  )

age_quartiles
## # A tibble: 1 × 4
##      q1 median    q3   iqr
##   <dbl>  <dbl> <dbl> <dbl>
## 1    37     53    66    29

Understanding the Quartiles:

  • Q1 (25th percentile): 37 years - 25% are younger

  • Median (50th percentile): 53 years - middle value

  • Q3 (75th percentile): 66 years - 75% are younger

  • IQR: 29 years (i.e., 37 to 66) - where the middle 50% of ages fall

A Complete Summary

What we want to know: Can we get all key statistics at once?

How we’ll find out: Let’s use R’s summary() function

summary(anes_2020$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   18.00   37.00   53.00   51.83   66.00   80.00     294

Understanding the Output:

  • Minimum: 18 years (youngest respondent)

  • 1st Quartile: 37 years (25% are younger)

  • Median: 53 years (middle value)

  • Mean: 51.83 years (average)

  • 3rd Quartile: 66 years (75% are younger)

  • Maximum: 80 years (oldest respondent)

  • Missing Values: 294 NAs

What This Tells Us About Our Sample

  1. Age Range
    • Survey includes adults 18-80 years old
    • Range of 62 years (80 - 18)
  2. Central Tendency
    • Median (53) slightly higher than mean (51.8)
    • Suggests a slight skew in our age distribution
    • Most respondents are middle-aged
  3. Data Quality
    • 294 missing ages
    • Need to consider impact on analysis

The relationship between mean and median can tell us about the shape of our distribution:

  • When mean ≈ median: Suggests symmetrical distribution

  • When mean < median: Suggests negative skew (tail to left)

  • When mean > median: Suggests positive skew (tail to right)

We will come back to this during the distribution week and visualize!

Understanding How Spread Out Our Data Is

A fundamental tool: Standard Deviation

Standard deviation measures the typical distance from the average. The formula is:

SD = sqrt( Σ(x - μ)² / n )

Where:
- SD is the standard deviation
- x is each value in our data
- μ (mu) is the mean
- n is the number of observations
- Σ means "sum up"

Why do we calculate it this way?

  1. First, we find each value’s distance from the mean (x - μ). This tells us how far each observation is from average.

  2. We square these differences for two key reasons:

    • It makes negative differences positive (being 5 years younger or 5 years older than average both count as a 5-year difference)
    • It gives more weight to larger differences, which makes sense when measuring spread (being 20 years from average is more than twice as extreme as being 10 years from average)
  3. We average these squared differences (÷ n) to get a typical squared distance

  4. Finally, we take the square root to get back to the original units (years in our case)

Let’s use age in our ANES survey as an example:

# Calculate mean and standard deviation of age
age_stats <- anes_2020 %>%
  filter(!is.na(Age)) %>%
  summarize(
    mean_age = round(mean(Age), 1),
    sd_age = round(sd(Age), 1)
  )
age_stats
## # A tibble: 1 × 2
##   mean_age sd_age
##      <dbl>  <dbl>
## 1     51.8   17.1

In our data:

  • The average age is about 52 years

  • The standard deviation is about 17 years

What does this mean in plain language?

  • Most respondents’ ages fall within 17 years of 52 years old

  • So most people are between 35 and 69 years old

  • This gives us a sense of how “spread out” the ages are

The “Rules of Thumb” for Standard Deviation

There are some helpful guidelines about standard deviation:

  1. The 68-95-99.7 Rule:

    • About 68% of people fall within 1 standard deviation of the mean

    • About 95% fall within 2 standard deviations

    • Almost everyone (99.7%) falls within 3 standard deviations

Let’s see what this means for our age data:

# Calculate the intervals
age_ranges <- anes_2020 %>%
  filter(!is.na(Age)) %>%
  summarize(
    mean = round(mean(Age), 1),
    sd = round(sd(Age), 1),
    
    # One SD range (about 68% of people)
    one_sd_low = round(mean - sd, 0),
    one_sd_high = round(mean + sd, 0),
    
    # Two SD range (about 95% of people)
    two_sd_low = round(mean - (2 * sd), 0),
    two_sd_high = round(mean + (2 * sd), 0)
  )

age_ranges
## # A tibble: 1 × 6
##    mean    sd one_sd_low one_sd_high two_sd_low two_sd_high
##   <dbl> <dbl>      <dbl>       <dbl>      <dbl>       <dbl>
## 1  51.8  17.1         35          69         18          86

What This Tells Us:

  • Most respondents (68%) are between 35 and 69 years old

  • Almost all (95%) are between 18 and 86 years old

  • This makes sense given the voting age minimum of 18

Identifying Unusual Cases

  • Values more than 2 standard deviations from the mean are unusual

  • Values more than 3 standard deviations are very rare

  • This helps us spot interesting patterns or potential issues even

When to Use Each Measure?

  1. Use Range When:

    • You need a quick overview

    • Extreme values matter

    • Explaining to non-technical audiences

  2. Use Quartiles When:

    • You want to know about typical “segments”

    • Outliers might distort your picture

    • You need to identify “middle” or “quarter” groups

  3. Use Standard Deviation When:

    • You need precise measures of spread

    • You want to report both the average and typical variation around a key variable (typical in a descriptive table)

Producing Descriptive Tables

A fundamental skill we will practice is leveraging key descriptive statistics and turning them into clear tables describing our sample or provide an overview, e.g., of key characteristics. This is a standard of quantitative research article – or some variant – often referred to as a “Table 1”.

Quick Data Summaries

While learning to create customized tables is important, sometimes we need quick descriptive statistics during our exploratory phase. The datasummary_skim() function from the modelsummary package offers a simple way to generate descriptive statistics.

Exploring Numeric Variables

Let’s look at a few numeric variables using this simpler approach:

# Create a simpler dataset with just numeric variables of interest
demo_vars <- anes_2020 %>%
  select(Age, Income7) %>%
  # Ensure variables are numeric
  mutate(across(everything(), as.numeric))

# Quick summary
datasummary_skim(demo_vars)
Unique Missing Pct. Mean SD Min Median Max Histogram
Age 64 4 51.8 17.1 18.0 53.0 80.0
Income7 8 7 4.0 2.1 1.0 4.0 7.0

Exploring Categorical Variables

For categorical variables like trust and education, we can use the same function with a different type:

# Create a dataset with categorical variables
cat_vars <- anes_2020 %>%
  select(TrustGovernment, Education) %>%
  # Ensure variables are factors
  mutate(across(everything(), as.factor))

# Quick summary of categorical variables
datasummary_skim(cat_vars, type = "categorical")
N %
TrustGovernment Always 80 1.1
Most of the time 1016 13.6
About half the time 2313 31.0
Some of the time 3313 44.5
Never 702 9.4
Education Less than HS 312 4.2
High school 1160 15.6
Post HS 2514 33.7
Bachelor's 1877 25.2
Graduate 1474 19.8

Advantages of this approach:

  • Quick and easy to use

  • Standardized output format

  • Good for initial data exploration

  • Requires less coding than custom tables

While datasummary_skim() is great for quick exploration, it offers less customization than the gt (package name) approach we’ll learn next. Think of it as another useful tool in your toolkit.

Building Professional Tables: A Step-by-Step Approach

Creating Professional Tables for Research

Before creating tables, we need to understand what we want to show and check our data. Let’s work through this step by step.

Step 1: Exploring Our Variables

First, let’s look at key variables we might want to include:

# Check education categories
anes_2020 %>%
  count(Education) %>%
  arrange(desc(n))
## # A tibble: 6 × 2
##   Education        n
##   <fct>        <int>
## 1 Post HS       2514
## 2 Bachelor's    1877
## 3 Graduate      1474
## 4 High school   1160
## 5 Less than HS   312
## 6 <NA>           116
# Check trust categories
table(anes_2020$TrustGovernment)
## 
##              Always    Most of the time About half the time    Some of the time 
##                  80                1016                2313                3313 
##               Never 
##                 702
# Quick summary of age
summary(anes_2020$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   18.00   37.00   53.00   51.83   66.00   80.00     294

What we learnt: - Education has several categories we might want to combine

  • Trust has 5 categories from Never –> Always

  • Age ranges from 18-80

  • Some missing values to consider

Step 2: Planning Our Tables

We’ll create two key tables:

  1. Sample characteristics (demographics)

  2. Distribution of trust in government (outcome of interest)

Let’s start with sample characteristics.

Step 3: Building Our First Table - Sample Demographics

First, let’s calculate our basic statistics:

# Calculate basic statistics
sample_stats <- anes_2020 %>%          # Start with our dataset
  summarize(                           # Create summary statistics
    # Count total number of respondents
    n_total = n(),                     
    
    # Calculate mean and standard deviation of Age
    age_mean = mean(Age, na.rm = TRUE),    # na.rm = TRUE tells R to ignore missing values
    age_sd = sd(Age, na.rm = TRUE),        # sd() calculates standard deviation
    
    # Calculate percentage female
    # mean() of TRUE/FALSE gives us proportion (0-1), multiply by 100 for percentage
    pct_female = mean(Gender == "Female", na.rm = TRUE) * 100,
    
    # Calculate percentage with high school or less
    # %in% checks if Education is either "Less than HS" OR "High school"
    pct_hs_or_less = mean(Education %in% c("Less than HS", "High school"), 
                         na.rm = TRUE) * 100
  )

# Look at our calculations
sample_stats
## # A tibble: 1 × 5
##   n_total age_mean age_sd pct_female pct_hs_or_less
##     <int>    <dbl>  <dbl>      <dbl>          <dbl>
## 1    7453     51.8   17.1       54.4           19.8

Understanding each part of the code:

  1. summarize(): Creates summary statistics for your data

  2. n(): Counts number of rows (total respondents)

  3. mean(): Calculates average

  4. sd(): Calculates standard deviation

  5. na.rm = TRUE: Tells R to remove missing values before calculating

  6. Education %in% c("Less than HS", "High school"): Checks if education falls into either category

  7. * 100: Converts proportion to percentage

Now let’s create a nicely formatted table:

# Create basic table structure
basic_table <- data.frame(           # Create a new data frame
  # First column: names of our statistics
  characteristic = c(
    "Sample size (N)",
    "Age, mean (SD)",                # (SD) = Standard Deviation
    "Female (%)",
    "High school or less (%)"        # Changed from college degree
  ),
  # Second column: actual values
  value = c(
    sample_stats$n_total,            # Sample size
    # paste0() combines text - here combining mean and SD with formatting
    paste0(round(sample_stats$age_mean, 1), " (", 
           round(sample_stats$age_sd, 1), ")"),   # round(x, 1) rounds to 1 decimal
    round(sample_stats$pct_female, 1),            # Female percentage
    round(sample_stats$pct_hs_or_less, 1)         # Education percentage
  )
)

basic_table
##            characteristic       value
## 1         Sample size (N)        7453
## 2          Age, mean (SD) 51.8 (17.1)
## 3              Female (%)        54.4
## 4 High school or less (%)        19.8

Some functions explained:

  • paste0(): Combines text and numbers (e.g., combining mean and SD)

  • round(): Rounds numbers (first argument is number, second is decimal places)

  • %in%: Checks if values are in a set of options

  • c(): Combines values into a list

# Create formatted table using gt package
basic_table %>%                # Take our table and
  gt() %>%                    # Convert to gt format
  cols_label(                 # Remove column headers
    characteristic = "",      # First column blank header
    value = ""               # Second column blank header
  ) %>%
  tab_header(                # Add title
    title = "Sample Characteristics"
  )
Sample Characteristics
Sample size (N) 7453
Age, mean (SD) 51.8 (17.1)
Female (%) 54.4
High school or less (%) 19.8

Step 4: Creating an Enhanced Professional Table

Let’s build a more detailed table in three parts:

  1. Calculate our statistics

  2. Structure our table

  3. Add professional formatting

Part 1: Calculating Detailed Statistics

First, let’s calculate all the statistics we want to show:

# Calculate all our statistics at once
detailed_stats <- anes_2020 %>%
  summarize(
    # Basic counts
    n_total = n(),   # Total sample size
    
    # Age statistics
    age_mean = mean(Age, na.rm = TRUE),    # Average age
    age_sd = sd(Age, na.rm = TRUE),        # Standard deviation of age
    age_min = min(Age, na.rm = TRUE),      # Youngest age
    age_max = max(Age, na.rm = TRUE),      # Oldest age
    
    # Gender percentages
    pct_female = mean(Gender == "Female", na.rm = TRUE) * 100,
    pct_male = mean(Gender == "Male", na.rm = TRUE) * 100,
    
    # Education percentages - three levels
n_valid_edu = sum(!is.na(Education)),
    pct_hs = sum(Education %in% c("Less than HS", "High school"), na.rm = TRUE) / 
         n_valid_edu * 100,               # High school or less
    pct_some_ps = sum(Education == "Post HS", na.rm = TRUE) / 
                  n_valid_edu * 100,      # Some post-secondary
    pct_college = sum(Education %in% c("Bachelor's", "Graduate"), na.rm = TRUE) / 
             n_valid_edu * 100            # Bachelor's or higher
  )

detailed_stats
## # A tibble: 1 × 11
##   n_total age_mean age_sd age_min age_max pct_female pct_male n_valid_edu pct_hs
##     <int>    <dbl>  <dbl>   <dbl>   <dbl>      <dbl>    <dbl>       <int>  <dbl>
## 1    7453     51.8   17.1      18      80       54.4     45.6        7337   20.1
## # ℹ 2 more variables: pct_some_ps <dbl>, pct_college <dbl>

Understanding this code:

  • We use summarize() to calculate all statistics at once

  • na.rm = TRUE appears often because it tells R to ignore missing values

  • For education, we use %in% to combine categories (e.g., “Less than HS” and “High school”)

  • All percentages are calculated by getting the mean of TRUE/FALSE (0/1) and multiplying by 100

Part 2: Creating Table Structure

Now we’ll organize these statistics into a table format:

# Create the structure for our enhanced table
enhanced_table <- data.frame(
  # First column: Labels for each row (14 elements)
  characteristic = c(
    "Sample size (N)",           # 1
    "",                          # 2 (spacing)
    "Age",                       # 3
    "    Mean (SD)",            # 4
    "    Range",                # 5
    "",                          # 6 (spacing)
    "Gender (%)",               # 7
    "    Female",               # 8
    "    Male",                 # 9
    "",                          # 10 (spacing)
    "Education (%)",            # 11
    "    High school or less",  # 12
    "    Some post-secondary",  # 13
    "    Bachelor's or higher"  # 14
  ),
  # Second column: The values (14 elements to match)
  value = c(
    detailed_stats$n_total,                                      # 1
    "",                                                          # 2
    "",                                                          # 3
    paste0(round(detailed_stats$age_mean, 1), " (", 
           round(detailed_stats$age_sd, 1), ")"),                # 4
    paste0(detailed_stats$age_min, "–", detailed_stats$age_max), # 5
    "",                                                          # 6
    "",                                                          # 7
    round(detailed_stats$pct_female, 1),                        # 8
    round(detailed_stats$pct_male, 1),                          # 9
    "",                                                          # 10
    "",                                                          # 11
    round(detailed_stats$pct_hs, 1),                            # 12
    round(detailed_stats$pct_some_ps, 1),                       # 13
    round(detailed_stats$pct_college, 1)                        # 14
  )
)

enhanced_table
##              characteristic       value
## 1           Sample size (N)        7453
## 2                                      
## 3                       Age            
## 4                 Mean (SD) 51.8 (17.1)
## 5                     Range       18–80
## 6                                      
## 7                Gender (%)            
## 8                    Female        54.4
## 9                      Male        45.6
## 10                                     
## 11            Education (%)            
## 12      High school or less        20.1
## 13      Some post-secondary        34.3
## 14     Bachelor's or higher        45.7

Key points about table structure:

  • We create a two-column table using data.frame()

  • Empty rows (““) create spacing between sections

  • Indentation (spaces before names) creates hierarchy

  • paste0() combines numbers and text (for age statistics)

  • round() controls decimal places

Part 3: Creating the Professional Table

Now we’ll use the gt package to create a professional-looking table:

# Create our final formatted table
enhanced_table %>%
 gt() %>%                                # Convert to gt table
 
 # Step 1: Remove column headers
 cols_label(
   characteristic = "",                  # Make both column
   value = ""                           # headers blank
 ) %>%
 
 # Step 2: Make category headers bold
 tab_style(
   style = cell_text(weight = "bold"),  # Bold text
   locations = cells_body(              # Apply to rows where
     rows = characteristic %in%          # characteristic matches
       c("Sample size (N)", "Age", "Gender (%)", "Education (%)")
   )
 ) %>%
 
 # Step 3: Format the sample size with commas
 fmt_number(
   columns = value,                     # In the value column
   rows = characteristic == "Sample size (N)",  # For sample size only
   decimals = 0,                        # No decimal places
   use_seps = TRUE                      # Use comma separators
 ) %>%
 
 # Step 4: Format all percentages
 fmt_percent(
   columns = value,                     # In the value column
   rows = characteristic %in%           # For rows that match
     c("    Female", "    Male",
       "    High school or less",  
       "    Some post-secondary",
       "    Bachelor's or higher"),
   scale_values = FALSE,                # Values already in percentage form
   decimals = 1                         # One decimal place
 ) %>%
 
 # Step 5: Add title and notes
 tab_header(
   title = md("**Table 1. Sample Characteristics**")  # Bold title
 ) %>%
 tab_source_note(                       # Add footnote
   source_note = "Note: Percentages based on total valid responses. SD = standard deviation."
 ) %>%
 
 # Step 6: Add borders
 tab_options(
   table.border.top.width = 2,          # Thick top border
   table.border.bottom.width = 2         # Thick bottom border
 )
Table 1. Sample Characteristics
Sample size (N) 7453
Age
Mean (SD) 51.8 (17.1)
Range 18–80
Gender (%)
Female 54.4
Male 45.6
Education (%)
High school or less 20.1
Some post-secondary 34.3
Bachelor's or higher 45.7
Note: Percentages based on total valid responses. SD = standard deviation.

How to modify this code for your needs:

  1. Change statistics: Modify the summarize() section to include different calculations

  2. Change categories: Update the ‘characteristic’ column in data.frame()

  3. Modify formatting:

    • Change decimal places in round() calls

    • Adjust borders in tab_options()

    • Modify title in tab_header()

  4. Add new variables:

    • Add calculations in first code block

    • Add rows to table structure

    • Add formatting rules as needed

Table 2: Trust in Government Distribution

Step 1: Examining Our Trust Variable

First, let’s look at the trust variable to understand what we’re working with:

# Look at distribution of trust responses
table(anes_2020$TrustGovernment)
## 
##              Always    Most of the time About half the time    Some of the time 
##                  80                1016                2313                3313 
##               Never 
##                 702
# Check for missing values
sum(is.na(anes_2020$TrustGovernment))
## [1] 29

What we learn: - We have five ordered categories from “Never” to “Always”

  • Category order isn’t sorted logically in raw data (it’s alphabetical)

  • We have some missing values to handle (29)

Step 2: Creating Initial Distribution

Let’s calculate our counts and percentages:

trust_dist <- anes_2020 %>%
  # Calculate total responses including missing before filtering
  mutate(total_responses = n()) %>%
  # Remove missing values for the distribution analysis
  filter(!is.na(TrustGovernment)) %>%
  # Set proper ordering of trust levels
  mutate(trust_level = factor(TrustGovernment, 
    levels = c("Never", 
               "Some of the time",
               "About half the time", 
               "Most of the time",
               "Always"))) %>%
  # Group by trust level and get counts
  group_by(trust_level) %>%
  summarize(
    count = n(),
    .groups = 'drop'
  ) %>%
  # Calculate total and percentages
  mutate(
    n_total = sum(count),
    percentage = round(count / n_total * 100, 1)
  )

trust_dist
## # A tibble: 5 × 4
##   trust_level         count n_total percentage
##   <fct>               <int>   <int>      <dbl>
## 1 Never                 702    7424        9.5
## 2 Some of the time     3313    7424       44.6
## 3 About half the time  2313    7424       31.2
## 4 Most of the time     1016    7424       13.7
## 5 Always                 80    7424        1.1

Step 3: Creating the Formatted Table

Now let’s create a professional-looking table:

# Create formatted table
trust_dist %>%
  # Select only the columns we want to display
  select(trust_level, count, percentage) %>%
  gt() %>%
  # Format columns
  cols_label(
    trust_level = "Level of Trust",
    count = "Count",
    percentage = "Percent"
  ) %>%
  # Format numbers with commas
  fmt_number(
    columns = count,
    decimals = 0,
    use_seps = TRUE
  ) %>%
  # Format percentages
  fmt_number(
    columns = percentage,
    decimals = 1
  ) %>%
  # Add title and notes
  tab_header(
    title = md("**Table 2. Trust in Government Distribution**"),
    subtitle = "How often can you trust the federal government to do what is right?"
  ) %>%
  # Add table notes
  tab_source_note(
    source_note = sprintf(
      "Note: Based on %d valid responses.",
      first(trust_dist$n_total)
    )
  ) %>%
  # Add borders
  tab_options(
    table.border.top.width = 2,
    table.border.bottom.width = 2
  )
Table 2. Trust in Government Distribution
How often can you trust the federal government to do what is right?
Level of Trust Count Percent
Never 702 9.5
Some of the time 3,313 44.6
About half the time 2,313 31.2
Most of the time 1,016 13.7
Always 80 1.1
Note: Based on 7424 valid responses.

Understanding the code: 1. Data preparation:

  • We manually specify the order of trust levels

  • Calculate percentages based on valid responses only

  • Store information about missing data

  1. Table formatting:

    • Clear column headers

    • Numbers formatted with commas

    • Percentages to one decimal place

    • Informative title and subtitle

    • Note showing response rates

  2. Key functions:

    • gt(): Creates the formatted table

    • cols_label(): Sets column headers

    • fmt_number(): Controls number formatting

    • tab_header(): Adds title and subtitle

    • tab_source_note(): Adds footnote

To modify this code:

  1. For different variables:

    • Change the category names in trust_level

    • Update the counts

    • Modify title and subtitle

  2. For different formatting:

    • Adjust decimals in fmt_number()

    • Change column labels in cols_label()

    • Modify borders in tab_options()

This approach ensures:

  • Logical ordering of categories

  • Correct percentage calculations

  • Professional formatting

  • Clear documentation of missing data

  • Consistent style with Table 1

Adding Professional Enhancements

One option: Using Color to Highlight Patterns

Let’s enhance our trust table by adding subtle color to highlight response patterns:

# Create enhanced trust table
trust_dist %>%
  # Select only the columns we want to display
  select(trust_level, count, percentage) %>%
  gt() %>%
  # Basic formatting (same as before)
  cols_label(
    trust_level = "Level of Trust",
    count = "Count",
    percentage = "Percent"
  ) %>%
  fmt_number(
    columns = count,
    decimals = 0,
    use_seps = TRUE
  ) %>%
  fmt_number(
    columns = percentage,
    decimals = 1
  ) %>%
  # Add subtle background color based on percentage
  data_color(
    columns = percentage,
    colors = scales::col_numeric(
      palette = c("#ffffff", "#e6f3ff"),  # White to light blue
      domain = NULL
    )
  ) %>%
  # Add light gray header background
  tab_style(
    style = cell_fill(color = "#f6f6f6"),
    locations = cells_column_labels()
  ) %>%
  # Title and notes
  tab_header(
    title = md("**Table 2. Trust in Government**"),
  ) %>%
  tab_source_note(
    source_note = sprintf(
      "Note: Based on %d valid responses. Color intensity indicates response frequency.",
      first(trust_dist$n_total)
    )
  ) %>%
  # Borders
  tab_options(
    table.border.top.width = 2,
    table.border.bottom.width = 2
  )
Table 2. Trust in Government
Level of Trust Count Percent
Never 702 9.5
Some of the time 3,313 44.6
About half the time 2,313 31.2
Most of the time 1,016 13.7
Always 80 1.1
Note: Based on 7424 valid responses. Color intensity indicates response frequency.

Understanding the enhancements:

  1. data_color():

    • Adds color based on percentage values

    • Uses white to light blue scale

    • Higher percentages show darker blue

  2. cell_fill():

    • Adds light gray to column headers

    • Creates subtle visual separation

Tips for using color:

  • Keep it subtle (light colors)

  • Use color meaningfully (to show patterns)

  • Include explanation in table note

  • Consider color-blind friendly options

You can modify this by:

  1. Changing colors in col_numeric()

  2. Adjusting header color in cell_fill()

  3. Adding color to different columns

  4. Using different color scales for different purposes

Practice Exercises

Here are 5 exercises to help you practice what we covered in today’s session about descriptive statistics. These exercises use the same ANES dataset but explore different variables.

Exercise 1: Trust in People

First, examine the survey question for TrustPeople:

  1. Look up the exact question wording in the codebook

  2. Identify all possible response categories

Then analyze the responses:

  1. Check the data quality by calculating:

    • Total number of respondents

    • Number of valid responses

    • Number of missing responses

    • Response rate

  2. Create a distribution table showing counts and percentages for each response category

Hint: Follow the same steps we used for TrustGovernment, but apply them to TrustPeople. Remember to examine the question context first

Exercise 2: Campaign Interest Analysis

Start by understanding the variable:

  1. Look up the exact question text for CampaignInterest in the codebook

  2. Note how the question was framed to respondents

Then create a summary analysis:

  1. Calculate data quality metrics (response rates, missing data)

  2. Create a formatted table showing:

    • Response counts

    • Percentages

    • Valid percentages (excluding missing data)

Hint: Think about how the question wording might affect response patterns

Exercise 3: Race and Education Demographics Table

Create a professional table that combines:

  1. Race/ethnicity distribution using RaceEth:

    • Counts and percentages for each category

    • Note any missing data

  2. Education levels using Education:

    • Grouped into meaningful categories

    • Percentages for each level

Format using gt with appropriate styling and clear labels.

Hint: This is different from our tutorial example as it combines two categorical demographic variables

Exercise 4: Age and Party ID

Create a descriptive analysis that examines:

  1. Age statistics by party identification:

    • Mean age for each party ID group

    • Standard deviation within groups

    • Compare age distributions across different party affiliations

  2. Present results in a professional table using gt

Hint: This combines continuous and categorical variables in a way different from our tutorial examples

Exercise 5:

Create a summary table showing the relationship between education level and party identification:

  1. Calculate the percent of each education level that identifies as Democrat (including independent-democrat), Republican (including independent-republican), or Independent

  2. Create a professional formatted table that shows:

    • Education levels

    • Distribution across these three party groups

    • Include row totals to verify percentages sum to 100%

    • Use proper formatting for numbers and percentages

Hint: Remember to combine party categories first, then calculate percentages within education levels


For your weekly diary reflection, consider:

  • How does understanding the exact survey questions help in analyzing and interpreting the data?

  • What challenges did you face in combining different types of variables in a single table?

  • How would you explain these demographic patterns to different audiences?

  • What additional context would be helpful when presenting these results?