# List of packages
packages <- c("tidyverse", "srvyr", "srvyrexploR", "broom","gt", "modelsummary", "gapminder") # add any you need here
# Install packages if they aren't installed already
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)
# Load the packages
lapply(packages, library, character.only = TRUE)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##
## Attaching package: 'srvyr'
##
##
## The following object is masked from 'package:stats':
##
## filter
##
##
## `modelsummary` 2.0.0 now uses `tinytable` as its default table-drawing
## backend. Learn more at: https://vincentarelbundock.github.io/tinytable/
##
## Revert to `kableExtra` for one session:
##
## options(modelsummary_factory_default = 'kableExtra')
## options(modelsummary_factory_latex = 'kableExtra')
## options(modelsummary_factory_html = 'kableExtra')
##
## Silence this message forever:
##
## config_modelsummary(startup_message = FALSE)
## [[1]]
## [1] "lubridate" "forcats" "stringr" "dplyr" "purrr" "readr"
## [7] "tidyr" "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [13] "grDevices" "utils" "datasets" "methods" "base"
##
## [[2]]
## [1] "srvyr" "lubridate" "forcats" "stringr" "dplyr" "purrr"
## [7] "readr" "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [13] "graphics" "grDevices" "utils" "datasets" "methods" "base"
##
## [[3]]
## [1] "srvyrexploR" "srvyr" "lubridate" "forcats" "stringr"
## [6] "dplyr" "purrr" "readr" "tidyr" "tibble"
## [11] "ggplot2" "tidyverse" "stats" "graphics" "grDevices"
## [16] "utils" "datasets" "methods" "base"
##
## [[4]]
## [1] "broom" "srvyrexploR" "srvyr" "lubridate" "forcats"
## [6] "stringr" "dplyr" "purrr" "readr" "tidyr"
## [11] "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [16] "grDevices" "utils" "datasets" "methods" "base"
##
## [[5]]
## [1] "gt" "broom" "srvyrexploR" "srvyr" "lubridate"
## [6] "forcats" "stringr" "dplyr" "purrr" "readr"
## [11] "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [16] "graphics" "grDevices" "utils" "datasets" "methods"
## [21] "base"
##
## [[6]]
## [1] "modelsummary" "gt" "broom" "srvyrexploR" "srvyr"
## [6] "lubridate" "forcats" "stringr" "dplyr" "purrr"
## [11] "readr" "tidyr" "tibble" "ggplot2" "tidyverse"
## [16] "stats" "graphics" "grDevices" "utils" "datasets"
## [21] "methods" "base"
##
## [[7]]
## [1] "gapminder" "modelsummary" "gt" "broom" "srvyrexploR"
## [6] "srvyr" "lubridate" "forcats" "stringr" "dplyr"
## [11] "purrr" "readr" "tidyr" "tibble" "ggplot2"
## [16] "tidyverse" "stats" "graphics" "grDevices" "utils"
## [21] "datasets" "methods" "base"
In our previous session, we worked with the gapminder
dataset to learn basic data manipulation skills. While
gapminder
is excellent for learning data manipulation, it
differs from the data we typically encounter in social science
research:
Gapminder (Teaching Dataset):
Clean and well-organized
No missing values
Small number of variables (6)
Clear relationships
No complex survey design
Real Social Science Data:
Missing values and inconsistencies
Complex survey designs with weights
Much larger number of variables
Different types of measurements
Data quality issues
Today, we’ll learn to describe and understand real survey data using a major social science dataset:
American National Election Studies (ANES):
Measures political attitudes and behaviors
Uses complex survey design to represent U.S. population
Mix of categorical and ordinal variables
Conducted biennially since 1948
Before we analyze data, we need to understand what kinds of variables we have. Each type requires different descriptive statistics:
We could do the following:
## Rows: 7,453
## Columns: 65
## $ V200001 <dbl> 200015, 200022, 200039, 200046, 200053, 200060…
## $ CaseID <dbl> 200015, 200022, 200039, 200046, 200053, 200060…
## $ V200002 <hvn_lbll> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
## $ InterviewMode <fct> Web, Web, Web, Web, Web, Web, Web, Web, Web, W…
## $ V200010b <dbl> 1.0057375, 1.1634731, 0.7686811, 0.5210195, 0.…
## $ Weight <dbl> 1.0057375, 1.1634731, 0.7686811, 0.5210195, 0.…
## $ V200010c <dbl> 2, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 1, 2, 2, 2, 2…
## $ VarUnit <fct> 2, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 1, 2, 2, 2, 2…
## $ V200010d <dbl> 9, 26, 41, 29, 23, 37, 7, 37, 32, 41, 22, 7, 3…
## $ Stratum <fct> 9, 26, 41, 29, 23, 37, 7, 37, 32, 41, 22, 7, 3…
## $ V201006 <hvn_lbll> 2, 3, 2, 3, 2, 1, 2, 3, 2, 2, 2, 2, 2, 1,…
## $ CampaignInterest <fct> Somewhat interested, Not much interested, Some…
## $ V201023 <hvn_lbll> -1, -1, -1, -1, -1, -1, -1, -1, 1, -1, -1…
## $ EarlyVote2020 <fct> NA, NA, NA, NA, NA, NA, NA, NA, Yes, NA, NA, N…
## $ V201024 <hvn_lbll> -1, -1, -1, -1, -1, -1, -1, -1, 2, -1, -1…
## $ V201025x <hvn_lbll> 3, 3, 3, 3, 3, 3, 3, 2, 4, 3, 3, 3, 2, 4,…
## $ V201028 <hvn_lbll> -1, -1, -1, -1, -1, -1, -1, -1, 1, -1, -1…
## $ V201029 <hvn_lbll> -1, -1, -1, -1, -1, -1, -1, -1, 1, -1, -1…
## $ V201101 <hvn_lbll> -1, -1, -1, -1, -1, -1, -1, -1, 1, 1, 2, …
## $ V201102 <hvn_lbll> 1, 1, 1, 1, 1, 2, 1, 2, -1, -1, -1, 1, 2,…
## $ VotedPres2016 <fct> Yes, Yes, Yes, Yes, Yes, No, Yes, No, Yes, Yes…
## $ V201103 <hvn_lbll> 2, 5, 1, 1, 2, -1, 5, -1, 1, 1, -1, 1, -1…
## $ VotedPres2016_selection <fct> Trump, Other, Clinton, Clinton, Trump, NA, Oth…
## $ V201228 <hvn_lbll> 2, 5, 3, 2, 3, 3, 2, 2, 3, 1, 1, 1, 2, 1,…
## $ V201229 <hvn_lbll> 1, -1, -1, 2, -1, -1, 2, 2, -1, 2, 1, 2, …
## $ V201230 <hvn_lbll> -1, 2, 3, -1, 2, 3, -1, -1, 2, -1, -1, -1…
## $ V201231x <hvn_lbll> 7, 4, 3, 6, 4, 3, 6, 6, 4, 2, 1, 2, 7, 2,…
## $ PartyID <fct> Strong republican, Independent, Independent-de…
## $ V201233 <hvn_lbll> 5, 5, 4, 3, 5, 4, 4, 1, 3, 3, 2, 3, 4, 5,…
## $ TrustGovernment <fct> Never, Never, Some of the time, About half the…
## $ V201237 <hvn_lbll> 3, 4, 4, 2, 4, 2, 4, 1, 3, 2, 4, 3, 4, 3,…
## $ TrustPeople <fct> About half the time, Some of the time, Some of…
## $ V201507x <hvn_lbll> 46, 37, 40, 41, 72, 71, 37, 45, 70, 43, 3…
## $ Age <dbl> 46, 37, 40, 41, 72, 71, 37, 45, 70, 43, 37, 55…
## $ AgeGroup <fct> 40-49, 30-39, 40-49, 40-49, 70 or older, 70 or…
## $ V201510 <hvn_lbll> 6, 3, 2, 4, 8, 3, 4, 2, 2, 4, 2, 2, 2, 7,…
## $ Education <fct> Bachelor's, Post HS, High school, Post HS, Gra…
## $ V201546 <hvn_lbll> 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2,…
## $ V201547a <hvn_lbll> -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -…
## $ V201547b <hvn_lbll> -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -…
## $ V201547c <hvn_lbll> -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -…
## $ V201547d <hvn_lbll> -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -…
## $ V201547e <hvn_lbll> -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -…
## $ V201547z <hvn_lbll> -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -…
## $ V201549x <hvn_lbll> 3, 4, 1, 4, 5, 1, 1, 1, 1, 3, 3, 1, 1, 4,…
## $ RaceEth <fct> "Hispanic", "Asian, NH/PI", "White", "Asian, N…
## $ V201600 <hvn_lbll> 1, 2, 2, 1, 1, 2, 2, 2, 2, 1, 2, 1, 2, 1,…
## $ Gender <fct> Male, Female, Female, Male, Male, Female, Fema…
## $ V201607 <hvn_lbll> -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -…
## $ V201610 <hvn_lbll> -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -…
## $ V201611 <hvn_lbll> -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -…
## $ V201613 <hvn_lbll> -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -…
## $ V201615 <hvn_lbll> -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -…
## $ V201616 <hvn_lbll> -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -…
## $ V201617x <hvn_lbll> 21, 13, 17, 7, 22, 3, 4, 3, 10, 11, 9, 18…
## $ Income <fct> "$175,000-249,999", "$70,000-74,999", "$100,00…
## $ Income7 <fct> $125k or more, $60k to < 80k, $100k to < 125k,…
## $ V202051 <hvn_lbll> -1, -1, -1, -1, -1, -1, -1, 1, -1, -1, -1…
## $ V202066 <hvn_lbll> 1, 4, 4, 4, 4, 4, 4, 1, -1, 4, 4, 4, 4, -…
## $ V202072 <hvn_lbll> -1, 1, 1, 1, 1, 1, 1, -1, -1, 1, 1, 1, 1,…
## $ VotedPres2020 <fct> NA, Yes, Yes, Yes, Yes, Yes, Yes, NA, Yes, Yes…
## $ V202073 <hvn_lbll> -1, 3, 1, 1, 2, 1, 2, -1, -1, 1, 1, 1, 2,…
## $ V202109x <hvn_lbll> 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,…
## $ V202110x <hvn_lbll> -1, 3, 1, 1, 2, 1, 2, -1, 1, 1, 1, 1, 2, …
## $ VotedPres2020_selection <fct> NA, Other, Biden, Biden, Trump, Biden, Trump, …
Understanding check: how many variables and observations are there?
But that is not very effective in that there is so much to look at once and the variables have specific acronyms.
Our course textbooks provides some key information:
https://tidy-survey-r.github.io/tidy-survey-book/anes-cb.html
But you can also look directly at the dataset website or look here for the full pdf documentation (which is 796 pages!):
According to the ANES documentation, one of our key variables is trust in government, which asks:
“How often can you trust the federal government in Washington to do what is right?”
Response options:
Always
Most of the time
About half the time
Some of the time
Never
What we want to know: How many people responded and what did they say?
Suppose you just wanted to do a quick check:
##
## Always Most of the time About half the time Some of the time
## 80 1016 2313 3313
## Never
## 702
Now suppose, you wanted to both arrange the responses and find out what the most “frequent” or “typical” response/value is – which is called the mode.
How we’ll find out: Let’s count responses by category and arrange
trust_counts <- anes_2020 %>%
group_by(TrustGovernment) %>% # Organize by response option
summarize(count = n()) %>% # Count responses in each group
filter(!is.na(TrustGovernment)) %>% # Remove missing values
arrange(desc(count)) # Sort by frequency
trust_counts
## # A tibble: 5 × 2
## TrustGovernment count
## <fct> <int>
## 1 Some of the time 3313
## 2 About half the time 2313
## 3 Most of the time 1016
## 4 Never 702
## 5 Always 80
What we found:
Most common response was “Some of the time” (3,313 people)
Followed by “About half the time” (2,313 people)
Only 80 people said they “Always” trust government
Therefore, the mode is “Some of the time”.
Now suppose you wanted to the relative proportion or percentage (x 100 the prop.), giving you the distribution of responses.
What we want to know: What percentage of people gave each response?
How we’ll find out: Let’s convert our counts into proportions
trust_props <- anes_2020 %>%
filter(!is.na(TrustGovernment)) %>% # Remove missing values first
group_by(TrustGovernment) %>% # Group by trust response
summarize( # Calculate counts and percentages
count = n(),
percentage = round(100 * n() / sum(n()), 1) # Percentage with 1 decimal
) %>%
arrange(desc(count)) # Sort by frequency
trust_props
## # A tibble: 5 × 3
## TrustGovernment count percentage
## <fct> <int> <dbl>
## 1 Some of the time 3313 100
## 2 About half the time 2313 100
## 3 Most of the time 1016 100
## 4 Never 702 100
## 5 Always 80 100
Oh oh, what happened! Every count shows 100%, that is definitely not right! Those things happen when we go too quickly. That’s why it’s important to check at every stage what you are doing and what it led to.
Understanding the Issue:
Inside summarize()
, sum(n())
is
calculated within each group (because of our
group_by()
)
This means each group’s count is being divided by its own sum
Of course dividing anything by itself gives 100%!
Let’s fix this by calculating the total before grouping:
# First, let's store the total valid responses
total_valid <- anes_2020 %>%
filter(!is.na(TrustGovernment)) %>%
nrow()
# Now calculate proportions using this total
trust_props <- anes_2020 %>%
filter(!is.na(TrustGovernment)) %>% # Remove missing values first
group_by(TrustGovernment) %>% # Group by trust response
summarize( # Calculate counts and percentages
count = n(),
percentage = round(100 * count / total_valid, 1) # Use total_valid instead of sum(n())
) %>%
arrange(desc(count)) # Sort by frequency
trust_props
## # A tibble: 5 × 3
## TrustGovernment count percentage
## <fct> <int> <dbl>
## 1 Some of the time 3313 44.6
## 2 About half the time 2313 31.2
## 3 Most of the time 1016 13.7
## 4 Never 702 9.5
## 5 Always 80 1.1
Now we get sensible percentages that add up to 100%!
Key Lesson: Always check the results of what you did.
To answer this, we’ll explore different ways of finding the “middle” or “center” of our data. Each method tells us something different about our respondents’ ages.
What we want to know: What age splits our sample in half?
How we’ll find out: Let’s find the middle value when ages are ordered
age_median <- anes_2020 %>%
filter(!is.na(Age)) %>% # Remove missing ages
summarize(
median_age = median(Age),
n_valid = n() # Count valid responses
)
age_median
## # A tibble: 1 × 2
## median_age n_valid
## <dbl> <int>
## 1 53 7159
What we found:
Median age is 53 years
Half of respondents are younger than 53
Half are older than 53
Based on valid responses (after removing NAs)
What we want to know: What’s the mathematical average age?
How we’ll find out: Let’s add all ages and divide by the count
age_mean <- anes_2020 %>%
filter(!is.na(Age)) %>%
summarize(
mean_age = mean(Age),
n_valid = n()
)
age_mean
## # A tibble: 1 × 2
## mean_age n_valid
## <dbl> <int>
## 1 51.8 7159
What we found:
Average age is 51.8 years
Mean is slightly lower than median (51.8 vs 53)
This small difference tells us something about our age distribution (we will come back to this during distribution week!)
What we want to know: How wide is the range of ages and where do most respondents fall?
How we’ll find out: Let’s explore different ways to measure spread in our data
# Calculate the range of ages
age_range <- anes_2020 %>%
filter(!is.na(Age)) %>%
summarize(
min_age = min(Age),
max_age = max(Age),
age_range = max_age - min_age
)
age_range
## # A tibble: 1 × 3
## min_age max_age age_range
## <dbl> <dbl> <dbl>
## 1 18 80 62
What we found:
Youngest respondent: 18 years old
Oldest respondent: 80 years old
Total range: 62 years
Note: Range is sensitive to extreme values (consider, e.g., income in US from lowest to highest)
What we want to know: Where do the middle 50% of ages fall?
How we’ll find out: Let’s divide our data into quarters
# Calculate quartiles
age_quartiles <- anes_2020 %>%
filter(!is.na(Age)) %>%
summarize(
q1 = quantile(Age, 0.25), # First quartile (25th percentile)
median = median(Age), # Second quartile (50th percentile)
q3 = quantile(Age, 0.75), # Third quartile (75th percentile)
iqr = q3 - q1 # Interquartile range
)
age_quartiles
## # A tibble: 1 × 4
## q1 median q3 iqr
## <dbl> <dbl> <dbl> <dbl>
## 1 37 53 66 29
Understanding the Quartiles:
Q1 (25th percentile): 37 years - 25% are younger
Median (50th percentile): 53 years - middle value
Q3 (75th percentile): 66 years - 75% are younger
IQR: 29 years (i.e., 37 to 66) - where the middle 50% of ages fall
What we want to know: Can we get all key statistics at once?
How we’ll find out: Let’s use R’s summary() function
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 18.00 37.00 53.00 51.83 66.00 80.00 294
Understanding the Output:
Minimum: 18 years (youngest respondent)
1st Quartile: 37 years (25% are younger)
Median: 53 years (middle value)
Mean: 51.83 years (average)
3rd Quartile: 66 years (75% are younger)
Maximum: 80 years (oldest respondent)
Missing Values: 294 NAs
The relationship between mean and median can tell us about the shape of our distribution:
When mean ≈ median: Suggests symmetrical distribution
When mean < median: Suggests negative skew (tail to left)
When mean > median: Suggests positive skew (tail to right)
We will come back to this during the distribution week and visualize!
Standard deviation measures the typical distance from the average. The formula is:
SD = sqrt( Σ(x - μ)² / n )
Where:
- SD is the standard deviation
- x is each value in our data
- μ (mu) is the mean
- n is the number of observations
- Σ means "sum up"
Why do we calculate it this way?
First, we find each value’s distance from the mean (x - μ). This tells us how far each observation is from average.
We square these differences for two key reasons:
We average these squared differences (÷ n) to get a typical squared distance
Finally, we take the square root to get back to the original units (years in our case)
Let’s use age in our ANES survey as an example:
# Calculate mean and standard deviation of age
age_stats <- anes_2020 %>%
filter(!is.na(Age)) %>%
summarize(
mean_age = round(mean(Age), 1),
sd_age = round(sd(Age), 1)
)
age_stats
## # A tibble: 1 × 2
## mean_age sd_age
## <dbl> <dbl>
## 1 51.8 17.1
In our data:
The average age is about 52 years
The standard deviation is about 17 years
What does this mean in plain language?
Most respondents’ ages fall within 17 years of 52 years old
So most people are between 35 and 69 years old
This gives us a sense of how “spread out” the ages are
There are some helpful guidelines about standard deviation:
The 68-95-99.7 Rule:
About 68% of people fall within 1 standard deviation of the mean
About 95% fall within 2 standard deviations
Almost everyone (99.7%) falls within 3 standard deviations
Let’s see what this means for our age data:
# Calculate the intervals
age_ranges <- anes_2020 %>%
filter(!is.na(Age)) %>%
summarize(
mean = round(mean(Age), 1),
sd = round(sd(Age), 1),
# One SD range (about 68% of people)
one_sd_low = round(mean - sd, 0),
one_sd_high = round(mean + sd, 0),
# Two SD range (about 95% of people)
two_sd_low = round(mean - (2 * sd), 0),
two_sd_high = round(mean + (2 * sd), 0)
)
age_ranges
## # A tibble: 1 × 6
## mean sd one_sd_low one_sd_high two_sd_low two_sd_high
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 51.8 17.1 35 69 18 86
What This Tells Us:
Most respondents (68%) are between 35 and 69 years old
Almost all (95%) are between 18 and 86 years old
This makes sense given the voting age minimum of 18
Identifying Unusual Cases
Values more than 2 standard deviations from the mean are unusual
Values more than 3 standard deviations are very rare
This helps us spot interesting patterns or potential issues even
Use Range When:
You need a quick overview
Extreme values matter
Explaining to non-technical audiences
Use Quartiles When:
You want to know about typical “segments”
Outliers might distort your picture
You need to identify “middle” or “quarter” groups
Use Standard Deviation When:
You need precise measures of spread
You want to report both the average and typical variation around a key variable (typical in a descriptive table)
A fundamental skill we will practice is leveraging key descriptive statistics and turning them into clear tables describing our sample or provide an overview, e.g., of key characteristics. This is a standard of quantitative research article – or some variant – often referred to as a “Table 1”.
While learning to create customized tables is important, sometimes we
need quick descriptive statistics during our exploratory phase. The
datasummary_skim()
function from the
modelsummary
package offers a simple way to generate
descriptive statistics.
Let’s look at a few numeric variables using this simpler approach:
# Create a simpler dataset with just numeric variables of interest
demo_vars <- anes_2020 %>%
select(Age, Income7) %>%
# Ensure variables are numeric
mutate(across(everything(), as.numeric))
# Quick summary
datasummary_skim(demo_vars)
Unique | Missing Pct. | Mean | SD | Min | Median | Max | Histogram | |
---|---|---|---|---|---|---|---|---|
Age | 64 | 4 | 51.8 | 17.1 | 18.0 | 53.0 | 80.0 | |
Income7 | 8 | 7 | 4.0 | 2.1 | 1.0 | 4.0 | 7.0 |
For categorical variables like trust and education, we can use the same function with a different type:
# Create a dataset with categorical variables
cat_vars <- anes_2020 %>%
select(TrustGovernment, Education) %>%
# Ensure variables are factors
mutate(across(everything(), as.factor))
# Quick summary of categorical variables
datasummary_skim(cat_vars, type = "categorical")
N | % | ||
---|---|---|---|
TrustGovernment | Always | 80 | 1.1 |
Most of the time | 1016 | 13.6 | |
About half the time | 2313 | 31.0 | |
Some of the time | 3313 | 44.5 | |
Never | 702 | 9.4 | |
Education | Less than HS | 312 | 4.2 |
High school | 1160 | 15.6 | |
Post HS | 2514 | 33.7 | |
Bachelor's | 1877 | 25.2 | |
Graduate | 1474 | 19.8 |
Advantages of this approach:
Quick and easy to use
Standardized output format
Good for initial data exploration
Requires less coding than custom tables
While datasummary_skim()
is great for quick exploration,
it offers less customization than the gt (package name) approach we’ll
learn next. Think of it as another useful tool in your toolkit.
Before creating tables, we need to understand what we want to show and check our data. Let’s work through this step by step.
First, let’s look at key variables we might want to include:
## # A tibble: 6 × 2
## Education n
## <fct> <int>
## 1 Post HS 2514
## 2 Bachelor's 1877
## 3 Graduate 1474
## 4 High school 1160
## 5 Less than HS 312
## 6 <NA> 116
##
## Always Most of the time About half the time Some of the time
## 80 1016 2313 3313
## Never
## 702
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 18.00 37.00 53.00 51.83 66.00 80.00 294
What we learnt: - Education has several categories we might want to combine
Trust has 5 categories from Never –> Always
Age ranges from 18-80
Some missing values to consider
We’ll create two key tables:
Sample characteristics (demographics)
Distribution of trust in government (outcome of interest)
Let’s start with sample characteristics.
First, let’s calculate our basic statistics:
# Calculate basic statistics
sample_stats <- anes_2020 %>% # Start with our dataset
summarize( # Create summary statistics
# Count total number of respondents
n_total = n(),
# Calculate mean and standard deviation of Age
age_mean = mean(Age, na.rm = TRUE), # na.rm = TRUE tells R to ignore missing values
age_sd = sd(Age, na.rm = TRUE), # sd() calculates standard deviation
# Calculate percentage female
# mean() of TRUE/FALSE gives us proportion (0-1), multiply by 100 for percentage
pct_female = mean(Gender == "Female", na.rm = TRUE) * 100,
# Calculate percentage with high school or less
# %in% checks if Education is either "Less than HS" OR "High school"
pct_hs_or_less = mean(Education %in% c("Less than HS", "High school"),
na.rm = TRUE) * 100
)
# Look at our calculations
sample_stats
## # A tibble: 1 × 5
## n_total age_mean age_sd pct_female pct_hs_or_less
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 7453 51.8 17.1 54.4 19.8
Understanding each part of the code:
summarize()
: Creates summary statistics for your
data
n()
: Counts number of rows (total
respondents)
mean()
: Calculates average
sd()
: Calculates standard deviation
na.rm = TRUE
: Tells R to remove missing values
before calculating
Education %in% c("Less than HS", "High school")
:
Checks if education falls into either category
* 100
: Converts proportion to percentage
Now let’s create a nicely formatted table:
# Create basic table structure
basic_table <- data.frame( # Create a new data frame
# First column: names of our statistics
characteristic = c(
"Sample size (N)",
"Age, mean (SD)", # (SD) = Standard Deviation
"Female (%)",
"High school or less (%)" # Changed from college degree
),
# Second column: actual values
value = c(
sample_stats$n_total, # Sample size
# paste0() combines text - here combining mean and SD with formatting
paste0(round(sample_stats$age_mean, 1), " (",
round(sample_stats$age_sd, 1), ")"), # round(x, 1) rounds to 1 decimal
round(sample_stats$pct_female, 1), # Female percentage
round(sample_stats$pct_hs_or_less, 1) # Education percentage
)
)
basic_table
## characteristic value
## 1 Sample size (N) 7453
## 2 Age, mean (SD) 51.8 (17.1)
## 3 Female (%) 54.4
## 4 High school or less (%) 19.8
Some functions explained:
paste0()
: Combines text and numbers (e.g., combining
mean and SD)
round()
: Rounds numbers (first argument is number,
second is decimal places)
%in%
: Checks if values are in a set of
options
c()
: Combines values into a list
# Create formatted table using gt package
basic_table %>% # Take our table and
gt() %>% # Convert to gt format
cols_label( # Remove column headers
characteristic = "", # First column blank header
value = "" # Second column blank header
) %>%
tab_header( # Add title
title = "Sample Characteristics"
)
Sample Characteristics | |
Sample size (N) | 7453 |
Age, mean (SD) | 51.8 (17.1) |
Female (%) | 54.4 |
High school or less (%) | 19.8 |
Let’s build a more detailed table in three parts:
Calculate our statistics
Structure our table
Add professional formatting
First, let’s calculate all the statistics we want to show:
# Calculate all our statistics at once
detailed_stats <- anes_2020 %>%
summarize(
# Basic counts
n_total = n(), # Total sample size
# Age statistics
age_mean = mean(Age, na.rm = TRUE), # Average age
age_sd = sd(Age, na.rm = TRUE), # Standard deviation of age
age_min = min(Age, na.rm = TRUE), # Youngest age
age_max = max(Age, na.rm = TRUE), # Oldest age
# Gender percentages
pct_female = mean(Gender == "Female", na.rm = TRUE) * 100,
pct_male = mean(Gender == "Male", na.rm = TRUE) * 100,
# Education percentages - three levels
n_valid_edu = sum(!is.na(Education)),
pct_hs = sum(Education %in% c("Less than HS", "High school"), na.rm = TRUE) /
n_valid_edu * 100, # High school or less
pct_some_ps = sum(Education == "Post HS", na.rm = TRUE) /
n_valid_edu * 100, # Some post-secondary
pct_college = sum(Education %in% c("Bachelor's", "Graduate"), na.rm = TRUE) /
n_valid_edu * 100 # Bachelor's or higher
)
detailed_stats
## # A tibble: 1 × 11
## n_total age_mean age_sd age_min age_max pct_female pct_male n_valid_edu pct_hs
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
## 1 7453 51.8 17.1 18 80 54.4 45.6 7337 20.1
## # ℹ 2 more variables: pct_some_ps <dbl>, pct_college <dbl>
Understanding this code:
We use summarize()
to calculate all statistics at
once
na.rm = TRUE
appears often because it tells R to
ignore missing values
For education, we use %in%
to combine categories
(e.g., “Less than HS” and “High school”)
All percentages are calculated by getting the mean of TRUE/FALSE (0/1) and multiplying by 100
Now we’ll organize these statistics into a table format:
# Create the structure for our enhanced table
enhanced_table <- data.frame(
# First column: Labels for each row (14 elements)
characteristic = c(
"Sample size (N)", # 1
"", # 2 (spacing)
"Age", # 3
" Mean (SD)", # 4
" Range", # 5
"", # 6 (spacing)
"Gender (%)", # 7
" Female", # 8
" Male", # 9
"", # 10 (spacing)
"Education (%)", # 11
" High school or less", # 12
" Some post-secondary", # 13
" Bachelor's or higher" # 14
),
# Second column: The values (14 elements to match)
value = c(
detailed_stats$n_total, # 1
"", # 2
"", # 3
paste0(round(detailed_stats$age_mean, 1), " (",
round(detailed_stats$age_sd, 1), ")"), # 4
paste0(detailed_stats$age_min, "–", detailed_stats$age_max), # 5
"", # 6
"", # 7
round(detailed_stats$pct_female, 1), # 8
round(detailed_stats$pct_male, 1), # 9
"", # 10
"", # 11
round(detailed_stats$pct_hs, 1), # 12
round(detailed_stats$pct_some_ps, 1), # 13
round(detailed_stats$pct_college, 1) # 14
)
)
enhanced_table
## characteristic value
## 1 Sample size (N) 7453
## 2
## 3 Age
## 4 Mean (SD) 51.8 (17.1)
## 5 Range 18–80
## 6
## 7 Gender (%)
## 8 Female 54.4
## 9 Male 45.6
## 10
## 11 Education (%)
## 12 High school or less 20.1
## 13 Some post-secondary 34.3
## 14 Bachelor's or higher 45.7
Key points about table structure:
We create a two-column table using
data.frame()
Empty rows (““) create spacing between sections
Indentation (spaces before names) creates hierarchy
paste0()
combines numbers and text (for age
statistics)
round()
controls decimal places
Now we’ll use the gt package to create a professional-looking table:
# Create our final formatted table
enhanced_table %>%
gt() %>% # Convert to gt table
# Step 1: Remove column headers
cols_label(
characteristic = "", # Make both column
value = "" # headers blank
) %>%
# Step 2: Make category headers bold
tab_style(
style = cell_text(weight = "bold"), # Bold text
locations = cells_body( # Apply to rows where
rows = characteristic %in% # characteristic matches
c("Sample size (N)", "Age", "Gender (%)", "Education (%)")
)
) %>%
# Step 3: Format the sample size with commas
fmt_number(
columns = value, # In the value column
rows = characteristic == "Sample size (N)", # For sample size only
decimals = 0, # No decimal places
use_seps = TRUE # Use comma separators
) %>%
# Step 4: Format all percentages
fmt_percent(
columns = value, # In the value column
rows = characteristic %in% # For rows that match
c(" Female", " Male",
" High school or less",
" Some post-secondary",
" Bachelor's or higher"),
scale_values = FALSE, # Values already in percentage form
decimals = 1 # One decimal place
) %>%
# Step 5: Add title and notes
tab_header(
title = md("**Table 1. Sample Characteristics**") # Bold title
) %>%
tab_source_note( # Add footnote
source_note = "Note: Percentages based on total valid responses. SD = standard deviation."
) %>%
# Step 6: Add borders
tab_options(
table.border.top.width = 2, # Thick top border
table.border.bottom.width = 2 # Thick bottom border
)
Table 1. Sample Characteristics | |
Sample size (N) | 7453 |
Age | |
Mean (SD) | 51.8 (17.1) |
Range | 18–80 |
Gender (%) | |
Female | 54.4 |
Male | 45.6 |
Education (%) | |
High school or less | 20.1 |
Some post-secondary | 34.3 |
Bachelor's or higher | 45.7 |
Note: Percentages based on total valid responses. SD = standard deviation. |
How to modify this code for your needs:
Change statistics: Modify the
summarize()
section to include different
calculations
Change categories: Update the ‘characteristic’
column in data.frame()
Modify formatting:
Change decimal places in round()
calls
Adjust borders in tab_options()
Modify title in tab_header()
Add new variables:
Add calculations in first code block
Add rows to table structure
Add formatting rules as needed
First, let’s look at the trust variable to understand what we’re working with:
##
## Always Most of the time About half the time Some of the time
## 80 1016 2313 3313
## Never
## 702
## [1] 29
What we learn: - We have five ordered categories from “Never” to “Always”
Category order isn’t sorted logically in raw data (it’s alphabetical)
We have some missing values to handle (29)
Let’s calculate our counts and percentages:
trust_dist <- anes_2020 %>%
# Calculate total responses including missing before filtering
mutate(total_responses = n()) %>%
# Remove missing values for the distribution analysis
filter(!is.na(TrustGovernment)) %>%
# Set proper ordering of trust levels
mutate(trust_level = factor(TrustGovernment,
levels = c("Never",
"Some of the time",
"About half the time",
"Most of the time",
"Always"))) %>%
# Group by trust level and get counts
group_by(trust_level) %>%
summarize(
count = n(),
.groups = 'drop'
) %>%
# Calculate total and percentages
mutate(
n_total = sum(count),
percentage = round(count / n_total * 100, 1)
)
trust_dist
## # A tibble: 5 × 4
## trust_level count n_total percentage
## <fct> <int> <int> <dbl>
## 1 Never 702 7424 9.5
## 2 Some of the time 3313 7424 44.6
## 3 About half the time 2313 7424 31.2
## 4 Most of the time 1016 7424 13.7
## 5 Always 80 7424 1.1
Now let’s create a professional-looking table:
# Create formatted table
trust_dist %>%
# Select only the columns we want to display
select(trust_level, count, percentage) %>%
gt() %>%
# Format columns
cols_label(
trust_level = "Level of Trust",
count = "Count",
percentage = "Percent"
) %>%
# Format numbers with commas
fmt_number(
columns = count,
decimals = 0,
use_seps = TRUE
) %>%
# Format percentages
fmt_number(
columns = percentage,
decimals = 1
) %>%
# Add title and notes
tab_header(
title = md("**Table 2. Trust in Government Distribution**"),
subtitle = "How often can you trust the federal government to do what is right?"
) %>%
# Add table notes
tab_source_note(
source_note = sprintf(
"Note: Based on %d valid responses.",
first(trust_dist$n_total)
)
) %>%
# Add borders
tab_options(
table.border.top.width = 2,
table.border.bottom.width = 2
)
Table 2. Trust in Government Distribution | ||
How often can you trust the federal government to do what is right? | ||
Level of Trust | Count | Percent |
---|---|---|
Never | 702 | 9.5 |
Some of the time | 3,313 | 44.6 |
About half the time | 2,313 | 31.2 |
Most of the time | 1,016 | 13.7 |
Always | 80 | 1.1 |
Note: Based on 7424 valid responses. |
Understanding the code: 1. Data preparation:
We manually specify the order of trust levels
Calculate percentages based on valid responses only
Store information about missing data
Table formatting:
Clear column headers
Numbers formatted with commas
Percentages to one decimal place
Informative title and subtitle
Note showing response rates
Key functions:
gt()
: Creates the formatted table
cols_label()
: Sets column headers
fmt_number()
: Controls number formatting
tab_header()
: Adds title and subtitle
tab_source_note()
: Adds footnote
To modify this code:
For different variables:
Change the category names in trust_level
Update the counts
Modify title and subtitle
For different formatting:
Adjust decimals in fmt_number()
Change column labels in cols_label()
Modify borders in tab_options()
This approach ensures:
Logical ordering of categories
Correct percentage calculations
Professional formatting
Clear documentation of missing data
Consistent style with Table 1
Let’s enhance our trust table by adding subtle color to highlight response patterns:
# Create enhanced trust table
trust_dist %>%
# Select only the columns we want to display
select(trust_level, count, percentage) %>%
gt() %>%
# Basic formatting (same as before)
cols_label(
trust_level = "Level of Trust",
count = "Count",
percentage = "Percent"
) %>%
fmt_number(
columns = count,
decimals = 0,
use_seps = TRUE
) %>%
fmt_number(
columns = percentage,
decimals = 1
) %>%
# Add subtle background color based on percentage
data_color(
columns = percentage,
colors = scales::col_numeric(
palette = c("#ffffff", "#e6f3ff"), # White to light blue
domain = NULL
)
) %>%
# Add light gray header background
tab_style(
style = cell_fill(color = "#f6f6f6"),
locations = cells_column_labels()
) %>%
# Title and notes
tab_header(
title = md("**Table 2. Trust in Government**"),
) %>%
tab_source_note(
source_note = sprintf(
"Note: Based on %d valid responses. Color intensity indicates response frequency.",
first(trust_dist$n_total)
)
) %>%
# Borders
tab_options(
table.border.top.width = 2,
table.border.bottom.width = 2
)
Table 2. Trust in Government | ||
Level of Trust | Count | Percent |
---|---|---|
Never | 702 | 9.5 |
Some of the time | 3,313 | 44.6 |
About half the time | 2,313 | 31.2 |
Most of the time | 1,016 | 13.7 |
Always | 80 | 1.1 |
Note: Based on 7424 valid responses. Color intensity indicates response frequency. |
Understanding the enhancements:
data_color()
:
Adds color based on percentage values
Uses white to light blue scale
Higher percentages show darker blue
cell_fill()
:
Adds light gray to column headers
Creates subtle visual separation
Tips for using color:
Keep it subtle (light colors)
Use color meaningfully (to show patterns)
Include explanation in table note
Consider color-blind friendly options
You can modify this by:
Changing colors in col_numeric()
Adjusting header color in cell_fill()
Adding color to different columns
Using different color scales for different purposes
Here are 5 exercises to help you practice what we covered in today’s session about descriptive statistics. These exercises use the same ANES dataset but explore different variables.
First, examine the survey question for TrustPeople
:
Look up the exact question wording in the codebook
Identify all possible response categories
Then analyze the responses:
Check the data quality by calculating:
Total number of respondents
Number of valid responses
Number of missing responses
Response rate
Create a distribution table showing counts and percentages for each response category
Hint: Follow the same steps we used for TrustGovernment, but apply them to TrustPeople. Remember to examine the question context first
Start by understanding the variable:
Look up the exact question text for CampaignInterest
in the codebook
Note how the question was framed to respondents
Then create a summary analysis:
Calculate data quality metrics (response rates, missing data)
Create a formatted table showing:
Response counts
Percentages
Valid percentages (excluding missing data)
Hint: Think about how the question wording might affect response patterns
Create a professional table that combines:
Race/ethnicity distribution using RaceEth
:
Counts and percentages for each category
Note any missing data
Education levels using Education
:
Grouped into meaningful categories
Percentages for each level
Format using gt with appropriate styling and clear labels.
Hint: This is different from our tutorial example as it combines two categorical demographic variables
Create a descriptive analysis that examines:
Age statistics by party identification:
Mean age for each party ID group
Standard deviation within groups
Compare age distributions across different party affiliations
Present results in a professional table using gt
Hint: This combines continuous and categorical variables in a way different from our tutorial examples
Create a summary table showing the relationship between education level and party identification:
Calculate the percent of each education level that identifies as Democrat (including independent-democrat), Republican (including independent-republican), or Independent
Create a professional formatted table that shows:
Education levels
Distribution across these three party groups
Include row totals to verify percentages sum to 100%
Use proper formatting for numbers and percentages
Hint: Remember to combine party categories first, then calculate percentages within education levels
For your weekly diary reflection, consider:
How does understanding the exact survey questions help in analyzing and interpreting the data?
What challenges did you face in combining different types of variables in a single table?
How would you explain these demographic patterns to different audiences?
What additional context would be helpful when presenting these results?