Tutorial 3 (252)

# List of packages
packages <- c("tidyverse", "fst", "modelsummary", "viridis", "kableExtra", "flextable", "officer") # add any you need here

# Install packages if they aren't installed already
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

# Load the packages
lapply(packages, library, character.only = TRUE)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## `modelsummary` 2.0.0 now uses `tinytable` as its default table-drawing
##   backend. Learn more at: https://vincentarelbundock.github.io/tinytable/
## 
## Revert to `kableExtra` for one session:
## 
##   options(modelsummary_factory_default = 'kableExtra')
##   options(modelsummary_factory_latex = 'kableExtra')
##   options(modelsummary_factory_html = 'kableExtra')
## 
## Silence this message forever:
## 
##   config_modelsummary(startup_message = FALSE)
## 
## Loading required package: viridisLite
## 
## 
## Attaching package: 'kableExtra'
## 
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows
## 
## 
## 
## Attaching package: 'flextable'
## 
## 
## The following objects are masked from 'package:kableExtra':
## 
##     as_image, footnote
## 
## 
## The following object is masked from 'package:purrr':
## 
##     compose

## [[1]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "fst"       "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "modelsummary" "fst"          "lubridate"    "forcats"      "stringr"     
##  [6] "dplyr"        "purrr"        "readr"        "tidyr"        "tibble"      
## [11] "ggplot2"      "tidyverse"    "stats"        "graphics"     "grDevices"   
## [16] "utils"        "datasets"     "methods"      "base"        
## 
## [[4]]
##  [1] "viridis"      "viridisLite"  "modelsummary" "fst"          "lubridate"   
##  [6] "forcats"      "stringr"      "dplyr"        "purrr"        "readr"       
## [11] "tidyr"        "tibble"       "ggplot2"      "tidyverse"    "stats"       
## [16] "graphics"     "grDevices"    "utils"        "datasets"     "methods"     
## [21] "base"        
## 
## [[5]]
##  [1] "kableExtra"   "viridis"      "viridisLite"  "modelsummary" "fst"         
##  [6] "lubridate"    "forcats"      "stringr"      "dplyr"        "purrr"       
## [11] "readr"        "tidyr"        "tibble"       "ggplot2"      "tidyverse"   
## [16] "stats"        "graphics"     "grDevices"    "utils"        "datasets"    
## [21] "methods"      "base"        
## 
## [[6]]
##  [1] "flextable"    "kableExtra"   "viridis"      "viridisLite"  "modelsummary"
##  [6] "fst"          "lubridate"    "forcats"      "stringr"      "dplyr"       
## [11] "purrr"        "readr"        "tidyr"        "tibble"       "ggplot2"     
## [16] "tidyverse"    "stats"        "graphics"     "grDevices"    "utils"       
## [21] "datasets"     "methods"      "base"        
## 
## [[7]]
##  [1] "officer"      "flextable"    "kableExtra"   "viridis"      "viridisLite" 
##  [6] "modelsummary" "fst"          "lubridate"    "forcats"      "stringr"     
## [11] "dplyr"        "purrr"        "readr"        "tidyr"        "tibble"      
## [16] "ggplot2"      "tidyverse"    "stats"        "graphics"     "grDevices"   
## [21] "utils"        "datasets"     "methods"      "base"

Visualizing and looking into the GSS

The General Social Survey is the most widely – and long-standing – survey in the US.

Full details found here: https://gssdataexplorer.norc.org/variables/vfilter

First, we will use a special version with data running from 1972-2018 with egp occupational class coding already embedded.

gss <- load("gss2018_egp.RData")
gss <- df

An important step is to always look into your variables of interest and how they are currently coded before you do anything. To explore a dataset, like the GSS, the best way is to look into the survey website.

str(gss$egp)

##  Factor w/ 10 levels "I","II","IIIa",..: 1 4 3 7 1 3 8 6 4 8 ...

Here is what the full EGP class scheme captures and what the roman numerals mean:

Class I

Higher-grade professionals, administrators, and officials; managers in large industrial establishments; large proprietors

Class II

Lower-grade professionals, administrators, and officials; higher-grade technicians; managers in small industrial establishments; supervisors of non-manual employees

Class IIIa

Routine non-manual employees, higher-grade (administration and commerce)

Class IIIb

Routine non-manual employees, lower-grade (sales and service)

Class IVa

Small proprietors, artisans, etc., with employees

Class IVb

Small proprietors, artisans, etc., without employees

Class IVc

Farmers and small-holders; other self-employed workers in primary production

Class V

Lower-grade technicians; supervisors of manual workers

Class VI

Skilled manual workers

Class VIIa

Semi- and unskilled manual workers (not in agriculture)

Class VIIb

Agricultural and other workers in primary production

Cleaning and labeling

In this chunk of code, we are going to use the mutate function from the dplyr package to transform a variable in our dataset. Specifically, we will convert the egp variable into a factor with ordered levels and meaningful labels. This process helps in categorizing the data, making it more interpretable for subsequent analysis and visualization.

gss <- gss %>%
  mutate(egp = factor(egp, levels = c(
    "I",
    "II",
    "IIIa",
    "IIIb",
    "IVa",
    "IVb",
    "IVc",
    "V",
    "VI",
    "VIIa",
    "VIIb"
  ), labels = c(
    "Higher-grade professionals, managers, large proprietors",
    "Lower-grade professionals, technicians, non-manual supervisors",
    "Higher-grade routine non-manual employees (admin/commerce)",
    "Lower-grade routine non-manual employees (sales/service)",
    "Small proprietors, artisans with employees",
    "Small proprietors, artisans without employees",
    "Farmers, small-holders, self-employed in primary production",
    "Lower-grade technicians, manual supervisors",
    "Skilled manual workers",
    "Semi- and unskilled manual workers (not agriculture)",
    "Agricultural and primary production workers"
  ), ordered = TRUE))

Now, we will filter out any missing values in the egp column and then summarize the data by year and EGP class. Finally, we will create a stacked bar chart to visualize the distribution of EGP classes over the survey years. This process involves several steps: filtering data, summarizing it, and plotting the results using ggplot2.

# Load necessary libraries
library(dplyr)
library(ggplot2)

# Filter out NAs in the egp column
gss_filtered <- gss %>%
  filter(!is.na(egp)) # Remove rows where egp is NA

# Summarize the data by year and EGP class
egp_summary <- gss_filtered %>%
  count(year, egp) %>%          # Count the number of occurrences of each egp class per year
  group_by(year) %>%            # Group the data by year
  mutate(total = sum(n),        # Calculate the total number of observations per year
         proportion = n / total) # Calculate the proportion of each egp class within the year

# Create the stacked bar chart
ggplot(egp_summary, aes(x = factor(year), y = proportion, fill = egp)) + # Create the plot with year on x-axis and proportion on y-axis
  geom_bar(stat = "identity", position = "fill") + # Create a stacked bar chart with bars filled proportionally
  scale_y_continuous(labels = scales::percent_format()) + # Convert y-axis to percentage format
  scale_fill_brewer(palette = "Paired") + # Change the color palette for better differentiation
  labs(title = "Distribution of EGP Class Scheme Over Survey Years", # Add plot title
       x = "Year", # Label x-axis
       y = "Proportion", # Label y-axis
       fill = "EGP Class") + # Label the legend
  theme_minimal() + # Apply a minimal theme to the plot
  theme(axis.text.x = element_text(angle = 90, hjust = 1), # Rotate x-axis text for better readability
        legend.position = "bottom", # Position the legend at the bottom
        legend.title = element_text(size = 10), # Set legend title text size
        legend.text = element_text(size = 8)) + # Set legend text size
  guides(fill = guide_legend(nrow = 3, byrow = TRUE, title.position = "top")) # Customize legend guide

A key takeaway here is in regards to researcher degrees of freedom. Essentially, we can recode variables in any way we want. We can turn a 10-class scheme to a binary one or even further sub-divide to 15-20. The issue is when researchers recode only to obtain a result that they want to see. As we see below, recoding the same class scheme can have a pretty important consequential difference on what is captured / not captured by a variable. Perhaps there are substantive reasons to do the binary coding below, but look at the difference in terms of distribution.

What we want is to code while making informed choices. Suppose you want to code a class scheme, you would turn to the literature to justify and explain why your coding scheme emphasizes particular distinctions (e.g., manual/non-manual). All three recoding below, with varying distinctions (and thus conceptualizations) and granularity, are based on: https://inequality.stanford.edu/sites/default/files/Mitnik_Cumberworth_2016_1.pdf

Class Scheme 1: Intermediate (stressing manual / nonmanual divide)

# Intermediate (stressing manual / nonmanual divide)
gss_intermediate_manual <- gss %>%
  mutate(class_scheme = case_when(
    egp %in% c("Higher-grade professionals, managers, large proprietors", 
               "Lower-grade professionals, technicians, non-manual supervisors") ~ "Higher Non-Manual",
    egp %in% c("Higher-grade routine non-manual employees (admin/commerce)", 
               "Lower-grade routine non-manual employees (sales/service)") ~ "Routine Non-Manual",
    egp %in% c("Small proprietors, artisans with employees", 
               "Small proprietors, artisans without employees") ~ "Small Proprietors",
    egp %in% c("Lower-grade technicians, manual supervisors", 
               "Skilled manual workers") ~ "Skilled Manual",
    egp == "Semi- and unskilled manual workers (not agriculture)" ~ "Semi/Unskilled Manual",
    egp %in% c("Farmers, small-holders, self-employed in primary production", 
               "Agricultural and primary production workers") ~ "Agricultural Workers",
    TRUE ~ NA_character_
  )) %>%
  filter(!is.na(class_scheme))
# Summarize the data by year and class scheme
intermediate_manual_summary <- gss_intermediate_manual %>%
  count(year, class_scheme) %>%          # Count the number of occurrences of each class scheme per year
  group_by(year) %>%                     # Group the data by year
  mutate(total = sum(n),                 # Calculate the total number of observations per year
         proportion = n / total)         # Calculate the proportion of each class scheme within the year

# Create the stacked bar chart
ggplot(intermediate_manual_summary, aes(x = factor(year), y = proportion, fill = class_scheme)) + # Create the plot with year on x-axis and proportion on y-axis
  geom_bar(stat = "identity", position = "fill") + # Create a stacked bar chart with bars filled proportionally
  scale_y_continuous(labels = scales::percent_format()) + # Convert y-axis to percentage format
  scale_fill_brewer(palette = "Paired") + # Change the color palette for better differentiation
  labs(title = "Class Distribution (Manual/Non-Manual Divide) Over Survey Years", # Add plot title
       x = "Year", # Label x-axis
       y = "Proportion", # Label y-axis
       fill = "Class Scheme") + # Label the legend
  theme_minimal() + # Apply a minimal theme to the plot
  theme(axis.text.x = element_text(angle = 90, hjust = 1), # Rotate x-axis text for better readability
        legend.position = "bottom", # Position the legend at the bottom
        legend.title = element_text(size = 10), # Set legend title text size
        legend.text = element_text(size = 8)) + # Set legend text size
  guides(fill = guide_legend(nrow = 3, byrow = TRUE, title.position = "top")) # Customize legend guide

Class Scheme 2: Intermediate (stressing similarity in income)

# Intermediate (stressing similarity in income)
gss_intermediate_income <- gss %>%
  mutate(class_scheme = case_when(
    egp %in% c("Higher-grade professionals, managers, large proprietors", 
               "Lower-grade professionals, technicians, non-manual supervisors") ~ "Higher Professionals", # Classify as Higher Professionals
    egp == "Higher-grade routine non-manual employees (admin/commerce)" ~ "Higher Routine Non-Manual", # Classify as Higher Routine Non-Manual
    egp %in% c("Small proprietors, artisans with employees", 
               "Small proprietors, artisans without employees") ~ "Small Proprietors", # Classify as Small Proprietors
    egp %in% c("Lower-grade technicians, manual supervisors", 
               "Skilled manual workers") ~ "Skilled Manual", # Classify as Skilled Manual
    egp %in% c("Lower-grade routine non-manual employees (sales/service)",
               "Semi- and unskilled manual workers (not agriculture)") ~ "Routine/Semi-Unskilled Manual", # Classify as Routine/Semi-Unskilled Manual
    egp %in% c("Farmers, small-holders, self-employed in primary production", 
               "Agricultural and primary production workers") ~ "Agricultural Workers", # Classify as Agricultural Workers
    TRUE ~ NA_character_ # Assign NA for any other value
  )) %>%
  filter(!is.na(class_scheme)) # Filter out rows where class_scheme is NA

# Summarize the data by year and class scheme
intermediate_income_summary <- gss_intermediate_income %>%
  count(year, class_scheme) %>%          # Count the number of occurrences of each class scheme per year
  group_by(year) %>%                     # Group the data by year
  mutate(total = sum(n),                 # Calculate the total number of observations per year
         proportion = n / total)         # Calculate the proportion of each class scheme within the year

# Create the stacked bar chart
ggplot(intermediate_income_summary, aes(x = factor(year), y = proportion, fill = class_scheme)) + # Create the plot with year on x-axis and proportion on y-axis
  geom_bar(stat = "identity", position = "fill") + # Create a stacked bar chart with bars filled proportionally
  scale_y_continuous(labels = scales::percent_format()) + # Convert y-axis to percentage format
  scale_fill_brewer(palette = "Set2") + # Change the color palette for better differentiation
  labs(title = "Class Distribution (Income Similarity) Over Survey Years", # Add plot title
       x = "Year", # Label x-axis
       y = "Proportion", # Label y-axis
       fill = "Class Scheme") + # Label the legend
  theme_minimal() + # Apply a minimal theme to the plot
  theme(axis.text.x = element_text(angle = 90, hjust = 1), # Rotate x-axis text for better readability
        legend.position = "bottom", # Position the legend at the bottom
        legend.title = element_text(size = 10), # Set legend title text size
        legend.text = element_text(size = 8)) + # Set legend text size
  guides(fill = guide_legend(nrow = 3, byrow = TRUE, title.position = "top")) # Customize legend guide

Class Scheme 3: Low (Professional/Managers vs. all other classes)

# Low (Professional/Managers vs. all other classes)
gss_low <- gss %>%
  mutate(class_scheme = case_when(
    egp %in% c("Higher-grade professionals, managers, large proprietors", 
               "Lower-grade professionals, technicians, non-manual supervisors") ~ "Professionals/Managers", # Classify as Professionals/Managers
    egp %in% c("Higher-grade routine non-manual employees (admin/commerce)", 
               "Lower-grade routine non-manual employees (sales/service)",
               "Small proprietors, artisans with employees", 
               "Small proprietors, artisans without employees", 
               "Farmers, small-holders, self-employed in primary production", 
               "Lower-grade technicians, manual supervisors", 
               "Skilled manual workers", 
               "Semi- and unskilled manual workers (not agriculture)", 
               "Agricultural and primary production workers") ~ "All Other Classes", # Classify as All Other Classes
    TRUE ~ NA_character_ # Assign NA for any other value
  )) %>%
  filter(!is.na(class_scheme)) # Filter out rows where class_scheme is NA

# Summarize the data by year and class scheme
low_summary <- gss_low %>%
  count(year, class_scheme) %>%          # Count the number of occurrences of each class scheme per year
  group_by(year) %>%                     # Group the data by year
  mutate(total = sum(n),                 # Calculate the total number of observations per year
         proportion = n / total)         # Calculate the proportion of each class scheme within the year

# Create the stacked bar chart
ggplot(low_summary, aes(x = factor(year), y = proportion, fill = class_scheme)) + # Create the plot with year on x-axis and proportion on y-axis
  geom_bar(stat = "identity", position = "fill") + # Create a stacked bar chart with bars filled proportionally
  scale_y_continuous(labels = scales::percent_format()) + # Convert y-axis to percentage format
  scale_fill_brewer(palette = "Dark2") + # Change the color palette for better differentiation
  labs(title = "Class Distribution (Professional/Managers vs. All Other Classes)", # Add plot title
       x = "Year", # Label x-axis
       y = "Proportion", # Label y-axis
       fill = "Class Scheme") + # Label the legend
  theme_minimal() + # Apply a minimal theme to the plot
  theme(axis.text.x = element_text(angle = 90, hjust = 1), # Rotate x-axis text for better readability
        legend.position = "bottom", # Position the legend at the bottom
        legend.title = element_text(size = 10), # Set legend title text size
        legend.text = element_text(size = 8)) + # Set legend text size
  guides(fill = guide_legend(nrow = 3, byrow = TRUE, title.position = "top")) # Customize legend guide

Now we are going to deal with the full GSS available (1972-2022)

gss <- load("gss2022.RData")
gss <- df

Datasummary skim

A fist fundamental skill we will cover, which is a requirement for Stepping Stone 1, is to do summary table(s) of your main variables of interest using a useful function from the modelsummary package. We will show some slight aesthetic improvements.

Note: you can do a combination of a numeric & categorical summary tables, or just one of the two depending on your variables and what you deem is the right choice(s) for your Table 1 (or 1a, 1b).

table(gss$natcrime)

## 
##                    too little                   about right 
##                         26474                         10282 
##                      too much                    don't know 
##                          2486                             0 
##                           iap            I don't have a job 
##                             0                             0 
##                   dk, na, iap                     no answer 
##                             0                             0 
##    not imputable_(2147483637)    not imputable_(2147483638) 
##                             0                             0 
##                       refused                skipped on web 
##                             0                             0 
##                    uncodeable not available in this release 
##                             0                             0 
##    not available in this year                  see codebook 
##                             0                             0

unique(gss$natcrime)

## [1] <NA>        too much    too little  about right
## 16 Levels: too little about right too much don't know ... see codebook

The above should be a habit. Always look into your variables before doing anything else. I show only one but I checked all of them before recoding below.

# Recode and clean variables
gss <- gss %>%
  mutate(
    natcrime = case_when(
      natcrime %in% c("too little", "about right", "too much") ~ natcrime,
      TRUE ~ NA_character_
    ),
    natenvir = case_when(
      natenvir %in% c("too little", "about right", "too much") ~ natenvir,
      TRUE ~ NA_character_
    ),
    natdrug = case_when(
      natdrug %in% c("too little", "about right", "too much") ~ natdrug,
      TRUE ~ NA_character_
    ),
    race = case_when(
      race %in% c("white", "black", "other") ~ race,
      TRUE ~ NA_character_
    ),
    sex = case_when(
      sex %in% c("male", "female") ~ sex,
      TRUE ~ NA_character_
    ),
    degree = case_when(
      degree %in% c("less than high school", "high school", "junior college", "bachelor", "graduate") ~ degree,
      TRUE ~ NA_character_
    ),
    wrkstat = case_when(
      wrkstat %in% c("working full time", "working part time", "with a job, but not at work because of temporary illness, vacation, strike", "unemployed, laid off, looking for work", "retired", "in school", "keeping house", "other") ~ wrkstat,
      TRUE ~ NA_character_
    )
  )

Filtering & Cleaning

Now we will filter to only our variables of interest, otherwise the function will try to do the summary on the entire dataset.

# Filter to variables of interest
gss_filtered <- gss %>%
  dplyr::select(natcrime, natenvir, natdrug, race, sex, degree, wrkstat)
categorical_summary <- datasummary_skim(gss_filtered, type = "categorical")
categorical_summary

tinytable_t95xjcmly7ekfvvwtot3

		N	%
natcrime	about right	10282	14.2
	too little	26474	36.6
	too much	2486	3.4
	NA	33148	45.8
natenvir	about right	11388	15.7
	too little	24244	33.5
	too much	3515	4.9
	NA	33243	45.9
natdrug	about right	11120	15.4
	too little	24442	33.8
	too much	3318	4.6
	NA	33510	46.3
race	black	10215	14.1
	other	4411	6.1
	white	57657	79.6
	NA	107	0.1
sex	female	40301	55.7
	male	31977	44.2
	NA	112	0.2
degree	graduate	5953	8.2
	high school	36446	50.3
	less than high school	14192	19.6
	NA	15799	21.8
wrkstat	in school	2187	3.0
	keeping house	10764	14.9
	other	1643	2.3
	retired	10886	15.0
	unemployed, laid off, looking for work	2621	3.6
	with a job, but not at work because of temporary illness, vacation, strike	1556	2.1
	working full time	35267	48.7
	working part time	7430	10.3
	NA	36	0.0

Next, we will recode and relabel variables, removing NA values

gss_cleaned <- gss %>%
  filter(!is.na(natcrime), !is.na(natenvir), !is.na(natdrug),
         !is.na(race), !is.na(sex), !is.na(degree), !is.na(fefam),
         !is.na(libhomo), !is.na(attend), !is.na(wrkstat)) %>%
  mutate(
    natcrime = recode(natcrime, "too little" = "Too Little", "about right" = "About Right", "too much" = "Too Much"),
    natenvir = recode(natenvir, "too Little" = "Too Little", "about Right" = "About Right", "too Much" = "Too Much"),
    natdrug = recode(natdrug, "too Little" = "Too Little", "About Right" = "About Right", "too Much" = "Too Much"),
    race = recode(race, "white" = "White", "black" = "Black", "other" = "Other"),
    sex = recode(sex, "male" = "Male", "female" = "Female"),
    degree = recode(degree, "less than high school" = "Less than High School", "high school" = "High School", "junior college" = "Junior College", "bachelor" = "Bachelor", "graduate" = "Graduate"),
    wrkstat = recode(wrkstat, "working full time" = "Working Full Time", "working part time" = "Working Part Time", "with a job, but not at work because of temporary illness, vacation, strike" = "Job but Not at Work", "unemployed, laid off, looking for work" = "Unemployed", "retired" = "Retired", "in school" = "In School", "keeping house" = "Keeping House", "other" = "Other")
  )

# Select and rename the variables to be more informative
gss_cleaned <- gss_cleaned %>%
  rename(
    "Spending on Crime" = natcrime,
    "Spending on Environment" = natenvir,
    "Spending on Drugs" = natdrug,
    "Spending on Blacks" = natrace,
    "Respondent Race" = race,
    "Respondent Sex" = sex,
    "Highest Degree" = degree,
    "Labor Force Status" = wrkstat
  )

# Create summary for relabeled categorical variables
categorical_summary_relabelled <- datasummary_skim(
  gss_cleaned %>%
    dplyr::select(`Spending on Crime`, `Spending on Environment`, `Spending on Drugs`, `Respondent Race`, `Respondent Sex`, `Highest Degree`, `Labor Force Status`), # Select categorical variables
  type = "categorical", # Specify the type of variables to summarize
  output = "kableExtra" # Specify the output format
)

## Warning: Inline histograms in `datasummary_skim()` are only supported for tables
##   produced by the `tinytable` backend.

# Customize the table appearance
categorical_summary_relabelled %>%
  kableExtra::kable_styling(full_width = F, bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% # Apply table styling options
  kableExtra::row_spec(0, bold = TRUE, color = "white", background = "#4CAF50") %>% # Customize the header row
  kableExtra::column_spec(1, bold = TRUE) %>% # Make the first column bold
  kableExtra::add_header_above(c(" " = 1, "Summary Statistics for Categorical Variables" = 3)) # Add a header above the table

	Summary Statistics for Categorical Variables
		N	%
Spending on Crime	About Right	1625	25.7
	Too Little	4317	68.4
	Too Much	371	5.9
Spending on Environment	about right	1925	30.5
	too little	3855	61.1
	too much	533	8.4
Spending on Drugs	about right	1790	28.4
	too little	4006	63.5
	too much	517	8.2
Respondent Race	Black	890	14.1
	Other	373	5.9
	White	5050	80.0
Respondent Sex	Female	3446	54.6
	Male	2867	45.4
Highest Degree	Graduate	710	11.2
	High School	4224	66.9
	Less than High School	1379	21.8
Labor Force Status	In School	203	3.2
	Job but Not at Work	117	1.9
	Keeping House	917	14.5
	Other	137	2.2
	Retired	972	15.4
	Unemployed	229	3.6
	Working Full Time	3081	48.8
	Working Part Time	657	10.4

Let’s try a different table aesthetic.

# Create summary for relabeled categorical variables with flextable
categorical_summary_flextable <- datasummary_skim(
  gss_cleaned %>%
    dplyr::select(`Spending on Crime`, `Spending on Environment`, `Spending on Drugs`, `Spending on Blacks`, `Respondent Race`, `Respondent Sex`, `Highest Degree`, `Labor Force Status`),
  type = "categorical",
  output = "flextable"
)

## Warning: Inline histograms in `datasummary_skim()` are only supported for tables
##   produced by the `tinytable` backend.

# Customize the table appearance with flextable
categorical_summary_flextable <- categorical_summary_flextable %>%
  set_header_labels(Variable = "Variable", Value = "Value", Freq = "Frequency") %>%
  theme_box() %>%
  bold(part = "header") %>%
  bg(part = "header", bg = "#4CAF50") %>%
  color(part = "header", color = "white") %>%
  border_remove() %>%
  border_inner_v(border = fp_border(color = "black", width = 1)) %>%
  autofit()

print(categorical_summary_flextable)

## a flextable object.
## col_keys: ` `, `  `, `N`, `%` 
## header has 1 row(s) 
## body has 41 row(s) 
## original dataset sample: 
##                                          N    %
## 1       Spending on Crime About Right 1625 25.7
## 2                          Too Little 4317 68.4
## 3                            Too Much  371  5.9
## 4 Spending on Environment about right 1925 30.5
## 5                          too little 3855 61.1

Now, let’s do the numeric version.

# Explicitly convert categorical variables to numeric
gss_numeric <- gss_cleaned %>%
  mutate(
    `Spending on Crime` = as.numeric(factor(`Spending on Crime`, levels = c("Too Little", "About Right", "Too Much"))),
    `Spending on Environment` = as.numeric(factor(`Spending on Environment`, levels = c("Too Little", "About Right", "Too Much"))),
    `Spending on Drugs` = as.numeric(factor(`Spending on Drugs`, levels = c("Too Little", "About Right", "Too Much"))),
    `Spending on Blacks` = as.numeric(factor(`Spending on Blacks`, levels = c("Too Little", "About Right", "Too Much"))),
    `Respondent Race` = as.numeric(factor(`Respondent Race`, levels = c("White", "Black", "Other"))),
    `Respondent Sex` = as.numeric(factor(`Respondent Sex`, levels = c("Male", "Female"))),
    `Highest Degree` = as.numeric(factor(`Highest Degree`, levels = c("Less than High School", "High School", "Junior College", "Bachelor", "Graduate"))),
    `Labor Force Status` = as.numeric(factor(`Labor Force Status`, levels = c("Working Full Time", "Working Part Time", "Job but Not at Work", "Unemployed", "Retired", "In School", "Keeping House", "Other")))
  )

Selecting only the variables of interest:

# Select only the variables of interest
variables_of_interest <- c("Spending on Crime", "Spending on Environment", "Spending on Drugs", 
                           "Spending on Blacks", "Respondent Race", "Respondent Sex", 
                           "Highest Degree", "Labor Force Status")
gss_numeric_selected <- gss_numeric %>% dplyr::select(all_of(variables_of_interest))

Creating summary for numeric variables with flextable:

numeric_summary_flextable <- datasummary_skim(
  gss_numeric_selected,
  type = "numeric",
  output = "flextable"
)

## Warning: Inline histograms in `datasummary_skim()` are only supported for tables
##   produced by the `tinytable` backend.

Customizing appearance:

# Customize the numeric table appearance with flextable
numeric_summary_flextable <- numeric_summary_flextable %>%
  set_header_labels(Variable = "Variable", Mean = "Mean", SD = "Standard Deviation", 
                    Min = "Minimum", Median = "Median", Max = "Maximum") %>%
  theme_box() %>%
  bold(part = "header") %>%
  bg(part = "header", bg = "black") %>%
  color(part = "header", color = "white") %>%
  border_remove() %>%
  border_inner_v(border = fp_border(color = "black", width = 1)) %>%
  autofit()

print(numeric_summary_flextable)

## a flextable object.
## col_keys: ` `, `Unique`, `Missing Pct.`, `Mean`, `SD`, `Min`, `Median`, `Max` 
## header has 1 row(s) 
## body has 5 row(s) 
## original dataset sample: 
##                      Unique Missing Pct. Mean  SD Min Median Max
## 1  Spending on Crime      3            0  1.4 0.6 1.0    1.0 3.0
## 2    Respondent Race      3            0  1.3 0.6 1.0    1.0 3.0
## 3     Respondent Sex      2            0  1.5 0.5 1.0    2.0 2.0
## 4     Highest Degree      3            0  2.1 1.1 1.0    2.0 5.0
## 5 Labor Force Status      8            0  3.1 2.4 1.0    2.0 8.0

Now suppose you wanted to save your favorite version to a word document, you can do it using the following method:

# Create a Word document
doc <- read_docx()

# Add the categorical summary flextable to the Word document
doc <- doc %>%
  body_add_flextable(value = categorical_summary_flextable, align = "center") %>%
  body_add_par(" ", style = "Normal") 

# Save the Word document
print(doc, target = "summary_table.docx")

Let’s turn now to explore and visualizing one specific variable, while highlight a few important takeaways.

Here is the variable info: https://gssdataexplorer.norc.org/variables/287/vshow

table(df$relig)

## 
##                    protestant                      catholic 
##                         38707                         16498 
##                        jewish                          none 
##                          1360                          8918 
##                         other                      buddhism 
##                          1135                           245 
##                      hinduism       other eastern religions 
##                           130                            41 
##                  muslim/islam            orthodox-christian 
##                           178                           155 
##                     christian               native american 
##                           915                            34 
##       inter-nondenominational                    don't know 
##                           154                             0 
##                           iap            I don't have a job 
##                             0                             0 
##                   dk, na, iap                     no answer 
##                             0                             0 
##    not imputable_(2147483637)    not imputable_(2147483638) 
##                             0                             0 
##                       refused                skipped on web 
##                             0                             0 
##                    uncodeable not available in this release 
##                             0                             0 
##    not available in this year                  see codebook 
##                             0                             0

Let’s do a f

# Create summary for religious preferences
relig_summary <- gss %>%
  count(relig) %>% # Count the occurrences of each religious preference
  mutate(pct = n / sum(n) * 100) # Calculate the percentage of each preference

# Create a bar plot for the distribution of religious preferences
ggplot(gss, aes(x = relig)) +
  geom_bar(fill = "lightblue", color = "black") + # Create a bar plot with light blue fill and black borders
  labs(title = "Distribution of Religious Preferences", x = "Religious Preference", y = "Count") + # Add title and axis labels
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis text for better readability

# Prepare the data
gss_yearly <- gss %>%
  group_by(year, relig) %>% # Group by year and religious preference
  summarize(count = n(), .groups = 'drop') %>% # Calculate the count for each group
  group_by(year) %>% # Group by year
  mutate(total = sum(count), # Calculate the total count per year
         proportion = count / total) # Calculate the proportion of each religious preference per year

# Create a line plot to visualize the evolution of religious preferences over time
ggplot(gss_yearly, aes(x = year, y = proportion, color = relig, group = relig)) +
  geom_line(size = 1.2) + # Create lines for each religious preference with increased line size
  scale_color_brewer(palette = "Set3") + # Use a color palette for better differentiation
  labs(title = "Evolution of Religious Preferences Over Time", # Add plot title
       x = "Year", # Label x-axis
       y = "Proportion", # Label y-axis
       color = "Religious Preference") + # Label the legend
  theme_minimal() + # Apply a minimal theme to the plot
  theme(legend.position = "bottom") # Position the legend at the bottom

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set3 is 12
## Returning the palette you asked for with that many colors

## Warning: Removed 46 rows containing missing values or values outside the scale range
## (`geom_line()`).

Bar Chart of Selected Religious Preferences

We will create a bar chart to visualize the distribution of selected religious preferences: Protestant, Catholic, Jewish, None, and Other. We’ll use a different color palette so that we can choose which palette we prefer to include.

# Filter the dataset to include only the specified categories
gss_filtered <- gss %>%
  filter(relig %in% c("protestant", "catholic", "jewish", "none", "other"))

# Summarize the data by year and religious preference
gss_yearly <- gss_filtered %>%
  group_by(year, relig) %>%
  summarize(count = n(), .groups = 'drop') %>%
  group_by(year) %>%
  mutate(total = sum(count),
         proportion = count / total)

# Plot the evolution of religious preferences over time
ggplot(gss_yearly, aes(x = year, y = proportion, color = relig, group = relig)) +
  geom_line(size = 1.2) +
  scale_color_brewer(palette = "Dark2") +
  labs(title = "Evolution of Selected Religious Preferences Over Time",
       x = "Year",
       y = "Proportion",
       color = "Religious Preference") +
  theme_minimal() +
  theme(legend.position = "bottom")

What we see thus far is the so-collaed religious ‘nones’ over time. To compare and further look into this, let’s visualize with cohort on the x-axis, which we will need to code for.

# Create the cohort variable
gss_filtered$cohort <- gss_filtered$year - gss_filtered$age

# Summarize the data by cohort and religious preference
gss_cohort <- gss_filtered %>%
  group_by(cohort, relig) %>%
  summarize(count = n(), .groups = 'drop') %>%
  group_by(cohort) %>%
  mutate(total = sum(count),
         proportion = count / total)

# Plot the smoothed cohort trends
ggplot(gss_cohort, aes(x = cohort, y = proportion, color = relig, group = relig)) +
  geom_smooth(se = FALSE, size = 1.2, method = "loess") + # Smoothed lines without standard error
  scale_color_brewer(palette = "Set1") + # Use Set1 color palette
  labs(title = "Smoothed Cohort Trends of Selected Religious Preferences",
       x = "Cohort",
       y = "Proportion",
       color = "Religious Preference") +
  theme_minimal() + # Apply minimal theme
  theme(legend.position = "bottom") # Position legend at the bottom

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 5 rows containing non-finite outside the scale range
## (`stat_smooth()`).

# Plot the smoothed cohort trends with emphasis on "None" using geom_area and geom_smooth
ggplot(gss_cohort, aes(x = cohort, y = proportion, fill = relig, group = relig)) +
  geom_area(position = 'identity', alpha = 0.3) + # Semi-transparent areas
  geom_smooth(se = FALSE, size = 1.2, method = "loess", aes(color = relig)) + # Smoothed lines
  scale_fill_brewer(palette = "Set1") + # Use Set1 color palette for fill
  scale_color_brewer(palette = "Set1") + # Use Set1 color palette for lines
  labs(title = "The 'Rise' of the Religious Nones",
       subtitle = "General Social Survey, 1972-2022",
       x = "Cohort",
       y = "Proportion",
       color = "Religious Preference",
       fill = "Religious Preference") +
  theme_minimal() + # Apply minimal theme
  theme(legend.position = "bottom") + # Position legend at the bottom
  geom_point(data = gss_cohort %>% filter(relig == "None"),
             aes(x = cohort, y = proportion), color = "red", size = 2, alpha = 0.6) # Highlight points for "None"

## Warning: Removed 5 rows containing non-finite outside the scale range
## (`stat_align()`).

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 5 rows containing non-finite outside the scale range
## (`stat_smooth()`).

Again, the rise – and corresponding decrease for protestants – is quite striking.

For a final visual option, let’s look at a different geom() and see what we think.

# Plot the cohort trends with emphasis on "None" using geom_ribbon and geom_line
ggplot(gss_cohort, aes(x = cohort, y = proportion, fill = relig)) +
  geom_ribbon(aes(ymin = 0, ymax = proportion), alpha = 0.4) +
  geom_line(aes(color = relig), size = 1.2) +
  scale_fill_brewer(palette = "Paired") +
  scale_color_brewer(palette = "Paired") +
  labs(title = "The 'Rise' of the Religious Nones",
       subtitle = "General Social Survey, 1972-2022",
       x = "Cohort",
       y = "Proportion",
       color = "Religious Preference",
       fill = "Religious Preference") +
  theme_minimal() +
  theme(legend.position = "bottom")

## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_line()`).

Now, let’s turn to our final outcome example for today – support for legal abortion for any reason.

Let’s look into the variable.

unique(gss$abany)

## [1] <NA> yes  no  
## 15 Levels: yes no don't know iap I don't have a job dk, na, iap ... see codebook

We will clean the variable so that we can use it.

# Recode abany explicitly
gss <- gss %>%
  mutate(abany = case_when(
    abany == "yes" ~ 1,
    abany == "no" ~ 2,
    TRUE ~ NA_real_
  ))

# Filter out non-relevant responses in abany
gss_filtered <- gss %>%
  filter(abany %in% c(1, 2)) %>%
  mutate(abany = factor(abany, levels = c(1, 2), labels = c("Yes", "No")))

# Check the cleaned data
table(gss_filtered$abany)

## 
##   Yes    No 
## 16626 22628

To consider how it might relate to a potential explanatory variable, let’s do a cross-tab of abany with party identification.

# Create cross-tabulation of partyid and abany
cross_tab_full <- gss_filtered %>%
  count(partyid, abany) %>%
  spread(key = abany, value = n, fill = 0)

# Create and style the table with a footnote
cross_tab_full %>%
  kable(col.names = c("Party ID", "Yes", "No"), align = 'c') %>% # Set column names and align columns centrally
  kable_styling(bootstrap_options = "striped", full_width = F) %>% # Apply table styling
  add_footnote(label = "Data: General Social Survey (1972-2022)") # Add a footnote with the data source

Party ID	Yes	No
strong democrat	2967	3429
not very strong democrat	3537	4345
independent, close to democrat	2522	2246
independent (neither, no response)	2400	3475
independent, close to republican	1365	2175
not very strong republican	2270	3745
strong republican	1138	2782
other party	342	291
NA	85	140
^a Data: General Social Survey (1972-2022)

Let’s do a basic bar charts for all Party ID categories to see their distribution among ‘yes’ and ‘no’.

gss_filtered %>%
  count(partyid, abany) %>%
  ggplot(aes(x = partyid, y = n, fill = abany)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_brewer(palette = "Set1", name = "Support for Legal Abortion") +
  labs(title = "Abortion Attitudes by Party ID",
       subtitle = "General Social Survey, 1972-2022",
       x = "Party ID",
       y = "Count") +
  theme_minimal() +
  theme(legend.position = "bottom")

To further probe into the relationship, let’s recode the available categorical values to Democrat, Republican, and Independent (note: if you were to save as PDF, or look in viewer, the category lable display appropriately – but not on the Rpubs).

# Recode partyid into broader categories based on detailed levels
gss_filtered <- gss_filtered %>%
  mutate(partyid_recoded = case_when(
    partyid %in% c("strong democrat", "not very strong democrat", "independent, close to democrat") ~ "Democrat",
    partyid %in% c("independent (neither, no response)", "other party") ~ "Independent",
    partyid %in% c("independent, close to republican", "not very strong republican", "strong republican") ~ "Republican",
    TRUE ~ NA_character_
  )) %>%
  filter(!is.na(partyid_recoded))

# Check the recoding
gss_filtered %>%
  count(partyid_recoded)

##   partyid_recoded     n
## 1        Democrat 19046
## 2     Independent  6508
## 3      Republican 13475

# Filter out rows with NA values in partyid_recoded and abany
gss_filtered <- gss_filtered %>%
  filter(!is.na(partyid_recoded) & !is.na(abany))

# Plot the proportion plot
gss_filtered %>% 
  count(partyid_recoded, abany) %>%  # Count occurrences of each combination of partyid_recoded and abany
  group_by(partyid_recoded) %>%  # Group data by political identity
  mutate(proportion = n / sum(n)) %>%  # Calculate the proportion of each abortion attitude within each political identity
  ggplot(aes(x = partyid_recoded, y = proportion, fill = abany)) +  # Initialize ggplot with political identity on x-axis, proportion on y-axis, and fill based on abortion attitude
  geom_bar(stat = "identity", position = "fill") +  # Create a stacked bar plot with bars filled proportionally
  scale_fill_brewer(palette = "Set2", name = "Support for Legal Abortion") +  # Use a color palette for fill and set legend title
  labs(title = "Support for Legal Abortion by Political Identification",  # Add plot title
       subtitle = "General Social Survey, 1972-2022", 
       x = "", # Add plot subtitle
       y = "Proportion") +  # Label y-axis
  theme_minimal() +  # Apply a minimal theme to the plot
  theme(legend.position = "bottom")  # Position the legend at the bottom

Not super informative as a graph especially since we are showing proportions for the entire dataset available. Let’s disaggregate and look into over time trends.

table(gss_filtered$abany)

## 
##   Yes    No 
## 16541 22488

First, line plots for survey year:

# Filter out rows with NA values in year, partyid_recoded, and abany
gss_filtered_clean <- gss_filtered %>%
  filter(!is.na(year) & !is.na(partyid_recoded) & !is.na(abany))

# Create a proportion dataset by year
gss_yearly <- gss_filtered_clean %>%
  count(year, partyid_recoded, abany) %>%  # Count occurrences for each combination of year, partyid_recoded, and abany
  group_by(year, partyid_recoded) %>%  # Group by year and political identity
  mutate(total = sum(n),  # Calculate the total count per year and political identity
         proportion = n / total) %>%  # Calculate the proportion of each response within each year and political identity
  filter(abany == "Yes")  # Filter to keep only 'Yes' responses

# Check the resulting dataset
print(head(gss_yearly))

## # A tibble: 6 × 6
## # Groups:   year, partyid_recoded [6]
##    year partyid_recoded abany     n total proportion
##   <int> <chr>           <fct> <int> <int>      <dbl>
## 1  1977 Democrat        Yes     309   852      0.363
## 2  1977 Independent     Yes      60   170      0.353
## 3  1977 Republican      Yes     181   447      0.405
## 4  1978 Democrat        Yes     255   781      0.327
## 5  1978 Independent     Yes      80   217      0.369
## 6  1978 Republican      Yes     159   482      0.330

# Line plot by year
ggplot(gss_yearly, aes(x = year, y = proportion, color = partyid_recoded)) +
  geom_line(size = 1.2) +  # Create line plot with increased line size
  scale_color_brewer(palette = "Dark2", name = "Political Identification") +  # Use Dark2 color palette for lines and set legend title
  labs(title = "Support for Legal Abortion by Year and Political Identification",
       subtitle = "General Social Survey, 1972-2022",
       x = "Year",
       y = "Pr(Yes)") +  # Add title, subtitle, and axis labels
  theme_minimal() +  # Apply minimal theme
  theme(legend.position = "bottom")  # Position legend at the bottom

Now, let’s look into cohort over time trends:

# Calculate the cohort variable
gss_filtered <- gss_filtered %>%
  mutate(cohort = year - age)
# Filter data for Democrats and Republicans from 1900 to 2000
gss_cohort_filtered <- gss_filtered %>%
  filter(partyid_recoded %in% c("Democrat", "Republican") & 
         cohort >= 1900 & cohort <= 2000)
# Create a proportion dataset by cohort
gss_cohort <- gss_cohort_filtered %>%
  count(cohort, partyid_recoded, abany) %>%
  group_by(cohort, partyid_recoded) %>%
  mutate(proportion = n / sum(n)) %>%
  filter(abany == "Yes")

Using specifially geom_area and separate the Dem and Rep graphs:

# Separate data for Democrats and Republicans
gss_cohort_democrat <- gss_cohort %>% filter(partyid_recoded == "Democrat")
gss_cohort_republican <- gss_cohort %>% filter(partyid_recoded == "Republican")

# Area plot for Democrats
plot_democrat <- ggplot(gss_cohort_democrat, aes(x = cohort, y = proportion)) +
  geom_area(fill = "blue", alpha = 0.6) +
  labs(title = "Support for Legal Abortion by Cohort (Democrats)",
       x = "Cohort",
       y = "Pr(Yes)") +
  theme_minimal()

# Area plot for Republicans
plot_republican <- ggplot(gss_cohort_republican, aes(x = cohort, y = proportion)) +
  geom_area(fill = "red", alpha = 0.6) +
  labs(title = "Support for Legal Abortion by Cohort (Republicans)",
       x = "Cohort",
       y = "Pr(Yes)") +
  theme_minimal()

# Use patchwork to combine the plots
library(patchwork)
combined_plot <- plot_democrat + plot_republican + plot_layout(ncol = 1)

# Display the combined plot
print(combined_plot)

What is misleading above is that the two are on different scales. Let’s overlay to compare more directly:

# Overlay area plot for Democrats and Republicans
overlay_plot <- ggplot() +
  geom_area(data = gss_cohort_democrat, aes(x = cohort, y = proportion), fill = "blue", alpha = 0.4) +
  geom_area(data = gss_cohort_republican, aes(x = cohort, y = proportion), fill = "red", alpha = 0.4) +
  labs(title = "Support for Legal Abortion  by Cohort (Democrats vs. Republicans)",
       x = "Cohort",
       y = "Pr(Yes)",
       fill = "Political Identity") +
  scale_fill_manual(values = c("blue" = "Democrats", "red" = "Republicans")) +
  theme_minimal()

# Display the overlay plot
print(overlay_plot)

## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's fill values.

There seems to be an increasing cohort gap, notably led by a Democrat uptick and no corresponding Republican increase. Let’s explicitly plot the difference between Dem and Rep for those that support legal abortion for any reason – comparing cohorts over time.

# Spread the data for difference calculation
gss_cohort_wide <- gss_cohort %>%
  dplyr::select(cohort, partyid_recoded, proportion) %>%
  spread(key = partyid_recoded, value = proportion, fill = 0) %>%
  mutate(difference = Democrat - Republican)

# Check the wide dataset
head(gss_cohort_wide)

## # A tibble: 6 × 4
## # Groups:   cohort [6]
##   cohort Democrat Republican difference
##    <int>    <dbl>      <dbl>      <dbl>
## 1   1900    0.286      0.412    -0.126 
## 2   1901    0.25       0.379    -0.129 
## 3   1902    0.2        0.371    -0.171 
## 4   1903    0.244      0.132     0.112 
## 5   1904    0.302      0.425    -0.123 
## 6   1905    0.263      0.294    -0.0310

# Plot the differences with a baseline of 0
ggplot(gss_cohort_wide, aes(x = cohort, y = difference)) +
  geom_line(color = "blue", size = 1.2) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  scale_y_continuous(limits = c(-1, 1), labels = scales::percent) +
  labs(title = "Difference in Support for Legal Abortion by Cohort",
       subtitle = "Comparing Democrats vs. Republicans (1900-2000)",
       x = "Cohort",
       y = "Difference in Proportion (Democrat - Republican)") +
  theme_minimal() +
  theme(legend.position = "none")

Clearly, there is a significant increasing gap. The graph even shows a flip, where Republicans used to hold a higher proportion of support for legal abortion.

To end, let’s do the same for the period (of ‘year’ trend).

# Ensure partyid_recoded columns match exactly 'Democrat' and 'Republican'
gss_yearly_wide <- gss_yearly %>%
  dplyr::select(year, partyid_recoded, proportion) %>%
  spread(key = partyid_recoded, value = proportion, fill = 0) %>%
  mutate(difference = Democrat - Republican)

# Check the wide dataset
head(gss_yearly_wide)

## # A tibble: 6 × 5
## # Groups:   year [6]
##    year Democrat Independent Republican difference
##   <int>    <dbl>       <dbl>      <dbl>      <dbl>
## 1  1977    0.363       0.353      0.405  -0.0422  
## 2  1978    0.327       0.369      0.330  -0.00337 
## 3  1980    0.406       0.464      0.393   0.0123  
## 4  1982    0.375       0.457      0.375  -0.000475
## 5  1983    0.336       0.305      0.367  -0.0314  
## 6  1984    0.361       0.426      0.409  -0.0474

# Plot the differences with a baseline of 0
ggplot(gss_yearly_wide, aes(x = year, y = difference)) +
  geom_line(color = "blue", size = 1.2) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  scale_y_continuous(limits = c(-1, 1), labels = scales::percent) +
  labs(title = "Difference in Support for Legal Abortion  by Year",
       subtitle = "Comparing Democrats vs. Republicans",
       x = "Year",
       y = "Difference in Proportion (Democrat - Republican)") +
  theme_minimal() +
  theme(legend.position = "none")

Problem Set 2

Due: July 15, 2024 by lecture time

You must submit both the .Rmd (R markdown file) and R pubs published version (with link provided). Copy/paste all the tasks as text and provide the relevant code as the answer to each task.

Task 1: Data Cleaning and Recoding

Objective: Clean and recode the variables to ensure they are ready for analysis.

Recode polviews into three categories: “Liberal”, “Moderate”, and “Conservative”. Clean sex, degree, and race but retain the relevant categories.

Task 2: Data Summary

Objective: Generate a summary table for selected variables using the datasummary_skim function from the modelsummary package.

Select the variables of interest: polviews, sex, degree, and race.

Generate a categorical summary table for these variables, clean the labels, and display it using the flextable package for styling.

Task 3: Visualization of Political Views by Gender

Objective: Create a bar chart showing the distribution of political views by gender.

Create a bar chart showing the distribution of political views by gender. Use a color palette that clearly differentiates the categories.

Task 4: Trends Over Time

Objective: Visualize trends in religious attendance over time.

Select the year and attend variables from the GSS dataset.

Create a line plot showing the proportion of each category of religious attendance over time.

Task 5: Comparison Trends

Objective: Create a stacked bar chart showing the distribution of fejobaff (preferential hiring) across different age groups.

Create an age group variable by categorizing age into “18-29”, “30-44”, “45-59”, “60+”. Create a stacked bar chart showing the distribution of the fejobaff response categories for each age group.