HW 08 - Exploring the GSS

Brady May 5/20/26

Load packages and data

library(tidyverse)
library(dsbox)

Exercises

What are the possible responses to this question and how many respondents chose each of these answers?

gss16 %>%
  count(harass5)

## # A tibble: 4 × 2
##   harass5                                                     n
##   <chr>                                                   <int>
## 1 Does not apply (i do not have a job/superior/co-worker)    96
## 2 No                                                       1136
## 3 Yes                                                       237
## 4 <NA>                                                     1398

The possible responses to the questions would be does not apply, no, or yes. 96 respondents chose does not apply, 1136 chose no, and 237 chose yes. There are also 1398 still n/a

What percent of the respondents for whom this question is applicable (i.e. excluding NAs and Does not applys) have been harassed by their superiors or co-workers at their job.

gss %>%
  filter(harass5 != "NA", harass5 != "Does not apply (i do not have a job/superior/co-worker)") %>%
  count(harass5) %>%
  mutate(percent = n / sum(n) * 100)

## # A tibble: 2 × 3
##   harass5     n percent
##   <chr>   <int>   <dbl>
## 1 No       1136    82.7
## 2 Yes       237    17.3

approximately 17.26147% of respondents for whom the question is applicable, have been harassed by their superiors or co-workers at their job.

3.Create a new variable called email that combines these two variables to reports the number of minutes the respondents spend on email weekly.

gss16 <- gss16 %>%
  mutate(email = emailhr * 60 + emailmin)

Visualize the distribution of this new variable. Find the mean and the median number of minutes respondents spend on email weekly. Is the mean or the median a better measure of the typical among of time Americans spend on email weekly? Why?

ggplot(gss16, aes(x = email)) +
  geom_histogram(binwidth = 60) +
  labs(
    title = "Distribution of Weekly Email Time",
    x = "Minutes per week",
    y = "# of respondents"
  )

## Warning: Removed 1218 rows containing non-finite outside the scale range
## (`stat_bin()`).

gss16 %>%
  summarise(
    mean_email = mean(email, na.rm = TRUE),
    median_email = median(email, na.rm = TRUE)
  )

## # A tibble: 1 × 2
##   mean_email median_email
##        <dbl>        <dbl>
## 1       417.          120

The distribution is right skewed with most respondents spending little time. The mean is 416.8423 minutes while the median is 120. For this situation, the median is a better measure of the average time using email because the distribution is skewed. The mean would be affected by outliers.

Create another new variable, snap_insta that is coded as “Yes” if the respondent reported using any of Snapchat (snapchat) or Instagram (instagrm), and “No” if not. If the recorded value was NA for both of these questions, the value in your new variable should also be NA.

gss16 <- gss16 %>%
  mutate (
    snap_insta = case_when(
      is.na(snapchat) & is.na(instagrm) ~ NA_character_, 
      snapchat == "Yes" | instagrm == "Yes" ~ "Yes",
      TRUE ~ "No"
    ))

Calculate the percentage of Yes’s for snap_insta among those who answered the question, i.e. excluding NAs.

gss16 %>%
  filter(!is.na(snap_insta)) %>%
  count(snap_insta) %>%
  mutate(percent = n / sum(n) * 100)

## # A tibble: 2 × 3
##   snap_insta     n percent
##   <chr>      <int>   <dbl>
## 1 No           858    62.5
## 2 Yes          514    37.5

What are the possible responses to the question Last week were you working full time, part time, going to school, keeping house, or what? and how many respondents chose each of these answers? Note that this information is stored in the wrkstat variable.

gss16 %>%
  count(wrkstat)

## # A tibble: 9 × 2
##   wrkstat              n
##   <chr>            <int>
## 1 Keeping house      284
## 2 Other               89
## 3 Retired            574
## 4 School              76
## 5 Temp not working    57
## 6 Unempl, laid off   118
## 7 Working fulltime  1321
## 8 Working parttime   345
## 9 <NA>                 3

Fit a model predicting email (number of minutes per week spent on email) from educ (number of years of education), wrkstat, and snap_insta. Interpret the slopes for each of these variables.

email_model <- lm(email ~ educ + wrkstat + snap_insta,
                  data = gss16)

summary(email_model)

## 
## Call:
## lm(formula = email ~ educ + wrkstat + snap_insta, data = gss16)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -760.5 -372.7 -161.2   95.4 3355.6 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -229.736    149.837  -1.533  0.12569    
## educ                      29.632      9.601   3.087  0.00211 ** 
## wrkstatOther              33.057    209.470   0.158  0.87465    
## wrkstatRetired            68.279    111.051   0.615  0.53887    
## wrkstatSchool           -123.812    143.981  -0.860  0.39014    
## wrkstatTemp not working  -73.709    153.948  -0.479  0.63225    
## wrkstatUnempl, laid off  118.349    151.242   0.783  0.43419    
## wrkstatWorking fulltime  366.840     87.690   4.183 3.26e-05 ***
## wrkstatWorking parttime   18.900    101.632   0.186  0.85253    
## snap_instaYes            149.961     52.745   2.843  0.00460 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 642.2 on 669 degrees of freedom
##   (2188 observations deleted due to missingness)
## Multiple R-squared:  0.1043, Adjusted R-squared:  0.09227 
## F-statistic: 8.657 on 9 and 669 DF,  p-value: 2.395e-12

9.Create a predicted values vs. residuals plot for this model. Are there any issues with the model? If yes, describe them.

model_data <- na.omit(gss16[, c("email", "educ", "wrkstat", "snap_insta")])

email_model <- lm(email ~ educ + wrkstat + snap_insta, data = model_data)

model_data$predicted <- predict(email_model)
model_data$residuals <- resid(email_model)

ggplot(model_data, aes(predicted, residuals)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = 0, linetype = "dashed")

In a new variable, recode advfront such that Strongly Agree and Agree are mapped to “Yes”, and Disagree and Strongly disagree are mapped to “No”. The remaining levels can be left as is. Don’t overwrite the existing advfront, instead pick a different, informative name for your new variable.

gss16 <- gss16 %>%
  mutate(
    science_support = case_when(
      advfront == "Strongly agree" ~ "Yes",
      advfront == "Agree" ~ "Yes",
      advfront == "Disagree" ~ "No",
      advfront == "Strongly Disagree" ~ "No",
      
    ))

In a new variable, recode polviews such that Extremely liberal, Liberal, and Slightly liberal, are mapped to “Liberal”, and Slghtly conservative, Conservative, and Extrmly conservative disagree are mapped to “Conservative”. The remaining levels can be left as is. Make sure that the levels are in a reasonable order. Don’t overwrite the existing polviews, instead pick a different, informative name for your new variable.

gss16 <- gss16 %>%
  mutate(
    political_group = case_when(
      polviews %in% c("Extremely liberal",
                      "Liberal",
                      "Slightly liberal") ~ "Liberal",
      
      polviews %in% c("Slightly conservative", "Conservative", "Extremely conservative") ~ "Conservative",
      
    )
  )
gss16$political_group <- factor(
  gss16$political_group,
  levels = c("Liberal", "Moderate", "Conservative")
)

Create a visualization that displays the relationship between these two new variables and interpret it.

ggplot(gss16,
       aes(x = political_group,
           fill = science_support)) +
  geom_bar(position = "fill") +
  labs(
    title = "Political Views and Support for Science Research",
    x = "Political Group",
    y = "Proportion",
    fill = "Support Science Research"
  )