Overview

This file will help you subset data from the gssr package. The gssr R package provides a useful set of tools to efficiently access, manipulate, and analyze the General Social Survey (GSS) dataset. The package simplifies the process of working with complex survey data by offering a set of functions for subsetting, filtering, and transforming your desired GSS data.

The package gssr is located here.

Set up your work enviornment

Open up a new .Rmd file.

Use {r setup, include=F} in your first code chunk.

## Warning: package 'gssr' was built under R version 4.4.2

## Package loaded. To attach the GSS data, type data(gss_all) at the console.
## For the codebook, type data(gss_dict).
## For the panel data and documentation, type e.g. data(gss_panel08_long) and data(gss_panel_doc).
## For help on a specific GSS variable, type ?varname at the console.

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

We will run through a sample analysis using the gssr package.

Explore

We explore variables in the GSSR documentation file

# load the master documentation files
data(gss_all) # note that this is a large file of all GSS data
data(gssrdoc) # this is the documentation for the GSS data

Examine

We examine the years available for any given variable.

# check which years your variables are available
gss_which_years(gss_all, race)

## # A tibble: 34 × 2
##    year      race 
##    <dbl+lbl> <lgl>
##  1 1972      TRUE 
##  2 1973      TRUE 
##  3 1974      TRUE 
##  4 1975      TRUE 
##  5 1976      TRUE 
##  6 1977      TRUE 
##  7 1978      TRUE 
##  8 1980      TRUE 
##  9 1982      TRUE 
## 10 1983      TRUE 
## # ℹ 24 more rows

gss_which_years(gss_all, sex)

## # A tibble: 34 × 2
##    year      sex  
##    <dbl+lbl> <lgl>
##  1 1972      TRUE 
##  2 1973      TRUE 
##  3 1974      TRUE 
##  4 1975      TRUE 
##  5 1976      TRUE 
##  6 1977      TRUE 
##  7 1978      TRUE 
##  8 1980      TRUE 
##  9 1982      TRUE 
## 10 1983      TRUE 
## # ℹ 24 more rows

gss_which_years(gss_all, hrs2)

## # A tibble: 34 × 2
##    year      hrs2 
##    <dbl+lbl> <lgl>
##  1 1972      FALSE
##  2 1973      TRUE 
##  3 1974      TRUE 
##  4 1975      TRUE 
##  5 1976      TRUE 
##  6 1977      TRUE 
##  7 1978      TRUE 
##  8 1980      TRUE 
##  9 1982      TRUE 
## 10 1983      TRUE 
## # ℹ 24 more rows

gss_which_years(gss_all, educ)

## # A tibble: 34 × 2
##    year      educ 
##    <dbl+lbl> <lgl>
##  1 1972      TRUE 
##  2 1973      TRUE 
##  3 1974      TRUE 
##  4 1975      TRUE 
##  5 1976      TRUE 
##  6 1977      TRUE 
##  7 1978      TRUE 
##  8 1980      TRUE 
##  9 1982      TRUE 
## 10 1983      TRUE 
## # ℹ 24 more rows

# when you want to return information for multiple variables
gss_all %>%
  gss_which_years(c(race, sex, hrs2, educ)) %>%
  print(n = Inf)

## # A tibble: 34 × 5
##    year      race  sex   hrs2  educ 
##    <dbl+lbl> <lgl> <lgl> <lgl> <lgl>
##  1 1972      TRUE  TRUE  FALSE TRUE 
##  2 1973      TRUE  TRUE  TRUE  TRUE 
##  3 1974      TRUE  TRUE  TRUE  TRUE 
##  4 1975      TRUE  TRUE  TRUE  TRUE 
##  5 1976      TRUE  TRUE  TRUE  TRUE 
##  6 1977      TRUE  TRUE  TRUE  TRUE 
##  7 1978      TRUE  TRUE  TRUE  TRUE 
##  8 1980      TRUE  TRUE  TRUE  TRUE 
##  9 1982      TRUE  TRUE  TRUE  TRUE 
## 10 1983      TRUE  TRUE  TRUE  TRUE 
## 11 1984      TRUE  TRUE  TRUE  TRUE 
## 12 1985      TRUE  TRUE  TRUE  TRUE 
## 13 1986      TRUE  TRUE  TRUE  TRUE 
## 14 1987      TRUE  TRUE  TRUE  TRUE 
## 15 1988      TRUE  TRUE  TRUE  TRUE 
## 16 1989      TRUE  TRUE  TRUE  TRUE 
## 17 1990      TRUE  TRUE  TRUE  TRUE 
## 18 1991      TRUE  TRUE  TRUE  TRUE 
## 19 1993      TRUE  TRUE  TRUE  TRUE 
## 20 1994      TRUE  TRUE  TRUE  TRUE 
## 21 1996      TRUE  TRUE  TRUE  TRUE 
## 22 1998      TRUE  TRUE  TRUE  TRUE 
## 23 2000      TRUE  TRUE  TRUE  TRUE 
## 24 2002      TRUE  TRUE  TRUE  TRUE 
## 25 2004      TRUE  TRUE  TRUE  TRUE 
## 26 2006      TRUE  TRUE  TRUE  TRUE 
## 27 2008      TRUE  TRUE  TRUE  TRUE 
## 28 2010      TRUE  TRUE  TRUE  TRUE 
## 29 2012      TRUE  TRUE  TRUE  TRUE 
## 30 2014      TRUE  TRUE  TRUE  TRUE 
## 31 2016      TRUE  TRUE  TRUE  TRUE 
## 32 2018      TRUE  TRUE  TRUE  TRUE 
## 33 2021      TRUE  TRUE  TRUE  TRUE 
## 34 2022      TRUE  TRUE  TRUE  TRUE

# loading single year data
gss12 <- gss_get_yr(2012)

## Fetching: https://gss.norc.org/documents/stata/2012_stata.zip

gss22 <- gss_get_yr(2022)

## Fetching: https://gss.norc.org/documents/stata/2022_stata.zip

# View(gss22)

# select your variables for a single year using cross sectional file
gss12 %>% 
  select(race, sex, hrs2, educ, wtssall)

## # A tibble: 1,974 × 5
##    race      sex        hrs2        educ                    wtssall  
##    <dbl+lbl> <dbl+lbl>  <dbl+lbl>   <dbl+lbl>               <dbl+lbl>
##  1 1 [white] 1 [male]   NA(i) [iap] 16 [4 years of college] 2.62     
##  2 1 [white] 1 [male]   NA(i) [iap] 12 [12th grade]         3.50     
##  3 3 [other] 1 [male]   NA(i) [iap] 12 [12th grade]         1.75     
##  4 1 [white] 2 [female] NA(i) [iap] 13 [1 year of college]  1.24     
##  5 2 [black] 2 [female] NA(i) [iap] 16 [4 years of college] 0.874    
##  6 1 [white] 2 [female] NA(i) [iap] 19 [7 years of college] 0.824    
##  7 1 [white] 2 [female] NA(i) [iap] 15 [3 years of college] 0.824    
##  8 3 [other] 2 [female] NA(i) [iap] 11 [11th grade]         0.412    
##  9 2 [black] 2 [female] NA(i) [iap]  9 [9th grade]          0.412    
## 10 1 [white] 2 [female] NA(i) [iap] 17 [5 years of college] 0.412    
## # ℹ 1,964 more rows

gss12 %>% 
  select(race, sex, hrs2, educ, wtssall) %>% 
  count(race, sex, wtssall)

## # A tibble: 66 × 4
##    race      sex       wtssall       n
##    <dbl+lbl> <dbl+lbl> <dbl+lbl> <int>
##  1 1 [white] 1 [male]  0.412       167
##  2 1 [white] 1 [male]  0.824       280
##  3 1 [white] 1 [male]  0.874        34
##  4 1 [white] 1 [male]  1.00          1
##  5 1 [white] 1 [male]  1.19          1
##  6 1 [white] 1 [male]  1.24         46
##  7 1 [white] 1 [male]  1.65         22
##  8 1 [white] 1 [male]  1.75         95
##  9 1 [white] 1 [male]  2.06          7
## 10 1 [white] 1 [male]  2.62         11
## # ℹ 56 more rows

Get data

We use the gss_all to get all of the data set. We can use a more direct method but we must be careful about our outputs.

gss_all %>% 
  select(year, race, sex, hrs2, educ, wtssall, wtss, wtssnr) %>% 
  filter(year == 2014) %>% 
  drop_na() -> df

sapply(df, function(x) sum(is.na(x))) # take note that the hrs2 variable will cause issue

##    year    race     sex    hrs2    educ wtssall    wtss  wtssnr 
##       0       0       0       0       0       0       0       0

df

## # A tibble: 40 × 8
##    year      race      sex        hrs2      educ      wtssall   wtss      wtssnr
##    <dbl+lbl> <dbl+lbl> <dbl+lbl>  <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+>
##  1 2014      1 [white] 2 [female] 40        16        0.448     0.448     0.500 
##  2 2014      2 [black] 2 [female] 50        14        1.38      1.38      1.54  
##  3 2014      3 [other] 2 [female] 50        13        0.448     0.448     0.500 
##  4 2014      2 [black] 2 [female] 45        18        0.690     0.690     0.770 
##  5 2014      1 [white] 2 [female] 45        17        0.896     0.896     1.00  
##  6 2014      2 [black] 2 [female] 32        12        0.690     0.690     0.735 
##  7 2014      3 [other] 2 [female] 40        12        2.07      2.07      2.21  
##  8 2014      1 [white] 2 [female] 15        12        0.896     0.896     0.954 
##  9 2014      1 [white] 2 [female] 60        20        0.896     0.896     0.888 
## 10 2014      1 [white] 1 [male]   40        16        0.896     0.896     0.923 
## # ℹ 30 more rows

Subset data

# select your variables for a single year using gss_all
gss_all %>% 
  filter(year == 2012) %>% 
  select(race, sex, hrs2, educ, wtssall, wtss, wtssnr)

## # A tibble: 1,974 × 7
##    race      sex        hrs2        educ      wtssall   wtss      wtssnr   
##    <dbl+lbl> <dbl+lbl>  <dbl+lbl>   <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl>
##  1 1 [white] 1 [male]   NA(i) [iap] 16        2.62      2.62      2.87     
##  2 1 [white] 1 [male]   NA(i) [iap] 12        3.50      3.50      3.83     
##  3 3 [other] 1 [male]   NA(i) [iap] 12        1.75      1.75      1.91     
##  4 1 [white] 2 [female] NA(i) [iap] 13        1.24      1.24      1.35     
##  5 2 [black] 2 [female] NA(i) [iap] 16        0.874     0.874     0.957    
##  6 1 [white] 2 [female] NA(i) [iap] 19        0.824     0.824     0.902    
##  7 1 [white] 2 [female] NA(i) [iap] 15        0.824     0.824     0.902    
##  8 3 [other] 2 [female] NA(i) [iap] 11        0.412     0.412     0.451    
##  9 2 [black] 2 [female] NA(i) [iap]  9        0.412     0.412     0.451    
## 10 1 [white] 2 [female] NA(i) [iap] 17        0.412     0.412     0.451    
## # ℹ 1,964 more rows

gss_all %>% 
  filter(year == 2018) %>% 
  select(race, sex, hrs2, educ, wtssall, wtss, wtssnr)

## # A tibble: 2,348 × 7
##    race      sex        hrs2        educ      wtssall   wtss      wtssnr   
##    <dbl+lbl> <dbl+lbl>  <dbl+lbl>   <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl>
##  1 1 [white] 1 [male]      41       14        2.36      2.36      2.75     
##  2 1 [white] 2 [female] NA(i) [iap] 10        0.943     0.943     1.10     
##  3 1 [white] 1 [male]   NA(i) [iap] 16        0.943     0.943     1.10     
##  4 1 [white] 2 [female] NA(i) [iap] 16        0.943     0.943     1.10     
##  5 2 [black] 1 [male]   NA(i) [iap] 18        0.943     0.943     1.10     
##  6 1 [white] 2 [female] NA(i) [iap] 16        0.943     0.943     1.10     
##  7 2 [black] 2 [female] NA(i) [iap] 13        0.943     0.943     1.10     
##  8 1 [white] 1 [male]   NA(i) [iap] 12        0.471     0.471     0.551    
##  9 1 [white] 2 [female] NA(i) [iap]  8        0.471     0.471     0.551    
## 10 1 [white] 1 [male]   NA(i) [iap] 12        0.943     0.943     1.10     
## # ℹ 2,338 more rows

Clean data

# clean data for analysis
gss12 %>% 
  select(race, sex, hrs2, educ, wtssall, wtss, wtssnr) %>% 
  drop_na() -> gss12_analysis

# check for missing values
sapply(gss12_analysis, function(x) sum(is.na(x)))

##    race     sex    hrs2    educ wtssall    wtss  wtssnr 
##       0       0       0       0       0       0       0

str(gss12_analysis)

## tibble [35 × 7] (S3: tbl_df/tbl/data.frame)
##  $ race   : dbl+lbl [1:35] 2, 1, 3, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1...
##    ..@ label       : chr "race of respondent"
##    ..@ format.stata: chr "%29.0g"
##    ..@ labels      : Named num [1:16] 1 2 3 NA NA NA NA NA NA NA ...
##    .. ..- attr(*, "names")= chr [1:16] "white" "black" "other" "don't know" ...
##  $ sex    : dbl+lbl [1:35] 2, 1, 1, 2, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1...
##    ..@ label       : chr "respondents sex"
##    ..@ format.stata: chr "%29.0g"
##    ..@ labels      : Named num [1:15] 1 2 NA NA NA NA NA NA NA NA ...
##    .. ..- attr(*, "names")= chr [1:15] "male" "female" "don't know" "iap" ...
##  $ hrs2   : dbl+lbl [1:35] 25, 40, 40, 50, 45, 45, 38, 32, 30, 40, 35, 35, 16, 35,...
##    ..@ label       : chr "number of hours usually work a week"
##    ..@ format.stata: chr "%29.0g"
##    ..@ labels      : Named num [1:14] 89 NA NA NA NA NA NA NA NA NA ...
##    .. ..- attr(*, "names")= chr [1:14] "89+ hours" "don't know" "iap" "I don't have a job" ...
##  $ educ   : dbl+lbl [1:35] 16, 16, 12, 16, 12, 14, 14, 12, 14, 14,  9, 10, 14, 18,...
##    ..@ label       : chr "highest year of school completed"
##    ..@ format.stata: chr "%29.0g"
##    ..@ labels      : Named num [1:34] 0 1 2 3 4 5 6 7 8 9 ...
##    .. ..- attr(*, "names")= chr [1:34] "no formal schooling" "1st grade" "2nd grade" "3rd grade" ...
##  $ wtssall: dbl+lbl [1:35] 1.748, 0.824, 3.496, 0.874, 0.824, 0.874, 0.874, 0.412,...
##    ..@ label       : chr "weight variable"
##    ..@ format.stata: chr "%29.0g"
##    ..@ labels      : Named num [1:13] NA NA NA NA NA NA NA NA NA NA ...
##    .. ..- attr(*, "names")= chr [1:13] "don't know" "iap" "I don't have a job" "dk, na, iap" ...
##  $ wtss   : dbl+lbl [1:35] 1.748, 0.824, 3.496, 0.874, 0.824, 0.874, 0.874, 0.412,...
##    ..@ label       : chr "weight variable"
##    ..@ format.stata: chr "%29.0g"
##    ..@ labels      : Named num [1:13] NA NA NA NA NA NA NA NA NA NA ...
##    .. ..- attr(*, "names")= chr [1:13] "don't know" "iap" "I don't have a job" "dk, na, iap" ...
##  $ wtssnr : dbl+lbl [1:35] 1.761, 0.845, 3.740, 0.807, 0.887, 0.895, 1.110, 0.408,...
##    ..@ label       : chr "weight variable"
##    ..@ format.stata: chr "%29.0g"
##    ..@ labels      : Named num [1:13] NA NA NA NA NA NA NA NA NA NA ...
##    .. ..- attr(*, "names")= chr [1:13] "don't know" "iap" "I don't have a job" "dk, na, iap" ...

# exploratory analysis
gss12 %>% 
  select(race, sex, hrs2, educ, wtssall, wtss, wtssnr) %>% 
  count(race, sex)

## # A tibble: 6 × 3
##   race      sex            n
##   <dbl+lbl> <dbl+lbl>  <int>
## 1 1 [white] 1 [male]     673
## 2 1 [white] 2 [female]   804
## 3 2 [black] 1 [male]     114
## 4 2 [black] 2 [female]   187
## 5 3 [other] 1 [male]      98
## 6 3 [other] 2 [female]    98

gss12_analysis %>% 
  count(race, sex)

## # A tibble: 5 × 3
##   race      sex            n
##   <dbl+lbl> <dbl+lbl>  <int>
## 1 1 [white] 1 [male]       9
## 2 1 [white] 2 [female]    19
## 3 2 [black] 2 [female]     5
## 4 3 [other] 1 [male]       1
## 5 3 [other] 2 [female]     1

Model 1

model <- lm(gss12_analysis$hrs2 ~ gss12_analysis$educ, weight = gss12_analysis$wtssall)
summary(model)

## 
## Call:
## lm(formula = gss12_analysis$hrs2 ~ gss12_analysis$educ, weights = gss12_analysis$wtssall)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.953  -5.699   1.695   5.578  17.203 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          41.8008     8.0522   5.191 1.05e-05 ***
## gss12_analysis$educ  -0.4010     0.6014  -0.667    0.509    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.01 on 33 degrees of freedom
## Multiple R-squared:  0.0133, Adjusted R-squared:  -0.0166 
## F-statistic: 0.4447 on 1 and 33 DF,  p-value: 0.5095

Plot the variable relationships

Here, I use some dated code, namely the base R abline() function discussed in class.

# Plot the regression line
plot(gss12_analysis$educ, gss12_analysis$hrs2)

# Some alternate code
plot(gss12_analysis$educ, gss12_analysis$hrs2, 
     main="Education vs Hours Worked",
     xlab="Years of Education", 
     ylab="Hours Worked")

# Add the regression line
abline(lm(hrs2 ~ educ, data = gss12_analysis), col = "red")

Here, I use some more frequent code with additional parameters given that we’ve made a ggplot object. The code also provides some labels to our plot.

Plot the variable relationships

# Create a basic scatter plot
plot(gss12_analysis$educ, gss12_analysis$hrs2, 
     main="Education vs Hours Worked",
     xlab="Years of Education", 
     ylab="Hours Worked",
     pch=16,  # Use filled circles for points
     col="blue",  # Make the points blue
     cex=0.8)  # Slightly smaller point size

# Add the regression line
abline(model, col = "red", lwd = 2)

# Add a legend
legend("topleft", legend=c("Data points", "Regression line"),
       col=c("blue", "red"), pch=c(16, NA), lty=c(NA, 1))

# Add text for R-squared value
r_squared <- summary(model)$r.squared
text(x = max(gss12_analysis$educ), y = min(gss12_analysis$hrs2), 
     labels = paste("R-squared =", round(r_squared, 3)), 
     pos = 2)

Examine residuals

We now examine the residuals.

# Check residuals
resids <- residuals(model)
plot(gss12_analysis$hrs2, resids)

And then view the diagnostic plots.

# Diagnostic plots
par(mfrow=c(2,2))
plot(model)

# Check significance of predictor
anova(model)

## Analysis of Variance Table
## 
## Response: gss12_analysis$hrs2
##                     Df Sum Sq Mean Sq F value Pr(>F)
## gss12_analysis$educ  1   44.6  44.575  0.4447 0.5095
## Residuals           33 3307.8 100.236

Model 2

# run your model and diagnostic plots
model2 <- lm(df$educ ~ df$hrs2)
summary(model2)

## 
## Call:
## lm(formula = df$educ ~ df$hrs2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2179 -2.8213 -0.1369  1.7238  5.1510 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.11117    1.68095   8.395 3.48e-10 ***
## df$hrs2      0.01845    0.04141   0.445    0.659    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.211 on 38 degrees of freedom
## Multiple R-squared:  0.005195,   Adjusted R-squared:  -0.02098 
## F-statistic: 0.1984 on 1 and 38 DF,  p-value: 0.6585

pred <- predict(model2) 

# Plot the scatterplot
plot(df$educ, df$hrs2)

# Check residuals
resids2 <- residuals(model2)
plot(df$educ, resids2)

# Diagnostic plots
par(mfrow=c(2,2))
plot(model2)

# Check significance of predictor
anova(model2)

## Analysis of Variance Table
## 
## Response: df$educ
##           Df Sum Sq Mean Sq F value Pr(>F)
## df$hrs2    1   2.05  2.0456  0.1984 0.6585
## Residuals 38 391.73 10.3087

Multivariate analysis

Try to examine the assumptions of the multivariate analysis and model below.

## conducting a multivariate linear analysis
mlm1 <- lm(cbind(hrs2, educ) ~ race + sex, data = df, weight = wtssall)
summary(mlm1)

## Response hrs2 :
## 
## Call:
## lm(formula = hrs2 ~ race + sex, data = df, weights = wtssall)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.903  -5.609   2.901   7.820  26.521 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 48.37616    7.68946   6.291 2.54e-07 ***
## race        -0.01122    3.47654  -0.003    0.997    
## sex         -6.38269    4.44618  -1.436    0.160    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.62 on 37 degrees of freedom
## Multiple R-squared:  0.05918,    Adjusted R-squared:  0.008322 
## F-statistic: 1.164 on 2 and 37 DF,  p-value: 0.3235
## 
## 
## Response educ :
## 
## Call:
## lm(formula = educ ~ race + sex, data = df, weights = wtssall)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4129 -2.5731  0.0395  1.4778  5.1925 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15.6865     1.8605   8.431 3.87e-10 ***
## race         -2.2122     0.8412  -2.630   0.0124 *  
## sex           1.2441     1.0758   1.156   0.2549    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.054 on 37 degrees of freedom
## Multiple R-squared:  0.1591, Adjusted R-squared:  0.1136 
## F-statistic:   3.5 on 2 and 37 DF,  p-value: 0.04054

Subsetting General Social Survey (GSS) data

Nathan Alexander

Center for Applied Data Science and Analytics (CADSA)

Howard University

2024-11-17

Overview

Set up your work enviornment

Explore

Examine

Get data

Subset data

Clean data

Model 1

Plot the variable relationships

Plot the variable relationships

Examine residuals

Model 2

Multivariate analysis