Statistical inference with the GSS data

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(purrr)

Load data

load("gss.Rdata")

Part 1: Data

The General Social Survey (GSS) has been a significant source of sociological data since its inception in 1972. Conducted by NORC at the University of Chicago and funded by the National Science Foundation, the GSS collects data to monitor societal changes and study the growing complexity of American society.

Data Collection Methodology

The GSS utilizes a robust methodology to ensure the collection of high-quality, unbiased data. Initially administered annually and then biennially, the survey interviews a representative sample of Americans on a wide range of topics, from beliefs and attitudes to behaviors and societal trends. The inclusion of a broad array of questions - such as belief in God, government confidence, race relations, and more - allows for a comprehensive analysis of American society.

Implications for Generalizability and Causality

The GSS’s rigorous methodological approach ensures that its findings are generalizable to the broader American population. The representative nature of the sample allows researchers and policymakers to draw conclusions about societal trends and attitudes with a high degree of confidence. However, like most surveys, the GSS is primarily observational, meaning it can identify correlations but not necessarily establish causality.

Accessibility and Use

A key feature of the GSS is its commitment to making data accessible to a wide audience. The GSS Data Explorer, an online tool launched in 2015 and updated in 2021, allows users to search and analyze data, making the GSS a valuable resource for educators, policymakers, journalists, and students. The survey’s adaptability and incorporation of new collection methods, like the web mode starting in 2022, ensure that the GSS remains a vital tool for understanding American society.

Cross-Country Comparisons

The GSS participates in the International Social Science Programme (ISSP), facilitating cross-country comparisons. This inclusion allows researchers to compare responses from the United States with those from other countries, further enhancing the survey’s utility in global sociological research.

Conclusion

The General Social Survey’s methodological rigor and broad scope make it an indispensable tool for understanding societal trends and attitudes in the United States. While it offers excellent generalizability, the nature of its data collection primarily allows for observational insights, limiting the ability to infer causality.

For more detailed information on the GSS methodology, you can visit NORC’s GSS website and the GSS Methodological Reports.

Part 2: Research question

Reflection of Sociopolitical Dynamics

Understanding Sociopolitical Polarization: In recent years, the United States has witnessed increasing sociopolitical polarization. Analyzing how political views correlate with opinions on social welfare spending can shed light on the ideological divides that characterize contemporary American politics. People’s attitudes towards social welfare spending are a key indicator of their broader socio-economic policy preferences. Understanding these attitudes in the context of political orientation can provide insights into public support or opposition to welfare policies, which is crucial for policymakers and political analysts.

Impact on Policy Development

This question can help in understanding public opinion trends that are vital for developing responsive and representative welfare policies. Recognizing how different demographic groups view welfare spending, in relation to their political views, allows for more targeted and effective policy development. Attitudes toward welfare spending reflect broader opinions about the role of government in addressing social inequalities and economic challenges. Understanding these attitudes is key to addressing issues like poverty, healthcare, and education.

Formulating the Hypotheses

In this analysis we will investigate three different demographic factors and their relationship to opinions on welfare spending: income level, education, and political views; therefore, we will be testing three hypotheses.

Income Level

Null Hypothesis (H0): There has been no significant relationship between income level and opinions on social welfare spending in the United States across different demographic groups over the past two decades.

Alternative Hypothesis (H1): There is a significant relationship between income level and opinions on social welfare spending in the United States across different demographic groups over the past two decades.

Education

Null Hypothesis (H0): There has been no significant relationship between education and opinions on social welfare spending in the United States across different demographic groups over the past two decades.

Alternative Hypothesis (H1): There is a significant relationship between education and opinions on social welfare spending in the United States across different demographic groups over the past two decades.

Political Views

Null Hypothesis (H0): There has been no significant relationship between political views and opinions on social welfare spending in the United States across different demographic groups over the past two decades.

Alternative Hypothesis (H1): There is a significant relationship between political views and opinions on social welfare spending in the United States across different demographic groups over the past two decades.

Part 3: Exploratory data analysis

# Select relevant variables
gss_selected <- gss %>%
  select(year, polviews, natfare, age, sex, race, educ, income06) %>%
  filter(!is.na(natfare), !is.na(polviews))  # Ensure key variables are not missing

# Check for missing values and data structure
summary(gss_selected)

##       year                       polviews            natfare     
##  Min.   :1974   Extremely Liberal    :  762   Too Little : 5367  
##  1st Qu.:1978   Liberal              : 3114   About Right: 8462  
##  Median :1988   Slightly Liberal     : 3564   Too Much   :13189  
##  Mean   :1990   Moderate             :10465                      
##  3rd Qu.:1998   Slightly Conservative: 4415                      
##  Max.   :2012   Conservative         : 3863                      
##                 Extrmly Conservative :  835                      
##       age            sex           race            educ      
##  Min.   :18.00   Male  :12376   White:22438   Min.   : 0.00  
##  1st Qu.:31.00   Female:14642   Black: 3505   1st Qu.:12.00  
##  Median :42.00                  Other: 1075   Median :12.00  
##  Mean   :45.15                                Mean   :12.77  
##  3rd Qu.:58.00                                3rd Qu.:15.00  
##  Max.   :89.00                                Max.   :20.00  
##  NA's   :85                                   NA's   :51     
##              income06    
##  $60000 To 74999 :  387  
##  $40000 To 49999 :  325  
##  Refused         :  309  
##  $50000 To 59999 :  302  
##  $75000 To $89999:  295  
##  (Other)         : 2413  
##  NA's            :22987

str(gss_selected)

## 'data.frame':    27018 obs. of  8 variables:
##  $ year    : int  1974 1974 1974 1974 1974 1974 1974 1974 1974 1974 ...
##  $ polviews: Factor w/ 7 levels "Extremely Liberal",..: 4 5 6 6 6 5 5 5 6 4 ...
##  $ natfare : Factor w/ 3 levels "Too Little","About Right",..: 3 3 3 2 2 2 3 2 3 1 ...
##  $ age     : int  21 41 83 69 58 30 48 67 54 89 ...
##  $ sex     : Factor w/ 2 levels "Male","Female": 1 1 2 2 2 1 1 1 2 1 ...
##  $ race    : Factor w/ 3 levels "White","Black",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ educ    : int  14 16 10 10 12 16 17 10 11 6 ...
##  $ income06: Factor w/ 26 levels "Under $1 000",..: NA NA NA NA NA NA NA NA NA NA ...

# Preliminary exploration
gss_selected %>%
  group_by(year) %>%
  summarise(
    count = n(),
    avg_educ = mean(educ, na.rm = TRUE),
    mode_income = names(which.max(table(income06)))
  )

## # A tibble: 27 × 4
##     year count avg_educ mode_income 
##    <int> <int>    <dbl> <chr>       
##  1  1974  1359     11.9 Under $1 000
##  2  1975  1326     11.9 Under $1 000
##  3  1976  1344     12.0 Under $1 000
##  4  1977  1390     11.8 Under $1 000
##  5  1978  1392     12.1 Under $1 000
##  6  1980  1367     12.2 Under $1 000
##  7  1982  1668     12.1 Under $1 000
##  8  1983   747     12.6 Under $1 000
##  9  1984   456     12.5 Under $1 000
## 10  1985   691     12.7 Under $1 000
## # ℹ 17 more rows

# Visualization of opinions on welfare spending over years
gss %>%
  filter(!is.na(natfare)) %>%
  count(year, natfare) %>%
  ggplot(aes(x = factor(year), y = n, fill = natfare)) +
  geom_bar(stat = "identity", position = "fill") +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(title = "Opinions on Welfare Spending Over Time",
       x = "Year",
       y = "Percentage",
       fill = "Opinion on Welfare") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

# Convert to factor if it's not already
gss$income06 <- as.factor(gss$income06)
income_levels <- levels(gss$income06)
income_levels

##  [1] "Under $1 000"       "$1 000 To 2 999"    "$3 000 To 3 999"   
##  [4] "$4 000 To 4 999"    "$5 000 To 5 999"    "$6 000 To 6 999"   
##  [7] "$7 000 To 7 999"    "$8 000 To 9 999"    "$10000 To 12499"   
## [10] "$12500 To 14999"    "$15000 To 17499"    "$17500 To 19999"   
## [13] "$20000 To 22499"    "$22500 To 24999"    "$25000 To 29999"   
## [16] "$30000 To 34999"    "$35000 To 39999"    "$40000 To 49999"   
## [19] "$50000 To 59999"    "$60000 To 74999"    "$75000 To $89999"  
## [22] "$90000 To $109999"  "$110000 To $129999" "$130000 To $149999"
## [25] "$150000 Or Over"    "Refused"

# Create a named vector that maps each original income level to the new category
income_map <- setNames(
  c(rep("Under $20,000", 13),   # For income levels up to "$17500 To 19999"
    rep("$20,000 to $49,999", 5),   # For the next five income levels
    rep("$50,000 to $74,999", 2),   # For the next two income levels
    rep("$75,000 to $99,999", 2),   # For the next two income levels
    "Unknown/Refused",              # For the "Refused" category
    rep("$100,000 and over", 5)),   # For the top five income levels
  unique(gss$income06)
)

# Map the 'income06' variable to the new categories
gss$income_category <- map_chr(gss$income06, ~ income_map[.])

# Handle the NAs if any exist after mapping
gss$income_category[is.na(gss$income_category)] <- "Unknown/Refused"

# Check the new income category variable
table(gss$income_category)

## 
##  $100,000 and over $20,000 to $49,999 $50,000 to $74,999 $75,000 to $99,999 
##               1692               2661               1625               1257 
##      Under $20,000    Unknown/Refused 
##               2456              47370

# Visualization of opinions on welfare spending by income level
# Making sure the income_category is a factor and set the levels in the order we want
gss$income_category <- factor(gss$income_category, levels = c(
  "Under $20,000", 
  "$20,000 to $49,999", 
  "$50,000 to $74,999", 
  "$75,000 to $99,999", 
  "$100,000 and over", 
  "Unknown/Refused"
))

# Filtering out NA opinions on welfare
gss_filtered <- gss %>%
  filter(!is.na(natfare) & natfare != "NA")

# Creating a plot with the ordered income categories and excluding NA opinions
ggplot(gss_filtered, aes(x = income_category, fill = natfare)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(title = "Opinions on Welfare Spending by Income Category",
       x = "Income Category",
       y = "Percentage",
       fill = "Opinion on Welfare") +
  theme_minimal() +
  # Rotate x labels for readability
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# First, create and store the filtered and transformed dataset
gss_education_filtered <- gss %>%
  filter(!is.na(natfare)) %>%
  mutate(education_group = cut(educ, 
                               breaks = c(-Inf, 11, 12, 14, 16, Inf),
                               labels = c("Less than High School", "High School Graduate",
                                          "Some College", "Bachelor's Degree", "Graduate Degree"),
                               include.lowest = TRUE))

# Visualization of opinions on welfare spending by education level
gss_education_filtered %>%
  count(education_group, natfare) %>%
  ggplot(aes(x = education_group, y = n, fill = natfare)) +
  geom_bar(stat = "identity", position = "fill") +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(title = "Opinions on Welfare Spending by Education Level",
       x = "Education Level",
       y = "Percentage",
       fill = "Opinion on Welfare") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x labels for readability

# Visualization of opinions on welfare spending by political views
# Filtering to remove rows with NA in either 'polviews' or 'natfare'
gss_clean <- gss %>%
  filter(!is.na(polviews) & !is.na(natfare)) %>%
  count(polviews, natfare)

# Creating the plot with filtered data
gss_clean %>%
  ggplot(aes(x = polviews, y = n, fill = natfare)) +
  geom_bar(stat = "identity", position = "fill") +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(title = "Opinions on Welfare Spending by Political Views",
       x = "Political Views",
       y = "Percentage",
       fill = "Opinion on Welfare") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x labels for readability

Part 4: Inference

# Chi-squared test for income level and opinions on welfare spending

# Create a contingency table of counts for 'income_category' and 'natfare'
contingency_table_income <- table(gss_filtered$income_category, gss_filtered$natfare)

# Perform the chi-square test
chi_square_test_income <- chisq.test(contingency_table_income)

# Output the result of the chi-square test
print(chi_square_test_income)

## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table_income
## X-squared = 258.16, df = 10, p-value < 2.2e-16

# Chi-squared test for education level and opinions on welfare spending

# Create a contingency table of counts for 'education_filtered' and 'natfare'
contingency_table_education <- table(gss_education_filtered$education_group, gss_education_filtered$natfare)

# Perform the chi-square test
chi_square_test_education <- chisq.test(contingency_table_education)

# Output the result of the chi-square test
print(chi_square_test_education)

## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table_education
## X-squared = 347.21, df = 8, p-value < 2.2e-16

# Chi-squared test for political views and opinions on welfare spending

# Create a dataset for the chi-square test without summarizing
gss_chi_test <- gss %>%
  filter(!is.na(polviews) & !is.na(natfare))

# Create a contingency table of counts for 'polviews' and 'natfare'
contingency_table <- table(gss_chi_test$polviews, gss_chi_test$natfare)

# Perform the chi-square test
chi_square_test <- chisq.test(contingency_table)
print(chi_square_test)

## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table
## X-squared = 1051.9, df = 12, p-value < 2.2e-16

Chi-Square Test Results Interpretation

Income Category: Chi-square test result (\(\chi^2 = 258.16\), \(df = 10\)) with p-value < \(2.2 \times 10^{-16}\) indicates a significant association between income level and opinions on welfare spending.
Education Level: With \(\chi^2 = 347.21\) and \(df = 8\) with p-value < \(2.2 \times 10^{-16}\) suggests a statistically significant relationship between education level and opinions on welfare spending.
Political Views: The test shows \(\chi^2 = 1051.9\) and \(df = 12\), with a p-value < \(2.2 \times 10^{-16}\), indicating a significant association between political views and opinions on welfare spending.

These results imply significant associations across all categories. The identical p-values, commonly seen in large datasets, suggest strong statistical evidence against the null hypotheses. These tests indicate associations, not causality, necessitating further analysis for understanding the underlying reasons.