Statistical inference with the GSS data

The project starts with an explanation of the generability and causalty of the General Social Survey (GSS) data, which is used to monitor societal change and study the growing complexity of American society. Following the explanation are two research questions of interest. The main body is the exploration and inference base on the research question. And a summary is presented in the end as well.

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)

Load data

load("gss.Rdata")

Part 1: Data

① Generability:

From GSS’s website, we read: “In 1985 the GSS co-founded the International Social Survey Program (ISSP). The ISSP has conducted an annual cross-national survey each year since then and has involved 60 countries and interviewed over one million respondents. The ISSP asks an identical battery of questions in all countries; the U.S. version of these questions is incorporated into the GSS. The 2016 ISSP topics are work orientation and role of government.”.

And from wiki, we learn about the methodology: The target population of the GSS is adults (18+) living in households in the United States. The GSS sample is drawn using an area probability design that randomly selects respondents in households across the United States to take part in the survey. Respondents that become part of the GSS sample are from a mix of urban, suburban, and rural geographic areas. Participation in the study is strictly voluntary. However, because only about a few thousand respondents are interviewed in the main study, every respondent selected is very important to the results.

The survey is conducted face-to-face with an in-person interview by NORC at the University of Chicago. The survey was conducted every year from 1972 to 1994 (except in 1979, 1981, and 1992). Since 1994, it has been conducted every other year. The survey takes about 90 minutes to administer.

Since random sampling is used in the GSS survey, it is generablizable to the broad population. However there are a few concerns, for example, the survey lasts 90 minutes, which is a long period of time, therefore it is possible that many respondents might answer perfunctorily to speed up the process, rather than giving each question a prudent think, which could make the survey less credable.

② Causality

Random assignment was not used in this survey, therefore we cannot establish any causal relationship between variables based on this document.

Part 2: Research question

The research question of interest are:

Q1:In a decade, between 2000 and 2010, is there a change in how people feel about whether it’s easy to find a equally good job? If there is, on a 95% significant level, is the change statistical significant?

Q2: Using the latest(2012) data to see if people feel differently with the question.

In order to answer these questions, we need the following variables from the data: year : 2000, 2010 jobfind : 1 Very easy, 2 Somewhat Easy, 3 Not Easy region :1 NEW ENGLAND,2 MIDDLE ATLANTIC,3 E. NOR. CENTRAL,4 W. NOR. CENTRAL,5 SOUTH ATLANTIC,6 E. SOU. CENTRAL,7 W. SOU. CENTRAL,8 MOUNTAIN,9 PACIFIC

yearstudy <- c("2000","2010")

jobs <- gss%>% select(year,region,jobfind) %>%
    filter(year %in% yearstudy) %>% na.omit() 

jobs_re <- jobs %>% mutate( easy_or_not = ifelse( jobfind == "Not Easy","not easy","easy")) %>%         select(year,region,easy_or_not)      #Define easy and not easy

Part 3: Exploratory data analysis

Q1

First, we look at the general trend of the variable:

jobs_re  %>%
  group_by(year) %>%
  summarise( easy = sum(easy_or_not =="easy"), not_easy = sum(easy_or_not == "not easy"))

## # A tibble: 2 x 3
##    year  easy not_easy
##   <int> <int>    <int>
## 1  2000   863      360
## 2  2010   372      432

year_table <- table(jobs_re$year,jobs_re$easy_or_not)
prop.table(year_table,1)

##       
##             easy  not easy
##   2000 0.7056419 0.2943581
##   2010 0.4626866 0.5373134

From the table we can see that more and more respondents feel it is not easy to find a equally good job. And now let’s see the change in diffrent regions:

 ggplot(data = jobs_re, aes(x = year, fill = easy_or_not)) +
  facet_grid(.~region) +
  geom_bar(position = "fill") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle('Do you feel it is easy to find a equally good job?')  +    #add title for graph
  labs(y ="proportion")

The bar shows that, the proplortion of those who feel uneasy to find an equaaly good job is increasing in all region feel from within this decade, with some places more obvious than others.

With this sketchy exploration of the data,now we can interpret the first quesiton as is the diffrence between the proportion of respondents feeling not easy to find equally good jobs in 2000 and 2010 statistical significant.

Q2

As for the second question, we want to know whether the respondent’s attitude regarding to whether it’s easy to find equally good job and their region of residence are independent variables, therefore we use a chi-square to run an independence test.

Notice that our interest lies in those who find it “not easy”, we want to draw a table to show the “not easy” proportion in different regions, to have a general understanding the distribution:

Q2 <- gss %>% select(region, year, jobfind) %>% filter(year ==  "2012") %>% na.omit()
Q2 <- Q2 %>% mutate( easy_or_not = ifelse( jobfind == "Not Easy","not easy","easy")) %>%         select(year,region,easy_or_not)     

table_q2 <- table(Q2$region,Q2$easy_or_not)
table_q2 <- prop.table(table_q2,1)
table_q2

##                  
##                        easy  not easy
##   New England     0.6000000 0.4000000
##   Middle Atlantic 0.5227273 0.4772727
##   E. Nor. Central 0.4881890 0.5118110
##   W. Nor. Central 0.5555556 0.4444444
##   South Atlantic  0.5103448 0.4896552
##   E. Sou. Central 0.4772727 0.5227273
##   W. Sou. Central 0.5903614 0.4096386
##   Mountain        0.4626866 0.5373134
##   Pacific         0.5436893 0.4563107

From the table we see that the proportion of “not easy” in different regions is somewhat variable.

Part 4: Inference

Q1

State hypotheses

H0: p2010 = p2000(nothing is going on) H1: p2010 > p2000

Check conditions

1)Independence: Respondents are considered independent of each other. 2)Sample size: There are more than 10 respondents in both of the “easy” and “not easy” catogory in 2000 and 2010. The sample meets the conditions for inference, so we continue with our inferecne.

State the method(s) to be used and why and how

We can calculate a confidence interval and check if 0 belongs to it, and we’re doing a Z test for the proportion in addition, to see it they agree.

Perform inference

inference(data = jobs_re, y = easy_or_not , x = as.factor(year), order = c("2010","2000"),statistic = "proportion", success = "not easy", type = "ci", method = "theoretical")

## Response variable: categorical (2 levels, success: not easy)
## Explanatory variable: categorical (2 levels) 
## n_2010 = 804, p_hat_2010 = 0.5373
## n_2000 = 1223, p_hat_2000 = 0.2944
## 95% CI (2010 - 2000): (0.2001 , 0.2859)

inference(data = jobs_re, y = easy_or_not , x = as.factor(year), order = c("2010","2000"),statistic = "proportion", success = "not easy", type = "ht", null = 0, alternative = "greater", method = "theoretical")

## Response variable: categorical (2 levels, success: not easy)
## Explanatory variable: categorical (2 levels) 
## n_2010 = 804, p_hat_2010 = 0.5373
## n_2000 = 1223, p_hat_2000 = 0.2944
## H0: p_2010 =  p_2000
## HA: p_2010 > p_2000
## z = 10.9673
## p_value = < 0.0001

Interpret results

The 95% confidence interval is : (0.2001 , 0.2859). Clearcly 0 is excluded, which means we should reject the null hypothesis. And in the Z test, the p_value < 0.0001 < alpha, therefore again, we should reject the null hyphothesis. The results agree with each other. The change in the proportion is significant. We are convinced that people are finding it more difficult to find an equally good job.

Q2

State hypotheses

H0: Proportion of “not easy” has nothing to do with region.

H1: The proportion of that varies by region.

State the method(s) to be used and why and how

Since the “region” includes more than two levels, we choose to do an independence test with chi-square to answer the question.

Check conditions

1)Independence: Respondents are considered independent of each other. 2)Sample size:

Q2_2 <- gss %>% select(region, year, jobfind) %>% filter(year ==  "2012") %>% na.omit()
Q2_2 <- Q2_2 %>% mutate( easy_or_not = ifelse( jobfind == "Not Easy","not easy","easy")) %>%         select(region,easy_or_not) %>% filter(easy_or_not == "not easy")

no.sum <- table(Q2_2)
no.sum

##                  easy_or_not
## region            not easy
##   New England           18
##   Middle Atlantic       42
##   E. Nor. Central       65
##   W. Nor. Central       28
##   South Atlantic        71
##   E. Sou. Central       23
##   W. Sou. Central       34
##   Mountain              36
##   Pacific               47

ifelse(sum(no.sum >= 5),"Conditions met","Conditions not met")

## [1] "Conditions met"

In all regions, the number of “not easy” in each scenario is more than five. The sample meets the conditions for independent test with chi-square, so we continue with our inferecne.

The independence test consists of calculating an expected values assuming that the null hypothesis is true. This is done through the following calculation:

\(expected proportion= total "not easy"/sample size\)

Once this is done, the Chi-squared statistic is computed as:

\[χ^2 = \sum_{i=1}^k (O_i-E)^2/E\]

The chi-square test does not define confidence intervals, so these were not included in this analysis.

Perform inference

We can calculate the chi statistic, as well as the corresponding p-value via:

no.sum <- c(18,42,65,28,71,23,34,36,47)
chisq.test(no.sum)

## 
##  Chi-squared test for given probabilities
## 
## data:  no.sum
## X-squared = 64.44, df = 8, p-value = 6.228e-11

Interpret results

The high X-squared statistic in this case with 8 degrees of freedom leads to very low p-value. Since the p-value is below alpha (0.05), we can conclude that there is sufficient evidence to reject H0 (null hypothesis). In the context of the research question, it mean that respondent’s attitude regarding to whether it’s easy to find equally good job varies by region. This result, though, cannot be used to determine causality. This occurs because the GSS is an observational study, and not an experiment with randomized assignment to treatment.