The project starts with an explanation of the generability and causalty of the General Social Survey (GSS) data, which is used to monitor societal change and study the growing complexity of American society. Following the explanation are two research questions of interest. The main body is the exploration and inference base on the research question. And a summary is presented in the end as well.
library(ggplot2)
library(dplyr)
library(statsr)
load("gss.Rdata")
① Generability:
From GSS’s website, we read: “In 1985 the GSS co-founded the International Social Survey Program (ISSP). The ISSP has conducted an annual cross-national survey each year since then and has involved 60 countries and interviewed over one million respondents. The ISSP asks an identical battery of questions in all countries; the U.S. version of these questions is incorporated into the GSS. The 2016 ISSP topics are work orientation and role of government.”.
And from wiki, we learn about the methodology: The target population of the GSS is adults (18+) living in households in the United States. The GSS sample is drawn using an area probability design that randomly selects respondents in households across the United States to take part in the survey. Respondents that become part of the GSS sample are from a mix of urban, suburban, and rural geographic areas. Participation in the study is strictly voluntary. However, because only about a few thousand respondents are interviewed in the main study, every respondent selected is very important to the results.
The survey is conducted face-to-face with an in-person interview by NORC at the University of Chicago. The survey was conducted every year from 1972 to 1994 (except in 1979, 1981, and 1992). Since 1994, it has been conducted every other year. The survey takes about 90 minutes to administer.
Since random sampling is used in the GSS survey, it is generablizable to the broad population. However there are a few concerns, for example, the survey lasts 90 minutes, which is a long period of time, therefore it is possible that many respondents might answer perfunctorily to speed up the process, rather than giving each question a prudent think, which could make the survey less credable.
② Causality
Random assignment was not used in this survey, therefore we cannot establish any causal relationship between variables based on this document.
The research question of interest are:
Q1:In a decade, between 2000 and 2010, is there a change in how people feel about whether it’s easy to find a equally good job? If there is, on a 95% significant level, is the change statistical significant?
Q2: Using the latest(2012) data to see if people feel differently with the question.
In order to answer these questions, we need the following variables from the data: year : 2000, 2010 jobfind : 1 Very easy, 2 Somewhat Easy, 3 Not Easy region :1 NEW ENGLAND,2 MIDDLE ATLANTIC,3 E. NOR. CENTRAL,4 W. NOR. CENTRAL,5 SOUTH ATLANTIC,6 E. SOU. CENTRAL,7 W. SOU. CENTRAL,8 MOUNTAIN,9 PACIFIC
yearstudy <- c("2000","2010")
jobs <- gss%>% select(year,region,jobfind) %>%
filter(year %in% yearstudy) %>% na.omit()
jobs_re <- jobs %>% mutate( easy_or_not = ifelse( jobfind == "Not Easy","not easy","easy")) %>% select(year,region,easy_or_not) #Define easy and not easy
First, we look at the general trend of the variable:
jobs_re %>%
group_by(year) %>%
summarise( easy = sum(easy_or_not =="easy"), not_easy = sum(easy_or_not == "not easy"))
## # A tibble: 2 x 3
## year easy not_easy
## <int> <int> <int>
## 1 2000 863 360
## 2 2010 372 432
year_table <- table(jobs_re$year,jobs_re$easy_or_not)
prop.table(year_table,1)
##
## easy not easy
## 2000 0.7056419 0.2943581
## 2010 0.4626866 0.5373134
From the table we can see that more and more respondents feel it is not easy to find a equally good job. And now let’s see the change in diffrent regions:
ggplot(data = jobs_re, aes(x = year, fill = easy_or_not)) +
facet_grid(.~region) +
geom_bar(position = "fill") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ggtitle('Do you feel it is easy to find a equally good job?') + #add title for graph
labs(y ="proportion")
The bar shows that, the proplortion of those who feel uneasy to find an equaaly good job is increasing in all region feel from within this decade, with some places more obvious than others.
With this sketchy exploration of the data,now we can interpret the first quesiton as is the diffrence between the proportion of respondents feeling not easy to find equally good jobs in 2000 and 2010 statistical significant.
As for the second question, we want to know whether the respondent’s attitude regarding to whether it’s easy to find equally good job and their region of residence are independent variables, therefore we use a chi-square to run an independence test.
Notice that our interest lies in those who find it “not easy”, we want to draw a table to show the “not easy” proportion in different regions, to have a general understanding the distribution:
Q2 <- gss %>% select(region, year, jobfind) %>% filter(year == "2012") %>% na.omit()
Q2 <- Q2 %>% mutate( easy_or_not = ifelse( jobfind == "Not Easy","not easy","easy")) %>% select(year,region,easy_or_not)
table_q2 <- table(Q2$region,Q2$easy_or_not)
table_q2 <- prop.table(table_q2,1)
table_q2
##
## easy not easy
## New England 0.6000000 0.4000000
## Middle Atlantic 0.5227273 0.4772727
## E. Nor. Central 0.4881890 0.5118110
## W. Nor. Central 0.5555556 0.4444444
## South Atlantic 0.5103448 0.4896552
## E. Sou. Central 0.4772727 0.5227273
## W. Sou. Central 0.5903614 0.4096386
## Mountain 0.4626866 0.5373134
## Pacific 0.5436893 0.4563107
From the table we see that the proportion of “not easy” in different regions is somewhat variable.
H0: p2010 = p2000(nothing is going on) H1: p2010 > p2000
1)Independence: Respondents are considered independent of each other. 2)Sample size: There are more than 10 respondents in both of the “easy” and “not easy” catogory in 2000 and 2010. The sample meets the conditions for inference, so we continue with our inferecne.
We can calculate a confidence interval and check if 0 belongs to it, and we’re doing a Z test for the proportion in addition, to see it they agree.
inference(data = jobs_re, y = easy_or_not , x = as.factor(year), order = c("2010","2000"),statistic = "proportion", success = "not easy", type = "ci", method = "theoretical")
## Response variable: categorical (2 levels, success: not easy)
## Explanatory variable: categorical (2 levels)
## n_2010 = 804, p_hat_2010 = 0.5373
## n_2000 = 1223, p_hat_2000 = 0.2944
## 95% CI (2010 - 2000): (0.2001 , 0.2859)
inference(data = jobs_re, y = easy_or_not , x = as.factor(year), order = c("2010","2000"),statistic = "proportion", success = "not easy", type = "ht", null = 0, alternative = "greater", method = "theoretical")
## Response variable: categorical (2 levels, success: not easy)
## Explanatory variable: categorical (2 levels)
## n_2010 = 804, p_hat_2010 = 0.5373
## n_2000 = 1223, p_hat_2000 = 0.2944
## H0: p_2010 = p_2000
## HA: p_2010 > p_2000
## z = 10.9673
## p_value = < 0.0001
The 95% confidence interval is : (0.2001 , 0.2859). Clearcly 0 is excluded, which means we should reject the null hypothesis. And in the Z test, the p_value < 0.0001 < alpha, therefore again, we should reject the null hyphothesis. The results agree with each other. The change in the proportion is significant. We are convinced that people are finding it more difficult to find an equally good job.
H0: Proportion of “not easy” has nothing to do with region.
H1: The proportion of that varies by region.
Since the “region” includes more than two levels, we choose to do an independence test with chi-square to answer the question.
1)Independence: Respondents are considered independent of each other. 2)Sample size:
Q2_2 <- gss %>% select(region, year, jobfind) %>% filter(year == "2012") %>% na.omit()
Q2_2 <- Q2_2 %>% mutate( easy_or_not = ifelse( jobfind == "Not Easy","not easy","easy")) %>% select(region,easy_or_not) %>% filter(easy_or_not == "not easy")
no.sum <- table(Q2_2)
no.sum
## easy_or_not
## region not easy
## New England 18
## Middle Atlantic 42
## E. Nor. Central 65
## W. Nor. Central 28
## South Atlantic 71
## E. Sou. Central 23
## W. Sou. Central 34
## Mountain 36
## Pacific 47
ifelse(sum(no.sum >= 5),"Conditions met","Conditions not met")
## [1] "Conditions met"
In all regions, the number of “not easy” in each scenario is more than five. The sample meets the conditions for independent test with chi-square, so we continue with our inferecne.
The independence test consists of calculating an expected values assuming that the null hypothesis is true. This is done through the following calculation:
\(expected proportion= total "not easy"/sample size\)
Once this is done, the Chi-squared statistic is computed as:
\[χ^2 = \sum_{i=1}^k (O_i-E)^2/E\]
The chi-square test does not define confidence intervals, so these were not included in this analysis.
We can calculate the chi statistic, as well as the corresponding p-value via:
no.sum <- c(18,42,65,28,71,23,34,36,47)
chisq.test(no.sum)
##
## Chi-squared test for given probabilities
##
## data: no.sum
## X-squared = 64.44, df = 8, p-value = 6.228e-11
The high X-squared statistic in this case with 8 degrees of freedom leads to very low p-value. Since the p-value is below alpha (0.05), we can conclude that there is sufficient evidence to reject H0 (null hypothesis). In the context of the research question, it mean that respondent’s attitude regarding to whether it’s easy to find equally good job varies by region. This result, though, cannot be used to determine causality. This occurs because the GSS is an observational study, and not an experiment with randomized assignment to treatment.