Since 1972, the General Social Survey (GSS) has provided politicians, policymakers, and scholars with a clear and unbiased perspective on what Americans think and feel about such issues as national spending priorities, crime and punishment, intergroup relations, and confidence in institutions.
The National Data Program for the Social Sciences has been conducted since 1972 by NORC at the University of Chicago, with the support of the National Science Foundation.
For more information of gss. About Guss
load("gss.Rdata")
dim(gss)
## [1] 57061 114
The data provided by Coursera have been cleaned with missing values removed and variables modified to facilitate with R. The extracted dataset based on the annually survey from 1972 to 2012, consisting 57061 observations with 114 variables each corresponded with the specific questions. The survey was conducted face-to-face in person voluntary to adults(18+) who living in houseeholds in United States. According to Wikipedia the survey last about 90 minutes with a face-to-face interview, some questions might have been skipped in such a long time of interview.
According to Wikipedia.
Respondents that become part of the GSS sample are from a mix of urban, suburban, and rural geographic areas. Participation in the study is strictly voluntary. However, because only about a few thousand respondents are interviewed in the main study, every respondent selected is very important to the results.
With thousand entries of objects, the sample data allows us to generalize the population. The survey is based on rando sampling, and the sample size is less than 10% of the population of America. However, the causation relationship between variables can not be tested in the research, but correlations between variables.
Even the slavery system have been abolished for more than 150 years in US, most US residents declaim its impact continues. People argues that it is unlikely for the country to treat different races equally. Also, Blacks, Hispanics and Asians are more likely believe that being white helps people’s ability to get ahead. In this research I am interested about if the race impact people’s work status. more information
race
Race of respondent.
wrkstat
Labor force status.
jobfind
Could R find equally good job
class
Subjective class identification
As time changes, people’s views change as well. To get more accurate opinion among 21 centry, the dataset filter the data before 2000.
study <- gss %>%
filter(year >= 2000) %>%
select(race, wrkstat,jobfind, class, year,rank)
study <- study[complete.cases(study),]
study%>%
mutate(rank = as.character(rank)) -> study
str(study)
## 'data.frame': 2275 obs. of 6 variables:
## $ race : Factor w/ 3 levels "White","Black",..: 1 1 1 1 1 2 1 1 1 1 ...
## $ wrkstat: Factor w/ 8 levels "Working Fulltime",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ jobfind: Factor w/ 3 levels "Very Easy","Somewhat Easy",..: 3 1 1 3 2 2 1 2 2 3 ...
## $ class : Factor w/ 5 levels "Lower Class",..: 3 3 3 4 3 3 4 4 2 4 ...
## $ year : int 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
## $ rank : chr "5" "3" "5" "3" ...
table(study$race, useNA = "ifany")
##
## White Black Other
## 1743 331 201
table(study$wrkstat, useNA = "ifany")
##
## Working Fulltime Working Parttime Temp Not Working Unempl, Laid Off
## 1775 435 65 0
## Retired School Keeping House Other
## 0 0 0 0
table(study$jobfind, useNA = "ifany")
##
## Very Easy Somewhat Easy Not Easy
## 468 801 1006
table(study$class, useNA = "ifany")
##
## Lower Class Working Class Middle Class Upper Class No Class
## 107 1149 959 60 0
fj <- ggplot(data = study, aes(x = race))
fj <- fj+ geom_bar(aes(fill = jobfind), position = "dodge")
fj + theme(axis.text.x = element_text(angle = 60, hjust = 1))
Observations:
White people take the most part of the Ameica’s population.
In America, people who feel not easy to find a job are more than the other two groups who feel very easy or somewhat easy to find a job.
sclass <- ggplot(data = study, aes(x = class))
sclass <- sclass+ geom_bar(aes(fill = race), position = "dodge")
sclass + theme(axis.text.x = element_text(angle = 60, hjust = 1))
Observations:
From the grapf, most people are classified themselves as working or middle class.
Black and other races took a lower proportion in upper class.
Null Hypothesis: Races of people in America and the hardness to find a job are independent. The hardness in finding job do not vary by race.
Alternative hypothesis: Races od people in America and the hardness to find a job are dependent.The hardness in finding job do vary by race.
chisq.test(study$race, study$jobfind)$expected
## study$jobfind
## study$race Very Easy Somewhat Easy Not Easy
## White 358.56000 613.68923 770.75077
## Black 68.09143 116.54110 146.36747
## Other 41.34857 70.76967 88.88176
Since we’re evaluating the relationship between two categorical variables, the method used here is Chi-square independence. And the test is about evaluating the difference between the observed counts and expected counts.
chisq.test(study$race, study$jobfind)
##
## Pearson's Chi-squared test
##
## data: study$race and study$jobfind
## X-squared = 2.3823, df = 4, p-value = 0.6658
The Chi-square statistic is 8.7169, the degree of freedom is 4, and the p-value is more than 0.05. Therefore, We cannot rejected the null hypothesis, and cannot provide convincing evidence that the race of people and the hardness for them to find a job are associated.
Null Hypothesis: Races of people in America and their identified social class are independent. The hardness in finding job do not vary by race.
Alternative hypothesis: Races od people in America and their identified social class are dependent.The hardness in finding job do vary by race.
chisq.test(study$race, study$class)$expected
## study$class
## study$race Lower Class Working Class Middle Class Upper Class
## White 81.978462 880.3108 734.74154 45.969231
## Black 15.567912 167.1732 139.52923 8.729670
## Other 9.453626 101.5160 84.72923 5.301099
The conditions are meet as the last inference and every cell have more than 5 cases.
chisq.test(study$race, study$class)
##
## Pearson's Chi-squared test
##
## data: study$race and study$class
## X-squared = 82.149, df = 6, p-value = 1.285e-15
The Chi-square statistic is 223,37, the degree of freedom is 6, and the p-value is lower than 0.05. Therefore, We have convincing evidence that the race of people and their social classes are associated.
In this anaylysis, I am interested in if facing incurable disease or serious financial disaster for a family, will gender associate with the decision in suicide.
sex
Sex of R
suicide1
Suicide if incurable disease
suicide2
Suicide if bankrupt
study2 <- gss%>%
filter(year >= 2000) %>%
select(sex, suicide1, suicide2, year)
study2 <- study2[complete.cases(study2),]
str(study2)
## 'data.frame': 9280 obs. of 4 variables:
## $ sex : Factor w/ 2 levels "Male","Female": 1 2 2 2 1 1 1 2 1 2 ...
## $ suicide1: Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 2 1 1 1 ...
## $ suicide2: Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ year : int 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
table(study2$sex)
##
## Male Female
## 4125 5155
table(study2$suicide1)
##
## Yes No
## 5517 3763
table(study2$suicide2)
##
## Yes No
## 925 8355
table(study2$year)
##
## 2000 2002 2004 2006 2008 2010 2012
## 1749 880 868 1914 1266 1365 1238
sui1 <- ggplot(data = study2, aes(x = sex))
sui1 <- sui1 + geom_bar(aes(fill = suicide1))
sui1 + theme(axis.text.x = element_text(angle = 60, hjust = 1))
sui2 <- ggplot(data = study2, aes(x = sex))
sui2 <- sui2 + geom_bar(aes(fill = suicide2))
sui2 + theme(axis.text.x = element_text(angle = 60, hjust = 1))
Observations: 1. People are more likely in commiit suicide when they facing with incurable disease than serious financial situation.
Null Hypothesis: For different genders, the difference in proportion of suicide intention when facing incurable disease is 0.
Alternative hypothesis: For different genders, the difference in proportion of suicide intention when facing incurable disease is not 0.
Since we’re estimate that if there are difference for two proportions, the z-test will be used in this study.
study2%>%
summarise(p_p = sum(suicide1 == "Yes")/n(),
Female = sum(sex == "Female"),
Male = sum(sex == "Male"),
Fs = Female*p_p,
Ff = Female*(1-p_p),
Ms = Male*p_p,
Mf = Male*(1-p_p),
SE = sqrt((p_p*(1-p_p))/Female + (p_p*(1-p_p))/Male))
## p_p Female Male Fs Ff Ms Mf SE
## 1 0.5945043 5155 4125 3064.67 2090.33 2452.33 1672.67 0.01025695
As show in the table, the minimum expected counts criterion of 10 is met.Therefore, the distribution of the sample proportion will be nearly normal, certered at the true population mean.
inference(y = suicide1, x= sex, data = study2, statistic = "proportion", type = "ht", null = 0, success = "Yes", alternative = "twosided", method = "theoretical")
## Response variable: categorical (2 levels, success: Yes)
## Explanatory variable: categorical (2 levels)
## n_Male = 4125, p_hat_Male = 0.6422
## n_Female = 5155, p_hat_Female = 0.5564
## H0: p_Male = p_Female
## HA: p_Male != p_Female
## z = 8.3679
## p_value = < 0.0001
inference(y = suicide1, x= sex, data = study2, statistic = "proportion", type = "ci", null = 0, success = "Yes", alternative = "twosided", method = "theoretical")
## Response variable: categorical (2 levels, success: Yes)
## Explanatory variable: categorical (2 levels)
## n_Male = 4125, p_hat_Male = 0.6422
## n_Female = 5155, p_hat_Female = 0.5564
## 95% CI (Male - Female): (0.0659 , 0.1058)
From the test result above, the z-score is 8.3679 and the p-value smaller than 0.05. It is indicated that we reject the null hypothesis that there is no association between sex and one’s intention in committing suicide facing incurable disease. We have convincing evidence that the proportion of the male who considering suicide when seriously sick is greater that female in same situation. And 95% confident that the difference in the proportion for male and female in considering suicide in facing incurable disease is (0.0659, 0.1058).
As the recent controversy focus on about if education in school worth or not, I want to check out if people’s income and their diploma levels can be independent or not.
degree
RS highest degree
coninc
Total family income in constant dollars
study3 <- gss%>%
filter(year >= 2000) %>%
select(degree, coninc, year)
study3 <- study3[complete.cases(study3),]
str(study3)
## 'data.frame': 16510 obs. of 3 variables:
## $ degree: Factor w/ 5 levels "Lt High School",..: 2 2 2 3 2 3 5 4 5 5 ...
## $ coninc: int 9300 19376 46502 38752 28418 21959 166419 103338 103338 166419 ...
## $ year : int 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
table(study3$degree)
##
## Lt High School High School Junior College Bachelor Graduate
## 2189 8512 1304 2915 1590
ggplot(data = study3, aes(x = coninc)) + geom_histogram() + facet_wrap(~degree)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Observation:
For groups who have lower level degrees, their distributions are more right skewed.
The biggest group is the one whose repondents have highest degree as high school during 2000-2012 in America.
Null Hypothesis: The poplulation mean of total income corrected with inflation is same for groups who have differnt degree levels.
Alternative hypothesis: The poplulation mean of total income corrected with inflation is not same for groups who have differnt degree levels.
Since we evaluating the mean in more than 2 groups, analysis of variance(ANOVA) with F test is applied in this study.
The respondents are random sampled across America, and observaations are independent in each groups. The groups are indenpendent to each other.
in each groups, the distribution should be nearly normal.
par(mfrow = c(2,3))
de_groups = c("Lt High School","High School", "Junior College", "Bachelor", "Graduate")
for (i in 1:5){
df = study3 %>% filter(degree == de_groups[i])
qqnorm(df$coninc, main = de_groups[i])
qqline(df$coninc)
}
The dieviation are significant in upper parts in groups, and based on the previouse histogram the distribution of five groups are right-skewed. However, with the sample size of 16510, the distribution can be considered as approximately normal
ggplot(data= study3, aes(x = degree, y = coninc)) + geom_boxplot((aes(fill = degree)))
The variability is consistent across groups, the income is higher for those with higher degree levels.
Based on the observation above, the conditions for ANOVA are not “fully” satisfied. In this case, it is necessary to interpret the result cautiously. Other elements also contributes the high income level among people, there is no etiological relation between the two variables.
anova(lm(coninc ~ degree, data = study3))
## Analysis of Variance Table
##
## Response: coninc
## Df Sum Sq Mean Sq F value Pr(>F)
## degree 4 5.4142e+12 1.3535e+12 887.86 < 2.2e-16 ***
## Residuals 16505 2.5162e+13 1.5245e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
With an f-statistic of 887.86 and p-value almost zero, we have a strong evidence that at least one pair of mean of groups are different.To test multiple comparisions we use Bonferroni correction, and there are 10 paired comparisons, and the significant level are 0.05/10 = 0.005
pairwise.t.test(study3$coninc, study3$degree, p.adj = "bonferroni")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: study3$coninc and study3$degree
##
## Lt High School High School Junior College Bachelor
## High School <2e-16 - - -
## Junior College <2e-16 <2e-16 - -
## Bachelor <2e-16 <2e-16 <2e-16 -
## Graduate <2e-16 <2e-16 <2e-16 <2e-16
##
## P value adjustment method: bonferroni
As show above, the t-test score for paired compaisons are all smaller to 0.005. We rejected the null hypothesis again that the mean for different groups are different from each other.
It is good chance to get to know what happens to Americ people and some fetails that neglicted by people. And it is a good opportunity to practice R from import data to inference tests. Thanks Coursera and anyone who read this. :)
Smith, Tom W., Michael Hout, and Peter V. Marsden. General Social Survey, 1972-2012 [Cumulative File]. ICPSR34802-v1. Storrs, CT: Roper Center for Public Opinion Research, University of Connecticut.
Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributors], 2013-09-11. doi:10.3886/ICPSR34802.v1 Persistent URL: http://doi.org/10.3886/ICPSR34802.v1.