1. introduction

1.1 The Data

Since 1972, the General Social Survey (GSS) has provided politicians, policymakers, and scholars with a clear and unbiased perspective on what Americans think and feel about such issues as national spending priorities, crime and punishment, intergroup relations, and confidence in institutions.

The National Data Program for the Social Sciences has been conducted since 1972 by NORC at the University of Chicago, with the support of the National Science Foundation.

For more information of gss. About Guss

load("gss.Rdata")
dim(gss)
## [1] 57061   114

The data provided by Coursera have been cleaned with missing values removed and variables modified to facilitate with R. The extracted dataset based on the annually survey from 1972 to 2012, consisting 57061 observations with 114 variables each corresponded with the specific questions. The survey was conducted face-to-face in person voluntary to adults(18+) who living in houseeholds in United States. According to Wikipedia the survey last about 90 minutes with a face-to-face interview, some questions might have been skipped in such a long time of interview.

1.2 The scope of inference

According to Wikipedia.

Respondents that become part of the GSS sample are from a mix of urban, suburban, and rural geographic areas. Participation in the study is strictly voluntary. However, because only about a few thousand respondents are interviewed in the main study, every respondent selected is very important to the results.

With thousand entries of objects, the sample data allows us to generalize the population. The survey is based on rando sampling, and the sample size is less than 10% of the population of America. However, the causation relationship between variables can not be tested in the research, but correlations between variables.

2. Research Questions

2.1 Research question 1

Motivation:

Even the slavery system have been abolished for more than 150 years in US, most US residents declaim its impact continues. People argues that it is unlikely for the country to treat different races equally. Also, Blacks, Hispanics and Asians are more likely believe that being white helps people’s ability to get ahead. In this research I am interested about if the race impact people’s work status. more information

The variables will be analyzed in the study are:

race Race of respondent.

wrkstat Labor force status.

jobfind Could R find equally good job

class Subjective class identification

Filter and select data.

As time changes, people’s views change as well. To get more accurate opinion among 21 centry, the dataset filter the data before 2000.

study <- gss %>%   
  filter(year >= 2000) %>%
  select(race, wrkstat,jobfind, class, year,rank)
study <- study[complete.cases(study),]
study%>%
  mutate(rank = as.character(rank)) -> study

str(study)
## 'data.frame':    2275 obs. of  6 variables:
##  $ race   : Factor w/ 3 levels "White","Black",..: 1 1 1 1 1 2 1 1 1 1 ...
##  $ wrkstat: Factor w/ 8 levels "Working Fulltime",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ jobfind: Factor w/ 3 levels "Very Easy","Somewhat Easy",..: 3 1 1 3 2 2 1 2 2 3 ...
##  $ class  : Factor w/ 5 levels "Lower Class",..: 3 3 3 4 3 3 4 4 2 4 ...
##  $ year   : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
##  $ rank   : chr  "5" "3" "5" "3" ...

Exploratory data analysis

table(study$race, useNA = "ifany")
## 
## White Black Other 
##  1743   331   201
table(study$wrkstat, useNA = "ifany")
## 
## Working Fulltime Working Parttime Temp Not Working Unempl, Laid Off 
##             1775              435               65                0 
##          Retired           School    Keeping House            Other 
##                0                0                0                0
In the dataset, all objects in the study are with jobs or attempt to find a job.
table(study$jobfind, useNA = "ifany")
## 
##     Very Easy Somewhat Easy      Not Easy 
##           468           801          1006
table(study$class, useNA = "ifany")
## 
##   Lower Class Working Class  Middle Class   Upper Class      No Class 
##           107          1149           959            60             0
fj <- ggplot(data = study, aes(x = race))
fj <- fj+ geom_bar(aes(fill = jobfind), position = "dodge")
fj + theme(axis.text.x = element_text(angle = 60, hjust = 1))

Observations:

  1. White people take the most part of the Ameica’s population.

  2. In America, people who feel not easy to find a job are more than the other two groups who feel very easy or somewhat easy to find a job.

sclass <- ggplot(data = study, aes(x = class))
sclass <- sclass+ geom_bar(aes(fill = race), position = "dodge")
sclass + theme(axis.text.x = element_text(angle = 60, hjust = 1))

Observations:

  1. From the grapf, most people are classified themselves as working or middle class.

  2. Black and other races took a lower proportion in upper class.

Inference

State Hypothesis

Null Hypothesis: Races of people in America and the hardness to find a job are independent. The hardness in finding job do not vary by race.

Alternative hypothesis: Races od people in America and the hardness to find a job are dependent.The hardness in finding job do vary by race.

Check Conditions

  1. The respondents are random sampled across America.
  2. Without replacement, n< 10%. There are 18187 observations in the sample, which lower than the 10% of America population from 2000 to 2012.
  3. Each case only contribute to one cell.As the GSS data are independent we suppose this requirement can be checked as well.
  4. For each cell, there are must be at least 5 cases, which is clearly meet with the table show below.
chisq.test(study$race, study$jobfind)$expected
##           study$jobfind
## study$race Very Easy Somewhat Easy  Not Easy
##      White 358.56000     613.68923 770.75077
##      Black  68.09143     116.54110 146.36747
##      Other  41.34857      70.76967  88.88176

Method to check the inference

Since we’re evaluating the relationship between two categorical variables, the method used here is Chi-square independence. And the test is about evaluating the difference between the observed counts and expected counts.

chisq.test(study$race, study$jobfind)
## 
##  Pearson's Chi-squared test
## 
## data:  study$race and study$jobfind
## X-squared = 2.3823, df = 4, p-value = 0.6658

Inference results

The Chi-square statistic is 8.7169, the degree of freedom is 4, and the p-value is more than 0.05. Therefore, We cannot rejected the null hypothesis, and cannot provide convincing evidence that the race of people and the hardness for them to find a job are associated.

At the same time chack if the race and social class are associated or not.

State Hypothesis

Null Hypothesis: Races of people in America and their identified social class are independent. The hardness in finding job do not vary by race.

Alternative hypothesis: Races od people in America and their identified social class are dependent.The hardness in finding job do vary by race.

chisq.test(study$race, study$class)$expected
##           study$class
## study$race Lower Class Working Class Middle Class Upper Class
##      White   81.978462      880.3108    734.74154   45.969231
##      Black   15.567912      167.1732    139.52923    8.729670
##      Other    9.453626      101.5160     84.72923    5.301099

The conditions are meet as the last inference and every cell have more than 5 cases.

chisq.test(study$race, study$class)
## 
##  Pearson's Chi-squared test
## 
## data:  study$race and study$class
## X-squared = 82.149, df = 6, p-value = 1.285e-15

Inference results

The Chi-square statistic is 223,37, the degree of freedom is 6, and the p-value is lower than 0.05. Therefore, We have convincing evidence that the race of people and their social classes are associated.

2.2 Research question 2

Motivation:

In this anaylysis, I am interested in if facing incurable disease or serious financial disaster for a family, will gender associate with the decision in suicide.

The variables will be analyzed in the study:

sex Sex of R

suicide1 Suicide if incurable disease

suicide2 Suicide if bankrupt

study2 <- gss%>% 
  filter(year >= 2000) %>%
  select(sex, suicide1, suicide2, year)
study2 <- study2[complete.cases(study2),]
str(study2)
## 'data.frame':    9280 obs. of  4 variables:
##  $ sex     : Factor w/ 2 levels "Male","Female": 1 2 2 2 1 1 1 2 1 2 ...
##  $ suicide1: Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 2 1 1 1 ...
##  $ suicide2: Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
##  $ year    : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...

Exploratory data analysis

table(study2$sex)
## 
##   Male Female 
##   4125   5155
table(study2$suicide1)
## 
##  Yes   No 
## 5517 3763
table(study2$suicide2)
## 
##  Yes   No 
##  925 8355
table(study2$year)
## 
## 2000 2002 2004 2006 2008 2010 2012 
## 1749  880  868 1914 1266 1365 1238
sui1 <- ggplot(data = study2, aes(x = sex)) 
sui1 <- sui1 + geom_bar(aes(fill = suicide1))
sui1 + theme(axis.text.x = element_text(angle = 60, hjust = 1))

sui2 <- ggplot(data = study2, aes(x = sex)) 
sui2 <- sui2 + geom_bar(aes(fill = suicide2))
sui2 + theme(axis.text.x = element_text(angle = 60, hjust = 1))

Observations: 1. People are more likely in commiit suicide when they facing with incurable disease than serious financial situation.

  1. When facing with incurable disease, males are more likely to commit suicide than female.

Inference

State Hypothesis

Null Hypothesis: For different genders, the difference in proportion of suicide intention when facing incurable disease is 0.

Alternative hypothesis: For different genders, the difference in proportion of suicide intention when facing incurable disease is not 0.

Method to check the inference

Since we’re estimate that if there are difference for two proportions, the z-test will be used in this study.

Check Conditions

  1. The respondents are random sampled across America, and observaations are independent in each groups in GSS survy.
  2. Between groups, as the observations are carefully selected, the two groups of different genders are independent of each other.
  3. Without replacement, n < 10% of population. There are 9280 observations in the sample, which lower than the 10% of America population from 2000 to 2012.
  4. Meet the success-failure condition: First to caculate the pooled proportion.
 study2%>%
  summarise(p_p = sum(suicide1 == "Yes")/n(),
             Female = sum(sex == "Female"),
             Male = sum(sex == "Male"),
             Fs = Female*p_p,
             Ff = Female*(1-p_p),
             Ms = Male*p_p,
             Mf = Male*(1-p_p),
             SE = sqrt((p_p*(1-p_p))/Female + (p_p*(1-p_p))/Male))
##         p_p Female Male      Fs      Ff      Ms      Mf         SE
## 1 0.5945043   5155 4125 3064.67 2090.33 2452.33 1672.67 0.01025695

As show in the table, the minimum expected counts criterion of 10 is met.Therefore, the distribution of the sample proportion will be nearly normal, certered at the true population mean.

Two-sided independent sample proportion t–test

inference(y = suicide1, x= sex, data = study2, statistic = "proportion", type = "ht", null = 0, success = "Yes", alternative = "twosided", method = "theoretical")
## Response variable: categorical (2 levels, success: Yes)
## Explanatory variable: categorical (2 levels) 
## n_Male = 4125, p_hat_Male = 0.6422
## n_Female = 5155, p_hat_Female = 0.5564
## H0: p_Male =  p_Female
## HA: p_Male != p_Female
## z = 8.3679
## p_value = < 0.0001

inference(y = suicide1, x= sex, data = study2, statistic = "proportion", type = "ci", null = 0, success = "Yes", alternative = "twosided", method = "theoretical")
## Response variable: categorical (2 levels, success: Yes)
## Explanatory variable: categorical (2 levels) 
## n_Male = 4125, p_hat_Male = 0.6422
## n_Female = 5155, p_hat_Female = 0.5564
## 95% CI (Male - Female): (0.0659 , 0.1058)

Inference result

From the test result above, the z-score is 8.3679 and the p-value smaller than 0.05. It is indicated that we reject the null hypothesis that there is no association between sex and one’s intention in committing suicide facing incurable disease. We have convincing evidence that the proportion of the male who considering suicide when seriously sick is greater that female in same situation. And 95% confident that the difference in the proportion for male and female in considering suicide in facing incurable disease is (0.0659, 0.1058).

2.3 Research question 3

Motivation

As the recent controversy focus on about if education in school worth or not, I want to check out if people’s income and their diploma levels can be independent or not.

The variables will be analyzed in the study:

degree RS highest degree

coninc Total family income in constant dollars

study3 <- gss%>%
  filter(year >= 2000) %>%
  select(degree, coninc, year) 
study3 <- study3[complete.cases(study3),]
str(study3)
## 'data.frame':    16510 obs. of  3 variables:
##  $ degree: Factor w/ 5 levels "Lt High School",..: 2 2 2 3 2 3 5 4 5 5 ...
##  $ coninc: int  9300 19376 46502 38752 28418 21959 166419 103338 103338 166419 ...
##  $ year  : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...

Exploratory data analysis

table(study3$degree)
## 
## Lt High School    High School Junior College       Bachelor       Graduate 
##           2189           8512           1304           2915           1590
ggplot(data = study3, aes(x = coninc)) + geom_histogram() + facet_wrap(~degree)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Observation:

  1. For groups who have lower level degrees, their distributions are more right skewed.

  2. The biggest group is the one whose repondents have highest degree as high school during 2000-2012 in America.

Inference

State Hypothesis

Null Hypothesis: The poplulation mean of total income corrected with inflation is same for groups who have differnt degree levels.

Alternative hypothesis: The poplulation mean of total income corrected with inflation is not same for groups who have differnt degree levels.

Method to check the inference

Since we evaluating the mean in more than 2 groups, analysis of variance(ANOVA) with F test is applied in this study.

Check Conditions

  1. The respondents are random sampled across America, and observaations are independent in each groups. The groups are indenpendent to each other.

  2. in each groups, the distribution should be nearly normal.

par(mfrow = c(2,3))
de_groups = c("Lt High School","High School", "Junior College", "Bachelor", "Graduate")

for (i in 1:5){
  df = study3 %>% filter(degree == de_groups[i])
  qqnorm(df$coninc, main = de_groups[i])
  qqline(df$coninc)
}

The dieviation are significant in upper parts in groups, and based on the previouse histogram the distribution of five groups are right-skewed. However, with the sample size of 16510, the distribution can be considered as approximately normal

  1. For groups, the variability should be roughly equal.
ggplot(data= study3, aes(x = degree, y = coninc)) + geom_boxplot((aes(fill = degree)))

The variability is consistent across groups, the income is higher for those with higher degree levels.

Based on the observation above, the conditions for ANOVA are not “fully” satisfied. In this case, it is necessary to interpret the result cautiously. Other elements also contributes the high income level among people, there is no etiological relation between the two variables.

ANOVA

anova(lm(coninc ~ degree, data = study3))
## Analysis of Variance Table
## 
## Response: coninc
##              Df     Sum Sq    Mean Sq F value    Pr(>F)    
## degree        4 5.4142e+12 1.3535e+12  887.86 < 2.2e-16 ***
## Residuals 16505 2.5162e+13 1.5245e+09                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Inference result

With an f-statistic of 887.86 and p-value almost zero, we have a strong evidence that at least one pair of mean of groups are different.To test multiple comparisions we use Bonferroni correction, and there are 10 paired comparisons, and the significant level are 0.05/10 = 0.005

pairwise.t.test(study3$coninc, study3$degree, p.adj = "bonferroni")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  study3$coninc and study3$degree 
## 
##                Lt High School High School Junior College Bachelor
## High School    <2e-16         -           -              -       
## Junior College <2e-16         <2e-16      -              -       
## Bachelor       <2e-16         <2e-16      <2e-16         -       
## Graduate       <2e-16         <2e-16      <2e-16         <2e-16  
## 
## P value adjustment method: bonferroni

As show above, the t-test score for paired compaisons are all smaller to 0.005. We rejected the null hypothesis again that the mean for different groups are different from each other.

Conclusion

It is good chance to get to know what happens to Americ people and some fetails that neglicted by people. And it is a good opportunity to practice R from import data to inference tests. Thanks Coursera and anyone who read this. :)

Reference

Smith, Tom W., Michael Hout, and Peter V. Marsden. General Social Survey, 1972-2012 [Cumulative File]. ICPSR34802-v1. Storrs, CT: Roper Center for Public Opinion Research, University of Connecticut.

Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributors], 2013-09-11. doi:10.3886/ICPSR34802.v1 Persistent URL: http://doi.org/10.3886/ICPSR34802.v1.