library(ggplot2)
library(dplyr)
library(tidyr)
library(colorspace)
library(statsr)
load("gss.Rdata")
dim(gss)
## [1] 57061 114
The GSS data has 57,061 obsevations and 114 variables. However, We will just extract information that we want to explore in this study. To be easier to do analysis, we will create a function called “prepdata” to eliminate NA values
prepdata <- function(...) {
study <- gss %>%
select(...)
return(study[complete.cases(study),])
}
The General Social Survey (GSS) is a sociological survey created by the National Opinion Research Center (NORC) at the University of Chicago funding by the National Science Foundation (NSF). The GSS collects data about demographic, behavioral, and attitudinal questions, plus topics of special interest. The purpose is to build a reliable dataset for researching, monitoring and explaining trends, changes, and constants in attitudes, behaviors, and attributes as well as examining the structure, development, and functioning of society in general and developing cross-national models of human society.
About 5,000 American are invited to respond to the survey. All households from across the country had an equal chance of being selected for this survey. Then the GSS will randomly select an adult member of a household to complete the interview. Selected people will be asked about their opinion on a variety of topics.
Scope of Inference: As a large representative random sampling was drawn, the data for the sample is generalizable to the adult population of the participating states. Thus, the study is obsevational and only shows associational relationship.
People always question if education has any impact on the income. In the past, it is said that the higher education we have, the more salary we will receive. Thus, many people put a lot of efforts and money to invest in education to acquire a degree. Is it true that education has a big impact on ensuring to have higher income after gradution? We will find an answer for that question soon.
The variables analyzed in this study are:
To reflect the latest trend, we will just do analysis year of 2010 and after.
degree_income <- prepdata(degree, finrela, year) %>%
filter(year >= '2010')
dim(degree_income)
## [1] 3952 3
The dataset reduced to 3952 records from the total 57061 records
ggplot(degree_income, aes(degree)) +
geom_bar(position = 'dodge', aes(fill = finrela)) +
labs(title = 'Having A Degree In Association With Income', fill = 'Income Level')
Observations:
For the better illustrating the relationship between two categorical variables, we will use Mosaic plots as follow:
plot(table(degree_income$finrela, degree_income$degree))
In the above mosaic plot, income increases when having higher degree, especially for college and graduate groups.
gss
data was generated from a random sample so we can assume that the records are independentchisq.test(degree_income$degree, degree_income$finrela)$expected
## degree_income$finrela
## degree_income$degree Far Below Average Below Average Average
## Lt High School 43.67004 162.26417 242.8968
## High School 150.98684 561.01974 839.8026
## Junior College 22.84160 84.87222 127.0471
## Bachelor 55.82642 207.43345 310.5116
## Graduate 32.67510 121.41043 181.7419
## degree_income$finrela
## degree_income$degree Above Average Far Above Average
## Lt High School 100.04150 15.127530
## High School 345.88816 52.302632
## Junior College 52.32667 7.912449
## Bachelor 127.88993 19.338563
## Graduate 74.85374 11.318826
From the above table, it is clear that each cell has 5 expected count. Thus, all conditions for performing chi-squared test are met. We will set 5% for significance level.
chisq.test(degree_income$degree, degree_income$finrela)
##
## Pearson's Chi-squared test
##
## data: degree_income$degree and degree_income$finrela
## X-squared = 632.33, df = 16, p-value < 2.2e-16
X-squared is 632.33 for 16 degree of freedoms and p-value is much lower than the significance level. Thus, we have convincing evidence to reject the null hypothesis in favor of alternative hypothesis that education and income level are dependent. The study is observational so there is only an association between these two variables - no casual relationship involved.
Are immigrants really taking American job? This is the hot topic recently and the president Donald Trump are having a lot of policies to limit opportunities for foreign worker. We would like to know if foreign workers are underpaid or they got paid the same amount with the U.S citizens
The variables analyzed for this study are:
citizen_income <- prepdata(uscitzn, coninc)
citizen_income$uscitzn <- plyr::revalue(citizen_income$uscitzn, c('A U.S. Citizen Born In Puerto Rico, The U.S. Virgin Islands, Or The Northern Marianas Islands' = 'Island', 'Born Outside Of The United States To Parents Who Were U.S Citizens At That Time (If Volunteered)' = 'Born Outside', 'A U.S. Citizen' = 'Citizen', 'Not A U.S. Citizen' = 'Not Citizen'))
table(citizen_income$uscitzn)
##
## Citizen Not Citizen Island Born Outside
## 315 329 5 10
We will just focus on ‘citizen’ and ‘not citizen’ group
citizen_income <- citizen_income%>%
filter(uscitzn == "Citizen" | uscitzn == "Not Citizen")
dim(citizen_income)
## [1] 644 2
ggplot(data=citizen_income,aes(x = coninc))+
geom_histogram(bins = 30) +
facet_wrap(~uscitzn)
Observations
gss
data was generated from a random sample so we can assume that the records are independent# 2 graphs in 1 rows
par(mfrow = c(1,2))
citzn_groups = c("Citizen","Not Citizen")
for (i in 1:2) {
qqnorm(citizen_income$coninc, main=citzn_groups[i])
qqline(citizen_income$coninc)
}
There is a significant deviation from standard normal distribution in Citizen and Not Citizen groups especially in the upper quantile. This mirrors the right-skewed distributions we observed in the histogram plots.
ggplot(citizen_income, aes(x = uscitzn, y = coninc)) +
geom_boxplot(aes(fill = uscitzn)) +
labs(title = 'Variability between two groups', fill = 'Citizenship status')
We see that the the median and IQR of Citizen group is much higher than ‘Not Citizen’ group
All conditions are not fully met so we must pay more attention when doing analysis.
anova(lm(coninc~uscitzn, data = citizen_income))
## Analysis of Variance Table
##
## Response: coninc
## Df Sum Sq Mean Sq F value Pr(>F)
## uscitzn 1 5.2530e+10 5.2530e+10 26.55 3.423e-07 ***
## Residuals 642 1.2702e+12 1.9785e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
p-value is much lower than significant level so we will reject the null hypothesis in favor of the alternative hypothesis which says that income for U.S citizen different from foreign workers.
The exploratory data analysis provided good information to help predict the potential for association between potential predictor variables and the chosen response variable. From the exploratory data analysis for politics variable, we cannot see clearly the association between variables but with the inference analysis, we can see clearly this association.
It is acknowledged that since the GSS survey is not an experiment study and that confounding factors may impact the associations found in this analysis. Hence, we need to construct analysis carefully.