library(ggplot2)
library(dplyr)
library(statsr)Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called gss. Delete this note when before you submit your work.
load("gss.Rdata")Searching from website, we could know that the data come from the General Social Surveys, interviews administered to NORC national samples using a standard questionnaire which means it used a normal distribution method to collect the data.
I’ve searched for the types of this sureys and found out there’re three ways to use. Firstly, Permanent questions that occur on each survey, which means they will collect the fair data even through the time goes by, if they kept changing the question likes asking if people thinks 3000 usd annual paid are good enough in 30 years ago and asking people now if 6000 usd annual paid are good enough now. They will get almost the same consequence but it’s depedns on the different cost value changes by days.
Secondly, rotating questions that appear on two out of every three surveys (1973, 1974, and 1976, or 1973, 1975, and 1976), which means if they chose a same person to be the sample, he or she wouldn’t choose anwers without thinking since he or she had already done the same questions. Different from permanent questions, those kinds of questions need to be dependent from condition, which like asking about health statues. Since, health statues don’t change good or bad which just depends on the time, they still change by the interviewer’s personal situatino. Thus, the questionnaire could reveal real situation and relationship between those problems.
Last but not least, a few occasional questions such as split ballot experiments that occur in a single survey. Split ballot technique is a procedure where a sample is divided into two halves and each receives a slightly different questionnaire. The split-ballot technique consists of giving different forms of the questionnaire to equivalent portions of the sample. It reduces the effect of position bias when using multiple-choice questions in a questionnaire.
How to how the causality? I think we should analzye if the questions are designed to avoid leading interviewers to a specficist way and whether the each variabilities are independent to each other. The Interviewer Instructions could tell us something.The quotas call for approximately equal numbers of men and women with the exact proportion in each segment determined by the 1970 Census tract data. For women, the additional requirement is imposed that there be the proper proportion of employed and unemployed women in the location. Again, these quotas are based on the 1970 Census tract data. For men, the added requirement is that there be the proper proportion of men over and under 35 in the location. Which means the consequence won’t be influenced by the number of women and men, it still show the real situation of public. If they found 80% of men, maybe, the consequence would show more men-related data and it showed that the data is independent.
Firstly, I would like to know if those data could provide convincing evidence that American people study more than 6 years in average and if men have a higher educatino year than women. So, the first thing I need to do is to filter the ‘educ’ variability which represents the highest year of school complete and to see the distribution of the years. And then, I???ll make a hypothesis test and confidence interval about 12 years in average for population to see the consequence. ( 95% confidence level )
-> The consequence show that male in the US have a little bit more years in education than women, on average.
Secondly, I would like to use ANOVA test to see if there???s a difference from average family income in each race. So, I need to filter ???race??? and ???coninc??? and stack the data frame to use ???avo??? function.
-> Black people have less family income than White people. There’s a difference between different races.
I’ll figure out the family income situation’s defference between black and white women who are under 30 ,are divorced and had more than 3 kids. And usd t-distributino to see if it porviding convincing evidence to prove there’s a difference.
-> There’s no difference between black and white women who are under 30 ,are divorced and had more than 3 kids in their family’s income. It means that family income isn’t the reason to force them suffer in those bad situation.
Finally, I would like to use Chi-square test to see is the sex be the reason makes an difference to human’s class in society.
-> Yes. There’s a differenc betweem male and female.
Filter the sex and educ, create a new dataset “ggs_q1” and see the distribution of male and female. We could see the mean is 12.75.
gss_q1 <- gss %>% select(sex,educ)
summary(gss_q1)## sex educ
## Male :25146 Min. : 0.00
## Female:31915 1st Qu.:12.00
## Median :12.00
## Mean :12.75
## 3rd Qu.:15.00
## Max. :20.00
## NA's :164
ggplot(data=gss_q1,aes(x=sex,y=educ))+ geom_boxplot()## Warning: Removed 164 rows containing non-finite values (stat_boxplot).
Filter the race and coninc, create a new dataset “ggs_q2”.
gss_q2 <- gss %>% select(race,coninc)
gss_q2 <- na.omit(gss_q2)
summary(gss_q2)## race coninc
## White:41824 Min. : 383
## Black: 6956 1st Qu.: 18445
## Other: 2452 Median : 35602
## Mean : 44503
## 3rd Qu.: 59542
## Max. :180386
ggplot(data=gss_q2,aes(x=race,y=coninc))+ geom_boxplot()gss_q2 %>% group_by(race) %>% summarise(m_income=mean(coninc))## # A tibble: 3 <U+00D7> 2
## race m_income
## <fctr> <dbl>
## 1 White 47006.74
## 2 Black 30185.02
## 3 Other 42415.43
White 47006.74
Black 30185.02
Other 42415.43
We can see Black people have lower family income.
Filter data to the situation I set.
gss_q3 <- gss %>% filter(sex=="Female",race!="Other",marital=="Divorced",age < 30, childs >3 )
ggplot(data=gss_q3,aes(x=race,y=coninc,group=race))+geom_boxplot()summary(gss_q3$coninc)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3619 9630 15740 19920 24880 68690
gss_q3_summary <- gss_q3 %>% group_by(race)%>% summarise(m_qe=mean(coninc),sd_qe=sd(coninc))
gss_q3_summary## # A tibble: 2 <U+00D7> 3
## race m_qe sd_qe
## <fctr> <dbl> <dbl>
## 1 White 21526.77 17102.012
## 2 Black 14690.75 9896.172
I’ll try to calculate the number of each class.
gss_q4 <- gss %>% select(sex,class)
gss_q4 <- na.omit(gss_q4)
nrow(gss_q4[which(gss_q4$sex == "Male" & gss_q4$class == "Lower Class"),])## [1] 1206
nrow(gss_q4[which(gss_q4$sex == "Male" & gss_q4$class == "Working Class"),])## [1] 11053
nrow(gss_q4[which(gss_q4$sex == "Male" & gss_q4$class == "Middle Class"),])## [1] 10661
nrow(gss_q4[which(gss_q4$sex == "Female" & gss_q4$class == "Upper Class"),])## [1] 927
nrow(gss_q4[which(gss_q4$sex == "Female" & gss_q4$class == "Lower Class"),])## [1] 1941
nrow(gss_q4[which(gss_q4$sex == "Female" & gss_q4$class == "Working Class"),])## [1] 13405
nrow(gss_q4[which(gss_q4$sex == "Female" & gss_q4$class == "Middle Class"),])## [1] 13628
nrow(gss_q4[which(gss_q4$sex == "Female" & gss_q4$class == "Upper Class"),])## [1] 927
Make a data frame to show the relationship between gender, class and counting number
gss_q4_s <- data.frame(gender = c("male","male","male","male","female","female","female","female"),
class = c("Lower Class","Working Class","Middle Class","Upper Class","Lower Class","Working Class","Middle Class","Upper Class"),
frequency = c(1206,11053,10661,927,1941,13405,13628,927))
gss_q4_s## gender class frequency
## 1 male Lower Class 1206
## 2 male Working Class 11053
## 3 male Middle Class 10661
## 4 male Upper Class 927
## 5 female Lower Class 1941
## 6 female Working Class 13405
## 7 female Middle Class 13628
## 8 female Upper Class 927
H0: u=12 H1: u<= 12 (u = population mean of the highest year of school complete) n = 25146 + 31915 = 57061 < 10% of american, so it’s met the cnodition
I need to use hyothesis test and confident interval to calculate the area of rejection and with theoretical metod about Z distribution.( 95% confidence level )
inference(y=educ,data=gss_q1,statistic = "mean", type = "ht", null = 12, alternative = "less" ,method="theoretical")## Single numerical variable
## n = 56897, y-bar = 12.7536, s = 3.1816
## H0: mu = 12
## HA: mu < 12
## t = 56.4974, df = 56896
## p_value = 1
it prints out that Single numerical variable n = 56897, y-bar = 12.7536, s = 3.1816 H0: mu = 12 HA: mu < 12 t = 56.4974, df = 56896 p_value = 1
which P-value > 0.05, we couldn’t reject the H0, so it means there has 95% of confidence says that American has average 12 years highest year of school completem.
inference(y=educ,x=sex,data = gss_q1, statistic = "mean" , type = "ci", conf_level = 0.95, method = "theoretical" )## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_Male = 25078, y_bar_Male = 12.8953, s_Male = 3.3694
## n_Female = 31819, y_bar_Female = 12.6419, s_Female = 3.0208
## 95% CI (Male - Female): (0.2001 , 0.3067)
Confidence Interval shows that male have 0.2001 - 0.3067 more highest year complete education than women. Then, I’ll chek if this’s true:
Ho :uman - uwoman = 0 H1 :uman - uwoman =/= 0
inference(y=educ,x=sex, data=gss_q1,statistic = "mean", type = "ht", null = 0, alternative = "twosided" ,method="theoretical")## Response variable: numerical
## Explanatory variable: categorical (2 levels)
## n_Male = 25078, y_bar_Male = 12.8953, s_Male = 3.3694
## n_Female = 31819, y_bar_Female = 12.6419, s_Female = 3.0208
## H0: mu_Male = mu_Female
## HA: mu_Male != mu_Female
## t = 9.3201, df = 25077
## p_value = < 0.0001
Response variable: numerical Explanatory variable: categorical (2 levels) n_Male = 25078, y_bar_Male = 12.8953, s_Male = 3.3694 n_Female = 31819, y_bar_Female = 12.6419, s_Female = 3.0208 H0: mu_Male = mu_Female HA: mu_Male != mu_Female t = 9.3201, df = 25077 p_value = < 0.0001
P_value < 0.05, reject H0, there’s a difference between men and women and we all konw men have more years than women through last analysis.
ANOVA testing to see if there’s difference between White, Black and Other.
Ho : ??1 = ??2 = … = ??k H1 : At least one mean is different
Aov_t <- aov (coninc~race,data=gss_q2)
summary(Aov_t)## Df Sum Sq Mean Sq F value Pr(>F)
## race 2 1.699e+12 8.494e+11 675.1 <2e-16 ***
## Residuals 51229 6.446e+13 1.258e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
F equal almost zero, reject Ho, and we can know the difference of average family income is Black people.
Use T distribution since the sample size is smaller than 30
White sample mean= 21526.77 sd= 17102.012 n=13
Black sample mean= 14690.75 sd= 9896.172 n=4
H0: u white - u black = 0 H1: u white - u black =/= 0
t = (21526.77-14690.75)/((((17102.012^2)/13)+((9896.172^2)/4))^(1/2))
pt( t , df=15,lower.tail = FALSE)## [1] 0.1672115
p-value = 0.1672115 > 0.05 , We couldn’t reject H0, there’s no difference between Black and White female in this situation.
Ho : fellow the same distrubition in population H1 : Didn’t fellow the same distribution in population
I want to know if make a cross table
xtabs ( frequency ~ gender + class, data=gss_q4_s) -> cross.table
cross.table## class
## gender Lower Class Middle Class Upper Class Working Class
## female 1941 13628 927 13405
## male 1206 10661 927 11053
chisq.test(cross.table)##
## Pearson's Chi-squared test
##
## data: cross.table
## X-squared = 79.379, df = 3, p-value < 2.2e-16
class
gender Lower Class Middle Class Upper Class Working Class female 1941 13628 927 13405 male 1206 10661 927 11053
Pearson's Chi-squared test
data: cross.table X-squared = 79.379, df = 3, p-value < 2.2e-16
p-value is too small, male and female didn’t fellow their pupolation distribution.