library(ggplot2)## Warning: package 'ggplot2' was built under R version 4.1.1
library(dplyr)## Warning: package 'dplyr' was built under R version 4.1.1
library(statsr)## Warning: package 'BayesFactor' was built under R version 4.1.2
## Warning: package 'coda' was built under R version 4.1.2
## Warning: package 'Matrix' was built under R version 4.1.1
library(tidyr)## Warning: package 'tidyr' was built under R version 4.1.1
load("gss.Rdata")The General Social Survey (GSS) is a data set that contains information on the general trends in opinions, attitudes and behaviors of adults in the United States of America since 1972. The data was collected through personal interviews in order to study the growing complexity of American society.
To answer the above questions, we have to select the data we are interested in. For the first question, we will select the caseid, year, gender (sex), Total Family income (coninc) and working parameter(whether they are self-employed or work for someone else). Then filter the result for the year 2010.
# Select data for the first question
gss2010pro <- gss %>% select(caseid, year, sex, coninc, wrkslf) %>% filter(year == 2010) %>% drop_na(coninc, wrkslf)Create a box-plot to visualize if there is any difference from the sample we have.
# A box plot for the family dollar constant
ggplot(gss2010pro, aes(x = wrkslf, y = coninc, fill = wrkslf)) + geom_boxplot() + labs(title = "Boxplot of family income vs working situation in 2010" , y = "Total family income", x ="Work situation")The box-plots are seem identical to each other.
A bar graph to visualize the Total family income vs Working situation of a person.
ggplot(gss2010pro, aes(x = wrkslf, y = (coninc / 1000), fill = wrkslf)) + geom_col() + labs(title = "Barplot of family income vs working situation" , y = "Total family income in 1000", x ="Work situation")From the sample, the number of self-employed is significantly lower than those who are hired by others.
Create a histogram ro see the distribution of working for someone vs self-employed.
# Histogram of the distribution between self-employed vs someone else
ggplot(gss2010pro, aes(x = coninc, fill = wrkslf)) + geom_histogram(binwidth = 25000, col = I("black")) + facet_wrap(~wrkslf) + labs(title = "Histogram of family income vs working situation", x ="Work situation")The histograms show similar characteristics apart from the peaks.
To answer the second question, we select the variables that have the data we require.
# The variables and filter for the data required
gss2010 <- gss %>% select(caseid, year, sex, coninc) %>% filter(year == 2010) %>% drop_na(coninc)Sample statistic of the above.
gss2010 %>% group_by(sex) %>% summarise(count = n(), mean = mean(coninc), s = sd(coninc))## # A tibble: 2 x 4
## sex count mean s
## <fct> <int> <dbl> <dbl>
## 1 Male 809 50231. 41175.
## 2 Female 996 42015. 37766.
A box-plot to visualize the difference between the total family income vs gender.
# A box plot for the family dollar constant
ggplot(gss2010, aes(x = sex, y = coninc, fill = sex)) + geom_boxplot() + labs(title = "Boxplot of family income vs Gender" , y = "Total family income", x ="Gender")The box-plots of the different genders compared to the Total family income seems like the median of male is higher than female.
Histogram to see the distribution of Total family income based on Gender.
# Histogram of the distribution between male and female
ggplot(gss2010, aes(x = coninc, fill = sex)) + geom_histogram(binwidth = 25000, col = I("black")) + facet_wrap(~sex) + labs(title = "Histogram of family income vs Gender" , y = "Total family income", x ="Gender")The distribution between male and female are both right skewed and almost identical apart from the peaks of the first few bins.
To answer the third question, we will use the chi-square test to test whether there is a difference between the proportion of the genders versus the working situation of a person.
table(gss2010pro$wrkslf, gss2010pro$sex)##
## Male Female
## Self-Employed 125 79
## Someone Else 659 867
The table above shows us the difference between the genders and there working situation.
First we find the 95% confidence interval for the difference between working situation of male and female.
inference(y = coninc, x = wrkslf, data = gss2010pro, statistic = "mean", type = "ci", method = "theoretical", order = c("Someone Else", "Self-Employed"))## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_Someone Else = 1526, y_bar_Someone Else = 46168, s_Someone Else = 38562.5589
## n_Self-Employed = 204, y_bar_Self-Employed = 52326.299, s_Self-Employed = 47107.8911
## 95% CI (Someone Else - Self-Employed): (-12946.4813 , 629.8833)
We are 95% confident that the difference in the average Family income of someone working for someone else is (-12946.4813 , 629.8833) compared to those self-employed.
Now, we conduct a hypothesis test to test if there is a difference between self-employed and working for someone else.
inference(y = coninc, x = wrkslf, data = gss2010pro, statistic = "mean", type = "ht", null = 0, alternative = "twosided", method = "theoretical", order = c("Someone Else", "Self-Employed"))## Response variable: numerical
## Explanatory variable: categorical (2 levels)
## n_Someone Else = 1526, y_bar_Someone Else = 46168, s_Someone Else = 38562.5589
## n_Self-Employed = 204, y_bar_Self-Employed = 52326.299, s_Self-Employed = 47107.8911
## H0: mu_Someone Else = mu_Self-Employed
## HA: mu_Someone Else != mu_Self-Employed
## t = -1.7888, df = 203
## p_value = 0.0751
The p-value is 0.0751 which implies we do not reject the null hypothesis. There is no significant difference between someone working for another and one who is self-employed.
We will first construct a 95% confidence interval to find the interval where the population parameter fall in.
inference(y = coninc, x = sex, data = gss2010pro, statistic = "mean", type = "ci", method = "theoretical", order = c("Male", "Female"))## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_Male = 784, y_bar_Male = 50903.3431, s_Male = 41356.9643
## n_Female = 946, y_bar_Female = 43571.5772, s_Female = 37975.3272
## 95% CI (Male - Female): (3552.7641 , 11110.7678)
We are 95% confident that the difference in the average Family income of a male is ( 3552.8 , 11110.8) compared to that of a female.
We then conduct a hypothesis test to check if there is a significant difference between Family income between the genders.
inference(y = coninc, x = sex, data = gss2010, statistic = "mean", type = "ht", null = 0, alternative = "twosided", method = "theoretical", order = c("Male", "Female"))## Response variable: numerical
## Explanatory variable: categorical (2 levels)
## n_Male = 809, y_bar_Male = 50230.5921, s_Male = 41175.0819
## n_Female = 996, y_bar_Female = 42014.7299, s_Female = 37766.3401
## H0: mu_Male = mu_Female
## HA: mu_Male != mu_Female
## t = 4.3743, df = 808
## p_value = < 0.0001
The p-value from the inference above is <0.0001 meaning we reject the null hypothesis. There is a significant difference between the family income between the genders.
Conduct a chi-square test of goodness of fit.
chisq.test(gss2010pro$sex, gss2010pro$wrkslf)##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: gss2010pro$sex and gss2010pro$wrkslf
## X-squared = 23.038, df = 1, p-value = 1.588e-06
The p-value is less than 0.05 so we reject the null hypothesis. The observed counts do not follow the population distribution.