library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.2
library(statsr)
Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called gss
. Delete this note when before you submit your work.
setwd("D:/Git/StatsR/Inference")
load("gss.Rdata")
Since 1972, the General Social Survey (GSS) has been monitoring societal change and studying the growing complexity of American society. The GSS aims to gather data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes; to examine the structure and functioning of society in general as well as the role played by relevant subgroups; to compare the United States to other societies in order to place American society in comparative perspective and develop cross-national models of human society; and to make high-quality data easily accessible to scholars, students, policy makers, and others, with minimal cost and waiting.
The data appears to have been gatheres using random sampling which means that we would be able to generalize this to the population basis our findings
However, this is a random sampling data and not random assignment. Hence, irresepective of the findings of the analysis we would not be able to make any causal statements about the data
Number of columns with more than 50% NA’s in observations
C <- dim(gss)[2]
x <- NULL
for ( i in 1: C){
x[i] <- sum(is.na(gss[,i]))/length(gss[,i])
}
y <- ifelse(x >= 0.5,1,0)
sum(y)
## [1] 28
There are only 28 variables where we have more than 50% observations which are complete
I want to understand that over the years, have their been any significant difference of Income adjusted for inflation between Men and Women.
I feel very strongly about gender discrimination and I feel that while we have been talking about it, in reality the income disparity exists and we should take corrective measures immediately
1st step would be to create a data set which would contain the variables that we would be using to answer our research question
Since we would be working with income, Year and sex, the subset of the data would contain these 3 variables
bd1 <- data.frame(gss$sex, gss$coninc,gss$year)
bd2 <- subset(bd1, (complete.cases(bd1)))
colnames(bd2) <- c("Sex","Income","Year")
head(bd2)
## Sex Income Year
## 1 Female 25926 1972
## 2 Male 33333 1972
## 3 Female 33333 1972
## 4 Female 41667 1972
## 5 Female 69444 1972
## 6 Male 60185 1972
data <- aggregate(bd2$Income, by = list(bd2$Sex), FUN = mean)
data
## Group.1 x
## 1 Male 48763.65
## 2 Female 41020.22
While the average over the years does have a difference, I want to check if over the years the same difference has been true Creating a chart
g <- ggplot(data=bd2, aes(x=Year,y=Income))
g <- g + geom_smooth(aes(colour = Sex))
g
## `geom_smooth()` using method = 'gam'
As part of this we see that every year, Men have had higher income than women.
In the next section we would be exploring whether the difference is statistically significant
In this section, we would be checking whether the difference is statistically significant
To do this, we would be using a t test. Since we have a data set where we have observations for Men and Women for each year, we would be using a paired t test.
H0 : The Null Hypothesis in this case is that there is no difference between men and women as far as income levels are concerned and the differences we observe could be because of chance
HA : The Alternate Hypothesis in this case is that there is indeed a difference of income between Men and Women/
To do this, we would be using a t test. Since we have a data set where we have observations for Men and Women for each year, we would be using a paired t test.
To do the test, first we would be creating a data frame which would contain a data set summarising the observations in the data frame bd2
Randomness : We understand that the data has been randomly sampled. Hence we assume the data to be independent
the total data set contains 57K observations which would be less than 10% of the US population
Hence the conditions for conducting the t-test have been met
bd3 <- aggregate(bd2$Income, by = list(bd2$Sex, bd2$Year), FUN = mean)
colnames(bd3) <- c("Sex", "Year", "Income")
head(bd3)
## Sex Year Income
## 1 Male 1972 41953.86
## 2 Female 1972 34696.02
## 3 Male 1973 42864.77
## 4 Female 1973 40253.10
## 5 Male 1974 44253.19
## 6 Female 1974 40062.34
Basis this data set we would be conducting a t test. Since I want to be very sure about my findings I would be using a 99% confidence level
t <- t.test(bd3$Income ~ bd3$Sex, paired = TRUE, var.equal = F, conf.level = 0.99)
t
##
## Paired t-test
##
## data: bd3$Income by bd3$Sex
## t = 19.245, df = 28, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 99 percent confidence interval:
## 6480.740 8653.855
## sample estimates:
## mean of the differences
## 7567.298
the p value is very low
and the 99% confidence interval is [6481, 8654]
With such a low p value we would be rejecting the Null Hypothesis, hence concluding that there is indeed a statistically significant difference between the incomes of men and women
Also we see that the 99% confidence interval is [6481, 8654] which does not contain zero. This reinforces our rejection of the null hypothesis which said that the difference between the incomes is 0.
In Conclusion, we can infer that we have indeed observed a statistically significant difference between the incomes of men and women every year from 1972 to 2012 `