Setup

Load packages

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.2
library(statsr)

Load data

Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called gss. Delete this note when before you submit your work.

setwd("D:/Git/StatsR/Inference")
load("gss.Rdata")

Part 1: Data

Since 1972, the General Social Survey (GSS) has been monitoring societal change and studying the growing complexity of American society. The GSS aims to gather data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes; to examine the structure and functioning of society in general as well as the role played by relevant subgroups; to compare the United States to other societies in order to place American society in comparative perspective and develop cross-national models of human society; and to make high-quality data easily accessible to scholars, students, policy makers, and others, with minimal cost and waiting.

The data appears to have been gatheres using random sampling which means that we would be able to generalize this to the population basis our findings

However, this is a random sampling data and not random assignment. Hence, irresepective of the findings of the analysis we would not be able to make any causal statements about the data

Data Evaluation and Preprocessing

Number of columns with more than 50% NA’s in observations

C <- dim(gss)[2]

x <- NULL

for ( i in 1: C){
  x[i] <- sum(is.na(gss[,i]))/length(gss[,i])
}
y <- ifelse(x >= 0.5,1,0)
sum(y)
## [1] 28

There are only 28 variables where we have more than 50% observations which are complete


Part 2: Research question

I want to understand that over the years, have their been any significant difference of Income adjusted for inflation between Men and Women.

I feel very strongly about gender discrimination and I feel that while we have been talking about it, in reality the income disparity exists and we should take corrective measures immediately


Part 3: Exploratory data analysis

1st step would be to create a data set which would contain the variables that we would be using to answer our research question

Since we would be working with income, Year and sex, the subset of the data would contain these 3 variables

bd1 <- data.frame(gss$sex, gss$coninc,gss$year)
bd2 <- subset(bd1, (complete.cases(bd1)))
colnames(bd2) <- c("Sex","Income","Year")
head(bd2)
##      Sex Income Year
## 1 Female  25926 1972
## 2   Male  33333 1972
## 3 Female  33333 1972
## 4 Female  41667 1972
## 5 Female  69444 1972
## 6   Male  60185 1972
data <- aggregate(bd2$Income, by = list(bd2$Sex), FUN = mean)
data
##   Group.1        x
## 1    Male 48763.65
## 2  Female 41020.22

While the average over the years does have a difference, I want to check if over the years the same difference has been true Creating a chart

g <- ggplot(data=bd2, aes(x=Year,y=Income))
g <- g + geom_smooth(aes(colour = Sex))
g
## `geom_smooth()` using method = 'gam'

As part of this we see that every year, Men have had higher income than women.

In the next section we would be exploring whether the difference is statistically significant


Part 4: Inference

In this section, we would be checking whether the difference is statistically significant

To do this, we would be using a t test. Since we have a data set where we have observations for Men and Women for each year, we would be using a paired t test.

Defining the Hypothesis

  • H0 : The Null Hypothesis in this case is that there is no difference between men and women as far as income levels are concerned and the differences we observe could be because of chance

  • HA : The Alternate Hypothesis in this case is that there is indeed a difference of income between Men and Women/

Method

To do this, we would be using a t test. Since we have a data set where we have observations for Men and Women for each year, we would be using a paired t test.

To do the test, first we would be creating a data frame which would contain a data set summarising the observations in the data frame bd2

checking conditions

  • Randomness : We understand that the data has been randomly sampled. Hence we assume the data to be independent

  • the total data set contains 57K observations which would be less than 10% of the US population

Hence the conditions for conducting the t-test have been met

bd3 <- aggregate(bd2$Income, by = list(bd2$Sex, bd2$Year), FUN = mean)
colnames(bd3) <- c("Sex", "Year", "Income")
head(bd3)
##      Sex Year   Income
## 1   Male 1972 41953.86
## 2 Female 1972 34696.02
## 3   Male 1973 42864.77
## 4 Female 1973 40253.10
## 5   Male 1974 44253.19
## 6 Female 1974 40062.34

Basis this data set we would be conducting a t test. Since I want to be very sure about my findings I would be using a 99% confidence level

t <- t.test(bd3$Income ~ bd3$Sex, paired = TRUE, var.equal = F, conf.level = 0.99)
t
## 
##  Paired t-test
## 
## data:  bd3$Income by bd3$Sex
## t = 19.245, df = 28, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 99 percent confidence interval:
##  6480.740 8653.855
## sample estimates:
## mean of the differences 
##                7567.298

the p value is very low

and the 99% confidence interval is [6481, 8654]

With such a low p value we would be rejecting the Null Hypothesis, hence concluding that there is indeed a statistically significant difference between the incomes of men and women

Also we see that the 99% confidence interval is [6481, 8654] which does not contain zero. This reinforces our rejection of the null hypothesis which said that the difference between the incomes is 0.

In Conclusion, we can infer that we have indeed observed a statistically significant difference between the incomes of men and women every year from 1972 to 2012 `