Maaz Shaikh - S3795603
Vaishnavi Narayana Naik - S3797442
Last updated: 26 October, 2019
You must publish your presentation to RPubs (see here) and add this link to your presentation here.
Rpubs link comes here: http://rpubs.com/vaishnavi-naik/542466
We investigate if men and women across many occupations get an equal amount of income in the United States. We consider the open data provided on Kaggle - Incomes by Career and Gender to carry out our investigation. The dataset, was retrieved from the Bureau of Labor Statistics, and we will be looking at weekly median incomes for 558 different occupations.
The problem statement/question is to find out if male and female employees have equal income across many occupations in the United States of America. An open dataset is being used to observe the descriptive statistics, boxplots and qqplots to visualise the data and examine the different income rates for male and females across many occupations. Following this, a hypothesis test - Two sample t-test is performed on the dataset to find (if any) statistical significant difference between the income per week between the two genders.
We use the open data from Kaggle - U.S. Incomes by Occupation and Gender. This dataset was retrieved from the Bureau of Labor Statistics and contains 558 observations and contains 7 variables.
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 558 obs. of 7 variables:
## $ Occupation : chr "ALL OCCUPATIONS" "MANAGEMENT" "Chief executives" "General and operations managers" ...
## $ All_workers: num 109080 12480 1046 823 8 ...
## $ All_weekly : chr "809" "1351" "2041" "1260" ...
## $ M_workers : num 60746 7332 763 621 5 ...
## $ M_weekly : chr "895" "1486" "2251" "1347" ...
## $ F_workers : num 48334 5147 283 202 4 ...
## $ F_weekly : chr "726" "1139" "1836" "1002" ...
On examining the structure of the data frame, we find that All_weekly, M_weekly and F_weekly are of character type, which is incorrect. These variables need to be numeric(double) as they hold the income values. Therefore in the next slide, we type cast these variables to numeric type.
statsdata$All_weekly <- as.double(statsdata$All_weekly)
statsdata$M_weekly <- as.double(statsdata$M_weekly)
statsdata$F_weekly <- as.double(statsdata$F_weekly)
#Examine the data frame after type conversions.
str(statsdata, give.attr = FALSE)## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 558 obs. of 7 variables:
## $ Occupation : chr "ALL OCCUPATIONS" "MANAGEMENT" "Chief executives" "General and operations managers" ...
## $ All_workers: num 109080 12480 1046 823 8 ...
## $ All_weekly : num 809 1351 2041 1260 NA ...
## $ M_workers : num 60746 7332 763 621 5 ...
## $ M_weekly : num 895 1486 2251 1347 NA ...
## $ F_workers : num 48334 5147 283 202 4 ...
## $ F_weekly : num 726 1139 1836 1002 NA ...
In this dataset, we find some missing values (which we will deal with later), but these values are stored as “Na”. R will not be able to recognize these missing values, therefore we recode them to NA.
## Occupation All_workers All_weekly M_workers M_weekly F_workers
## 0 0 236 0 326 0
## F_weekly
## 366
Since we our objective is to find if there is any statistical significance between the weekly income of male and female employees, we look at two main vairables, ‘M_weekly’ which is the male weekly income and ‘F_weekly’ which is the female weekly income. We first find the descriptive statistics of these varaibles.
M_weekly data:statsdata %>% summarise( Min = min(M_weekly,na.rm = TRUE), Q1 = quantile(M_weekly,probs = .25,na.rm = TRUE), Median = median(M_weekly, na.rm = TRUE),
Q3 = quantile(M_weekly,probs = .75,na.rm = TRUE), Max = max(M_weekly,na.rm = TRUE), Mean = mean(M_weekly, na.rm = TRUE),
SD = sd(M_weekly, na.rm = TRUE), Interquartile = IQR(M_weekly, na.rm = TRUE), n = n(), Missing = sum(is.na(M_weekly)))F_weekly data:statsdata %>% summarise(Min = min(F_weekly,na.rm = TRUE), Q1 = quantile(F_weekly,probs = .25,na.rm = TRUE), Median = median(F_weekly, na.rm = TRUE),
Q3 = quantile(F_weekly,probs = .75,na.rm = TRUE), Max = max(F_weekly,na.rm = TRUE), Mean = mean(F_weekly, na.rm = TRUE),
SD = sd(F_weekly, na.rm = TRUE), Interquartile = IQR(F_weekly, na.rm = TRUE), n = n(), Missing = sum(is.na(F_weekly)))From observing the descriptive statistics we can see that the Male weekly income has a higher mean price along with higher Q1 and Q3 values. Males also has a higher median salary of $915.5 compared to that of Females who have a median salary of $736.
These initial observations may lead us to believe that Males do receive higher income than Females. We did find many missing values in the previous slides, so we first have to deal with that.
Since we find that there are some missing values in both male and female weekly income variables, we have decided to remove these values to give us a appropriate dataset for our investigation.
## Occupation All_workers All_weekly M_workers M_weekly F_workers
## 0 0 0 0 0 0
## F_weekly
## 0
From the boxplots we can clearly observe that the Male weekly income has a much higher median. They have a comparable Q1 salary but Male employees has a significantly higher Q3, therefore the male weekly income also has bigger IQR.
par(mfrow = c(1,2))
male_outliers <- boxplot(statsdata$M_weekly, main="Box Plot for Male weekly income", ylab="Salary", ylim = c(0,2500))$out
female_outliers <- boxplot(statsdata$F_weekly, main="Box Plot for Female weekly income", ylab="Salary", ylim = c(0,2500))$outWe find 3 outliers in the female weekly income variable. Since the outliers are high valued salary for females, we opt for using the capping method to deal with the data rather than deleting the outliers because that could have an impact on our findings.
We use qqPlot() to examine the normality of the sample distribution.
We find that for both the variables the data points show a trend of falling within the 95% CI for normal distrinution except for a few data points that fall outside the left tail of the distribution. Both of the variables appear to be left skewed. This tells us that both male and female weekly income data cannot be assumed to be normally distributed. But since the sample size is greater than 30 (n>30) according to Central Limit Theorem we can proceed with Two sample t-test even if the normality assumption is violated. So in this case, we can effectively ignore the non-normality of the data.
par(mfrow = c(1,2))
dat <- statsdata$M_weekly %>% qqPlot(dist="norm", main = "QQPlot for Male Weekly income", ylim = c(0,2500))
dat <- statsdata$F_weekly %>% qqPlot(dist="norm", main = "QQPlot for Female Weekly income", ylim = c(0,2500))We use Two sample t-test for this dataset as the male and female weekly income are independent of each other. We perform the Two sample t-test which has the following statistical hypotheses:
\[H_0 : μ_1−μ_2=0\]
\[H_A : μ_1−μ_2≠0\]
where \(μ_1\) and \(μ_2\) refer to the population means of Male and Female weekly income respectively.
We first need to examine the homogeneity of variance i.e. to check the assumption that the variance of the two populations are equal. We use Levene’s test to test the assumption of equal variance.
The statistical hypotheses for Levene’s test are:
\[H_0:σ_1^2=σ_2^2\]
\[H_A:σ_1^2≠σ_2^2\]
where \(σ_1^2\) and \(σ_2^2\) refer to the population variance of Male and Female weekly incomes respectively.
Our data is in the following format:
However to perform t-test we require the data to have a factor variable to define the groups. We reshape our data into a suitable form and only keep the variables we need for our statistical analysis - Occupation, Gender and WeeklyIncome, where Gender serves as the factor variable that defines the groups - Male and Female.
Reshaping our data using gather():
y <- gather(statsdata,'M_weekly','F_weekly', key = 'Gender', value = "WeeklyIncome") %>% select(c(Occupation, Gender, WeeklyIncome))
y[y == "M_weekly"] <- "Male"
y[y == "F_weekly"] <- "Female"
y[1:4,]Here p value - 0.0003953 < 0.05, hence we reject null hypothesis, in other words it is not safe to assume equal variance.
So we perform the two-sample t-test assuming unequal variance.
##
## Welch Two Sample t-test
##
## data: WeeklyIncome by Gender
## t = -4.6592, df = 257.95, p-value = 5.091e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -283.6525 -115.1144
## sample estimates:
## mean in group Female mean in group Male
## 818.1518 1017.5352
## [1] -1.968412
From the Two-sample t-test we get the t-test statistic \(t = -4.6592\) which does not fall within the lower critical value: -1.968412 and upper critical value: 1.968412. According to the critical value method, there was statistically significant difference between the mean weekly incomes for Males and Females. Hence, we reject the Null Hypothesis H0.
The p-value of the two-sample t-test will tell us the probability of observing a sample with difference between the means as \(-199.3834\), assuming the difference was 0 in the population. From the Two-sample t-test, we get the p-value as \(p = 5.091e-06\).
According to p-value approach, since \(p\) = 5.091e-06 < \(α\) = 0.05 we reject the Null Hypothesis H0. Therefore there is a statistical significance difference between the means.
We calculate the difference between the mean weekly incomes of Males and Females:
## [1] -199.3834
The Two-sample t-test gives us the 95% CI: \([ -283.6525, -115.1144]\). The result is captured in the 95% CI, hence we fail to reject the Null Hypothesis H0. There is no statistical significance difference between the means.
After performing the hypothesis test we found a statistical significant difference between the means of the weekly income of male and female employees suggesting that there is an inequality in the income given to male and females. There are however some limitations to the data.
While the sample size is greater than 30 (meaning statistically a big sample), we firstly omit many observations because of missing values, And a final sample size of 142 observation may not be enough to conclude our investigation.
Not all occupations were taken into consideration, specially after omitting certain data because of missing values.
Bonuses and length of stay at one company are not considered in this dataset which could have had an impact on our findings.
The dataset was collect back in 2015, not given us the current evaluations.
To imporve the accuracy of the investigation we could use a larger sample, and also try to include all possible occupations. A big factor were the missing values, so to increase the accuracy data entry errors should be limited.
Overall, the report concludes that the investigation found statistical evidence to support that Male employees do receive a higher income than female employees.