Introduction
The goal of this report is to show certain insights on wage data consisting of a random sample taken from the U.S. Current Population Survey for the year 1976. Also, the following report will include estimation of the mean wage of white female and male professionals living in the north region. Furthermore, the proportion of white female professionals will be estimated.
Libraries and data setup
#install.packages("np")
#install.packages("tidyverse")
#install.packages("kableExtra")
#install.packages("psych")
#install.packages("moments")
#install.packages("stats")
#install.packages("MASS")
#install.packages("plyr")
#install.packages("classInt")
library(np)
library(tidyverse)
library(kableExtra)
library(classInt)There are 526 observations on 24 variables in total.
summary(wages)## wage educ exper tenure
## Min. : 0.530 Min. : 0.00 Min. : 1.00 Min. : 0.000
## 1st Qu.: 3.330 1st Qu.:12.00 1st Qu.: 5.00 1st Qu.: 0.000
## Median : 4.650 Median :12.00 Median :13.50 Median : 2.000
## Mean : 5.896 Mean :12.56 Mean :17.02 Mean : 5.105
## 3rd Qu.: 6.880 3rd Qu.:14.00 3rd Qu.:26.00 3rd Qu.: 7.000
## Max. :24.980 Max. :18.00 Max. :51.00 Max. :44.000
## nonwhite female married numdep smsa
## Nonwhite: 54 Female:252 Married :320 Min. :0.000 Min. :0.0000
## White :472 Male :274 Notmarried:206 1st Qu.:0.000 1st Qu.:0.0000
## Median :1.000 Median :1.0000
## Mean :1.044 Mean :0.7224
## 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :6.000 Max. :1.0000
## northcen south west construc
## Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.000 Median :0.0000 Median :0.0000 Median :0.00000
## Mean :0.251 Mean :0.3555 Mean :0.1692 Mean :0.04563
## 3rd Qu.:0.750 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## ndurman trcommpu trade services
## Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.00000 Median :0.0000 Median :0.0000
## Mean :0.1141 Mean :0.04373 Mean :0.2871 Mean :0.1008
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.0000
## profserv profocc clerocc servocc
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.2586 Mean :0.3669 Mean :0.1673 Mean :0.1407
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## lwage expersq tenursq
## Min. :-0.6349 Min. : 1.0 Min. : 0.00
## 1st Qu.: 1.2030 1st Qu.: 25.0 1st Qu.: 0.00
## Median : 1.5369 Median : 182.5 Median : 4.00
## Mean : 1.6233 Mean : 473.4 Mean : 78.15
## 3rd Qu.: 1.9286 3rd Qu.: 676.0 3rd Qu.: 49.00
## Max. : 3.2181 Max. :2601.0 Max. :1936.00
For solving the major concerns of the analysis variables: wage, nonwhite, female, northcen will be used.
New variable for education analysis
wages <- wages %>% mutate(educ_level=cut(educ, breaks=c(min(educ), 11, 15, max(educ)), labels=c("low","middle","high")))There will be four subsets distinguished from the dataset: white_north - all white people from the sample living in the north region male_white - all white men from the sample living in the north region female_white - all white women from the sample living in the north region north - people living on the north
white_north <- wages %>% filter(nonwhite == "White", northcen == 1)
male_white <- wages %>% filter(nonwhite == "White", female == "Male", northcen == 1)
female_white <- wages %>% filter(nonwhite == "White", female == "Female", northcen == 1)
north <- wages %>% filter(northcen == 1)
summary(white_north)## wage educ exper tenure
## Min. : 1.500 Min. : 8.00 Min. : 1.00 Min. : 0.000
## 1st Qu.: 3.350 1st Qu.:12.00 1st Qu.: 5.00 1st Qu.: 0.000
## Median : 4.685 Median :12.00 Median :11.00 Median : 2.000
## Mean : 5.729 Mean :12.78 Mean :17.50 Mean : 5.347
## 3rd Qu.: 6.880 3rd Qu.:14.00 3rd Qu.:29.25 3rd Qu.: 7.250
## Max. :21.860 Max. :18.00 Max. :48.00 Max. :33.000
## nonwhite female married numdep smsa
## Nonwhite: 0 Female:63 Married :75 Min. :0.0000 Min. :0.0000
## White :124 Male :61 Notmarried:49 1st Qu.:0.0000 1st Qu.:1.0000
## Median :0.0000 Median :1.0000
## Mean :0.8871 Mean :0.7742
## 3rd Qu.:1.2500 3rd Qu.:1.0000
## Max. :5.0000 Max. :1.0000
## northcen south west construc ndurman
## Min. :1 Min. :0 Min. :0 Min. :0.00000 Min. :0.0000
## 1st Qu.:1 1st Qu.:0 1st Qu.:0 1st Qu.:0.00000 1st Qu.:0.0000
## Median :1 Median :0 Median :0 Median :0.00000 Median :0.0000
## Mean :1 Mean :0 Mean :0 Mean :0.04032 Mean :0.1129
## 3rd Qu.:1 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1 Max. :0 Max. :0 Max. :1.00000 Max. :1.0000
## trcommpu trade services profserv
## Min. :0.00000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000 Median :0.00000 Median :0.0000
## Mean :0.02419 Mean :0.3145 Mean :0.06452 Mean :0.3145
## 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :1.00000 Max. :1.0000 Max. :1.00000 Max. :1.0000
## profocc clerocc servocc lwage
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.4055
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:1.2090
## Median :0.0000 Median :0.0000 Median :0.0000 Median :1.5444
## Mean :0.4194 Mean :0.1532 Mean :0.1774 Mean :1.6108
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:1.9286
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :3.0847
## expersq tenursq educ_level
## Min. : 1.0 Min. : 0.00 low :22
## 1st Qu.: 25.0 1st Qu.: 0.00 middle:79
## Median : 121.0 Median : 4.00 high :23
## Mean : 518.3 Mean : 81.90
## 3rd Qu.: 855.8 3rd Qu.: 52.75
## Max. :2304.0 Max. :1089.00
white_north subset contains of 124 observations.
Wages frequency table
Following frequency table and histogram show the distribution of wages across white_north subset. The number of classes is selected to be 20. The calculated interval is ~ 1.018 [$/hour]
attach(white_north)
hist(wage,breaks=seq(round(min(wage), 2),floor(max(wage)) + round(min(wage), 2) ,by=1), main="Wages frequency in white_north", xlab = "wage [$/h]",col = "darkgoldenrod1")n_classes <- 20
interval <- (max(wage)-min(wage))/n_classes
wage_limits <- cut(wage, seq(min(wage), max(wage), by = interval))
wage_tab <- table(wage_limits)
wage_tab <- as.data.frame(wage_tab)
kable(wage_tab, format = "html", col.names=c("Wage range", "Frequency"), caption="Wages of white from north frequency table") %>%
column_spec(2, color = "white", background = spec_color(wage_tab$Freq[1:10],begin=0.3, end=0.8, option="E")) %>%
kable_styling(bootstrap_options = "bordered", "hover")| Wage range | Frequency |
|---|---|
| (1.5,2.52] | 5 |
| (2.52,3.54] | 31 |
| (3.54,4.55] | 24 |
| (4.55,5.57] | 11 |
| (5.57,6.59] | 20 |
| (6.59,7.61] | 7 |
| (7.61,8.63] | 5 |
| (8.63,9.64] | 6 |
| (9.64,10.7] | 5 |
| (10.7,11.7] | 1 |
| (11.7,12.7] | 3 |
| (12.7,13.7] | 0 |
| (13.7,14.7] | 2 |
| (14.7,15.8] | 1 |
| (15.8,16.8] | 0 |
| (16.8,17.8] | 0 |
| (17.8,18.8] | 1 |
| (18.8,19.8] | 0 |
| (19.8,20.8] | 0 |
| (20.8,21.9] | 1 |
Tabular accuracy index
wage_tai <- classIntervals(wage, n = n_classes, style="fixed", fixedBreaks=seq(min(wage), max(wage), by = interval) )
wage_tai <-jenks.tests(wage_tai)
kable(wage_tai) %>% kable_material_dark() %>% kable_styling(font_size = 24)| x | |
|---|---|
| # classes | 20.0000000 |
| Goodness of fit | 0.9948936 |
| Tabular accuracy | 0.9220722 |
Conclusion:
Division for 20 classes caused really high tabular accuracy index of 0.922.
Wages distribution
This part will focus on the visual representation of wages distribution within the sample. Firstly let’s have a close look at the histogram with a density line for the whole white_north part of the sample. Below, the histogram shows wages distribution separately for female and male.
qplot(wage, data = white_north, geom = "histogram", color = female) + theme_bw() + labs(x = "Wage [$/h]", y ="Frequency")Density plot for each gender with marked mean.
library(plyr)
mu <- ddply(white_north, "female", summarise, grp.mean=mean(wage))
p <- ggplot(white_north, aes(x=wage, fill=female)) +
geom_density(alpha = 0.2)+
geom_vline(data=mu, aes(xintercept=grp.mean, color=female),
linetype="dashed", size = 1)
pShape measures
Skewness
Positive value of skewness tells us, that we can observe asymmetry towards the left side of the histogram. And in fact the histogram is positively skewed. It can be interpreted as follows: median > mean. We should be aware that skewness result is influenced by the large tail on the left side of the plot. In order to evaluate the shape and tailedness of the distribution there comes handy kurtosis.| Skewness | value |
|---|---|
| Female | 2.154914 |
| Male | 1.485754 |
Kurtosis
Negative value of this measure tells us, that the distribution is leptokurtic.
kurtosis_tab <- data.frame(Kurtosis = c("Female", "Male"),value = c( kurtosi(female_white$wage), kurtosi(male_white$wage)))
kbl(kurtosis_tab) %>% kable_styling(bootstrap_options = "bordered", "hover")| Kurtosis | value |
|---|---|
| Female | 6.939859 |
| Male | 2.671281 |
Central tendency measures
mean_female <- mean(female_white$wage)
mean_male <- mean(male_white$wage)
median_female <- median(female_white$wage)
median_male <- median(male_white$wage)
mean_tab <- data.frame(Mean = c("Female", "Male"),value = c(mean_female, mean_male))
kbl(mean_tab) %>% kable_styling(bootstrap_options = "bordered", "hover")| Mean | value |
|---|---|
| Female | 4.395714 |
| Male | 7.106885 |
median_tab <- data.frame(Median = c("Female", "Male"),value = c(median_female, median_male))
kbl(median_tab) %>% kable_styling(bootstrap_options = "bordered", "hover")| Median | value |
|---|---|
| Female | 3.75 |
| Male | 6.25 |
Conclusion:
Clearly white male living on north subset is characterized by higher mean and median wages than female subset.
Positional measures
quant_female <- quantile(female_white$wage, probs = seq(0,1,0.25))
iqr_female <- quant_female[4]-quant_female[2];
boxplot(quant_female, horizontal=T, main="Quantiles of female wages in $/h", sub=paste("Interquartille range = ", round(iqr_female, 2), " $/h"), medcol="firebrick3")quant_male <- quantile(male_white$wage, probs = seq(0,1,0.25))
iqr_male <- quant_male[4]-quant_male[2];
boxplot(quant_male, horizontal=T, main="Quantiles of male wages in $/h", sub=paste("Interquartille range = ", round(iqr_male, 2), " $/h"), medcol="cyan3") First quartile is the value which divides observations into two parts: 25% of smaller values and 75% larger values. Second quartile is equal to the median - marked with the red line on the box plot. Third quartile is the value for which only 25% of observations are larger. Left border indicates 5% percentile, and right border indicates 95%. Interquartile range is calculated in such a way: Q3 - Q1. IQR contains 50% of all observations, so the wider IQR is, the more differentiation of the variable.
Conclusion:
It is visible that male have twice as wide IQR as female. It means that female wages are more concentrated around one value while male wages are more varied. It can be confirmed by the following boxplot:
box1 <- ggplot(white_north, aes(female, wage)) + geom_boxplot() + stat_summary(fun.args = mean, geom="point", shape=20, size=5, color="deeppink4") + labs(x="gender", y = "wage [$/h]")
box1Education level impact on wages
educ1 <- ggplot(white_north, aes(wage, female, color = female)) +
geom_jitter(width = 0.1, height = 0.1)
educ1 + facet_wrap(~educ_level) + ggtitle("Wage dependant on a gender and a education level") + labs(x="wage [$/h]", y = "gender")Conclusion:
Obviously the higher education level, the higher wages are observed both for male and female.
Following boxplot represents wages of withe professionals’ living on the north quantiles with median and mean marked. Considered factors to have an impact on the wage are the education level and gender.
wage_by_sex_educ <- ggplot(white_north, aes(female, wage)) + geom_boxplot() +
stat_summary(fun.args = mean, geom="point", shape=20, size=5, color="deeppink4")
wage_by_sex_educ + facet_wrap(~educ_level) +ggtitle("Wages by gender and education level", subtitle = "with marked mean wage") + theme(plot.subtitle = element_text(colour = "deeppink4", face="bold")) + labs(x="gender", "wage [$/h]")Race impact on wages
mu1 <- ddply(north, "nonwhite", summarise, grp.mean=mean(wage))
p1 <- ggplot(north, aes(x=wage, fill=nonwhite)) +
geom_density(alpha = 0.2)+
geom_vline(data=mu1, aes(xintercept=grp.mean, color=nonwhite),
linetype="dashed", size = 1) + scale_fill_manual(values=c("#70201F", "beige")) + scale_color_manual(values=c("#70201F", "beige")) + theme_dark() + ggtitle("Wages distribution by the race")
p1Conclusions:
From the plot it can be concluded that white people have higher wages, however, there is a very limited number of observations characterized by the nonwhite variable to be “Nonwhite”.
nrow(filter(north, nonwhite == "Nonwhite"))## [1] 8
nrow(filter(north, nonwhite == "White"))## [1] 124
There are just 8 “Nonwhite” observations versus 124 “White” observations. For that reason no further analysis on the race will be conducted.
Mean wage estimation
Stats package estiamtion
Automatic t-tests from the stats package for female:
library(stats)
white_north_t_test_f <- t.test(female_white$wage)
white_north_t_test_f##
## One Sample t-test
##
## data: female_white$wage
## t = 16.601, df = 62, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 3.866426 4.925003
## sample estimates:
## mean of x
## 4.395714
Automatic t-tests from the stats package for male:
white_north_t_test_m <- t.test(male_white$wage)
white_north_t_test_m##
## One Sample t-test
##
## data: male_white$wage
## t = 14.419, df = 60, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 6.120981 8.092790
## sample estimates:
## mean of x
## 7.106885
Manual approach:
Firstly, the standard deviation for female and male is calculated.
sd_f <- sd(female_white$wage)
sd_m <- sd(male_white$wage)Then the standard error for female and male is calculated by the formula: \[ SE = \frac{σ}{\sqrt{n}} \] z is read from the statistical table for the 95% confidence level. Lower and upper bound of the interval are calculated by the formula: \[ L = μ - Z*SE \\ U = μ + Z*SE \]
n_f <- nrow(female_white)
n_m <- nrow(male_white)
se_f <- sd_f / sqrt(n_f)
se_m <- sd_m / sqrt(n_m)
z95 <- 1.96Interval estimation for female:
lower_f <- mean_female - z95 * se_f
upper_f <- mean_female + z95 * se_f
c(lower_f, upper_f)## [1] 3.876745 4.914684
ci_f <- z95 * se_fInterval estimation for male:
lower_m <- mean_male - z95 * se_m
upper_m <- mean_male + z95 * se_m
c(lower_m, upper_m)## [1] 6.140843 8.072928
ci_m <- z95 * se_fPlot of wage estimations with marked conficence intervals:
library(Rmisc)## Warning: package 'Rmisc' was built under R version 4.0.5
## Loading required package: lattice
summ <- summarySE(white_north, measurevar = "wage", groupvars = c("female"))
ggplot(summ, aes(x=female, y=wage, fill=female)) + geom_bar(position = position_dodge(), stat = "identity") +
geom_errorbar(aes(ymin = wage - ci, ymax = wage + ci), width = 0.45) We’re 95% confident that the true average wage of white professionals living on the north by genders lies between the values 3.8767448 and 4.9146838 for female, and 6.1408428 and 8.0729278 for male.
Gender proportion
The proportion of white female to white people in the population will be estimated. First of all p-hat must be calculated using the formula: \[ p̂ = \frac{k}{n} \] where k denotes the group which proportion we calculate and n denotes the size of the whole sample
library(MASS)
n <- nrow(white_north)
k <- nrow(female_white)
p_hat <- k/n
p_hat## [1] 0.5080645
Then the standard error for proportion is calculated using the formula: \[ SE = \sqrt{\frac{p̂(1-p̂)}{n}} \]
se_prop <- sqrt(p_hat*(1-p_hat)/n)With 95% confidence we can say, that proportion of female in the selected population will be within the interval:
E = qnorm(.975)*se_prop
#confidence interval
p_hat + c(-E, E)## [1] 0.420071 0.596058
Which is approved by the test form the ‘stats’ package:
# check by function
prop.test(k,n)##
## 1-sample proportions test with continuity correction
##
## data: k out of n, null probability 0.5
## X-squared = 0.0080645, df = 1, p-value = 0.9284
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.4172411 0.5983814
## sample estimates:
## p
## 0.5080645
High education level estimate
high_educ <- white_north %>% filter(educ_level == "high")
k_educ <- nrow(high_educ)
n <- nrow(white_north)
p_hat_educ <- k_educ/n
se_prop_educ <- sqrt(p_hat_educ*(1-p_hat_educ)/n)
E_educ = qnorm(.975)*se_prop
interval_vect <- p_hat_educ + c(-E_educ, E_educ)
interval_vect## [1] 0.09749034 0.27347740
With 95% confidence it can be stated that the proportion of people with high education level within th population of white people living in the north will be within the interval from 0.0974903 to 0.2734774