Wages - Point Interval Estimation

Jakub Sochacki

2021-11-14

Introduction

The goal of this report is to show certain insights on wage data consisting of a random sample taken from the U.S. Current Population Survey for the year 1976. Also, the following report will include estimation of the mean wage of white female and male professionals living in the north region. Furthermore, the proportion of white female professionals will be estimated.

Libraries and data setup

#install.packages("np")
#install.packages("tidyverse")
#install.packages("kableExtra")
#install.packages("psych")
#install.packages("moments")
#install.packages("stats")
#install.packages("MASS")
#install.packages("plyr")
#install.packages("classInt")


library(np)
library(tidyverse)
library(kableExtra)
library(classInt)

There are 526 observations on 24 variables in total.

summary(wages)
##       wage             educ           exper           tenure      
##  Min.   : 0.530   Min.   : 0.00   Min.   : 1.00   Min.   : 0.000  
##  1st Qu.: 3.330   1st Qu.:12.00   1st Qu.: 5.00   1st Qu.: 0.000  
##  Median : 4.650   Median :12.00   Median :13.50   Median : 2.000  
##  Mean   : 5.896   Mean   :12.56   Mean   :17.02   Mean   : 5.105  
##  3rd Qu.: 6.880   3rd Qu.:14.00   3rd Qu.:26.00   3rd Qu.: 7.000  
##  Max.   :24.980   Max.   :18.00   Max.   :51.00   Max.   :44.000  
##      nonwhite      female          married        numdep           smsa       
##  Nonwhite: 54   Female:252   Married   :320   Min.   :0.000   Min.   :0.0000  
##  White   :472   Male  :274   Notmarried:206   1st Qu.:0.000   1st Qu.:0.0000  
##                                               Median :1.000   Median :1.0000  
##                                               Mean   :1.044   Mean   :0.7224  
##                                               3rd Qu.:2.000   3rd Qu.:1.0000  
##                                               Max.   :6.000   Max.   :1.0000  
##     northcen         south             west           construc      
##  Min.   :0.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.000   Median :0.0000   Median :0.0000   Median :0.00000  
##  Mean   :0.251   Mean   :0.3555   Mean   :0.1692   Mean   :0.04563  
##  3rd Qu.:0.750   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :1.000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##     ndurman          trcommpu           trade           services     
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.00000   Median :0.0000   Median :0.0000  
##  Mean   :0.1141   Mean   :0.04373   Mean   :0.2871   Mean   :0.1008  
##  3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :1.0000  
##     profserv         profocc          clerocc          servocc      
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.2586   Mean   :0.3669   Mean   :0.1673   Mean   :0.1407  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##      lwage            expersq          tenursq       
##  Min.   :-0.6349   Min.   :   1.0   Min.   :   0.00  
##  1st Qu.: 1.2030   1st Qu.:  25.0   1st Qu.:   0.00  
##  Median : 1.5369   Median : 182.5   Median :   4.00  
##  Mean   : 1.6233   Mean   : 473.4   Mean   :  78.15  
##  3rd Qu.: 1.9286   3rd Qu.: 676.0   3rd Qu.:  49.00  
##  Max.   : 3.2181   Max.   :2601.0   Max.   :1936.00

For solving the major concerns of the analysis variables: wage, nonwhite, female, northcen will be used.

New variable for education analysis

wages <- wages %>% mutate(educ_level=cut(educ, breaks=c(min(educ), 11, 15, max(educ)), labels=c("low","middle","high")))

There will be four subsets distinguished from the dataset: white_north - all white people from the sample living in the north region male_white - all white men from the sample living in the north region female_white - all white women from the sample living in the north region north - people living on the north

white_north <- wages %>% filter(nonwhite == "White", northcen == 1)
male_white <- wages %>% filter(nonwhite == "White", female == "Male", northcen == 1)
female_white <- wages %>% filter(nonwhite == "White", female == "Female", northcen == 1)
north <- wages %>% filter(northcen == 1)

summary(white_north)
##       wage             educ           exper           tenure      
##  Min.   : 1.500   Min.   : 8.00   Min.   : 1.00   Min.   : 0.000  
##  1st Qu.: 3.350   1st Qu.:12.00   1st Qu.: 5.00   1st Qu.: 0.000  
##  Median : 4.685   Median :12.00   Median :11.00   Median : 2.000  
##  Mean   : 5.729   Mean   :12.78   Mean   :17.50   Mean   : 5.347  
##  3rd Qu.: 6.880   3rd Qu.:14.00   3rd Qu.:29.25   3rd Qu.: 7.250  
##  Max.   :21.860   Max.   :18.00   Max.   :48.00   Max.   :33.000  
##      nonwhite      female         married       numdep            smsa       
##  Nonwhite:  0   Female:63   Married   :75   Min.   :0.0000   Min.   :0.0000  
##  White   :124   Male  :61   Notmarried:49   1st Qu.:0.0000   1st Qu.:1.0000  
##                                             Median :0.0000   Median :1.0000  
##                                             Mean   :0.8871   Mean   :0.7742  
##                                             3rd Qu.:1.2500   3rd Qu.:1.0000  
##                                             Max.   :5.0000   Max.   :1.0000  
##     northcen     south        west      construc          ndurman      
##  Min.   :1   Min.   :0   Min.   :0   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:1   1st Qu.:0   1st Qu.:0   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :1   Median :0   Median :0   Median :0.00000   Median :0.0000  
##  Mean   :1   Mean   :0   Mean   :0   Mean   :0.04032   Mean   :0.1129  
##  3rd Qu.:1   3rd Qu.:0   3rd Qu.:0   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :1   Max.   :0   Max.   :0   Max.   :1.00000   Max.   :1.0000  
##     trcommpu           trade           services          profserv     
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.00000   Median :0.0000   Median :0.00000   Median :0.0000  
##  Mean   :0.02419   Mean   :0.3145   Mean   :0.06452   Mean   :0.3145  
##  3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:1.0000  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.00000   Max.   :1.0000  
##     profocc          clerocc          servocc           lwage       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.4055  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:1.2090  
##  Median :0.0000   Median :0.0000   Median :0.0000   Median :1.5444  
##  Mean   :0.4194   Mean   :0.1532   Mean   :0.1774   Mean   :1.6108  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:1.9286  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :3.0847  
##     expersq          tenursq         educ_level
##  Min.   :   1.0   Min.   :   0.00   low   :22  
##  1st Qu.:  25.0   1st Qu.:   0.00   middle:79  
##  Median : 121.0   Median :   4.00   high  :23  
##  Mean   : 518.3   Mean   :  81.90              
##  3rd Qu.: 855.8   3rd Qu.:  52.75              
##  Max.   :2304.0   Max.   :1089.00

white_north subset contains of 124 observations.

Wages frequency table

Following frequency table and histogram show the distribution of wages across white_north subset. The number of classes is selected to be 20. The calculated interval is ~ 1.018 [$/hour]

attach(white_north)
hist(wage,breaks=seq(round(min(wage), 2),floor(max(wage)) + round(min(wage), 2) ,by=1), main="Wages frequency in white_north", xlab = "wage [$/h]",col = "darkgoldenrod1")

n_classes <- 20
interval <- (max(wage)-min(wage))/n_classes
wage_limits <- cut(wage, seq(min(wage), max(wage), by = interval))
wage_tab <- table(wage_limits)
wage_tab <- as.data.frame(wage_tab)
kable(wage_tab, format = "html", col.names=c("Wage range", "Frequency"), caption="Wages of white from north frequency table") %>% 
  column_spec(2, color = "white", background = spec_color(wage_tab$Freq[1:10],begin=0.3, end=0.8, option="E")) %>%
  kable_styling(bootstrap_options = "bordered", "hover")
Wages of white from north frequency table
Wage range Frequency
(1.5,2.52] 5
(2.52,3.54] 31
(3.54,4.55] 24
(4.55,5.57] 11
(5.57,6.59] 20
(6.59,7.61] 7
(7.61,8.63] 5
(8.63,9.64] 6
(9.64,10.7] 5
(10.7,11.7] 1
(11.7,12.7] 3
(12.7,13.7] 0
(13.7,14.7] 2
(14.7,15.8] 1
(15.8,16.8] 0
(16.8,17.8] 0
(17.8,18.8] 1
(18.8,19.8] 0
(19.8,20.8] 0
(20.8,21.9] 1

Tabular accuracy index

wage_tai <- classIntervals(wage, n = n_classes, style="fixed", fixedBreaks=seq(min(wage), max(wage), by = interval) )
wage_tai <-jenks.tests(wage_tai)
kable(wage_tai) %>% kable_material_dark() %>% kable_styling(font_size = 24)
x
# classes 20.0000000
Goodness of fit 0.9948936
Tabular accuracy 0.9220722

Conclusion:

Division for 20 classes caused really high tabular accuracy index of 0.922.

Wages distribution

This part will focus on the visual representation of wages distribution within the sample. Firstly let’s have a close look at the histogram with a density line for the whole white_north part of the sample. Below, the histogram shows wages distribution separately for female and male.

qplot(wage, data = white_north, geom = "histogram", color = female) + theme_bw() + labs(x = "Wage [$/h]", y ="Frequency")

Density plot for each gender with marked mean.

library(plyr)
mu <- ddply(white_north, "female", summarise, grp.mean=mean(wage))
p <- ggplot(white_north, aes(x=wage, fill=female)) +
  geom_density(alpha = 0.2)+
  geom_vline(data=mu, aes(xintercept=grp.mean, color=female),
             linetype="dashed", size = 1)
p

Shape measures

Skewness

Positive value of skewness tells us, that we can observe asymmetry towards the left side of the histogram. And in fact the histogram is positively skewed. It can be interpreted as follows: median > mean. We should be aware that skewness result is influenced by the large tail on the left side of the plot. In order to evaluate the shape and tailedness of the distribution there comes handy kurtosis.
Skewness value
Female 2.154914
Male 1.485754

Kurtosis

Negative value of this measure tells us, that the distribution is leptokurtic.

kurtosis_tab <- data.frame(Kurtosis = c("Female", "Male"),value = c( kurtosi(female_white$wage),  kurtosi(male_white$wage)))
kbl(kurtosis_tab) %>% kable_styling(bootstrap_options = "bordered", "hover")
Kurtosis value
Female 6.939859
Male 2.671281

Central tendency measures

mean_female <- mean(female_white$wage)
mean_male <- mean(male_white$wage)

median_female <- median(female_white$wage)
median_male <- median(male_white$wage)

mean_tab <- data.frame(Mean = c("Female", "Male"),value = c(mean_female, mean_male))
kbl(mean_tab) %>% kable_styling(bootstrap_options = "bordered", "hover")
Mean value
Female 4.395714
Male 7.106885
median_tab <- data.frame(Median = c("Female", "Male"),value = c(median_female, median_male))
kbl(median_tab) %>% kable_styling(bootstrap_options = "bordered", "hover")
Median value
Female 3.75
Male 6.25

Conclusion:

Clearly white male living on north subset is characterized by higher mean and median wages than female subset.

Positional measures

quant_female <- quantile(female_white$wage, probs = seq(0,1,0.25))
iqr_female <- quant_female[4]-quant_female[2];
boxplot(quant_female, horizontal=T, main="Quantiles of female wages in $/h", sub=paste("Interquartille range = ", round(iqr_female, 2), " $/h"), medcol="firebrick3")

quant_male <- quantile(male_white$wage, probs = seq(0,1,0.25))
iqr_male <- quant_male[4]-quant_male[2];
boxplot(quant_male, horizontal=T, main="Quantiles of male wages in $/h", sub=paste("Interquartille range = ", round(iqr_male, 2), " $/h"), medcol="cyan3")

First quartile is the value which divides observations into two parts: 25% of smaller values and 75% larger values. Second quartile is equal to the median - marked with the red line on the box plot. Third quartile is the value for which only 25% of observations are larger. Left border indicates 5% percentile, and right border indicates 95%. Interquartile range is calculated in such a way: Q3 - Q1. IQR contains 50% of all observations, so the wider IQR is, the more differentiation of the variable.

Conclusion:

It is visible that male have twice as wide IQR as female. It means that female wages are more concentrated around one value while male wages are more varied. It can be confirmed by the following boxplot:

box1 <- ggplot(white_north, aes(female, wage)) + geom_boxplot() + stat_summary(fun.args = mean, geom="point", shape=20, size=5, color="deeppink4") + labs(x="gender", y = "wage [$/h]")
box1

Education level impact on wages

educ1 <- ggplot(white_north, aes(wage, female, color = female)) + 
  geom_jitter(width = 0.1, height = 0.1) 
educ1 + facet_wrap(~educ_level) + ggtitle("Wage dependant on a gender and a education level") + labs(x="wage [$/h]", y = "gender")

Conclusion:

Obviously the higher education level, the higher wages are observed both for male and female.

Following boxplot represents wages of withe professionals’ living on the north quantiles with median and mean marked. Considered factors to have an impact on the wage are the education level and gender.

wage_by_sex_educ <- ggplot(white_north, aes(female, wage)) + geom_boxplot() + 
  stat_summary(fun.args = mean, geom="point", shape=20, size=5, color="deeppink4")
wage_by_sex_educ + facet_wrap(~educ_level) +ggtitle("Wages by gender and education level", subtitle = "with marked mean wage") + theme(plot.subtitle = element_text(colour = "deeppink4", face="bold")) + labs(x="gender", "wage [$/h]")

Race impact on wages

mu1 <- ddply(north, "nonwhite", summarise, grp.mean=mean(wage))
p1 <- ggplot(north, aes(x=wage, fill=nonwhite)) +
  geom_density(alpha = 0.2)+
  geom_vline(data=mu1, aes(xintercept=grp.mean, color=nonwhite),
             linetype="dashed", size = 1)  + scale_fill_manual(values=c("#70201F", "beige")) + scale_color_manual(values=c("#70201F", "beige")) + theme_dark() + ggtitle("Wages distribution by the race")
p1

Conclusions:

From the plot it can be concluded that white people have higher wages, however, there is a very limited number of observations characterized by the nonwhite variable to be “Nonwhite”.

nrow(filter(north, nonwhite == "Nonwhite"))
## [1] 8
nrow(filter(north, nonwhite == "White"))
## [1] 124

There are just 8 “Nonwhite” observations versus 124 “White” observations. For that reason no further analysis on the race will be conducted.

Mean wage estimation

Stats package estiamtion

Automatic t-tests from the stats package for female:

library(stats)
white_north_t_test_f <- t.test(female_white$wage)
white_north_t_test_f
## 
##  One Sample t-test
## 
## data:  female_white$wage
## t = 16.601, df = 62, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  3.866426 4.925003
## sample estimates:
## mean of x 
##  4.395714

Automatic t-tests from the stats package for male:

white_north_t_test_m <- t.test(male_white$wage)
white_north_t_test_m
## 
##  One Sample t-test
## 
## data:  male_white$wage
## t = 14.419, df = 60, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  6.120981 8.092790
## sample estimates:
## mean of x 
##  7.106885

Manual approach:

Firstly, the standard deviation for female and male is calculated.

sd_f <- sd(female_white$wage)
sd_m <- sd(male_white$wage)

Then the standard error for female and male is calculated by the formula: \[ SE = \frac{σ}{\sqrt{n}} \] z is read from the statistical table for the 95% confidence level. Lower and upper bound of the interval are calculated by the formula: \[ L = μ - Z*SE \\ U = μ + Z*SE \]

n_f <- nrow(female_white)
n_m <- nrow(male_white)

se_f <- sd_f / sqrt(n_f)
se_m <- sd_m / sqrt(n_m)

z95 <- 1.96

Interval estimation for female:

lower_f <- mean_female - z95 * se_f
upper_f <- mean_female + z95 * se_f
c(lower_f, upper_f)
## [1] 3.876745 4.914684
ci_f <- z95 * se_f

Interval estimation for male:

lower_m <- mean_male - z95 * se_m
upper_m <- mean_male + z95 * se_m
c(lower_m, upper_m)
## [1] 6.140843 8.072928
ci_m <- z95 * se_f

Plot of wage estimations with marked conficence intervals:

library(Rmisc)
## Warning: package 'Rmisc' was built under R version 4.0.5
## Loading required package: lattice
summ <- summarySE(white_north, measurevar = "wage", groupvars = c("female"))

ggplot(summ, aes(x=female, y=wage, fill=female)) + geom_bar(position = position_dodge(), stat = "identity") + 
   geom_errorbar(aes(ymin = wage - ci, ymax = wage + ci), width = 0.45)

We’re 95% confident that the true average wage of white professionals living on the north by genders lies between the values 3.8767448 and 4.9146838 for female, and 6.1408428 and 8.0729278 for male.

Gender proportion

The proportion of white female to white people in the population will be estimated. First of all p-hat must be calculated using the formula: \[ p̂ = \frac{k}{n} \] where k denotes the group which proportion we calculate and n denotes the size of the whole sample

library(MASS)
n <- nrow(white_north)
k <- nrow(female_white)
p_hat <- k/n
p_hat
## [1] 0.5080645

Then the standard error for proportion is calculated using the formula: \[ SE = \sqrt{\frac{p̂(1-p̂)}{n}} \]

se_prop <- sqrt(p_hat*(1-p_hat)/n)

With 95% confidence we can say, that proportion of female in the selected population will be within the interval:

E = qnorm(.975)*se_prop

#confidence interval
p_hat + c(-E, E)
## [1] 0.420071 0.596058

Which is approved by the test form the ‘stats’ package:

# check by function
prop.test(k,n)
## 
##  1-sample proportions test with continuity correction
## 
## data:  k out of n, null probability 0.5
## X-squared = 0.0080645, df = 1, p-value = 0.9284
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.4172411 0.5983814
## sample estimates:
##         p 
## 0.5080645

High education level estimate

high_educ <- white_north %>% filter(educ_level == "high")
k_educ <- nrow(high_educ)
n <- nrow(white_north)
p_hat_educ <- k_educ/n

se_prop_educ <- sqrt(p_hat_educ*(1-p_hat_educ)/n)

E_educ = qnorm(.975)*se_prop

interval_vect <- p_hat_educ + c(-E_educ, E_educ)
interval_vect
## [1] 0.09749034 0.27347740

With 95% confidence it can be stated that the proportion of people with high education level within th population of white people living in the north will be within the interval from 0.0974903 to 0.2734774