In this report we will take a look at white people living in the north and talk a bit about their wages. We are gonna use wage1 package.
library(np)
library(tidyverse)
library(kableExtra)
library(moments)
library(MASS)
library(stats)
library(distributions3)
library(Rmisc)
data <- data(wage1)
female_white <- wage1 %>% filter(nonwhite == "White", female == "Female", northcen == 1)
male_white <- wage1 %>% filter(nonwhite == "White", female == "Male", northcen == 1)
white_north <- wage1 %>% filter(nonwhite == "White", northcen == 1)
female_white_box <- ggplot(white_north, aes(female, wage, fill=female)) + geom_boxplot(outlier.size = 3) + labs(title="Wages per gender", x="Gender", y="Wage") + theme_bw() + stat_summary(fun=mean, geom="point", shape=1, size=3)
female_white_box
female_white_density <- qplot(wage, data = white_north, geom = "density", color = female, linetype = female, main = "Density of wages per gender", xlab = "Wage", ylab = "Density") + theme_bw()
female_white_density
female_white_histogram <- qplot(wage, data = white_north, geom = "histogram", color = female, fill=female, main = "Density of wages per gender", xlab = "Wage", ylab = "Density" ) + theme_bw()
female_white_histogram
All these nice and simple plots tell us that men among white inhabitants of the north on average have bigger wages. Even the outliers show big difference, as the richest man has his wage higher for over 5 than richest woman. Most woman are cumulated in the lower parts of wages range, while men are divided more equally, still with majority in lower quantities.
attach(white_north)
interval <- (max(wage)-min(wage))/8
wage_limits <- cut(wage, seq(min(wage), max(wage), by = interval))
wage_tab <- table(wage_limits)
wage_tab <- as.data.frame(wage_tab)
kable(wage_tab, format = "html", col.names=c("Wage", "Frequency")) %>% kable_material(c("hover"))
| Wage | Frequency |
|---|---|
| (1.5,4.05] | 48 |
| (4.05,6.59] | 43 |
| (6.59,9.14] | 18 |
| (9.14,11.7] | 6 |
| (11.7,14.2] | 4 |
| (14.2,16.8] | 2 |
| (16.8,19.3] | 1 |
| (19.3,21.9] | 1 |
The frequency table shows us that almost whole sample lies in 3 lowest intervals, with over 100 people there and just 15 people earn enough to land in 5 other intervals. That tells us that this society of white people living in the north might not be the richest society, but with 43 people in second interval they may survive.
attach(white_north)
white_north_mean_wage <- mean(wage)
white_north_var_wage <- var(wage)
white_north_sd_wage <- sd(wage)
white_north_iqr_wage <- IQR(wage)
range_ = max(wage) - min(wage)
white_north_median_wage <- median(wage)
white_north_kurtosis_wage <- kurtosis(wage)
white_north_skewness_wage <- skewness(wage)
table <- data.frame(White_north_wages = c("Mean", "Variance", "Standard deviation", "Inter quartile range", "Range", "Median", "Kurtosis", "Skewness"),Value = c(white_north_mean_wage,white_north_var_wage,white_north_sd_wage,white_north_iqr_wage,range_,white_north_median_wage,white_north_kurtosis_wage,white_north_skewness_wage))
kbl(table) %>% kable_material(c("striped"))
| White_north_wages | Value |
|---|---|
| Mean | 5.729436 |
| Variance | 11.307082 |
| Standard deviation | 3.362601 |
| Inter quartile range | 3.530000 |
| Range | 20.360001 |
| Median | 4.685000 |
| Kurtosis | 7.911594 |
| Skewness | 1.936282 |
The mean wage among north whites isn’t very low, but standard deviation equal to 3 is a pretty high value. Values are in range of 20, so the difference between the poorest and richest man is high. The distribution is leptokurtic, due to high kurtosis. The distribution is skewed to the right.
T_123 <- StudentsT(df=123)
LCI <- mean(wage) + quantile(T_123, 0.05 / 2) * sd(wage) / sqrt(124)
UCI <- mean(wage) + quantile(T_123, 1-0.05 / 2) * sd(wage) / sqrt(124)
standard_error <- white_north_sd_wage/sqrt(124)
tgc <- summarySE(white_north, measurevar="wage", groupvars="female")
ci = 1.96 * standard_error
ggplot(tgc, aes(x=female, y=wage, colour=female, fill=female)) +
geom_bar(position=position_dodge(), stat="identity")+
geom_errorbar(aes(ymin=wage-standard_error, ymax=wage+standard_error), width=.2, position=position_dodge(.9)) + labs(title="Standard error gender vs wage", x="Female", y="Wage")
ggplot(tgc, aes(x=female, y=wage, colour=female, fill=female)) +
geom_bar(position=position_dodge(), stat="identity")+
geom_errorbar(aes(ymin=wage-ci, ymax=wage+ci), width=.2, position=position_dodge(.9)) + labs(title="Confidence interval gender vs wage", x="Female", y="Wage")
table2 <- data.frame(Tests = c("Lower confidence interval", "Upper confidence interval", "Standard error"), Value = c(LCI, UCI, standard_error))
kbl(table2) %>% kable_material(c("striped"))
| Tests | Value |
|---|---|
| Lower confidence interval | 5.1317035 |
| Upper confidence interval | 6.3271675 |
| Standard error | 0.3019704 |
For given data we must assume degrees of freedom equal to 123 (124-1). The mean of the population lies between 5.1 and 6.3 with 95% of confidence. The standard error is 0.3.
n <- nrow(white_north)
k <- nrow(female_white)
p_hat <- k/n
se_prop <- sqrt(p_hat*(1-p_hat)/n)
LCI_prop <- p_hat + qnorm(0.05/2)*sqrt(p_hat*(1-p_hat)/n)
UCI_prop <- p_hat + qnorm(1-0.05/2)*sqrt(p_hat*(1-p_hat)/n)
table2 <- data.frame(Proportion_tests = c("Lower confidence interval", "Upper confidence interval", "Standard error"), Value = c(LCI_prop, UCI_prop, se_prop))
kbl(table2) %>% kable_material(c("striped"))
| Proportion_tests | Value |
|---|---|
| Lower confidence interval | 0.4200710 |
| Upper confidence interval | 0.5960580 |
| Standard error | 0.0448955 |
In population proportion of white female living in the north to all whites in the north the 95% confidence interval is between 0.42 and 0.59 and the standard error is 0.044.
It is all, thank you for reading, I really don’t want to write the midterm :(