Just some simple plots and calculations

In this report we will take a look at white people living in the north and talk a bit about their wages. We are gonna use wage1 package.

library(np)
library(tidyverse)
library(kableExtra)
library(moments)
library(MASS)
library(stats)
library(distributions3)
library(Rmisc)

data <- data(wage1)
female_white <- wage1 %>% filter(nonwhite == "White", female == "Female", northcen == 1)

male_white <- wage1 %>% filter(nonwhite == "White", female == "Male", northcen == 1)

white_north <- wage1 %>% filter(nonwhite == "White", northcen == 1)

female_white_box <- ggplot(white_north, aes(female, wage, fill=female)) + geom_boxplot(outlier.size = 3) + labs(title="Wages per gender", x="Gender", y="Wage") + theme_bw() + stat_summary(fun=mean, geom="point", shape=1, size=3)
female_white_box

female_white_density <- qplot(wage, data = white_north, geom = "density", color = female, linetype = female, main = "Density of wages per gender", xlab = "Wage", ylab = "Density") + theme_bw()
female_white_density

female_white_histogram <- qplot(wage, data = white_north, geom = "histogram", color = female, fill=female, main = "Density of wages per gender", xlab = "Wage", ylab = "Density" ) + theme_bw()
female_white_histogram

All these nice and simple plots tell us that men among white inhabitants of the north on average have bigger wages. Even the outliers show big difference, as the richest man has his wage higher for over 5 than richest woman. Most woman are cumulated in the lower parts of wages range, while men are divided more equally, still with majority in lower quantities.

attach(white_north)
interval <- (max(wage)-min(wage))/8
wage_limits <- cut(wage, seq(min(wage), max(wage), by = interval))
wage_tab <- table(wage_limits)
wage_tab <- as.data.frame(wage_tab)
kable(wage_tab, format = "html", col.names=c("Wage", "Frequency")) %>% kable_material(c("hover"))

Wage	Frequency
(1.5,4.05]	48
(4.05,6.59]	43
(6.59,9.14]	18
(9.14,11.7]	6
(11.7,14.2]	4
(14.2,16.8]	2
(16.8,19.3]	1
(19.3,21.9]	1

The frequency table shows us that almost whole sample lies in 3 lowest intervals, with over 100 people there and just 15 people earn enough to land in 5 other intervals. That tells us that this society of white people living in the north might not be the richest society, but with 43 people in second interval they may survive.

attach(white_north)
white_north_mean_wage <- mean(wage)
white_north_var_wage <- var(wage)
white_north_sd_wage <- sd(wage)
white_north_iqr_wage <- IQR(wage)
range_ = max(wage) - min(wage)
white_north_median_wage <- median(wage)
white_north_kurtosis_wage <- kurtosis(wage)
white_north_skewness_wage <- skewness(wage)

table <- data.frame(White_north_wages = c("Mean", "Variance", "Standard deviation", "Inter quartile range", "Range", "Median", "Kurtosis", "Skewness"),Value = c(white_north_mean_wage,white_north_var_wage,white_north_sd_wage,white_north_iqr_wage,range_,white_north_median_wage,white_north_kurtosis_wage,white_north_skewness_wage))
kbl(table) %>% kable_material(c("striped"))

White_north_wages	Value
Mean	5.729436
Variance	11.307082
Standard deviation	3.362601
Inter quartile range	3.530000
Range	20.360001
Median	4.685000
Kurtosis	7.911594
Skewness	1.936282

The mean wage among north whites isn’t very low, but standard deviation equal to 3 is a pretty high value. Values are in range of 20, so the difference between the poorest and richest man is high. The distribution is leptokurtic, due to high kurtosis. The distribution is skewed to the right.

T_123 <- StudentsT(df=123)
LCI <- mean(wage) + quantile(T_123, 0.05 / 2) * sd(wage) / sqrt(124)
UCI <- mean(wage) + quantile(T_123, 1-0.05 / 2) * sd(wage) / sqrt(124)
standard_error <- white_north_sd_wage/sqrt(124)

tgc <- summarySE(white_north, measurevar="wage", groupvars="female")
ci = 1.96 * standard_error

ggplot(tgc, aes(x=female, y=wage, colour=female, fill=female)) +
  geom_bar(position=position_dodge(), stat="identity")+
    geom_errorbar(aes(ymin=wage-standard_error, ymax=wage+standard_error), width=.2, position=position_dodge(.9)) + labs(title="Standard error gender vs wage", x="Female", y="Wage")

ggplot(tgc, aes(x=female, y=wage, colour=female, fill=female)) +
  geom_bar(position=position_dodge(), stat="identity")+
    geom_errorbar(aes(ymin=wage-ci, ymax=wage+ci), width=.2, position=position_dodge(.9)) + labs(title="Confidence interval gender vs wage", x="Female", y="Wage")

table2 <- data.frame(Tests = c("Lower confidence interval", "Upper confidence interval", "Standard error"), Value = c(LCI, UCI, standard_error))
kbl(table2) %>% kable_material(c("striped"))

Tests	Value
Lower confidence interval	5.1317035
Upper confidence interval	6.3271675
Standard error	0.3019704

For given data we must assume degrees of freedom equal to 123 (124-1). The mean of the population lies between 5.1 and 6.3 with 95% of confidence. The standard error is 0.3.

n <- nrow(white_north)
k <- nrow(female_white)
p_hat <- k/n

se_prop <- sqrt(p_hat*(1-p_hat)/n)

LCI_prop <- p_hat + qnorm(0.05/2)*sqrt(p_hat*(1-p_hat)/n)
UCI_prop <- p_hat + qnorm(1-0.05/2)*sqrt(p_hat*(1-p_hat)/n)

table2 <- data.frame(Proportion_tests = c("Lower confidence interval", "Upper confidence interval", "Standard error"), Value = c(LCI_prop, UCI_prop, se_prop))
kbl(table2) %>% kable_material(c("striped"))

Proportion_tests	Value
Lower confidence interval	0.4200710
Upper confidence interval	0.5960580
Standard error	0.0448955

In population proportion of white female living in the north to all whites in the north the 95% confidence interval is between 0.42 and 0.59 and the standard error is 0.044.

It is all, thank you for reading, I really don’t want to write the midterm :(

Just some simple plots and calculations

Maciej Świetlik

14 11 2021