Introduction to data
The data from wage1 consist of random sample of U.S. population from year 1976.
Each respondent has given information i.a. about hourly wage, years of education and experience, gender, part of the U.S. he/she lives in and type of profession.
Because we do not have access to the population data, we have to use given sample data to estimate population parameters.
Estimating mean
Female professionals
At first, we want to look at data about female professionals (profserv == 1) living in north of U.S.
data1 <- as_tibble(wage1)
female_data1 <- data1 %>% filter(female == "Female", northcen == 1, profserv == 1)
summary(female_data1$wage) #MEAN == 5.147
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.500 3.480 4.715 5.147 6.228 14.580
Let’s look at the data using kableExtra package. Because we are interested only in few variables, we will select only those of interest. We will look at first 10 rows (out of 28).
female_data2 <- female_data1 %>% select(wage, educ, exper, nonwhite, married)
female_data2 <- female_data2 %>% add_column(id = 1:nrow(female_data2), .before = "wage")
kbl(head(female_data2, 10), caption = "Sample data about female professionals from north of U.S") %>%
kable_minimal() %>%
column_spec(2, color = "white", background = spec_color(female_data2$wage[1:10], end = 0.8))
| id | wage | educ | exper | nonwhite | married |
|---|---|---|---|---|---|
| 1 | 4.75 | 12 | 48 | White | Married |
| 2 | 6.25 | 13 | 6 | White | Married |
| 3 | 14.58 | 18 | 13 | White | Notmarried |
| 4 | 6.22 | 12 | 19 | White | Married |
| 5 | 9.00 | 12 | 11 | White | Married |
| 6 | 6.53 | 16 | 3 | White | Notmarried |
| 7 | 7.60 | 12 | 41 | White | Notmarried |
| 8 | 5.00 | 12 | 14 | White | Notmarried |
| 9 | 6.18 | 14 | 9 | White | Married |
| 10 | 6.25 | 12 | 13 | White | Married |
Now, lets look at the distribution of wages in given sample. Using summary() function we can see range of values: from 1.5 to 14.580, mean = 5.147 as well as median and first and third quantile.
To better visualize, we can plot histogram of wages distribution. By red line we show the mean of sample.
hist(female_data1$wage, main = "Wage distribution of female professionals", xlab = "Hourly wage")
abline(v = mean(female_data1$wage), col = "red", lwd = 3)
Now to estimate the real mean of population we will use 95% confidence interval.
To find 95% confidence interval, we need following values:
- z score
- standard deviation
- standard error
- mean of sample
The we will use following formula to calculate the confidence interval.
\[\mu = \overline{x} \pm (z * SE)\]
Where standard error (SE) is equal to:
\[se = \frac {sd}{\sqrt{n}}\]
z_fem1 <- qnorm(0.975) # value ~ 1.96
n_fem1 <- nrow(female_data1)
sd_fem1 <- sd(female_data1$wage)
se_fem1 <- sd_fem1 / sqrt(n_fem1)
mean_samp_fem1 <- mean(female_data1$wage)
Putting that values to given formula gives us lower and upper limits to confidence interval:
lower <- mean_samp_fem1 - z_fem1 * se_fem1
upper <- mean_samp_fem1 + z_fem1 * se_fem1
z_confidence95_fem1 <- c(lower, upper)
z_confidence95_fem1
## [1] 4.205662 6.089338
This confidence interval shows that we are 95% confident true mean of hourly wage for female professionals falls between 4.206 and 6.089.
However, looking at the size of the sample: n = 28 we should realize that this sample size is not big. n < 30 signalizes we should rather use t - distribution.
For t - distribution we need following values:
- mean of sample
- t-score
- degrees of freedom
- standard deviation
- standard error
Most of these values are already calculated, the only ones remaining are degrees of freedom (n - 1) and t - score.
df <- n_fem1 - 1
t_fem1 <- qt(0.975, df = df)
lower <- mean_samp_fem1 - t_fem1 * se_fem1
upper <- mean_samp_fem1 + t_fem1 * se_fem1
t_confidence95_fem1 <- c(lower, upper)
t_confidence95_fem1
## [1] 4.161517 6.133483
We can clearly see that this interval is wider. That is because t - distribution protects us from using a small sample and getting not proper result.
The real 95% confidence interval for our sample of female professionals is then:
Male professionals
Suppose that now we want to see if male’s hourly wages are giving similar values to the female’s.
To do that, we also need to calculate 95% confidence interval for male professionals. First, lets look at first 10 rows of the data and distribution of male’s wages for the same parameters.
male_data1 <- data1 %>% filter(female == "Male", northcen == 1, profserv == 1)
summary(male_data1$wage)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.950 5.770 6.500 7.518 10.000 15.380
male_data2 <- male_data1 %>% select(wage, educ, exper, nonwhite, married)
male_data2 <- male_data2 %>% add_column(id = 1:nrow(male_data2), .before = "wage")
kbl(head(male_data2, 10), caption = "Sample data about male professionals from north of U.S") %>%
kable_minimal() %>%
column_spec(2, color = "white", background = spec_color(male_data2$wage[1:10], end = 0.8))
| id | wage | educ | exper | nonwhite | married |
|---|---|---|---|---|---|
| 1 | 6.88 | 18 | 4 | White | Notmarried |
| 2 | 6.00 | 13 | 8 | White | Married |
| 3 | 15.38 | 13 | 25 | White | Married |
| 4 | 10.00 | 14 | 43 | White | Married |
| 5 | 5.77 | 10 | 44 | White | Married |
| 6 | 6.25 | 16 | 9 | White | Married |
| 7 | 6.88 | 12 | 31 | White | Married |
| 8 | 3.75 | 18 | 1 | White | Notmarried |
| 9 | 2.95 | 14 | 41 | Nonwhite | Married |
| 10 | 6.50 | 17 | 3 | White | Married |
hist(male_data1$wage, main = "Wage distribution for male professionals", xlab = "Hourly wage")
abline(v = mean(male_data1$wage), col = "red", lwd = 3)
Even now we can see that male’s wages distribution differs from female’s. Later, we will see a better comparison on a violin plot.
Now calculating 95% confidence interval:
mean_samp_men1 <- mean(male_data1$wage)
n_men1 <- nrow(male_data1)
sd_men1 <- sd(male_data1$wage)
se_men1 <- sd_men1 / sqrt(n_men1)
z_men1 <- qnorm(0.975)
lower <- mean_samp_men1 - z_men1 * se_men1
upper <- mean_samp_men1 + z_men1 * se_men1
z_confidence95_men1 <- c(lower, upper)
z_confidence95_men1
## [1] 5.571269 9.465654
t_men1 <- qt(0.975, df = n_men1 - 1)
lower <- mean_samp_men1 - t_men1 * se_men1
upper <- mean_samp_men1 + t_men1 * se_men1
t_confidence95_men1 <- c(lower, upper)
t_confidence95_men1
## [1] 5.353846 9.683077
Here we also should rater use STUDENT t - distribution, as the sample size (13) is rather small.
Now seeing the result we can see that the given confidence interval: from 5.35 to 9.68 is much higher than the one computed before - for female professionals.
Lets visualize this confidence intervals on a boxplot:
boxplot(t_confidence95_fem1, t_confidence95_men1, names = c("Female", "Male"),
main = "95% confidence interval of wages")
Here we clearly see the difference between estimated population mean - it is much higher for male professionals.
To see how wages are distributed in the sample, we can use a violin plot.
data2 <- data1
data2$female <- as.factor(data1$female)
ggplot(data2, aes(x = female, y = wage, fill = female)) + geom_violin() +
geom_boxplot(width = 0.1, fill = "white") + theme_minimal() +
labs(title = "Distribution of wages in sample of male and female professionals", x = "", y = "Hourly wage") +
theme(legend.position = "none")
This diagram allows us to better visualize the sample distribution of wages. Now we can see why we got such result with confidence intervals. Female wages are focused on lower values and don’t reach as high outliers as male’s professionals.
Comparing populations
Lastly, having two estimated means for male and female professionals, we can try to compare populations. We will also use 95% confidence interval.
We do not assume we are dealing with populations that have the same variance:
\[ \delta_1 \ne \delta_2\] and the size of samples is different (one almost twice the size of another).
Hence, having two independent samples with less than 30 observations we will use following formula:
\[(\overline{x}_1 - \overline{x}_2) \pm t * \sqrt{(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2})}\]
Lets apply the formula using our samples:
difference_means <- mean_samp_men1 - mean_samp_fem1
difference_means
## [1] 2.370962
t <- qt(0.975, n_fem1 + n_men1 - 2)
lower <- difference_means - t * sqrt(sd_men1^2 / n_men1 + sd_fem1^2 / n_fem1)
upper <- difference_means + t * sqrt(sd_men1^2 / n_men1 + sd_fem1^2 / n_fem1)
difference_interval95 <- c(lower, upper)
difference_interval95
## [1] 0.1387252 4.6031980
We got an interval from 0.139 to 4.603. Hence, we can be 95% confident that the true difference of means of population (male - female) professionals lies between those values.
Estimating proportion
Female data
Now, lets take a look at different part of data. We want to estimate proportion of female population from north part of U.S. that is not married. For that, we will use the data given in the column ‘married’.
married_female1 <- data1 %>% filter(female == "Female", northcen == 1)
We calculate the proportion of not married females:
n_total_fem1 <- nrow(married_female1)
n_nonmarried_fem1 <- nrow((filter(married_female1, married == "Notmarried")))
phat_nonmarried_fem1 <- n_nonmarried_fem1 / n_total_fem1
Now, we have to check if phat * n > 5 as well as (1 - phat) * n > 5 - whether the sample and calculated proportion fall into the rule of thumb:
n_total_fem1 * phat_nonmarried_fem1
## [1] 30
n_total_fem1 * (1 - phat_nonmarried_fem1)
## [1] 35
Now, having the proportion of non married females from the sample we can use that to estimate a population parameter. Here, we will again use 95% confidence interval.
Because we are dealing with small sample (n < 100), we will use the following formula:
\[phat \pm z * \sqrt{\frac{phat(1-phat)}{n}}\]
z <- qnorm(0.975)
lower <- phat_nonmarried_fem1 - z * sqrt(phat_nonmarried_fem1 * (1 - phat_nonmarried_fem1) / n_total_fem1)
upper <- phat_nonmarried_fem1 + z * sqrt(phat_nonmarried_fem1 * (1 - phat_nonmarried_fem1) / n_total_fem1)
prop_nonmarried_fem1 <- c(lower, upper)
prop_nonmarried_fem1
## [1] 0.3403468 0.5827301
Using this formula gives us a 95% confidence that real proportion of not married females in north U.S. is between
Male data
Now we want to see the same in male’s population.
married_male1 <- data1 %>% filter(female == "Male", northcen == 1)
n_total_men1 <- nrow(married_male1)
n_nonmarried_men1 <- nrow((filter(married_male1, married == "Notmarried")))
phat_nonmarried_men1 <- n_nonmarried_men1 / n_total_men1
We have to check the rule of thumb:
n_total_men1 * phat_nonmarried_men1
## [1] 23
n_total_men1 * (1 - phat_nonmarried_men1)
## [1] 44
z <- qnorm(0.975)
lower <- phat_nonmarried_men1 - z * sqrt(phat_nonmarried_men1 * (1 - phat_nonmarried_men1) / n_total_men1)
upper <- phat_nonmarried_men1 + z * sqrt(phat_nonmarried_men1 * (1 - phat_nonmarried_men1) / n_total_men1)
prop_nonmarried_men1 <- c(lower, upper)
prop_nonmarried_men1
## [1] 0.2295926 0.4569746
Here we got a 95% confidence interval between values 0.2295926 and 0.4569746.
Now we want to compare these intervals and visualize that on a boxplot:
boxplot(prop_nonmarried_fem1, prop_nonmarried_men1, names = c("Female", "Male"))
We got a result stating that based on our sample we estimate on average around 45% of females to be not married and only around 35% of males to be unmarried. At first, we might find this result surprising but we have to take into consideration that a lot of married women did not have a job in 1970’s. However, married men were even more likely to have a job because of their family.
Comparing populations
Now, we want to compare the difference between males and females population proportion of non married individuals. We will use again 95% confidence interval.
To compare independent samples that approves the rule of thump (np >= 5 and n(1-p) >= 5) we will use following formula:
\[(\hat{p}_1 - \hat{p}_2) \pm z * \sqrt{\frac{\hat{p}_1*(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2*(1-\hat{p}_2)}{n_2}}\]
difference_prop <- phat_nonmarried_fem1 - phat_nonmarried_men1
z <- qnorm(0.975)
lower <- difference_prop - z * sqrt((phat_nonmarried_fem1 * (1 - phat_nonmarried_fem1) / n_nonmarried_fem1) + (phat_nonmarried_men1 * (1 - phat_nonmarried_men1) / n_nonmarried_men1))
upper <- difference_prop + z * sqrt((phat_nonmarried_fem1 * (1 - phat_nonmarried_fem1) / n_nonmarried_fem1) + (phat_nonmarried_men1 * (1 - phat_nonmarried_men1) / n_nonmarried_men1))
difference_prop95 <- c(lower, upper)
difference_prop95
## [1] -0.1453276 0.3818373
Here we see that our 95% confidence interval gives us values from -0.1453276 to 0.3818373. Hence, we are 95% confident that true difference in the population proportion lies in the given interval. Because the difference in sample proportions was not that high, the estimated difference in population proportion also includes negative value - we are not certain which proportion will be higher, then.