Report_01

Daria Skarbek, 184869

9.11.2021

Introduction to data

The data from wage1 consist of random sample of U.S. population from year 1976.
Each respondent has given information i.a. about hourly wage, years of education and experience, gender, part of the U.S. he/she lives in and type of profession.

Because we do not have access to the population data, we have to use given sample data to estimate population parameters.

Estimating mean

Female professionals

At first, we want to look at data about female professionals (profserv == 1) living in north of U.S.

data1 <- as_tibble(wage1)

female_data1 <- data1 %>% filter(female == "Female", northcen == 1, profserv == 1)

summary(female_data1$wage) #MEAN == 5.147

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.500   3.480   4.715   5.147   6.228  14.580

Let’s look at the data using kableExtra package. Because we are interested only in few variables, we will select only those of interest. We will look at first 10 rows (out of 28).

female_data2 <- female_data1 %>% select(wage, educ, exper, nonwhite, married)
female_data2 <- female_data2 %>% add_column(id = 1:nrow(female_data2), .before = "wage")

kbl(head(female_data2, 10), caption = "Sample data about female professionals from north of U.S") %>%
  kable_minimal() %>%
  column_spec(2, color = "white", background = spec_color(female_data2$wage[1:10], end = 0.8))

Sample data about female professionals from north of U.S
id	wage	educ	exper	nonwhite	married
1	4.75	12	48	White	Married
2	6.25	13	6	White	Married
3	14.58	18	13	White	Notmarried
4	6.22	12	19	White	Married
5	9.00	12	11	White	Married
6	6.53	16	3	White	Notmarried
7	7.60	12	41	White	Notmarried
8	5.00	12	14	White	Notmarried
9	6.18	14	9	White	Married
10	6.25	12	13	White	Married

Now, lets look at the distribution of wages in given sample. Using summary() function we can see range of values: from 1.5 to 14.580, mean = 5.147 as well as median and first and third quantile.
To better visualize, we can plot histogram of wages distribution. By red line we show the mean of sample.

hist(female_data1$wage, main = "Wage distribution of female professionals", xlab = "Hourly wage")
abline(v = mean(female_data1$wage), col = "red", lwd = 3)

Now to estimate the real mean of population we will use 95% confidence interval.

To find 95% confidence interval, we need following values:
- z score
- standard deviation
- standard error
- mean of sample

The we will use following formula to calculate the confidence interval.

\[\mu = \overline{x} \pm (z * SE)\]

Where standard error (SE) is equal to:

\[se = \frac {sd}{\sqrt{n}}\]

z_fem1 <- qnorm(0.975) # value ~ 1.96

n_fem1 <- nrow(female_data1)

sd_fem1 <- sd(female_data1$wage)

se_fem1 <- sd_fem1 / sqrt(n_fem1)

mean_samp_fem1 <- mean(female_data1$wage)

Putting that values to given formula gives us lower and upper limits to confidence interval:

lower <- mean_samp_fem1 - z_fem1 * se_fem1

upper <- mean_samp_fem1 + z_fem1 * se_fem1

z_confidence95_fem1 <- c(lower, upper)
z_confidence95_fem1

## [1] 4.205662 6.089338

This confidence interval shows that we are 95% confident true mean of hourly wage for female professionals falls between 4.206 and 6.089.

However, looking at the size of the sample: n = 28 we should realize that this sample size is not big. n < 30 signalizes we should rather use t - distribution.

For t - distribution we need following values:
- mean of sample
- t-score
- degrees of freedom
- standard deviation
- standard error

Most of these values are already calculated, the only ones remaining are degrees of freedom (n - 1) and t - score.

df <- n_fem1 - 1

t_fem1 <- qt(0.975, df = df)

lower <- mean_samp_fem1 - t_fem1 * se_fem1

upper <- mean_samp_fem1 + t_fem1 * se_fem1

t_confidence95_fem1 <- c(lower, upper)
t_confidence95_fem1

## [1] 4.161517 6.133483

We can clearly see that this interval is wider. That is because t - distribution protects us from using a small sample and getting not proper result.

The real 95% confidence interval for our sample of female professionals is then:

from 4.16152 to 6.13348

Male professionals

Suppose that now we want to see if male’s hourly wages are giving similar values to the female’s.
To do that, we also need to calculate 95% confidence interval for male professionals. First, lets look at first 10 rows of the data and distribution of male’s wages for the same parameters.

male_data1 <- data1 %>% filter(female == "Male", northcen == 1, profserv == 1)

summary(male_data1$wage)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.950   5.770   6.500   7.518  10.000  15.380

male_data2 <- male_data1 %>% select(wage, educ, exper, nonwhite, married)
male_data2 <- male_data2 %>% add_column(id = 1:nrow(male_data2), .before = "wage")

kbl(head(male_data2, 10), caption = "Sample data about male professionals from north of U.S") %>%
  kable_minimal() %>%
  column_spec(2, color = "white", background = spec_color(male_data2$wage[1:10], end = 0.8))

Sample data about male professionals from north of U.S
id	wage	educ	exper	nonwhite	married
1	6.88	18	4	White	Notmarried
2	6.00	13	8	White	Married
3	15.38	13	25	White	Married
4	10.00	14	43	White	Married
5	5.77	10	44	White	Married
6	6.25	16	9	White	Married
7	6.88	12	31	White	Married
8	3.75	18	1	White	Notmarried
9	2.95	14	41	Nonwhite	Married
10	6.50	17	3	White	Married

hist(male_data1$wage, main = "Wage distribution for male professionals", xlab = "Hourly wage")
abline(v = mean(male_data1$wage), col = "red", lwd = 3)

Even now we can see that male’s wages distribution differs from female’s. Later, we will see a better comparison on a violin plot.
Now calculating 95% confidence interval:

mean_samp_men1 <- mean(male_data1$wage)

n_men1 <- nrow(male_data1)

sd_men1 <- sd(male_data1$wage)

se_men1 <- sd_men1 / sqrt(n_men1)

z_men1 <- qnorm(0.975)

lower <- mean_samp_men1 - z_men1 * se_men1
upper <- mean_samp_men1 + z_men1 * se_men1

z_confidence95_men1 <- c(lower, upper)
z_confidence95_men1

## [1] 5.571269 9.465654

t_men1 <- qt(0.975, df = n_men1 - 1)

lower <- mean_samp_men1 - t_men1 * se_men1
upper <- mean_samp_men1 + t_men1 * se_men1

t_confidence95_men1 <- c(lower, upper)
t_confidence95_men1

## [1] 5.353846 9.683077

Here we also should rater use STUDENT t - distribution, as the sample size (13) is rather small.

Now seeing the result we can see that the given confidence interval: from 5.35 to 9.68 is much higher than the one computed before - for female professionals.

Lets visualize this confidence intervals on a boxplot:

boxplot(t_confidence95_fem1, t_confidence95_men1, names = c("Female", "Male"), 
        main = "95% confidence interval of wages")

Here we clearly see the difference between estimated population mean - it is much higher for male professionals.

To see how wages are distributed in the sample, we can use a violin plot.

data2 <- data1

data2$female <- as.factor(data1$female)

ggplot(data2, aes(x = female, y = wage, fill = female)) + geom_violin() + 
  geom_boxplot(width = 0.1, fill = "white") + theme_minimal() + 
  labs(title = "Distribution of wages in sample of male and female professionals", x = "", y = "Hourly wage") +
  theme(legend.position = "none")

This diagram allows us to better visualize the sample distribution of wages. Now we can see why we got such result with confidence intervals. Female wages are focused on lower values and don’t reach as high outliers as male’s professionals.

Comparing populations

Lastly, having two estimated means for male and female professionals, we can try to compare populations. We will also use 95% confidence interval.

We do not assume we are dealing with populations that have the same variance:

\[ \delta_1 \ne \delta_2\] and the size of samples is different (one almost twice the size of another).

Hence, having two independent samples with less than 30 observations we will use following formula:

\[(\overline{x}_1 - \overline{x}_2) \pm t * \sqrt{(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2})}\]

Lets apply the formula using our samples:

difference_means <- mean_samp_men1 - mean_samp_fem1
difference_means

## [1] 2.370962

t <- qt(0.975, n_fem1 + n_men1 - 2)

lower <- difference_means - t * sqrt(sd_men1^2 / n_men1 + sd_fem1^2 / n_fem1)

upper <- difference_means + t * sqrt(sd_men1^2 / n_men1 + sd_fem1^2 / n_fem1)

difference_interval95 <- c(lower, upper)
difference_interval95

## [1] 0.1387252 4.6031980

We got an interval from 0.139 to 4.603. Hence, we can be 95% confident that the true difference of means of population (male - female) professionals lies between those values.

Estimating proportion

Female data

Now, lets take a look at different part of data. We want to estimate proportion of female population from north part of U.S. that is not married. For that, we will use the data given in the column ‘married’.

married_female1 <- data1 %>% filter(female == "Female", northcen == 1)

We calculate the proportion of not married females:

n_total_fem1 <- nrow(married_female1)

n_nonmarried_fem1 <- nrow((filter(married_female1, married == "Notmarried")))

phat_nonmarried_fem1 <- n_nonmarried_fem1 / n_total_fem1

Now, we have to check if phat * n > 5 as well as (1 - phat) * n > 5 - whether the sample and calculated proportion fall into the rule of thumb:

n_total_fem1 * phat_nonmarried_fem1

## [1] 30

n_total_fem1 * (1 - phat_nonmarried_fem1)

## [1] 35

Now, having the proportion of non married females from the sample we can use that to estimate a population parameter. Here, we will again use 95% confidence interval.

Because we are dealing with small sample (n < 100), we will use the following formula:

\[phat \pm z * \sqrt{\frac{phat(1-phat)}{n}}\]

z <- qnorm(0.975)

lower <- phat_nonmarried_fem1 - z * sqrt(phat_nonmarried_fem1 * (1 - phat_nonmarried_fem1) / n_total_fem1)

upper <- phat_nonmarried_fem1 + z * sqrt(phat_nonmarried_fem1 * (1 - phat_nonmarried_fem1) / n_total_fem1)

prop_nonmarried_fem1 <- c(lower, upper)
prop_nonmarried_fem1

## [1] 0.3403468 0.5827301

Using this formula gives us a 95% confidence that real proportion of not married females in north U.S. is between 0.3403468 and 0.5827301

Male data

Now we want to see the same in male’s population.

married_male1 <- data1 %>% filter(female == "Male", northcen == 1)

n_total_men1 <- nrow(married_male1)

n_nonmarried_men1 <- nrow((filter(married_male1, married == "Notmarried")))

phat_nonmarried_men1 <- n_nonmarried_men1 / n_total_men1

We have to check the rule of thumb:

n_total_men1 * phat_nonmarried_men1

## [1] 23

n_total_men1 * (1 - phat_nonmarried_men1)

## [1] 44

z <- qnorm(0.975)

lower <- phat_nonmarried_men1 - z * sqrt(phat_nonmarried_men1 * (1 - phat_nonmarried_men1) / n_total_men1)

upper <- phat_nonmarried_men1 + z * sqrt(phat_nonmarried_men1 * (1 - phat_nonmarried_men1) / n_total_men1)

prop_nonmarried_men1 <- c(lower, upper)
prop_nonmarried_men1

## [1] 0.2295926 0.4569746

Here we got a 95% confidence interval between values 0.2295926 and 0.4569746.

Now we want to compare these intervals and visualize that on a boxplot:

boxplot(prop_nonmarried_fem1, prop_nonmarried_men1, names = c("Female", "Male"))

We got a result stating that based on our sample we estimate on average around 45% of females to be not married and only around 35% of males to be unmarried. At first, we might find this result surprising but we have to take into consideration that a lot of married women did not have a job in 1970’s. However, married men were even more likely to have a job because of their family.

Comparing populations

Now, we want to compare the difference between males and females population proportion of non married individuals. We will use again 95% confidence interval.

To compare independent samples that approves the rule of thump (np >= 5 and n(1-p) >= 5) we will use following formula:

\[(\hat{p}_1 - \hat{p}_2) \pm z * \sqrt{\frac{\hat{p}_1*(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2*(1-\hat{p}_2)}{n_2}}\]

difference_prop <- phat_nonmarried_fem1 - phat_nonmarried_men1

z <- qnorm(0.975)

lower <- difference_prop - z * sqrt((phat_nonmarried_fem1 * (1 - phat_nonmarried_fem1) / n_nonmarried_fem1) + (phat_nonmarried_men1 * (1 - phat_nonmarried_men1) / n_nonmarried_men1))

upper <- difference_prop + z * sqrt((phat_nonmarried_fem1 * (1 - phat_nonmarried_fem1) / n_nonmarried_fem1) + (phat_nonmarried_men1 * (1 - phat_nonmarried_men1) / n_nonmarried_men1))

difference_prop95 <- c(lower, upper)
difference_prop95

## [1] -0.1453276  0.3818373

Here we see that our 95% confidence interval gives us values from -0.1453276 to 0.3818373. Hence, we are 95% confident that true difference in the population proportion lies in the given interval. Because the difference in sample proportions was not that high, the estimated difference in population proportion also includes negative value - we are not certain which proportion will be higher, then.