Lab report

Exercises:

Exercise 1:

##        n
## 1 0.4215
  1. What percent of the original sample (acs) are employed?

42.1% of the original sample are employed.

Exercise 2:

## # A tibble: 2 x 4
##   gender   xbar      s     n
##   <fct>   <dbl>  <dbl> <int>
## 1 male   55887. 68768.   470
## 2 female 29244. 32026.   373
  1. At a first glance how do the average incomes of males and females compare? Make sure to include the visualization and the summary statistics in your answer, and discuss/ interpret them.

At a first glance, it looks as though males have a higher average income than females. The boxplot shows that males’ median income is higher than females’, and tend to have more spread/possible outliers into the upper bounds. The summary statistics show that females have a much lower mean income than males, and have a smaller standard deviation. This means there is less variability in females’ incomes than in males’.

Exercise 3:

## Warning: package 'gridExtra' was built under R version 3.6.3
## Warning: package 'broom' was built under R version 3.6.3
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_male = 470, y_bar_male = 55887.234, s_male = 68767.8814
## n_female = 373, y_bar_female = 29243.6997, s_female = 32025.9848
## 95% CI (male - female): (19605.3014 , 33681.7672)

  1. Construct a 95% confidence interval for the difference between the average incomes of males and females using the inference function, and interpret this interval.

We are 95% confident that the difference in mean income between males and females falls between 19,605.30 and 33,681.77 dollars.

Exercise 4:

  1. Based on this interval is there a statistically significant difference between the average incomes of men and women? Why, or why not?

There is a statistically significant difference between the average incomes of men and women. The “null hypothesis” in this case is that there is no difference between the incomes of men and women, which would return a difference of “0” in our confidence interval. The p-value must be < 0.05 because our confidence interval does not include 0 at all. The null hypothesis is rejected and our CI is statistically significant.

Exercise 5:

  1. What is the significance level for the equivalent hypothesis test that evaluates whether there is a significant difference between average incomes of men and women.

The confidence level is equivalent to 1 – the alpha level. Therefore, 0.95 = 1 - alpha. Alpha = 0.05, so the significance level for the equivalent hypothesis test is alpha = 0.05.

Exercise 6:

## Response variable: numerical
## Explanatory variable: categorical (2 levels) 
## n_male = 470, y_bar_male = 55887.234, s_male = 68767.8814
## n_female = 373, y_bar_female = 29243.6997, s_female = 32025.9848
## H0: mu_male =  mu_female
## HA: mu_male != mu_female
## t = 7.4437, df = 372
## p_value = < 0.0001

  1. Conduct this hypothesis test using the inference function, and interpret your results in context of the data and the research question. Do your results from the confidence interval and the hypothesis test agree?

H0: There is no difference between the mean incomes of males and females.

HA: There is a difference between the mean incomes of males and females.

Our p-value is 0.0001, which is smaller than our alpha value of 0.05. We can conclude that a significant difference exists between the average income of males and females. We reject the null hypothesis in favor of the alternative hypothesis.

These results are in agreement with the results of our confidence interval.

Exercise 7:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   35.00   40.00   38.93   42.00   99.00
  1. Create a bar plot of the distribution of the emp_type variable, and also include the summary statistics you calculated above in your answer. What percent of the sample are full time and what percent are part time employees?

The summary statistics of the overall sample show the median hours worked are 40 and the mean is 38.93. The barplot shows that more males are full-time than part-time (81% vs 19%) and that females are slightly more equally divided between full-time and part-time (57% vs 43%). The overall statistics on full-time and part-time employees show 71% of employees are full-time and 29% are part-time.

Exercise 8:

  1. Are females more heavily represented among full time employees or part time employees?
## # A tibble: 4 x 4
## # Groups:   emp_type [2]
##   emp_type  gender     n   pct
##   <chr>     <fct>  <int> <dbl>
## 1 full time male     383 0.643
## 2 full time female   213 0.357
## 3 part time male      87 0.352
## 4 part time female   160 0.648

Females are more heavily represented among part-time employees, being about 65% of that workforce. Females are only about 36% of the full-time workforce.


On your own:

1:

  1. Create two subsets of the acs_emp dataset: one for full time employees and one for part time employees. No interpretation is needed for this question, just the code is sufficient.

2:

## Response variable: numerical
## Explanatory variable: categorical (2 levels) 
## n_male = 383, y_bar_male = 63183.5509, s_male = 68974.109
## n_female = 213, y_bar_female = 39752.1127, s_female = 35288.6675
## H0: mu_male =  mu_female
## HA: mu_male != mu_female
## t = 5.4822, df = 212
## p_value = < 0.0001

H0: There is no difference between the mean incomes of full time male and female employees.

HA: There is a difference between the mean incomes of full time male and female employees.

  1. Use a hypothesis test to evaluate whether there is a difference in average incomes of full time male and female employees. If the difference is significant, also include a confidence interval (at the equivalent confidence level) estimating the magnitude of the average income difference.

Our p-value is 0.0001, which is smaller than our alpha value of 0.05. We can conclude that a significant difference exists between the average income of full time males and full time females. We reject the null hypothesis in favor of the alternative hypothesis.

Since the difference is significant, we will conduct a confidence interval at 95% confidence. This is because our hypothesis test had a default value of 0.05 for alpha. The confidence level is equivalent to 1 – the alpha level.

1 - 0.05 = 0.95. Confidence level = 0.95, so we will conduct a confidence interval at 95% confidence

## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_male = 383, y_bar_male = 63183.5509, s_male = 68974.109
## n_female = 213, y_bar_female = 39752.1127, s_female = 35288.6675
## 95% CI (male - female): (15006.2634 , 31856.6131)

We are 95% confident that the difference in mean income between full time males and full time females falls between 15,006.26 and 31,856.61 dollars.

3:

## Response variable: numerical
## Explanatory variable: categorical (2 levels) 
## n_male = 87, y_bar_male = 23766.6667, s_male = 58112.1248
## n_female = 160, y_bar_female = 15254.375, s_female = 19859.9366
## H0: mu_male =  mu_female
## HA: mu_male != mu_female
## t = 1.3249, df = 86
## p_value = 0.1887

H0: There is no difference between the mean incomes of part time male and female employees.

HA: There is a difference between the mean incomes of part time male and female employees.

  1. Use a hypothesis test to evaluate whether there is a difference in average incomes of part time male and female employees. If the difference is significant, also include a confidence interval (at the equivalent confidence level) estimating the magnitude of the average income difference.

Our p-value is 0.1887, which is larger than our alpha value of 0.05. We can conclude that a significant difference does not exist between the average income of part time males and part time females. We fail to reject the null hypothesis.

4:

  1. What do your findings from these hypothesis test suggest about whether or not working full or part time might be a confounding variable in the relationship between gender and income?

The findings suggest that working full or part time is a confounding variable in the relationship between gender and income. There isn’t a statistically significant difference between the incomes of part time males and females, but there is one between full time males and females.

5:

  1. What type of a test would we use to compare the average salaries across the various race / ethnicity groups in this dataset? Explain your reasoning.

We would use an ANOVA test (analysis of variance test). ANOVA tests test for some difference in means of many different groups. Because we do not know which, if any, of the various race/ethnicity groups in the dataset are different from one another, we need to test all of them.

6:

## Response variable: numerical
## Explanatory variable: categorical (4 levels) 
## n_white = 670, y_bar_white = 44491.0448, s_white = 56564.7207
## n_black = 76, y_bar_black = 29953.2895, s_black = 23313.5402
## n_asian = 39, y_bar_asian = 86406.4103, s_asian = 104998.3911
## n_other = 58, y_bar_other = 29648.2759, s_other = 29511.4562
## 
## ANOVA:
##            df           Sum_Sq          Mean_Sq       F  p_value
## race        3 97229184003.6589 32409728001.2196 10.2616 < 0.0001
## Residuals 839 2649854777471.31  3158348960.0373                 
## Total     842 2747083961474.97                                  
## 
## Pairwise tests - t tests with pooled SD:
## # A tibble: 6 x 3
##   group1 group2     p.value
##   <chr>  <chr>        <dbl>
## 1 black  white  0.0329     
## 2 asian  white  0.00000682 
## 3 other  white  0.0540     
## 4 asian  black  0.000000421
## 5 other  black  0.975      
## 6 other  asian  0.00000129

  1. Conduct this hypothesis test using the inference function. Write your hypotheses, and interpret your conclusion in context of the data and the research question.

H0: There is no difference between the mean incomes of employees in different racial/ethnic groups.

HA: There is a difference between the mean incomes of employees in at least one of the different racial/ethnic groups.

The compared groups that had a smaller p-value than 0.05 (our significance value) were: (1)Black-white p-value = 0.0329, (2)Asian-white p-value = 0.00000682, (3)Asian-black p-value = 0.000000421, (4)Other-asian p-value = 0.00000129.

We can conclude that a significant difference exists between the mean incomes of employees in at least one of the different racial/ethnic groups. We reject the null hypothesis in favor of the alternative hypothesis.

7:

  1. Pick another numerical variable from the dataset to be your response variable, and also pick a categorical explanatory variable. Conduct the appropriate hypothesis test, using the inference function, to compare means of the response variable across levels of the explanatory variable. Make sure to state your research question, and interpret your conclusion in context of the dataset.

The research question I am asking here is whether or not there is a difference in the average time it takes to get to work depending on marriage status.

H0: There is no difference in the mean time it takes to get to work between married and non-married employees.

HA: There is a difference in the mean time it takes to get to work between married and non-married employees.

## Response variable: numerical
## Explanatory variable: categorical (2 levels) 
## n_yes = 454, y_bar_yes = 26.859, s_yes = 23.6314
## n_no = 329, y_bar_no = 24.8085, s_no = 20.2717
## H0: mu_yes =  mu_no
## HA: mu_yes != mu_no
## t = 1.3023, df = 328
## p_value = 0.1937

Our p-value is 0.1937, which is larger than our alpha value of 0.05. We can conclude that a significant difference does not exist in the average time it takes to get to work between married and non-married employees. We fail to reject the null hypothesis.