Start the project by clearing the workspace. Then load the R package openxlsx and the data Earnings_and_Height.
rm(list=ls())
library(openxlsx)
## Warning: package 'openxlsx' was built under R version 4.3.3
id <- "1XKjDOQBJcxwslhwipkJAF2qLNmFW9Bfu"
earn <- read.xlsx(sprintf("https://docs.google.com/uc?id=%s&export=download",id),sheet=1,startRow=1,colNames=TRUE,rowNames=FALSE)
str(earn)
## 'data.frame': 17870 obs. of 11 variables:
## $ sex : num 0 0 0 0 0 0 0 0 0 0 ...
## $ age : num 48 41 26 37 35 25 29 44 50 38 ...
## $ mrd : num 1 6 1 1 6 6 1 4 6 1 ...
## $ educ : num 13 12 16 16 16 15 16 18 14 12 ...
## $ cworker : num 1 1 1 1 1 1 1 3 2 4 ...
## $ region : num 3 2 1 2 1 4 2 4 3 3 ...
## $ race : num 1 1 1 1 1 1 1 1 1 1 ...
## $ earnings : num 84055 14021 84055 84055 28560 ...
## $ height : num 65 65 60 67 68 63 67 65 67 66 ...
## $ weight : num 133 155 108 150 180 101 150 125 129 110 ...
## $ occupation: num 1 1 1 1 1 1 1 1 1 1 ...
- What is the median value of height in the sample?
median(earn$height)
## [1] 67
- Estimate the average earnings for workers whose height is at most 67 inches.
earn_l67 <- subset(earn, height<=67)$earnings # select the subsample with height<=67
ave_earn_l67 <- mean(earn_l67)
print(ave_earn_l67)
## [1] 44488.44
- Estimate the average earnings for workers whose height is greater than 67 inches.
earn_g67 <- subset(earn, height>67)$earnings # select the subsample with height>67
ave_earn_g67 <- mean(earn_g67)
print(ave_earn_g67)
## [1] 49987.88
- On average, do taller workers earn more than shorter workers? How much more? What is a 95% confidence interval for the difference in average earnings?
To investigate if taller workers earn more than shorter workers on average, we conduct a t test with 5% significance level. The null and alternative hypotheses are: \[ H_0: \mu_{g67} - \mu_{l67} = 0 \quad \text{vs} \quad H_a: \mu_{g67} - \mu_{l67} > 0 \] This is an upper-tailed test. We need to compute the t statistic and then find the p-value. We find that p-value is 0, less than 5%. Therefore, we reject the null hypothesis, and claim that taller workers earn more than shorter workers on average.
# Step-1: Compute SE for the difference between ave_earn_g67 and ave_earn_l67.
se_earn_g67 <- sqrt(var(earn_g67)/length(earn_g67))
se_earn_l67 <- sqrt(var(earn_l67)/length(earn_l67))
se_diff <- sqrt(se_earn_g67^2 + se_earn_l67^2)
# Step-2: Compute the t statistic.
diff <- ave_earn_g67 - ave_earn_l67
t <- diff / se_diff
# Step-3: Find the p-value.
p_value <- 1-pnorm(t)
print(p_value)
## [1] 0
The 95% CI for the difference in average earnings is:
upper_CI <- diff - qnorm(0.05/2) * se_diff
lower_CI <- diff + qnorm(0.05/2) * se_diff
print(c(lower_CI, upper_CI))
## [1] 4706.295 6292.585
- Construct a scatterplot of annual earnings (Earnings) on height (Height). Notice that the points on the plot fall along horizontal lines. (There are only 23 distinct values of Earnings). Why? (Hint: Carefully read the detailed data description.)
The data documentation reports that individual earnings were reported in 23 brackets, and a single average value is reported for earnings in the same bracket. Thus, the dataset contains 23 distinct values of earnings.
plot(x=earn$height, y=earn$earnings,
main="Annual Earnings vs Height",
xlab="height", ylab="earnings")
- Run a regression of Earnings on Height. (i) What is the estimated slope? (ii) Use the estimated regression to predict earnings for a worker who is 67 inches tall, for a worker who is 70 inches tall, and for a worker who is 65 inches tall.
Using the function lm, we estimate the intercept and the slope in the linear model: \[ earnings_i = \beta_0 + \beta_1 height_i + u_i. \] The estimated slope is 707.67. Use \(\hat\beta_0\) and \(\hat\beta_1\), we compute the predicted earnings for a worker who is 67 inches tall is , for 70 inches tall is, for 65 inches tall is.
fit <- lm(earnings~height, data=earn)
summary(fit)
##
## Call:
## lm(formula = earnings ~ height, data = earn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47836 -21879 -7976 34323 50599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -512.73 3386.86 -0.151 0.88
## height 707.67 50.49 14.016 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26780 on 17868 degrees of freedom
## Multiple R-squared: 0.01088, Adjusted R-squared: 0.01082
## F-statistic: 196.5 on 1 and 17868 DF, p-value: < 2.2e-16
b0 <- coef(fit)[1]
b1 <- coef(fit)[2]
b0 + b1 * 67
## (Intercept)
## 46901.26
b0 + b1 * 70
## (Intercept)
## 49024.28
b0 + b1 * 65
## (Intercept)
## 45485.92
- Suppose height were measured in centimeters instead of inches. Answer the following questions about the Earnings on Height (in cm) regression. (i) What is the estimated slope of the regression? (ii) What is the estimated intercept? (iii) What is the R2? (iv) What is the standard error of the regression?
Recall that \(1 \text{ cm} = 0.394 \text{
inches}\). The estimated regression in (d), with units shown, is
\[\widehat{earnings}($) = -512.7($) +
707.7($/inch) \times Height(inches),\] \[R^2 (\text{unit free}) = 0.011,\] \[SER = 26777($).\] Note that \[\begin{eqnarray}
707.7($/inch) \times Height(inches)
&=& 707.7($/inch) \times (0.394inch/cm) \times Height(cm) \\
&=& 278.8($/cm) \times Height(cm)
\end{eqnarray}\] So the regression is \[\widehat{earnings}($) = -512.7($) + 278.8($/cm)
\times Height(cm),\] \[R^2 (\text{unit
free}) = 0.011,\] \[SER =
26777($)\].
- Run a regression of Earnings on Height, using data for female workers only. (i) What is the estimated slope? (ii) A randomly selected woman is 1 inch taller than the average woman in the sample. Would you predict her earnings to be higher or lower than the average earnings for women in the sample? By how much?
The estimated slope is 511.2. A woman who is one inch taller than average is predicted to have earnings that are $511.2 per year higher than average.
earn_female <- subset(earn, sex==0)
fit.female <- lm(earnings~height, data=earn_female)
summary(fit.female)
##
## Call:
## lm(formula = earnings ~ height, data = earn_female)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42748 -22006 -7466 36641 46865
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12650.9 6383.7 1.982 0.0475 *
## height 511.2 98.9 5.169 2.4e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26800 on 9972 degrees of freedom
## Multiple R-squared: 0.002672, Adjusted R-squared: 0.002572
## F-statistic: 26.72 on 1 and 9972 DF, p-value: 2.396e-07
- Repeat (f) for male workers.
The estimated slope is 1306.9. A man who is one inch taller than average is predicted to have earnings that are $1306.9 per year higher than average.
earn_male <- subset(earn, sex==1)
fit.male <- lm(earnings~height, data=earn_male)
summary(fit.male)
##
## Call:
## lm(formula = earnings ~ height, data = earn_male)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50158 -22373 -8118 33091 59228
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -43130.3 7068.5 -6.102 1.1e-09 ***
## height 1306.9 100.8 12.969 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26670 on 7894 degrees of freedom
## Multiple R-squared: 0.02086, Adjusted R-squared: 0.02074
## F-statistic: 168.2 on 1 and 7894 DF, p-value: < 2.2e-16
- Do you think that height is uncorrelated with other factors that cause earning? That is, do you think that the regression error term, \(u_i\) has a conditional mean of 0 given Height (\(X_i\))? (You will investigate this more in the Earnings and Height exercises in later chapters.)
Height may be correlated with other factors that cause earnings. For example, height may be correlated with “strength,” and in some occupations, stronger workers may by more productive. There are many other potential factors that may be correlated with height and cause earnings and you will investigate of these in future exercises.