Lab 5: Exercise

E5.1

Use the data set Earnings_and_Height described in Empirical Exercise 4.2 to carry out the following exercises.

Start the project by clearing the workspace. Then load the R package openxlsx and the data Earnings_and_Height.

rm(list=ls()) 
library(openxlsx)

## Warning: package 'openxlsx' was built under R version 4.3.3

id <- "1XKjDOQBJcxwslhwipkJAF2qLNmFW9Bfu"
earn <- read.xlsx(sprintf("https://docs.google.com/uc?id=%s&export=download",id),sheet=1,startRow=1,colNames=TRUE,rowNames=FALSE)
str(earn)

## 'data.frame':    17870 obs. of  11 variables:
##  $ sex       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ age       : num  48 41 26 37 35 25 29 44 50 38 ...
##  $ mrd       : num  1 6 1 1 6 6 1 4 6 1 ...
##  $ educ      : num  13 12 16 16 16 15 16 18 14 12 ...
##  $ cworker   : num  1 1 1 1 1 1 1 3 2 4 ...
##  $ region    : num  3 2 1 2 1 4 2 4 3 3 ...
##  $ race      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ earnings  : num  84055 14021 84055 84055 28560 ...
##  $ height    : num  65 65 60 67 68 63 67 65 67 66 ...
##  $ weight    : num  133 155 108 150 180 101 150 125 129 110 ...
##  $ occupation: num  1 1 1 1 1 1 1 1 1 1 ...

Run a regression of Earnings on Height. i. Is the estimated slope statistically significant? ii. Construct a 95% confidence interval for the slope coefficient.

[Ans] According to the two-sided t test, since p-value is less than 0.05, we reject the null hypothesis and claim that the slope coefficient is statistically significant. The 95% CI for the slope coefficient is \([608.71, 806.64]\).

fit.earn <- lm(earnings~height, data=earn)  #run the linear regression
summary(fit.earn)

## 
## Call:
## lm(formula = earnings ~ height, data = earn)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -47836 -21879  -7976  34323  50599 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -512.73    3386.86  -0.151     0.88    
## height        707.67      50.49  14.016   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26780 on 17868 degrees of freedom
## Multiple R-squared:  0.01088,    Adjusted R-squared:  0.01082 
## F-statistic: 196.5 on 1 and 17868 DF,  p-value: < 2.2e-16

confint(fit.earn, "height")   #find the 95% CI

##           2.5 %   97.5 %
## height 608.7078 806.6353

Repeat (a) for women.

[Ans] The slope for women is statistically significant. The 95% CI is \([317.37, 705.08]\).

earn.f <- subset(earn, sex==0)
fit.earn.f <- lm(earnings~height, data=earn.f)  #run the linear regression
summary(fit.earn.f)

## 
## Call:
## lm(formula = earnings ~ height, data = earn.f)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -42748 -22006  -7466  36641  46865 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  12650.9     6383.7   1.982   0.0475 *  
## height         511.2       98.9   5.169  2.4e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26800 on 9972 degrees of freedom
## Multiple R-squared:  0.002672,   Adjusted R-squared:  0.002572 
## F-statistic: 26.72 on 1 and 9972 DF,  p-value: 2.396e-07

confint(fit.earn.f, "height")   #find the 95% CI

##           2.5 %   97.5 %
## height 317.3654 705.0789

Repeat (a) for men.

[Ans] The slope for men is statistically significant. The 95% CI is \([1109.33, 1504.39]\).

earn.m <- subset(earn, sex==1)
fit.earn.m <- lm(earnings~height, data=earn.m)  #run the linear regression
summary(fit.earn.m)

## 
## Call:
## lm(formula = earnings ~ height, data = earn.m)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -50158 -22373  -8118  33091  59228 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -43130.3     7068.5  -6.102  1.1e-09 ***
## height        1306.9      100.8  12.969  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26670 on 7894 degrees of freedom
## Multiple R-squared:  0.02086,    Adjusted R-squared:  0.02074 
## F-statistic: 168.2 on 1 and 7894 DF,  p-value: < 2.2e-16

confint(fit.earn.m, "height")   #find the 95% CI

##           2.5 %   97.5 %
## height 1109.332 1504.388

Test the null hypothesis that the effect of height on earnings is the same for men and women. (Hint: See Exercise 5.15.)

[Ans] We reject the null and claim that the effect of height on earnings is different for men and women.

The usual way to solve this question is to conduct a test for the difference in means. According to (b) and (c), the estimated difference of the slope is \(\hat\beta_{1,male} - \hat\beta_{1,female}=1306.9-511.2=795.7\) and its standard error is \(SE(\hat\beta_{1,male} - \hat\beta_{1,female})=\sqrt{SE(\hat\beta_{1,male})^2 + SE(\hat\beta_{1,female})^2}=\sqrt{100.8^2 + 98.9^2}=141.22\). The t statistic is \(795.7/141.22=5.63\) with p-value close to 0. Therefore, we reject the null hypothesis.

Another way to solve this question is to use multiple regression with an interaction term between height and sex. The coefficient for the interaction term indicates the difference of the effect of height. We will study this method in Chapter 8.

fit.diff <- lm(earnings~height + sex + I(height*sex), data=earn)  #run the linear regression
summary(fit.diff)

## 
## Call:
## lm(formula = earnings ~ height + sex + I(height * sex), data = earn)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -50158 -22006  -7977  34398  59228 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      12650.86    6370.12   1.986   0.0471 *  
## height             511.22      98.69   5.180 2.24e-07 ***
## sex             -55781.20    9529.61  -5.853 4.90e-09 ***
## I(height * sex)    795.64     141.24   5.633 1.79e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26740 on 17866 degrees of freedom
## Multiple R-squared:  0.01346,    Adjusted R-squared:  0.0133 
## F-statistic: 81.26 on 3 and 17866 DF,  p-value: < 2.2e-16

E5.2

Using the data set Growth described in Empirical Exercise 4.1, but excluding the data for Malta, run a regression of Growth on TradeShare.

id <- "1BZAxYZsUtZjeuEugYrHUuHWSlHXZ_4tu"
Growth <- read.xlsx(sprintf("https://docs.google.com/uc?id=%s&export=download",id),
                 sheet=1,startRow=1,colNames=TRUE,rowNames=FALSE)
str(Growth)

## 'data.frame':    65 obs. of  8 variables:
##  $ country_name : chr  "India" "Argentina" "Japan" "Brazil" ...
##  $ growth       : num  1.915 0.618 4.305 2.93 1.712 ...
##  $ oil          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ rgdp60       : num  766 4462 2954 1784 9895 ...
##  $ tradeshare   : num  0.141 0.157 0.158 0.16 0.161 ...
##  $ yearsschool  : num  1.45 4.99 6.71 2.89 8.66 ...
##  $ rev_coups    : num  0.133 0.933 0 0.1 0 ...
##  $ assasinations: num  0.867 1.933 0.2 0.1 0.433 ...

Is the estimated regression slope statistically significant? That is, can you reject the null hypothesis H0: b1 = 0 vs. a two-sided alternative hypothesis at the 10%, 5%, or 1% significance level?

[Ans] Because the p-value is 0.09, the slope is statistically significant at 10% significance level but insignificant at 1% and 5%.

Growth.noM <- subset(Growth, country_name != "Malta")
fit.noM <- lm(growth~tradeshare, data=Growth.noM)
summary(fit.noM)

## 
## Call:
## lm(formula = growth ~ tradeshare, data = Growth.noM)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4247 -0.9383  0.2091  0.9265  5.3776 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   0.9574     0.5804   1.650   0.1041  
## tradeshare    1.6809     0.9874   1.702   0.0937 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.789 on 62 degrees of freedom
## Multiple R-squared:  0.04466,    Adjusted R-squared:  0.02925 
## F-statistic: 2.898 on 1 and 62 DF,  p-value: 0.09369

What is the p-value associated with the coefficient’s t-statistic?

[Ans] The p-value is 0.09.

Construct a 90% confidence interval for b1.

[Ans] The 90% CI is \([0.03, 3.33]\).

confint(fit.noM, "tradeshare", level=0.9)

##                   5 %     95 %
## tradeshare 0.03220284 3.329606

E5.3

On the text website, http://www.pearsonglobaleditions.com, you will find the data file Birthweight_Smoking, which contains data for a random sample of babies born in Pennsylvania in 1989. The data include the baby’s birth weight together with various characteristics of the mother, including whether she smoked during the pregnancy.2 A detailed description is given in Birthweight_Smoking_Description, also available on the website. In this exercise, you will investigate the relationship between birth weight and smoking during pregnancy.

id <- "1IL42szr5_GLat_hqY30yJVV_JVHxHEmO"
bw <- read.xlsx(sprintf("https://docs.google.com/uc?id=%s&export=download",id),
                 sheet=1,startRow=1,colNames=TRUE,rowNames=FALSE)
str(bw)

## 'data.frame':    3000 obs. of  12 variables:
##  $ nprevist   : num  12 5 12 13 9 11 12 10 13 10 ...
##  $ alcohol    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ tripre1    : num  1 0 1 1 1 1 1 1 1 1 ...
##  $ tripre2    : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ tripre3    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ tripre0    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ birthweight: num  4253 3459 2920 2600 3742 ...
##  $ smoker     : num  1 0 1 0 0 0 1 0 0 0 ...
##  $ unmarried  : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ educ       : num  12 16 11 17 13 16 14 13 17 14 ...
##  $ age        : num  27 24 23 28 27 33 24 38 29 28 ...
##  $ drinks     : num  0 0 0 0 0 0 0 0 0 0 ...

In the sample: i. What is the average value of Birthweight for all mothers? ii. For mothers who smoke? iii. For mothers who do not smoke?

[Ans] The average value of Birthweight for all mothers is 3382.934, 3178.832 for mother who smoke, and 3432.06 for mother who do not smoke.

mean.bw <- mean(bw$birthweight) #average value of Birthweight for all mothers
print(mean.bw)

## [1] 3382.934

bw.sk <- subset(bw, smoker==1)    # get the subset of data for smoker
bw.nsk <- subset(bw, smoker==0)   # get the subset of data for non-smoker

m.bw.sk <- mean(bw.sk$birthweight)   # average birth weight for smoker
print(m.bw.sk)

## [1] 3178.832

m.bw.nsk <- mean(bw.nsk$birthweight) # average birth weight for non-smoker
print(m.bw.nsk)

## [1] 3432.06

b.i. Use the data in the sample to estimate the difference in average birth weight for smoking and nonsmoking mothers. ii. What is the standard error for the estimated difference in (i)? iii. Construct a 95% confidence interval for the difference in the average birth weight for smoking and nonsmoking mothers.

[Ans] The estimated difference is exactly \(\hat\beta_1\) from the regression of birthweight on smoker, \(-253.23\). Its standard error is \(26.95\). Its 95% CI is \([-306.07, -200.38]\).

fit.sk <- lm(birthweight~smoker, data=bw)
summary(fit.sk)

## 
## Call:
## lm(formula = birthweight ~ smoker, data = bw)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3007.06  -313.06    26.94   366.94  2322.94 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3432.06      11.87 289.115   <2e-16 ***
## smoker       -253.23      26.95  -9.396   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 583.7 on 2998 degrees of freedom
## Multiple R-squared:  0.0286, Adjusted R-squared:  0.02828 
## F-statistic: 88.28 on 1 and 2998 DF,  p-value: < 2.2e-16

confint(fit.sk, "smoker")

##            2.5 %    97.5 %
## smoker -306.0736 -200.3831

Run a regression of Birthweight on the binary variable Smoker.i. Explain how the estimated slope and intercept are related to your answers in parts (a) and (b). ii. Explain how the \(SE(\hat\beta_1)\) is related to your answer in b(ii). iii. Construct a 95% confidence interval for the effect of smoking on birth weight.

[Ans] The intercept is the average birthweight for non-smokers (\(Smoker = 0\)). The slope is the difference between average birthweights for smokers (\(Smoker = 1\)) and non-smokers (\(Smoker = 0\)).

Do you think smoking is uncorrelated with other factors that cause low birth weight? That is, do you think that the regression error term— say, \(u_i\)—has a conditional mean of 0 given Smoking (Xi)?(You will investigate this further in Birthweight and Smoking exercises in later chapters.)

[Ans] Give your answer. (In general, the answer is no. It is easy to find other factors in the error term that is correlated to smoking, such as income and years of education.)