Lab 6: Exercise

E6.1

Use the Birthweight_Smoking data set introduced in Empirical Exercise E5.3 to answer the following questions.

Start the project by clearing the workspace. Then load the R package openxlsx and the data Birthweight.

rm(list=ls()) 
library(openxlsx)

## Warning: package 'openxlsx' was built under R version 4.3.3

id <- "1IL42szr5_GLat_hqY30yJVV_JVHxHEmO"
bw <- read.xlsx(sprintf("https://docs.google.com/uc?id=%s&export=download",id),
                 sheet=1,startRow=1,colNames=TRUE,rowNames=FALSE)
str(bw)

## 'data.frame':    3000 obs. of  12 variables:
##  $ nprevist   : num  12 5 12 13 9 11 12 10 13 10 ...
##  $ alcohol    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ tripre1    : num  1 0 1 1 1 1 1 1 1 1 ...
##  $ tripre2    : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ tripre3    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ tripre0    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ birthweight: num  4253 3459 2920 2600 3742 ...
##  $ smoker     : num  1 0 1 0 0 0 1 0 0 0 ...
##  $ unmarried  : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ educ       : num  12 16 11 17 13 16 14 13 17 14 ...
##  $ age        : num  27 24 23 28 27 33 24 38 29 28 ...
##  $ drinks     : num  0 0 0 0 0 0 0 0 0 0 ...

Regress Birthweight on Smoker. What is the estimated effect of smoking on birth weight?

[Ans] On average, the birth weight of the smoker group is 253.2 grams less than that of non-smoker group.

fit.bw <- lm(birthweight~smoker, data=bw)  #run the linear regression
summary(fit.bw)

## 
## Call:
## lm(formula = birthweight ~ smoker, data = bw)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3007.06  -313.06    26.94   366.94  2322.94 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3432.06      11.87 289.115   <2e-16 ***
## smoker       -253.23      26.95  -9.396   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 583.7 on 2998 degrees of freedom
## Multiple R-squared:  0.0286, Adjusted R-squared:  0.02828 
## F-statistic: 88.28 on 1 and 2998 DF,  p-value: < 2.2e-16

Regress Birthweight on Smoker, Alcohol, and Nprevist.

fit.bw2 <- lm(birthweight~smoker+alcohol+nprevist, data=bw)
summary(fit.bw2)

## 
## Call:
## lm(formula = birthweight ~ smoker + alcohol + nprevist, data = bw)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2733.53  -307.57    21.42   358.09  2192.70 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3051.249     34.016  89.701  < 2e-16 ***
## smoker      -217.580     26.680  -8.155 5.07e-16 ***
## alcohol      -30.491     76.234  -0.400    0.689    
## nprevist      34.070      2.855  11.933  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 570.5 on 2996 degrees of freedom
## Multiple R-squared:  0.07285,    Adjusted R-squared:  0.07192 
## F-statistic: 78.47 on 3 and 2996 DF,  p-value: < 2.2e-16

Using the two conditions in Key Concept 6.1, explain why the exclusion of Alcohol and Nprevist could lead to omitted variable bias in the regression estimated in (a).

[Ans] Smoking may be correlated with both alcohol and the number of prenatal doctor visits, thus satisfying (1) in Key Concept 6.1. Moreover, both alcohol consumption and the number of doctor visits may have their own independent affects on birthweight, thus satisfying (2) in Key Concept 6.1.

Is the estimated effect of smoking on birth weight substantially different from the regression that excludes Alcohol and Nprevist? Does the regression in (a) seem to suffer from omitted variable bias?

[Ans] The estimated effect of the multiple regression is smaller: it has fallen to 217 grams from 253 grams. Given that both conditions in Key Concept 6.1 are likely to be satisfied, the regression in (a) may suffer from omitted variable bias.

Jane smoked during her pregnancy, did not drink alcohol, and had 8 prenatal care visits. Use the regression to predict the birth weight of Jane’s child.

[Ans] The predicted birth weight of Jane’s child is 3106.23.

coef.bw2 <- fit.bw2$coefficients
bw.pred.Jane <- coef.bw2[1] + coef.bw2[2]*1 + coef.bw2[3]*0 + coef.bw2[4]*8
print(bw.pred.Jane)

## (Intercept) 
##    3106.228

Compute R2 and R2. Why are they so similar?

[Ans] $R^2=0.0729$, $\bar{R}^2=0.0719$. Their closeness arises from the large sample size ($n = 3000$), where the penalty term in $\bar{R}^2$, $(n-1)/(n-k-1)$, equals $(n-1)/(n-4)=1.001$, making $\bar{R}^2$ almost identical to $R^2$.

How should you interpret the coefficient on Nprevist? Does the coefficient measure a causal effect of prenatal visits on birth weight? If not, what does it measure?

[Ans] Nprevist is a control variable. It captures, for example, mother’s access to healthcare and health. Because Nprevist is a control variable, its coefficient does not have a causal interpretation.

An alternative way to control for prenatal visits is to use the binary variables Tripre0 through Tripre3. Regress Birthweight on Smoker, Alcohol, Tripre0, Tripre2, and Tripre3.

fit.bw3 <- lm(birthweight~smoker+alcohol+tripre0+tripre2+tripre3, data=bw)
summary(fit.bw3)

## 
## Call:
## lm(formula = birthweight ~ smoker + alcohol + tripre0 + tripre2 + 
##     tripre3, data = bw)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3029.55  -307.55    31.35   372.45  2401.29 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3454.55      12.65 273.077  < 2e-16 ***
## smoker       -228.85      27.16  -8.424  < 2e-16 ***
## alcohol       -15.10      77.54  -0.195 0.845613    
## tripre0      -697.97     106.88  -6.531 7.66e-11 ***
## tripre2      -100.84      29.62  -3.404 0.000672 ***
## tripre3      -136.96      59.58  -2.299 0.021595 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 578.7 on 2994 degrees of freedom
## Multiple R-squared:  0.04647,    Adjusted R-squared:  0.04487 
## F-statistic: 29.18 on 5 and 2994 DF,  p-value: < 2.2e-16

Why is Tripre1 excluded from the regression? What would happen if you included it in the regression?

[Ans] Tripre1 is omitted to avoid perfect multicollinearity. ($Tripre0+Tripre1+Tripre2+Tripre3=1$, the value of the “constant” regressor that determines the intercept). If Tripre0, Tripre1, Tripre2, Tripre3, and the constant term all included in the regression, the software R will automatically drop one of the dummy variables and run the regression.

fit.bw4 <- lm(birthweight~smoker+alcohol+tripre0+tripre1+tripre2+tripre3, data=bw)
summary(fit.bw4)

## 
## Call:
## lm(formula = birthweight ~ smoker + alcohol + tripre0 + tripre1 + 
##     tripre2 + tripre3, data = bw)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3029.55  -307.55    31.35   372.45  2401.29 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3317.59      59.00  56.231  < 2e-16 ***
## smoker       -228.85      27.16  -8.424  < 2e-16 ***
## alcohol       -15.10      77.54  -0.195   0.8456    
## tripre0      -561.01     120.88  -4.641 3.61e-06 ***
## tripre1       136.96      59.58   2.299   0.0216 *  
## tripre2        36.12      64.17   0.563   0.5736    
## tripre3           NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 578.7 on 2994 degrees of freedom
## Multiple R-squared:  0.04647,    Adjusted R-squared:  0.04487 
## F-statistic: 29.18 on 5 and 2994 DF,  p-value: < 2.2e-16

The estimated coefficient on Tripre0 is large and negative. What does this coefficient measure? Interpret its value.

[Ans] The estimated coefficient on Tripre0 is $-697.97$. It implies that babies born to women who had no prenatal doctor visits ($Tripre0 = 1$) had birth weights that on average were 697.97 grams ($\approx$ 1.5 lbs) lower than babies from others who saw a doctor during the first trimester ($Tripre1 = 1$).

Interpret the value of the estimated coefficients on Tripre2 and Tripre3.

[Ans] The estimated coefficients on Tripre2 and Tripre3 are $-100.84$ and $-136.96$, respectively. The results imply that babies born to women whose first doctor visit was during the second trimester ($Tripre2 = 1$) had birth weights that on average were $100.84$ grams ($\approx 0.2$ lbs) lower than babies from others who saw a doctor during the first trimester ($Tripre1 = 1$). Babies born to women whose first doctor visit was during the third trimester ($Tripre3 = 1$) had birth weights that on average were $136.96$ grams ($\approx 0.3$ lbs) lower than babies from others who saw a doctor during the first trimester ($Tripre1 = 1$).

Does the regression in (d) explain a larger fraction of the variance in birth weight than the regression in (b)?

[Ans] The regression in (d) explains a smaller fraction of the variance in birth weight than the regression in (b) because $\bar{R}^2$ in (d) is smaller than $\bar{R}^2$ in (b).

E6.2

Using the data set Growth described in Empirical Exercise 4.1, but excluding the data for Malta, carry out the following exercises.

id <- "1BZAxYZsUtZjeuEugYrHUuHWSlHXZ_4tu"
Growth <- read.xlsx(sprintf("https://docs.google.com/uc?id=%s&export=download",id),
                 sheet=1,startRow=1,colNames=TRUE,rowNames=FALSE)
Growth.noM <- subset(Growth, country_name != "Malta")
str(Growth.noM)

## 'data.frame':    64 obs. of  8 variables:
##  $ country_name : chr  "India" "Argentina" "Japan" "Brazil" ...
##  $ growth       : num  1.915 0.618 4.305 2.93 1.712 ...
##  $ oil          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ rgdp60       : num  766 4462 2954 1784 9895 ...
##  $ tradeshare   : num  0.141 0.157 0.158 0.16 0.161 ...
##  $ yearsschool  : num  1.45 4.99 6.71 2.89 8.66 ...
##  $ rev_coups    : num  0.133 0.933 0 0.1 0 ...
##  $ assasinations: num  0.867 1.933 0.2 0.1 0.433 ...

Construct a table that shows the sample mean, standard deviation, and minimum and maximum values for the series Growth, TradeShare, YearsSchool, Oil, Rev_Coups, Assassinations, and RGDP60. Include the appropriate units for all entries. (The sample mean and minimum and maximum values of each variable can be computed using summary, the standard deviation can be calculated using sd.)

[Ans]

Variable	Mean	SD	Minimum	Maximum	Units
Growth	1.87	1.82	-2.81	7.16	Percentage Points
Rgdp60	3131.0	2523.0	367.0	9895.0	$1960
Tradeshare	0.542	0.229	0.141	1.128	Unit free
yearsschool	3.95	2.55	0.20	10.07	years
Rev_coups	0.170	0.225	0.000	0.970	Coups per year
Assassinations	0.281	0.494	0.000	2.467	Assassinations per year
Oil	0.00	0.00	0.00	0.00	0–1 Dummy variable

summary(Growth.noM)

##  country_name           growth             oil        rgdp60    
##  Length:64          Min.   :-2.8119   Min.   :0   Min.   : 367  
##  Class :character   1st Qu.: 0.8057   1st Qu.:0   1st Qu.:1144  
##  Mode  :character   Median : 1.9745   Median :0   Median :2028  
##                     Mean   : 1.8691   Mean   :0   Mean   :3131  
##                     3rd Qu.: 2.8283   3rd Qu.:0   3rd Qu.:5180  
##                     Max.   : 7.1569   Max.   :0   Max.   :9895  
##    tradeshare      yearsschool       rev_coups       assasinations   
##  Min.   :0.1405   Min.   : 0.200   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.3847   1st Qu.: 1.880   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.5390   Median : 3.550   Median :0.08333   Median :0.1000  
##  Mean   :0.5424   Mean   : 3.959   Mean   :0.17007   Mean   :0.2819  
##  3rd Qu.:0.6588   3rd Qu.: 5.343   3rd Qu.:0.26667   3rd Qu.:0.2333  
##  Max.   :1.1279   Max.   :10.070   Max.   :0.97037   Max.   :2.4667

sd(Growth.noM$growth)

## [1] 1.816189

Run a regression of Growth on TradeShare, YearsSchool, Rev_Coups, Assassinations, and RGDP60. What is the value of the coefficient on Rev_Coups? Interpret the value of this coefficient. Is it large or small in a real-world sense?

[Ans] The coefficient on Rev_Coups is $-2.15$. An additional coup in a five year period, reduces the average year growth rate by $(2.15/5) = 0.43\%$ over this 25 year period. This means the GDP in 1995 is expected to be approximately $.43 \times 25 = 10.75\%$ lower. This is a large effect.

fit.growth <- lm(growth~tradeshare+yearsschool+rev_coups+assasinations+rgdp60, data=Growth.noM)
summary(fit.growth)

## 
## Call:
## lm(formula = growth ~ tradeshare + yearsschool + rev_coups + 
##     assasinations + rgdp60, data = Growth.noM)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6897 -0.9459 -0.0565  0.8286  5.1534 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.6268915  0.7830280   0.801  0.42663    
## tradeshare     1.3408193  0.9600631   1.397  0.16786    
## yearsschool    0.5642445  0.1431131   3.943  0.00022 ***
## rev_coups     -2.1504256  1.1185900  -1.922  0.05947 .  
## assasinations  0.3225844  0.4880043   0.661  0.51121    
## rgdp60        -0.0004613  0.0001508  -3.059  0.00336 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.594 on 58 degrees of freedom
## Multiple R-squared:  0.2911, Adjusted R-squared:   0.23 
## F-statistic: 4.764 on 5 and 58 DF,  p-value: 0.001028

Use the regression to predict the average annual growth rate for a country that has average values for all regressors.

[Ans] The predicted growth rate at the mean values for all regressors is 1.87.

#Extract the coefficients from the regression model
coefficients <- coef(fit.growth)

# Create a vector of average values for all regressors
avg_values <- c(1,
                mean(Growth.noM$tradeshare),
                mean(Growth.noM$yearsschool),
                mean(Growth.noM$rev_coups),
                mean(Growth.noM$assasinations),
                mean(Growth.noM$rgdp60)
                )

# Calculate the predicted average annual growth rate
predicted_growth <- sum(coefficients * avg_values)

# Print the predicted average annual growth rate
print(predicted_growth)

## [1] 1.86912

Repeat (c), but now assume that the country’s value for TradeShare is one standard deviation above the mean.

[Ans] The resulting predicted value is 2.18.

# Create a vector of average values for all regressors
avg_values_2 <- c(1,
                mean(Growth.noM$tradeshare)+sd(Growth.noM$tradeshare),
                mean(Growth.noM$yearsschool),
                mean(Growth.noM$rev_coups),
                mean(Growth.noM$assasinations),
                mean(Growth.noM$rgdp60)
                )

# Calculate the predicted average annual growth rate
predicted_growth_2 <- sum(coefficients * avg_values_2)

# Print the predicted average annual growth rate
print(predicted_growth_2)

## [1] 2.175273

Why is Oil omitted from the regression? What would happen if it were included?

[Ans] The variable “oil” has a value of 0 for all 64 countries in the sample. Its inclusion in the regression model leads to perfect multicollinearity, resulting in a non-full column rank in the matrix $X$. This situation makes $X'X$ non-invertible.