Problem Set 3

This will be an empirical problem set examining the relationship between an infant’s birth weight and whether the mother is a smoker. You may be reading this in a html file, which was created using RMarkdown. The RMarkdown file used to create this file is posted on CCLE and contains code boxes to help you get started. You may want to load the RMarkdown file in RStudio and work on it directly to obtain your answers and display the code you used along the way.

The data bw_smoking.csv is posted in the same directory as this homework on CCLE. Download the data file bw_smoking.csv into your computer (if you run R in your computer) or upload it into jupyter (if you are running R remotely instead). Loading the data into your workspace is a bit more involved that we have seen in lab because it is a csv file. To load the data: First, make sure that the dataset is in the working directory. In RStudio, you can use the bottom right box to navigate to the folder where you saved the file. To set the working directory to be that folder just click on More and press Set as Working Directory. Second, once the data is in the working directory you should be able to load it using the following code:

Questions

(10 pts) The description of the data is contained in a file PS3 - Birthweight_Smoking_Description.pdf, which is posted in the same directory as this homework on CCLE. Read the description carefully.
- What is the description of the variable birthweight?
- What is the description of the variable smoker?

Birthweight description is Birth weight of infant (in grams)
Smoker description is Indicator equal to one if the mother smoked during pregnancy and zero, otherwise

(10 pts) Once the data is loaded into R, you should see a data frame called bwdata in your workspace.
- How many variables are stored in bwdata?
- What is the average birth weights for the entire sample?
- What is the variance of birth weight for the entire sample?

# You can type your code directly into this code-chunk. The included comments 
# contain some hints to get you started with the questions.

# Part 1: The function `names` can be used learn what is hiding in an object
# Remember the command ?names asks R for help.

names (bwdata) #bwdata has 12 variables

##  [1] "nprevist"    "alcohol"     "tripre1"     "tripre2"     "tripre3"    
##  [6] "tripre0"     "birthweight" "smoker"      "unmarried"   "educ"       
## [11] "age"         "drinks"

# Part 2: The function `mean` can compute the average for you. 
# Remember the command ?mean asks R for help.

mean(bwdata$birthweight, trim = 0, na.rm = FALSE) #mean is 3382.934 grams or 3.382934 kilos

## [1] 3382.934

# Part 3: The function `var` can compute the variance for you. 
# Remember the command ?var asks R for help.

var(bwdata$birthweight, y = NULL, na.rm = FALSE) #variance is 350656.9 grams squared or 35.06569 kilos

## [1] 350656.9

(15 pts) Create a scatterplot of birthweigth (on the y-axis) and smoker (on the x-axis).

# You can type your code directly into this code-chunk. When using `knit` 
# the graph will be included directly in your output file. Remember a good 
# function for scatter-plots is called `plot` (and ?plot will ask R for help)

plot(bwdata$smoker, bwdata$birthweight)

(20 pts) Run a linear regression of the variable birthweight on the variable smoker in the model \[\text{birthweight} = \beta_1 + \beta_2 \text{smoker} + \epsilon\] using the lm function and save the output in an object called regout.
- What is the point estimate \(b_2\) for \(\beta_2\)?
- What are the degrees of freedom in this regression (i.e. \(n-K\))?
- Using the vector of residuals created by the lm function, compute \(\hat \sigma^2\).

# You can type your code directly into this code-chunk. The included comments 
# contain some hints to get you started with the questions.

# Part 1: Remember the `lm` function lets us run linear regression

regout <- lm(bwdata$birthweight ~ bwdata$smoker, data = bwdata)
b_2 <- coef(regout)["bwdata$smoker"]
print(b_2) #point estimate of b2 is -253.2284

## bwdata$smoker 
##     -253.2284

# Part 2: Recall the output of lm has more information than what is displayed.
# Type attributes(regout) or names(regout) to see what is in there. 
# The df.residual inside regout is what we want

lm_residuals <- df.residual(regout)
print(lm_residuals) #2998 degrees of freedom in this regression

## [1] 2998

# Part 3: Be careful if using the var function because it divides by n-1

sigma_hat_sq <- sum(regout$residuals^2) / lm_residuals #The sigma squared (variance of the error term) is 340740.4 grams squared or 34.07404 kilos
print(sigma_hat_sq) #

## [1] 340740.4

(15 pts) Building on the previous question answer the following:
- What is the standard error for \(b_2\)?
- Build a 99\(\%\) confidence interval for \(\beta_2\) using a normal approximation.

# You can type your code directly into this code-chunk. The included comments 
# contain some hints to get you started with the questions.

# Part 1: The `summary` function may be helpful

summary(regout)

## 
## Call:
## lm(formula = bwdata$birthweight ~ bwdata$smoker, data = bwdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3007.06  -313.06    26.94   366.94  2322.94 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3432.06      11.87 289.115   <2e-16 ***
## bwdata$smoker  -253.23      26.95  -9.396   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 583.7 on 2998 degrees of freedom
## Multiple R-squared:  0.0286, Adjusted R-squared:  0.02828 
## F-statistic: 88.28 on 1 and 2998 DF,  p-value: < 2.2e-16

b2_se <- summary(regout)$coefficients["bwdata$smoker", "Std. Error"]
print(b2_se) #the standard error for b2 is 26.95

## [1] 26.95149

# Part 2: The function `qnorm` can return quantiles of the normal distribution

z_critical <- qnorm(0.995)
lower <- b_2 - z_critical * b2_se
upper <- b_2 + z_critical * b2_se
print(lower) #lower bound is -322.6508

## bwdata$smoker 
##     -322.6508

print(upper) #upper bound is -183.8059

## bwdata$smoker 
##     -183.8059

(15 pts) Create a scatterplot of birthweigth (on the y-axis) and smoker (on the x-axis) and add to it the estimated linear regression line.

# You can type your code directly into this code-chunk. 
# Recall you already have the scatter plot of the data from part 3.
# The `curve` function is helpful for drawing the regression line.

plot(bwdata$smoker, bwdata$birthweight)
curve(coef(regout)["(Intercept)"] + coef(regout)["bwdata$smoker"] * x,
      from = 0, to = 1,
      add = TRUE,
      lwd = 2)

(15 pts) In the linear regression model we estimated we assume as usual that \(E[\epsilon|\text{smoker}] = 0\).
- What is your estimate for \(E[\text{birthweight}|\text{not smoker}]\)?
- What is your estimate for \(E[\text{birthweight}|\text{smoker}]-E[\text{birthweight}|\text{not smoker}]\)?
- Do our results imply smoking causes lower infant birthweight?

# Part 1: 
regout <- lm(bwdata$birthweight ~ bwdata$smoker, data = bwdata)
birthweight_nonsmoker <- coef(regout)["(Intercept)"]
print(birthweight_nonsmoker) #birthweight | not smoker is the same as b1 + b2(0) or b1 so the coefficient of the linear regression

## (Intercept) 
##     3432.06

# Part 2:
birthweight_nonsmoker <- coef(regout)["(Intercept)"]

birthweight_smoker <- coef(regout)["(Intercept)"] + coef(regout)["bwdata$smoker"]

answer <- birthweight_smoker - birthweight_nonsmoker
print(answer)

## (Intercept) 
##   -253.2284

Problem Set 3

Econ 103: Introduction to Econometrics

Due Date: 11/16/2025

Questions