This will be an empirical problem set examining the relationship between an infant’s birth weight and whether the mother is a smoker. You may be reading this in a html file, which was created using RMarkdown. The RMarkdown file used to create this file is posted on CCLE and contains code boxes to help you get started. You may want to load the RMarkdown file in RStudio and work on it directly to obtain your answers and display the code you used along the way.
The data bw_smoking.csv is posted in the same directory
as this homework on CCLE. Download the data file
bw_smoking.csv into your computer (if you run R in your
computer) or upload it into jupyter (if you are running R remotely
instead). Loading the data into your workspace is a bit more involved
that we have seen in lab because it is a csv file. To load the data:
First, make sure that the dataset is in the working directory. In
RStudio, you can use the bottom right box to navigate to the folder
where you saved the file. To set the working directory to be that folder
just click on More and press
Set as Working Directory. Second, once the data is in the
working directory you should be able to load it using the following
code:
# Tell R to clear the workspace
rm(list = ls())
# Tell R to load the dataset
# Warning: If you are running this code from the RMarkdown by pressing the same arrow,
# then make sure the Markdown file is in the same directory as the csv file otherwise
# the code will not work. If you are loading the data by typing into the console, then
# it should work as long ast the data is in the working directory
bwdata <- read.csv("bw_smoking.csv")
PS3 - Birthweight_Smoking_Description.pdf, which is posted
in the same directory as this homework on CCLE. Read the description
carefully.
What is the description of the variable
birthweight?
What is the description of the variable
smoker?
Birthweight description is Birth weight of infant (in grams)
Smoker description is Indicator equal to one if the mother smoked during pregnancy and zero, otherwise
(10 pts) Once the data is loaded into R, you should see a data
frame called bwdata in your workspace.
How many variables are stored in bwdata?
What is the average birth weights for the entire sample?
What is the variance of birth weight for the entire sample?
# You can type your code directly into this code-chunk. The included comments
# contain some hints to get you started with the questions.
# Part 1: The function `names` can be used learn what is hiding in an object
# Remember the command ?names asks R for help.
names (bwdata) #bwdata has 12 variables
## [1] "nprevist" "alcohol" "tripre1" "tripre2" "tripre3"
## [6] "tripre0" "birthweight" "smoker" "unmarried" "educ"
## [11] "age" "drinks"
# Part 2: The function `mean` can compute the average for you.
# Remember the command ?mean asks R for help.
mean(bwdata$birthweight, trim = 0, na.rm = FALSE) #mean is 3382.934 grams or 3.382934 kilos
## [1] 3382.934
# Part 3: The function `var` can compute the variance for you.
# Remember the command ?var asks R for help.
var(bwdata$birthweight, y = NULL, na.rm = FALSE) #variance is 350656.9 grams squared or 35.06569 kilos
## [1] 350656.9
birthweigth (on the
y-axis) and smoker (on the x-axis).# You can type your code directly into this code-chunk. When using `knit`
# the graph will be included directly in your output file. Remember a good
# function for scatter-plots is called `plot` (and ?plot will ask R for help)
plot(bwdata$smoker, bwdata$birthweight)
(20 pts) Run a linear regression of the variable
birthweight on the variable smoker in the
model \[\text{birthweight} = \beta_1 +
\beta_2 \text{smoker} + \epsilon\] using the lm
function and save the output in an object called
regout.
What is the point estimate \(b_2\) for \(\beta_2\)?
What are the degrees of freedom in this regression (i.e. \(n-K\))?
Using the vector of residuals created by the lm
function, compute \(\hat
\sigma^2\).
# You can type your code directly into this code-chunk. The included comments
# contain some hints to get you started with the questions.
# Part 1: Remember the `lm` function lets us run linear regression
regout <- lm(bwdata$birthweight ~ bwdata$smoker, data = bwdata)
b_2 <- coef(regout)["bwdata$smoker"]
print(b_2) #point estimate of b2 is -253.2284
## bwdata$smoker
## -253.2284
# Part 2: Recall the output of lm has more information than what is displayed.
# Type attributes(regout) or names(regout) to see what is in there.
# The df.residual inside regout is what we want
lm_residuals <- df.residual(regout)
print(lm_residuals) #2998 degrees of freedom in this regression
## [1] 2998
# Part 3: Be careful if using the var function because it divides by n-1
sigma_hat_sq <- sum(regout$residuals^2) / lm_residuals #The sigma squared (variance of the error term) is 340740.4 grams squared or 34.07404 kilos
print(sigma_hat_sq) #
## [1] 340740.4
(15 pts) Building on the previous question answer the following:
What is the standard error for \(b_2\)?
Build a 99\(\%\) confidence interval for \(\beta_2\) using a normal approximation.
# You can type your code directly into this code-chunk. The included comments
# contain some hints to get you started with the questions.
# Part 1: The `summary` function may be helpful
summary(regout)
##
## Call:
## lm(formula = bwdata$birthweight ~ bwdata$smoker, data = bwdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3007.06 -313.06 26.94 366.94 2322.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3432.06 11.87 289.115 <2e-16 ***
## bwdata$smoker -253.23 26.95 -9.396 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 583.7 on 2998 degrees of freedom
## Multiple R-squared: 0.0286, Adjusted R-squared: 0.02828
## F-statistic: 88.28 on 1 and 2998 DF, p-value: < 2.2e-16
b2_se <- summary(regout)$coefficients["bwdata$smoker", "Std. Error"]
print(b2_se) #the standard error for b2 is 26.95
## [1] 26.95149
# Part 2: The function `qnorm` can return quantiles of the normal distribution
z_critical <- qnorm(0.995)
lower <- b_2 - z_critical * b2_se
upper <- b_2 + z_critical * b2_se
print(lower) #lower bound is -322.6508
## bwdata$smoker
## -322.6508
print(upper) #upper bound is -183.8059
## bwdata$smoker
## -183.8059
birthweigth (on the
y-axis) and smoker (on the x-axis) and add to it the
estimated linear regression line.# You can type your code directly into this code-chunk.
# Recall you already have the scatter plot of the data from part 3.
# The `curve` function is helpful for drawing the regression line.
plot(bwdata$smoker, bwdata$birthweight)
curve(coef(regout)["(Intercept)"] + coef(regout)["bwdata$smoker"] * x,
from = 0, to = 1,
add = TRUE,
lwd = 2)
(15 pts) In the linear regression model we estimated we assume as usual that \(E[\epsilon|\text{smoker}] = 0\).
What is your estimate for \(E[\text{birthweight}|\text{not smoker}]\)?
What is your estimate for \(E[\text{birthweight}|\text{smoker}]-E[\text{birthweight}|\text{not smoker}]\)?
Do our results imply smoking causes lower infant birthweight?
# Part 1:
regout <- lm(bwdata$birthweight ~ bwdata$smoker, data = bwdata)
birthweight_nonsmoker <- coef(regout)["(Intercept)"]
print(birthweight_nonsmoker) #birthweight | not smoker is the same as b1 + b2(0) or b1 so the coefficient of the linear regression
## (Intercept)
## 3432.06
# Part 2:
birthweight_nonsmoker <- coef(regout)["(Intercept)"]
birthweight_smoker <- coef(regout)["(Intercept)"] + coef(regout)["bwdata$smoker"]
answer <- birthweight_smoker - birthweight_nonsmoker
print(answer)
## (Intercept)
## -253.2284