This will be an empirical problem set examining the relationship between an infant’s birth weight and whether the mother is a smoker. You may be reading this in a html file, which was created using RMarkdown. The RMarkdown file used to create this file is posted on CCLE and contains code boxes to help you get started. You may want to load the RMarkdown file in RStudio and work on it directly to obtain your answers and display the code you used along the way.

The data bw_smoking.csv is posted in the same directory as this homework on CCLE. Download the data file bw_smoking.csv into your computer (if you run R in your computer) or upload it into jupyter (if you are running R remotely instead). Loading the data into your workspace is a bit more involved that we have seen in lab because it is a csv file. To load the data: First, make sure that the dataset is in the working directory. In RStudio, you can use the bottom right box to navigate to the folder where you saved the file. To set the working directory to be that folder just click on More and press Set as Working Directory. Second, once the data is in the working directory you should be able to load it using the following code:

# Tell R to clear the workspace
rm(list = ls())

# Tell R to load the dataset
# Warning: If you are running this code from the RMarkdown by pressing the same arrow,
# then make sure the Markdown file is in the same directory as the csv file otherwise 
# the code will not work. If you are loading the data by typing into the console, then 
# it should work as long ast the data is in the working directory
bwdata <- read.csv("bw_smoking.csv")

Questions

Birthweight description is Birth weight of infant (in grams)
Smoker description is Indicator equal to one if the mother smoked during pregnancy and zero, otherwise
# You can type your code directly into this code-chunk. The included comments 
# contain some hints to get you started with the questions.

# Part 1: The function `names` can be used learn what is hiding in an object
# Remember the command ?names asks R for help.

names (bwdata) #bwdata has 12 variables
##  [1] "nprevist"    "alcohol"     "tripre1"     "tripre2"     "tripre3"    
##  [6] "tripre0"     "birthweight" "smoker"      "unmarried"   "educ"       
## [11] "age"         "drinks"
# Part 2: The function `mean` can compute the average for you. 
# Remember the command ?mean asks R for help.

mean(bwdata$birthweight, trim = 0, na.rm = FALSE) #mean is 3382.934 grams or 3.382934 kilos
## [1] 3382.934
# Part 3: The function `var` can compute the variance for you. 
# Remember the command ?var asks R for help.

var(bwdata$birthweight, y = NULL, na.rm = FALSE) #variance is 350656.9 grams squared or 35.06569 kilos 
## [1] 350656.9
# You can type your code directly into this code-chunk. When using `knit` 
# the graph will be included directly in your output file. Remember a good 
# function for scatter-plots is called `plot` (and ?plot will ask R for help)

plot(bwdata$smoker, bwdata$birthweight)

# You can type your code directly into this code-chunk. The included comments 
# contain some hints to get you started with the questions.

# Part 1: Remember the `lm` function lets us run linear regression

regout <- lm(bwdata$birthweight ~ bwdata$smoker, data = bwdata)
b_2 <- coef(regout)["bwdata$smoker"]
print(b_2) #point estimate of b2 is -253.2284
## bwdata$smoker 
##     -253.2284
# Part 2: Recall the output of lm has more information than what is displayed.
# Type attributes(regout) or names(regout) to see what is in there. 
# The df.residual inside regout is what we want

lm_residuals <- df.residual(regout)
print(lm_residuals) #2998 degrees of freedom in this regression
## [1] 2998
# Part 3: Be careful if using the var function because it divides by n-1

sigma_hat_sq <- sum(regout$residuals^2) / lm_residuals #The sigma squared (variance of the error term) is 340740.4 grams squared or 34.07404 kilos
print(sigma_hat_sq) #
## [1] 340740.4
# You can type your code directly into this code-chunk. The included comments 
# contain some hints to get you started with the questions.

# Part 1: The `summary` function may be helpful

summary(regout) 
## 
## Call:
## lm(formula = bwdata$birthweight ~ bwdata$smoker, data = bwdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3007.06  -313.06    26.94   366.94  2322.94 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3432.06      11.87 289.115   <2e-16 ***
## bwdata$smoker  -253.23      26.95  -9.396   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 583.7 on 2998 degrees of freedom
## Multiple R-squared:  0.0286, Adjusted R-squared:  0.02828 
## F-statistic: 88.28 on 1 and 2998 DF,  p-value: < 2.2e-16
b2_se <- summary(regout)$coefficients["bwdata$smoker", "Std. Error"]
print(b2_se) #the standard error for b2 is 26.95
## [1] 26.95149
# Part 2: The function `qnorm` can return quantiles of the normal distribution

z_critical <- qnorm(0.995)
lower <- b_2 - z_critical * b2_se
upper <- b_2 + z_critical * b2_se
print(lower) #lower bound is -322.6508
## bwdata$smoker 
##     -322.6508
print(upper) #upper bound is -183.8059
## bwdata$smoker 
##     -183.8059
# You can type your code directly into this code-chunk. 
# Recall you already have the scatter plot of the data from part 3.
# The `curve` function is helpful for drawing the regression line.

plot(bwdata$smoker, bwdata$birthweight)
curve(coef(regout)["(Intercept)"] + coef(regout)["bwdata$smoker"] * x,
      from = 0, to = 1,
      add = TRUE,
      lwd = 2)

# Part 1: 
regout <- lm(bwdata$birthweight ~ bwdata$smoker, data = bwdata)
birthweight_nonsmoker <- coef(regout)["(Intercept)"]
print(birthweight_nonsmoker) #birthweight | not smoker is the same as b1 + b2(0) or b1 so the coefficient of the linear regression
## (Intercept) 
##     3432.06
# Part 2:
birthweight_nonsmoker <- coef(regout)["(Intercept)"]

birthweight_smoker <- coef(regout)["(Intercept)"] + coef(regout)["bwdata$smoker"]

answer <- birthweight_smoker - birthweight_nonsmoker
print(answer)
## (Intercept) 
##   -253.2284