A Reminder

  • keep your #comment short in a code chunk.

  • Write paragraphs above or below the code chunks.

  • Do not round intermediate calculations. Round your final calculation to 2 significant digits.

Instruction

Answer the following questions using the appropriate dataset and codebook. For each question, provide (1) your codes, (2) R outputs AND (3) the answer in complete sentences.

Step 0: Load all necessary libraries (if any)

options(scipen=999, digits = 3)

Q1

Import the dataset named “birthweight_smoking.csv”, name your dataset.

df <- read.csv("birthweight_smoking.csv")

Q2

Create a smaller dataset containing the variables “birthweight”,“nprevist”,“drinks”,“smoker”,“age”,“educ”, and “alcohol”. [1 point]

dataMini <- df[,c("birthweight","nprevist","age","educ","drinks","smoker","alcohol")] 
# subset data frame by shortlisting named columns

Q3

Reproduce the regression models lm1 and lm2 based on the lab demonstration and show the summary regression outputs. lm1: Regress birthweight on smoker, alcohol and nprevist lm2: Regress birthweight on smoker, alcohol, nprevist and unmarried

  • correct codes for producing regression models (1 point)
  • correct codes for showing summary results (1 point)
lm1 <- lm(birthweight ~ smoker + alcohol + nprevist, data = df) 
summary(lm1)
## 
## Call:
## lm(formula = birthweight ~ smoker + alcohol + nprevist, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2733.5  -307.6    21.4   358.1  2192.7 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  3051.25      34.02   89.70 < 0.0000000000000002 ***
## smoker       -217.58      26.68   -8.16  0.00000000000000051 ***
## alcohol       -30.49      76.23   -0.40                 0.69    
## nprevist       34.07       2.85   11.93 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 570 on 2996 degrees of freedom
## Multiple R-squared:  0.0729, Adjusted R-squared:  0.0719 
## F-statistic: 78.5 on 3 and 2996 DF,  p-value: <0.0000000000000002
lm2 <- lm(birthweight ~ smoker + alcohol + nprevist + unmarried, data = df) 
summary(lm2)
## 
## Call:
## lm(formula = birthweight ~ smoker + alcohol + nprevist + unmarried, 
##     data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2798.8  -309.2    25.4   361.8  2363.7 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   3134.4       35.7   87.91 < 0.0000000000000002 ***
## smoker        -175.4       27.1   -6.47     0.00000000011275 ***
## alcohol        -21.1       75.6   -0.28                 0.78    
## nprevist        29.6        2.9   10.21 < 0.0000000000000002 ***
## unmarried     -187.1       26.0   -7.20     0.00000000000078 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 566 on 2995 degrees of freedom
## Multiple R-squared:  0.0886, Adjusted R-squared:  0.0874 
## F-statistic: 72.8 on 4 and 2995 DF,  p-value: <0.0000000000000002

Q4

Run the third model (lm3) by adding age and educ to lm2. Find the summary regression results.

  • run a new lm() model (0.5 points)
  • call the model using summary() (0.5 points)
lm3 <- lm(birthweight ~ smoker + alcohol + nprevist + unmarried + age + educ, 
          data = df) 

# you need to call lm3 using the summary(), 
# otherwise you can't see the results

summary(lm3)
## 
## Call:
## lm(formula = birthweight ~ smoker + alcohol + nprevist + unmarried + 
##     age + educ, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2820.0  -304.8    22.6   357.3  2355.3 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 3199.426     83.337   38.39 < 0.0000000000000002 ***
## smoker      -176.959     27.474   -6.44      0.0000000001378 ***
## alcohol      -14.758     75.839   -0.19                 0.85    
## nprevist      29.775      2.926   10.17 < 0.0000000000000002 ***
## unmarried   -199.319     28.396   -7.02      0.0000000000027 ***
## age           -2.494      2.269   -1.10                 0.27    
## educ           0.238      5.542    0.04                 0.97    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 566 on 2993 degrees of freedom
## Multiple R-squared:  0.089,  Adjusted R-squared:  0.0872 
## F-statistic: 48.7 on 6 and 2993 DF,  p-value: <0.0000000000000002

Q5

Find out the AIC and BIC values for lm1, lm2, and lm3.

  • 1 point for the correct codes
AIC(lm1, lm2, lm3)
##     df   AIC
## lm1  5 46598
## lm2  6 46549
## lm3  8 46552
BIC(lm1, lm2, lm3)
##     df   BIC
## lm1  5 46628
## lm2  6 46585
## lm3  8 46600

Q6

Which model is the BEST FIT model (i.e., has the highest adjusted R-squared value and the lowest AIC and BIC values) among the three models? Interpret the BEST FIT model using the F-statistics, Adjusted R-squared as well as AIC and BIC values.

  • selection of model 2 as the best fit model (0.5 points)
  • Indicate and interpret F-statistics, Adjusted R-squared, AIC and BIC values (0.5 points each, totalling 1.5 points)
  • 0.5 point deduction for each inaccuracy

Response:

Based on the assessment, we will choose model 2 as the final model which has the best fit with the data, the highest explanatory power and is the most efficient and parsimonious.

The F-statistics of Model 2 is 72.79 and the p-value is close to 0. The model is significant at the 0.05 level, meaning there is at least one significant coefficient in the model.

Model 2 (lm2) has the highest adjusted R-squared value (8.74%) among the three models. It means that model lm2 (with smoker, alcohol, nprevist and unmarried as independent variables) explains about 8.74% of the variance in birthweight.

Model 2 (lm2) also has the lowest AIC and BIC values, which are 46548.98 (or 46549) and 46585.02 (or 46585) respectively, indicating it is the most efficient and parsimonious models among all.


Q7

Interpret the intercept and the significance of independent variables of the BEST FIT model, including the direction and size of the significant beta coefficients.

  • significance of variables (1 point)
  • interpretations of significant coefficients (1 point)
  • 0.5 point deduction for each incorrect description

Note that for dummy/categorical variables, your response should compare the DV based on the groups. See interpretations below.

Response: Model 2 [for ease of reference, you may recall the model here]

summary(lm2)
## 
## Call:
## lm(formula = birthweight ~ smoker + alcohol + nprevist + unmarried, 
##     data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2798.8  -309.2    25.4   361.8  2363.7 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   3134.4       35.7   87.91 < 0.0000000000000002 ***
## smoker        -175.4       27.1   -6.47     0.00000000011275 ***
## alcohol        -21.1       75.6   -0.28                 0.78    
## nprevist        29.6        2.9   10.21 < 0.0000000000000002 ***
## unmarried     -187.1       26.0   -7.20     0.00000000000078 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 566 on 2995 degrees of freedom
## Multiple R-squared:  0.0886, Adjusted R-squared:  0.0874 
## F-statistic: 72.8 on 4 and 2995 DF,  p-value: <0.0000000000000002
  • Both smoking mothers (smoker), unmarried mothers (unmarried), and the number of prenatal visits (nprevist) are significant predictors of infant birthweight, but not for mothers drinking alcohol during pregnancy (alcohol) at the 0.05 level.

  • smoker: Holding all other variables constant, children with smoking mothers weighed 175.38 grams less on average than children with non-smoking mothers at the 0.05 level.

  • nprevist: A 1-unit increase in the number of prenatal visits was associated with an increase in the infant birthweight (DV) by 29.6 grams, holding other factors constant.

  • unmarried: Compared to children whose mothers were married, children whose mothers were unmarried exhibited a significantly lower birthweight by 187.13 grams on average at the 0.05 level, holding all other variables constant.

  • alcohol: The coefficient for alcohol was not statistically significant at the 0.05 level.


Reference materials: Why negative residuals?

The main objective of a regression model is to minimize the sum of residuals by identifying the best-fit line that models the relationship between the independent variable (x) and dependent variable (y). However, it’s important to note that not all models are capable of fitting the data accurately. In some cases, the predicted values \(\hat{y_i}\) from the model can significantly differ from the actual observed values \(y_i\). Residuals, which represent the difference between the predicted value and the actual value, can be negative or positive. A negative residual indicates that the predicted value is higher than the actual value, while a positive residual indicates that the predicted value is lower than the actual value.

We are examining Model 2 to better understand the relationship between mothers who did not smoke and the birthweight of their infants. The regression line for this model is:

\[\hat{y_i} = 3432.1 + (-253.2)x \] where \(x=0\) indicates that the mother does not smoke. Using this equation, we can predict that the average birthweight for infants with non-smoking mothers is 3432.1 grams.

To check the accuracy of our prediction, we can examine the actual data point. We do this by calculating the residual, which is the difference between the actual value \(y_i\) and the predicted value \(\hat{y_i}\). The minimum residual for this model is -3007.1, which means that there is a data point with a birthweight that is 3007.1 grams less than our predicted value.

To find the actual data point, we can use the following formula:

\[r_i = y_i - \hat{y_i}\] where \(y_i\) is the observed value (data point), \(\hat{y_i}\) is the predicted value based on model 2.

Rearranging the equation, we get:

\[ y_i = r_i + \hat{y_i}\] Substituting in the values for \(r_i\) and \(\hat{y_i}\), we get:

\[ y_i = -3007.1 + 3432.1\] \[ y_i = 425 \]

Upon inspecting the dataset, you will find that the lowest birthweight recorded is 425 grams which is the the actual birthweight for the data point with the minimal (largest negative) residual. This value is hugely different from the predicted birthweight of 3432.1 grams, which highlights the limitations of relying on a simple linear regression model with just one independent variable.

To develop a more robust model, we should incorporate additional variables into our analysis through multiple regression. By incorporating more variables, we can gain a better understanding of the relationships between various factors and birthweight, ultimately leading to a more accurate and comprehensive model.

summary(dataMini$birthweight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     425    3062    3420    3383    3750    5755
#alternatively from model2 data
min(lm2$model$birthweight)
## [1] 425

Submit your Assignment

Statastic – Well done!

Step 1: Double check if you answered all the questions thoroughly and check for accuracy ALWAYS!

Step 2: If you use RMarkdown (.Rmd) document, Knit your R Markdown document–move your cursor to the face-down triangle next to Knit, and choose for PDF. If you use an R script (.R), then transfer your codes, results and work on the assignment in a word document, then convert it to a PDF.

Step 3: Submit your assignment to Gradescope https://www.gradescope.com/.