A Reminder

keep your #comment short in a code chunk.
Write paragraphs above or below the code chunks.

Instructions

Answer the following questions using the appropriate dataset and codebook. For each question, provide (1) your codes, (2) R outputs AND (3) the answer in complete sentences.

Setup and load necessary libraries

options(scipen=999, digits = 2) 
# remove scientific notation and round to two significant digits

Q1

Import the dataset named “birthweight_smoking.csv”, name your dataset.

correct code (1 point)

smoking_data <- read.csv("birthweight_smoking.csv")

Q2

Create a smaller dataset containing the variables “birthweight”, “nprevist”,“drinks”, and “smoker”.

correct code (1 point)

dataMini <- smoking_data[,c("birthweight","nprevist","drinks","smoker")]

Q3

Q3: How are birthweight and nprevist correlated? Describe the direction and strength of the correlation using pearson correlation.

correct code (0.5 points)
correct description of the coefficient, strength and direction (0.5 points)
0.5 points deducted for each incorrect description or code

Response: Birthweight and the number of prenatal visits are weakly but positively correlated with a correlation coefficient of 0.23.

cor(smoking_data$birthweight, smoking_data$nprevist)

## [1] 0.23

Q4

Run cor.test()and discuss whether the correlation between birthweight and nprevist are statistically significant at the 0.05 level.

correct code (0.5 points)
correct description of the significance, p-value, conclusion (0.5 points)
0.5 points deducted for each incorrect or missing description or code

Response: The test result suggests that the correlation coefficient for birthweight and the number of prenatal visits (r = 0.23) is statistically significant at the 0.05 level with the p-value of nearly 0 (or less than 0.05). We reject the null hypothesis that birthweight and the number of prenatal visits are not correlated, though the correlation is weak.

# the order of the variables does not matter

cor.test(smoking_data$birthweight, smoking_data$nprevist, method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  smoking_data$birthweight and smoking_data$nprevist
## t = 13, df = 2998, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.19 0.26
## sample estimates:
##  cor 
## 0.23

Q5

Let’s regress birthweight (y) on smoker (x) using the small dataset and generate the regression results using summary(). Interpret the outputs including the intercept and the beta coefficient of smoker. Discuss whether the beta coefficient of smoker is statistically significant at the 0.05 level. Note that one unit increase in smoker means making a comparison between non-smoking mothers (0) and smoking mothers (1).

correct description of the coefficient and interpretation (0.5 points)
correct description of the p-value and significance (0.5 points)
0.5 points deducted for each incorrect description or code

model2: The intercept is 3432.1 grams, meaning when all our variables are 0, our model predicts the average infant birthweight will be 3432.1 grams. The coefficient for smoker is -253.2 which means that the infants with smoking mothers weighed 253.2 grams less on average compared to infants with non-smoking mothers. The p-value is less than 0.05, suggesting that the coefficient for smoker is statistically significant at the 0.05 level.

model2 <- lm(birthweight ~ smoker, data = dataMini)

summary(model2)

## 
## Call:
## lm(formula = birthweight ~ smoker, data = dataMini)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3007.1  -313.1    26.9   366.9  2322.9 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)   3432.1       11.9   289.1 <0.0000000000000002 ***
## smoker        -253.2       27.0    -9.4 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 584 on 2998 degrees of freedom
## Multiple R-squared:  0.0286, Adjusted R-squared:  0.0283 
## F-statistic: 88.3 on 1 and 2998 DF,  p-value: <0.0000000000000002

Q6

Let’s regress birthweight (y) on drinks (x) using the small dataset and generate the regression results using summary(). Interpret the outputs including the intercept and the beta coefficient. Discuss whether the beta coefficient of drinks is statistically significant at the 0.05 level. Note that one unit increase in drinks means an additional number of drinks per week.

correct description of the beta coefficient and interpretation (0.5 points)
correct description of the p-value and significance (0.5 points)
0.5 points deducted for each incorrect description or code

model3: The intercept is 3384.56 grams, meaning when all our variables are 0, our model predicts the average infant birthweight will be 3384.56 grams. The coefficient for drinks is -27.9 which means that infants’ birthweight (y) goes down by 27.9 grams on average when mothers have one additional drink per week. However, the p-value (i.e., 0.0759) is larger than 0.05, suggesting that the coefficient for drinks is not statistically significant at the 0.05 level.

model3 <- lm(birthweight ~ drinks, data = dataMini)

summary(model3)

## 
## Call:
## lm(formula = birthweight ~ drinks, data = dataMini)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2959.6  -322.6    45.3   375.4  2370.4 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)   3384.6       10.8  312.05 <0.0000000000000002 ***
## drinks         -27.9       15.7   -1.78               0.076 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 592 on 2998 degrees of freedom
## Multiple R-squared:  0.00105,    Adjusted R-squared:  0.000717 
## F-statistic: 3.15 on 1 and 2998 DF,  p-value: 0.0759

Reference materials: Why negative residuals?

The main objective of a regression model is to minimize the sum of residuals by identifying the best-fit line that models the relationship between the independent variable (x) and dependent variable (y). However, it’s important to note that not all models are capable of fitting the data accurately. In some cases, the predicted values \(\hat{y_i}\) from the model can significantly differ from the actual observed values \(y_i\). Residuals, which represent the difference between the predicted value and the actual value, can be negative or positive. A negative residual indicates that the predicted value is higher than the actual value, while a positive residual indicates that the predicted value is lower than the actual value.

We are examining Model 2 to better understand the relationship between mothers who did not smoke and the birthweight of their infants. The regression line for this model is:

\[\hat{y_i} = 3432.1 + (-253.2)x \] where \(x=0\) indicates that the mother does not smoke. Using this equation, we can predict that the average birthweight for infants with non-smoking mothers is 3432.1 grams.

To check the accuracy of our prediction, we can examine the actual data point. We do this by calculating the residual, which is the difference between the actual value \(y_i\) and the predicted value \(\hat{y_i}\). The minimum residual for this model is -3007.1, which means that there is a data point with a birthweight that is 3007.1 grams less than our predicted value.

To find the actual data point, we can use the following formula:

\[r_i = y_i - \hat{y_i}\] where \(y_i\) is the observed value (data point), \(\hat{y_i}\) is the predicted value based on model 2.

Rearranging the equation, we get:

\[ y_i = r_i + \hat{y_i}\] Substituting in the values for \(r_i\) and \(\hat{y_i}\), we get:

\[ y_i = -3007.1 + 3432.1\] \[ y_i = 425 \]

Upon inspecting the dataset, you will find that the lowest birthweight recorded is 425 grams which is the the actual birthweight for the data point with the minimal (largest negative) residual. This value is hugely different from the predicted birthweight of 3432.1 grams, which highlights the limitations of relying on a simple linear regression model with just one independent variable.

To develop a more robust model, we should incorporate additional variables into our analysis through multiple regression. By incorporating more variables, we can gain a better understanding of the relationships between various factors and birthweight, ultimately leading to a more accurate and comprehensive model.

summary(dataMini$birthweight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     425    3062    3420    3383    3750    5755

#alternatively from model2 data
min(model2$model$birthweight)

## [1] 425

Lab 4 Assignment (Answer Key)

Viviana Wu

2025-03-02

A Reminder

Instructions

Q1

Q2

Q3

Q4

Q5

Q6

Reference materials: Why negative residuals?