keep your #comment short in a code chunk.
Write paragraphs above or below the code chunks.
Answer the following questions using the appropriate dataset and codebook. For each question, provide (1) your codes, (2) R outputs AND (3) the answer in complete sentences.
Setup and load necessary libraries
Import the dataset named “birthweight_smoking.csv”, name your dataset.
Create a smaller dataset containing the variables “birthweight”, “nprevist”,“drinks”, and “smoker”.
Q3: How are birthweight
and nprevist
correlated? Describe the direction and strength of the correlation using
pearson correlation.
Response: Birthweight and the number of prenatal visits are weakly but positively correlated with a correlation coefficient of 0.23.
## [1] 0.23
Run cor.test()
and discuss whether the correlation
between birthweight
and nprevist
are
statistically significant at the 0.05 level.
Response: The test result suggests that the correlation coefficient for birthweight and the number of prenatal visits (r = 0.23) is statistically significant at the 0.05 level with the p-value of nearly 0 (or less than 0.05). We reject the null hypothesis that birthweight and the number of prenatal visits are not correlated, though the correlation is weak.
# the order of the variables does not matter
cor.test(smoking_data$birthweight, smoking_data$nprevist, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: smoking_data$birthweight and smoking_data$nprevist
## t = 13, df = 2998, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.19 0.26
## sample estimates:
## cor
## 0.23
Let’s regress birthweight
(y) on smoker
(x)
using the small dataset and generate the regression results using
summary()
. Interpret the outputs including the intercept
and the beta coefficient of smoker. Discuss whether the beta coefficient
of smoker is statistically significant at the 0.05 level. Note that one
unit increase in smoker means making a comparison between non-smoking
mothers (0) and smoking mothers (1).
model2: The intercept is 3432.1 grams, meaning when all our variables are 0, our model predicts the average infant birthweight will be 3432.1 grams. The coefficient for
smoker
is -253.2 which means that the infants with smoking mothers weighed 253.2 grams less on average compared to infants with non-smoking mothers. The p-value is less than 0.05, suggesting that the coefficient forsmoker
is statistically significant at the 0.05 level.
##
## Call:
## lm(formula = birthweight ~ smoker, data = dataMini)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3007.1 -313.1 26.9 366.9 2322.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3432.1 11.9 289.1 <0.0000000000000002 ***
## smoker -253.2 27.0 -9.4 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 584 on 2998 degrees of freedom
## Multiple R-squared: 0.0286, Adjusted R-squared: 0.0283
## F-statistic: 88.3 on 1 and 2998 DF, p-value: <0.0000000000000002
Let’s regress birthweight
(y) on drinks
(x)
using the small dataset and generate the regression results using
summary()
. Interpret the outputs including the intercept
and the beta coefficient. Discuss whether the beta coefficient of drinks
is statistically significant at the 0.05 level. Note that one unit
increase in drinks means an additional number of drinks per week.
model3: The intercept is 3384.56 grams, meaning when all our variables are 0, our model predicts the average infant birthweight will be 3384.56 grams. The coefficient for
drinks
is -27.9 which means that infants’ birthweight (y) goes down by 27.9 grams on average when mothers have one additional drink per week. However, the p-value (i.e., 0.0759) is larger than 0.05, suggesting that the coefficient fordrinks
is not statistically significant at the 0.05 level.
##
## Call:
## lm(formula = birthweight ~ drinks, data = dataMini)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2959.6 -322.6 45.3 375.4 2370.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3384.6 10.8 312.05 <0.0000000000000002 ***
## drinks -27.9 15.7 -1.78 0.076 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 592 on 2998 degrees of freedom
## Multiple R-squared: 0.00105, Adjusted R-squared: 0.000717
## F-statistic: 3.15 on 1 and 2998 DF, p-value: 0.0759
The main objective of a regression model is to minimize the sum of residuals by identifying the best-fit line that models the relationship between the independent variable (x) and dependent variable (y). However, it’s important to note that not all models are capable of fitting the data accurately. In some cases, the predicted values \(\hat{y_i}\) from the model can significantly differ from the actual observed values \(y_i\). Residuals, which represent the difference between the predicted value and the actual value, can be negative or positive. A negative residual indicates that the predicted value is higher than the actual value, while a positive residual indicates that the predicted value is lower than the actual value.
We are examining Model 2 to better understand the relationship between mothers who did not smoke and the birthweight of their infants. The regression line for this model is:
\[\hat{y_i} = 3432.1 + (-253.2)x \] where \(x=0\) indicates that the mother does not smoke. Using this equation, we can predict that the average birthweight for infants with non-smoking mothers is 3432.1 grams.
To check the accuracy of our prediction, we can examine the actual data point. We do this by calculating the residual, which is the difference between the actual value \(y_i\) and the predicted value \(\hat{y_i}\). The minimum residual for this model is -3007.1, which means that there is a data point with a birthweight that is 3007.1 grams less than our predicted value.
To find the actual data point, we can use the following formula:
\[r_i = y_i - \hat{y_i}\] where \(y_i\) is the observed value (data point), \(\hat{y_i}\) is the predicted value based on model 2.
Rearranging the equation, we get:
\[ y_i = r_i + \hat{y_i}\] Substituting in the values for \(r_i\) and \(\hat{y_i}\), we get:
\[ y_i = -3007.1 + 3432.1\] \[ y_i = 425 \]
Upon inspecting the dataset, you will find that the lowest birthweight recorded is 425 grams which is the the actual birthweight for the data point with the minimal (largest negative) residual. This value is hugely different from the predicted birthweight of 3432.1 grams, which highlights the limitations of relying on a simple linear regression model with just one independent variable.
To develop a more robust model, we should incorporate additional variables into our analysis through multiple regression. By incorporating more variables, we can gain a better understanding of the relationships between various factors and birthweight, ultimately leading to a more accurate and comprehensive model.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 425 3062 3420 3383 3750 5755
## [1] 425