Slides for Week 5 are available here.
In this tutorial, you will learn how to:
Step 0. Look at the Data Documentation Check the
Documentation for Birthweight_Smoking
in your project
folder.
Your filename.csv
needs to be exact and indicate the
file format, which is a .csv file (Comma Separated Values file)–a
delimited text file that uses a comma to separate values. It is one type
of data-friendly Excel spreadsheets besides the common .xlsx format)
For more exploratory data analysis, see previous weeks’ lab materials.
## [1] 3000 12
## nprevist alcohol tripre1 tripre2 tripre3
## Min. : 0 Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.: 9 1st Qu.:0.000 1st Qu.:1.000 1st Qu.:0.000 1st Qu.:0.000
## Median :12 Median :0.000 Median :1.000 Median :0.000 Median :0.000
## Mean :11 Mean :0.019 Mean :0.804 Mean :0.153 Mean :0.033
## 3rd Qu.:13 3rd Qu.:0.000 3rd Qu.:1.000 3rd Qu.:0.000 3rd Qu.:0.000
## Max. :35 Max. :1.000 Max. :1.000 Max. :1.000 Max. :1.000
## tripre0 birthweight smoker unmarried educ
## Min. :0.00 Min. : 425 Min. :0.000 Min. :0.000 Min. : 0.0
## 1st Qu.:0.00 1st Qu.:3062 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:12.0
## Median :0.00 Median :3420 Median :0.000 Median :0.000 Median :12.0
## Mean :0.01 Mean :3383 Mean :0.194 Mean :0.227 Mean :12.9
## 3rd Qu.:0.00 3rd Qu.:3750 3rd Qu.:0.000 3rd Qu.:0.000 3rd Qu.:14.0
## Max. :1.00 Max. :5755 Max. :1.000 Max. :1.000 Max. :17.0
## age drinks
## Min. :14.0 Min. : 0.00
## 1st Qu.:23.0 1st Qu.: 0.00
## Median :27.0 Median : 0.00
## Mean :26.9 Mean : 0.06
## 3rd Qu.:31.0 3rd Qu.: 0.00
## Max. :44.0 Max. :21.00
First, let’s create a new data frame with the variables we will use.
We can subset our original data frame by selecting the columns pretty
quickly. Specify the column names through this code:
data[,c("variable1", "variable2")]
.
na.omit()
: Remove missing values# remove NA (i.e., rows with missing values) first
dataMini <- na.omit(dataMini)
# this dataset is clean, but in reality, data tend to have missing values
Our first basic model with three IVs, namely smoker
,
alcohol
and nprevist
:
##
## Call:
## lm(formula = birthweight ~ smoker + alcohol + nprevist, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2733.5 -307.6 21.4 358.1 2192.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3051.25 34.02 89.70 < 0.0000000000000002 ***
## smoker -217.58 26.68 -8.16 0.00000000000000051 ***
## alcohol -30.49 76.23 -0.40 0.69
## nprevist 34.07 2.85 11.93 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 570 on 2996 degrees of freedom
## Multiple R-squared: 0.0729, Adjusted R-squared: 0.0719
## F-statistic: 78.5 on 3 and 2996 DF, p-value: <0.0000000000000002
When working with datasets that have missing values in some variables, it’s important to check the number of observations in each regression model. The number may vary depending on which variables are included in the model. However, the birthweight data we’re working with here is very clean, so the number of observations is the same across all models. Nonetheless, it’s always a good practice to check the sample size in each regression you compute!
## [1] 3000
The F value is the ratio of the mean regression sum of squares divided by the mean error sum of squares.
It is the probability that the null hypothesis (H0) for the full model (i.e., all of the regression coefficients are jointly zero) is true. Null hypothesis (H0) is which we try to disprove or reject at the 0.05 level.
F-statistic for the first model is 78.5 and the p-value is close to 0. The model is significant at the 0.05 level, we can reject H0, meaning there is at least one significant coefficient in the model.
\(Adjusted R^2\) and \(R^2\) measure how close the data are to the
fitted regression line. 0% indicates that the model explains none of the
variability of the response data around its mean. How much of the
variance in birthweight
is explained by our independent
variables?
Adjusted R-squared for lm1: 0.0719
It means that model lm1 (with smoker, alcohol and nprevist as independent variables) explains about 7.19% of the variance in birthweight.
##
## Call:
## lm(formula = birthweight ~ smoker + alcohol + nprevist, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2733.5 -307.6 21.4 358.1 2192.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3051.25 34.02 89.70 < 0.0000000000000002 ***
## smoker -217.58 26.68 -8.16 0.00000000000000051 ***
## alcohol -30.49 76.23 -0.40 0.69
## nprevist 34.07 2.85 11.93 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 570 on 2996 degrees of freedom
## Multiple R-squared: 0.0729, Adjusted R-squared: 0.0719
## F-statistic: 78.5 on 3 and 2996 DF, p-value: <0.0000000000000002
Partial R-Squared for each independent variable (for glm or lm models) The partial R-square (or coefficient of partial determination or a squared partial correlation) measures a fully partialled proportion of the variance in \(Y\) (the unique contribution) explained by \(Xi\) over the variance in \(Y\) that is not associated with any other predictors already included in the model.
partial.rsq for nprevist = 0.0453756
It means that nprevist explains about 4.54% of the variance in birthweight in lm1.
## $adjustment
## [1] FALSE
##
## $variable
## [1] "smoker" "alcohol" "nprevist"
##
## $partial.rsq
## [1] 0.0217171 0.0000534 0.0453756
What is the average infant birthweight when smoker
,
alcohol
and nprevist
are zero?
The Constant is 3051.25 grams, meaning when all our variables are 0, our model predicts the average infant birthweight will be 3051.25 grams.
What are the coefficients for smoker
,
alcohol
and nprevist
? Are they statistically
significant at the 0.05 level?
Model lm1
Based on the adjusted r-squared value, 7.19% of the variance in our dependent variable is explained by the first model.
Both smoking mothers (smoker) and the number of prenatal visit (nprevist) are significant predictors of infant birthweight, but not for mothers drinking alcohol during pregnancy (alcohol) at the 0.05 level (p < 0.05 for both variables).
smoker
: Compared to children whose mothers were
non-smokers, the average birthweight of children whose mothers were
smokers is significantly lower by 217.58 grams on average at the 0.05
level, holding all other variables constant.
nprevist
: A 1-unit increase in the number of
prenatal visit is associated with an increase in the infant birthweight
(DV) by 34.07 grams, holding other variables constant.
alcohol
: The coefficient for alcohol is not
statistically significant at the 0.05 level. (You may skip the
coefficient interpretation when a variable is not statistically
significant.)
Hierarchical regression is a way to show if variables of your interest explain a statistically significant amount of variance in your Dependent Variable (DV) after accounting for all other variables. This is a framework for model comparison rather than a statistical method. In this framework, you build several regression models by adding variables to a previous model at each step; later models always include smaller models in previous steps. In many cases, our interest is to determine whether newly added variables show a significant improvement in R2 (the proportion of explained variance in DV by the model). (Read more here)
Recall our first model
##
## Call:
## lm(formula = birthweight ~ smoker + alcohol + nprevist, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2733.5 -307.6 21.4 358.1 2192.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3051.25 34.02 89.70 < 0.0000000000000002 ***
## smoker -217.58 26.68 -8.16 0.00000000000000051 ***
## alcohol -30.49 76.23 -0.40 0.69
## nprevist 34.07 2.85 11.93 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 570 on 2996 degrees of freedom
## Multiple R-squared: 0.0729, Adjusted R-squared: 0.0719
## F-statistic: 78.5 on 3 and 2996 DF, p-value: <0.0000000000000002
Our second model: adding unmarried
to the model
##
## Call:
## lm(formula = birthweight ~ smoker + alcohol + nprevist + unmarried,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2798.8 -309.2 25.4 361.8 2363.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3134.4 35.7 87.91 < 0.0000000000000002 ***
## smoker -175.4 27.1 -6.47 0.00000000011275 ***
## alcohol -21.1 75.6 -0.28 0.78
## nprevist 29.6 2.9 10.21 < 0.0000000000000002 ***
## unmarried -187.1 26.0 -7.20 0.00000000000078 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 566 on 2995 degrees of freedom
## Multiple R-squared: 0.0886, Adjusted R-squared: 0.0874
## F-statistic: 72.8 on 4 and 2995 DF, p-value: <0.0000000000000002
Information criteria, unlike significance tests, are indices that allow for model comparison accounting for model complexity and sample size.
Smaller models are better and more efficient. Smaller values of AIC and BIC is better. In the case where AIC and BIC give conflicting results, we will use BIC as the criteria for a more conservative estimate.
Which model has the highest adjusted R-squared value and the lowest AIC and BIC values?
## df AIC
## lm1 5 46598
## lm2 6 46549
## df BIC
## lm1 5 46628
## lm2 6 46585