a. We first upload and inspect the height-weight dataset with a descriptive statistics summary.
htwt = read.csv("htwt.csv", header=TRUE)
stat.desc(htwt)
## SEX AGE HEIGHT WEIGHT
## nbr.val NA 237.0000000 2.370000e+02 2.370000e+02
## nbr.null NA 0.0000000 0.000000e+00 0.000000e+00
## nbr.na NA 0.0000000 0.000000e+00 0.000000e+00
## min NA 13.9000000 5.050000e+01 5.050000e+01
## max NA 25.0000000 7.200000e+01 1.715000e+02
## range NA 11.1000000 2.150000e+01 1.210000e+02
## sum NA 3897.0000000 1.454340e+04 2.401000e+04
## median NA 16.3000000 6.150000e+01 1.010000e+02
## mean NA 16.4430380 6.136456e+01 1.013080e+02
## SE.mean NA 0.1196882 2.562814e-01 1.262809e+00
## CI.mean NA 0.2357938 5.048915e-01 2.487818e+00
## var NA 3.3950890 1.556620e+01 3.779407e+02
## std.dev NA 1.8425767 3.945402e+00 1.944070e+01
## coef.var NA 0.1120582 6.429447e-02 1.918969e-01
b. From here we can investigate a simple model predicting weight from age.
model.a = lm(htwt$WEIGHT ~ htwt$AGE) #Model A: WGTi = B_0 + B_1*AGEi + ei
model.a
##
## Call:
## lm(formula = htwt$WEIGHT ~ htwt$AGE)
##
## Coefficients:
## (Intercept) htwt$AGE
## -8.794 6.696
The intercept estimated here indicates that, for a newborn (age = 0), one’s weight is predicted to be -8.794 pounds. Of course, without first centering the age predictor around the mean, this intercept has no real-world meaning. The estimated slope indicates that, for an age increase of one year, one’s weight is predicted to rise 6.696 pounds. Of course, in real-world settings this linear relationship will not hold throughout the whole life course.
c. As a necessary step in data visualization, we plot the data overlaid with the regression line.
If we were additionally interested in the relative importance of age on weight for males and females, we could run two regressions: one for the males and one for the females. The plot below depicts the results of such tests.
Since the intercepts for the male and female regression lines appear nearly equal but the regression line for males rises above that for females, it appears that the slope of the age predictor is greater for males than females. That is, before running any formal tests, it appears that age matters more for weight in males than in females.
d. We now investigate a compact model for estimating weight.
model.c = lm(htwt$WEIGHT ~ 1) #Model: WGTi = B_0 + e_i
Here \(b_0 = 101.3\), suggesting that a mean weight of \(101.3\) is a good predictor of the weights of individuals in the sample.
e. We now add the regression line for the compact model to the previous plot.
f. Having defined and ran the compact and augmented models, we now compute PRE and F from SSE(A) and SSE(C) “by hand” and using the modelCompare function.
modelSummary(model.a) #SSE(A) = 53269.9
## lm(formula = htwt$WEIGHT ~ htwt$AGE)
## Observations: 237
##
## Linear model fit by least squares
##
## Coefficients:
## Estimate SE t Pr(>|t|)
## (Intercept) -8.7935 8.8005 -0.999 0.319
## htwt$AGE 6.6959 0.5319 12.589 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Sum of squared errors (SSE): 53269.9, Error df: 235
## R-squared: 0.4028
modelSummary(model.c) #SSE(C) = 89194.0
## lm(formula = htwt$WEIGHT ~ 1)
## Observations: 237
##
## Linear model fit by least squares
##
## Coefficients:
## Estimate SE t Pr(>|t|)
## (Intercept) 101.308 1.263 80.22 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Sum of squared errors (SSE): 89194.0, Error df: 236
## R-squared: 0.0000
#Computing PRE and F "by hand":
n = length(htwt$WEIGHT)
PRE = (89194-53269.9)/89194
PRE
## [1] 0.4027636
F = PRE/((1-PRE)/(n-2))
F
## [1] 158.4791
#displaying all information on model comparison using modelCompare
modelCompare(ModelA = model.a, ModelC = model.c) #p << .0001
## SSE (Compact) = 89194.01
## SSE (Augmented) = 53269.93
## Delta R-Squared = 0.4027634
## Partial Eta-Squared (PRE) = 0.4027634
## F(1,235) = 158.4789, p = 4.054587e-28
g. We now reporting our results in two forms:
Statistical conclusion: We regressed weight (lbs) on age (years) and observed a significant relationship (F(1, 235) = 158.47, PRE = 0.4028, p << .00001). In particular, our best-fit model predicted an increase of 6.69 lbs for every additional year of life.
Substantive conclusion: We found that, on average, increased age is associated with increased weight.
We first upload and inspect the APGAR dataset with a descriptive statistics summary.
APGAR = read.csv("apgar.csv", header=TRUE)
stat.desc(APGAR)
## CASE APGAR GENDER SMOKES WGTGAIN
## nbr.val 60.0000000 60.0000000 60.00000000 60.0000000 60.0000000
## nbr.null 0.0000000 0.0000000 30.00000000 44.0000000 0.0000000
## nbr.na 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000
## min 1.0000000 1.0000000 0.00000000 0.0000000 8.0000000
## max 60.0000000 10.0000000 1.00000000 1.0000000 75.0000000
## range 59.0000000 9.0000000 1.00000000 1.0000000 67.0000000
## sum 1830.0000000 401.0000000 30.00000000 16.0000000 1625.0000000
## median 30.5000000 7.0000000 0.50000000 0.0000000 25.0000000
## mean 30.5000000 6.6833333 0.50000000 0.2666667 27.0833333
## SE.mean 2.2546249 0.2715218 0.06509446 0.0575717 1.6254146
## CI.mean.0.95 4.5114940 0.5433139 0.13025370 0.1152007 3.2524471
## var 305.0000000 4.4234463 0.25423729 0.1988701 158.5183616
## std.dev 17.4642492 2.1031991 0.50421948 0.4459485 12.5904075
## coef.var 0.5725983 0.3146931 1.00843897 1.6723068 0.4648766
## GESTAT PRENAT ANNINC
## nbr.val 60.0000000 60.0000000 60.0000000
## nbr.null 0.0000000 6.0000000 0.0000000
## nbr.na 0.0000000 0.0000000 0.0000000
## min 20.0000000 0.0000000 10.0000000
## max 42.0000000 3.0000000 180.0000000
## range 22.0000000 3.0000000 170.0000000
## sum 2227.0000000 118.0000000 3347.0000000
## median 38.0000000 2.0000000 41.5000000
## mean 37.1166667 1.9666667 55.7833333
## SE.mean 0.6954622 0.1279271 5.2383361
## CI.mean.0.95 1.3916166 0.2559816 10.4818863
## var 29.0200565 0.9819209 1646.4098870
## std.dev 5.3870267 0.9909192 40.5759767
## coef.var 0.1451377 0.5038572 0.7273853
aI. We first define the compact and augmented models we’ll use for our model comparison:
\(Model C: APGAR_i = \beta_0 + \epsilon_i\)
\(Model A: APGAR_i = \beta_0 + \beta_{1}*GESTAT_i + \epsilon_i\)
aII. With these models defined, we can now run them separately and carry out a model comparison.
model.c2 = lm(APGAR$APGAR ~ 1)
model.c2
##
## Call:
## lm(formula = APGAR$APGAR ~ 1)
##
## Coefficients:
## (Intercept)
## 6.683
model.a2 = lm(APGAR$APGAR ~ APGAR$GESTAT)
model.a2
##
## Call:
## lm(formula = APGAR$APGAR ~ APGAR$GESTAT)
##
## Coefficients:
## (Intercept) APGAR$GESTAT
## -0.9303 0.2051
modelCompare(ModelC = model.c2, ModelA = model.a2)
## SSE (Compact) = 260.9833
## SSE (Augmented) = 188.939
## Delta R-Squared = 0.2760496
## Partial Eta-Squared (PRE) = 0.2760496
## F(1,58) = 22.11599, p = 1.629713e-05
aIII. From the test in aII we have that SSE(A) = 188.9, SSE(C) = 260.98, PRE = 0.276, F(1,58) = 22.11 and p = 1.63e-05. Intercepts of 6.683 for the compact model and -0.9303 for the augmented model indicate that the predicted APGAR score for a baby whose mother had no gestation time are \(6.683\) and \(-0.9303\), respectively. However, without first centering the predictors these values have no meaning. A slope of \(0.2051\) in the augmented model indicates that for an increase in every one unit of gestation time, the baby’s predicted APGAR score will increase by \(0.2051\).
aIV. We now report our results in two forms:
Statistical conclusion: I regressed APGAR scores on gestational time (GESTAT) and found a significant relationship between these two variables (F(1,58) = 22.11, PRE = 0.276, p < .001). Our best-fit model predicted an increase of \(0.2051\) APGAR points for every unit increase in gestational time.
Short, substantive conclusion: greater gestational times are associated with improved APGAR scores.
aV. We an aside, we compute the correlation between GESTAT and APGAR:
gestCorr = cor(APGAR$APGAR, APGAR$GESTAT)
gestCorr
## [1] 0.5254043
b. However, we note that the intercept of the original model is meaningless. To remedy this we center the predictors and re-run the analysis.
GESTAT_c = APGAR$GESTAT - mean(APGAR$GESTAT)
model.c3 = model.c2 #compact model remains unchanged
model.a3 = lm(APGAR$APGAR ~ GESTAT_c) #Model: APGARi = B_0 + B_1*GESTAT_ci + ei
model.a3
##
## Call:
## lm(formula = APGAR$APGAR ~ GESTAT_c)
##
## Coefficients:
## (Intercept) GESTAT_c
## 6.6833 0.2051
modelCompare(ModelC = model.c3, ModelA = model.a3) #the model comparison is identical to that in aII as expected
## SSE (Compact) = 260.9833
## SSE (Augmented) = 188.939
## Delta R-Squared = 0.2760496
## Partial Eta-Squared (PRE) = 0.2760496
## F(1,58) = 22.11599, p = 1.629713e-05
The centered augmented model yielded an intercept of \(6.6833\), the mean of the APGAR scores, predicting that a baby experiencing exactly the mean level of gestation should likewise receive the mean APGAR score. In contrast, the uncentered augmented model included an intercept of \(-0.9303\) as described above. This latter value is meaningless for practical purposes. The slopes of both the centered and uncentered augmented models are the same, so no additional intpretation in necessary beyond that given in aIII-IV.
c. We can also re-run the analysis after rescaling the gestation predictor to days.
GESTAT_days = APGAR$GESTAT*7
model.c4 = model.c2 #compact model remains unchanged
model.a4 = lm(APGAR$APGAR ~ GESTAT_days) # Model: APGARi = B_0 + B_1*GESTAT_days + ei
model.a4
##
## Call:
## lm(formula = APGAR$APGAR ~ GESTAT_days)
##
## Coefficients:
## (Intercept) GESTAT_days
## -0.9303 0.0293
modelCompare(ModelC = model.c4, ModelA = model.a4) #the model comparison is identical to that in aII and b as expected
## SSE (Compact) = 260.9833
## SSE (Augmented) = 188.939
## Delta R-Squared = 0.2760496
## Partial Eta-Squared (PRE) = 0.2760496
## F(1,58) = 22.11599, p = 1.629713e-05
The unit-converted augmented model yielded the same intercept as the unconverted augmented model, but a different slope. In particular, the converted model estimated a slope of \(0.0293\), indicating that for every additional day of gestation, a baby’s APGAR score is predicted to rise by \(0.0293\) points. Importantly, this slope is 1/7th of that of the unconverted model slope, which has the same interpretation for gestational increases in weeks. Mathematically, this checks out: \(0.2051*GESTAT = 0.0293*GESTAT_{days} = 0.0293*GESTAT*7\)
d. To verify that the original, centered and rescaled models yield the same predictions, we compute a predicted APGAR score given a gestation time of 39 weeks (273 days, 1.88333 weeks from mean GESTAT).
pred39_original = -0.9303 + 0.2051*39
pred39_original
## [1] 7.0686
pred39_centered = 6.6833 + 0.2051*1.88333
pred39_centered
## [1] 7.069571
pred39_scaled = -0.9303 + 0.0293*273
pred39_scaled
## [1] 7.0686
As expected, the models yield the same predictions.
e. In this exercise I learned a number of important aspects of linear regressions: 1) deviating predictors from the mean ensures that the intercept will be the outcome mean and therefore intelligible;scaling the predictors by some constant (in order to change units, say) causes the slope to be scaled by the inverse of that constant and that doing this allows the slope to be interpret on a new scale; 3) neither 1) nor 2) affect the predicted outcomes, goodness of fit and model comparisons. Moreover, each of 1), 2) and 3) are easily proven mathematically.
We first read in data and inspect with a descriptive statistics summary:
insurance = read.csv("insurance_dataset.csv", header=TRUE)
stat.desc(insurance)
## age sex bmi children smoker region
## nbr.val 1.338000e+03 NA 1.338000e+03 1.338000e+03 NA NA
## nbr.null 0.000000e+00 NA 0.000000e+00 5.740000e+02 NA NA
## nbr.na 0.000000e+00 NA 0.000000e+00 0.000000e+00 NA NA
## min 1.800000e+01 NA 1.596000e+01 0.000000e+00 NA NA
## max 6.400000e+01 NA 5.313000e+01 5.000000e+00 NA NA
## range 4.600000e+01 NA 3.717000e+01 5.000000e+00 NA NA
## sum 5.245900e+04 NA 4.102763e+04 1.465000e+03 NA NA
## median 3.900000e+01 NA 3.040000e+01 1.000000e+00 NA NA
## mean 3.920703e+01 NA 3.066340e+01 1.094918e+00 NA NA
## SE.mean 3.841024e-01 NA 1.667142e-01 3.295616e-02 NA NA
## CI.mean.0.95 7.535090e-01 NA 3.270500e-01 6.465140e-02 NA NA
## var 1.974014e+02 NA 3.718788e+01 1.453213e+00 NA NA
## std.dev 1.404996e+01 NA 6.098187e+00 1.205493e+00 NA NA
## coef.var 3.583531e-01 NA 1.988751e-01 1.100989e+00 NA NA
## charges
## nbr.val 1.338000e+03
## nbr.null 0.000000e+00
## nbr.na 0.000000e+00
## min 1.121874e+03
## max 6.377043e+04
## range 6.264855e+04
## sum 1.775582e+07
## median 9.382033e+03
## mean 1.327042e+04
## SE.mean 3.310675e+02
## CI.mean.0.95 6.494682e+02
## var 1.466524e+08
## std.dev 1.211001e+04
## coef.var 9.125566e-01
(this dataset includes annual medical insurance costs of \(1338\) Americans alongside a number of personal characteristics such as age, sex, smoking status, and so on).
…we’ll test if a continuous BMI measurement is associated with annual medical costs in the insurance dataset…
a. We first define the compact and augmented models:
\(Model C: charges_i = \beta_0 + \epsilon_i\)
\(Model A: charges_i = \beta_0 + \beta_1*BMI_i + \epsilon_i\)
b. We now run the corresponding model comparison:
model.c5 = lm(insurance$charges ~ 1)
model.a5 = lm(insurance$charges ~ insurance$bmi)
modelCompare(ModelC = model.c5, ModelA = model.a5)
## SSE (Compact) = 196074221568
## SSE (Augmented) = 188360830332
## Delta R-Squared = 0.03933914
## Partial Eta-Squared (PRE) = 0.03933914
## F(1,1336) = 54.70931, p = 2.459086e-13
We report that F(1, 1336) = 54.71, PRE = 0.039 and p << .0001.
c. We therefore report that, on average, higher BMI is associated with higher annual medical insurance costs.
…for fun…plotting!
ggplot(insurance, aes(bmi, charges)) + geom_point() + geom_smooth(method = "lm")
#it appears that there are two separate categories of people hiding in this plot...
ggplot(insurance, aes(bmi, charges, color=smoker)) + geom_point() + geom_smooth(method = "lm")
It turns out that the group above the regression line are smokers and that increasing BMI increases insurance costs to a greater extent amongst smokers over non-smokers, an interaction effect.