Question 1

a. We first upload and inspect the height-weight dataset with a descriptive statistics summary.

htwt = read.csv("htwt.csv", header=TRUE)
stat.desc(htwt)
##          SEX          AGE       HEIGHT       WEIGHT
## nbr.val   NA  237.0000000 2.370000e+02 2.370000e+02
## nbr.null  NA    0.0000000 0.000000e+00 0.000000e+00
## nbr.na    NA    0.0000000 0.000000e+00 0.000000e+00
## min       NA   13.9000000 5.050000e+01 5.050000e+01
## max       NA   25.0000000 7.200000e+01 1.715000e+02
## range     NA   11.1000000 2.150000e+01 1.210000e+02
## sum       NA 3897.0000000 1.454340e+04 2.401000e+04
## median    NA   16.3000000 6.150000e+01 1.010000e+02
## mean      NA   16.4430380 6.136456e+01 1.013080e+02
## SE.mean   NA    0.1196882 2.562814e-01 1.262809e+00
## CI.mean   NA    0.2357938 5.048915e-01 2.487818e+00
## var       NA    3.3950890 1.556620e+01 3.779407e+02
## std.dev   NA    1.8425767 3.945402e+00 1.944070e+01
## coef.var  NA    0.1120582 6.429447e-02 1.918969e-01

b. From here we can investigate a simple model predicting weight from age.

model.a = lm(htwt$WEIGHT ~ htwt$AGE) #Model A: WGTi = B_0 + B_1*AGEi + ei 
model.a
## 
## Call:
## lm(formula = htwt$WEIGHT ~ htwt$AGE)
## 
## Coefficients:
## (Intercept)     htwt$AGE  
##      -8.794        6.696

The intercept estimated here indicates that, for a newborn (age = 0), one’s weight is predicted to be -8.794 pounds. Of course, without first centering the age predictor around the mean, this intercept has no real-world meaning. The estimated slope indicates that, for an age increase of one year, one’s weight is predicted to rise 6.696 pounds. Of course, in real-world settings this linear relationship will not hold throughout the whole life course.

c. As a necessary step in data visualization, we plot the data overlaid with the regression line.

If we were additionally interested in the relative importance of age on weight for males and females, we could run two regressions: one for the males and one for the females. The plot below depicts the results of such tests.

Since the intercepts for the male and female regression lines appear nearly equal but the regression line for males rises above that for females, it appears that the slope of the age predictor is greater for males than females. That is, before running any formal tests, it appears that age matters more for weight in males than in females.

d. We now investigate a compact model for estimating weight.

model.c = lm(htwt$WEIGHT ~ 1) #Model: WGTi = B_0 + e_i

Here \(b_0 = 101.3\), suggesting that a mean weight of \(101.3\) is a good predictor of the weights of individuals in the sample.

e. We now add the regression line for the compact model to the previous plot.

f. Having defined and ran the compact and augmented models, we now compute PRE and F from SSE(A) and SSE(C) “by hand” and using the modelCompare function.

modelSummary(model.a) #SSE(A) = 53269.9
## lm(formula = htwt$WEIGHT ~ htwt$AGE)
## Observations: 237
## 
## Linear model fit by least squares
## 
## Coefficients:
##             Estimate      SE      t Pr(>|t|)    
## (Intercept)  -8.7935  8.8005 -0.999    0.319    
## htwt$AGE      6.6959  0.5319 12.589   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sum of squared errors (SSE): 53269.9, Error df: 235
## R-squared:  0.4028
modelSummary(model.c) #SSE(C) = 89194.0
## lm(formula = htwt$WEIGHT ~ 1)
## Observations: 237
## 
## Linear model fit by least squares
## 
## Coefficients:
##             Estimate      SE     t Pr(>|t|)    
## (Intercept)  101.308   1.263 80.22   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sum of squared errors (SSE): 89194.0, Error df: 236
## R-squared:  0.0000
#Computing PRE and F "by hand":
n = length(htwt$WEIGHT)
PRE = (89194-53269.9)/89194
PRE
## [1] 0.4027636
F = PRE/((1-PRE)/(n-2))
F
## [1] 158.4791
#displaying all information on model comparison using modelCompare
modelCompare(ModelA = model.a, ModelC = model.c) #p << .0001
## SSE (Compact) =  89194.01 
## SSE (Augmented) =  53269.93 
## Delta R-Squared =  0.4027634 
## Partial Eta-Squared (PRE) =  0.4027634 
## F(1,235) = 158.4789, p = 4.054587e-28

g. We now reporting our results in two forms:

Statistical conclusion: We regressed weight (lbs) on age (years) and observed a significant relationship (F(1, 235) = 158.47, PRE = 0.4028, p << .00001). In particular, our best-fit model predicted an increase of 6.69 lbs for every additional year of life.

Substantive conclusion: We found that, on average, increased age is associated with increased weight.

Question 2

We first upload and inspect the APGAR dataset with a descriptive statistics summary.

APGAR = read.csv("apgar.csv", header=TRUE)
stat.desc(APGAR)
##                      CASE       APGAR      GENDER     SMOKES      WGTGAIN
## nbr.val        60.0000000  60.0000000 60.00000000 60.0000000   60.0000000
## nbr.null        0.0000000   0.0000000 30.00000000 44.0000000    0.0000000
## nbr.na          0.0000000   0.0000000  0.00000000  0.0000000    0.0000000
## min             1.0000000   1.0000000  0.00000000  0.0000000    8.0000000
## max            60.0000000  10.0000000  1.00000000  1.0000000   75.0000000
## range          59.0000000   9.0000000  1.00000000  1.0000000   67.0000000
## sum          1830.0000000 401.0000000 30.00000000 16.0000000 1625.0000000
## median         30.5000000   7.0000000  0.50000000  0.0000000   25.0000000
## mean           30.5000000   6.6833333  0.50000000  0.2666667   27.0833333
## SE.mean         2.2546249   0.2715218  0.06509446  0.0575717    1.6254146
## CI.mean.0.95    4.5114940   0.5433139  0.13025370  0.1152007    3.2524471
## var           305.0000000   4.4234463  0.25423729  0.1988701  158.5183616
## std.dev        17.4642492   2.1031991  0.50421948  0.4459485   12.5904075
## coef.var        0.5725983   0.3146931  1.00843897  1.6723068    0.4648766
##                    GESTAT      PRENAT       ANNINC
## nbr.val        60.0000000  60.0000000   60.0000000
## nbr.null        0.0000000   6.0000000    0.0000000
## nbr.na          0.0000000   0.0000000    0.0000000
## min            20.0000000   0.0000000   10.0000000
## max            42.0000000   3.0000000  180.0000000
## range          22.0000000   3.0000000  170.0000000
## sum          2227.0000000 118.0000000 3347.0000000
## median         38.0000000   2.0000000   41.5000000
## mean           37.1166667   1.9666667   55.7833333
## SE.mean         0.6954622   0.1279271    5.2383361
## CI.mean.0.95    1.3916166   0.2559816   10.4818863
## var            29.0200565   0.9819209 1646.4098870
## std.dev         5.3870267   0.9909192   40.5759767
## coef.var        0.1451377   0.5038572    0.7273853

aI. We first define the compact and augmented models we’ll use for our model comparison:

\(Model C: APGAR_i = \beta_0 + \epsilon_i\)

\(Model A: APGAR_i = \beta_0 + \beta_{1}*GESTAT_i + \epsilon_i\)

aII. With these models defined, we can now run them separately and carry out a model comparison.

model.c2 = lm(APGAR$APGAR ~ 1)
model.c2
## 
## Call:
## lm(formula = APGAR$APGAR ~ 1)
## 
## Coefficients:
## (Intercept)  
##       6.683
model.a2 = lm(APGAR$APGAR ~ APGAR$GESTAT)
model.a2
## 
## Call:
## lm(formula = APGAR$APGAR ~ APGAR$GESTAT)
## 
## Coefficients:
##  (Intercept)  APGAR$GESTAT  
##      -0.9303        0.2051
modelCompare(ModelC = model.c2, ModelA = model.a2)
## SSE (Compact) =  260.9833 
## SSE (Augmented) =  188.939 
## Delta R-Squared =  0.2760496 
## Partial Eta-Squared (PRE) =  0.2760496 
## F(1,58) = 22.11599, p = 1.629713e-05

aIII. From the test in aII we have that SSE(A) = 188.9, SSE(C) = 260.98, PRE = 0.276, F(1,58) = 22.11 and p = 1.63e-05. Intercepts of 6.683 for the compact model and -0.9303 for the augmented model indicate that the predicted APGAR score for a baby whose mother had no gestation time are \(6.683\) and \(-0.9303\), respectively. However, without first centering the predictors these values have no meaning. A slope of \(0.2051\) in the augmented model indicates that for an increase in every one unit of gestation time, the baby’s predicted APGAR score will increase by \(0.2051\).

aIV. We now report our results in two forms:

Statistical conclusion: I regressed APGAR scores on gestational time (GESTAT) and found a significant relationship between these two variables (F(1,58) = 22.11, PRE = 0.276, p < .001). Our best-fit model predicted an increase of \(0.2051\) APGAR points for every unit increase in gestational time.

Short, substantive conclusion: greater gestational times are associated with improved APGAR scores.

aV. We an aside, we compute the correlation between GESTAT and APGAR:

gestCorr = cor(APGAR$APGAR, APGAR$GESTAT)
gestCorr
## [1] 0.5254043

b. However, we note that the intercept of the original model is meaningless. To remedy this we center the predictors and re-run the analysis.

GESTAT_c = APGAR$GESTAT - mean(APGAR$GESTAT)
model.c3 = model.c2 #compact model remains unchanged
model.a3 = lm(APGAR$APGAR ~ GESTAT_c) #Model: APGARi = B_0 + B_1*GESTAT_ci + ei
model.a3
## 
## Call:
## lm(formula = APGAR$APGAR ~ GESTAT_c)
## 
## Coefficients:
## (Intercept)     GESTAT_c  
##      6.6833       0.2051
modelCompare(ModelC = model.c3, ModelA = model.a3) #the model comparison is identical to that in aII as expected
## SSE (Compact) =  260.9833 
## SSE (Augmented) =  188.939 
## Delta R-Squared =  0.2760496 
## Partial Eta-Squared (PRE) =  0.2760496 
## F(1,58) = 22.11599, p = 1.629713e-05

The centered augmented model yielded an intercept of \(6.6833\), the mean of the APGAR scores, predicting that a baby experiencing exactly the mean level of gestation should likewise receive the mean APGAR score. In contrast, the uncentered augmented model included an intercept of \(-0.9303\) as described above. This latter value is meaningless for practical purposes. The slopes of both the centered and uncentered augmented models are the same, so no additional intpretation in necessary beyond that given in aIII-IV.

c. We can also re-run the analysis after rescaling the gestation predictor to days.

GESTAT_days = APGAR$GESTAT*7
model.c4 = model.c2 #compact model remains unchanged
model.a4 = lm(APGAR$APGAR ~ GESTAT_days) # Model: APGARi = B_0 + B_1*GESTAT_days + ei
model.a4
## 
## Call:
## lm(formula = APGAR$APGAR ~ GESTAT_days)
## 
## Coefficients:
## (Intercept)  GESTAT_days  
##     -0.9303       0.0293
modelCompare(ModelC = model.c4, ModelA = model.a4) #the model comparison is identical to that in aII and b as expected
## SSE (Compact) =  260.9833 
## SSE (Augmented) =  188.939 
## Delta R-Squared =  0.2760496 
## Partial Eta-Squared (PRE) =  0.2760496 
## F(1,58) = 22.11599, p = 1.629713e-05

The unit-converted augmented model yielded the same intercept as the unconverted augmented model, but a different slope. In particular, the converted model estimated a slope of \(0.0293\), indicating that for every additional day of gestation, a baby’s APGAR score is predicted to rise by \(0.0293\) points. Importantly, this slope is 1/7th of that of the unconverted model slope, which has the same interpretation for gestational increases in weeks. Mathematically, this checks out: \(0.2051*GESTAT = 0.0293*GESTAT_{days} = 0.0293*GESTAT*7\)

d. To verify that the original, centered and rescaled models yield the same predictions, we compute a predicted APGAR score given a gestation time of 39 weeks (273 days, 1.88333 weeks from mean GESTAT).

pred39_original = -0.9303 + 0.2051*39
pred39_original
## [1] 7.0686
pred39_centered = 6.6833 + 0.2051*1.88333
pred39_centered
## [1] 7.069571
pred39_scaled = -0.9303 + 0.0293*273
pred39_scaled
## [1] 7.0686

As expected, the models yield the same predictions.

e. In this exercise I learned a number of important aspects of linear regressions: 1) deviating predictors from the mean ensures that the intercept will be the outcome mean and therefore intelligible;scaling the predictors by some constant (in order to change units, say) causes the slope to be scaled by the inverse of that constant and that doing this allows the slope to be interpret on a new scale; 3) neither 1) nor 2) affect the predicted outcomes, goodness of fit and model comparisons. Moreover, each of 1), 2) and 3) are easily proven mathematically.

Question 3

We first read in data and inspect with a descriptive statistics summary:

insurance = read.csv("insurance_dataset.csv", header=TRUE)
stat.desc(insurance)
##                       age sex          bmi     children smoker region
## nbr.val      1.338000e+03  NA 1.338000e+03 1.338000e+03     NA     NA
## nbr.null     0.000000e+00  NA 0.000000e+00 5.740000e+02     NA     NA
## nbr.na       0.000000e+00  NA 0.000000e+00 0.000000e+00     NA     NA
## min          1.800000e+01  NA 1.596000e+01 0.000000e+00     NA     NA
## max          6.400000e+01  NA 5.313000e+01 5.000000e+00     NA     NA
## range        4.600000e+01  NA 3.717000e+01 5.000000e+00     NA     NA
## sum          5.245900e+04  NA 4.102763e+04 1.465000e+03     NA     NA
## median       3.900000e+01  NA 3.040000e+01 1.000000e+00     NA     NA
## mean         3.920703e+01  NA 3.066340e+01 1.094918e+00     NA     NA
## SE.mean      3.841024e-01  NA 1.667142e-01 3.295616e-02     NA     NA
## CI.mean.0.95 7.535090e-01  NA 3.270500e-01 6.465140e-02     NA     NA
## var          1.974014e+02  NA 3.718788e+01 1.453213e+00     NA     NA
## std.dev      1.404996e+01  NA 6.098187e+00 1.205493e+00     NA     NA
## coef.var     3.583531e-01  NA 1.988751e-01 1.100989e+00     NA     NA
##                   charges
## nbr.val      1.338000e+03
## nbr.null     0.000000e+00
## nbr.na       0.000000e+00
## min          1.121874e+03
## max          6.377043e+04
## range        6.264855e+04
## sum          1.775582e+07
## median       9.382033e+03
## mean         1.327042e+04
## SE.mean      3.310675e+02
## CI.mean.0.95 6.494682e+02
## var          1.466524e+08
## std.dev      1.211001e+04
## coef.var     9.125566e-01

(this dataset includes annual medical insurance costs of \(1338\) Americans alongside a number of personal characteristics such as age, sex, smoking status, and so on).

…we’ll test if a continuous BMI measurement is associated with annual medical costs in the insurance dataset…

a. We first define the compact and augmented models:

\(Model C: charges_i = \beta_0 + \epsilon_i\)

\(Model A: charges_i = \beta_0 + \beta_1*BMI_i + \epsilon_i\)

b. We now run the corresponding model comparison:

model.c5 = lm(insurance$charges ~ 1)
model.a5 = lm(insurance$charges ~ insurance$bmi)
modelCompare(ModelC = model.c5, ModelA = model.a5)
## SSE (Compact) =  196074221568 
## SSE (Augmented) =  188360830332 
## Delta R-Squared =  0.03933914 
## Partial Eta-Squared (PRE) =  0.03933914 
## F(1,1336) = 54.70931, p = 2.459086e-13

We report that F(1, 1336) = 54.71, PRE = 0.039 and p << .0001.

c. We therefore report that, on average, higher BMI is associated with higher annual medical insurance costs.

…for fun…plotting!

ggplot(insurance, aes(bmi, charges)) + geom_point() + geom_smooth(method = "lm") 

#it appears that there are two separate categories of people hiding in this plot...
ggplot(insurance, aes(bmi, charges, color=smoker)) + geom_point() + geom_smooth(method = "lm") 

It turns out that the group above the regression line are smokers and that increasing BMI increases insurance costs to a greater extent amongst smokers over non-smokers, an interaction effect.