DATA 605 Problem Set 12: Multiple Regression

Read-in our CSV data

# TODO: replace local file path with GitHub URL
df <- read.csv("~/CUNY/computationalMath605/data/real-world-data.csv")

head(df)

##               Country LifeExp InfantSurvival Under5Survival  TBFree      PropMD
## 1         Afghanistan      42          0.835          0.743 0.99769 0.000228841
## 2             Albania      71          0.985          0.983 0.99974 0.001143127
## 3             Algeria      71          0.967          0.962 0.99944 0.001060478
## 4             Andorra      82          0.997          0.996 0.99983 0.003297297
## 5              Angola      41          0.846          0.740 0.99656 0.000070400
## 6 Antigua and Barbuda      73          0.990          0.989 0.99991 0.000142857
##        PropRN PersExp GovtExp TotExp
## 1 0.000572294      20      92    112
## 2 0.004614439     169    3128   3297
## 3 0.002091362     108    5184   5292
## 4 0.003500000    2589  169725 172314
## 5 0.001146162      36    1620   1656
## 6 0.002773810     503   12543  13046

Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error, and p-values only. Discuss whether the assumptions of simple linear regression met.

ggplot(df, aes(x=LifeExp, y=TotExp)) + geom_point()

Let’s run a simple linear regression on these two variables using the R methods lm and summary

linModel <- lm( LifeExp~TotExp, df)

summary(linModel)

## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

The adjusted \(R^2\) value states that \(25.37\%\) of our output variable (TotExp) can be explained by our input variable (LifeExp). The p-value (\(7.71 x 10^{-14}\)) indicates the probability of our model producing coefficients more extreme than those produced. The F-statistic value compares our model to a model with only an intercept parameter. Per our regression text, we only have one additional parameter. The Standard Error listed shows the statistical standard error for each coefficient listed (Intercept and TotExp)

Assumptions for Regression

Linear Relationship:The data, as plotted above, do not exhibit a linear relationship visually. From a quick “eye test”, an exponential function may model this data better
Residuals are normally distributed: We’ll plot the distribution of residuals below. They do not appear normally distributed as the hitogram is skewed left

res <- resid(linModel) #alternatively, model$residuals
hist(res)

Finally, we can generate a QQ-plot to check our assumptions for regression

# Create Q-Q plot of our residuals
qqnorm(res)
qqline(res)

This plot does not exhibit linear behavior. In this case a linear regression model does not fit our data well

Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better”?

df$LifeExpRaised <- df$LifeExp ^ 4.6
df$TotExpRaised <- df$TotExp ^ 0.06

head(df)

##               Country LifeExp InfantSurvival Under5Survival  TBFree      PropMD
## 1         Afghanistan      42          0.835          0.743 0.99769 0.000228841
## 2             Albania      71          0.985          0.983 0.99974 0.001143127
## 3             Algeria      71          0.967          0.962 0.99944 0.001060478
## 4             Andorra      82          0.997          0.996 0.99983 0.003297297
## 5              Angola      41          0.846          0.740 0.99656 0.000070400
## 6 Antigua and Barbuda      73          0.990          0.989 0.99991 0.000142857
##        PropRN PersExp GovtExp TotExp LifeExpRaised TotExpRaised
## 1 0.000572294      20      92    112      29305338     1.327251
## 2 0.004614439     169    3128   3297     327935478     1.625875
## 3 0.002091362     108    5184   5292     327935478     1.672697
## 4 0.003500000    2589  169725 172314     636126841     2.061481
## 5 0.001146162      36    1620   1656      26230450     1.560068
## 6 0.002773810     503   12543  13046     372636298     1.765748

# Plotting our variables raised to their respective powers
ggplot(df, aes(x=LifeExpRaised, y=TotExpRaised)) + geom_point()

This dataset looks to have a much more linear relationship. Let’s run our linear model on this data and the interpret our summary statistics.

linModelRaised <- lm( LifeExpRaised~TotExpRaised, df)

summary(linModelRaised)

## 
## Call:
## lm(formula = LifeExpRaised ~ TotExpRaised, data = df)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -736527910   46817945  -15.73   <2e-16 ***
## TotExpRaised  620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

In this case, we see a much lower p-value (on the order of \(10^{-16}\) rather than \(10^{-14}\)). This is important because it means our model “covers” a larger portion of the distribution of output values. The F-statistic is also higher, meaning that model compares more favorably than our linModel if both had no independent parameters (in other words, only the intercept value). The standard error of our coefficients is larger, but that is because the scale our our data is much larger, because we raised the LifeExp variable to a higher power.

Overall, our linModelRaised is a “better” model as it is modelling data that visually looks more linear. This results in a better fit from a linear model

Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

Using R’s built-in predict function, we can pass our linModelRaised object as well as a dataframe of our new TotExp values

# predicting new values between LifeExp and TotExp
newExp <- data.frame(TotExpRaised=c(1.5, 2.5))
predict(linModelRaised, newdata = newExp)

##         1         2 
## 193562414 813622630

Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model? LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

We can use the mlm model as this model takes more than one predictor variable (multiple regression)

# Storing our last term as a product column
df$mDExp <- df$PropMD * df$TotExp

# build the multiple regression model
mlm <- lm(LifeExp ~ PropMD + TotExp + mDExp, df)

Again, let’s look at and interpret our summary stats on this model

summary(mlm)

## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + mDExp, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD       1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp       7.233e-05  8.982e-06   8.053 9.39e-14 ***
## mDExp       -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

mresid <- resid(mlm)
hist(mresid)

Forecast LifeExp when \(PropMD=0.03\) and \(TotExp = 14\). Does this forecast seem realistic? Why or why not?

Again, we can use the predict method here for these new input values.

multiNewData <- data.frame(PropMD=0.03, TotExp = 14)
multiNewData$mDExp <- multiNewData$PropMD * multiNewData$TotExp

# Predicting with our given values
predict(mlm, multiNewData)

##       1 
## 107.696

While this predicted value is possible, it seems a bit high to be realistic. In part, a life expectancy over 100 years does not seem like a reasonable value, knowing how rare it is for people to live to over 100 years.

hist(resid(mlm))

Looking at the above, the residuals from our multiple regression model appear to be skew-left. We can also see this pattern in our residual values from our fitted values below. This would violate our assumption of heteroscedasticity. This could result in a model that does not produce an ideal fit. In part, the data we’re working with here were not scaled up as in step 2, resulting in a poorer fit from our linear model.

plot(fitted(mlm), resid(mlm))

We can also use the plot.lm function in order to generate these graphs with a single function call

plot(mlm)

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

DATA 605 Problem Set 12: Multiple Regression

Andrew Bowen

2023-04-09

Assumptions for Regression