Regression Analysis in R

Assignment instructions

Follow a series of instructions on WHO data from 2008, provided with the assignment.

Load Data

I uploaded the assignment data to GitHub. The following variables and descriptions were provided with the assignment instructions:

Country: name of the country
LifeExp: average life expectancy for the country in years
InfantSurvival: proportion of those surviving to one year or more
Under5Survival: proportion of those surviving to five years or more
TBFree: proportion of the population without TB
PropMD: proportion of the population who are MDs
PropRN: proportion of the population who are RNs
PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate
GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate
TotExp: sum of personal and government expenditures

datalocation = 'https://raw.githubusercontent.com/pkofy/DATA605/main/HomeworkWK12/who.csv'
df <- read.csv(file = datalocation)

Part 1

Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

# Run linear regression
lm <- lm(LifeExp ~ TotExp, data=df)
lm

## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = df)
## 
## Coefficients:
## (Intercept)       TotExp  
##   6.475e+01    6.297e-05

# Provide scatter plot
plot(df$TotExp, df$LifeExp, main = "Scatter plot of LifeExp~TotExp",
     xlab = "Total Expenditures", ylab = "Life Expectancy",
     pch = 19, frame = FALSE)
abline(lm, col = "blue")

# Call statistics for interpretation
summary(lm)

## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

Interpret slrm statistics

The F-statistic, 65.26 on 1 and 188 DF, compares the current model to a model that only has the intercept parameter. It’s supposed to be more informative in Multiple linear regression with multiple independent or explanatory variables. It’s not clear what the F-statistic indicates in this case, but since the associated p-value is less than 0.05 then at least 1 independent variable is related to the dependent variable. (There’s only one.)

The Multiple R-squared statistic, 0.2577 indicates that 25.77% of the variation in the model is explained by the explanatory variable, Total Expenditures. The Adjusted R-squared is slightly smaller depending on the number of explanatory variables.

The Residual standard error, 9.371 on 188 DF, measures the standard deviation of the residuals. A larger number indicates the model doesn’t come as close to estimating the data points. This seems relatively high but we should compare to a second model.

The p-values are all very low. Clearly Total Expenditure is related to Life Expectancy.

Does the Model Satisfy?

Based on the four assumptions for simple linear regression, the model does not satisfy:
1) Is there a linear relationship? No, look at the scatter plot. While it’s not a linear relationship the data are clearly ordered. We can try a non-linear transformation of the data to see if we can achieve linearity.
2) Are the residuals independent? We can’t tell from what we’ve looked at so far.
3) Do the residuals have constant variance across the range of the independent variable? This quality is homoskedasticity and we can tell from the graph that variance is higher depending where you are on the x-axis.
4) Are the residuals normally distributed? We haven’t looked but we can tell from the graph that they aren’t normally scattered on either direction from the regression line.

Part 2

Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

df$LifeExpTrans <- df$LifeExp^4.6
df$TotExpTrans <- df$TotExp^0.06

# Run linear regression
lmTrans <- lm(LifeExpTrans ~ TotExpTrans, data=df)
lmTrans

## 
## Call:
## lm(formula = LifeExpTrans ~ TotExpTrans, data = df)
## 
## Coefficients:
## (Intercept)  TotExpTrans  
##  -736527909    620060216

# Provide scatter plot
plot(df$TotExpTrans, df$LifeExpTrans, main = "Scatter plot of LifeExp^4.6~TotExp^0.06",
     xlab = "Total Expenditures^0.06", ylab = "Life Expectancy^4.6",
     pch = 19, frame = FALSE)
abline(lmTrans, col = "blue")

# Call statistics for interpretation
summary(lmTrans)

## 
## Call:
## lm(formula = LifeExpTrans ~ TotExpTrans, data = df)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExpTrans  620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

Interpret slrm2 statistics

The F-statistic is 507.7 compared to 65.26 on 1 and 188 DF. It’s not clear what the F-statistic indicates in this case, but since the associated p-value is less than 0.05 then at least 1 independent variable is related to the dependent variable. (There’s only one.)

The Multiple R-squared statistic is 0.7298 compared to 0.2577. This indicates that 72.98% of the variation in the model is explained by the explanatory variable, Total Expenditures^0.06. The model is significantly improved.

The Residual standard error is 90490000 compared to 9.371 on 188 DF, measures the standard deviation of the residuals. A larger number indicates the model doesn’t come as close to estimating the data points; However, the scale isn’t standardized between the two to compare.

The p-values are all very low. Clearly Total Expenditure is related to Life Expectancy.

Is the Second Model Better?

Yes, the second model is better. The R-squared statistic is 75% instead of 25%. The graph shows a linear relationship. We didn’t do residual analysis but presumably those are significantly closer to normal, unbiased residuals. Let’s see what we can do looking at additional variables.

Part 3

Using the results from Part 2, forecast life expectancy when TotExp^.06=1.5. Then forecast life expectancy when TotExp^.06=2.5.

# When 1.5, life expectancy is 63.31153
new <- data.frame(TotExpTrans=c(1.5))
predict(lmTrans, newdata = new)^(1/4.6)

##        1 
## 63.31153

# When 2.5, life expectancy is 86.50645
new <- data.frame(TotExpTrans=c(2.5))
predict(lmTrans, newdata = new)^(1/4.6)

##        1 
## 86.50645

Part 4

Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?
LifeExp = b0 + b1xPropMd + b2xTotExp + b3xPropMD x TotExp

# Create new variable
df$PropMDxTotExp <- df$PropMD * df$TotExp

# Build the regression model
mlr <- lm(LifeExp ~ PropMD+TotExp+PropMDxTotExp, data=df)
mlr

## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMDxTotExp, data = df)
## 
## Coefficients:
##   (Intercept)         PropMD         TotExp  PropMDxTotExp  
##     6.277e+01      1.497e+03      7.233e-05     -6.026e-03

# Call statistics for interpretation
summary(mlr)

## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMDxTotExp, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMDxTotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

Interpret mlrm statistics

The F-statistic is 34.49 on 3 and 186. It has an almost zero p-value so at least 1 of the explanatory variables is related to Life Expectancy. This is comparing the model to an intercept-only model but it’s not clear what F-statistic indicates.

The Multiple R-squared statistic is 0.3574. This indicates that 35.74% of the variation in the model is explained by the explanatory variables. The Adjusted R-squared is lowered by an amount greater than we saw in the first two models with only one independent variable. We should rerun this model with the Transformed Life Expectancy and Transformed Total Expenditures.

The Residual standard error is 8.765, which is the lowest we’ve seen yet. meaning the standard deviation of the residuals is relatively low.

The p-values are all very low. All of the variables have a relation to the dependent variable.

How good is the model?

The model is not great! The variance explained is low at 35%.

I would have started with all of the variables and removed ones that don’t seem to be related because they have high p-values for their coefficient. Also I would try the transformed life expectancy and transformed total expenditures. Or run this model using polynomial regression to find out what transformations have the most predictive effect.

Part 5

Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

These are unrealistic inputs and an unrealistic forecast.

Even in the best of circumstances only an outlier would live to 107. Going from having one doctor per 1,000 people to 1 doctor per 33 people isn’t going to dramatically increase life expectancy (unless they are all research doctors!) Also it’s preposterous that $14 a year would be spent personally and by the government on health care to achieve longer life expectancies. One issue with this model is the different variables aren’t normalized so we’re probably having skew with the different scales of the variables.

# With the given inputs, life expectancy is predicted at 107.696
new <- data.frame(PropMD=0.03,TotExp=c(14))
new$PropMDxTotExp <- new$PropMD * new$TotExp
predict(mlr, newdata = new)

##       1 
## 107.696