CUNY MSDS DATA 605 HW 12 - Regression

Data Introduction

Libraries

library(tidyverse)
library(olsrr)
library(car)
library(gvlma)
options(scipen=8) #removes scinotation 

data <- read.csv("https://raw.githubusercontent.com/nschettini/Other-projects/master/who.csv", header = T)

The purpose of this assignment is to predict the life expectancy for a county in years, using regression.

Variables

Country: name of the country

LifeExp: average life expectancy for the country in years

InfantSurvival: proportion of those surviving to one year or more

Under5Survival: proportion of those surviving to five years or more

TBFree: proportion of the population without TB.

PropMD: proportion of the population who are MDs

PropRN: proportion of the population who are RNs

PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate

GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate TotExp: sum of personal and government expenditures.

Model 1 - LifeExp ~ TotExp

  1. Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.
model1 <- lm(LifeExp ~TotExp  , data)
intercept <- coef(model1)[1]
slope <- coef(model1)[2]

ggplot(model1, aes(TotExp, LifeExp))+ 
  geom_point() + 
  geom_abline(slope = slope, intercept = intercept, show.legend = TRUE)

summary(model1)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##                 Estimate   Std. Error t value Pr(>|t|)    
## (Intercept) 64.753374534  0.753536611  85.933  < 2e-16 ***
## TotExp       0.000062970  0.000007795   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

Model 1 Summary

The \(R^2\) from this model accounts for 0.2537 of the variability of the data, which means that only 25% of the variance in the response variable can be explained by the independent variable.

Both the y-intercept and TotExp’s p-values are small (near zero), meaning that the probability of observation these relationships due to chance is small.

The Standard error is 9.371.

The linear model is expressed as \(life exp = 64.75 + 0.00006*x\)

The F-statistic and p-value indicate that we would reject the null hypothesis (\(H_0\)), that there isn’t a relationship between the variables.

Model 1 - Evaulation and Residual Analysis

ols_plot_resid_qq(model1)

ols_plot_resid_hist(model1)

ols_plot_resid_fit(model1)

crPlots(model1)

The residual analysis show that the assumptions of linear regression are not met.

Linearity - there does not appear to be a linear relationship

Normality - According to the histogram and Q-Q plot, the residuals are not normally distributed.

Homoscedasticity -

ncvTest(model1)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 2.599177, Df = 1, p = 0.10692

The p-value is greater than .05 - fail to reject \(H_0\)

Independence -

durbinWatsonTest(model1)
##  lag Autocorrelation D-W Statistic p-value
##    1      0.06872515      1.802451   0.176
##  Alternative hypothesis: rho != 0
gvlma(model1)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = data)
## 
## Coefficients:
## (Intercept)       TotExp  
## 64.75337453   0.00006297  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = model1) 
## 
##                        Value          p-value                   Decision
## Global Stat        56.737011 0.00000000001405 Assumptions NOT satisfied!
## Skewness           30.532757 0.00000003282766 Assumptions NOT satisfied!
## Kurtosis            0.002804 0.95777263030755    Assumptions acceptable.
## Link Function      26.074703 0.00000032845930 Assumptions NOT satisfied!
## Heteroscedasticity  0.126747 0.72182921484679    Assumptions acceptable.

Model 2 - Transformations

2.Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

LifeExpP <- data$LifeExp^4.6
TotExpP <- data$TotExp^.06

model2 <- lm(LifeExpP~ TotExpP, data)
summary(model2)
## 
## Call:
## lm(formula = LifeExpP ~ TotExpP, data = data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExpP      620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

Summary

The \(R^2\) from this model accounts for 0.72283 of the variability of the data, which means that 72% of the variance in the response variable can be explained by the independent variable.

Both the y-intercept and TotExp’s p-values are small (near zero), meaning that the probability of observation these relationships due to chance is small.

The Standard error is 53.6 years

This mode is a vast improvement on the base model from section 1. This model outperforms the previous one.

90490000^(1/4.6)
## [1] 53.66557

The linear model is expressed as \(life exp = 64.75 + 0.00006*x\)

The F-statistic and p-value indicate that we would reject the null hypothesis (\(H_0\)), that there isn’t a relationship between the variables.

intercept <- coef(model2)[1]
slope <- coef(model2)[2]

ggplot(model2, aes(TotExpP, LifeExpP))+ 
  geom_point() + 
  geom_abline(slope = slope, intercept = intercept, show.legend = TRUE)

Residual Analysis

ols_plot_resid_qq(model2)

ols_plot_resid_hist(model2)

ols_plot_resid_fit(model2)

crPlots(model2)

The CRPlot shows that there is now a linear relationship between the variables. While the data is more normalized than previously, there is still a left skew as shown by the histogram and Q-Q plot.

Predictions

3.Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

predictdata <- data.frame(TotExpP=c(1.5,2.5))

predict(model2, predictdata,interval="predict")^(1/4.6)
##        fit      lwr      upr
## 1 63.31153 35.93545 73.00793
## 2 86.50645 81.80643 90.43414

Model 3

  1. Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?

LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

model3 <- lm(LifeExp ~ PropMD + TotExp + PropMD * TotExp, data)

Evaulation

summary(model3)
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                     Estimate     Std. Error t value Pr(>|t|)    
## (Intercept)     62.772703255    0.795605238  78.899  < 2e-16 ***
## PropMD        1497.493952519  278.816879652   5.371 2.32e-07 ***
## TotExp           0.000072333    0.000008982   8.053 9.39e-14 ***
## PropMD:TotExp   -0.006025686    0.001472357  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

The adj \(R^2\) accounts for 0.3471 of the variability of the data, which means that only 34% of the variance in the response variable can be explained by the independent variable.

Both the y-intercept and other variables p-value or low (near zero), meaning that the probability of observation these relationships due to chance is small.

The F-statistic and p-value indicate that we would reject the null hypothesis (\(H_0\)), that there isn’t a relationship between the variables.

Residual Analysis

ols_plot_resid_qq(model3)

ols_plot_resid_hist(model3)

ols_plot_resid_fit(model3)

The data does not resemble a normal distribution, as shown in the histogram (left skew) and the Q-Q pllots. The residuals do not appear to be centered around 0 from the residual plot.

Predictions

  1. Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?
predictdata2 <- data.frame(PropMD=0.03, TotExp=14)

predict(model3, predictdata2,interval="predict")
##       fit      lwr      upr
## 1 107.696 84.24791 131.1441

The predicted range of vaulues appear to be too high. The max value shows to be around 83. The data is predicting a much higher fit and CI.

Nicholas Schettini

11/18/2018