CUNY MSDS DATA 605 HW 12 - Regression
Data Introduction
Libraries
library(tidyverse)
library(olsrr)
library(car)
library(gvlma)
options(scipen=8) #removes scinotation
data <- read.csv("https://raw.githubusercontent.com/nschettini/Other-projects/master/who.csv", header = T)
The purpose of this assignment is to predict the life expectancy for a county in years, using regression.
Variables
Country: name of the country
LifeExp: average life expectancy for the country in years
InfantSurvival: proportion of those surviving to one year or more
Under5Survival: proportion of those surviving to five years or more
TBFree: proportion of the population without TB.
PropMD: proportion of the population who are MDs
PropRN: proportion of the population who are RNs
PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate
GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate TotExp: sum of personal and government expenditures.
Model 1 - LifeExp ~ TotExp
- Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.
model1 <- lm(LifeExp ~TotExp , data)
intercept <- coef(model1)[1]
slope <- coef(model1)[2]
ggplot(model1, aes(TotExp, LifeExp))+
geom_point() +
geom_abline(slope = slope, intercept = intercept, show.legend = TRUE)
summary(model1)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 64.753374534 0.753536611 85.933 < 2e-16 ***
## TotExp 0.000062970 0.000007795 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
Model 1 Summary
The \(R^2\) from this model accounts for 0.2537 of the variability of the data, which means that only 25% of the variance in the response variable can be explained by the independent variable.
Both the y-intercept and TotExp’s p-values are small (near zero), meaning that the probability of observation these relationships due to chance is small.
The Standard error is 9.371.
The linear model is expressed as \(life exp = 64.75 + 0.00006*x\)
The F-statistic and p-value indicate that we would reject the null hypothesis (\(H_0\)), that there isn’t a relationship between the variables.
Model 1 - Evaulation and Residual Analysis
ols_plot_resid_qq(model1)
ols_plot_resid_hist(model1)
ols_plot_resid_fit(model1)
crPlots(model1)
The residual analysis show that the assumptions of linear regression are not met.
Linearity - there does not appear to be a linear relationship
Normality - According to the histogram and Q-Q plot, the residuals are not normally distributed.
Homoscedasticity -
ncvTest(model1)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 2.599177, Df = 1, p = 0.10692
The p-value is greater than .05 - fail to reject \(H_0\)
Independence -
durbinWatsonTest(model1)
## lag Autocorrelation D-W Statistic p-value
## 1 0.06872515 1.802451 0.176
## Alternative hypothesis: rho != 0
gvlma(model1)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = data)
##
## Coefficients:
## (Intercept) TotExp
## 64.75337453 0.00006297
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = model1)
##
## Value p-value Decision
## Global Stat 56.737011 0.00000000001405 Assumptions NOT satisfied!
## Skewness 30.532757 0.00000003282766 Assumptions NOT satisfied!
## Kurtosis 0.002804 0.95777263030755 Assumptions acceptable.
## Link Function 26.074703 0.00000032845930 Assumptions NOT satisfied!
## Heteroscedasticity 0.126747 0.72182921484679 Assumptions acceptable.
Model 2 - Transformations
2.Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”
LifeExpP <- data$LifeExp^4.6
TotExpP <- data$TotExp^.06
model2 <- lm(LifeExpP~ TotExpP, data)
summary(model2)
##
## Call:
## lm(formula = LifeExpP ~ TotExpP, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExpP 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
Summary
The \(R^2\) from this model accounts for 0.72283 of the variability of the data, which means that 72% of the variance in the response variable can be explained by the independent variable.
Both the y-intercept and TotExp’s p-values are small (near zero), meaning that the probability of observation these relationships due to chance is small.
The Standard error is 53.6 years
This mode is a vast improvement on the base model from section 1. This model outperforms the previous one.
90490000^(1/4.6)
## [1] 53.66557
The linear model is expressed as \(life exp = 64.75 + 0.00006*x\)
The F-statistic and p-value indicate that we would reject the null hypothesis (\(H_0\)), that there isn’t a relationship between the variables.
intercept <- coef(model2)[1]
slope <- coef(model2)[2]
ggplot(model2, aes(TotExpP, LifeExpP))+
geom_point() +
geom_abline(slope = slope, intercept = intercept, show.legend = TRUE)
Residual Analysis
ols_plot_resid_qq(model2)
ols_plot_resid_hist(model2)
ols_plot_resid_fit(model2)
crPlots(model2)
The CRPlot shows that there is now a linear relationship between the variables. While the data is more normalized than previously, there is still a left skew as shown by the histogram and Q-Q plot.
Predictions
3.Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.
predictdata <- data.frame(TotExpP=c(1.5,2.5))
predict(model2, predictdata,interval="predict")^(1/4.6)
## fit lwr upr
## 1 63.31153 35.93545 73.00793
## 2 86.50645 81.80643 90.43414
Model 3
- Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?
LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp
model3 <- lm(LifeExp ~ PropMD + TotExp + PropMD * TotExp, data)
Evaulation
summary(model3)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.772703255 0.795605238 78.899 < 2e-16 ***
## PropMD 1497.493952519 278.816879652 5.371 2.32e-07 ***
## TotExp 0.000072333 0.000008982 8.053 9.39e-14 ***
## PropMD:TotExp -0.006025686 0.001472357 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
The adj \(R^2\) accounts for 0.3471 of the variability of the data, which means that only 34% of the variance in the response variable can be explained by the independent variable.
Both the y-intercept and other variables p-value or low (near zero), meaning that the probability of observation these relationships due to chance is small.
The F-statistic and p-value indicate that we would reject the null hypothesis (\(H_0\)), that there isn’t a relationship between the variables.
Residual Analysis
ols_plot_resid_qq(model3)
ols_plot_resid_hist(model3)
ols_plot_resid_fit(model3)
The data does not resemble a normal distribution, as shown in the histogram (left skew) and the Q-Q pllots. The residuals do not appear to be centered around 0 from the residual plot.
Predictions
- Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?
predictdata2 <- data.frame(PropMD=0.03, TotExp=14)
predict(model3, predictdata2,interval="predict")
## fit lwr upr
## 1 107.696 84.24791 131.1441
The predicted range of vaulues appear to be too high. The max value shows to be around 83. The data is predicting a much higher fit and CI.