The attached who.csv dataset contains real-world data from 2008. The variables included follow.
Data explaination:
* Country: name of the country
* LifeExp: average life expectancy for the country in years
* InfantSurvival: proportion of those surviving to one year or more
* Under5Survival: proportion of those surviving to five years or more
* TBFree: proportion of the population without TB.
* PropMD: proportion of the population who are MDs
* PropRN: proportion of the population who are RNs
* PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate
* GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate
* TotExp: sum of personal and government expenditures.
who = read_csv("./who.csv", col_names = TRUE)
## Parsed with column specification:
## cols(
## Country = col_character(),
## LifeExp = col_double(),
## InfantSurvival = col_double(),
## Under5Survival = col_double(),
## TBFree = col_double(),
## PropMD = col_double(),
## PropRN = col_double(),
## PersExp = col_double(),
## GovtExp = col_double(),
## TotExp = col_double()
## )
head(who)
## # A tibble: 6 x 10
## Country LifeExp InfantSurvival Under5Survival TBFree PropMD PropRN
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghan~ 42 0.835 0.743 0.998 2.29e-4 5.72e-4
## 2 Albania 71 0.985 0.983 1.000 1.14e-3 4.61e-3
## 3 Algeria 71 0.967 0.962 0.999 1.06e-3 2.09e-3
## 4 Andorra 82 0.997 0.996 1.000 3.30e-3 3.50e-3
## 5 Angola 41 0.846 0.74 0.997 7.04e-5 1.15e-3
## 6 Antigu~ 73 0.99 0.989 1.000 1.43e-4 2.77e-3
## # ... with 3 more variables: PersExp <dbl>, GovtExp <dbl>, TotExp <dbl>
options("scipen"=100, "digits"=4)
Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, \({ R }^{ 2 }\), standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.
who%>%
ggplot(aes(LifeExp,TotExp))+
geom_point(size = 4, alpha = .4)+
geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
who_linear <- lm(LifeExp ~ TotExp, data = who)
summary(who_linear)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.76 -4.78 3.15 7.12 13.29
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 64.75337453 0.75353661 85.93 < 0.0000000000000002 ***
## TotExp 0.00006297 0.00000779 8.08 0.000000000000077 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.37 on 188 degrees of freedom
## Multiple R-squared: 0.258, Adjusted R-squared: 0.254
## F-statistic: 65.3 on 1 and 188 DF, p-value: 0.0000000000000771
\[LifeExp=64.7533+0.00006297*TotExp\] At this time, this is a good signigicant predictor of evulation since the p-value is near 0 (less than .05). However looking at both the \({ R }^{ 2 }\) figures it fits only ~25% of the data.
qqnorm(who_linear$residuals)
qqline(who_linear$residuals)
The Q-Q plot shows the data is not uniformly scattered and has a definate curve. This matches the initial scatter plot.
I can not say clearly at this time the assumptions of linear regression are met.
Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, \({ R }^{ 2 }\), standard error, and p-values. Which model is “better?”
who<-who%>%
mutate(LifeExp2 = LifeExp^4.6)%>%
mutate(TotExp2 = TotExp^.06)
who_linear2 <- lm(LifeExp2 ~ TotExp2, data = who)
summary(who_linear2)
##
## Call:
## lm(formula = LifeExp2 ~ TotExp2, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.7 <0.0000000000000002 ***
## TotExp2 620060216 27518940 22.5 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90500000 on 188 degrees of freedom
## Multiple R-squared: 0.73, Adjusted R-squared: 0.728
## F-statistic: 508 on 1 and 188 DF, p-value: <0.0000000000000002
\[Previous\] \[LifeExp=64.7533+0.00006297*TotExp\] \[Updated\] \[LifeExp=-736527910+620060216*TotExp\]
While there is no major change to the p-value (it remains signifiant and under 0.05), there is a large change in the \({ R }^{ 2 }\) figures. Previous model fit ~25% of the data, the updated model now fits ~73% of the data.
qqnorm(who_linear2$residuals)
qqline(who_linear2$residuals)
There is now no longer as much of a curve to the top of the data as it starts fo fall in line with the residual line. There is no change to the bottom of the data, which is to be expected.
Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.
TotExp <- 1.5
TotExp2 <- 2.5
raiseexp <- -736527910+620060216*TotExp
raiseexp2 <- -736527910+620060216*TotExp2
(Lifeexp1 <- raiseexp^(1/4.6))
## [1] 63.31
(Lifeexp2 <- raiseexp2^(1/4.6))
## [1] 86.51
Build the following multiple regression model and interpret the F Statistics, R^2, standard error,and p-values. How good is the model?
\[LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp\]
who_linear3 <- lm(data = who, LifeExp ~ PropMD + TotExp + PropMD*TotExp)
summary(who_linear3)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.32 -4.13 2.10 6.54 13.07
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.77270326 0.79560524 78.90 < 0.0000000000000002 ***
## PropMD 1497.49395252 278.81687965 5.37 0.000000232060277 ***
## TotExp 0.00007233 0.00000898 8.05 0.000000000000094 ***
## PropMD:TotExp -0.00602569 0.00147236 -4.09 0.000063527329494 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.77 on 186 degrees of freedom
## Multiple R-squared: 0.357, Adjusted R-squared: 0.347
## F-statistic: 34.5 on 3 and 186 DF, p-value: <0.0000000000000002
\[LifeExp = 62.7727+0.00007233 TotExp + 1497.4939 PropMD + -0.006025 * TotExp*PropMD\]
All T-Values do show statistically significant predictor values to the evaluation score.
The p-value is also near 0 (under 0.05) show statistically significant predictor values to the evaluation score.
The \({ R }^{ 2 }\) figures are ~38% and ~35% which is low varibility of the model.
The Residual Standard Error shows 8.77 w/ 186 degrees of freedom making it significant. We would have to take in to account the datapoints will be on average 8.77 off from the model. This is not off from the first model.
TotExp = 14
PropMD=.03
LifeExp = 62.7727+0.00007233*TotExp + 1497.4939*PropMD + -0.006025*TotExp*PropMD
LifeExp
## [1] 107.7
107 does not seem too far off from the model forecast, however, it does present a small bit of discussion. 3% of your population is Doctors and your total expendatures is only 14 billion does not seem reasonable. To have confidence in the model, I should be able to plug in the data from an actual country and get near enough to the same LifeExp number listed.
Pulling the numbers from UK
TotExp = 243184
PropMD= 0.0022085
LifeExp_mod = 62.7727+0.00007233*TotExp + 1497.4939*PropMD + -0.006025*TotExp*PropMD
LifeExp_mod
## [1] 80.43
LifeExpUK <- who%>%
filter(Country == "United Kingdom")%>%
mutate(LifeExpUK = LifeExp)%>%
mutate(LifeExp_Model = .GlobalEnv$LifeExp_mod)%>%
select(LifeExpUK, LifeExp_Model)
LifeExpUK
## # A tibble: 1 x 2
## LifeExpUK LifeExp_Model
## <dbl> <dbl>
## 1 79 80.4
This is within the degrees of freedom already identified, I would accept this model forecast.