The attached who.csv dataset contains real-world data from 2008. The
variables included follow.
- Country: name of the country
- LifeExp: average life expectancy for the country in years
- InfantSurvival: proportion of those surviving to one year or
more
- Under5Survival: proportion of those surviving to five years or
more
- TBFree: proportion of the population without TB.
- PropMD: proportion of the population who are MDs
- PropRN: proportion of the population who are RNs
- PersExp: mean personal expenditures on healthcare in US dollars at
average exchange rate
- GovtExp: mean government expenditures per capita on healthcare, US
dollars at average exchange rate
- TotExp: sum of personal and government expenditures.
who = read.csv("who.csv")
print(head(who))
## Country LifeExp InfantSurvival Under5Survival TBFree PropMD
## 1 Afghanistan 42 0.835 0.743 0.99769 0.000228841
## 2 Albania 71 0.985 0.983 0.99974 0.001143127
## 3 Algeria 71 0.967 0.962 0.99944 0.001060478
## 4 Andorra 82 0.997 0.996 0.99983 0.003297297
## 5 Angola 41 0.846 0.740 0.99656 0.000070400
## 6 Antigua and Barbuda 73 0.990 0.989 0.99991 0.000142857
## PropRN PersExp GovtExp TotExp
## 1 0.000572294 20 92 112
## 2 0.004614439 169 3128 3297
## 3 0.002091362 108 5184 5292
## 4 0.003500000 2589 169725 172314
## 5 0.001146162 36 1620 1656
## 6 0.002773810 503 12543 13046
#scatterplot
plot(who$TotExp, who$LifeExp, xlab='TotExp', ylab='LifeExp', main='Difference in Exps (LifeExp~TotExp)')
#linear regression
who_lm <- lm(LifeExp ~ TotExp, data = who)
summary(who_lm)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
Looking the provided summary of the simple linear regression, we can
see:
F statistics: 65.26 on 1 and 188 DF
R^2: 0.2577 & 0.2537
standard error: 9.371 on 188 degrees of freedom
p-values: 7.714e-14
The f statistics and the very low p-values can be indication that this model is suitable for our data. However, the R^2 scores being low at ~25% means it can only be a great representation of a low percent of the data. In addition the standard error is higher than it should be, making this model unrelatable.
The assumptions of simple linear regression are:
- Linearity: X and Y have a linear relationship
- Homoscedasticity: normal x residuals
- Independence: all points/data are independent from each other
- Normality: normal distribution
If even one of these are not met, it is not a linear regression.
Given our plot above, we can already make an assumption that it does not met the linear regression rules as the relationship of x and y are not linear. For further proof, the histogram below shows that our data model skews to the right so it does not have normality either.
hist(who_lm$residuals, xlab='Residuals', main='Histogram of residuals/Normality check')
Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”
who$LifeExp_raised <- who$LifeExp^4.6
who$TotExp_raised <- who$TotExp^0.06
#scatterplot
plot(who$TotExp_raised, who$LifeExp_raised, xlab='TotExp^0.06', ylab='LifeExp^4.6', main='Difference in Exps (LifeExp^4.6~TotExp^0.06)')
#linear regression
who_lm_raised <- lm(LifeExp_raised ~ TotExp_raised, data = who)
summary(who_lm_raised)
##
## Call:
## lm(formula = LifeExp_raised ~ TotExp_raised, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExp_raised 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
Looking the provided summary of this linear regression, we can
see:
F statistics: 507.7 on 1 and 188 DF
R^2: 0.7298 & 0.7283
standard error: 90490000 on 188 degrees of freedom
p-values: < 2.2e-16
Almost everything indicates that this is a better fit. The plot looks more linear, the R^2 scores are way higher and the p-values is still low. However, the standard error has skyrocketed as a possible result of the transformation so while still not being a good model, it is better.
For the linear regression test:
hist(who_lm_raised$residuals, xlab='Residuals', main='Histogram of residuals/Normality check')
qqnorm(who_lm_raised$residuals)
For the two plots, we can see the points much closer to being normal (only slightly skewed to the right) and the plots are linear (we can draw us a pattern line on the qqplot). Since we already know that we have independence, these plots show the other three assumptions of linear regression are true. (The only arguement would be the normality - if it is too skewed or not)
Using the results from 2, forecast life expectancy when TotExp^.06
=1.5.
Then forecast life expectancy when TotExp^.06=2.5.
(Intercept + (TotExp_raised * x))^1/4.6, x being the TotExp^.06
Intercept = -736527910
TotExp_raised = 620060216
i = 1/4.6
cat('When TotExp^.06 =1.5, the forecast life expectancy is: ', ((Intercept+(TotExp_raised*1.5))^i), '\n')
## When TotExp^.06 =1.5, the forecast life expectancy is: 63.31153
cat('When TotExp^.06 =2.5, the forecast life expectancy is: ', ((Intercept+(TotExp_raised*2.5))^i))
## When TotExp^.06 =2.5, the forecast life expectancy is: 86.50645
Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?
LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp
LifeExp = PropMd + TotExp + PropMD x TotExp
LifeExp ~ PropMd + TotExp + PropMD * TotExp
who_lm_4 <- lm(LifeExp ~ PropMD + TotExp + (PropMD * TotExp), data = who)
summary(who_lm_4)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + (PropMD * TotExp), data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
Looking the provided summary of this linear regression, we can
see:
F statistics: 34.49 on 3 and 186 DF R^2: 0.3574 & 0.3471
standard error: 8.765 on 186 degrees of freedom
p-values: < 2.2e-16
Given the low R^2 score, this is not a good and as good as our first transformed data model. The p-values is still the same, but noticeable, the standard errors are much lower than the last.
For the linear regression test:
hist(who_lm_4$residuals, xlab='Residuals', main='Histogram of residuals/Normality check')
Based soley on the histogram, you can see that the data is way too skewed to the right so the model does not have normality and is not fit for linear regression.
Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast
seem realistic? Why or why not?
I am going to use the predict command: predict(linear_model,
data.frame(variables for changes))
cat('When PropMD=.03 and TotExp = 14, the forecast life expectancy is: ', predict(who_lm_4, data.frame(PropMD = 0.03, TotExp = 14)))
## When PropMD=.03 and TotExp = 14, the forecast life expectancy is: 107.696
This forecast does NOT seem realistic because the life expectancy is ~108 and that seems too high. After a quick search, I found the top three life expectancy in the world (Hong Kong, Mascao & Japan) are all around 85 so this forecast is way above the top of the real numbers.