Data 605 HW12

Question 1

The attached who.csv dataset contains real-world data from 2008. The variables included follow.
- Country: name of the country
- LifeExp: average life expectancy for the country in years
- InfantSurvival: proportion of those surviving to one year or more
- Under5Survival: proportion of those surviving to five years or more
- TBFree: proportion of the population without TB.
- PropMD: proportion of the population who are MDs
- PropRN: proportion of the population who are RNs
- PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate
- GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate
- TotExp: sum of personal and government expenditures.

who = read.csv("who.csv")
print(head(who))

##               Country LifeExp InfantSurvival Under5Survival  TBFree      PropMD
## 1         Afghanistan      42          0.835          0.743 0.99769 0.000228841
## 2             Albania      71          0.985          0.983 0.99974 0.001143127
## 3             Algeria      71          0.967          0.962 0.99944 0.001060478
## 4             Andorra      82          0.997          0.996 0.99983 0.003297297
## 5              Angola      41          0.846          0.740 0.99656 0.000070400
## 6 Antigua and Barbuda      73          0.990          0.989 0.99991 0.000142857
##        PropRN PersExp GovtExp TotExp
## 1 0.000572294      20      92    112
## 2 0.004614439     169    3128   3297
## 3 0.002091362     108    5184   5292
## 4 0.003500000    2589  169725 172314
## 5 0.001146162      36    1620   1656
## 6 0.002773810     503   12543  13046

Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

#scatterplot
plot(who$TotExp, who$LifeExp, xlab='TotExp', ylab='LifeExp', main='Difference in Exps (LifeExp~TotExp)')

#linear regression
who_lm <- lm(LifeExp ~ TotExp, data = who)
summary(who_lm)

## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

Looking the provided summary of the simple linear regression, we can see:
F statistics: 65.26 on 1 and 188 DF
R^2: 0.2577 & 0.2537
standard error: 9.371 on 188 degrees of freedom
p-values: 7.714e-14

The f statistics and the very low p-values can be indication that this model is suitable for our data. However, the R^2 scores being low at ~25% means it can only be a great representation of a low percent of the data. In addition the standard error is higher than it should be, making this model unrelatable.

The assumptions of simple linear regression are:
- Linearity: X and Y have a linear relationship
- Homoscedasticity: normal x residuals
- Independence: all points/data are independent from each other
- Normality: normal distribution
If even one of these are not met, it is not a linear regression.

Given our plot above, we can already make an assumption that it does not met the linear regression rules as the relationship of x and y are not linear. For further proof, the histogram below shows that our data model skews to the right so it does not have normality either.

hist(who_lm$residuals, xlab='Residuals', main='Histogram of residuals/Normality check')

Question 2

Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

who$LifeExp_raised <- who$LifeExp^4.6
who$TotExp_raised <- who$TotExp^0.06

#scatterplot
plot(who$TotExp_raised, who$LifeExp_raised, xlab='TotExp^0.06', ylab='LifeExp^4.6', main='Difference in Exps (LifeExp^4.6~TotExp^0.06)')

#linear regression
who_lm_raised <- lm(LifeExp_raised ~ TotExp_raised, data = who)
summary(who_lm_raised)

## 
## Call:
## lm(formula = LifeExp_raised ~ TotExp_raised, data = who)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -736527910   46817945  -15.73   <2e-16 ***
## TotExp_raised  620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

Looking the provided summary of this linear regression, we can see:
F statistics: 507.7 on 1 and 188 DF
R^2: 0.7298 & 0.7283
standard error: 90490000 on 188 degrees of freedom
p-values: < 2.2e-16

Almost everything indicates that this is a better fit. The plot looks more linear, the R^2 scores are way higher and the p-values is still low. However, the standard error has skyrocketed as a possible result of the transformation so while still not being a good model, it is better.

For the linear regression test:

hist(who_lm_raised$residuals, xlab='Residuals', main='Histogram of residuals/Normality check')

qqnorm(who_lm_raised$residuals)

For the two plots, we can see the points much closer to being normal (only slightly skewed to the right) and the plots are linear (we can draw us a pattern line on the qqplot). Since we already know that we have independence, these plots show the other three assumptions of linear regression are true. (The only arguement would be the normality - if it is too skewed or not)

Question 3

Using the results from 2, forecast life expectancy when TotExp^.06 =1.5.
Then forecast life expectancy when TotExp^.06=2.5.

(Intercept + (TotExp_raised * x))^1/4.6, x being the TotExp^.06

Intercept = -736527910
TotExp_raised = 620060216
i = 1/4.6  

cat('When TotExp^.06 =1.5, the forecast life expectancy is: ', ((Intercept+(TotExp_raised*1.5))^i), '\n')

## When TotExp^.06 =1.5, the forecast life expectancy is:  63.31153

cat('When TotExp^.06 =2.5, the forecast life expectancy is: ', ((Intercept+(TotExp_raised*2.5))^i))

## When TotExp^.06 =2.5, the forecast life expectancy is:  86.50645

Question 4

Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?

LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp
LifeExp = PropMd + TotExp + PropMD x TotExp
LifeExp ~ PropMd + TotExp + PropMD * TotExp

who_lm_4 <- lm(LifeExp ~ PropMD + TotExp + (PropMD * TotExp), data = who)
summary(who_lm_4)

## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + (PropMD * TotExp), data = who)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

Looking the provided summary of this linear regression, we can see:
F statistics: 34.49 on 3 and 186 DF R^2: 0.3574 & 0.3471
standard error: 8.765 on 186 degrees of freedom
p-values: < 2.2e-16

Given the low R^2 score, this is not a good and as good as our first transformed data model. The p-values is still the same, but noticeable, the standard errors are much lower than the last.

For the linear regression test:

hist(who_lm_4$residuals, xlab='Residuals', main='Histogram of residuals/Normality check')

Based soley on the histogram, you can see that the data is way too skewed to the right so the model does not have normality and is not fit for linear regression.

Question 5

Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?
I am going to use the predict command: predict(linear_model, data.frame(variables for changes))

cat('When PropMD=.03 and TotExp = 14, the forecast life expectancy is: ', predict(who_lm_4, data.frame(PropMD = 0.03, TotExp = 14)))

## When PropMD=.03 and TotExp = 14, the forecast life expectancy is:  107.696

This forecast does NOT seem realistic because the life expectancy is ~108 and that seems too high. After a quick search, I found the top three life expectancy in the world (Hong Kong, Mascao & Japan) are all around 85 so this forecast is way above the top of the real numbers.