Data 605 - A12

Michael Muller

November 19, 2017

who = read.csv('who.csv')
describe(who)

##                vars   n     mean       sd  median  trimmed     mad   min
## Country*          1 190    95.50    54.99   95.50    95.50   70.42  1.00
## LifeExp           2 190    67.38    10.85   70.00    68.47   10.38 40.00
## InfantSurvival    3 190     0.96     0.04    0.98     0.97    0.02  0.84
## Under5Survival    4 190     0.95     0.06    0.97     0.96    0.03  0.73
## TBFree            5 190     1.00     0.00    1.00     1.00    0.00  0.99
## PropMD            6 190     0.00     0.00    0.00     0.00    0.00  0.00
## PropRN            7 190     0.00     0.01    0.00     0.00    0.00  0.00
## PersExp           8 190   742.00  1354.00  199.50   386.70  256.49  3.00
## GovtExp           9 190 40953.49 86140.65 5385.00 17671.33 7692.47 10.00
## TotExp           10 190 41695.49 87449.85 5541.00 18060.03 7899.29 13.00
##                      max     range  skew kurtosis      se
## Country*          190.00    189.00  0.00    -1.22    3.99
## LifeExp            83.00     43.00 -0.80    -0.25    0.79
## InfantSurvival      1.00      0.16 -1.34     1.11    0.00
## Under5Survival      1.00      0.27 -1.57     1.71    0.00
## TBFree              1.00      0.01 -1.66     2.70    0.00
## PropMD              0.04      0.04  7.52    64.25    0.00
## PropRN              0.07      0.07  7.25    74.81    0.00
## PersExp          6350.00   6347.00  2.48     5.64   98.23
## GovtExp        476420.00 476410.00  2.86     8.39 6249.30
## TotExp         482750.00 482737.00  2.85     8.32 6344.28

ggplot(who, aes(x=TotExp, y=LifeExp)) + geom_point(shape=1) + geom_smooth(method=lm)

q1lm = lm(data = who,LifeExp~TotExp)

hist(resid(q1lm))

summary(q1lm)

## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

The F-Statistic acts as an indicator.
The F-stat indicates whether or not there is a relationship between variables (response/predictor)
If our F-stat is big, we reject the null hypothesis. (That there is no relationship)
In our case, our F-stat is very big…meaning there is a clear relationship between LifeExp and TotExp.

\(R^2\)
Also known as the coefficient of determination, shows us an estimation of variability explained by the model.
Since R-squared can’t account for bias, we need to chart our residuals to properly interpret the R^2.
We will be looking for symmetry through the model.

plot(fitted(q1lm),resid(q1lm))
abline(0,0)

This is terrible; I can’t use our R^2 metric.

Residual standard error
This is the average variance of our coefficients from the actual average of the response variable. (On average, how far away is a dot from a the regression line)
Since this is a metric of variance; the lower the better for finding a fit model.
This is pretty high

P-value
This is the probability that our models relationship is due to chance.
A small P-value is best, ours is very small and less than 5%. This relationship is definitely not due to chance.

Assumptions required for regression CHECKLIST:

Relationship is not linear. We see this from the original scatterplot.
Variables are not normally distributed, seen from histogram.
Multicollinearity and autocorrelation is N/A as we do not have multiple indepedent variables.
Homoscedasticity shows negative, as the residual plot shows no symmetry between values.(They are not of a common variance)

Q1 Conclusion : Model fails to be accepted as proper regression.

who$LifeExpq2 = who$LifeExp^(4.6)
who$TotExpq2 = who$TotExp^(.06)
q2lm = lm(data=who,LifeExpq2~TotExpq2)
summary(q2lm)

## 
## Call:
## lm(formula = LifeExpq2 ~ TotExpq2, data = who)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExpq2     620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

ggplot(who, aes(x=TotExpq2, y=LifeExpq2)) + geom_point(shape=1) + geom_smooth(method=lm)

hist(resid(q2lm))

plot(fitted(q2lm),resid(q2lm))
abline(0,0)

F-Statitic

Much higher than in problem 1; with the same amount of data/variables. This is a good thing.

R^2

Variability explained in the model rose to 72%, that is amazing. Residual plot is symmetric this time around.

Residual standard error

Became huge, however so did our data. Our residual plot doesn’t show too much of an improvement. Not much to say about this.

P-Value

Gained two zeroes; so the chance our model is explained by chance is even closer to 0.

Q2 Conclusion : Model 2 is a much better fit than model 1.

expect = function(x){
(-736527910 + 620060216 * x)^(1/4.6)}
someExpenditure= sapply(seq(1,5,.5),expect)
rbind(seq(1,5,.5),someExpenditure)

##                 [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]
##                    1  1.50000  2.00000  2.50000  3.00000  3.50000   4.0000
## someExpenditure  NaN 63.31153 77.93928 86.50645 92.79589 97.84379 102.0978
##                     [,8]     [,9]
##                   4.5000   5.0000
## someExpenditure 105.7953 109.0788

Q3 Conclusion : As expenditure goes up, life expentancy does too.

q4lm = lm(data=who,LifeExp~PropMD+TotExp+PropMD*TotExp)
summary(q4lm)

## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

hist(resid(q4lm))

plot(fitted(q4lm),resid(q4lm))
abline(0,0)

F-Statistic
34, lower than our previous F-Stats, but really we only need a number bigger than one.
Accepted.

R^2
Low at 36%; residual plot shows a massive skew.
Rejected.

Residual Standard Error
Pretty high at approx. 9 but not terrible. Residual plot tells us that the RSE will go down as we forecast values from close to median predictor values.
Accepted.

P-Value
As near 0 as our second model, this is very good.
Accepted.

Q4 Conclusion : Model accepted

Q5LifeExpectation = (6.277*(10)) + (1.497*(10^3))*.03 + (7.233*(10^-5))*14 - (6.026*(10^-3)) *(.03*14)
Q5LifeExpectation

## [1] 107.6785

This is not realistic, 14 Totexp is one above the min but the mean happens to be 41696. Unless total expenditure goes down as you age (Which it doesn’t) I would not expect someone outside the range of life expectancy to be approaching the minimum expenditure.

Assumptions required for regression CHECKLIST:

Q1 Conclusion : Model fails to be accepted as proper regression.

Q2 Conclusion : Model 2 is a much better fit than model 1.

Q3 Conclusion : As expenditure goes up, life expentancy does too.

Q4 Conclusion : Model accepted

Q5 Conclusion : Particular forecast set at (PropMD ==.03 && TotExp == 14) is unrealistic