who = read.csv('who.csv')
describe(who)
## vars n mean sd median trimmed mad min
## Country* 1 190 95.50 54.99 95.50 95.50 70.42 1.00
## LifeExp 2 190 67.38 10.85 70.00 68.47 10.38 40.00
## InfantSurvival 3 190 0.96 0.04 0.98 0.97 0.02 0.84
## Under5Survival 4 190 0.95 0.06 0.97 0.96 0.03 0.73
## TBFree 5 190 1.00 0.00 1.00 1.00 0.00 0.99
## PropMD 6 190 0.00 0.00 0.00 0.00 0.00 0.00
## PropRN 7 190 0.00 0.01 0.00 0.00 0.00 0.00
## PersExp 8 190 742.00 1354.00 199.50 386.70 256.49 3.00
## GovtExp 9 190 40953.49 86140.65 5385.00 17671.33 7692.47 10.00
## TotExp 10 190 41695.49 87449.85 5541.00 18060.03 7899.29 13.00
## max range skew kurtosis se
## Country* 190.00 189.00 0.00 -1.22 3.99
## LifeExp 83.00 43.00 -0.80 -0.25 0.79
## InfantSurvival 1.00 0.16 -1.34 1.11 0.00
## Under5Survival 1.00 0.27 -1.57 1.71 0.00
## TBFree 1.00 0.01 -1.66 2.70 0.00
## PropMD 0.04 0.04 7.52 64.25 0.00
## PropRN 0.07 0.07 7.25 74.81 0.00
## PersExp 6350.00 6347.00 2.48 5.64 98.23
## GovtExp 476420.00 476410.00 2.86 8.39 6249.30
## TotExp 482750.00 482737.00 2.85 8.32 6344.28
ggplot(who, aes(x=TotExp, y=LifeExp)) + geom_point(shape=1) + geom_smooth(method=lm)
q1lm = lm(data = who,LifeExp~TotExp)
hist(resid(q1lm))
summary(q1lm)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
The F-Statistic acts as an indicator.
The F-stat indicates whether or not there is a relationship between variables (response/predictor)
If our F-stat is big, we reject the null hypothesis. (That there is no relationship)
In our case, our F-stat is very big…meaning there is a clear relationship between LifeExp and TotExp.
\(R^2\)
Also known as the coefficient of determination, shows us an estimation of variability explained by the model.
Since R-squared can’t account for bias, we need to chart our residuals to properly interpret the R^2.
We will be looking for symmetry through the model.
plot(fitted(q1lm),resid(q1lm))
abline(0,0)
This is terrible; I can’t use our R^2 metric.
Residual standard error
This is the average variance of our coefficients from the actual average of the response variable. (On average, how far away is a dot from a the regression line)
Since this is a metric of variance; the lower the better for finding a fit model.
This is pretty high
P-value
This is the probability that our models relationship is due to chance.
A small P-value is best, ours is very small and less than 5%. This relationship is definitely not due to chance.
Relationship is not linear. We see this from the original scatterplot.
Variables are not normally distributed, seen from histogram.
Multicollinearity and autocorrelation is N/A as we do not have multiple indepedent variables.
Homoscedasticity shows negative, as the residual plot shows no symmetry between values.(They are not of a common variance)
who$LifeExpq2 = who$LifeExp^(4.6)
who$TotExpq2 = who$TotExp^(.06)
q2lm = lm(data=who,LifeExpq2~TotExpq2)
summary(q2lm)
##
## Call:
## lm(formula = LifeExpq2 ~ TotExpq2, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExpq2 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
ggplot(who, aes(x=TotExpq2, y=LifeExpq2)) + geom_point(shape=1) + geom_smooth(method=lm)
hist(resid(q2lm))
plot(fitted(q2lm),resid(q2lm))
abline(0,0)
F-Statitic
Much higher than in problem 1; with the same amount of data/variables. This is a good thing.
R^2
Variability explained in the model rose to 72%, that is amazing. Residual plot is symmetric this time around.
Residual standard error
Became huge, however so did our data. Our residual plot doesn’t show too much of an improvement. Not much to say about this.
P-Value
Gained two zeroes; so the chance our model is explained by chance is even closer to 0.
expect = function(x){
(-736527910 + 620060216 * x)^(1/4.6)}
someExpenditure= sapply(seq(1,5,.5),expect)
rbind(seq(1,5,.5),someExpenditure)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## 1 1.50000 2.00000 2.50000 3.00000 3.50000 4.0000
## someExpenditure NaN 63.31153 77.93928 86.50645 92.79589 97.84379 102.0978
## [,8] [,9]
## 4.5000 5.0000
## someExpenditure 105.7953 109.0788
q4lm = lm(data=who,LifeExp~PropMD+TotExp+PropMD*TotExp)
summary(q4lm)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
hist(resid(q4lm))
plot(fitted(q4lm),resid(q4lm))
abline(0,0)
F-Statistic
34, lower than our previous F-Stats, but really we only need a number bigger than one.
Accepted.
R^2
Low at 36%; residual plot shows a massive skew.
Rejected.
Residual Standard Error
Pretty high at approx. 9 but not terrible. Residual plot tells us that the RSE will go down as we forecast values from close to median predictor values.
Accepted.
P-Value
As near 0 as our second model, this is very good.
Accepted.
Q5LifeExpectation = (6.277*(10)) + (1.497*(10^3))*.03 + (7.233*(10^-5))*14 - (6.026*(10^-3)) *(.03*14)
Q5LifeExpectation
## [1] 107.6785
This is not realistic, 14 Totexp is one above the min but the mean happens to be 41696. Unless total expenditure goes down as you age (Which it doesn’t) I would not expect someone outside the range of life expectancy to be approaching the minimum expenditure.