1.Download the dataset from the source.

lifexp <- read.table ("https://raw.githubusercontent.com/angus001/Data605/master/Assign12_rawdata.csv",header =T, sep =",")
  1. Clean up the data 2.a Put in data columns
colnames(lifexp)
##  [1] "Country"        "LifeExp"        "InfantSurvival" "Under5Survival"
##  [5] "TBFree"         "PropMD"         "PropRN"         "PersExp"       
##  [9] "GovtExp"        "TotExp"
head(lifexp)
##               Country LifeExp InfantSurvival Under5Survival  TBFree
## 1         Afghanistan      42          0.835          0.743 0.99769
## 2             Albania      71          0.985          0.983 0.99974
## 3             Algeria      71          0.967          0.962 0.99944
## 4             Andorra      82          0.997          0.996 0.99983
## 5              Angola      41          0.846          0.740 0.99656
## 6 Antigua and Barbuda      73          0.990          0.989 0.99991
##        PropMD      PropRN PersExp GovtExp TotExp
## 1 0.000228841 0.000572294      20      92    112
## 2 0.001143127 0.004614439     169    3128   3297
## 3 0.001060478 0.002091362     108    5184   5292
## 4 0.003297297 0.003500000    2589  169725 172314
## 5 0.000070400 0.001146162      36    1620   1656
## 6 0.000142857 0.002773810     503   12543  13046
head(lifexp)
##               Country LifeExp InfantSurvival Under5Survival  TBFree
## 1         Afghanistan      42          0.835          0.743 0.99769
## 2             Albania      71          0.985          0.983 0.99974
## 3             Algeria      71          0.967          0.962 0.99944
## 4             Andorra      82          0.997          0.996 0.99983
## 5              Angola      41          0.846          0.740 0.99656
## 6 Antigua and Barbuda      73          0.990          0.989 0.99991
##        PropMD      PropRN PersExp GovtExp TotExp
## 1 0.000228841 0.000572294      20      92    112
## 2 0.001143127 0.004614439     169    3128   3297
## 3 0.001060478 0.002091362     108    5184   5292
## 4 0.003297297 0.003500000    2589  169725 172314
## 5 0.000070400 0.001146162      36    1620   1656
## 6 0.000142857 0.002773810     503   12543  13046
pairs(lifexp,gap=0.5, col = 'navyblue')

  1. Perform simple linear regression The R-Squared at 0.2302 indicates the total expenditure alone explain about 23% of the variance in lifexpectancy across the countries. The P value is very small and less than 0.05, therefore the model is valid. F-statistic is used for additional check on the validity of R-Sqaured value. R-Squared value explains the strenght of the relationship between the (input vs. output) variables. F-statistic then check if the R-sqaured value is valid or not. Low F-value means close similarity between groups while the high F-value means the opposite.
lifexplm <- lm(LifeExp ~ TotExp,  data = lifexp )

summary(lifexplm)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = lifexp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.669  -4.096   3.443   7.260  13.450 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.466e+01  9.070e-01  71.291  < 2e-16 ***
## TotExp      6.093e-05  9.484e-06   6.424 1.99e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.763 on 138 degrees of freedom
## Multiple R-squared:  0.2302, Adjusted R-squared:  0.2246 
## F-statistic: 41.27 on 1 and 138 DF,  p-value: 1.993e-09
colnames(lifexp)
##  [1] "Country"        "LifeExp"        "InfantSurvival" "Under5Survival"
##  [5] "TBFree"         "PropMD"         "PropRN"         "PersExp"       
##  [9] "GovtExp"        "TotExp"
{plot(lifexp$TotExp,lifexp$LifeExp,main = "Life Expectency vs. Total Medical Expenditure",
      col = 'navyblue', pch = 16, xlab = "Total Expenditure (Adjusted Dollars)", ylab = "Life Expectency (Years)")
abline(lifexplm, col = "red")}

Question 2. Raise the life expectency to 4.6 power and TotExp to the power of 0.6 then perform another linear regression of the same two variables.

Firstly, the regression line is a perfect fit for the scatterplot. The R-sqaured value now explains 71 percent of all the variance between different countries. F-statistic is becoming larger as we expect the difference between countries are significant and the p-value for F-statistic is also support the validity.

lifexp$LifeExp46 <- (lifexp$LifeExp)^4.6
lifexp$TotExp06<-(lifexp$TotExp)^.06
lifexplm2 <-lm(LifeExp46 ~ TotExp06,  data = lifexp )
summary(lifexplm2)
## 
## Call:
## lm(formula = LifeExp46 ~ TotExp06, data = lifexp)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -306134143  -57556675   17946146   62597037  210341349 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -720554749   56186881  -12.82   <2e-16 ***
## TotExp06     609874086   33137634   18.40   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 92060000 on 138 degrees of freedom
## Multiple R-squared:  0.7105, Adjusted R-squared:  0.7084 
## F-statistic: 338.7 on 1 and 138 DF,  p-value: < 2.2e-16
{plot(lifexp$TotExp06, lifexp$LifeExp46,main = "Version 2 Life Expectency vs. Total Medical Expenditure",
      col = 'navyblue', pch = 16, xlab = "Total Expenditure (Power Raised to  0.06)", ylab = "Life Expectency (Power Raised to 4.6)")
abline(lifexplm2, col = "red")
}

  1. Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.
colnames(lifexp)
##  [1] "Country"        "LifeExp"        "InfantSurvival" "Under5Survival"
##  [5] "TBFree"         "PropMD"         "PropRN"         "PersExp"       
##  [9] "GovtExp"        "TotExp"         "LifeExp46"      "TotExp06"
  1. Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model? LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

With Multivariate regression, the 92% of the variance can be explained by the variables. Interestingly, the TotExp has large values and it was removed in subsequent backward elimination. The residuals also remain close to zero, suggesting valid model.

Also, the residual Q-Q plot follows a straight line. Thus the model quite robust.

lifexplm3 <-lm(LifeExp ~ TotExp+GovtExp+InfantSurvival+Under5Survival+TBFree+PropMD+PropRN+PersExp,  data = lifexp )
summary(lifexplm3)
## 
## Call:
## lm(formula = LifeExp ~ TotExp + GovtExp + InfantSurvival + Under5Survival + 
##     TBFree + PropMD + PropRN + PersExp, data = lifexp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.8274  -1.1990   0.4179   1.7742   6.8482 
## 
## Coefficients: (1 not defined because of singularities)
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.073e+03  1.611e+02  -6.658 6.79e-10 ***
## TotExp          1.093e-03  7.970e-04   1.372   0.1725    
## GovtExp        -1.085e-03  8.090e-04  -1.342   0.1820    
## InfantSurvival  1.094e+02  4.878e+01   2.242   0.0267 *  
## Under5Survival  5.525e+01  2.870e+01   1.925   0.0564 .  
## TBFree          9.836e+02  1.668e+02   5.898 2.91e-08 ***
## PropMD          6.546e+02  2.606e+02   2.512   0.0132 *  
## PropRN         -3.257e+02  1.311e+02  -2.485   0.0142 *  
## PersExp                NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.16 on 132 degrees of freedom
## Multiple R-squared:  0.9229, Adjusted R-squared:  0.9188 
## F-statistic: 225.7 on 7 and 132 DF,  p-value: < 2.2e-16

Remove variables (TotExp & GovtExp) with large p value (p value > 0.005)

lifexplm4 <- update(lifexplm3, .~. -TotExp, data = lifexp)
lifexplm4 <- update(lifexplm4, .~. -GovtExp, data = lifexp)
summary(lifexplm4)
## 
## Call:
## lm(formula = LifeExp ~ InfantSurvival + Under5Survival + TBFree + 
##     PropMD + PropRN + PersExp, data = lifexp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.8016  -1.1772   0.4368   1.8917   6.8701 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.070e+03  1.607e+02  -6.659 6.62e-10 ***
## InfantSurvival  1.099e+02  4.866e+01   2.258   0.0256 *  
## Under5Survival  5.467e+01  2.862e+01   1.910   0.0582 .  
## TBFree          9.808e+02  1.663e+02   5.898 2.88e-08 ***
## PropMD          6.583e+02  2.599e+02   2.533   0.0125 *  
## PropRN         -3.184e+02  1.302e+02  -2.445   0.0158 *  
## PersExp         1.555e-03  2.639e-04   5.891 2.98e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.152 on 133 degrees of freedom
## Multiple R-squared:  0.9227, Adjusted R-squared:  0.9192 
## F-statistic: 264.4 on 6 and 133 DF,  p-value: < 2.2e-16

Residual analysis.

plot(fitted(lifexplm4),resid(lifexplm4))

qqnorm(resid(lifexplm4), col = "blue")
qqline(resid(lifexplm4), col = "red")

  1. Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?