The attached who.csv dataset contains real-world data from 2008. The variables included follow. Country: name of the country LifeExp: average life expectancy for the country in years InfantSurvival: proportion of those surviving to one year or more Under5Survival: proportion of those surviving to five years or more TBFree: proportion of the population without TB. PropMD: proportion of the population who are MDs PropRN: proportion of the population who are RNs PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate TotExp: sum of personal and government expenditures.

  1. Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.
# Step 1: Let us load the who.csv

whodata = read.csv("who.csv")

# Let us do a small summary on the whodata.

summary(whodata)
##                 Country       LifeExp      InfantSurvival  
##  Afghanistan        :  1   Min.   :40.00   Min.   :0.8350  
##  Albania            :  1   1st Qu.:61.25   1st Qu.:0.9433  
##  Algeria            :  1   Median :70.00   Median :0.9785  
##  Andorra            :  1   Mean   :67.38   Mean   :0.9624  
##  Angola             :  1   3rd Qu.:75.00   3rd Qu.:0.9910  
##  Antigua and Barbuda:  1   Max.   :83.00   Max.   :0.9980  
##  (Other)            :184                                   
##  Under5Survival       TBFree           PropMD              PropRN         
##  Min.   :0.7310   Min.   :0.9870   Min.   :0.0000196   Min.   :0.0000883  
##  1st Qu.:0.9253   1st Qu.:0.9969   1st Qu.:0.0002444   1st Qu.:0.0008455  
##  Median :0.9745   Median :0.9992   Median :0.0010474   Median :0.0027584  
##  Mean   :0.9459   Mean   :0.9980   Mean   :0.0017954   Mean   :0.0041336  
##  3rd Qu.:0.9900   3rd Qu.:0.9998   3rd Qu.:0.0024584   3rd Qu.:0.0057164  
##  Max.   :0.9970   Max.   :1.0000   Max.   :0.0351290   Max.   :0.0708387  
##                                                                           
##     PersExp           GovtExp             TotExp      
##  Min.   :   3.00   Min.   :    10.0   Min.   :    13  
##  1st Qu.:  36.25   1st Qu.:   559.5   1st Qu.:   584  
##  Median : 199.50   Median :  5385.0   Median :  5541  
##  Mean   : 742.00   Mean   : 40953.5   Mean   : 41696  
##  3rd Qu.: 515.25   3rd Qu.: 25680.2   3rd Qu.: 26331  
##  Max.   :6350.00   Max.   :476420.0   Max.   :482750  
## 
# The below is scatterplot of TotExp vs LifeExp
plot(whodata$TotExp,whodata$LifeExp)

# We will do a linear regression.

q1linreg = lm(whodata$LifeExp ~ whodata$TotExp)

summary(q1linreg)
## 
## Call:
## lm(formula = whodata$LifeExp ~ whodata$TotExp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.475e+01  7.535e-01  85.933  < 2e-16 ***
## whodata$TotExp 6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14
# The P-value is very less than 0.05.
# R^2 is also low.
# We will look at the residuals.

plot(fitted(q1linreg), resid(q1linreg))
abline(h=0)

mean(q1linreg$residuals)
## [1] -1.902718e-16
# Mean of the residuals is less near zero.

hist(q1linreg$residuals)

qqnorm(q1linreg$residuals)
qqline(q1linreg$residuals)

# Residuals does not look Normal. They are not fitting the straight line.
# So our assumption of linear regression is not met. 
  1. Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”
whodata2 = whodata

whodata2$LifeExp = (whodata2$LifeExp)^4.6
whodata2$TotExp = (whodata2$TotExp)^.06


# The below is scatterplot of TotExp vs LifeExp
plot(whodata2$TotExp,whodata2$LifeExp)

# Now the plot looks more linear.


# We will do a linear regression.

q2linreg = lm(whodata2$LifeExp ~ whodata2$TotExp)

summary(q2linreg)
## 
## Call:
## lm(formula = whodata2$LifeExp ~ whodata2$TotExp)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -736527910   46817945  -15.73   <2e-16 ***
## whodata2$TotExp  620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16
# The P-value is very less than 0.05.
# R^2 is also high at around 73
# We will look at the residuals.

plot(fitted(q2linreg), resid(q2linreg))
abline(h=0)

mean(q2linreg$residuals)
## [1] 2.268873e-09
# Mean of the residuals is less near zero.

hist(q2linreg$residuals)

qqnorm(q2linreg$residuals)
qqline(q2linreg$residuals)

# The residual plot looks normal. though at the end there is a little drag.
# The Model 2 is better compared to model 1. 
# From model 2 we can say linear regression is met.
  1. Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.
#LifeExpectancy = -736527910 + 620060216(TotExp)
# In the above formula we have to substitute 1.5
B0 = -736527910
B1 = 620060216
LifeExp = round(-B0 + B1*(1.5))
LifeExp
## [1] 1666618234
# To get the actual value we have to raise to power of 4.6 inverse
LifeExp = (LifeExp)^(1/4.6)
LifeExp
## [1] 101.099
# For 2.5

LifeExp = round(-B0 + B1*(2.5))
LifeExp
## [1] 2286678450
# To get the actual value we have to raise to power of 4.6 inverse
LifeExp = (LifeExp)^(1/4.6)
LifeExp
## [1] 108.2953
  1. Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model? LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp
# We have to build the multiple linear regression.

# 1 Dependent Variable and 2 Independent Variable.
# Let us do some basic analysis like correlation between dependent variable and independent variable.
# We are also see for MultiCollinearity between independent variables.


whodata3 = whodata
subwho3 = subset(whodata3, select = c(2,6,10))
kdepairs(subwho3)

# There is no linearity between Dependent and Independent variables. The Correlation is low.


subwho3 = subset(whodata3, select = c(6,10))
kdepairs(subwho3)

# There is no linearity between Independent variables. The Correlation is low.

multreg = lm(whodata3$LifeExp ~ whodata3$PropMD + whodata3$TotExp + whodata3$PropMD * whodata3$TotExp)

summary(multreg)
## 
## Call:
## lm(formula = whodata3$LifeExp ~ whodata3$PropMD + whodata3$TotExp + 
##     whodata3$PropMD * whodata3$TotExp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      6.277e+01  7.956e-01  78.899  < 2e-16 ***
## whodata3$PropMD                  1.497e+03  2.788e+02   5.371 2.32e-07 ***
## whodata3$TotExp                  7.233e-05  8.982e-06   8.053 9.39e-14 ***
## whodata3$PropMD:whodata3$TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16
# The P Value for PropMD and TotExp is greater.
# R^2 value is low at 35%.
# Look at the residuals.

plot(fitted(multreg), resid(multreg))
abline(h=0)

mean(multreg$residuals)
## [1] -7.890435e-16
# Mean of the residuals is less near zero.

hist(multreg$residuals)

qqnorm(multreg$residuals)
qqline(multreg$residuals)

# The Residuals are not normally distributed. The Histogram is left skewed.
# Multi Linear Regression is not established.
  1. Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?
LifeExp2 = round(6.277e+01 + (1.497e+03 * .03) + (7.233e-05 * 14) + (-6.026e-03 * 0))
LifeExp2
## [1] 108
# The Value of Life Expectancy being 108 is little unrealistic age. Even in Japan it will be unrealistic.
#