The attached who.csv dataset contains real-world data from 2008. The variables included follow. Country: name of the country LifeExp: average life expectancy for the country in years InfantSurvival: proportion of those surviving to one year or more Under5Survival: proportion of those surviving to five years or more TBFree: proportion of the population without TB. PropMD: proportion of the population who are MDs PropRN: proportion of the population who are RNs PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate TotExp: sum of personal and government expenditures.
# Step 1: Let us load the who.csv
whodata = read.csv("who.csv")
# Let us do a small summary on the whodata.
summary(whodata)
## Country LifeExp InfantSurvival
## Afghanistan : 1 Min. :40.00 Min. :0.8350
## Albania : 1 1st Qu.:61.25 1st Qu.:0.9433
## Algeria : 1 Median :70.00 Median :0.9785
## Andorra : 1 Mean :67.38 Mean :0.9624
## Angola : 1 3rd Qu.:75.00 3rd Qu.:0.9910
## Antigua and Barbuda: 1 Max. :83.00 Max. :0.9980
## (Other) :184
## Under5Survival TBFree PropMD PropRN
## Min. :0.7310 Min. :0.9870 Min. :0.0000196 Min. :0.0000883
## 1st Qu.:0.9253 1st Qu.:0.9969 1st Qu.:0.0002444 1st Qu.:0.0008455
## Median :0.9745 Median :0.9992 Median :0.0010474 Median :0.0027584
## Mean :0.9459 Mean :0.9980 Mean :0.0017954 Mean :0.0041336
## 3rd Qu.:0.9900 3rd Qu.:0.9998 3rd Qu.:0.0024584 3rd Qu.:0.0057164
## Max. :0.9970 Max. :1.0000 Max. :0.0351290 Max. :0.0708387
##
## PersExp GovtExp TotExp
## Min. : 3.00 Min. : 10.0 Min. : 13
## 1st Qu.: 36.25 1st Qu.: 559.5 1st Qu.: 584
## Median : 199.50 Median : 5385.0 Median : 5541
## Mean : 742.00 Mean : 40953.5 Mean : 41696
## 3rd Qu.: 515.25 3rd Qu.: 25680.2 3rd Qu.: 26331
## Max. :6350.00 Max. :476420.0 Max. :482750
##
# The below is scatterplot of TotExp vs LifeExp
plot(whodata$TotExp,whodata$LifeExp)
# We will do a linear regression.
q1linreg = lm(whodata$LifeExp ~ whodata$TotExp)
summary(q1linreg)
##
## Call:
## lm(formula = whodata$LifeExp ~ whodata$TotExp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## whodata$TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
# The P-value is very less than 0.05.
# R^2 is also low.
# We will look at the residuals.
plot(fitted(q1linreg), resid(q1linreg))
abline(h=0)
mean(q1linreg$residuals)
## [1] -1.902718e-16
# Mean of the residuals is less near zero.
hist(q1linreg$residuals)
qqnorm(q1linreg$residuals)
qqline(q1linreg$residuals)
# Residuals does not look Normal. They are not fitting the straight line.
# So our assumption of linear regression is not met.
whodata2 = whodata
whodata2$LifeExp = (whodata2$LifeExp)^4.6
whodata2$TotExp = (whodata2$TotExp)^.06
# The below is scatterplot of TotExp vs LifeExp
plot(whodata2$TotExp,whodata2$LifeExp)
# Now the plot looks more linear.
# We will do a linear regression.
q2linreg = lm(whodata2$LifeExp ~ whodata2$TotExp)
summary(q2linreg)
##
## Call:
## lm(formula = whodata2$LifeExp ~ whodata2$TotExp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## whodata2$TotExp 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
# The P-value is very less than 0.05.
# R^2 is also high at around 73
# We will look at the residuals.
plot(fitted(q2linreg), resid(q2linreg))
abline(h=0)
mean(q2linreg$residuals)
## [1] 2.268873e-09
# Mean of the residuals is less near zero.
hist(q2linreg$residuals)
qqnorm(q2linreg$residuals)
qqline(q2linreg$residuals)
# The residual plot looks normal. though at the end there is a little drag.
# The Model 2 is better compared to model 1.
# From model 2 we can say linear regression is met.
#LifeExpectancy = -736527910 + 620060216(TotExp)
# In the above formula we have to substitute 1.5
B0 = -736527910
B1 = 620060216
LifeExp = round(-B0 + B1*(1.5))
LifeExp
## [1] 1666618234
# To get the actual value we have to raise to power of 4.6 inverse
LifeExp = (LifeExp)^(1/4.6)
LifeExp
## [1] 101.099
# For 2.5
LifeExp = round(-B0 + B1*(2.5))
LifeExp
## [1] 2286678450
# To get the actual value we have to raise to power of 4.6 inverse
LifeExp = (LifeExp)^(1/4.6)
LifeExp
## [1] 108.2953
# We have to build the multiple linear regression.
# 1 Dependent Variable and 2 Independent Variable.
# Let us do some basic analysis like correlation between dependent variable and independent variable.
# We are also see for MultiCollinearity between independent variables.
whodata3 = whodata
subwho3 = subset(whodata3, select = c(2,6,10))
kdepairs(subwho3)
# There is no linearity between Dependent and Independent variables. The Correlation is low.
subwho3 = subset(whodata3, select = c(6,10))
kdepairs(subwho3)
# There is no linearity between Independent variables. The Correlation is low.
multreg = lm(whodata3$LifeExp ~ whodata3$PropMD + whodata3$TotExp + whodata3$PropMD * whodata3$TotExp)
summary(multreg)
##
## Call:
## lm(formula = whodata3$LifeExp ~ whodata3$PropMD + whodata3$TotExp +
## whodata3$PropMD * whodata3$TotExp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## whodata3$PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## whodata3$TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## whodata3$PropMD:whodata3$TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
# The P Value for PropMD and TotExp is greater.
# R^2 value is low at 35%.
# Look at the residuals.
plot(fitted(multreg), resid(multreg))
abline(h=0)
mean(multreg$residuals)
## [1] -7.890435e-16
# Mean of the residuals is less near zero.
hist(multreg$residuals)
qqnorm(multreg$residuals)
qqline(multreg$residuals)
# The Residuals are not normally distributed. The Histogram is left skewed.
# Multi Linear Regression is not established.
LifeExp2 = round(6.277e+01 + (1.497e+03 * .03) + (7.233e-05 * 14) + (-6.026e-03 * 0))
LifeExp2
## [1] 108
# The Value of Life Expectancy being 108 is little unrealistic age. Even in Japan it will be unrealistic.
#