—————————————————————————

Student Name : Sachid Deshmukh

—————————————————————————

Multiple Linear Regression

Load Life Expectancy Data

life.exp.df = read.csv("https://raw.githubusercontent.com/mlforsachid/MSDSQ2Data605/master/Week12/HW-12/who.csv")
head(life.exp.df)
##               Country LifeExp InfantSurvival Under5Survival  TBFree
## 1         Afghanistan      42          0.835          0.743 0.99769
## 2             Albania      71          0.985          0.983 0.99974
## 3             Algeria      71          0.967          0.962 0.99944
## 4             Andorra      82          0.997          0.996 0.99983
## 5              Angola      41          0.846          0.740 0.99656
## 6 Antigua and Barbuda      73          0.990          0.989 0.99991
##        PropMD      PropRN PersExp GovtExp TotExp
## 1 0.000228841 0.000572294      20      92    112
## 2 0.001143127 0.004614439     169    3128   3297
## 3 0.001060478 0.002091362     108    5184   5292
## 4 0.003297297 0.003500000    2589  169725 172314
## 5 0.000070400 0.001146162      36    1620   1656
## 6 0.000142857 0.002773810     503   12543  13046
str(life.exp.df)
## 'data.frame':    190 obs. of  10 variables:
##  $ Country       : Factor w/ 190 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ LifeExp       : int  42 71 71 82 41 73 75 69 82 80 ...
##  $ InfantSurvival: num  0.835 0.985 0.967 0.997 0.846 0.99 0.986 0.979 0.995 0.996 ...
##  $ Under5Survival: num  0.743 0.983 0.962 0.996 0.74 0.989 0.983 0.976 0.994 0.996 ...
##  $ TBFree        : num  0.998 1 0.999 1 0.997 ...
##  $ PropMD        : num  2.29e-04 1.14e-03 1.06e-03 3.30e-03 7.04e-05 ...
##  $ PropRN        : num  0.000572 0.004614 0.002091 0.0035 0.001146 ...
##  $ PersExp       : int  20 169 108 2589 36 503 484 88 3181 3788 ...
##  $ GovtExp       : int  92 3128 5184 169725 1620 12543 19170 1856 187616 189354 ...
##  $ TotExp        : int  112 3297 5292 172314 1656 13046 19654 1944 190797 193142 ...

1. Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

plot(life.exp.df$LifeExp, life.exp.df$TotExp)

fit = lm(LifeExp~TotExp, data=life.exp.df)
summary(fit)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = life.exp.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

From the model summary above we can see that residuals are un-evenly distributed. We get R square as 0.25 indiciating poor fit of the model. From the coefficient standard errors it looks like there is not much variability in the coefficient estimation. P values for both hintercept and TotExp variables are significant indicating they are helping in predicting life expectancy variable

2. Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

life.exp.df$LifeExp = life.exp.df$LifeExp^4.6
life.exp.df$TotExp = life.exp.df$TotExp^.06
plot(life.exp.df$LifeExp, life.exp.df$TotExp)

fit = lm(LifeExp~TotExp, data=life.exp.df)
summary(fit)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = life.exp.df)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExp       620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

After transforming the variable we can see that Residuals are evenly distributed. P value estimation is lot better. R Square is increased from 0.25 to 0.72 and F stats is also increased. Certainly model with trnasformed variables is better than the earlier model

3. Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5

TotExp = c(1.5, 2.5)
test.df = data.frame(TotExp)
life.exp.pred = predict.lm(fit, newdata =test.df)
life.exp.pred = life.exp.pred ^ (1/4.6)
print(life.exp.pred)
##        1        2 
## 63.31153 86.50645

Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model? LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

life.exp.df = read.csv("https://raw.githubusercontent.com/mlforsachid/MSDSQ2Data605/master/Week12/HW-12/who.csv")
life.exp.df$IntTerm = life.exp.df$PropMD * life.exp.df$TotExp
fit = lm(LifeExp~PropMD+TotExp+IntTerm, data=life.exp.df)
summary(fit)
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + IntTerm, data = life.exp.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD       1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp       7.233e-05  8.982e-06   8.053 9.39e-14 ***
## IntTerm     -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

From the above model summary we can conclude following

1. Residuals are not evenly distributed

2. P values of all the regression coefficients are significant. This indicates that all the predictor variables are helping in predicting life expectancy

3. Make a note of interaction term added to model ProdMD X TotExp. P value of the interaction term is significant indicating it is helpful in predicting life exp

4. F statistics is statistically significant. This indicates that one or more variables are statistically significant while poredicting outcome variable (life exp)

5. Multiple R Sqaure and Adjusted R square of the model is 0.35 and 0.34 respectively indiciating poor fit of the model

Based on R suqare which indicates lack of model fit we can conculde that simple linear regression model with transformed variables is much better that multiple linear regression model

5. Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

PropMD = c(.03)
TotExp = c(14)
IntTerm = PropMD*TotExp
test.df = data.frame(PropMD, TotExp, IntTerm)
life.exp.pred = predict.lm(fit, newdata =test.df)
print(life.exp.pred)
##       1 
## 107.696

This forecast doesn’t seem realstic. There is very less likelihood of having life expectancy beyond 100 for a given person