Assignment-12

1

who_orig <- read_csv("/Users/bchand005c/CUNY/DATA-605/assignment/week-12/who.csv")
who <- who_orig
head(who)

## # A tibble: 6 x 10
##   Country LifeExp InfantSurvival Under5Survival TBFree  PropMD  PropRN
##   <chr>     <int>          <dbl>          <dbl>  <dbl>   <dbl>   <dbl>
## 1 Afghan…      42          0.835          0.743  0.998 2.29e-4 5.72e-4
## 2 Albania      71          0.985          0.983  1.000 1.14e-3 4.61e-3
## 3 Algeria      71          0.967          0.962  0.999 1.06e-3 2.09e-3
## 4 Andorra      82          0.997          0.996  1.000 3.30e-3 3.50e-3
## 5 Angola       41          0.846          0.74   0.997 7.04e-5 1.15e-3
## 6 Antigu…      73          0.99           0.989  1.000 1.43e-4 2.77e-3
## # ... with 3 more variables: PersExp <int>, GovtExp <int>, TotExp <int>

who.lm <- lm(LifeExp ~ TotExp, who)
ggplot(data = who, aes(x = TotExp, y = LifeExp)) + 
        geom_point(color='blue') +
        geom_smooth(method = "lm", se = FALSE)

summary(who.lm)

## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

#Residual Analysis
plot(fitted(who.lm),resid(who.lm))

# Residuals Q-Q plot
qqnorm(who.lm$residuals)
qqline(who.lm$residuals)

Residual standard error: 9.371 is relatively low and is a good indicator.
F-statistic is not particularly useful here as the model is based on one parameter. 
R^2 value is too low since the model only explains 25.37% of variability.
Since p value is close to 0, the model says the relationship is not due to random variation.
Residual plot is not normally distributed around 0. This tells that this is not a good model.

2

who <- read_csv("/Users/bchand005c/CUNY/DATA-605/assignment/week-12/who.csv")

lifeexp_4.6 <- who$LifeExp^4.6
totexp_0.06 <- who$TotExp^0.06

who2_lm <- lm(lifeexp_4.6 ~ totexp_0.06)

summary(who2_lm)

## 
## Call:
## lm(formula = lifeexp_4.6 ~ totexp_0.06)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## totexp_0.06  620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

plot(lifeexp_4.6 ~  totexp_0.06)
abline(who2_lm)

plot(fitted(who2_lm),resid(who2_lm))
abline(h=0)

qqnorm(who2_lm$residuals)
qqline(who2_lm$residuals)

Residual standard error: 90490000 is high and is a bad indicator.
F-statistic is not particularly useful here as the model is based on one parameter. 
R^2 value is good since the model only explains 72.98% of variability.
Since p value is close to 0, the model says the relationship is not due to random variation.
Residual plot is normally distributed around 0 with some minor deviations. This tells that this is a relativley good model.

3

test.dat <- data.frame(totexp_0.06=c(1.5, 2.5))
predict(who2_lm, test.dat) ^ (1/4.6)

##        1        2 
## 63.31153 86.50645

4

who <- read_csv("/Users/bchand005c/CUNY/DATA-605/assignment/week-12/who.csv")

multiple_regression <- lm(LifeExp ~ PropMD + TotExp + TotExp * PropMD, data=who)
summary(multiple_regression)

## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + TotExp * PropMD, data = who)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

plot(fitted(multiple_regression),resid(multiple_regression))
abline(h=0)

qqnorm(multiple_regression$residuals)
qqline(multiple_regression$residuals)

Residual standard error: 8.765 is low and is a good indicator.
F-statistic is useful here and is fairly high. 
R^2 value is not good since the model only explains 35.74% of variability.
Since p value is close to 0, the model says the relationship is not due to random variation.
Residual plot is not normally distributed around 0. This tells that this is not a good model.

5

test.dat <- data.frame(PropMD=0.03, TotExp=14)
predict(multiple_regression, test.dat)

##       1 
## 107.696

107.696 is an unrealistic age for humans. So this is a bad model.

Assignment-12

Binish Kurian Chandy

11/16/2018

1

2

3

4

5