1
who_orig <- read_csv("/Users/bchand005c/CUNY/DATA-605/assignment/week-12/who.csv")
who <- who_orig
head(who)
## # A tibble: 6 x 10
## Country LifeExp InfantSurvival Under5Survival TBFree PropMD PropRN
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghan… 42 0.835 0.743 0.998 2.29e-4 5.72e-4
## 2 Albania 71 0.985 0.983 1.000 1.14e-3 4.61e-3
## 3 Algeria 71 0.967 0.962 0.999 1.06e-3 2.09e-3
## 4 Andorra 82 0.997 0.996 1.000 3.30e-3 3.50e-3
## 5 Angola 41 0.846 0.74 0.997 7.04e-5 1.15e-3
## 6 Antigu… 73 0.99 0.989 1.000 1.43e-4 2.77e-3
## # ... with 3 more variables: PersExp <int>, GovtExp <int>, TotExp <int>
who.lm <- lm(LifeExp ~ TotExp, who)
ggplot(data = who, aes(x = TotExp, y = LifeExp)) +
geom_point(color='blue') +
geom_smooth(method = "lm", se = FALSE)

summary(who.lm)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
#Residual Analysis
plot(fitted(who.lm),resid(who.lm))

# Residuals Q-Q plot
qqnorm(who.lm$residuals)
qqline(who.lm$residuals)

Residual standard error: 9.371 is relatively low and is a good indicator.
F-statistic is not particularly useful here as the model is based on one parameter.
R^2 value is too low since the model only explains 25.37% of variability.
Since p value is close to 0, the model says the relationship is not due to random variation.
Residual plot is not normally distributed around 0. This tells that this is not a good model.
2
who <- read_csv("/Users/bchand005c/CUNY/DATA-605/assignment/week-12/who.csv")
lifeexp_4.6 <- who$LifeExp^4.6
totexp_0.06 <- who$TotExp^0.06
who2_lm <- lm(lifeexp_4.6 ~ totexp_0.06)
summary(who2_lm)
##
## Call:
## lm(formula = lifeexp_4.6 ~ totexp_0.06)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## totexp_0.06 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
plot(lifeexp_4.6 ~ totexp_0.06)
abline(who2_lm)

plot(fitted(who2_lm),resid(who2_lm))
abline(h=0)

qqnorm(who2_lm$residuals)
qqline(who2_lm$residuals)

Residual standard error: 90490000 is high and is a bad indicator.
F-statistic is not particularly useful here as the model is based on one parameter.
R^2 value is good since the model only explains 72.98% of variability.
Since p value is close to 0, the model says the relationship is not due to random variation.
Residual plot is normally distributed around 0 with some minor deviations. This tells that this is a relativley good model.
3
test.dat <- data.frame(totexp_0.06=c(1.5, 2.5))
predict(who2_lm, test.dat) ^ (1/4.6)
## 1 2
## 63.31153 86.50645
4
who <- read_csv("/Users/bchand005c/CUNY/DATA-605/assignment/week-12/who.csv")
multiple_regression <- lm(LifeExp ~ PropMD + TotExp + TotExp * PropMD, data=who)
summary(multiple_regression)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + TotExp * PropMD, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
plot(fitted(multiple_regression),resid(multiple_regression))
abline(h=0)

qqnorm(multiple_regression$residuals)
qqline(multiple_regression$residuals)

Residual standard error: 8.765 is low and is a good indicator.
F-statistic is useful here and is fairly high.
R^2 value is not good since the model only explains 35.74% of variability.
Since p value is close to 0, the model says the relationship is not due to random variation.
Residual plot is not normally distributed around 0. This tells that this is not a good model.
5
test.dat <- data.frame(PropMD=0.03, TotExp=14)
predict(multiple_regression, test.dat)
## 1
## 107.696
107.696 is an unrealistic age for humans. So this is a bad model.