The attached who.csv dataset contains real-world data from 2008. The variables included follow.
Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, \(R^2\), standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.
Solution
Load the csv data in to R, this data consists of 10 coumns and 190 obsevarions.
who <- read.csv("/Users/subhalaxmirout/DATA 605/who.csv")
dim(who)
## [1] 190 10
Below table shows entire data and statistical measures.
library(DT)
DT::datatable(who)
summary(who)
## Country LifeExp InfantSurvival Under5Survival
## Length:190 Min. :40.00 Min. :0.8350 Min. :0.7310
## Class :character 1st Qu.:61.25 1st Qu.:0.9433 1st Qu.:0.9253
## Mode :character Median :70.00 Median :0.9785 Median :0.9745
## Mean :67.38 Mean :0.9624 Mean :0.9459
## 3rd Qu.:75.00 3rd Qu.:0.9910 3rd Qu.:0.9900
## Max. :83.00 Max. :0.9980 Max. :0.9970
## TBFree PropMD PropRN PersExp
## Min. :0.9870 Min. :0.0000196 Min. :0.0000883 Min. : 3.00
## 1st Qu.:0.9969 1st Qu.:0.0002444 1st Qu.:0.0008455 1st Qu.: 36.25
## Median :0.9992 Median :0.0010474 Median :0.0027584 Median : 199.50
## Mean :0.9980 Mean :0.0017954 Mean :0.0041336 Mean : 742.00
## 3rd Qu.:0.9998 3rd Qu.:0.0024584 3rd Qu.:0.0057164 3rd Qu.: 515.25
## Max. :1.0000 Max. :0.0351290 Max. :0.0708387 Max. :6350.00
## GovtExp TotExp
## Min. : 10.0 Min. : 13
## 1st Qu.: 559.5 1st Qu.: 584
## Median : 5385.0 Median : 5541
## Mean : 40953.5 Mean : 41696
## 3rd Qu.: 25680.2 3rd Qu.: 26331
## Max. :476420.0 Max. :482750
Checking for any NA in data.
who[!complete.cases(who),]
## [1] Country LifeExp InfantSurvival Under5Survival TBFree
## [6] PropMD PropRN PersExp GovtExp TotExp
## <0 rows> (or 0-length row.names)
No NA present in data set.
library(ggplot2)
library(scales)
ggplot(who, aes(TotExp, LifeExp)) +
geom_point(col ="blue") +
ylab("Avg Life Expectancy (years)") +
xlab("Total Expenditures (Personal and Government)") +
theme( axis.line = element_line(colour = "darkblue",
size = 1, linetype = "solid")) +
scale_x_continuous(labels = dollar)
Above plot shows:
lm = lm(LifeExp ~ TotExp,data = who)
summary(lm)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
From Linear model we get:
We can say, this model is not a good fit model.
residual <- resid(lm)
plot(who$TotExp, residual, xlab="Total Expenditures (Personal and Government)",
ylab="Residuals",
main="Residual Plot" )
abline(h = 0, col = "blue", lwd=2, lty=2)
abline(h = 10, col = "dark red", lwd=2, lty=2)
abline(h = -10, col = "dark red", lwd=2, lty=2)
hist(residual, col = "sky blue")
qqnorm(residual)
qqline(residual)
From residual analysis, we found:
Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”
Solution
library(dplyr)
who <- who %>%
dplyr::mutate(TotExp_new = TotExp^.06,
LifeExp_new = LifeExp^4.6)
ggplot(who, aes(TotExp_new, LifeExp_new)) +
geom_point(col = "blue") +
ylab("Avg Life Expectancy (years) to the power 4.6") +
xlab("Avg Personal and Government Expenditures (US dollars) to the power 0.06") +
theme( axis.line = element_line(colour = "darkblue",
size = 1, linetype = "solid")) +
scale_x_continuous(labels = dollar) +
scale_y_continuous(labels = comma)
lm_transform <- lm(LifeExp_new ~ TotExp_new,data = who)
summary(lm_transform)
##
## Call:
## lm(formula = LifeExp_new ~ TotExp_new, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExp_new 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
residual_new <- resid(lm_transform)
plot(who$TotExp_new, residual_new, xlab="Total Expenditures (Personal and Government)",
ylab="Residuals",
main="Residual Plot" )
abline(h = 0, col = "blue", lwd=2, lty=2)
hist(residual_new, col = "sky blue")
qqnorm(residual_new)
qqline(residual_new)
New transformed model, shows:
Out of 2 models, transformed model is better than the first model.
Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.
Solution
Using transformed model linear regression equation: \[LifeExp ^ {4.6} = −736527910 + 620060216 ∗ TotExp ^ {0.06}\]
forecast_life_expectancy <- function(total_exp)
{
result <- (-736527910 + 620060216 *(total_exp) ) ^ (1/4.6)
return(result)
}
forecast_life_expectancy(1.5)
## [1] 63.31153
forecast_life_expectancy(2.5)
## [1] 86.50645
The prediction at 1.5 is 63 years and 2.5 is 87 years.
Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model? \[LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp\]
Solution
lm3 <- lm(LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who)
summary(lm3)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
From above model3, we found:
This model is not as good as the transformed model.
Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?
Solution
coefficients of model 3
summary(lm3)['coefficients']
## $coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277270e+01 7.956052e-01 78.899309 6.207187e-145
## PropMD 1.497494e+03 2.788169e+02 5.370887 2.320603e-07
## TotExp 7.233324e-05 8.981926e-06 8.053199 9.386290e-14
## PropMD:TotExp -6.025686e-03 1.472357e-03 -4.092543 6.352733e-05
forecast_life_expectancy_2 <- function(PropMD, total_exp)
{
result <- 62.77270 + 1497.494 * PropMD + 0.00007233 * total_exp + 0.006025686 * PropMD * total_exp
return(result)
}
forecast_life_expectancy_2(0.03, 14)
## [1] 107.7011
The predicted value appears too high. This does not seem realistic.