#required packages
library(dplyr)
#read data
data <- read.csv(file=
"https://raw.githubusercontent.com/olga0503/DATA-621/master/who.csv",
stringsAsFactors=T, header=T)
#display first six records
head(data)
## Country LifeExp InfantSurvival Under5Survival TBFree
## 1 Afghanistan 42 0.835 0.743 0.99769
## 2 Albania 71 0.985 0.983 0.99974
## 3 Algeria 71 0.967 0.962 0.99944
## 4 Andorra 82 0.997 0.996 0.99983
## 5 Angola 41 0.846 0.740 0.99656
## 6 Antigua and Barbuda 73 0.990 0.989 0.99991
## PropMD PropRN PersExp GovtExp TotExp
## 1 0.000228841 0.000572294 20 92 112
## 2 0.001143127 0.004614439 169 3128 3297
## 3 0.001060478 0.002091362 108 5184 5292
## 4 0.003297297 0.003500000 2589 169725 172314
## 5 0.000070400 0.001146162 36 1620 1656
## 6 0.000142857 0.002773810 503 12543 13046
Simple linear regression is descried by the following equation:
\(LifeExp = b0 + b1*TotExp\)
#build linear model
linear_model <- lm(LifeExp ~ TotExp, data = data)
Linear regression should satisfy the following assumptions:
#build linear model
linear_model <- lm(LifeExp ~ TotExp, data = data)
plot(LifeExp ~ TotExp, data = data)
abline(linear_model, col="red")
cor(data$LifeExp,data$TotExp)
## [1] 0.5076339
The graph shows that there is moderate linear relationships between the response variable ‘LifeExp’ and explanatory variable ‘TotExp’.
#create histogram
par(mfrow=c(1,2))
hist(data$TotExp, probability=TRUE, col="gray", border="white", main="Distribution of Total Expenses")
d <- density(data$TotExp)
lines(d, col="red")
#normal probability plot
qqnorm(data$TotExp)
qqline(data$TotExp)
The distribution of the variable ‘TotExp’ is skewed to the right. The variable needs to be transformed. One of the options is to replace the variable by its log.
summary(linear_model)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
The summary statistics shows that:
F-statistic indicates that the variable ‘TotExp’ has statistical power since its p-value is less than the significance level of 5%.
The variable ‘TotExp’ is statistically significant as its p-value is less than the significance level of 5%. Thus, the linear regression is described by the following equation:
\(LifeExp = 6.475e+01 + 6.297e-05*TotExp\)
The intercept coefficient of 6.475e+01 indicates that ‘LifeExp’ equals to 6.475e+01 when ‘TotExp’ equals to 0. The slope coefficient of 6.297e-05 indicates that single unit increase in ‘Tot Exp’ increases Life Exp by 6.297e-05.
R-squared of 0.2577 indicates that 25.77% of the variability in the dependent variable is explained by the model.
Adjusted R-squired of 0.2537 indicates that only 25.37% variation explained by the estimated regression line.
RSE of 9.371 measures the accuracy with which a sample represents a population.
data_modified <- data %>% mutate(LifeExp=LifeExp^4.6, TotExp= TotExp^0.06)
#build linear model
linear_model <- lm(LifeExp ~ TotExp, data = data_modified)
summary(linear_model)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = data_modified)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExp 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
F-statistic shows that the variable ‘TotExp’ has statistical power since its p-value is less than the significance level of 5%.
The variable ‘TotExp’ is statistically significant as its p-value is less than the significance level of 5%. Thus, the linear regression is described by the following equation:
\(LifeExp^4.6 = -736527910 + 620060216*TotExp^0.06\)
The intercept coefficient of -736527910 indicates that ‘Life Exp^4.6’ equals to -736527910 when ‘Tot Exp^0.06’ equals to 0 (looks unrealistic since it’s negative). The slope coefficient of 620060216 indicates that single unit increase in ‘Tot Exp^0.06’ increases ‘Life Exp^4.6’ by 620060216.
If we compare this model with the previous model we will see the great increase in RSE (the improved model returns RSE of 90490000 which is much greater that RSE of the previous model), R-squired(72.798% of the variability in the dependent variable is explained by the model) and adjusted R-squired (72.83% variation explained by the estimated regression line).
TotExp_0.06 <- 1.5
LifeExp_4.6 = -736527910 + 620060216*TotExp_0.06
LifeExp_4.6
## [1] 193562414
TotExp_0.06 <- 2.5
LifeExp_4.6_2 = -736527910 + 620060216*TotExp_0.06
LifeExp_4.6_2
## [1] 813622630
LifeExp_4.6_2-LifeExp_4.6
## [1] 620060216
The difference between life expectancy value when ‘TotExp^.06’ = 1.5 and life expectancy value when ‘TotExp^.06’ = 2.5 is 620060216. It proves that single unit increase in ‘Tot Exp^0.06’ increase ‘Life Exp^4.6’ by 620060216.
\(LifeExp = b0+b1*PropMd + b2*TotExp +b3*PropMD*TotExp\)
multiple_linear_model <- lm(LifeExp ~ PropMD + TotExp + PropMD*TotExp, data = data)
summary(multiple_linear_model)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
The summary statistics shows that:
F-statistic indicates that the dependent variables have statistical power since its p-value is less than the significance level of 5%.
All dependent variables are statistically significant as their p-values are less than the significance level of 5%. Thus, the linear regression is described by the following equation:
\(LifeExp = 6.277e+01 + 1.497e+03*PropMd + 7.233e-05*TotExp - 6.026e-03*PropMd*TotExp\)
The intercept coefficient of 6.277e+01 indicates that ‘Life Exp’ equals to 6.277e+01 when all dependent variable equal to 0. The slope coefficient of 1.497e+03 indicates that single unit increase in ‘PropMd’ increases ‘LifeExp’ by 1.497e+03 while keepind all repaining dependent variables constant. The slope coefficient of 7.233e-05 indicates that single unit increase in ‘TotExp’ increases ‘LifeExp’ by 7.233e-05 while keepind all repaining dependent variables constant. The slope coefficient of -6.026e-03 indicates that single unit increase in ’PropMd*TotExp’ increases ‘LifeExp’ by 7.233e-05 while keeping all remaining dependent variables constant.
R-squared of 0.3574 indicates that 35.74% of the variability in the dependent variable is explained by the model.
Adjusted R-squired of 0.3471 indicates that only 34.71% variation explained by the estimated regression line.
RSE of 8.765 measures the accuracy with which a sample represents a population.
PropMd <- 0.03
TotExp <- 14
LifeExp = 6.277e+01 + 1.497e+03*PropMd + 7.233e-05*TotExp - 6.026e-03*PropMd*TotExp
round(LifeExp,0)
## [1] 108
The result looks unrealistic since most of the people don’t live 108 years.