#required packages
library(dplyr)
#read data
data <- read.csv(file=
"https://raw.githubusercontent.com/olga0503/DATA-621/master/who.csv",
stringsAsFactors=T, header=T)

#display first six records
head(data)
##               Country LifeExp InfantSurvival Under5Survival  TBFree
## 1         Afghanistan      42          0.835          0.743 0.99769
## 2             Albania      71          0.985          0.983 0.99974
## 3             Algeria      71          0.967          0.962 0.99944
## 4             Andorra      82          0.997          0.996 0.99983
## 5              Angola      41          0.846          0.740 0.99656
## 6 Antigua and Barbuda      73          0.990          0.989 0.99991
##        PropMD      PropRN PersExp GovtExp TotExp
## 1 0.000228841 0.000572294      20      92    112
## 2 0.001143127 0.004614439     169    3128   3297
## 3 0.001060478 0.002091362     108    5184   5292
## 4 0.003297297 0.003500000    2589  169725 172314
## 5 0.000070400 0.001146162      36    1620   1656
## 6 0.000142857 0.002773810     503   12543  13046

1. Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

Simple linear regression is descried by the following equation:

\(LifeExp = b0 + b1*TotExp\)

#build linear model
linear_model <- lm(LifeExp ~ TotExp, data = data)

Linear regression should satisfy the following assumptions:

  1. Linear relationship. Linear regression requires the relationship between the independent and dependent variables to be linear.
#build linear model
linear_model <- lm(LifeExp ~ TotExp, data = data)
plot(LifeExp ~ TotExp, data = data)
abline(linear_model, col="red")

cor(data$LifeExp,data$TotExp)
## [1] 0.5076339

The graph shows that there is moderate linear relationships between the response variable ‘LifeExp’ and explanatory variable ‘TotExp’.

  1. Multivariate normality. The linear regression analysis requires all dependent variables to be multivariate normal.
#create histogram
par(mfrow=c(1,2))
hist(data$TotExp, probability=TRUE, col="gray", border="white", main="Distribution of Total Expenses")
d <- density(data$TotExp)
    lines(d, col="red")
#normal probability plot 
qqnorm(data$TotExp)
qqline(data$TotExp) 

The distribution of the variable ‘TotExp’ is skewed to the right. The variable needs to be transformed. One of the options is to replace the variable by its log.

  1. No auto-correlation. The linear regression analysis requires that there is little or no autocorrelation in the data. Autocorrelation occurs when the residuals are not independent from each other. Let’s assume that residuals are independent from each other.
summary(linear_model)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

The summary statistics shows that:

  1. F-statistic indicates that the variable ‘TotExp’ has statistical power since its p-value is less than the significance level of 5%.

  2. The variable ‘TotExp’ is statistically significant as its p-value is less than the significance level of 5%. Thus, the linear regression is described by the following equation:

\(LifeExp = 6.475e+01 + 6.297e-05*TotExp\)

The intercept coefficient of 6.475e+01 indicates that ‘LifeExp’ equals to 6.475e+01 when ‘TotExp’ equals to 0. The slope coefficient of 6.297e-05 indicates that single unit increase in ‘Tot Exp’ increases Life Exp by 6.297e-05.

  1. R-squared of 0.2577 indicates that 25.77% of the variability in the dependent variable is explained by the model.

  2. Adjusted R-squired of 0.2537 indicates that only 25.37% variation explained by the estimated regression line.

  3. RSE of 9.371 measures the accuracy with which a sample represents a population.

2. Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

data_modified <- data %>% mutate(LifeExp=LifeExp^4.6, TotExp= TotExp^0.06)

#build linear model
linear_model <- lm(LifeExp ~ TotExp, data = data_modified)
summary(linear_model)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = data_modified)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExp       620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

F-statistic shows that the variable ‘TotExp’ has statistical power since its p-value is less than the significance level of 5%.

The variable ‘TotExp’ is statistically significant as its p-value is less than the significance level of 5%. Thus, the linear regression is described by the following equation:

\(LifeExp^4.6 = -736527910 + 620060216*TotExp^0.06\)

The intercept coefficient of -736527910 indicates that ‘Life Exp^4.6’ equals to -736527910 when ‘Tot Exp^0.06’ equals to 0 (looks unrealistic since it’s negative). The slope coefficient of 620060216 indicates that single unit increase in ‘Tot Exp^0.06’ increases ‘Life Exp^4.6’ by 620060216.

If we compare this model with the previous model we will see the great increase in RSE (the improved model returns RSE of 90490000 which is much greater that RSE of the previous model), R-squired(72.798% of the variability in the dependent variable is explained by the model) and adjusted R-squired (72.83% variation explained by the estimated regression line).

Using the results from 3, forecast life expectancy when TotExp^.06 = 1.5. Then forecast life expectancy when TotExp^.06=2.5.

TotExp_0.06 <- 1.5
LifeExp_4.6 =  -736527910 + 620060216*TotExp_0.06
LifeExp_4.6
## [1] 193562414
TotExp_0.06 <- 2.5
LifeExp_4.6_2 =  -736527910 + 620060216*TotExp_0.06
LifeExp_4.6_2
## [1] 813622630
LifeExp_4.6_2-LifeExp_4.6
## [1] 620060216

The difference between life expectancy value when ‘TotExp^.06’ = 1.5 and life expectancy value when ‘TotExp^.06’ = 2.5 is 620060216. It proves that single unit increase in ‘Tot Exp^0.06’ increase ‘Life Exp^4.6’ by 620060216.

Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?

\(LifeExp = b0+b1*PropMd + b2*TotExp +b3*PropMD*TotExp\)

multiple_linear_model <- lm(LifeExp ~ PropMD + TotExp + PropMD*TotExp, data = data)
summary(multiple_linear_model)
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

The summary statistics shows that:

  1. F-statistic indicates that the dependent variables have statistical power since its p-value is less than the significance level of 5%.

  2. All dependent variables are statistically significant as their p-values are less than the significance level of 5%. Thus, the linear regression is described by the following equation:

\(LifeExp = 6.277e+01 + 1.497e+03*PropMd + 7.233e-05*TotExp - 6.026e-03*PropMd*TotExp\)

The intercept coefficient of 6.277e+01 indicates that ‘Life Exp’ equals to 6.277e+01 when all dependent variable equal to 0. The slope coefficient of 1.497e+03 indicates that single unit increase in ‘PropMd’ increases ‘LifeExp’ by 1.497e+03 while keepind all repaining dependent variables constant. The slope coefficient of 7.233e-05 indicates that single unit increase in ‘TotExp’ increases ‘LifeExp’ by 7.233e-05 while keepind all repaining dependent variables constant. The slope coefficient of -6.026e-03 indicates that single unit increase in ’PropMd*TotExp’ increases ‘LifeExp’ by 7.233e-05 while keeping all remaining dependent variables constant.

  1. R-squared of 0.3574 indicates that 35.74% of the variability in the dependent variable is explained by the model.

  2. Adjusted R-squired of 0.3471 indicates that only 34.71% variation explained by the estimated regression line.

  3. RSE of 8.765 measures the accuracy with which a sample represents a population.

Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

PropMd <- 0.03
TotExp <- 14

LifeExp =  6.277e+01 + 1.497e+03*PropMd + 7.233e-05*TotExp - 6.026e-03*PropMd*TotExp
round(LifeExp,0)
## [1] 108

The result looks unrealistic since most of the people don’t live 108 years.