Regression Analysis Linear Model

The who.csv is a pre-selection course data set containing real-world data from 2008. The data set will provide analysis models such as: simple linear regression with non-transformed and transformed variables, forecasting, and multiple regression.

Load Required Libraries

library(tidyverse)
library(corrplot)
library(gvlma)
library(ggplot2)

Dataset

Load dataset from repository or local drive.

Dataset Contents

Variable Name Variable Description
Country name of the country
LifeExp average life expectancy for the country in years
InfantSurvival proportion of those surviving to one year or more
Under5Survival proportion of those surviving to five years or more
TBFree proportion of the population without TB
PropMD proportion of the population who are MDs
PropRN proportion of the population who are RNs
PersExp mean personal expenditures on healthcare in US dollars at average exchange rate
GovtExp mean government expenditures per capita on healthcare, US dollars at average exchange rate
TotExp sum of personal and government expeditures

Convert the dataset components to a dataframe

who <- data.frame(df)

View the dataframe structure. The who dataset contains 190 observations (rows) and 10 variables (columns).

str(who)
## 'data.frame':    190 obs. of  10 variables:
##  $ Country       : chr  "Afghanistan" "Albania" "Algeria" "Andorra" ...
##  $ LifeExp       : num  42 71 71 82 41 73 75 69 82 80 ...
##  $ InfantSurvival: num  0.835 0.985 0.967 0.997 0.846 0.99 0.986 0.979 0.995 0.996 ...
##  $ Under5Survival: num  0.743 0.983 0.962 0.996 0.74 0.989 0.983 0.976 0.994 0.996 ...
##  $ TBFree        : num  0.998 1 0.999 1 0.997 ...
##  $ PropMD        : num  2.29e-04 1.14e-03 1.06e-03 3.30e-03 7.04e-05 ...
##  $ PropRN        : num  0.000572 0.004614 0.002091 0.0035 0.001146 ...
##  $ PersExp       : num  20 169 108 2589 36 ...
##  $ GovtExp       : num  92 3128 5184 169725 1620 ...
##  $ TotExp        : num  112 3297 5292 172314 1656 ...

View the object summaries depending on class (numeric): minimum value, maximum value, mean value, 1st quartile (25th percentile), and 3rd quartile (75th percentile)

summary(who)
##    Country             LifeExp      InfantSurvival   Under5Survival  
##  Length:190         Min.   :40.00   Min.   :0.8350   Min.   :0.7310  
##  Class :character   1st Qu.:61.25   1st Qu.:0.9433   1st Qu.:0.9253  
##  Mode  :character   Median :70.00   Median :0.9785   Median :0.9745  
##                     Mean   :67.38   Mean   :0.9624   Mean   :0.9459  
##                     3rd Qu.:75.00   3rd Qu.:0.9910   3rd Qu.:0.9900  
##                     Max.   :83.00   Max.   :0.9980   Max.   :0.9970  
##      TBFree           PropMD              PropRN             PersExp       
##  Min.   :0.9870   Min.   :0.0000196   Min.   :0.0000883   Min.   :   3.00  
##  1st Qu.:0.9969   1st Qu.:0.0002444   1st Qu.:0.0008455   1st Qu.:  36.25  
##  Median :0.9992   Median :0.0010474   Median :0.0027584   Median : 199.50  
##  Mean   :0.9980   Mean   :0.0017954   Mean   :0.0041336   Mean   : 742.00  
##  3rd Qu.:0.9998   3rd Qu.:0.0024584   3rd Qu.:0.0057164   3rd Qu.: 515.25  
##  Max.   :1.0000   Max.   :0.0351290   Max.   :0.0708387   Max.   :6350.00  
##     GovtExp             TotExp      
##  Min.   :    10.0   Min.   :    13  
##  1st Qu.:   559.5   1st Qu.:   584  
##  Median :  5385.0   Median :  5541  
##  Mean   : 40953.5   Mean   : 41696  
##  3rd Qu.: 25680.2   3rd Qu.: 26331  
##  Max.   :476420.0   Max.   :482750

Count NA values in all dataframe columns

colSums(is.na(who))
##        Country        LifeExp InfantSurvival Under5Survival         TBFree 
##              0              0              0              0              0 
##         PropMD         PropRN        PersExp        GovtExp         TotExp 
##              0              0              0              0              0

1. Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables.

Scatterplot of LifeExp~TotExp

plot(LifeExp ~ TotExp, data = who, col = "red")

#abline(reg = lm(LifeExp ~ TotExp, data = who))

Simple Linear Regression Model

who.model1 <- lm(LifeExp ~ TotExp, data = who)
summary(who.model1)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

Provide and interpret the F statistics, R^2, standard error, and p-values only. Simple Linear Regression Equation for Model1: \[LifeExp=6.475e+1+6.297e−05∗TotExp±7.795e−06\]

F-statistic: is a global test if ANY of the independent variables is related to the outcome. The model p-value associated with the F-statistic is less than the significance level of 0.05, meaning the independent variable (TotExp) is related to the dependent variable (LifeExp).
R^2: this is the Estimate column, also called the regression coefficient. The number in the table (6.297e-05) tells us that for every one unit increase in the average LifeExp there is a corresponding 6.297e-05 unit increase for the sum of personal and government expenditures, TotExp.
Standard Error: this number shows a variation of 7.795e-06 in our estimate of the relationship between average LifeExp and TotExp (sum of personal and government expenditures) which is minimal.
p-value: the Pr(>|t|) column indicates the estimated effect of TotExp on LifeExp. The p-value is so low (7.71e-14 < 0.001), we can conclude that TotExp has a statiscally significant effect on LifeExp.

Discuss whether the assumptions of the simple linear regression met.

Results: The who.model1 model do not meet assumptions of a simple linear regression. There was more data collected approximately around an average age of 70 and contributed to the data violation for homoscedasticity. In the homosecedasticity check, the Normal Q-Qplot of the model shows the data points deviates significantly from the straight line.

  1. Independence fo observations(aka no autocorrelation)

The model, who.model1 has only one independent variable and one dependent variable and a test for any hidden relationship among variables is not required.

  1. Normality

The histogram shows the dependent variable has a skewed left (non-symmetric) distribution, indicating there is no center value and the mode is near the right tail of the data.

hist(who$LifeExp, col = "blue", border = "yellow")

3. Linearity

The relationship between the independent and dependent variable is non-linear because the distribution of the points has a curvilinear relationship.

plot(LifeExp ~ TotExp, data = who, col = "red")

4. Homoscedasticity

Plot residuals to view any bias and look at the red lines representing the mean of the residuals.

par(mfrow = c(2,2))
plot(who.model1, col = "blue")

#par(mfrow = c(1,1)) #to reset plotting graph individual in the entire window

#abline(who.model1) #graphics package

2. Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and re-run the simple regression model using the transformed variables.

options(scipen = 0)
options(digits = 6)

who$LifeExp4.6 <- who$LifeExp^4.6
who$TotExp.06 <- who$TotExp^0.06

who.model2 <- lm(LifeExp4.6 ~ TotExp.06, who)
summary(who.model2)
## 
## Call:
## lm(formula = LifeExp4.6 ~ TotExp.06, data = who)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -3.09e+08 -5.40e+07  1.37e+07  5.91e+07  2.12e+08 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.37e+08   4.68e+07   -15.7   <2e-16 ***
## TotExp.06    6.20e+08   2.75e+07    22.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90500000 on 188 degrees of freedom
## Multiple R-squared:  0.73,   Adjusted R-squared:  0.728 
## F-statistic:  508 on 1 and 188 DF,  p-value: <2e-16

Provide and interpret the F statistics, R^2, standard error, and p-values. Linear Regression Equation for Model2:

\[LifeExp4.6=−7.375e+08+6.20e+08∗TotExp.06±2.75e+07\]

F-statistic: is a global test if ANY of the independent variables is related to the outcome. The model p-value associated with the F-statistic is equal to the significance level of 0.05, meaning the independent variable (TotExp.06) is related to the dependent variable (LifeExp4.6).
R^2: this is the Estimate column, also called the regression coefficient. The number in the table (6.20e+08) tells us that for every one unit increase in the average LifeExp4.6 there is a corresponding 6.20e+08 unit increase for the sum of personal and government expenditures, TotExp.06.
Standard Error: this number shows a variation of 2.75e+07 in our estimate of the relationship between average LifeExp and TotExp (sum of personal and government expenditures).
p-value: the Pr(>|t|) column indicates the estimated effect of TotExp.06 on LifeExp4.6. The p-value is so low (2e-16 < 0.001), we can conclude that TotExp.06 has a statiscally significant effect on LifeExp.

Discuss whether the assumptions of the simple linear regression met.

Results: The who.model2 model do met assumptions of a simple linear regression. The data points are close and centered around the mean fitted lines as indicated in homoscedasticity Normal Q-Qplot..

1. Independence fo observations(aka no autocorrelation)

The model, who.model2 has only one independent variable and one dependent variable and a test for any hidden relationship among variables is not required.

2. Normality

The histogram shows the dependent variable has a slight symmetrical distribution, a center peak is shown. The histogram tapering of bar differs on the right and left indicating unbalance tails of the data.

hist(who$LifeExp4.6, col = "purple1", border = "blue")

3. Linearity

The relationship between the independent and dependent variable is linear because the distribution direction of the points have positive relationship.

plot(LifeExp4.6 ~ TotExp.06, data = who, pch = 21, 
     col = "red",  #border color
     bg = "plum2",  #fill color
     cex = 2)      #symbol style

4. Homoscedasticity

Plot residuals to view any bias and look at the red lines representing the mean of the residuals showing all basically horizontal and centered around zero.

par(mfrow = c(2,2))
plot(who.model2, col = "darkorchid4")

#par(mfrow = c(1,1)) #to reset plotting graph individual in the entire window

#abline(who.model1) #graphics package

Which model is “better?”

Results: The nearly log transformation model, who.model2, is better than than who.model1 model without transformations. The who.model2 model meet assumptions of a simple linear regression. The data was more evenly distributed/collected on average age and the homoscedasticity check validated the data log transformation. In the homosecedasticity check, the Normal Q-Qplot of the model shows that the real residuals mostly lie on a straight diagonal line with some minor deviations along each of the tails.

3. Using the results from 3, forecast life expectancy when TotExp^.06 = 1.5.

The forecast life expectancy when TotExp^.06 = 2.5.

values <- data.frame(TotExp.06=c(1.5,2.5))
predict(who.model2, values)^(1/4.6)
##       1       2 
## 63.3115 86.5064

Results:

The life expectancy when TotExp^06 = 1.5 is: 63.3115
The life expectancy when TotExp^06 = 2.5 is: 86.5064 ## 4.

4. Build the following multiple regression model…

\[LifeExp=b0+b1∗PropMD+b2∗TotExp+b3∗PropMD∗TotEx\]

who.model4 <- lm(LifeExp ~ PropMD + TotExp+ PropMD * TotExp, who)

summary(gvlma(who.model4))   #gvlma package
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -27.32  -4.13   2.10   6.54  13.07 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.28e+01   7.96e-01   78.90  < 2e-16 ***
## PropMD         1.50e+03   2.79e+02    5.37  2.3e-07 ***
## TotExp         7.23e-05   8.98e-06    8.05  9.4e-14 ***
## PropMD:TotExp -6.03e-03   1.47e-03   -4.09  6.4e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.77 on 186 degrees of freedom
## Multiple R-squared:  0.357,  Adjusted R-squared:  0.347 
## F-statistic: 34.5 on 3 and 186 DF,  p-value: <2e-16
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = who.model4) 
## 
##                    Value  p-value                   Decision
## Global Stat        87.07 0.00e+00 Assumptions NOT satisfied!
## Skewness           33.42 7.42e-09 Assumptions NOT satisfied!
## Kurtosis            0.56 4.54e-01    Assumptions acceptable.
## Link Function      52.73 3.83e-13 Assumptions NOT satisfied!
## Heteroscedasticity  0.36 5.49e-01    Assumptions acceptable.

…interpret the F statistics, R^2, standard error, and p-values.

F-statistic: is a global test if ANY of the independent variables is related to the outcome. The model p-value associated with the F-statistic is equal to the significance level of 0.05, meaning the independent variables is related to the dependent variable (LifeExp).
R^2: this is the Estimate column, also called the regression coefficient. The number in the table tells us that for every one unit increase in the model components (predictors) there is a corresponding increase to the response TotExp, variable. However, the interactions between PropMD and TotExp indicates a negative relationship with value significance to the response variable.
Standard Error: this number shows a variation in our estimate of the relationship between the predictor variables and response (LifeExp) variable.
p-value: the Pr(>|t|) column indicates the individual predictor components estimated effect on the response (LifeExp) variable. The p-value is so low (p < 0.001), we can conclude that all predictors have a statistical significant effect on LifeExp.

How good is the model?

The gvlma () function was used to determine how good the model fit the linear regression model. The Global Validation of Linear Models Assumptions (gvlma) is a top-level function that contains all of the components returned by the call to lm for fitting the linear model. The assessment of the linear model assumptions using the global test provided results for: Global Stat (NOT satisfied), Skewness (NOT satisfied), Kurtosis (acceptable), Link Function (NOT satisfied), and Heteroscedasticity (acceptable).

Results: The who.model4 model do not meet assumptions of a simple linear regression. There was more data collected approximately around an average age of 70, the data points are left skewed, no linear relationship between predictor variables and response variable.

Plot for Global Validation of Linear Models Assumptions (gvlma)

plot(gvlma((who.model4)))

Let’s discuss whether the assumptions of the multiple linear regression met.

Results: The who.model4 model do not meet assumptions of a simple linear regression. There was more data collected approximately around an average age of 70 and contributed to the data violation for homoscedasticity. In the homosecedasticity check, the Normal Q-Qplot of the model shown that the real residuals is not horizontally and centered around zero.

1. Independence for observations(aka no autocorrelation)

The model, who.model2 has only one independent variable and one dependent variable and a test for any hidden relationship among variables is not required.

Positive correlations are displayed in blue and negative correlations in red color. Color intensity and the size of the circle are proportional to the correlation coefficients. In the left side of the correlogram, the legend color show the correlation coefficients and the corresponding colors.

Independence of Observations Outcome: A positive correlation implies that the two variables change in the same direction, and a negative correlation indicates when one variable change the other variable change in the opposite direction. The model, who, variables in the regression has no autocorrelation.

#package "corrplot"
who2 <- who %>% mutate_if(is.factor, as.numeric)
who2 <- who2[ , -c(1, 3:5, 7:9, 11:15)]
corrplot.mixed(cor(who2), lower = "number", upper = "square", order = "hclust",
           tl.col = "black", tl.srt = 45) #Multi correlation plot

2. Normality

The histogram shows the dependent variable has a skewed left (non-symmetric) distribution, indicating there is no center value and the mode is near the right tail of the data.

hist(who$LifeExp, col = "blue", border = "yellow")

3. Linearity

The relationship between the independent and dependent variable is non-linear because the distribution of the points has a curvilinear relationship.

par(mfrow = c(1, 2))
plot(LifeExp ~ PropMD + TotExp+ PropMD * TotExp, data = who, pch = 21, 
     col = "brown4",  #border color
     bg = "lightcoral",  #fill color
     cex = 2)      #symbol style

4. Homoscedasticity

Plot residuals to view any bias and look at the red lines representing the mean of the residuals.

par(mfrow = c(2,2))
plot(who.model4, col = "darkorchid4")

#par(mfrow = c(1,1)) #to reset plotting graph individual in the entire window

#abline(who.model1) #graphics package

5. Forecast LifeExp when PropMD = .03 and TotExp = 14.

values <- data.frame(PropMD=0.03, TotExp=14)
predict(who.model4, values, interval = "confidence")
##       fit   lwr     upr
## 1 107.696 91.86 123.532

Does this forecast seem realistic? Why or why not?

In the model, who.model4, the forecast LifeExp is 107.696 when PropMD = 0.03 and TotExp = 14. The forecast of 107.696 average life expectancy is not realistic. The histogram shows the data distribution points are left skewed (non-symmetric).
ggplot(data = who.model4, aes(x = who.model4$residuals)) +
  geom_histogram(bins = 20, fill = 'indianred1', color = 'black') +
  labs(title = "Histogram of Residuals", x = 'Residuals', y = 'Frequency')