The who.csv is a pre-selection course data set containing real-world data from 2008. The data set will provide analysis models such as: simple linear regression with non-transformed and transformed variables, forecasting, and multiple regression.
library(tidyverse)
library(corrplot)
library(gvlma)
library(ggplot2)
Load dataset from repository or local drive.
| Variable Name | Variable Description |
|---|---|
| Country | name of the country |
| LifeExp | average life expectancy for the country in years |
| InfantSurvival | proportion of those surviving to one year or more |
| Under5Survival | proportion of those surviving to five years or more |
| TBFree | proportion of the population without TB |
| PropMD | proportion of the population who are MDs |
| PropRN | proportion of the population who are RNs |
| PersExp | mean personal expenditures on healthcare in US dollars at average exchange rate |
| GovtExp | mean government expenditures per capita on healthcare, US dollars at average exchange rate |
| TotExp | sum of personal and government expeditures |
who <- data.frame(df)
View the dataframe structure. The who dataset contains 190 observations (rows) and 10 variables (columns).
str(who)
## 'data.frame': 190 obs. of 10 variables:
## $ Country : chr "Afghanistan" "Albania" "Algeria" "Andorra" ...
## $ LifeExp : num 42 71 71 82 41 73 75 69 82 80 ...
## $ InfantSurvival: num 0.835 0.985 0.967 0.997 0.846 0.99 0.986 0.979 0.995 0.996 ...
## $ Under5Survival: num 0.743 0.983 0.962 0.996 0.74 0.989 0.983 0.976 0.994 0.996 ...
## $ TBFree : num 0.998 1 0.999 1 0.997 ...
## $ PropMD : num 2.29e-04 1.14e-03 1.06e-03 3.30e-03 7.04e-05 ...
## $ PropRN : num 0.000572 0.004614 0.002091 0.0035 0.001146 ...
## $ PersExp : num 20 169 108 2589 36 ...
## $ GovtExp : num 92 3128 5184 169725 1620 ...
## $ TotExp : num 112 3297 5292 172314 1656 ...
View the object summaries depending on class (numeric): minimum value, maximum value, mean value, 1st quartile (25th percentile), and 3rd quartile (75th percentile)
summary(who)
## Country LifeExp InfantSurvival Under5Survival
## Length:190 Min. :40.00 Min. :0.8350 Min. :0.7310
## Class :character 1st Qu.:61.25 1st Qu.:0.9433 1st Qu.:0.9253
## Mode :character Median :70.00 Median :0.9785 Median :0.9745
## Mean :67.38 Mean :0.9624 Mean :0.9459
## 3rd Qu.:75.00 3rd Qu.:0.9910 3rd Qu.:0.9900
## Max. :83.00 Max. :0.9980 Max. :0.9970
## TBFree PropMD PropRN PersExp
## Min. :0.9870 Min. :0.0000196 Min. :0.0000883 Min. : 3.00
## 1st Qu.:0.9969 1st Qu.:0.0002444 1st Qu.:0.0008455 1st Qu.: 36.25
## Median :0.9992 Median :0.0010474 Median :0.0027584 Median : 199.50
## Mean :0.9980 Mean :0.0017954 Mean :0.0041336 Mean : 742.00
## 3rd Qu.:0.9998 3rd Qu.:0.0024584 3rd Qu.:0.0057164 3rd Qu.: 515.25
## Max. :1.0000 Max. :0.0351290 Max. :0.0708387 Max. :6350.00
## GovtExp TotExp
## Min. : 10.0 Min. : 13
## 1st Qu.: 559.5 1st Qu.: 584
## Median : 5385.0 Median : 5541
## Mean : 40953.5 Mean : 41696
## 3rd Qu.: 25680.2 3rd Qu.: 26331
## Max. :476420.0 Max. :482750
Count NA values in all dataframe columns
colSums(is.na(who))
## Country LifeExp InfantSurvival Under5Survival TBFree
## 0 0 0 0 0
## PropMD PropRN PersExp GovtExp TotExp
## 0 0 0 0 0
Scatterplot of LifeExp~TotExp
plot(LifeExp ~ TotExp, data = who, col = "red")
#abline(reg = lm(LifeExp ~ TotExp, data = who))
Simple Linear Regression Model
who.model1 <- lm(LifeExp ~ TotExp, data = who)
summary(who.model1)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
Provide and interpret the F statistics, R^2, standard error, and p-values only. Simple Linear Regression Equation for Model1: \[LifeExp=6.475e+1+6.297e−05∗TotExp±7.795e−06\]
Estimate column, also called the regression coefficient. The number in the table (6.297e-05) tells us that for every one unit increase in the average LifeExp there is a corresponding 6.297e-05 unit increase for the sum of personal and government expenditures, TotExp.Pr(>|t|) column indicates the estimated effect of TotExp on LifeExp. The p-value is so low (7.71e-14 < 0.001), we can conclude that TotExp has a statiscally significant effect on LifeExp.Discuss whether the assumptions of the simple linear regression met.
Results: The who.model1 model do not meet assumptions of a simple linear regression. There was more data collected approximately around an average age of 70 and contributed to the data violation for homoscedasticity. In the homosecedasticity check, the Normal Q-Qplot of the model shows the data points deviates significantly from the straight line.
The model, who.model1 has only one independent variable and one dependent variable and a test for any hidden relationship among variables is not required.
The histogram shows the dependent variable has a skewed left (non-symmetric) distribution, indicating there is no center value and the mode is near the right tail of the data.
hist(who$LifeExp, col = "blue", border = "yellow")
3. Linearity
The relationship between the independent and dependent variable is non-linear because the distribution of the points has a curvilinear relationship.
plot(LifeExp ~ TotExp, data = who, col = "red")
4. Homoscedasticity
Plot residuals to view any bias and look at the red lines representing the mean of the residuals.
par(mfrow = c(2,2))
plot(who.model1, col = "blue")
#par(mfrow = c(1,1)) #to reset plotting graph individual in the entire window
#abline(who.model1) #graphics package
options(scipen = 0)
options(digits = 6)
who$LifeExp4.6 <- who$LifeExp^4.6
who$TotExp.06 <- who$TotExp^0.06
who.model2 <- lm(LifeExp4.6 ~ TotExp.06, who)
summary(who.model2)
##
## Call:
## lm(formula = LifeExp4.6 ~ TotExp.06, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.09e+08 -5.40e+07 1.37e+07 5.91e+07 2.12e+08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.37e+08 4.68e+07 -15.7 <2e-16 ***
## TotExp.06 6.20e+08 2.75e+07 22.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90500000 on 188 degrees of freedom
## Multiple R-squared: 0.73, Adjusted R-squared: 0.728
## F-statistic: 508 on 1 and 188 DF, p-value: <2e-16
Provide and interpret the F statistics, R^2, standard error, and p-values. Linear Regression Equation for Model2:
\[LifeExp4.6=−7.375e+08+6.20e+08∗TotExp.06±2.75e+07\]
Estimate column, also called the regression coefficient. The number in the table (6.20e+08) tells us that for every one unit increase in the average LifeExp4.6 there is a corresponding 6.20e+08 unit increase for the sum of personal and government expenditures, TotExp.06.Pr(>|t|) column indicates the estimated effect of TotExp.06 on LifeExp4.6. The p-value is so low (2e-16 < 0.001), we can conclude that TotExp.06 has a statiscally significant effect on LifeExp.Discuss whether the assumptions of the simple linear regression met.
Results: The who.model2 model do met assumptions of a simple linear regression. The data points are close and centered around the mean fitted lines as indicated in homoscedasticity Normal Q-Qplot..
1. Independence fo observations(aka no autocorrelation)
The model, who.model2 has only one independent variable and one dependent variable and a test for any hidden relationship among variables is not required.
2. Normality
The histogram shows the dependent variable has a slight symmetrical distribution, a center peak is shown. The histogram tapering of bar differs on the right and left indicating unbalance tails of the data.
hist(who$LifeExp4.6, col = "purple1", border = "blue")
3. Linearity
The relationship between the independent and dependent variable is linear because the distribution direction of the points have positive relationship.
plot(LifeExp4.6 ~ TotExp.06, data = who, pch = 21,
col = "red", #border color
bg = "plum2", #fill color
cex = 2) #symbol style
4. Homoscedasticity
Plot residuals to view any bias and look at the red lines representing the mean of the residuals showing all basically horizontal and centered around zero.
par(mfrow = c(2,2))
plot(who.model2, col = "darkorchid4")
#par(mfrow = c(1,1)) #to reset plotting graph individual in the entire window
#abline(who.model1) #graphics package
Which model is “better?”
Results: The nearly log transformation model, who.model2, is better than than who.model1 model without transformations. The who.model2 model meet assumptions of a simple linear regression. The data was more evenly distributed/collected on average age and the homoscedasticity check validated the data log transformation. In the homosecedasticity check, the Normal Q-Qplot of the model shows that the real residuals mostly lie on a straight diagonal line with some minor deviations along each of the tails.
The forecast life expectancy when TotExp^.06 = 2.5.
values <- data.frame(TotExp.06=c(1.5,2.5))
predict(who.model2, values)^(1/4.6)
## 1 2
## 63.3115 86.5064
Results:
\[LifeExp=b0+b1∗PropMD+b2∗TotExp+b3∗PropMD∗TotEx\]
who.model4 <- lm(LifeExp ~ PropMD + TotExp+ PropMD * TotExp, who)
summary(gvlma(who.model4)) #gvlma package
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.32 -4.13 2.10 6.54 13.07
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.28e+01 7.96e-01 78.90 < 2e-16 ***
## PropMD 1.50e+03 2.79e+02 5.37 2.3e-07 ***
## TotExp 7.23e-05 8.98e-06 8.05 9.4e-14 ***
## PropMD:TotExp -6.03e-03 1.47e-03 -4.09 6.4e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.77 on 186 degrees of freedom
## Multiple R-squared: 0.357, Adjusted R-squared: 0.347
## F-statistic: 34.5 on 3 and 186 DF, p-value: <2e-16
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = who.model4)
##
## Value p-value Decision
## Global Stat 87.07 0.00e+00 Assumptions NOT satisfied!
## Skewness 33.42 7.42e-09 Assumptions NOT satisfied!
## Kurtosis 0.56 4.54e-01 Assumptions acceptable.
## Link Function 52.73 3.83e-13 Assumptions NOT satisfied!
## Heteroscedasticity 0.36 5.49e-01 Assumptions acceptable.
…interpret the F statistics, R^2, standard error, and p-values.
Estimate column, also called the regression coefficient. The number in the table tells us that for every one unit increase in the model components (predictors) there is a corresponding increase to the response TotExp, variable. However, the interactions between PropMD and TotExp indicates a negative relationship with value significance to the response variable.Pr(>|t|) column indicates the individual predictor components estimated effect on the response (LifeExp) variable. The p-value is so low (p < 0.001), we can conclude that all predictors have a statistical significant effect on LifeExp.How good is the model?
The gvlma () function was used to determine how good the model fit the linear regression model. The Global Validation of Linear Models Assumptions (gvlma) is a top-level function that contains all of the components returned by the call to lm for fitting the linear model. The assessment of the linear model assumptions using the global test provided results for: Global Stat (NOT satisfied), Skewness (NOT satisfied), Kurtosis (acceptable), Link Function (NOT satisfied), and Heteroscedasticity (acceptable).
Results: The who.model4 model do not meet assumptions of a simple linear regression. There was more data collected approximately around an average age of 70, the data points are left skewed, no linear relationship between predictor variables and response variable.
Plot for Global Validation of Linear Models Assumptions (gvlma)
plot(gvlma((who.model4)))
Let’s discuss whether the assumptions of the multiple linear regression met.
Results: The who.model4 model do not meet assumptions of a simple linear regression. There was more data collected approximately around an average age of 70 and contributed to the data violation for homoscedasticity. In the homosecedasticity check, the Normal Q-Qplot of the model shown that the real residuals is not horizontally and centered around zero.
1. Independence for observations(aka no autocorrelation)
The model, who.model2 has only one independent variable and one dependent variable and a test for any hidden relationship among variables is not required.
Positive correlations are displayed in blue and negative correlations in red color. Color intensity and the size of the circle are proportional to the correlation coefficients. In the left side of the correlogram, the legend color show the correlation coefficients and the corresponding colors.
Independence of Observations Outcome: A positive correlation implies that the two variables change in the same direction, and a negative correlation indicates when one variable change the other variable change in the opposite direction. The model, who, variables in the regression has no autocorrelation.
#package "corrplot"
who2 <- who %>% mutate_if(is.factor, as.numeric)
who2 <- who2[ , -c(1, 3:5, 7:9, 11:15)]
corrplot.mixed(cor(who2), lower = "number", upper = "square", order = "hclust",
tl.col = "black", tl.srt = 45) #Multi correlation plot
2. Normality
The histogram shows the dependent variable has a skewed left (non-symmetric) distribution, indicating there is no center value and the mode is near the right tail of the data.
hist(who$LifeExp, col = "blue", border = "yellow")
3. Linearity
The relationship between the independent and dependent variable is non-linear because the distribution of the points has a curvilinear relationship.
par(mfrow = c(1, 2))
plot(LifeExp ~ PropMD + TotExp+ PropMD * TotExp, data = who, pch = 21,
col = "brown4", #border color
bg = "lightcoral", #fill color
cex = 2) #symbol style
4. Homoscedasticity
Plot residuals to view any bias and look at the red lines representing the mean of the residuals.
par(mfrow = c(2,2))
plot(who.model4, col = "darkorchid4")
#par(mfrow = c(1,1)) #to reset plotting graph individual in the entire window
#abline(who.model1) #graphics package
values <- data.frame(PropMD=0.03, TotExp=14)
predict(who.model4, values, interval = "confidence")
## fit lwr upr
## 1 107.696 91.86 123.532
Does this forecast seem realistic? Why or why not?
ggplot(data = who.model4, aes(x = who.model4$residuals)) +
geom_histogram(bins = 20, fill = 'indianred1', color = 'black') +
labs(title = "Histogram of Residuals", x = 'Residuals', y = 'Frequency')