{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE)
Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.
Load the Data Load the WHO dataset into a data frame.
# Load the necessary package
library(readr)
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Load the cars dataset
who_data = read_csv('https://raw.githubusercontent.com/hawa1983/DATA605Discussion/main/who.csv')
## Rows: 190 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (9): LifeExp, InfantSurvival, Under5Survival, TBFree, PropMD, PropRN, Pe...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(who_data)
## # A tibble: 6 × 10
## Country LifeExp InfantSurvival Under5Survival TBFree PropMD PropRN PersExp
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanis… 42 0.835 0.743 0.998 2.29e-4 5.72e-4 20
## 2 Albania 71 0.985 0.983 1.00 1.14e-3 4.61e-3 169
## 3 Algeria 71 0.967 0.962 0.999 1.06e-3 2.09e-3 108
## 4 Andorra 82 0.997 0.996 1.00 3.30e-3 3.5 e-3 2589
## 5 Angola 41 0.846 0.74 0.997 7.04e-5 1.15e-3 36
## 6 Antigua … 73 0.99 0.989 1.00 1.43e-4 2.77e-3 503
## # ℹ 2 more variables: GovtExp <dbl>, TotExp <dbl>
Below is the plot of Average Life Expectancy and Total Government and Individual Spending. The assumption of linearity requires that there is a linear relationship between the independent variable (Total Expenditure) and the dependent variable (Life Expectancy). In the scatter plot, the relationship does not appear to be linear across the entire range of Total Expenditure. The points seem to flatten out as Total Expenditure increases, suggesting a possible logarithmic or polynomial relationship rather than a strictly linear one.
# From the data frame 'who_data' with the columns LifeExp and TotExp
plot(who_data$TotExp, who_data$LifeExp, main = "Life Expectancy vs Total Expenditure", xlab = "Total Expenditure", ylab = "Life Expectancy", pch = 19)
Below is the Simple Linear model of the data with Total Expenditure ‘TotExp’ as the independent variable and Life Expectancy ‘LifeExp’ as the response variable.
# Create a linear model predicting stopping distance based on speed
who.lm <- lm(LifeExp ~ TotExp, data = who_data)
who.lm
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = who_data)
##
## Coefficients:
## (Intercept) TotExp
## 6.475e+01 6.297e-05
LifeExp=64.75+0.00006297×TotExp
# Summary of the model
summary(who.lm)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = who_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
From the output of the linear regression analysis above, we can interpret the following statistics:
F-statistic: With an F-statistic of 65.26 and a p-value of 7.71e-14, the model is statistically significant, rejecting the null hypothesis that the model has no explanatory power.
R-squared: An R-squared of 0.2577 means the model explains roughly 25.77% of the variance in life expectancy, which measures the model’s explanatory power but not the strength or direction of the association.
Adjusted R-squared: At 0.2537, the adjusted R-squared accounts for the number of predictors, offering a more accurate fit measurement by penalizing for unnecessary variables.
Standard Error: The residual standard error is 9.371, which indicates the average distance that the observed values fall from the regression line.
p-values: The p-value for the TotExp coefficient is 7.71e-14, showing that TotExp is a statistically significant predictor of LifeExp. The intercept also has a statistically significant p-value of <2e-16.
Based on the regression diagnostic plots below, we can assess whether the assumptions of simple linear regression are met.
par(mfrow=c(2,2))
plot(who.lm)
Linearity Assumption: The curve in the Residuals vs Fitted plot suggests a non-linear relationship between the predictor and response variable, indicating that the linearity assumption may be violated.
Normality Assumption: The deviations from the line in the tails of the Q-Q plot imply that the residuals may not be normally distributed, particularly with heavier tails than expected under normality.
Homoscedasticity: The Scale-Location plot displays a pattern where the spread of residuals increases with fitted values, hinting at potential heteroscedasticity, contrary to the constant variance assumption.
Independence: While the provided plots do not directly address this, the independence of residuals would require additional tests if data have a natural ordering (like time-series).
Influence: A few points in the Residuals vs Leverage plot show higher leverage and Cook’s distance, indicating the presence of influential points that could affect the model’s estimates.
Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”
who_data$LifeExp_Power <- who_data$LifeExp^4.6
who_data$TotExp_Power <- who_data$TotExp^0.06
# Display the data frame to confirm the new columns have been added
head(who_data)
## # A tibble: 6 × 12
## Country LifeExp InfantSurvival Under5Survival TBFree PropMD PropRN PersExp
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanis… 42 0.835 0.743 0.998 2.29e-4 5.72e-4 20
## 2 Albania 71 0.985 0.983 1.00 1.14e-3 4.61e-3 169
## 3 Algeria 71 0.967 0.962 0.999 1.06e-3 2.09e-3 108
## 4 Andorra 82 0.997 0.996 1.00 3.30e-3 3.5 e-3 2589
## 5 Angola 41 0.846 0.74 0.997 7.04e-5 1.15e-3 36
## 6 Antigua … 73 0.99 0.989 1.00 1.43e-4 2.77e-3 503
## # ℹ 4 more variables: GovtExp <dbl>, TotExp <dbl>, LifeExp_Power <dbl>,
## # TotExp_Power <dbl>
The scatter plot shows a positive relationship between Life Expectancy raised to the 4.6th power and Total Expenditure raised to the 0.06th power. The upward trend of the fitted line indicates that as Total Expenditure increases, Life Expectancy tends to increase on the transformed scale. The plot supports the use of a transformed model to capture the relationship between these variables more linearly.
# Plot the scatter plot
plot(who_data$TotExp_Power, who_data$LifeExp_Power,
xlab = "Total Expenditure (Power 0.06)",
ylab = "Life Expectancy (Power 4.6)",
main = "Plot of Life Expectancy (Power 4.6) vs Total Expenditure (Power 0.06)")
# Add a linear regression line to the plot:
who.lm_power <- lm(LifeExp_Power ~ TotExp_Power, data = who_data)
abline(who.lm_power, col = "red")
# Summary of the model
summary(who.lm_power)
##
## Call:
## lm(formula = LifeExp_Power ~ TotExp_Power, data = who_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExp_Power 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
F-statistic: With a value of 507.7 and a p-value less than 2.2e-16, this model has a very strong statistical significance, indicating that the model provides a better fit than the intercept-only model.
R-squared: The R-squared value is 0.7298, suggesting that approximately 72.98% of the variability in the transformed Life Expectancy (LifeExp_Power) is explained by the transformed Total Expenditure (TotExp_Power).
Adjusted R-squared: With a value of 0.7283, this model’s adjusted R-squared is quite close to the R-squared value, which indicates that the number of predictors (in this case, one) is appropriate for the number of observations.
Standard Error: The residual standard error is 904,000 on 188 degrees of freedom. This value is quite large; however, considering that the dependent variable was raised to the power of 4.6, the actual scale of the residuals might still be appropriate.
p-values: The p-value for the TotExp_Power coefficient is extremely low (< 2.2e-16), demonstrating a highly significant relationship with LifeExp_Power.
Comparison - The transformed model explains a substantially higher proportion of the variance in life expectancy as indicated by the R-squared value (72.98% vs 25.77%). - The standard errors are not directly comparable due to the transformation, but the residual standard error in the original model suggests a closer fit to the non-transformed data. - Both models show statistical significance for Total Expenditure, but the transformed model has a much higher F-statistic, indicating a stronger overall model fit. - The F-statistic and corresponding p-values are extremely low for both models, reinforcing the significance of the models.
Based on the regression diagnostic plots below, we can assess whether the assumptions of simple linear regression are met.
par(mfrow=c(2,2))
plot(who.lm_power)
Based on the diagnostic plots provided for the linear regression model:
Residuals vs Fitted: This plot doesn’t show any clear patterns or systematic structure, which is good as it suggests linearity. However, there appears to be an increase in the spread of residuals as the fitted values increase, which may indicate potential heteroscedasticity.
Q-Q Plot of Residuals: The residuals largely follow the reference line, especially in the middle of the distribution, indicating that the residuals are approximately normally distributed. There are some deviations in the tails, but these are not extreme.
Scale-Location (or Spread-Location): The spread of residuals appears more constant across the range of fitted values, suggesting that the transformation has helped with homoscedasticity.
Residuals vs Leverage: This plot helps us to find influential cases — points that have a substantial impact on the calculation of the regression coefficients. Here, no individual points stand out with particularly high leverage or Cook’s distance, suggesting there are no highly influential outliers in the data.
Comparison of the Residual Plots of Two Models The transformed model seems to better satisfy the assumptions of linear regression.
Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.
# Predict LifeExp_Power for TotExp_Power values
predicted_power_1 <- predict(who.lm_power, newdata = data.frame(TotExp_Power = 1.5))
predicted_power_2 <- predict(who.lm_power, newdata = data.frame(TotExp_Power = 2.5))
# Convert predicted LifeExp_Power back to LifeExp
predicted_lifeexp_1 <- predicted_power_1^(1/4.6)
predicted_lifeexp_2 <- predicted_power_2^(1/4.6)
# Output the predictions
predicted_lifeexp_1
## 1
## 63.31153
predicted_lifeexp_2
## 1
## 86.50645
Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model? LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp
# Fit the model
who_mlm <- lm(LifeExp ~ PropMD + TotExp + PropMD*TotExp, data = who_data)
# Get a summary of the model
summary(who_mlm)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
Based on the regression output:
F-Statistic: The F-statistic is 34.49 with a p-value < 2.2e-16, which is highly significant. This indicates that the model is statistically significant and the explanatory variables collectively have a strong linear relationship with the dependent variable, LifeExp.
R-squared: The R-squared value is 0.3574, meaning approximately 35.74% of the variability in LifeExp is explained by the model. While not extremely high, this does suggest the model has a moderate fit to the data.
Standard Error: The residual standard error is 8.765 on 186 degrees of freedom. This value is the estimate of the standard deviation of the error term, and it tells you how much the observed values deviate from the model’s predicted values. A smaller standard error would be preferable as it would indicate a tighter clustering of points around the fitted regression line.
P-values: All the p-values for the coefficients, including the interaction term (PropMD*TotExp), are highly significant (p < 0.05), suggesting that each term contributes to the model. It’s important to note that both the individual predictors and their interaction are significant.
Overall, the model is statistically significant as indicated by the F-statistic and the significant p-values for the coefficients. However, the R-squared suggests that while the model explains some of the variability in the dependent variable, there might be other factors that also affect LifeExp which are not included in the model.
Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not
The forecast for LifeExp when PropMD is 0.03 and TotExp is 14 is approximately 107.68 years.
In terms of realism, a life expectancy of over 107 years seems quite unrealistic for the current population, given that the highest average life expectancies in the world are in the mid-80s. This forecast might suggest that the model is extrapolating beyond the range of data it was trained on, or that there are other factors at play that the model is not accounting for.
# Manual calculation of the forecast
# Given coefficients from the model output
intercept = 6.277e+01
b1 = 1.497e+03
b2 = 7.233e-05
b3 = -6.026e-03
# Given values for PropMD and TotExp
PropMD_value = 0.03
TotExp_value = 14
# Calculating the interaction term
interaction = PropMD_value * TotExp_value
# Forecasting LifeExp using the model coefficients
LifeExp_forecast = intercept + (b1 * PropMD_value) + (b2 * TotExp_value) + (b3 * interaction)
LifeExp_forecast
## [1] 107.6785