CUNY_605_Multiple_Regression
CUNY_605_Multiple_Regression
- Load in the Data and Libraries
- Question 1
- Results from First Model
- Question 2 - Manipulating the Variables for a Second Regression
- Question 3 - Predict the life expectancy when TotExp^.06 = 1.5.
- Question 4 Build the Following Model
- Question 5 - Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this seem realistic
Load in the Data and Libraries
df <- read.csv("C:/Users/carnout/Documents/who.csv",header =TRUE)
library(ggplot2)
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v tibble 1.4.2 v purrr 0.2.5
## v tidyr 0.8.2 v dplyr 0.7.8
## v readr 1.3.1 v stringr 1.3.1
## v tibble 1.4.2 v forcats 0.3.0
## -- Conflicts ---------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Question 1
ggplot(data = df, aes(x = TotExp, y = LifeExp)) +
geom_point(color='blue') +
geom_smooth(method = "lm", se = FALSE)
model <- lm(LifeExp~TotExp,df)
plot(model)
summary(model)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
model
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = df)
##
## Coefficients:
## (Intercept) TotExp
## 6.475e+01 6.297e-05
Results from First Model
The Straight up Linear Model between Life Expectancy and Total Expenditure shouws that for every additional 15880 (in units of tens of millions of dollars) spent, the average life expenctancy will go up one year.
The Standard Error (S) is 9.371, which tells us that the average distance of the data points from the fitted line is about 9.371%
The P-value between the two variables is 7.714e-14 which shows that there is a statistically significant relationship between government expenditure and the life expectancy of it’s citizensd
The F- statistic also corresponds to a statistically significant relationship between the two variabels
The Adjusted R-squared value is .2537 which means that roughly 25% of the variability in life expectancy can be explained by the country’s expenditures
The assumptions needed for a linear model are:
Linear relationship - This does not seem to exist for these two variables Multivariate normality -The is not normally distributed these can be determined by looking at the quantile plot Multicollinearity -Not Present because there is only one predictor variable No Auto Correlation in the Data -Not applicable because of only one predictor variable Homoscedasticity - There definitely is heteroscedasity present in the model because the residuals are not consistent across the graph
Question 2 - Manipulating the Variables for a Second Regression
df2 <- df
df2$LifeExp <- df2$LifeExp **(4.6)
df2$TotExp <- df2$TotExp **(.06)
ggplot(data = df2, aes(x = TotExp, y = LifeExp)) +
geom_point(color='blue') +
geom_smooth(method = "lm", se = FALSE)
transformed_model <- lm(LifeExp~TotExp,df2)
plot(transformed_model)
summary(transformed_model)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = df2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExp 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
transformed_model
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = df2)
##
## Coefficients:
## (Intercept) TotExp
## -736527909 620060216
THe transformed model has an even more statistically signifcant F statistic and P-value and the R-squared value is no .728 which means that almost 73% of the variability of the response variable can be explained by the predictor variable.
Question 3 - Predict the life expectancy when TotExp^.06 = 1.5.
forecasted_Life_expectancy <- function(x){
ans <- transformed_model$coefficients[1] + transformed_model$coefficients[2] * x
return(ans ** (1/4.6)) # Must transform the data into a sensible number
}
print(paste0("What is the forecasted life expectancy when TotExp^.06 = 1.5? ", round(forecasted_Life_expectancy(1.5),2), " years."))
## [1] "What is the forecasted life expectancy when TotExp^.06 = 1.5? 63.31 years."
print(paste0("What is the forecasted life expectancy when TotExp^.06 = 2.5? ", round(forecasted_Life_expectancy(2.5),2), " years."))
## [1] "What is the forecasted life expectancy when TotExp^.06 = 2.5? 86.51 years."
Question 4 Build the Following Model
\[LifeExp = B_0 + B_1 * PropMD + B_2 * TotExp + B_3 * PropMD * TotExp\]
df$PropMD_TotExp <- df$PropMD*df$TotExp
multiple_model <- lm(LifeExp~ PropMD + TotExp + PropMD_TotExp,df)
plot(multiple_model)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
summary(multiple_model)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD_TotExp, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD_TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
transformed_model
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = df2)
##
## Coefficients:
## (Intercept) TotExp
## -736527909 620060216
This model has an F-statistic of 34.49 and a statistically significant p-value < 2.2e-16, a residual standard error of 8.765, and adjusted R-squared 0.3471. This model is better than the first but worse than the second. The only thing improved here compared to the first model is that the precision is slightly better with the standard error being at 8.765. The adjusted R-squared suggests that 34.71% of the variation in this data could be interpreted from this model.
Question 5 - Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this seem realistic
forecast_revised.lm <- function(propmd, totexp){
ans <- summary(multiple_model)$coefficients[1] + summary(multiple_model)$coefficients[2] * propmd + summary(multiple_model)$coefficients[3] * totexp + summary(multiple_model)$coefficients[4] * propmd * totexp
return(ans)
}
print(paste0("What is the forecasted life expectancy when PropMD = 0.3 and TotExp = 14? ", round(forecast_revised.lm(0.3,14),2), " years."))
## [1] "What is the forecasted life expectancy when PropMD = 0.3 and TotExp = 14? 512 years."
The answer is 512 years which does not seem like a reasonable life expectancy