library(AppliedPredictiveModeling)
library(readr)
library(foreign)
library(MASS)
library(tidyverse)
data <- read.csv("C:/Users/chris/OneDrive - University of Texas at San Antonio/DATA ANALYTICS MSDA/SPRING 2023/STA 6543 Predictive Modeling/R/Income.txt", sep="")
plot(data$Income, data$Expenditure, xlab="Income", ylab="Expenditure")
plot(data$Income, data$Expenditure, xlab="Income", ylab="Expenditure")
fit= lm(Expenditure~Income, data=data)
summary(fit)
##
## Call:
## lm(formula = Expenditure ~ Income, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -112.390 -42.146 -6.162 30.630 224.210
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -151.26509 64.12183 -2.359 0.0224 *
## Income 0.06894 0.00835 8.256 9.05e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 61.41 on 48 degrees of freedom
## Multiple R-squared: 0.5868, Adjusted R-squared: 0.5782
## F-statistic: 68.16 on 1 and 48 DF, p-value: 9.055e-11
abline(fit, lwd=2,col=2)
The least-squares regression line or the regression equation from the computer output is:
y = 0.069x - 151.265
The slope of the least square regression line from the regression equation, or the coefficient is 0.069. The meaning of the slope of this least-squares regression line is that for every one per capita increase in income, this model predicts a 0.069 increase in per capita spending in public school expenditure.
The y-intercept of the least-squares regression line, y = - 151.265 + 0.069x, has a negative intercept of -151.265, and it is the estimated value of y when x = 0, and it is the value at which the regression line crosses the y-axis. The y-intercept of -151.265 means that when income is zero, the expected expenditure is -151.265.
summary(fit)
##
## Call:
## lm(formula = Expenditure ~ Income, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -112.390 -42.146 -6.162 30.630 224.210
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -151.26509 64.12183 -2.359 0.0224 *
## Income 0.06894 0.00835 8.256 9.05e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 61.41 on 48 degrees of freedom
## Multiple R-squared: 0.5868, Adjusted R-squared: 0.5782
## F-statistic: 68.16 on 1 and 48 DF, p-value: 9.055e-11
The least square regression equation is given by: y = mx + b
where, y = Expenditure (DV) m = slope b = y-intercept
From my output, the fitted least square regression equation is: Expenditure = 0.0689*Income - 151.265
We can see in the computer output that R-Sq = 0.5868. This means that about 59% of the variation in expenditure can be explained by income.
The common error variance σ2 is estimated by the mean squared error MSE=∑(yi−yi^)2/(n−2) where the numerator is the sum of squared errors (SSE). The n-2 is the degrees of freedom (2 for the intercept + explanatory variable).
From the data output, the MSE = 61.41^2 = 3771.19
par(mfrow = c(2,2))
plot(fit)
In looking at the diagnostic plots I see that there are indeed some outliers (among other issues such as heteroscedasticity). If you look at the plot on the bottom right, Residuals vs Leverage, you’ll see that some of the outliers have some significant leverage as well.
cooksD <- cooks.distance(fit)
influential <- cooksD[(cooksD > (3 * mean(cooksD, na.rm = TRUE)))]
names_of_influential <- names(influential)
influential
## 18 36
## 2.3149630 0.2772922
We see that 2 States have a Cook’s distance greater than 3x the mean. Let’s exclude these States and rerun the model to see if we have a better fit.
df <- data[names_of_influential,]
data_without_outliers <- data %>% anti_join(df)
## Joining, by = c("State", "Expenditure", "Income")
fit2 <- lm(Expenditure ~ Income, data = data_without_outliers)
summary(fit2)
##
## Call:
## lm(formula = Expenditure ~ Income, data = data_without_outliers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -90.477 -36.981 -5.175 32.912 103.068
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -58.509199 61.123294 -0.957 0.343
## Income 0.056243 0.008104 6.940 1.12e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49.35 on 46 degrees of freedom
## Multiple R-squared: 0.5115, Adjusted R-squared: 0.5009
## F-statistic: 48.16 on 1 and 46 DF, p-value: 1.121e-08
The model fit has improved substantially. This demonstrates how influential outliers can be in regression models. Let’s look at the diagnostic plots for our new model.
par(mfrow = c(2, 2))
plot(fit2)
In comparing with our previous diagnostic plots, these plots are greatly improved. Looking again at the Residuals vs Leverage plot, we see that we don’t have any remaining points with significant leverage, leading to a better fit for our model.The re-find least square regression equation is: Expenditure = 0.056x - 58.51.