library(AppliedPredictiveModeling)
library(readr)
library(foreign)
library(MASS)
library(tidyverse)

Load data

data <- read.csv("C:/Users/chris/OneDrive - University of Texas at San Antonio/DATA ANALYTICS  MSDA/SPRING 2023/STA 6543 Predictive Modeling/R/Income.txt", sep="")

a) Draw a scatter-plot to check the relationship between Income and Expenditure.

plot(data$Income, data$Expenditure, xlab="Income", ylab="Expenditure")

b) Find and interpret the slope for the least squares regression line

plot(data$Income, data$Expenditure, xlab="Income", ylab="Expenditure")

fit= lm(Expenditure~Income, data=data)
summary(fit)
## 
## Call:
## lm(formula = Expenditure ~ Income, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -112.390  -42.146   -6.162   30.630  224.210 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -151.26509   64.12183  -2.359   0.0224 *  
## Income         0.06894    0.00835   8.256 9.05e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 61.41 on 48 degrees of freedom
## Multiple R-squared:  0.5868, Adjusted R-squared:  0.5782 
## F-statistic: 68.16 on 1 and 48 DF,  p-value: 9.055e-11
abline(fit, lwd=2,col=2)

Linear Regression Equation

The least-squares regression line or the regression equation from the computer output is:

y = 0.069x - 151.265

The slope of the least square regression line from the regression equation, or the coefficient is 0.069. The meaning of the slope of this least-squares regression line is that for every one per capita increase in income, this model predicts a 0.069 increase in per capita spending in public school expenditure.

c) Find and interpret y-intercept for the least squares regression line

The y-intercept of the least-squares regression line, y = - 151.265 + 0.069x, has a negative intercept of -151.265, and it is the estimated value of y when x = 0, and it is the value at which the regression line crosses the y-axis. The y-intercept of -151.265 means that when income is zero, the expected expenditure is -151.265.

d) Find the least square regression equation and circle the results from your outputs.

summary(fit)
## 
## Call:
## lm(formula = Expenditure ~ Income, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -112.390  -42.146   -6.162   30.630  224.210 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -151.26509   64.12183  -2.359   0.0224 *  
## Income         0.06894    0.00835   8.256 9.05e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 61.41 on 48 degrees of freedom
## Multiple R-squared:  0.5868, Adjusted R-squared:  0.5782 
## F-statistic: 68.16 on 1 and 48 DF,  p-value: 9.055e-11

Regression Equation

The least square regression equation is given by: y = mx + b

where, y = Expenditure (DV) m = slope b = y-intercept

From my output, the fitted least square regression equation is: Expenditure = 0.0689*Income - 151.265

e) Find proportion of the variation that can be explained by the least squares regression line (i.e., R2).

We can see in the computer output that R-Sq = 0.5868. This means that about 59% of the variation in expenditure can be explained by income.

f) Find the estimator of σ2 (i.e., s2).

The common error variance σ2 is estimated by the mean squared error MSE=∑(yi−yi^)2/(n−2) where the numerator is the sum of squared errors (SSE). The n-2 is the degrees of freedom (2 for the intercept + explanatory variable).

From the data output, the MSE = 61.41^2 = 3771.19

g) From part a), do you see any outlier or influential points in this dataset? If so, please remove these points and re-find the least squares regression equation.

par(mfrow = c(2,2))
plot(fit)

In looking at the diagnostic plots I see that there are indeed some outliers (among other issues such as heteroscedasticity). If you look at the plot on the bottom right, Residuals vs Leverage, you’ll see that some of the outliers have some significant leverage as well.

Using the Cook’s Distance method to remove Outliers from Dataset

cooksD <- cooks.distance(fit)
influential <- cooksD[(cooksD > (3 * mean(cooksD, na.rm = TRUE)))]
names_of_influential <- names(influential)
influential
##        18        36 
## 2.3149630 0.2772922

We see that 2 States have a Cook’s distance greater than 3x the mean. Let’s exclude these States and rerun the model to see if we have a better fit.

df <- data[names_of_influential,]
data_without_outliers <- data %>% anti_join(df)
## Joining, by = c("State", "Expenditure", "Income")
fit2 <- lm(Expenditure ~ Income, data = data_without_outliers)
summary(fit2)
## 
## Call:
## lm(formula = Expenditure ~ Income, data = data_without_outliers)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -90.477 -36.981  -5.175  32.912 103.068 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -58.509199  61.123294  -0.957    0.343    
## Income        0.056243   0.008104   6.940 1.12e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49.35 on 46 degrees of freedom
## Multiple R-squared:  0.5115, Adjusted R-squared:  0.5009 
## F-statistic: 48.16 on 1 and 46 DF,  p-value: 1.121e-08

The model fit has improved substantially. This demonstrates how influential outliers can be in regression models. Let’s look at the diagnostic plots for our new model.

par(mfrow = c(2, 2))
plot(fit2)

In comparing with our previous diagnostic plots, these plots are greatly improved. Looking again at the Residuals vs Leverage plot, we see that we don’t have any remaining points with significant leverage, leading to a better fit for our model.The re-find least square regression equation is: Expenditure = 0.056x - 58.51.