Anscombe’s quartet demonstrates the importance of graphing data when analyzing it.
The file “satgpa.csv” contains information regarding SAT scores and GPAs.
data <- read.csv('satgpa.csv')
head(data, n=2)
## sex sat_v sat_m sat_sum hs_gpa fy_gpa
## 1 1 65 62 127 3.4 3.18
## 2 2 58 64 122 4.0 3.33
par(mfrow = c(1, 2))
kd_sat_sum <- density(data$sat_sum)
plot(kd_sat_sum)
polygon(kd_sat_sum, col = 'steelblue', border = 'black')
kd_fy_gpa <- density(data$fy_gpa)
plot(kd_fy_gpa)
polygon(kd_fy_gpa, col = 'steelblue', border = 'black')
This is a model to predict fy_gpa on the basis of sat_sum.
model = lm(fy_gpa ~ sat_sum , data=data)
model
##
## Call:
## lm(formula = fy_gpa ~ sat_sum, data = data)
##
## Coefficients:
## (Intercept) sat_sum
## 0.001927 0.023866
Linear regression makes several assumptions about the data:
par(mfrow = c(2, 2))
plot(model)
plot(model, 1)
Ideally, the plot should show no fitted pattern. The red line should be approximately horizontal at zero. To address non-linearity, various nonlinear transformations of predictors such as log(x), sqrt(x), and x^2 can be employed.
plot(model, 2)
The QQ plot should approximately follow a straight line.
plot(model, 3)
Homoscedasticity, or equal variance of residuals, is observed when the residuals are uniformly dispersed across the predictor ranges. A desirable indicator is a horizontal line with evenly distributed data points.
# Cook's distance
plot(model, 4)
# Residuals vs Leverage
plot(model, 5)
Possible outliers can be identified among observations with standardized residuals exceeding an absolute value of 3. If data points lie outside the Cook’s distance lines, they have a significant influence on regression results, and excluding them could alter the outcomes of the regression analysis.
Autocorrelation suggests missing information that our model should capture. In time series data, it might mean overlooking past information. In non-time series data, our model could exhibit systematic bias, either underpredicting or overpredicting in specific conditions. This may also stem from a violation of the linearity assumption.
library(car)
## Loading required package: carData
durbinWatsonTest(model)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.02003578 2.039963 0.534
## Alternative hypothesis: rho != 0
P-value of 0.54, which is greater than the typical significance level of 0.05 suggests that there is no significant autocorrelation.