# Clear the workspace
rm(list = ls()) # Clear all files from your environment
# gc() # Clear unused memory
cat("\f") # Clear the console
graphics.off() # Clear all graphs
# Clear the workspace
rm(list = ls()) # Clear all files from your environment
# gc() # Clear unused memory
cat("\f") # Clear the console
graphics.off() # Clear all graphs
# Loads specified data sets, or list the available data sets.
data()
# Store the data mtcars as a dataframe
df <- as.data.frame(mtcars)
# Create a scatter plot
plot(x = mtcars$wt,
y = mtcars$mpg,
xlab = "Weight",
ylab = "Miles Per Gallon",
main = "Cars Dataset"
)
Call:
lm(formula = mpg ~ wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.5432 -2.3647 -0.1252 1.4096 6.8727
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
# fitted and residual
When you run the plot()
function on a linear regression model, it produces a four-panel diagnostic plot that includes a histogram of the residuals, a Normal QQ plot of the residuals, a plot of the residuals vs. fitted values, and a plot of the residuals vs. leverage.
# Set up the plotting region / graphical parameters to have four subplots
par(mfrow = c(2, 2))
# Create a residual plot
plot(model)
You can also create individual residual plots by specifying the which argument. Here’s an example:
# Create a residual plot of residuals vs. fitted values
plot(model,
which = 1
)
This will produce a plot of the residuals against the fitted values.
In a residual vs. fitted plot after running the lm()
command in R, the red line represents the fitted line through the residuals. This line is often called the “zero line” or the “residuals equals zero line”, and represents the line of perfect prediction. The closer the residuals are to the zero line, the better the model fits the data.
If there is a pattern or trend in the residual vs. fitted plot, such as a curve or a funnel shape, it suggests that the model is not adequately capturing the variability in the data. In such cases, you may need to consider -
adding additional predictors to the model,
transforming the response or predictor variables, or
using a different model altogether.
It’s important to examine the residual vs. fitted plot along with other diagnostic plots, such as the QQ plot of the residuals and the scale-location plot, to assess the adequacy of the linear regression model. See next sections.
Alternatively, you can use the ggplot2
package to create customized residual plots. Here’s an example:
library(ggplot2)
# Create a data frame of fitted values and residuals
df <- data.frame(fitted = model$fitted.values,
residual = model$residuals
)
# Create a residual plot using ggplot2
ggplot(data = df,
mapping = aes(x = fitted,
y = residual)
) +
geom_point() +
geom_smooth(method = "loess") +
xlab("Fitted Values") +
ylab("Residuals") +
ggtitle("Residual Plot")
`geom_smooth()` using formula = 'y ~ x'
In this example, we first create a data frame of fitted values and residuals from the linear regression model using the $
operator. We then use ggplot()
and geom_point()
to create a scatterplot of the residuals against the fitted values. We also add a smoothed line to the plot using geom_smooth()
with the method = "loess"
argument, and add axis labels and a title using xlab()
, ylab()
, and ggtitle()
.
# Create a QQ plot of the residuals
plot(model,
which = 2
)
When you run the plot()
function on a linear regression model with which = 2
, it produces a Normal QQ plot of the residuals.
A Q-Q (quantile-quantile) plot is a graphical technique that allows us to compare the distribution of a sample with a theoretical distribution. In a Q-Q plot, the quantiles of the sample data are plotted against the corresponding quantiles of a theoretical distribution, such as the normal distribution. Here are some key takeaways from a Q-Q plot:
Linearity: If the Q-Q plot forms a straight line, then the sample data is approximately normally distributed. If the line deviates from a straight line, then the sample data is not normally distributed1.
Slope: The slope of the line in the Q-Q plot indicates the relative variance of the sample data compared to the theoretical distribution. If the slope is steeper than the line for the theoretical distribution, then the sample data has a higher variance. If the slope is shallower than the line for the theoretical distribution, then the sample data has a lower variance.
Outliers: Outliers in the sample data may be visible in a Q-Q plot as points that deviate from the straight line. These outliers may indicate a departure from the normal distribution, or they may be due to sampling error or measurement error.
Skewness: If the line in the Q-Q plot is not centered on the diagonal line, then the sample data may be skewed.
Overall, a Q-Q plot can provide valuable insights into the distribution of a sample data set, and can help us to identify departures from a theoretical distribution, such as the normal distribution.
You can also use the ggplot2
package to create customized QQ plots of the residuals. Here’s an example:
library(ggplot2)
# Create a data frame of the residuals
df <- data.frame(residuals = model$residuals)
# Create a QQ plot using ggplot2
ggplot(data = df,
mapping = aes(sample = residuals)
) +
stat_qq() +
stat_qq_line() +
xlab("Theoretical Quantiles") +
ylab("Sample Quantiles") +
ggtitle("Normal QQ Plot of Residuals")
In this example, we first create a data frame of the residuals from the linear regression model using the $
operator. We then use ggplot()
with aes()
to specify that we want to create a QQ plot of the residuals, and add a stat_qq()
layer to create the QQ plot itself. We also add a stat_qq_line()
layer to add a reference line to the plot, and add axis labels and a title using xlab()
, ylab()
, and ggtitle()
.
# Create a scale location plot
plot(model,
which = 3
)
A “scale-location” or “spread-location” chart is a graphical tool used to assess the assumptions of homoscedasticity (constant variance of residuals) in a linear regression model. In this type of chart, you typically plot the square root of the absolute standardized residuals against the fitted values of the model.
Standardized residuals are the residuals (the differences between observed and predicted values) of the model divided by their estimated standard deviation. These standardized residuals are often referred to as “Studentized residuals.”
\[Standardized \ Residual_i = \dfrac{Residual_i}{Estimated \ Standard \ Deviation \ of \ Residuals}\]
When you call rstandard(model)
, it calculates the standardized residuals for all observations in the model and returns a vector of these standardized residuals. These standardized residuals are useful for assessing the goodness of fit, identifying outliers, and checking the assumptions of the linear regression model.
residuals(model)
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
-2.2826106 -0.9197704 -2.0859521 1.2973499
Hornet Sportabout Valiant Duster 360 Merc 240D
-0.2001440 -0.6932545 -3.9053627 4.1637381
Merc 230 Merc 280 Merc 280C Merc 450SE
2.3499593 0.2998560 -1.1001440 0.8668731
Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
-0.0502472 -1.8830236 1.1733496 2.1032876
Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla
5.9810744 6.8727113 1.7461954 6.4219792
Toyota Corona Dodge Challenger AMC Javelin Camaro Z28
-2.6110037 -2.9725862 -3.7268663 -3.4623553
Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
2.4643670 0.3564263 0.1520430 1.2010593
Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
-4.5431513 -2.7809399 -3.2053627 -1.0274952
?rstandard
rstandard(model)
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
-0.76616765 -0.30743051 -0.70575249 0.43275114
Hornet Sportabout Valiant Duster 360 Merc 240D
-0.06681879 -0.23148309 -1.30552216 1.38889709
Merc 230 Merc 280 Merc 280C Merc 450SE
0.78392687 0.10010803 -0.36728706 0.29288651
Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
-0.01683789 -0.63159969 0.42296071 0.76979873
Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla
2.17353314 2.33490215 0.61035691 2.21708271
Toyota Corona Dodge Challenger AMC Javelin Camaro Z28
-0.87964013 -0.99313634 -1.24418015 -1.16279098
Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
0.82771968 0.12244407 0.05177187 0.42254270
Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
-1.51549710 -0.93086929 -1.07151943 -0.34388215
Loading required package: psych
Attaching package: 'psych'
The following objects are masked from 'package:ggplot2':
%+%, alpha
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 32 0 3 -0.13 -0.27 3.05 -4.54 6.87 11.42 0.64 -0.3 0.53
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 32 0.01 1.02 -0.04 -0.09 1.03 -1.52 2.33 3.85 0.68 -0.24 0.18
You might use standardized residuals to create diagnostic plots, such as a residuals versus leverage plot or a Q-Q plot, to evaluate the model’s assumptions and identify influential observations or outliers.
To interpret the scale-location plot, look at -
Vertical Spread: Examine the spread of the points along the y-axis (square root of absolute standardized residuals). Ideally, you want the points to be evenly spread out with no discernible funnel shape. If the spread remains roughly consistent as you move along the x-axis, it indicates that the variance of residuals is roughly constant across the range of fitted values.
Issues: If you observe a fan-shaped pattern or a clear trend (increasing or decreasing spread) as the fitted values increase, it suggests heteroscedasticity, which violates the assumption of constant variance.
In summary, a scale-location plot helps you assess whether the assumption of homoscedasticity is met in your linear regression model. If the plot shows no clear pattern and the spread of residuals remains relatively constant as fitted values increase, the assumption is likely met. If you observe a discernible pattern or increasing/decreasing spread, it may indicate a violation of the assumption and the need to consider alternative modeling techniques or transformations.
This can help ensure you understand what exactly is R plotting above.
# pull out the x and y variable that you want to plot
scale_location_data <- data.frame(
Fitted_Values = fitted(model),
Sqrt_Abs_Std_Residuals = sqrt(abs(rstandard(model)))
)
# Create a scale-location (spread-location) plot with ggplot2 package
scale_location_plot <- ggplot(data = scale_location_data,
mapping = aes(x = Fitted_Values,
y = Sqrt_Abs_Std_Residuals)
) +
geom_point() +
geom_smooth(method = "lm",
se = FALSE,
color = "blue") + # Add a fitted line
labs(x = "Fitted Values",
y = "Square Root of Absolute Standardized Residuals") +
ggtitle("Scale-Location (Spread-Location) Plot")
scale_location_plot
`geom_smooth()` using formula = 'y ~ x'
# Create a residual versus leverage chart
plot(model,
which = 4
)
# Add Cook's distance values to the plot
abline(h = 0.5,
col = "red",
lty = 1) # Add a horizontal line at the threshold
A “residual versus leverage” plot, often referred to as a “residual plot” or “residuals vs. leverage plot,” is a graphical tool used to assess the influence of individual data points on the fit of a linear regression model. This plot helps you identify outliers and influential observations.
Interpret the residuals vs. leverage plot:
Residuals: The vertical axis represents the residuals, which are the differences between the observed values and the predicted values from your linear regression model.
Leverage: The horizontal axis represents the leverage values for each data point. Leverage measures how much influence a data point has on the regression coefficients. High leverage values indicate that a data point has the potential to influence the model significantly.
Identification of Outliers: Look for data points that have large absolute residuals (far from zero) and high leverage values (far from the center). These points are potential outliers and influential observations. Outliers have large residuals, while influential observations have high leverage. SO POINTS IN THE TOP RIGHT AND BOTTOM LEFT in particular.
Influence: The influence of a data point is determined not only by its leverage but also by how much it affects the model parameters (coefficients). Points that are both high in leverage and have large residuals exert a strong influence on the regression model.
Patterns: Examine any patterns in the plot. For example, if you see a data point with a large positive residual and high leverage, it may be an influential observation that is driving the regression line away from the majority of the data.
In summary, a residuals vs. leverage plot helps you identify potential outliers and influential observations in your linear regression model. Points that deviate significantly from the center of the plot (high leverage) and have large residuals can have a substantial impact on the model’s coefficients. Careful consideration of these points may be necessary to assess their impact on the validity of your regression analysis.
Again, this can help ensure you understand what exactly is R plotting above.
# Extract residuals and leverage values from the model
residuals_vs_leverage_data <- data.frame(
Leverage = hatvalues(model),
Residuals = residuals(model)
)
# Create a residual versus leverage plot
residuals_vs_leverage_plot <- ggplot(residuals_vs_leverage_data, aes(x = Leverage, y = Residuals)) +
geom_point() +
labs(x = "Leverage", y = "Residuals") +
ggtitle("Residuals vs. Leverage Plot")
# Show the plot
print(residuals_vs_leverage_plot)
Lets try to bring the chart closer to what R originally generated. Try adding the contour lines if you prefer, or I just print out the names of observations that are potentailly troublesome (Cook’s distance > 0.05)
# Calculate Cook's distance values
cooksd <- cooks.distance(model)
# Create a data frame with residuals, leverage, and Cook's distance
residuals_vs_leverage_data <- data.frame(
Leverage = hatvalues(model),
Residuals = residuals(model),
Cooks_Distance = cooksd
)
# Create a residual versus leverage plot
residuals_vs_leverage_plot <- ggplot(residuals_vs_leverage_data,
aes(x = Leverage, y = Residuals)) +
geom_point(aes(size = Cooks_Distance), alpha = 0.7) + # Size of points indicates Cook's distance
labs(x = "Leverage",
y = "Residuals") +
ggtitle("Residuals vs. Leverage Plot with Cook's Distance") +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") + # Add a horizontal reference line at y = 0
scale_size_continuous(range = c(2, 10), breaks = c(0.05, 0.1, 0.2), labels = c("0.05", "0.1", "0.2")) + # Adjust point size and labels
theme_minimal()
# Label observations with Cook's distance > 0.05
label_threshold <- 0.05
residuals_vs_leverage_plot +
geom_text(data = subset(residuals_vs_leverage_data, Cooks_Distance > label_threshold),
aes(label = rownames(subset(residuals_vs_leverage_data, Cooks_Distance > label_threshold))),
position = position_nudge(y = 0.15), col = "red", size = 3)
When we say that a sample data is not normally distributed, we mean that the distribution of the data does not follow a normal or Gaussian distribution.
In a normal distribution, the data is symmetric around the mean, with the majority of the data clustered around the mean and fewer observations at the tails of the distribution. The shape of the distribution is bell-shaped, with the tails tapering off gradually on either side of the mean.
When the sample data is not normally distributed, it may exhibit a variety of shapes and patterns. For example, the distribution may be skewed, with a longer tail on one side of the distribution than the other. Alternatively, the distribution may be bimodal, with two distinct peaks in the data. Other types of non-normal distributions include uniform distributions, which have equal frequencies across the range of values, and exponential distributions, which have a right-skewed shape.
When we perform statistical analyses on non-normal data, the results may not be reliable or accurate. Therefore, it is important to assess the normality of our data before conducting any statistical tests or modeling. We can use graphical techniques such as histograms, box plots, and Q-Q plots to visualize the distribution of our data and identify any departures from normality.↩︎