# Load the Iris data set
mydata <- irisWeek 3_Gauss Markov Assumptions and Residual Analysis_Corinne Willis
I. Gauss-Markov Assumptions
Stated Assumptions
If the following assumptions are met, an ordinary least squares (OLS) regression will produce the best linear unbiased estimate (BLUE).
- Linearity - There must be a linear relationship between y and x (i.e. linear in parameters and the error term).
- Full Column Rank / Non-Collinearity - There should be no perfect multi-collinearity.
- Exogeneity - The predictors are not correlated with the error terms.
- Homoscedasticity / Nonautocorrelation - Each error term has the same finite variance and is not correlated with any other error term.
- Data Generation - Data can be any mixture of constants and random variables, but must be generated by a mechanism that is unrelated to the error terms.
- Normal Distribution - The error terms should be normally distributed.
Non-Technical Explanation of Assumptions
- Linearity - There must be a linear relationship between y and x (i.e. linear in parameters and the error term) meaning that we are keeping things simple and straightforward in terms of how different variables (or pieces of information) affect each other. This allows us to predict things and understand the results in a clear way that makes the most sense for the data we are reviewing. If the relationship between variables was not linear and was more complex, then we may need more complex ways to review them.
- Full Column Rank / Non-Collinearity - There should be no perfect multi-collinearity meaning that each variable we use in an analysis should be unique so that it contributes something that other variables do not. When making predictions, this allows us to better understand a single variable’s specific impact on an outcome.
- Exogeneity - The predictors are not correlated with the error terms meaning that the variables being reviewed are not influenced by the outcomes being measured. This ensures that the relationships between variables that uncovered are meaningful and not being distorted.
- Homoscedasticity / Nonautocorrelation - Each error term has the same finite variance and is not correlated with any other error term meaning that the differences between our predictions and actual outcomes are caused be real factors. This helps ensure that we can trust that our predictions are reliable and not influenced by extreme fluctuations or patterns in the data that we did not capture.
- Data Generation - Data can be any mixture of constants and random variables, but must be generated by a mechanism that is unrelated to the error terms meaning that the way data is collected for an analysis should not influence the outcome of the analysis. This will ensure that the analysis is fair and accurate.
- Normal Distribution - The error terms should be normally distributed meaning that the difference between predicted and actual outcomes (the error terms) follow a predictable pattern and are not just due to random chance.
Technical Explanation of Assumptions
Linearity - The model should be linear in parameters in that the relationship between the y and x variables can be expressed as shown below. \(\beta\) represents the parameters to be estimated and \(\epsilon\) is the error term.
\[ y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ...+ \beta_px_p + \epsilon \]
Full Column Rank / Non-Collinearity - The independent variables should not be perfectly collinear in that no independent variable could be expressed as a linear combination of the other independent variables. In other words, X is an \(n\) x \(k\) matrix of full rank and the columns of the \(X\) matrix are linearly independent. If the matrix is not full column rank, the \(X^TX\) is not invertible and we cannot compute the OLS estimator.
Exogeneity - The predictors are not correlated with the error terms meaning that the error terms average out to zero for any value of X. This also means that none of the independent variables provide any information about the expected values of the error terms. This can be written as the shown below.
\[ E = [\epsilon| X] = 0 \]
Homoscedasticity / Nonautocorrelation - Each error term has the same finite variance and is not correlated with any other error term. Homoscedasticity states that the variance \(\sigma^2\) for \(\epsilon_i\) is the same for all \(i\). Nonautocorrelation states that knowing something about the error term for one observation tells us nothing about the error term for other observations. This can be written as shown below.
\[ \Omega = \sigma^2I \]
Data Generation - Data can be any mixture of constants and random variables, but must be generated by a mechanism that is unrelated to the error terms. In other words, \(X\) is collected using a random sampling methodology from the population, so that the sample is a representation of the population and statistical inference can be made about the parameters.
Normal Distribution - The error terms should be normally distributed. Although not necessary for the Gauss-Markov Theorem, it is often assumed for the purposes of conducting statistical tests and constructing confidence intervals. This can be written as shown below.
\(\epsilon|X\) ~ \(N[0,\sigma^2I]\)
II. Linear Regression
Load Data
The iris data set is a data frame with 150 observations of 5 variables named Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species.
Estimating Equation
For this analysis, I will create a model to understand the Petal Length of the Iris flower according to its relationship with the Petal Width. The y and x variables are both measured in centimeters.
\[ \begin{align*} Petal.Length_i = & \beta_0 \ + \beta_1 Petal.Width_i +\epsilon_i \end{align*} \]
Simple Linear Regression Model
# Run a simple linear regression
my_reg <- lm(Petal.Length ~ Petal.Width, data=mydata)
summary(my_reg)
Call:
lm(formula = Petal.Length ~ Petal.Width, data = mydata)
Residuals:
Min 1Q Median 3Q Max
-1.33542 -0.30347 -0.02955 0.25776 1.39453
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.08356 0.07297 14.85 <2e-16 ***
Petal.Width 2.22994 0.05140 43.39 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4782 on 148 degrees of freedom
Multiple R-squared: 0.9271, Adjusted R-squared: 0.9266
F-statistic: 1882 on 1 and 148 DF, p-value: < 2.2e-16
As shown in the summary above, the linear regression analysis between the y-variable Iris Petal Length and the x-variable Petal Width results in a significant p-value of 2.2e-16 (alpha of .001) which tells us that the analysis is worth reviewing further. The R-squared values tell us that about 93% of the variation in Petal Length is explained by Petal Width. The median of the Residuals is nearly 0 at -0.02955 which tells us the residuals are fairly normally distributed.
Both the intercept and the x-variables are statistically significant at <2e-16. The coefficient estimate for the intercept (1.08356) is the estimated value of the Petal Length if the Petal Width were 0. The coeffficient estimate for the x-variable Petal Width (2.22994) tells us that if the Petal Width increases by 1 centimeter, the Petal Length will increase by 2.22994. This implies that an iris petal tends to be longer than it is wide.
III. Plots
Original Model
# Plot the simple linear regression model
plot(my_reg)Fitted vs. Residuals
In the first plot, Fitted vs. Residuals, we see the fitted values on the x-axis and residuals on the y-axis. There is a dotted line at 0 and a red line that goes through the data points. Ideally, the red line should be close to 0. This plot tell us whether there is a linear relationship between the mean of the response and the explanatory variables. From the red line we see above, there appears to be a slight quadratic curvature which indicates a possible violation of the linearity assumption. Regarding the constant variance assumption, we do not see a funnel pattern in the data points where they are closer together at one end.
Q-Q Residuals
In the Q-Q Residuals plot, we have theoretical quantiles on the x-axis and standardized residuals on the y-axis. When reviewing this plot, ideally we would see the data points follow closely along the dotted line. There don’t appear to be very large deviations in the plot shown above which indicates that we are not violating the normality assumption.
Scale-Location
In the third plot, Scale-Location, the fitted values are on the x-axis and the square root of the standardized residuals are on the y-axis. Ideally we would see a trend line that is fairly constant and not increasing or decreasing. In the plot shown above, there does appear to be an increasing trend amongst the data points which could be an indication of a possible violation of the constant variance assumption.
Residuals vs. Leverage
The last plot, Residuals vs. Leverage has leverage on the x-axis and standardized residuals on the y-axis. Ideally in this plot, we not see many data points outside of the red dotted lines. In the plot shown above, the red dotted lines don’t show at all which indicates that we do not have any data points that fall outside of them. This means that we do not have observations that are being influential.
Transformed Model
# Transform all variables to natural log
mydata_transform <- transform(mydata,
ln_Petal.Length = log(mydata$Petal.Length),
ln_Petal.Width = log(mydata$Petal.Width))# Run the simple linear regression again with the log transformed y and x variables
my_reg2 <- lm(ln_Petal.Length ~ ln_Petal.Width, data=mydata_transform)
summary(my_reg2)
Call:
lm(formula = ln_Petal.Length ~ ln_Petal.Width, data = mydata_transform)
Residuals:
Min 1Q Median 3Q Max
-0.50884 -0.05956 -0.00564 0.07763 0.46510
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.27491 0.01279 99.66 <2e-16 ***
ln_Petal.Width 0.57959 0.01286 45.07 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1543 on 148 degrees of freedom
Multiple R-squared: 0.9321, Adjusted R-squared: 0.9316
F-statistic: 2031 on 1 and 148 DF, p-value: < 2.2e-16
When compared to the original regression, this log transformed analysis has the same p-value and a very slightly higher R-squared. The median of the Residuals is now closer to 0 at -0.0054 which tells us that the transformation made the data slightly more normal.
Next, I will plot this new regression to see whether this looks like a better fit than the original.
# Plot the linear regession of log transformed y and x variables
plot(my_reg2)Fitted vs. Residuals
In the Fitted vs. Residuals plot, we don’t see the quadratic curvature that we saw in the original regression model’s plot. We are still not seeing a funnel type of pattern in the data points. There are both good indications that the transformed model does not violate the linearity or constant variance assumptions.
Q-Q Residuals
In the Q-Q Residuals plot, we are seeing much more heteroscedasticity about the line which means there is a possibility that we are violating the normality assumption. However, the deviations do not seem to be too drastic so this may still be okay.
Scale-Location
In the third plot, Scale-Location, there is no longer an increasing trend. It appears to be more constant after the log transformation, so the constant variance assumption is not violated.
Residuals vs. Leverage
The last plot, Residuals vs. Leverage still does not show the red dotted lines which indicates that we do not have any data points that fall outside of them. This means that this transformed model still does not have observations that are being influential.
IV. Conclusion
The Iris dataset contains three different species types which may be the cause of some of the irregularity in the data as each species has their own distinct qualities that tend to cluster together into three groups in the data. Focusing in on one species may have reduced these abnormal results. Ultimately, the log transformation was still able to slightly normalize the data, correct for the violated assumptions, and create a somewhat better fit regression model.