Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
Introduction
I decided to use the ‘Iris’ data set for this discussion question.
Petal.Length will be predicted via a multiple regression model using the
following variables:‘is_versicolor’,‘petal.width’, ‘sepal.width’ and
‘sepal.length’,‘Sepal.Length2’,‘is_versicolor:Sepal.Width’.
Brief Visualization of the Data
Scatter plots of the variables appear to denote linear relationships
between some of the variables, most notably between the Petal.Width and
Petal.Length variables. Also noted in the many of the scatter plots, two
distinct groups of observations are observed, with one group of
observations displaying lesser variance/spread, ie. Petal.Length vs
Sepal.Width.
pairs(iris, gap=0.5)
Variable creation
The variable ‘is_versicolor’ was created to satisfy the dichotomous
variable requirement for the assignment. The versicolor species is
denoted in the data a ‘1’ and any other species is denoted as ‘0’.
The variable ‘Sepal.Length2’ was created to satisfy the quadratic variable requirement.
A scatter plot of the new variables is notable because the presence of two sets of observations remains.
#dichotomous variable
iris$is_versicolor <- ifelse(iris$Species == "versicolor", 1, 0)
#quadratic variable for Sepal.Length
iris$Sepal.Length2 <- iris$Sepal.Length^2
pairs(iris, gap=0.5)
Linear Model
Below is the model.
model <- lm(Petal.Length ~ Sepal.Length + Sepal.Length2 + is_versicolor + Sepal.Width + Petal.Width+ is_versicolor:Sepal.Width , data=iris)
summary(model)
##
## Call:
## lm(formula = Petal.Length ~ Sepal.Length + Sepal.Length2 + is_versicolor +
## Sepal.Width + Petal.Width + is_versicolor:Sepal.Width, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.95073 -0.18402 -0.01191 0.19097 1.16467
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.42472 1.39089 1.024 0.307
## Sepal.Length 0.13233 0.47782 0.277 0.782
## Sepal.Length2 0.04651 0.03805 1.223 0.224
## is_versicolor -0.54263 0.49629 -1.093 0.276
## Sepal.Width -0.61633 0.09141 -6.742 3.56e-10 ***
## Petal.Width 1.48217 0.07356 20.150 < 2e-16 ***
## is_versicolor:Sepal.Width 0.24662 0.17062 1.445 0.151
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3136 on 143 degrees of freedom
## Multiple R-squared: 0.9697, Adjusted R-squared: 0.9684
## F-statistic: 763.2 on 6 and 143 DF, p-value: < 2.2e-16
` Residuals
The median value being relatively near zero, and the first and third
quartile approximation of each other hints that the residuals are
normally distributed. Additionally, minimum and maximum values are of
similar magnitude, which provides some evidence that the model is a good
fit.
Coefficients
Almost all p-values for this model are not statistically significant,
except Petal.Width and Sepal.Width (p=value <0.05).
The t-value for Petal.Width is 20.150. This high t-value suggests a very strong statistical significance of the Petal.Width coefficient in predicting Petal.Length.
Septal.Width has a somewhat lower t-value (t-value<10) but being statistically significant (p-value<0.05) can be considered a decent predictor of Petal.Length.
Residual Standard Error and Degrees of Freedom
Generally speaking, residual standard error that is approximately 1.5
times the 1st and 3rd quartile residuals provides evidence that
residuals are normally distributed. This is not the case here, the
residuals are RSE much less than the 1.5 times the 1st and 3rd quartile
that would denote residual standard distribution. However the RSE is
still relatively low regardless of distribution, but the RSE does
provide evidence that the model may be improved.
The Multiple R-squared Value
The reported R2 of 0.9697 for this model means that 96.97% of the
variability in Petal.Length distance is explained by the model.
Below, residual analysis is conducted in the form of Residual versus Fitted Value Plot and a Q-Q Plot
Residual versus Fitted Value Plot
For a Residual versus Fitted Value Plot to support the linear model,
residuals should be scattered around the horizontal axis where the
residual equals zero. The below Residual versus Fitted Value Plot with
two seperate groupings of observations provides possible evidence that
the data is 1. Not linear 2. Violates heteroscedasticity since one of
the observation groups residuals is more scattered than the other
indicating possible subgroups.
Q-Q Plot
For the Q-Q Plot to support our linear model, we would expect the
plotted values to follow a straight line, indicating the residuals were
normally distributed. Below our model’s Q-Q Plot suggests that the
distribution of the residuals are somewhat normal. However, both the
right and left tails deviate slightly from the expected straight line,
alluding to skew and possible outlyiers. This suggests that the model
could be improved.
#male LMF
par(mfrow=c(2,2))
plot(model)
Conclusion
In conclusion, the model could be greatly refined. Subgroups are almost
definitely present in the data. I think the linear model could be
appropriate for this data if the presence of the subgroup could be
accounted for. Furthermore, process of backward elimination could
possibly help identify the subgroups. The RSE vs residual 1st/3rd
quartiles denoted that the residuals are not equally distributed and
that the underlying data would have to be adjusted to account for this
or maybe a non-linear model could also be appropriate.