DATA 605 Week 12
Introduction
As part of the Fundamentals of Computational Mathematics course work at the CUNY School of Professional Studies, we need to build a multiple regression model. The model needs at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. For this exercise I will be using the clasic Iris data set.
Data Acquisition
I will be predicting the petal length. I know that there is a strong relationship between the sepal length and the petal length (0.8717538 correlation). I will be using this in the model as a quadratic term.
Because of my knowledge of this dataset I know that the setosa species of iris is distinct from the versicolor and virginica species. I will be using that as the dichotomous term.
I want to differentiate the versicolor and viginica group and will use the petal width as the quantitative interaction term.
library(dplyr)
df <- iris %>%
mutate(Sepal.Length.Squared = Sepal.Length ^ 2) %>% # Quadratic term
mutate(Not.Setosa = ifelse(Species == "setosa", 0, 1)) %>% # Dichotomous term
mutate(Petal.Width.Interaction = Not.Setosa * Petal.Width) %>% # Dichotomous vs. quantitative interaction
select(Petal.Length, Sepal.Length.Squared, Not.Setosa, Petal.Width.Interaction) %>%
arrange(Petal.Length)
The end result is the following data
Petal.Length | Sepal.Length.Squared | Not.Setosa | Petal.Width.Interaction |
---|---|---|---|
1.0 | 21.16 | 0 | 0 |
1.1 | 18.49 | 0 | 0 |
1.2 | 33.64 | 0 | 0 |
1.2 | 25.00 | 0 | 0 |
1.3 | 22.09 | 0 | 0 |
1.3 | 29.16 | 0 | 0 |
1.3 | 30.25 | 0 | 0 |
1.3 | 19.36 | 0 | 0 |
1.3 | 25.00 | 0 | 0 |
1.3 | 20.25 | 0 | 0 |
Visualization
It’s a bit harder to visualize this model because of the number of dimensions but here’s the clearest viz I was able to work up:
Model Creation
Now to create the multivariate model
model <- lm(Petal.Length ~ ., df)
Model Evaluation
summary(model)
Call:
lm(formula = Petal.Length ~ ., data = df)
Residuals:
Min 1Q Median 3Q Max
-0.83663 -0.16153 -0.01399 0.17506 1.09909
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.282882 0.105986 2.669 0.00847 **
Sepal.Length.Squared 0.046824 0.003898 12.013 < 2e-16 ***
Not.Setosa 0.999883 0.123952 8.067 2.41e-13 ***
Petal.Width.Interaction 1.054157 0.080886 13.033 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2828 on 146 degrees of freedom
Multiple R-squared: 0.9749, Adjusted R-squared: 0.9743
F-statistic: 1887 on 3 and 146 DF, p-value: < 2.2e-16
This model preformed very well. The adjusted R-squared is 0.974343 which is phenominal! All of the terms are statistically significant.
Residual Analysis
Let’s take a look at the residuals
summary(resid(model))
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.83663 -0.16153 -0.01399 0.00000 0.17506 1.09909
The mean and median are close to zero which is good. All of the other data hint that the residuals are fairly normally distributed. Let’s take a look at the plots:
There are a couple observations that are outliers but generally everything looks great!
Conclusion
This model explains almost all of the variablitiy. The R-squared is about 0.97. The residual plots do not raise any concern. The Q-Q is fairly linear. The coefficients are statistically significant.