DATA 605 Week 12

Introduction

As part of the Fundamentals of Computational Mathematics course work at the CUNY School of Professional Studies, we need to build a multiple regression model. The model needs at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. For this exercise I will be using the clasic Iris data set.

Data Acquisition

I will be predicting the petal length. I know that there is a strong relationship between the sepal length and the petal length (0.8717538 correlation). I will be using this in the model as a quadratic term.

Because of my knowledge of this dataset I know that the setosa species of iris is distinct from the versicolor and virginica species. I will be using that as the dichotomous term.

I want to differentiate the versicolor and viginica group and will use the petal width as the quantitative interaction term.

library(dplyr)
df <- iris %>%
  mutate(Sepal.Length.Squared = Sepal.Length ^ 2) %>% # Quadratic term
  mutate(Not.Setosa = ifelse(Species == "setosa", 0, 1)) %>% # Dichotomous term
  mutate(Petal.Width.Interaction = Not.Setosa * Petal.Width) %>% # Dichotomous vs. quantitative interaction
  select(Petal.Length, Sepal.Length.Squared, Not.Setosa, Petal.Width.Interaction) %>%
  arrange(Petal.Length)

The end result is the following data

Petal.Length Sepal.Length.Squared Not.Setosa Petal.Width.Interaction
1.0 21.16 0 0
1.1 18.49 0 0
1.2 33.64 0 0
1.2 25.00 0 0
1.3 22.09 0 0
1.3 29.16 0 0
1.3 30.25 0 0
1.3 19.36 0 0
1.3 25.00 0 0
1.3 20.25 0 0

Visualization

It’s a bit harder to visualize this model because of the number of dimensions but here’s the clearest viz I was able to work up:

Model Creation

Now to create the multivariate model

model <- lm(Petal.Length ~ ., df)

Model Evaluation

summary(model)

Call:
lm(formula = Petal.Length ~ ., data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.83663 -0.16153 -0.01399  0.17506  1.09909 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)             0.282882   0.105986   2.669  0.00847 ** 
Sepal.Length.Squared    0.046824   0.003898  12.013  < 2e-16 ***
Not.Setosa              0.999883   0.123952   8.067 2.41e-13 ***
Petal.Width.Interaction 1.054157   0.080886  13.033  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2828 on 146 degrees of freedom
Multiple R-squared:  0.9749,    Adjusted R-squared:  0.9743 
F-statistic:  1887 on 3 and 146 DF,  p-value: < 2.2e-16

This model preformed very well. The adjusted R-squared is 0.974343 which is phenominal! All of the terms are statistically significant.

Residual Analysis

Let’s take a look at the residuals

summary(resid(model))
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.83663 -0.16153 -0.01399  0.00000  0.17506  1.09909 

The mean and median are close to zero which is good. All of the other data hint that the residuals are fairly normally distributed. Let’s take a look at the plots:

There are a couple observations that are outliers but generally everything looks great!

Conclusion

This model explains almost all of the variablitiy. The R-squared is about 0.97. The residual plots do not raise any concern. The Q-Q is fairly linear. The coefficients are statistically significant.

Mike Silva

16 April 2019