DATA 605 Week 12

Introduction
Data Acquisition
Visualization
Model Creation
Model Evaluation
Conclusion

Introduction

As part of the Fundamentals of Computational Mathematics course work at the CUNY School of Professional Studies, we need to build a multiple regression model. The model needs at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. For this exercise I will be using the clasic Iris data set.

Data Acquisition

I will be predicting the petal length. I know that there is a strong relationship between the sepal length and the petal length (0.8717538 correlation). I will be using this in the model as a quadratic term.

Because of my knowledge of this dataset I know that the setosa species of iris is distinct from the versicolor and virginica species. I will be using that as the dichotomous term.

I want to differentiate the versicolor and viginica group and will use the petal width as the quantitative interaction term.

library(dplyr)
df <- iris %>%
  mutate(Sepal.Length.Squared = Sepal.Length ^ 2) %>% # Quadratic term
  mutate(Not.Setosa = ifelse(Species == "setosa", 0, 1)) %>% # Dichotomous term
  mutate(Petal.Width.Interaction = Not.Setosa * Petal.Width) %>% # Dichotomous vs. quantitative interaction
  select(Petal.Length, Sepal.Length.Squared, Not.Setosa, Petal.Width.Interaction) %>%
  arrange(Petal.Length)

The end result is the following data

Petal.Length	Sepal.Length.Squared	Not.Setosa	Petal.Width.Interaction
1.0	21.16	0	0
1.1	18.49	0	0
1.2	33.64	0	0
1.2	25.00	0	0
1.3	22.09	0	0
1.3	29.16	0	0
1.3	30.25	0	0
1.3	19.36	0	0
1.3	25.00	0	0
1.3	20.25	0	0

Visualization

It’s a bit harder to visualize this model because of the number of dimensions but here’s the clearest viz I was able to work up:

Model Creation

Now to create the multivariate model

model <- lm(Petal.Length ~ ., df)

Model Evaluation

summary(model)


Call:
lm(formula = Petal.Length ~ ., data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.83663 -0.16153 -0.01399  0.17506  1.09909 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)             0.282882   0.105986   2.669  0.00847 ** 
Sepal.Length.Squared    0.046824   0.003898  12.013  < 2e-16 ***
Not.Setosa              0.999883   0.123952   8.067 2.41e-13 ***
Petal.Width.Interaction 1.054157   0.080886  13.033  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2828 on 146 degrees of freedom
Multiple R-squared:  0.9749,    Adjusted R-squared:  0.9743 
F-statistic:  1887 on 3 and 146 DF,  p-value: < 2.2e-16

This model preformed very well. The adjusted R-squared is 0.974343 which is phenominal! All of the terms are statistically significant.

Residual Analysis

Let’s take a look at the residuals

summary(resid(model))

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.83663 -0.16153 -0.01399  0.00000  0.17506  1.09909

The mean and median are close to zero which is good. All of the other data hint that the residuals are fairly normally distributed. Let’s take a look at the plots:

There are a couple observations that are outliers but generally everything looks great!

Conclusion

This model explains almost all of the variablitiy. The R-squared is about 0.97. The residual plots do not raise any concern. The Q-Q is fairly linear. The coefficients are statistically significant.