DATA 605 Discussion Week 12

Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Introduction
I decided to use the ‘Iris’ data set for this discussion question. Petal.Length will be predicted via a multiple regression model using the following variables:‘is_versicolor’,‘petal.width’, ‘sepal.width’ and ‘sepal.length’,‘Sepal.Length2’,‘is_versicolor:Sepal.Width’.

Brief Visualization of the Data
Scatter plots of the variables appear to denote linear relationships between some of the variables, most notably between the Petal.Width and Petal.Length variables. Also noted in the many of the scatter plots, two distinct groups of observations are observed, with one group of observations displaying lesser variance/spread, ie. Petal.Length vs Sepal.Width.

pairs(iris, gap=0.5)

Variable creation
The variable ‘is_versicolor’ was created to satisfy the dichotomous variable requirement for the assignment. The versicolor species is denoted in the data a ‘1’ and any other species is denoted as ‘0’.

The variable ‘Sepal.Length2’ was created to satisfy the quadratic variable requirement.

A scatter plot of the new variables is notable because the presence of two sets of observations remains.

#dichotomous variable 
iris$is_versicolor <- ifelse(iris$Species == "versicolor", 1, 0)

#quadratic variable for Sepal.Length
iris$Sepal.Length2 <- iris$Sepal.Length^2

pairs(iris, gap=0.5)

Linear Model
Below is the model.

model <- lm(Petal.Length ~ Sepal.Length + Sepal.Length2 + is_versicolor + Sepal.Width + Petal.Width+ is_versicolor:Sepal.Width , data=iris)

summary(model)

## 
## Call:
## lm(formula = Petal.Length ~ Sepal.Length + Sepal.Length2 + is_versicolor + 
##     Sepal.Width + Petal.Width + is_versicolor:Sepal.Width, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.95073 -0.18402 -0.01191  0.19097  1.16467 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                1.42472    1.39089   1.024    0.307    
## Sepal.Length               0.13233    0.47782   0.277    0.782    
## Sepal.Length2              0.04651    0.03805   1.223    0.224    
## is_versicolor             -0.54263    0.49629  -1.093    0.276    
## Sepal.Width               -0.61633    0.09141  -6.742 3.56e-10 ***
## Petal.Width                1.48217    0.07356  20.150  < 2e-16 ***
## is_versicolor:Sepal.Width  0.24662    0.17062   1.445    0.151    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3136 on 143 degrees of freedom
## Multiple R-squared:  0.9697, Adjusted R-squared:  0.9684 
## F-statistic: 763.2 on 6 and 143 DF,  p-value: < 2.2e-16

` Residuals
The median value being relatively near zero, and the first and third quartile approximation of each other hints that the residuals are normally distributed. Additionally, minimum and maximum values are of similar magnitude, which provides some evidence that the model is a good fit.

Coefficients
Almost all p-values for this model are not statistically significant, except Petal.Width and Sepal.Width (p=value <0.05).

The t-value for Petal.Width is 20.150. This high t-value suggests a very strong statistical significance of the Petal.Width coefficient in predicting Petal.Length.

Septal.Width has a somewhat lower t-value (t-value<10) but being statistically significant (p-value<0.05) can be considered a decent predictor of Petal.Length.

Residual Standard Error and Degrees of Freedom
Generally speaking, residual standard error that is approximately 1.5 times the 1st and 3rd quartile residuals provides evidence that residuals are normally distributed. This is not the case here, the residuals are RSE much less than the 1.5 times the 1st and 3rd quartile that would denote residual standard distribution. However the RSE is still relatively low regardless of distribution, but the RSE does provide evidence that the model may be improved.

The Multiple R-squared Value
The reported R2 of 0.9697 for this model means that 96.97% of the variability in Petal.Length distance is explained by the model.

Part 6: Residual Analysis

Below, residual analysis is conducted in the form of Residual versus Fitted Value Plot and a Q-Q Plot

Residual versus Fitted Value Plot
For a Residual versus Fitted Value Plot to support the linear model, residuals should be scattered around the horizontal axis where the residual equals zero. The below Residual versus Fitted Value Plot with two seperate groupings of observations provides possible evidence that the data is 1. Not linear 2. Violates heteroscedasticity since one of the observation groups residuals is more scattered than the other indicating possible subgroups.

Q-Q Plot
For the Q-Q Plot to support our linear model, we would expect the plotted values to follow a straight line, indicating the residuals were normally distributed. Below our model’s Q-Q Plot suggests that the distribution of the residuals are somewhat normal. However, both the right and left tails deviate slightly from the expected straight line, alluding to skew and possible outlyiers. This suggests that the model could be improved.

#male LMF
par(mfrow=c(2,2))
plot(model)

Conclusion
In conclusion, the model could be greatly refined. Subgroups are almost definitely present in the data. I think the linear model could be appropriate for this data if the presence of the subgroup could be accounted for. Furthermore, process of backward elimination could possibly help identify the subgroups. The RSE vs residual 1st/3rd quartiles denoted that the residuals are not equally distributed and that the underlying data would have to be adjusted to account for this or maybe a non-linear model could also be appropriate.

DATA 605 Discussion Week 12

Gregg Maloy

Part 6: Residual Analysis