Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
I will be using Possum Measurements dataset from the DAAG library. The dataset contains information that can be used to identify weather Possum was captured in Victoria, Australia or other areas such as New South Wales or Queensland.
case - observation number
site - one of seven locations where possums were trapped
Pop - a factor which classifies the sites as Vic - Victoria, other - New South Wales or Queensland
sex - a factor with levels f female, m male
age - age of Possum
hdlngth - head length
skullw - skull width
totlngth - total length
taill - tail length
footlgth - foot length
earconch - ear conch length
eye - distance from medial canthus to lateral canthus of right eye
chest - chest girth (in cm)
belly - belly girth (in cm)
Pop variable is the dichotomous term. \(tailsq = taill^2\) is quadratic term. Pop Vs. taill is one dichotomous vs. quantitative interaction term. This will be used to determine weather tail length can be used to determine if Possum captured in Victoria.
library(DAAG)
library(knitr)
library(psych)
possum.data <- data.frame(possum)
#eliminate rows with missing data
possum.data <- na.omit(possum.data)
#convert factors to numeric values
possum.data$place <- ifelse(possum.data$Pop=='other',0,1) #Vic = 1, Other = 0
possum.data$gender <- ifelse(possum.data$sex=='m',1,0) #male = 1, female = 0
possum.data$tailsq <- possum.data$taill^2 #additional quadratic term
describeBy(possum.data$taill, group=possum.data$Pop,mat=T,type=3,digits=2)
## item group1 vars n mean sd median trimmed mad min max range skew
## X11 1 Vic 1 43 35.95 1.77 36 35.93 1.48 32 39.5 7.5 0.09
## X12 2 other 1 58 37.86 1.71 38 37.80 1.48 34 43.0 9.0 0.42
## kurtosis se
## X11 -0.11 0.27
## X12 0.47 0.23
boxplot(taill~Pop, data=possum.data, main="Place of Capture Vs. Tail Length", xlab = "Place of Capture", ylab = "Tail Length", names = c("Victoria","Other"))
Comparing the results, Possum captured in Victoria, Australia have smaller tail length compared to one caught in other parts of Australia.
possum.lm <- lm(place~gender+age+hdlngth+skullw+totlngth+taill+footlgth+earconch+eye+chest+belly+tailsq, data=possum.data)
possum.lm
##
## Call:
## lm(formula = place ~ gender + age + hdlngth + skullw + totlngth +
## taill + footlgth + earconch + eye + chest + belly + tailsq,
## data = possum.data)
##
## Coefficients:
## (Intercept) gender age hdlngth skullw
## 3.893556 -0.050397 0.016151 -0.023746 -0.013144
## totlngth taill footlgth earconch eye
## 0.015428 -0.368878 0.036229 0.068267 0.011322
## chest belly tailsq
## 0.011705 -0.004616 0.004119
Regression model
Possum Capture Place = 3.893556 - 0.050397 * gender + 0.016151 * age - 0.023746 * hdlngth - 0.013144 * skullw + 0.015428 * totlngth - 0.368878 * taill + 0.036229 * footlgth + 0.068267 * earconch + 0.011322 * eye + 0.011705 * chest - 0.004616 * belly + 0.004119 * tailsq
Gender variable does not provide any valuable information. Since variable takes values 0 or 1, when all other factors remain constant, using gender variable we can derive all female Possums belong to Victoria and male belong to other areas.
gender, hdlngth, skullw, taill and belly variables are negatively correlated. In other words, they act as balancing numbers such that net result of the model is 0 or 1.
All other variables are positively correlated.
summary(possum.lm)
##
## Call:
## lm(formula = place ~ gender + age + hdlngth + skullw + totlngth +
## taill + footlgth + earconch + eye + chest + belly + tailsq,
## data = possum.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.26076 -0.07880 -0.00742 0.06629 0.42082
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.893556 3.600557 1.081 0.28248
## gender -0.050397 0.031274 -1.611 0.11065
## age 0.016151 0.008114 1.990 0.04965 *
## hdlngth -0.023746 0.007171 -3.311 0.00135 **
## skullw -0.013144 0.006903 -1.904 0.06015 .
## totlngth 0.015428 0.006755 2.284 0.02479 *
## taill -0.368878 0.191848 -1.923 0.05775 .
## footlgth 0.036229 0.006517 5.559 2.86e-07 ***
## earconch 0.068267 0.006611 10.326 < 2e-16 ***
## eye 0.011322 0.014955 0.757 0.45101
## chest 0.011705 0.011101 1.054 0.29458
## belly -0.004616 0.007059 -0.654 0.51484
## tailsq 0.004119 0.002555 1.612 0.11044
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1387 on 88 degrees of freedom
## Multiple R-squared: 0.9315, Adjusted R-squared: 0.9221
## F-statistic: 99.65 on 12 and 88 DF, p-value: < 2.2e-16
belly, eye, skullw, chest and tailsq have highest p-value, above 0.05 significance level indicating they are not relevant to model. \(R^2\) value is 0.9315 and Adjusted \(R^2\) is 0.9221.
#eliminate belly, eye, gender, skullw, chest and tailsq variables
possum.lm <- update(possum.lm, .~. -belly -eye -gender -skullw -chest -tailsq)
summary(possum.lm)
##
## Call:
## lm(formula = place ~ age + hdlngth + totlngth + taill + footlgth +
## earconch, data = possum.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.30386 -0.09470 -0.00183 0.08133 0.42874
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.793093 0.503973 -3.558 0.000588 ***
## age 0.017062 0.007840 2.176 0.032043 *
## hdlngth -0.031031 0.005770 -5.378 5.48e-07 ***
## totlngth 0.016303 0.006393 2.550 0.012380 *
## taill -0.062692 0.011092 -5.652 1.69e-07 ***
## footlgth 0.038212 0.006366 6.002 3.62e-08 ***
## earconch 0.068922 0.006548 10.525 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1409 on 94 degrees of freedom
## Multiple R-squared: 0.9244, Adjusted R-squared: 0.9196
## F-statistic: 191.5 on 6 and 94 DF, p-value: < 2.2e-16
\(R^2\) and Adjusted \(R^2\) value decreased, however taill significance went up.
#add skullw back
possum.lm <- update(possum.lm, .~. +skullw)
summary(possum.lm)
##
## Call:
## lm(formula = place ~ age + hdlngth + totlngth + taill + footlgth +
## earconch + skullw, data = possum.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.29592 -0.08360 -0.00208 0.08841 0.42178
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.754166 0.501902 -3.495 0.000728 ***
## age 0.018056 0.007827 2.307 0.023283 *
## hdlngth -0.025957 0.006741 -3.851 0.000216 ***
## totlngth 0.016581 0.006360 2.607 0.010635 *
## taill -0.063033 0.011033 -5.713 1.33e-07 ***
## footlgth 0.039394 0.006384 6.171 1.75e-08 ***
## earconch 0.067371 0.006601 10.206 < 2e-16 ***
## skullw -0.009323 0.006502 -1.434 0.154995
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1402 on 93 degrees of freedom
## Multiple R-squared: 0.926, Adjusted R-squared: 0.9205
## F-statistic: 166.3 on 7 and 93 DF, p-value: < 2.2e-16
Adding skullw variable back does not have much impact to the model. This indicates skullw variable is not significant to the model.
possum.lm <- update(possum.lm, .~. -skullw)
plot(possum.lm, which=c(1,1))
Plot Residuals Vs. Fitted shows the distribution of residuals values around mean. They are not evenly spread around the mean. The graph shows a downward trend.
qqnorm(residuals(possum.lm))
qqline(residuals(possum.lm))
This graph indicates residuals are normally distributed. There are some outliers at top right part of the graph.
1 or 0Residuals Vs. Fitted plot and Q-Q plot leads to two different conclusions. Residuals Vs. Fitted plot says residuals are not normally distributed were as Q-Q plot shows residuals are normally distributed as they hug the line.