Discussion

Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Dataset

I will be using Possum Measurements dataset from the DAAG library. The dataset contains information that can be used to identify weather Possum was captured in Victoria, Australia or other areas such as New South Wales or Queensland.

Columns

case - observation number

site - one of seven locations where possums were trapped

Pop - a factor which classifies the sites as Vic - Victoria, other - New South Wales or Queensland

sex - a factor with levels f female, m male

age - age of Possum

hdlngth - head length

skullw - skull width

totlngth - total length

taill - tail length

footlgth - foot length

earconch - ear conch length

eye - distance from medial canthus to lateral canthus of right eye

chest - chest girth (in cm)

belly - belly girth (in cm)

Analysis

Pop variable is the dichotomous term. \(tailsq = taill^2\) is quadratic term. Pop Vs. taill is one dichotomous vs. quantitative interaction term. This will be used to determine weather tail length can be used to determine if Possum captured in Victoria.

library(DAAG)
library(knitr)
library(psych)

possum.data <- data.frame(possum)

#eliminate rows with missing data
possum.data <- na.omit(possum.data)

#convert factors to numeric values
possum.data$place <- ifelse(possum.data$Pop=='other',0,1) #Vic = 1, Other = 0
possum.data$gender <- ifelse(possum.data$sex=='m',1,0) #male = 1, female = 0
possum.data$tailsq <- possum.data$taill^2 #additional quadratic term

Dichotomous vs. Quantitative interaction term

describeBy(possum.data$taill, group=possum.data$Pop,mat=T,type=3,digits=2)
##     item group1 vars  n  mean   sd median trimmed  mad min  max range skew
## X11    1    Vic    1 43 35.95 1.77     36   35.93 1.48  32 39.5   7.5 0.09
## X12    2  other    1 58 37.86 1.71     38   37.80 1.48  34 43.0   9.0 0.42
##     kurtosis   se
## X11    -0.11 0.27
## X12     0.47 0.23
boxplot(taill~Pop, data=possum.data, main="Place of Capture Vs. Tail Length", xlab = "Place of Capture", ylab = "Tail Length", names = c("Victoria","Other"))

Comparing the results, Possum captured in Victoria, Australia have smaller tail length compared to one caught in other parts of Australia.

Multi-Factor Regression

possum.lm <- lm(place~gender+age+hdlngth+skullw+totlngth+taill+footlgth+earconch+eye+chest+belly+tailsq, data=possum.data)
possum.lm
## 
## Call:
## lm(formula = place ~ gender + age + hdlngth + skullw + totlngth + 
##     taill + footlgth + earconch + eye + chest + belly + tailsq, 
##     data = possum.data)
## 
## Coefficients:
## (Intercept)       gender          age      hdlngth       skullw  
##    3.893556    -0.050397     0.016151    -0.023746    -0.013144  
##    totlngth        taill     footlgth     earconch          eye  
##    0.015428    -0.368878     0.036229     0.068267     0.011322  
##       chest        belly       tailsq  
##    0.011705    -0.004616     0.004119

Regression model

Possum Capture Place = 3.893556 - 0.050397 * gender + 0.016151 * age - 0.023746 * hdlngth - 0.013144 * skullw + 0.015428 * totlngth - 0.368878 * taill + 0.036229 * footlgth + 0.068267 * earconch + 0.011322 * eye + 0.011705 * chest - 0.004616 * belly + 0.004119 * tailsq

Gender variable does not provide any valuable information. Since variable takes values 0 or 1, when all other factors remain constant, using gender variable we can derive all female Possums belong to Victoria and male belong to other areas.

gender, hdlngth, skullw, taill and belly variables are negatively correlated. In other words, they act as balancing numbers such that net result of the model is 0 or 1.

All other variables are positively correlated.

summary(possum.lm)
## 
## Call:
## lm(formula = place ~ gender + age + hdlngth + skullw + totlngth + 
##     taill + footlgth + earconch + eye + chest + belly + tailsq, 
##     data = possum.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.26076 -0.07880 -0.00742  0.06629  0.42082 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.893556   3.600557   1.081  0.28248    
## gender      -0.050397   0.031274  -1.611  0.11065    
## age          0.016151   0.008114   1.990  0.04965 *  
## hdlngth     -0.023746   0.007171  -3.311  0.00135 ** 
## skullw      -0.013144   0.006903  -1.904  0.06015 .  
## totlngth     0.015428   0.006755   2.284  0.02479 *  
## taill       -0.368878   0.191848  -1.923  0.05775 .  
## footlgth     0.036229   0.006517   5.559 2.86e-07 ***
## earconch     0.068267   0.006611  10.326  < 2e-16 ***
## eye          0.011322   0.014955   0.757  0.45101    
## chest        0.011705   0.011101   1.054  0.29458    
## belly       -0.004616   0.007059  -0.654  0.51484    
## tailsq       0.004119   0.002555   1.612  0.11044    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1387 on 88 degrees of freedom
## Multiple R-squared:  0.9315, Adjusted R-squared:  0.9221 
## F-statistic: 99.65 on 12 and 88 DF,  p-value: < 2.2e-16

belly, eye, skullw, chest and tailsq have highest p-value, above 0.05 significance level indicating they are not relevant to model. \(R^2\) value is 0.9315 and Adjusted \(R^2\) is 0.9221.

#eliminate belly, eye, gender, skullw, chest and tailsq variables
possum.lm <- update(possum.lm, .~. -belly -eye -gender -skullw -chest -tailsq)
summary(possum.lm)
## 
## Call:
## lm(formula = place ~ age + hdlngth + totlngth + taill + footlgth + 
##     earconch, data = possum.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.30386 -0.09470 -0.00183  0.08133  0.42874 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.793093   0.503973  -3.558 0.000588 ***
## age          0.017062   0.007840   2.176 0.032043 *  
## hdlngth     -0.031031   0.005770  -5.378 5.48e-07 ***
## totlngth     0.016303   0.006393   2.550 0.012380 *  
## taill       -0.062692   0.011092  -5.652 1.69e-07 ***
## footlgth     0.038212   0.006366   6.002 3.62e-08 ***
## earconch     0.068922   0.006548  10.525  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1409 on 94 degrees of freedom
## Multiple R-squared:  0.9244, Adjusted R-squared:  0.9196 
## F-statistic: 191.5 on 6 and 94 DF,  p-value: < 2.2e-16

\(R^2\) and Adjusted \(R^2\) value decreased, however taill significance went up.

#add skullw back
possum.lm <- update(possum.lm, .~. +skullw)
summary(possum.lm)
## 
## Call:
## lm(formula = place ~ age + hdlngth + totlngth + taill + footlgth + 
##     earconch + skullw, data = possum.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.29592 -0.08360 -0.00208  0.08841  0.42178 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.754166   0.501902  -3.495 0.000728 ***
## age          0.018056   0.007827   2.307 0.023283 *  
## hdlngth     -0.025957   0.006741  -3.851 0.000216 ***
## totlngth     0.016581   0.006360   2.607 0.010635 *  
## taill       -0.063033   0.011033  -5.713 1.33e-07 ***
## footlgth     0.039394   0.006384   6.171 1.75e-08 ***
## earconch     0.067371   0.006601  10.206  < 2e-16 ***
## skullw      -0.009323   0.006502  -1.434 0.154995    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1402 on 93 degrees of freedom
## Multiple R-squared:  0.926,  Adjusted R-squared:  0.9205 
## F-statistic: 166.3 on 7 and 93 DF,  p-value: < 2.2e-16

Adding skullw variable back does not have much impact to the model. This indicates skullw variable is not significant to the model.

Residual Analysis

possum.lm <- update(possum.lm, .~. -skullw)
plot(possum.lm, which=c(1,1))

Plot Residuals Vs. Fitted shows the distribution of residuals values around mean. They are not evenly spread around the mean. The graph shows a downward trend.

qqnorm(residuals(possum.lm))
qqline(residuals(possum.lm))

This graph indicates residuals are normally distributed. There are some outliers at top right part of the graph.

Conclusion