setwd("C:/Users/lisanjie2/Desktop/TEACHING/1_STATS_CalU/1_STAT_CalU_2016_by_NLB/Lecture/Unit3_regression/last_week")

Multiple linear regression

Lion data: predicting age from nose coloration

Preliminaries

Load the data

lions <- read.csv("lion_age_by_pop.csv" )


There are 2 populations, The main Serengeti population and the Ngorogoro crater (sub?) population

summary(lions)
##  portion.black      age.years          population
##  Min.   :0.1000   Min.   : 1.100   Ngorogoro:10  
##  1st Qu.:0.1475   1st Qu.: 2.175   Serengeti:22  
##  Median :0.2650   Median : 3.500                 
##  Mean   :0.3197   Mean   : 4.309                 
##  3rd Qu.:0.4325   3rd Qu.: 5.850                 
##  Max.   :0.7900   Max.   :13.100


In the book they consider all of the data combined and ignore the two separate populations (which are operate geographically and genetically)


Does the relationship between age and nose-blackness work the same for both populations?


Plot lion data w/ both populations combined

We will “pool” the data from both populations (that is, ignore population differences)

Run regression

#build regression model
m.pooled <- lm(age.years ~ portion.black, data = lions)

Look at summary of regression

summary(m.pooled)
## 
## Call:
## lm(formula = age.years ~ portion.black, data = lions)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5457 -1.1457 -0.3384  0.9245  4.3426 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.9262     0.5591   1.657    0.108    
## portion.black  10.5827     1.4884   7.110 6.59e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.66 on 30 degrees of freedom
## Multiple R-squared:  0.6276, Adjusted R-squared:  0.6152 
## F-statistic: 50.55 on 1 and 30 DF,  p-value: 6.59e-08

Plot pooled data

#plot
plot(age.years ~ portion.black, data = lions, 
     cex = 2, lwd = 2,
     xlim = c(0.0, 0.9),
     ylim = c(0,14))
abline(m.pooled, lwd = 3, col = 2)

#To Do: add equations for line

Plot lion data seperated by population

i.Serengeti <- which(lions$population == "Serengeti")
i.Ngorogoro <- which(lions$population == "Ngorogoro")

#plot Serengeti
plot(age.years ~ portion.black, 
     data = lions,
     subset = i.Serengeti, 
     cex = 2, lwd = 2,col = 2,
     xlim = c(0.0, 0.9),
     ylim = c(0,14))


#plot Serengeti
points(age.years ~ portion.black, 
     data = lions,
     subset = i.Ngorogoro, 
     cex = 2, lwd = 2,
     col = 3,pch = 2,
     xlim = c(0.0, 0.9),
     ylim = c(0,14))

#add legend
legend("topleft",
       legend = c("Serengeti","Ngorogoro"),
       col = c(2,3),pch = c(1,2),cex = 1.2,
       )

What model best describes the data: 1 line or 2?

Hypothesis 0: there is no relationship

The null hypothesis Ho is that there is no relationship between how black the nose is and age. We can see that there is a relationship and will skip this Ho.

Hypothesis 1: the populations are the same

The data can be modeled with a single regression line. Any difference between the populations due to random noise.

Hypothesis 2: the populations are different

The populations have different biological/ecological processes going on that make the relationship between age and nose blackness depend on the population We say there is an INTERACTION between population and portion.black stated another way: The relationship between nose blackness and age is different between the two populations Therefore, the slope of the lines are different *(This include the possibility that there is no relationship in one population (slope = 0) and is a relationship in the other (slope >0) )



Run multiple regression model

*We indicate in interaction in R using the multiple symbol “*"


Look at multiple regression output

summary(m.by.pop)
## 
## Call:
## lm(formula = age.years ~ portion.black * population, data = lions)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3453 -0.4460  0.0258  0.5827  2.9336 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         2.7015     0.9032   2.991  0.00574 ** 
## portion.black                      10.0877     1.8801   5.365 1.02e-05 ***
## populationSerengeti                -1.4667     1.0106  -1.451  0.15781    
## portion.black:populationSerengeti  -3.3352     2.3630  -1.411  0.16914    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.16 on 28 degrees of freedom
## Multiple R-squared:  0.8303, Adjusted R-squared:  0.8122 
## F-statistic: 45.68 on 3 and 28 DF,  p-value: 6.508e-11

Test models

Compare the model w/1 line vs. the model w/ 2 lines

anova(m.pooled, # data combined (aka "pooled")
      m.by.pop) # data seperated by population
## Analysis of Variance Table
## 
## Model 1: age.years ~ portion.black
## Model 2: age.years ~ portion.black * population
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     30 82.712                                  
## 2     28 37.679  2    45.034 16.733 1.657e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

MOre stuff

#no relationship
m.null <- lm(age.years ~ 1,data = lions)

#relationship is the same for both populations
m.pooled <- lm(age.years ~ portion.black,data = lions)


#slope is the same but intercepts are different
m.by.pop.add <- lm(age.years ~ portion.black + population,data = lions)


#slopes AND intercepts are different
m.by.pop.intnx <- lm(age.years ~ portion.black*population,data = lions)
#relationship is the same for both populations
summary(m.pooled)
## 
## Call:
## lm(formula = age.years ~ portion.black, data = lions)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5457 -1.1457 -0.3384  0.9245  4.3426 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.9262     0.5591   1.657    0.108    
## portion.black  10.5827     1.4884   7.110 6.59e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.66 on 30 degrees of freedom
## Multiple R-squared:  0.6276, Adjusted R-squared:  0.6152 
## F-statistic: 50.55 on 1 and 30 DF,  p-value: 6.59e-08
#slope is the same but intercepts are different
summary(m.by.pop.add)
## 
## Call:
## lm(formula = age.years ~ portion.black + population, data = lions)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5415 -0.5361 -0.1484  0.5990  3.5691 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           3.6283     0.6306   5.753 3.14e-06 ***
## portion.black         7.9764     1.1582   6.887 1.45e-07 ***
## populationSerengeti  -2.7185     0.4928  -5.517 6.04e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.18 on 29 degrees of freedom
## Multiple R-squared:  0.8183, Adjusted R-squared:  0.8057 
## F-statistic: 65.29 on 2 and 29 DF,  p-value: 1.827e-11
#slopes AND intercepts are different
summary(m.by.pop.intnx)
## 
## Call:
## lm(formula = age.years ~ portion.black * population, data = lions)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3453 -0.4460  0.0258  0.5827  2.9336 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         2.7015     0.9032   2.991  0.00574 ** 
## portion.black                      10.0877     1.8801   5.365 1.02e-05 ***
## populationSerengeti                -1.4667     1.0106  -1.451  0.15781    
## portion.black:populationSerengeti  -3.3352     2.3630  -1.411  0.16914    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.16 on 28 degrees of freedom
## Multiple R-squared:  0.8303, Adjusted R-squared:  0.8122 
## F-statistic: 45.68 on 3 and 28 DF,  p-value: 6.508e-11