Fitting a linear model

Reading data

spiderweb <- read.table(file="spider_web.csv", header = TRUE, sep = ",")
spiderweight <- read.table(file="spider_weight.csv", header = TRUE, sep = ",")

Merging data

mydata <- merge(spiderweight, spiderweb, by = c('spider'), all.x=T)

Sub-setting data to Canberra only

spider <- mydata[mydata$population == "canberra",]

Checking subsetted data

dim(spider) #returns the dimensions of the data frame, showing the number of rows and columns
## [1] 96  4
head(spider) #displays the first few rows of the data frame
##    spider weight population web_dia
## 1    sp_1   9.29   canberra   90.21
## 2   sp_10   9.39   canberra  101.23
## 3  sp_100   9.56   canberra   99.55
## 13  sp_11  10.93   canberra  105.88
## 24  sp_12  10.23   canberra  104.41
## 35  sp_13   8.06   canberra   70.82
tail(spider) #displays the last few rows of the data frame
##     spider weight population web_dia
## 285  sp_94  10.76   canberra   98.38
## 286  sp_95  10.29   canberra   97.67
## 287  sp_96   8.78   canberra   88.47
## 288  sp_97  10.12   canberra  103.09
## 289  sp_98   8.94   canberra   92.58
## 290  sp_99  10.66   canberra  102.32
summary(spider) #generates summary statistics for each variable in the data frame
##     spider              weight        population           web_dia      
##  Length:96          Min.   : 6.770   Length:96          Min.   : 61.49  
##  Class :character   1st Qu.: 9.193   Class :character   1st Qu.: 91.06  
##  Mode  :character   Median :10.215   Mode  :character   Median :101.34  
##                     Mean   :10.046                      Mean   :100.21  
##                     3rd Qu.:10.943                      3rd Qu.:109.84  
##                     Max.   :12.110                      Max.   :127.45
unique(spider$population) #returns the unique values of the 'population' variable in the spider data frame
## [1] "canberra"
length(unique(spider$population)) #calculates the number of unique values in the 'population' variable 
## [1] 1
unique(spider$spider) #returns the unique values of the 'spider' variable in the spider data frame
##  [1] "sp_1"   "sp_10"  "sp_100" "sp_11"  "sp_12"  "sp_13"  "sp_14"  "sp_15" 
##  [9] "sp_16"  "sp_17"  "sp_18"  "sp_19"  "sp_2"   "sp_20"  "sp_21"  "sp_22" 
## [17] "sp_23"  "sp_24"  "sp_25"  "sp_27"  "sp_28"  "sp_29"  "sp_3"   "sp_30" 
## [25] "sp_31"  "sp_32"  "sp_33"  "sp_34"  "sp_35"  "sp_36"  "sp_37"  "sp_38" 
## [33] "sp_39"  "sp_4"   "sp_40"  "sp_41"  "sp_42"  "sp_43"  "sp_44"  "sp_45" 
## [41] "sp_46"  "sp_48"  "sp_49"  "sp_5"   "sp_50"  "sp_51"  "sp_52"  "sp_53" 
## [49] "sp_54"  "sp_55"  "sp_56"  "sp_57"  "sp_58"  "sp_59"  "sp_6"   "sp_60" 
## [57] "sp_61"  "sp_62"  "sp_63"  "sp_64"  "sp_65"  "sp_66"  "sp_67"  "sp_68" 
## [65] "sp_69"  "sp_7"   "sp_70"  "sp_71"  "sp_72"  "sp_73"  "sp_74"  "sp_75" 
## [73] "sp_76"  "sp_77"  "sp_78"  "sp_79"  "sp_8"   "sp_80"  "sp_81"  "sp_82" 
## [81] "sp_83"  "sp_84"  "sp_85"  "sp_86"  "sp_87"  "sp_9"   "sp_90"  "sp_91" 
## [89] "sp_92"  "sp_93"  "sp_94"  "sp_95"  "sp_96"  "sp_97"  "sp_98"  "sp_99"
length(unique(spider$spider)) #calculates the number of unique values in the 'spider' variable
## [1] 96

Fit a linear model

fit <- lm(spider$web_dia ~ spider$weight) #The lm() function is used to fit a linear model to the data 

Create a scatterplot

plot(spider$weight, spider$web_dia, xlab = "Spider weight (g)", ylab = "Web diameter (cm)",
  pch = 16, cex = 1.3, col = "blue", main = "Web diameter versus spider weight")
  abline(fit, lty=2, col="red") #adds the linear regression 'fit' to the plot

Get the coefficients of the model using the R function

coefficients(fit)
##   (Intercept) spider$weight 
##      -2.39153      10.21288

Interpretation of coefficients: For each one gram increase in the spider weight, the web diameter increases by 10.2 cm

Calculate R squared

summary(fit)$r.squared
## [1] 0.8171372

Interpretation of R squared: The R2 is close to 1, indicating good fit


Fitting a binary covariate

Create a new variable in the dataset spider to store the binary covariate for weight

spider$bin_weight <- ifelse(spider$weight >= 10, 1, 0) #creates a new binary variable 'bin_weight' in the dataframe based on spider weight, where spiders weighing 10g or more are labeled as 1 and others as 0 

Refit model with binary covariate

fit <- lm(spider$web_dia ~ spider$bin_weight) #fits a new linear model using the binary covariate 'bin_weight' instead of the continuous 'weight' variable 

Get the coefficients of the binary model

coefficients(fit)
##       (Intercept) spider$bin_weight 
##          89.22326          19.89410

Plot the model with binary covariate

Remember: when we have a binary variable, the equation changes to:

Reference category i.e. category assigned 0 (in this case weight < 10g): Y = alpha = 89.2

Other category i.e. category assigned 1 (in this case weight >= 10g): Y = alpha + beta = 89.2 + 19.9 = 109.1

plot(spider$bin_weight, spider$web_dia)
abline(fit, lty=2)
abline(89.2, 0, lty=2, col="red") #adds a horizontal line to at y = 89.2 
abline(89.2 + 19.9, 0, lty=2, col="green") 

Interpretation: The overall equation is web_dia = 89.2 + 19.9 (for bin_weight > 10g). The equation states that the web-dia is expected to be 89.2 cm for spiders < 10g and 19.9 cm larger for spiders that weight >= 10g

What would happen if the code for 0 and 1 was the opposite?

spider$bin_weight2 <- ifelse(spider$weight < 10, 1, 0) #Create a new variable in the dataset spider to store the binary covariate for weight

fit <- lm(spider$web_dia ~ spider$bin_weight2) #re-fit the model with the changed binary variable to see if results are different

coefficients(fit) #Get the model coefficients 
##        (Intercept) spider$bin_weight2 
##           109.1174           -19.8941
plot(spider$bin_weight2, spider$web_dia) #plot the data 
abline(fit, lty=2) 
abline(109, 0, lty=2, col="red")
abline(109-19.9, 0, lty=2, col="green")

Interpretation: The overall equation is web_dia = 109.1 - 19.9 (for bin_weight < 10g). The equation states that the web-dia is expected to be 109 cm for spiders > 10g but 19.9 cm smaller for spiders that weight < 10g

This means the choice of coding does not affect interpretation of results

Comparing the models using continuous covariate vs binary covariate: despite using a binary covariate in the second model, the overall trend observed in the first model (using continuous covariate) is still consistent. However, the second model only captures the average values for weights in each category (less than 10 grams or 10 grams or more), rather than examining the full range of weights as in the continuous covariate model


Multiple regression

To study the association of the monthly average temperature (in ◦C, X) and hotel occupation (in %, Y), we consider data from three cities:

  1. Polenca (Mallorca, Spain) as a summer holiday destination,

  2. Davos (Switzerland) as a winter skiing destination, and

  3. Basel (Switzerland) as a business destination.

The data is as follows, with values per month

#Average monthly temperature 
Temp_Davos <- c(-6,-5,2,4,7,15,17,19,13,9,4,0)
Temp_Polenca <- c(10,10,14,17,22,24,26,27,22,19,14,12)
Temp_Basel <- c(1,0,5,9,14,20,23,24,21,14,9,4)

#Hotel occupation 
Hotel_Davos <- c(91,89,76,52,42,36,37,39,26,27,68,92)
Hotel_Polenca <- c(13,21,42,64,79,81,86,92,36,23,13,41)
Hotel_Basel <- c(23,82,40,45,39,43,50,95,64,78,9,12)

Create the month variable

Month <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")

Create the dataframe

HotelOccup <- data.frame(Month, Temp_Davos, Hotel_Davos, Temp_Polenca, Hotel_Polenca, Temp_Basel, Hotel_Basel) 

Use the function write.csv() to output the file (note there is no need to use the argument sep = “.”)

write.csv(HotelOccup, "HotelOccupation.csv") 
  1. Interpret the output from a regression model where “temperature” is the covariate, and “hotel occupation” is the outcome

The output suggests a 0.077 % increase of hotel occupation for each one degree increase in temperature. However, as this increase is not significant (p = 0.883 > 0.05), we therefore cannot show an association between temperature and hotel occupation.

  1. Interpret the output where ‘city’ is the covariate and “hotel occupation” is the outcome

The average hotel occupation is higher in Davos (7.9 %) and Polenca (0.9 %) compared with Basel (reference category). I.e. on average hotel occupation in Basel is 48.3%, in Davos is 56.2% (48.3+7.9) and in Polenca is 49.2% (48.3+0.9).

However, these differences are not significant (p > 0.05). Therefore, the model cannot show a significant difference in hotel occupation between Davos/Polenca and Basel.

  1. Interpret the ANOVA output and compare it with the output from b)

The analysis of variance (ANOVA) table tells us that the variable city as a whole did not have a significant association with hotel occupation (as p = 0.75 > 0.05).

  1. In the output for the multiple linear regression model, both ‘city’ and “temperature” are included as covariates. How can the coefficients be interpreted?

In the multivariate model, we cannot show an association between temperature and hotel occupation (given the city); and we cannot show an association between city and hotel occupation (given the temperature).

  1. Now consider the regression model for hotel occupation and temperature fitted separately for each city. How can the results be interpreted and what are the implications with respect to the models estimated in (a)–(d)? How can the models be improved?

Stratifying the data yields considerably different results:

In Davos, where tourists go for skiing, each increase of 1◦C relates to a drop in hotel occupation of 2.7%. The estimate β ≈ −2.7 is also significantly different from zero (p = 0.000231).

In Polenca, a summer holiday destination, an increase of 1◦C implies an increase of hotel occupation of almost 4 %. This estimate is also significantly different from zero (p = 0.00114 < 0.05).

In Basel, a business destination, there is a somewhat higher hotel occupation for higher temperatures (β = 1.3); however, the estimate is not significantly different from zero.

While there is no overall association between temperature and hotel occupation, there is an association between them if one looks at the different cities separately. This suggests that an interaction between temperature and city should be included in the model

  1. See the output of the model fitted including city, temperature and the interaction between them. How do you interpret the results?

Both interaction terms are significantly different from zero (p = 0.000375 and p = 0.033388).

The estimate of temperature therefore differs by city, and the estimate of city differs by temperature.

For the reference city of Basel, the association between temperature and hotel occupation is estimated as 1.31;

For Davos it is 1.31 − 4.00 = −2.69. This means that for city Davos, for every increase in one unit of temperature, there will be a decrease in hotel occupation by 4.0%

For Polenca is it 1.31 + 2.66 = 3.97. This means that for city Polenca, for every increase in one unit of temperature, there will be an increase in hotel occupation by 2.7%

Note: these results are identical to those obtained when three different regressions were fitted per city (part e), they are just summarized in a different way

Extra notes

Adjusted R squared tells how much of the change in the outcome variable can be explained by the predictor variables

F statistic confirms whether at least one variable included in the model is associated with the outcome variable.


Akaike Information Criterion/Evidence Ratio

Show that tree density (TreeDens) explains the amount of coarse woody debris (CWD) in a forest by using model selection between a linear model of CWD versus TreeDens and a null (intercept only) model

Here is the data

TreeDens <- c(1270, 1210, 1800, 1875, 1300, 2150, 1330, 964, 961, 1400, 1280, 976, 771, 833, 883, 956) ## tree density

CWD <- c(121, 41, 183, 130, 127, 134, 65, 52, 12, 46, 54, 97, 1, 4, 1, 4) ## coarse woody debris

And here is the plot

plot(TreeDens,CWD, pch=19)

Step 1: Model Fitting

m1 <- lm(CWD ~ TreeDens) #fits a linear regression model of TreeDens based on CWD
m2 <- lm(CWD ~ 1) #fits a null model, which means it only includes an intercept term (no predictor variables)

Step 2: Model Comparison

summary(m1)
## 
## Call:
## lm(formula = CWD ~ TreeDens)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -38.62 -22.41 -13.33  26.16  61.35 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -77.09908   30.60801  -2.519 0.024552 *  
## TreeDens      0.11552    0.02343   4.930 0.000222 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36.32 on 14 degrees of freedom
## Multiple R-squared:  0.6345, Adjusted R-squared:  0.6084 
## F-statistic:  24.3 on 1 and 14 DF,  p-value: 0.0002216

Therefore, for every unit increase in tree density, there is a statistically significant increase in CWD by 0.11552.

summary(m2)
## 
## Call:
## lm(formula = CWD ~ 1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -66.0  -57.0  -14.0   55.5  116.0 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    67.00      14.51   4.618 0.000335 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 58.03 on 15 degrees of freedom

In Model 2, where only an intercept term is considered, the coefficient associated with the intercept is 67.00.

This value represents the estimated average amount of coarse woody debris (CWD) when tree density is not taken into account. In other words, it’s the expected value of CWD when the predictor variable (in this case, tree density) is not considered in the model.

So, in this context, the value of 67.00 means that when tree density is not considered, the average amount of coarse woody debris is estimated to be approximately 67 units.

plot(TreeDens,CWD, pch=19)
abline(m1, lty=2)
abline(m2, lty=2, col="red")

Step 3: Model Evaluation

3a) Calculate the log-likelihood of each model

logLik(m1)
## 'log Lik.' -79.11161 (df=3)
logLik(m2)
## 'log Lik.' -87.1633 (df=2)

The log liklihood quantifies the probability of observing the given data under the assumed statistical model. The higher the log-likelihood, the better the model is at predicting the observed data.

Therefore, in this case, Model 1 (m1) has a higher log-likelihood than Model 2 (m2), suggesting that Model 1 provides a better fit to the observed data.

3b) Calculate the Akaike Information Criterion (AIC) for each model

AIC1 <- AIC(m1)
AIC1
## [1] 164.2232
AIC2 <- AIC(m2)
AIC2
## [1] 178.3266

AIC is a tool used for model selection, especially when comparing multiple models fitted to the same dataset. It helps us determine which model provides the best balance between goodness of fit and simplicity.

AIC takes into account how well the model fits the data. A model that fits the data well will have a low AIC value, indicating that it explains a large portion of the variability in the data.

AIC penalizes models for their complexity. More complex models, with more parameters or features, are penalized more heavily than simpler models. This penalty discourages overfitting, where a model captures noise or random fluctuations in the data rather than true underlying patterns.

When comparing models, the one with the lowest AIC value is considered the best, as it strikes the best balance between goodness of fit and model complexity. A lower AIC value indicates a model that explains the data well while being relatively simple.

In the context of the output provided, Model 1 (m1) has an AIC value of approximately 164.2232, while Model 2 (m2) has an AIC value of approximately 178.3266. Since Model 1 has the lower AIC value, it is considered the better model among the two.

3c) Calculate the difference in AIC

delta.IC <- function(x) x - min(x) # where x is to be a vector of AIC

AICvec <- c(AIC1,AIC2)
d <- delta.IC(AICvec)
d
## [1]  0.00000 14.10339

The difference in AIC calculates the difference in AIC values between models. Smaller differences suggest that one model is significantly better than the other.

The first value (0.00000) represents the difference between Model 1’s AIC and the minimum AIC (which is Model 1’s AIC itself, indicating that Model 1 is the best). The second value (14.10339) represents the difference between Model 2’s AIC and the minimum AIC (which is Model 1’s AIC), indicating that Model 2 is 14.10339 units worse than Model 1.

3d) Calculate the the weighted AIC

weight.IC <- function(x) (exp(-0.5*x))/sum(exp(-0.5*x)) 
wAIC <- weight.IC(d)
wAIC
## [1] 0.9991348091 0.0008651909

The weighted AIC calculates the relative likelihood of each model being the best among the candidate models.

The first value (0.9991348091) represents the weight assigned to Model 1, indicating that there’s a 99.91% chance that Model 1 is the best model among the two. The second value (0.0008651909) represents the weight assigned to Model 2, indicating that there’s only a 0.09% chance that Model 2 is the best model among the two.

3e) Calculate the evidence ratio

ER <- wAIC[1]/wAIC[2]
ER
## [1] 1154.814

The evidence ratio provides a measure of the relative support for one model over another. A higher evidence ratio indicates stronger support for the model with the higher weight.

In this case, an evidence ratio of 1154.814 suggests that Model 1 is strongly favored over Model 2, with Model 1 being approximately 1154.814 times more likely to be the best model compared to Model 2.

Overall, there is strong support for model 1 (m1) compared to model 2 (m2)

Note: By using multiple methods, we can gain a more comprehensive understanding of which model is most appropriate for the given data.