Reading data
spiderweb <- read.table(file="spider_web.csv", header = TRUE, sep = ",")
spiderweight <- read.table(file="spider_weight.csv", header = TRUE, sep = ",")
Merging data
mydata <- merge(spiderweight, spiderweb, by = c('spider'), all.x=T)
Sub-setting data to Canberra only
spider <- mydata[mydata$population == "canberra",]
Checking subsetted data
dim(spider) #returns the dimensions of the data frame, showing the number of rows and columns
## [1] 96 4
head(spider) #displays the first few rows of the data frame
## spider weight population web_dia
## 1 sp_1 9.29 canberra 90.21
## 2 sp_10 9.39 canberra 101.23
## 3 sp_100 9.56 canberra 99.55
## 13 sp_11 10.93 canberra 105.88
## 24 sp_12 10.23 canberra 104.41
## 35 sp_13 8.06 canberra 70.82
tail(spider) #displays the last few rows of the data frame
## spider weight population web_dia
## 285 sp_94 10.76 canberra 98.38
## 286 sp_95 10.29 canberra 97.67
## 287 sp_96 8.78 canberra 88.47
## 288 sp_97 10.12 canberra 103.09
## 289 sp_98 8.94 canberra 92.58
## 290 sp_99 10.66 canberra 102.32
summary(spider) #generates summary statistics for each variable in the data frame
## spider weight population web_dia
## Length:96 Min. : 6.770 Length:96 Min. : 61.49
## Class :character 1st Qu.: 9.193 Class :character 1st Qu.: 91.06
## Mode :character Median :10.215 Mode :character Median :101.34
## Mean :10.046 Mean :100.21
## 3rd Qu.:10.943 3rd Qu.:109.84
## Max. :12.110 Max. :127.45
unique(spider$population) #returns the unique values of the 'population' variable in the spider data frame
## [1] "canberra"
length(unique(spider$population)) #calculates the number of unique values in the 'population' variable
## [1] 1
unique(spider$spider) #returns the unique values of the 'spider' variable in the spider data frame
## [1] "sp_1" "sp_10" "sp_100" "sp_11" "sp_12" "sp_13" "sp_14" "sp_15"
## [9] "sp_16" "sp_17" "sp_18" "sp_19" "sp_2" "sp_20" "sp_21" "sp_22"
## [17] "sp_23" "sp_24" "sp_25" "sp_27" "sp_28" "sp_29" "sp_3" "sp_30"
## [25] "sp_31" "sp_32" "sp_33" "sp_34" "sp_35" "sp_36" "sp_37" "sp_38"
## [33] "sp_39" "sp_4" "sp_40" "sp_41" "sp_42" "sp_43" "sp_44" "sp_45"
## [41] "sp_46" "sp_48" "sp_49" "sp_5" "sp_50" "sp_51" "sp_52" "sp_53"
## [49] "sp_54" "sp_55" "sp_56" "sp_57" "sp_58" "sp_59" "sp_6" "sp_60"
## [57] "sp_61" "sp_62" "sp_63" "sp_64" "sp_65" "sp_66" "sp_67" "sp_68"
## [65] "sp_69" "sp_7" "sp_70" "sp_71" "sp_72" "sp_73" "sp_74" "sp_75"
## [73] "sp_76" "sp_77" "sp_78" "sp_79" "sp_8" "sp_80" "sp_81" "sp_82"
## [81] "sp_83" "sp_84" "sp_85" "sp_86" "sp_87" "sp_9" "sp_90" "sp_91"
## [89] "sp_92" "sp_93" "sp_94" "sp_95" "sp_96" "sp_97" "sp_98" "sp_99"
length(unique(spider$spider)) #calculates the number of unique values in the 'spider' variable
## [1] 96
Fit a linear model
fit <- lm(spider$web_dia ~ spider$weight) #The lm() function is used to fit a linear model to the data
Create a scatterplot
plot(spider$weight, spider$web_dia, xlab = "Spider weight (g)", ylab = "Web diameter (cm)",
pch = 16, cex = 1.3, col = "blue", main = "Web diameter versus spider weight")
abline(fit, lty=2, col="red") #adds the linear regression 'fit' to the plot
Get the coefficients of the model using the R function
coefficients(fit)
## (Intercept) spider$weight
## -2.39153 10.21288
Interpretation of coefficients: For each one gram increase in the spider weight, the web diameter increases by 10.2 cm
Calculate R squared
summary(fit)$r.squared
## [1] 0.8171372
Interpretation of R squared: The R2 is close to 1, indicating good fit
Create a new variable in the dataset spider to store the binary covariate for weight
spider$bin_weight <- ifelse(spider$weight >= 10, 1, 0) #creates a new binary variable 'bin_weight' in the dataframe based on spider weight, where spiders weighing 10g or more are labeled as 1 and others as 0
Refit model with binary covariate
fit <- lm(spider$web_dia ~ spider$bin_weight) #fits a new linear model using the binary covariate 'bin_weight' instead of the continuous 'weight' variable
Get the coefficients of the binary model
coefficients(fit)
## (Intercept) spider$bin_weight
## 89.22326 19.89410
Plot the model with binary covariate
Remember: when we have a binary variable, the equation changes to:
Reference category i.e. category assigned 0 (in this case weight < 10g): Y = alpha = 89.2
Other category i.e. category assigned 1 (in this case weight >= 10g): Y = alpha + beta = 89.2 + 19.9 = 109.1
plot(spider$bin_weight, spider$web_dia)
abline(fit, lty=2)
abline(89.2, 0, lty=2, col="red") #adds a horizontal line to at y = 89.2
abline(89.2 + 19.9, 0, lty=2, col="green")
Interpretation: The overall equation is web_dia = 89.2 + 19.9 (for bin_weight > 10g). The equation states that the web-dia is expected to be 89.2 cm for spiders < 10g and 19.9 cm larger for spiders that weight >= 10g
What would happen if the code for 0 and 1 was the opposite?
spider$bin_weight2 <- ifelse(spider$weight < 10, 1, 0) #Create a new variable in the dataset spider to store the binary covariate for weight
fit <- lm(spider$web_dia ~ spider$bin_weight2) #re-fit the model with the changed binary variable to see if results are different
coefficients(fit) #Get the model coefficients
## (Intercept) spider$bin_weight2
## 109.1174 -19.8941
plot(spider$bin_weight2, spider$web_dia) #plot the data
abline(fit, lty=2)
abline(109, 0, lty=2, col="red")
abline(109-19.9, 0, lty=2, col="green")
Interpretation: The overall equation is web_dia = 109.1 - 19.9 (for bin_weight < 10g). The equation states that the web-dia is expected to be 109 cm for spiders > 10g but 19.9 cm smaller for spiders that weight < 10g
This means the choice of coding does not affect interpretation of results
Comparing the models using continuous covariate vs binary covariate: despite using a binary covariate in the second model, the overall trend observed in the first model (using continuous covariate) is still consistent. However, the second model only captures the average values for weights in each category (less than 10 grams or 10 grams or more), rather than examining the full range of weights as in the continuous covariate model
To study the association of the monthly average temperature (in ◦C, X) and hotel occupation (in %, Y), we consider data from three cities:
Polenca (Mallorca, Spain) as a summer holiday destination,
Davos (Switzerland) as a winter skiing destination, and
Basel (Switzerland) as a business destination.
The data is as follows, with values per month
#Average monthly temperature
Temp_Davos <- c(-6,-5,2,4,7,15,17,19,13,9,4,0)
Temp_Polenca <- c(10,10,14,17,22,24,26,27,22,19,14,12)
Temp_Basel <- c(1,0,5,9,14,20,23,24,21,14,9,4)
#Hotel occupation
Hotel_Davos <- c(91,89,76,52,42,36,37,39,26,27,68,92)
Hotel_Polenca <- c(13,21,42,64,79,81,86,92,36,23,13,41)
Hotel_Basel <- c(23,82,40,45,39,43,50,95,64,78,9,12)
Create the month variable
Month <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
Create the dataframe
HotelOccup <- data.frame(Month, Temp_Davos, Hotel_Davos, Temp_Polenca, Hotel_Polenca, Temp_Basel, Hotel_Basel)
Use the function write.csv() to output the file (note
there is no need to use the argument sep = “.”)
write.csv(HotelOccup, "HotelOccupation.csv")
The output suggests a 0.077 % increase of hotel occupation for each one degree increase in temperature. However, as this increase is not significant (p = 0.883 > 0.05), we therefore cannot show an association between temperature and hotel occupation.
The average hotel occupation is higher in Davos (7.9 %) and Polenca (0.9 %) compared with Basel (reference category). I.e. on average hotel occupation in Basel is 48.3%, in Davos is 56.2% (48.3+7.9) and in Polenca is 49.2% (48.3+0.9).
However, these differences are not significant (p > 0.05). Therefore, the model cannot show a significant difference in hotel occupation between Davos/Polenca and Basel.
The analysis of variance (ANOVA) table tells us that the variable city as a whole did not have a significant association with hotel occupation (as p = 0.75 > 0.05).
In the multivariate model, we cannot show an association between temperature and hotel occupation (given the city); and we cannot show an association between city and hotel occupation (given the temperature).
Stratifying the data yields considerably different results:
In Davos, where tourists go for skiing, each increase of 1◦C relates to a drop in hotel occupation of 2.7%. The estimate β ≈ −2.7 is also significantly different from zero (p = 0.000231).
In Polenca, a summer holiday destination, an increase of 1◦C implies an increase of hotel occupation of almost 4 %. This estimate is also significantly different from zero (p = 0.00114 < 0.05).
In Basel, a business destination, there is a somewhat higher hotel occupation for higher temperatures (β = 1.3); however, the estimate is not significantly different from zero.
While there is no overall association between temperature and hotel occupation, there is an association between them if one looks at the different cities separately. This suggests that an interaction between temperature and city should be included in the model
Both interaction terms are significantly different from zero (p = 0.000375 and p = 0.033388).
The estimate of temperature therefore differs by city, and the estimate of city differs by temperature.
For the reference city of Basel, the association between temperature and hotel occupation is estimated as 1.31;
For Davos it is 1.31 − 4.00 = −2.69. This means that for city Davos, for every increase in one unit of temperature, there will be a decrease in hotel occupation by 4.0%
For Polenca is it 1.31 + 2.66 = 3.97. This means that for city Polenca, for every increase in one unit of temperature, there will be an increase in hotel occupation by 2.7%
Note: these results are identical to those obtained when three different regressions were fitted per city (part e), they are just summarized in a different way
Extra notes
Adjusted R squared tells how much of the change in the outcome variable can be explained by the predictor variables
F statistic confirms whether at least one variable included in the model is associated with the outcome variable.
Show that tree density (TreeDens) explains the amount of coarse woody debris (CWD) in a forest by using model selection between a linear model of CWD versus TreeDens and a null (intercept only) model
Here is the data
TreeDens <- c(1270, 1210, 1800, 1875, 1300, 2150, 1330, 964, 961, 1400, 1280, 976, 771, 833, 883, 956) ## tree density
CWD <- c(121, 41, 183, 130, 127, 134, 65, 52, 12, 46, 54, 97, 1, 4, 1, 4) ## coarse woody debris
And here is the plot
plot(TreeDens,CWD, pch=19)
Step 1: Model Fitting
m1 <- lm(CWD ~ TreeDens) #fits a linear regression model of TreeDens based on CWD
m2 <- lm(CWD ~ 1) #fits a null model, which means it only includes an intercept term (no predictor variables)
Step 2: Model Comparison
summary(m1)
##
## Call:
## lm(formula = CWD ~ TreeDens)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.62 -22.41 -13.33 26.16 61.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -77.09908 30.60801 -2.519 0.024552 *
## TreeDens 0.11552 0.02343 4.930 0.000222 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36.32 on 14 degrees of freedom
## Multiple R-squared: 0.6345, Adjusted R-squared: 0.6084
## F-statistic: 24.3 on 1 and 14 DF, p-value: 0.0002216
Therefore, for every unit increase in tree density, there is a statistically significant increase in CWD by 0.11552.
summary(m2)
##
## Call:
## lm(formula = CWD ~ 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66.0 -57.0 -14.0 55.5 116.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 67.00 14.51 4.618 0.000335 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 58.03 on 15 degrees of freedom
In Model 2, where only an intercept term is considered, the coefficient associated with the intercept is 67.00.
This value represents the estimated average amount of coarse woody debris (CWD) when tree density is not taken into account. In other words, it’s the expected value of CWD when the predictor variable (in this case, tree density) is not considered in the model.
So, in this context, the value of 67.00 means that when tree density is not considered, the average amount of coarse woody debris is estimated to be approximately 67 units.
plot(TreeDens,CWD, pch=19)
abline(m1, lty=2)
abline(m2, lty=2, col="red")
Step 3: Model Evaluation
3a) Calculate the log-likelihood of each model
logLik(m1)
## 'log Lik.' -79.11161 (df=3)
logLik(m2)
## 'log Lik.' -87.1633 (df=2)
The log liklihood quantifies the probability of observing the given data under the assumed statistical model. The higher the log-likelihood, the better the model is at predicting the observed data.
Therefore, in this case, Model 1 (m1) has a higher log-likelihood than Model 2 (m2), suggesting that Model 1 provides a better fit to the observed data.
3b) Calculate the Akaike Information Criterion (AIC) for each model
AIC1 <- AIC(m1)
AIC1
## [1] 164.2232
AIC2 <- AIC(m2)
AIC2
## [1] 178.3266
AIC is a tool used for model selection, especially when comparing multiple models fitted to the same dataset. It helps us determine which model provides the best balance between goodness of fit and simplicity.
AIC takes into account how well the model fits the data. A model that fits the data well will have a low AIC value, indicating that it explains a large portion of the variability in the data.
AIC penalizes models for their complexity. More complex models, with more parameters or features, are penalized more heavily than simpler models. This penalty discourages overfitting, where a model captures noise or random fluctuations in the data rather than true underlying patterns.
When comparing models, the one with the lowest AIC value is considered the best, as it strikes the best balance between goodness of fit and model complexity. A lower AIC value indicates a model that explains the data well while being relatively simple.
In the context of the output provided, Model 1 (m1) has an AIC value of approximately 164.2232, while Model 2 (m2) has an AIC value of approximately 178.3266. Since Model 1 has the lower AIC value, it is considered the better model among the two.
3c) Calculate the difference in AIC
delta.IC <- function(x) x - min(x) # where x is to be a vector of AIC
AICvec <- c(AIC1,AIC2)
d <- delta.IC(AICvec)
d
## [1] 0.00000 14.10339
The difference in AIC calculates the difference in AIC values between models. Smaller differences suggest that one model is significantly better than the other.
The first value (0.00000) represents the difference between Model 1’s AIC and the minimum AIC (which is Model 1’s AIC itself, indicating that Model 1 is the best). The second value (14.10339) represents the difference between Model 2’s AIC and the minimum AIC (which is Model 1’s AIC), indicating that Model 2 is 14.10339 units worse than Model 1.
3d) Calculate the the weighted AIC
weight.IC <- function(x) (exp(-0.5*x))/sum(exp(-0.5*x))
wAIC <- weight.IC(d)
wAIC
## [1] 0.9991348091 0.0008651909
The weighted AIC calculates the relative likelihood of each model being the best among the candidate models.
The first value (0.9991348091) represents the weight assigned to Model 1, indicating that there’s a 99.91% chance that Model 1 is the best model among the two. The second value (0.0008651909) represents the weight assigned to Model 2, indicating that there’s only a 0.09% chance that Model 2 is the best model among the two.
3e) Calculate the evidence ratio
ER <- wAIC[1]/wAIC[2]
ER
## [1] 1154.814
The evidence ratio provides a measure of the relative support for one model over another. A higher evidence ratio indicates stronger support for the model with the higher weight.
In this case, an evidence ratio of 1154.814 suggests that Model 1 is strongly favored over Model 2, with Model 1 being approximately 1154.814 times more likely to be the best model compared to Model 2.
Overall, there is strong support for model 1 (m1) compared to model 2 (m2)
Note: By using multiple methods, we can gain a more comprehensive understanding of which model is most appropriate for the given data.