SurvFemales <- 11
SurvMales <- 8
NotSurvFemales <- 11
NotSurvMales <- 4
1. Calculate probability of survival
p.survival <- (SurvFemales + SurvMales)/(SurvFemales + NotSurvFemales + SurvMales + NotSurvMales)
p.survival
## [1] 0.5588235
2. Calculate odds of survival
odds.survival <- p.survival/(1-p.survival)
odds.survival
## [1] 1.266667
3. Take log(odds of survival) to transform data
log.odds <- log(odds.survival)
log.odds
## [1] 0.2363888
4. Calculate the odds of survival based on gender
Odds(survival female) = probability of survival given female / probability of not surviving given female
OR
probability of survival given female / 1 - probability of surviving given female
p.survival_given_female <- SurvFemales/(SurvFemales + NotSurvFemales)
p.notsurvival_given_female <- SurvFemales/(SurvFemales + NotSurvFemales)
odds.survivalF <- p.survival_given_female/p.notsurvival_given_female
odds.survivalF
## [1] 1
Odds(survival male) = probability of survival given male / probability of not surviving given male
p.survival_given_male <- SurvMales/(SurvMales + NotSurvMales)
p.notsurvival_given_male <- NotSurvMales/(SurvMales + NotSurvMales)
odds.survivalM <- p.survival_given_male/p.notsurvival_given_male
odds.survivalM
## [1] 2
5. Calculate the odds ratio
odds.ratio <- odds.survivalF/odds.survivalM
odds.ratio
## [1] 0.5
Therefore, the odds of surviving being a female is 0.5x that the odds of surviving being a male
** 6. Now lets backtransform to confirm our answer from part 1**
Remember p = odds/1+odds
p.survival <- odds.survival/(1+odds.survival)
p.survival
## [1] 0.5588235
Sea turtles are now hatching in a nearby beach while a severe heatwave is taking place. As a team, we will survey the beach to deteremine if turtle hatchlings are surviving the event. We will also find out if the hatchlings are females or males. With the data collected, create a confusion-type of matrix (we will do this together in class)
Ensure to compare answers to manual answers from above
a) Input matrix into R
SurvFemales <- 11
SurvMales <- 8
NotSurvFemales <- 11
NotSurvMales <- 4
Create vector of zeros and ones
Survived <- c(rep(1, SurvFemales), rep(0, NotSurvFemales), rep(1, SurvMales), rep(0, NotSurvMales))
Survived
## [1] 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0
Create vector of males and females
Sex <- c(rep("female", SurvFemales), rep("female", NotSurvFemales), rep("male", SurvMales), rep("male", NotSurvMales))
Sex
## [1] "female" "female" "female" "female" "female" "female" "female" "female"
## [9] "female" "female" "female" "female" "female" "female" "female" "female"
## [17] "female" "female" "female" "female" "female" "female" "male" "male"
## [25] "male" "male" "male" "male" "male" "male" "male" "male"
## [33] "male" "male"
mydata <- data.frame(Survived, Sex)
mydata
## Survived Sex
## 1 1 female
## 2 1 female
## 3 1 female
## 4 1 female
## 5 1 female
## 6 1 female
## 7 1 female
## 8 1 female
## 9 1 female
## 10 1 female
## 11 1 female
## 12 0 female
## 13 0 female
## 14 0 female
## 15 0 female
## 16 0 female
## 17 0 female
## 18 0 female
## 19 0 female
## 20 0 female
## 21 0 female
## 22 0 female
## 23 1 male
## 24 1 male
## 25 1 male
## 26 1 male
## 27 1 male
## 28 1 male
## 29 1 male
## 30 1 male
## 31 0 male
## 32 0 male
## 33 0 male
## 34 0 male
b) Create a new variable for Male and another for Female. This will later be useful for calculating the odds based on sex
mydata$Male <- ifelse(mydata$Sex == "male", 1, 0) #creates a new column 'male' and if sex is equal to "male", it assigns the value 1, otherwise, it assigns 0.
mydata$Female <- ifelse(mydata$Sex == "female", 1, 0) #creates a new column 'female' and if sex is equal to "female", it assigns the value 1, otherwise, it assigns 0.
mydata
## Survived Sex Male Female
## 1 1 female 0 1
## 2 1 female 0 1
## 3 1 female 0 1
## 4 1 female 0 1
## 5 1 female 0 1
## 6 1 female 0 1
## 7 1 female 0 1
## 8 1 female 0 1
## 9 1 female 0 1
## 10 1 female 0 1
## 11 1 female 0 1
## 12 0 female 0 1
## 13 0 female 0 1
## 14 0 female 0 1
## 15 0 female 0 1
## 16 0 female 0 1
## 17 0 female 0 1
## 18 0 female 0 1
## 19 0 female 0 1
## 20 0 female 0 1
## 21 0 female 0 1
## 22 0 female 0 1
## 23 1 male 1 0
## 24 1 male 1 0
## 25 1 male 1 0
## 26 1 male 1 0
## 27 1 male 1 0
## 28 1 male 1 0
## 29 1 male 1 0
## 30 1 male 1 0
## 31 0 male 1 0
## 32 0 male 1 0
## 33 0 male 1 0
## 34 0 male 1 0
c) Calculate the probability of turtles surviving
alive <- sum(Survived) #tally's the total number of 1's under the column survived (i.e. the total of number of turtles that survived)
probsurvival <- sum(alive)/dim(mydata)[1] #dim gives no. of rows (i.e. total number of turtles) and no. of columns as a vector. Indexing the first element will return the total number of turtles
probsurvival
## [1] 0.5588235
d) Calculate the odds of survival
OddsSurv <- probsurvival/(1 - probsurvival)
OddsSurv
## [1] 1.266667
e) Calculate the odds of survival dependent on sex. What is the odds ratio?
Calculate the Odds of females surviving.
Remember odds of females surviving = probability of females surviving/probability of females not surviving
OR
probability of females surviving/1 - probability of females surviving
SurvFem <- ifelse(mydata$Sex == "female" & mydata$Survived == 1, 1, 0) #assigns a 1 to turtles in the data set that are female and survived, and a 0 to turtles in the dataset that are female and did not survive.
SurvFem
## [1] 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
NotSurvFem <- ifelse(mydata$Sex == "female" & mydata$Survived == 0, 1, 0) #assigns a 1 to turtles in the data set that are female and did not survive, and a 0 to turtles in the dataset that are female and survived.
NotSurvFem
## [1] 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
OddsFem <- (sum(SurvFem) / sum(mydata$Female)) / (sum(NotSurvFem) / sum(mydata$Female))
OddsFem
## [1] 1
Calculate the Odds of males surviving
SurvMale <- ifelse(mydata$Sex == "male" & mydata$Survived == 1, 1, 0)
SurvMale
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0
NotSurvMale <- ifelse(mydata$Sex == "male" & mydata$Survived == 0, 1, 0)
NotSurvMale
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1
OddsMale <- (sum(SurvMale) / sum(mydata$Male)) / (sum(NotSurvMale) / sum(mydata$Male))
OddsMale
## [1] 2
Calculate the Odds Ratio
OddsRatio <- OddsFem/OddsMale
OddsRatio
## [1] 0.5
f) Calculate the log(odds) of survival
logOdds <- log(OddsSurv)
logOdds
## [1] 0.2363888
e) Show that the backtransformed log(odds of survival) are equal to the probability of survival from part c)
Remember that: probability of survival = backtransformed log(odds of survival) = odds of survial/1 + odds of survival
myprob <- OddsSurv /(1+OddsSurv)
myprob
## [1] 0.5588235
a) Read in the dataset “Flatfish.txt” and do a quick data exploration.
Solea <- read.table("Flatfish.txt", header = TRUE, sep = "t")
Note: text files are tab delinated, not comma separated
Data exploration
Note: we are basically just interested in the last 3 columns
head(Solea)
## Sample season month temperature salinity Occurrences
## 1 1 1 5 20 30 0
## 2 2 1 5 18 29 0
## 3 3 1 5 19 30 1
## 4 4 1 5 20 29 0
## 5 5 1 5 20 30 0
## 6 6 1 5 20 32 0
tail(Solea)
## Sample season month temperature salinity Occurrences
## 60 60 2 9 19 31 0
## 61 61 2 9 19 31 0
## 62 62 2 9 21 10 1
## 63 63 2 9 21 16 0
## 64 64 2 9 22 19 0
## 65 65 2 9 21 5 0
Include a plot of Occurrences v. Temperature and another of Occurrences v. Salinity
plot(Solea$temperature, Solea$Occurrences) #creates a scatter plot where the x-axis represents the temperature and the y-axis represents the occurrences of flatfish (presence or absence)
plot(Solea$salinity, Solea$Occurrences) #creates a scatter plot where the x-axis represents salinity and the y-axis represents the occurrences of flatfish (presence or absence)
b) What type of model should be fitted to these data? Why?
The response variable is binary, indicating we need logistic regression modelling.
The model should be a glm (generalised linear model) with binomial distribution and a logit link function.
c) Fit the appropriate model to the relationship between Solea and salinity
Fit logistic regression model with ‘salinity’ as predictor variable
Solea.logit <- glm(Occurrences ~ salinity, data = Solea, family = binomial(link = "logit"))
summary(Solea.logit)
##
## Call:
## glm(formula = Occurrences ~ salinity, family = binomial(link = "logit"),
## data = Solea)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.66071 0.90167 2.951 0.003169 **
## salinity -0.12985 0.03494 -3.716 0.000202 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 87.492 on 64 degrees of freedom
## Residual deviance: 68.560 on 63 degrees of freedom
## AIC: 72.56
##
## Number of Fisher Scoring iterations: 4
Calculate deviance explained by the model
mydeviance <- ((Solea.logit$null.deviance - Solea.logit$deviance) / Solea.logit$null.deviance)*100
round(mydeviance, 2)
## [1] 21.64
Interpretation: The model with salinity only explains just over 20% of the deviance/variance
d) Interpret the modelling results
The logistic regression equation is given by log(odds of flatfish occurrence) = 2.7 - 0.13*salinity
The coefficient estimate for salinity is significant and equal to -0.12985 (~ -0.13). This means that the log(odds of flatfish occurrence) decreases by 0.13 for every one unit increase in salinity
Therefore, the odds of flatfish occurrence will increase by e^-0.13 = 0.88 for every one unit increase in salinity.
For every one unit change in salinity:
For salinity = 0 –> log(odds) = 2.7 - 0.13*0 = 2.7 –> odds = e^(2.7) = 14.9
For salinity = 1 –> log(odds) = 2.7 - 0.13*1 = 2.57 –> odds = e^(2.57) = 13.1
Therefore, for every one unit change in salinity, the odds decrease by (13.1 - 14.9/14.9) * 100 = 12%
e) Plot the raw data and the fitted relationship between Solea. Next plot the confidence intervals for this relationship. What is the problem with this plot?
First create a scatter plot with the x-axis representing salinity, and the y-axis representing the probability of flatfish presence
Note: The with() function in R is used to simplify code
when working with data frames. It allows you to execute expressions in
an environment where the data frame’s columns can be referred to by
their names directly, without specifying the data frame name each time.
In this case,we can directly reference these columns without specifying
Solea$salinity and Solea$Occurrences.
Next, sort the Solea data frame by salinity in ascending order and save it in the variable ind. Note: the order will be as per row number NOT the value of salinity itself
Now predict the fitted relationship and plot it
with(Solea, plot(salinity, Occurrences, xlab = "Salinity", ylab = "Probability Solea present",
axes = T, #ensures that both x and y axes are drawn on the plot
type = "p")) #indicates that the plot should display individual data points as symbols or points
ind <- order(Solea$salinity)
predictions <- predict(Solea.logit, type="response")[ind]
lines(Solea$salinity[ind], predictions, type = "l", lty = 1, col = 'blue')
Predict the proportion of flatfish present and calculate and plot 95%
confidence intervals
Solea.predict <- predict(Solea.logit, type = "response", se = TRUE)
with(Solea, plot(salinity, Occurrences, xlab = "Salinity", ylab = "Probability Solea present",
axes = T, #ensures that both x and y axes are drawn on the plot
type = "p")) #indicates that the plot should display individual data points as symbols or points
predictions <- predict(Solea.logit, type="response")[ind]
lines(Solea$salinity[ind], predictions, type = "l", lty = 1, col = 'blue')
lines(Solea$salinity[ind], Solea.predict$fit[ind] - (qt(0.975, dim(Solea)[1]) * Solea.predict$se[ind]), lty = 2, col = 'red')
lines(Solea$salinity[ind], Solea.predict$fit[ind] + (qt(0.975, dim(Solea)[1]) * Solea.predict$se[ind]), lty = 2, col = 'red')
Problem: The issue highlighted here is that the confidence intervals extend beyond the acceptable range of probabilities (0 to 1). This is problematic because probabilities cannot be less than 0 or greater than 1. Having confidence intervals that exceed these limits suggests that the model’s predictions may not be reliable in certain regions of the predictor variable (in this case, salinity). This could indicate issues with the model’s assumptions or the data itself.
f) What is the ratio of evidence for a relationship between Solea presence and estuary salinity compared to the null model?
Fit a null model
Solea.null <- glm(Occurrences ~ 1, family = binomial (link="logit"), data = Solea)
summary(Solea.null)
##
## Call:
## glm(formula = Occurrences ~ 1, family = binomial(link = "logit"),
## data = Solea)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.4055 0.2532 -1.601 0.109
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 87.492 on 64 degrees of freedom
## Residual deviance: 87.492 on 64 degrees of freedom
## AIC: 89.492
##
## Number of Fisher Scoring iterations: 4
Do a model selection procedure to get the ER
First get AICx
AICvec <- c(AIC(Solea.logit), AIC(Solea.null))
AICvec
## [1] 72.55999 89.49152
Calculate the difference dAIC
delta.IC <- function(x) x - min(x) ## where x is to be a vector of AIC
d <- delta.IC(AICvec)
d
## [1] 0.00000 16.93152
And now the wAIC
weight.IC <- function(x) (exp(-0.5*x))/sum(exp(-0.5*x)) ## Where x is a vector of dIC
wAIC <- weight.IC(d)
wAIC
## [1] 0.999789489 0.000210511
We can now calculate the ‘Evidence ratio’ by dividing one weight by the other
ER <- wAIC[1]/wAIC[2]
ER
## [1] 4749.345
Interpretation: overall, there high support for Solea.logit comparing to Solea.null