Logit transformation by hand

SurvFemales <- 11
SurvMales <- 8
NotSurvFemales <- 11
NotSurvMales <- 4

1. Calculate probability of survival

p.survival <-  (SurvFemales + SurvMales)/(SurvFemales + NotSurvFemales + SurvMales + NotSurvMales)
p.survival

## [1] 0.5588235

2. Calculate odds of survival

odds.survival <- p.survival/(1-p.survival)
odds.survival

## [1] 1.266667

3. Take log(odds of survival) to transform data

log.odds <- log(odds.survival)
log.odds

## [1] 0.2363888

4. Calculate the odds of survival based on gender

Odds(survival female) = probability of survival given female / probability of not surviving given female

probability of survival given female / 1 - probability of surviving given female

p.survival_given_female <- SurvFemales/(SurvFemales + NotSurvFemales) 
p.notsurvival_given_female <- SurvFemales/(SurvFemales + NotSurvFemales)

odds.survivalF <- p.survival_given_female/p.notsurvival_given_female
odds.survivalF

## [1] 1

Odds(survival male) = probability of survival given male / probability of not surviving given male

p.survival_given_male <- SurvMales/(SurvMales + NotSurvMales)
p.notsurvival_given_male <- NotSurvMales/(SurvMales + NotSurvMales)

odds.survivalM <- p.survival_given_male/p.notsurvival_given_male
odds.survivalM

## [1] 2

5. Calculate the odds ratio

odds.ratio <- odds.survivalF/odds.survivalM
odds.ratio

## [1] 0.5

Therefore, the odds of surviving being a female is 0.5x that the odds of surviving being a male

** 6. Now lets backtransform to confirm our answer from part 1**

Remember p = odds/1+odds

p.survival <- odds.survival/(1+odds.survival)
p.survival

## [1] 0.5588235

Exercise 1

Sea turtles are now hatching in a nearby beach while a severe heatwave is taking place. As a team, we will survey the beach to deteremine if turtle hatchlings are surviving the event. We will also find out if the hatchlings are females or males. With the data collected, create a confusion-type of matrix (we will do this together in class)

Ensure to compare answers to manual answers from above

a) Input matrix into R

SurvFemales <- 11
SurvMales <- 8
NotSurvFemales <- 11
NotSurvMales <- 4

Create vector of zeros and ones

Survived <- c(rep(1, SurvFemales), rep(0, NotSurvFemales), rep(1, SurvMales), rep(0, NotSurvMales))
Survived

##  [1] 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0

Create vector of males and females

Sex <- c(rep("female", SurvFemales), rep("female", NotSurvFemales), rep("male", SurvMales), rep("male", NotSurvMales))
Sex

##  [1] "female" "female" "female" "female" "female" "female" "female" "female"
##  [9] "female" "female" "female" "female" "female" "female" "female" "female"
## [17] "female" "female" "female" "female" "female" "female" "male"   "male"  
## [25] "male"   "male"   "male"   "male"   "male"   "male"   "male"   "male"  
## [33] "male"   "male"

mydata <- data.frame(Survived, Sex)
mydata

##    Survived    Sex
## 1         1 female
## 2         1 female
## 3         1 female
## 4         1 female
## 5         1 female
## 6         1 female
## 7         1 female
## 8         1 female
## 9         1 female
## 10        1 female
## 11        1 female
## 12        0 female
## 13        0 female
## 14        0 female
## 15        0 female
## 16        0 female
## 17        0 female
## 18        0 female
## 19        0 female
## 20        0 female
## 21        0 female
## 22        0 female
## 23        1   male
## 24        1   male
## 25        1   male
## 26        1   male
## 27        1   male
## 28        1   male
## 29        1   male
## 30        1   male
## 31        0   male
## 32        0   male
## 33        0   male
## 34        0   male

b) Create a new variable for Male and another for Female. This will later be useful for calculating the odds based on sex

mydata$Male <-  ifelse(mydata$Sex == "male", 1, 0) #creates a new column 'male' and if sex is equal to "male", it assigns the value 1, otherwise, it assigns 0.
mydata$Female <- ifelse(mydata$Sex == "female", 1, 0)  #creates a new column 'female' and if sex is equal to "female", it assigns the value 1, otherwise, it assigns 0.
mydata

##    Survived    Sex Male Female
## 1         1 female    0      1
## 2         1 female    0      1
## 3         1 female    0      1
## 4         1 female    0      1
## 5         1 female    0      1
## 6         1 female    0      1
## 7         1 female    0      1
## 8         1 female    0      1
## 9         1 female    0      1
## 10        1 female    0      1
## 11        1 female    0      1
## 12        0 female    0      1
## 13        0 female    0      1
## 14        0 female    0      1
## 15        0 female    0      1
## 16        0 female    0      1
## 17        0 female    0      1
## 18        0 female    0      1
## 19        0 female    0      1
## 20        0 female    0      1
## 21        0 female    0      1
## 22        0 female    0      1
## 23        1   male    1      0
## 24        1   male    1      0
## 25        1   male    1      0
## 26        1   male    1      0
## 27        1   male    1      0
## 28        1   male    1      0
## 29        1   male    1      0
## 30        1   male    1      0
## 31        0   male    1      0
## 32        0   male    1      0
## 33        0   male    1      0
## 34        0   male    1      0

c) Calculate the probability of turtles surviving

alive <- sum(Survived) #tally's the total number of 1's under the column survived (i.e. the total of number of turtles that survived)

probsurvival <- sum(alive)/dim(mydata)[1] #dim gives no. of rows (i.e. total number of turtles) and no. of columns as a vector. Indexing the first element will return the total number of turtles 
probsurvival

## [1] 0.5588235

d) Calculate the odds of survival

OddsSurv <- probsurvival/(1 - probsurvival)
OddsSurv

## [1] 1.266667

e) Calculate the odds of survival dependent on sex. What is the odds ratio?

Calculate the Odds of females surviving.

Remember odds of females surviving = probability of females surviving/probability of females not surviving

probability of females surviving/1 - probability of females surviving

SurvFem <-  ifelse(mydata$Sex == "female" & mydata$Survived == 1, 1, 0) #assigns a 1 to turtles in the data set that are female and survived, and a 0 to turtles in the dataset that are female and did not survive. 
SurvFem

##  [1] 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

NotSurvFem <- ifelse(mydata$Sex == "female" & mydata$Survived == 0, 1, 0) #assigns a 1 to turtles in the data set that are female and did not survive, and a 0 to turtles in the dataset that are female and survived. 
NotSurvFem

##  [1] 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0

OddsFem <- (sum(SurvFem) / sum(mydata$Female)) / (sum(NotSurvFem) / sum(mydata$Female)) 
OddsFem

## [1] 1

Calculate the Odds of males surviving

SurvMale <- ifelse(mydata$Sex == "male" & mydata$Survived == 1, 1, 0)
SurvMale

##  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0

NotSurvMale <- ifelse(mydata$Sex == "male" & mydata$Survived == 0, 1, 0)
NotSurvMale

##  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1

OddsMale <- (sum(SurvMale) / sum(mydata$Male)) / (sum(NotSurvMale) / sum(mydata$Male))
OddsMale

## [1] 2

Calculate the Odds Ratio

OddsRatio <-  OddsFem/OddsMale
OddsRatio

## [1] 0.5

f) Calculate the log(odds) of survival

logOdds <- log(OddsSurv)
logOdds

## [1] 0.2363888

e) Show that the backtransformed log(odds of survival) are equal to the probability of survival from part c)

Remember that: probability of survival = backtransformed log(odds of survival) = odds of survial/1 + odds of survival

myprob <- OddsSurv /(1+OddsSurv)
myprob

## [1] 0.5588235

Exercise 2: Predicting the presence or absence of flatfish based on temperature and salinity

a) Read in the dataset “Flatfish.txt” and do a quick data exploration.

Solea <- read.table("Flatfish.txt", header = TRUE, sep = "t")

Note: text files are tab delinated, not comma separated

Data exploration

Note: we are basically just interested in the last 3 columns

head(Solea)

##   Sample season month temperature salinity Occurrences
## 1      1      1     5          20       30           0
## 2      2      1     5          18       29           0
## 3      3      1     5          19       30           1
## 4      4      1     5          20       29           0
## 5      5      1     5          20       30           0
## 6      6      1     5          20       32           0

tail(Solea)

##    Sample season month temperature salinity Occurrences
## 60     60      2     9          19       31           0
## 61     61      2     9          19       31           0
## 62     62      2     9          21       10           1
## 63     63      2     9          21       16           0
## 64     64      2     9          22       19           0
## 65     65      2     9          21        5           0

Include a plot of Occurrences v. Temperature and another of Occurrences v. Salinity

plot(Solea$temperature, Solea$Occurrences) #creates a scatter plot where the x-axis represents the temperature and the y-axis represents the occurrences of flatfish (presence or absence)

plot(Solea$salinity, Solea$Occurrences) #creates a scatter plot where the x-axis represents salinity and the y-axis  represents the occurrences of flatfish (presence or absence)

b) What type of model should be fitted to these data? Why?

The response variable is binary, indicating we need logistic regression modelling.

The model should be a glm (generalised linear model) with binomial distribution and a logit link function.

c) Fit the appropriate model to the relationship between Solea and salinity

Fit logistic regression model with ‘salinity’ as predictor variable

Solea.logit <- glm(Occurrences ~ salinity, data = Solea, family = binomial(link = "logit"))
summary(Solea.logit)

## 
## Call:
## glm(formula = Occurrences ~ salinity, family = binomial(link = "logit"), 
##     data = Solea)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.66071    0.90167   2.951 0.003169 ** 
## salinity    -0.12985    0.03494  -3.716 0.000202 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 87.492  on 64  degrees of freedom
## Residual deviance: 68.560  on 63  degrees of freedom
## AIC: 72.56
## 
## Number of Fisher Scoring iterations: 4

Calculate deviance explained by the model

mydeviance <- ((Solea.logit$null.deviance - Solea.logit$deviance) / Solea.logit$null.deviance)*100
round(mydeviance, 2)

## [1] 21.64

Interpretation: The model with salinity only explains just over 20% of the deviance/variance

d) Interpret the modelling results

The logistic regression equation is given by log(odds of flatfish occurrence) = 2.7 - 0.13*salinity

The coefficient estimate for salinity is significant and equal to -0.12985 (~ -0.13). This means that the log(odds of flatfish occurrence) decreases by 0.13 for every one unit increase in salinity

Therefore, the odds of flatfish occurrence will increase by e^-0.13 = 0.88 for every one unit increase in salinity.

For every one unit change in salinity:

For salinity = 0 –> log(odds) = 2.7 - 0.13*0 = 2.7 –> odds = e^(2.7) = 14.9

For salinity = 1 –> log(odds) = 2.7 - 0.13*1 = 2.57 –> odds = e^(2.57) = 13.1

Therefore, for every one unit change in salinity, the odds decrease by (13.1 - 14.9/14.9) * 100 = 12%

e) Plot the raw data and the fitted relationship between Solea. Next plot the confidence intervals for this relationship. What is the problem with this plot?

First create a scatter plot with the x-axis representing salinity, and the y-axis representing the probability of flatfish presence

Note: The with() function in R is used to simplify code when working with data frames. It allows you to execute expressions in an environment where the data frame’s columns can be referred to by their names directly, without specifying the data frame name each time. In this case,we can directly reference these columns without specifying Solea$salinity and Solea$Occurrences.

Next, sort the Solea data frame by salinity in ascending order and save it in the variable ind. Note: the order will be as per row number NOT the value of salinity itself

Now predict the fitted relationship and plot it

with(Solea, plot(salinity, Occurrences, xlab = "Salinity", ylab = "Probability Solea present", 
                axes = T, #ensures that both x and y axes are drawn on the plot
                type = "p")) #indicates that the plot should display individual data points as symbols or points
ind <- order(Solea$salinity)
predictions <- predict(Solea.logit, type="response")[ind]

lines(Solea$salinity[ind], predictions, type = "l", lty = 1, col = 'blue')

Predict the proportion of flatfish present and calculate and plot 95% confidence intervals

Solea.predict <- predict(Solea.logit, type = "response", se = TRUE)

with(Solea, plot(salinity, Occurrences, xlab = "Salinity", ylab = "Probability Solea present", 
                axes = T, #ensures that both x and y axes are drawn on the plot
                type = "p")) #indicates that the plot should display individual data points as symbols or points

predictions <- predict(Solea.logit, type="response")[ind]
lines(Solea$salinity[ind], predictions, type = "l", lty = 1, col = 'blue')

lines(Solea$salinity[ind], Solea.predict$fit[ind] - (qt(0.975, dim(Solea)[1]) * Solea.predict$se[ind]), lty = 2, col = 'red')
lines(Solea$salinity[ind], Solea.predict$fit[ind] + (qt(0.975, dim(Solea)[1]) * Solea.predict$se[ind]), lty = 2, col = 'red')

Problem: The issue highlighted here is that the confidence intervals extend beyond the acceptable range of probabilities (0 to 1). This is problematic because probabilities cannot be less than 0 or greater than 1. Having confidence intervals that exceed these limits suggests that the model’s predictions may not be reliable in certain regions of the predictor variable (in this case, salinity). This could indicate issues with the model’s assumptions or the data itself.

f) What is the ratio of evidence for a relationship between Solea presence and estuary salinity compared to the null model?

Fit a null model

Solea.null <- glm(Occurrences ~ 1, family = binomial (link="logit"), data = Solea)
summary(Solea.null)

## 
## Call:
## glm(formula = Occurrences ~ 1, family = binomial(link = "logit"), 
##     data = Solea)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -0.4055     0.2532  -1.601    0.109
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 87.492  on 64  degrees of freedom
## Residual deviance: 87.492  on 64  degrees of freedom
## AIC: 89.492
## 
## Number of Fisher Scoring iterations: 4

Do a model selection procedure to get the ER

First get AICx

AICvec <- c(AIC(Solea.logit), AIC(Solea.null))
AICvec

## [1] 72.55999 89.49152

Calculate the difference dAIC

delta.IC <- function(x) x - min(x) ## where x is to be a vector of AIC
d <- delta.IC(AICvec)
d

## [1]  0.00000 16.93152

And now the wAIC

weight.IC <- function(x) (exp(-0.5*x))/sum(exp(-0.5*x)) ## Where x is a vector of dIC
wAIC <- weight.IC(d)
wAIC

## [1] 0.999789489 0.000210511

We can now calculate the ‘Evidence ratio’ by dividing one weight by the other

ER <- wAIC[1]/wAIC[2]
ER

## [1] 4749.345

Interpretation: overall, there high support for Solea.logit comparing to Solea.null

Week 9 Workshop: Binomial Logistic Regression

2024-05-01

Logit transformation by hand

Exercise 1

Exercise 2: Predicting the presence or absence of flatfish based on temperature and salinity