library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#For the logistic model we will use the titanic data set and predict survival
#import the dataset
library(readr)
train <- read_csv("train.csv")
## Rows: 891 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Name, Sex, Ticket, Cabin, Embarked
## dbl (7): PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Clean the data, we want males and females to be 0s and 1s... males will be 0 and females will be 1
train <- train[!is.na(train$Age), ]
train <- train %>%
mutate(Sex_binary = ifelse(Sex == "male", 0, 1)
)
head(train)
## # A tibble: 6 × 13
## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin
## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
## 1 1 0 3 Braund… male 22 1 0 A/5 2… 7.25 <NA>
## 2 2 1 1 Cuming… fema… 38 1 0 PC 17… 71.3 C85
## 3 3 1 3 Heikki… fema… 26 0 0 STON/… 7.92 <NA>
## 4 4 1 1 Futrel… fema… 35 1 0 113803 53.1 C123
## 5 5 0 3 Allen,… male 35 0 0 373450 8.05 <NA>
## 6 7 0 1 McCart… male 54 0 0 17463 51.9 E46
## # ℹ 2 more variables: Embarked <chr>, Sex_binary <dbl>
model <- glm(formula = Survived ~ Pclass + Sex_binary + Age, family = binomial(link = 'logit'), data = train)
summary(model)
##
## Call:
## glm(formula = Survived ~ Pclass + Sex_binary + Age, family = binomial(link = "logit"),
## data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.533875 0.456247 5.554 2.80e-08 ***
## Pclass -1.288545 0.139259 -9.253 < 2e-16 ***
## Sex_binary 2.522131 0.207283 12.168 < 2e-16 ***
## Age -0.036929 0.007628 -4.841 1.29e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 964.52 on 713 degrees of freedom
## Residual deviance: 647.29 on 710 degrees of freedom
## AIC: 655.29
##
## Number of Fisher Scoring iterations: 5
#exp() calculates e^x
The logistic regression coefficients is the expencted change in log odds of having the outcome per unit change X
PClass: The PCass variable is defined as follows: 1 = 1st class, 2 = second class, 3 = 3rd class.
Coefficient: -1.288545
exp(-1.288545)
## [1] 0.2756716
Since the coefficient is negative, this is going to be an inverse relationship between PCLass and Survival. This means that as PClass increases (meaning that usually people are poorer) the odds of survival decrease (this is expected based on historical data).
One unit increase in PClass would result in a decrease of about 1.29 log-odds of Survival.
e^coefficient yields a value of .2757 meaning for every 1 unit increase in PClass, the odds of survival will decrease by (1-.2757 = .7243) 72.43%.
SEX_BINARY: Male = 0, Female = 1.
Coefficient = 2.522.
The positive value means that there is a positive relationship in sex and survival (here it means that being a woman would have increased your survival chances).
exp(2.552)
## [1] 12.83274
For an increase in 1 unit in Sex, the log odds of survival go up 2.552.
For an increase in 1 unit in Sex, the odds of survival increase by (12.45 - 1) 1145% !
AGE: The age of the passengers.
Coefficient: -.036929. This is a negative relationship, so there will be an inverse relationship between age and survival odds (if age goes up, survival odds will decrease).
exp(-.036929)
## [1] 0.9637446
For an increase in 1 unit in age, the log odds of survival decrease by .036929.
For an increase in 1 unit in age, the odds of survival decrease by (1 - .9637) = 3.63%
A multivariate regression here will not be a good choice of model. This is because there are only 2 outcomes, either survival or death (here modeled by 1 = survival and 0 = death). The outcome is a categorical variable.
Logistic models are used to estimate odds of an event occurring (survival), and assume a nonlinear relationship between the predictors and the log-odds of the outcome.
A multivariate linear regression would be a good choice for a dependent variable of continuous nature. This model will assume linear relationship with the predictors and the outcome. This type of model is good at estimating relationships between independent and dependent variables.
#generate predicted probabilities
predicted_probs <- predict(object = model, type = 'response')
#convert probabilites to binary outcomes
threshold <- .5
binary_predictions <- ifelse(test = predicted_probs >= threshold,
yes = 1,
no = 0)
#Attach binary predictions to the original dataset
train$binary_predictions <- binary_predictions
summary(train$binary_predictions)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3852 1.0000 1.0000
table(train$binary_predictions)
##
## 0 1
## 439 275
#confusion matrix
conf_matrix <- table(train$Survived, train$binary_predictions,
dnn = c("Actual", "Predicted"))
conf_matrix
## Predicted
## Actual 0 1
## 0 356 68
## 1 83 207
true_positives <- conf_matrix[2,2]
false_positives <- conf_matrix[1,2]
true_negatives <- conf_matrix[1,1]
false_negatives <- conf_matrix[2,1]
#calculate sensitivity and specificity
sensitivity <- true_positives/(true_positives + false_negatives)
specificity <- true_negatives/(true_negatives + false_positives)
#print
cat("Sensitivity (True Positive Rate):", sensitivity, "\n")
## Sensitivity (True Positive Rate): 0.7137931
cat("Specificity (True Negative Rate):", specificity, "\n")
## Specificity (True Negative Rate): 0.8396226
library(ggplot2)
# Function to calculate the logistic (S-shaped) curve
logistic_curve <- function(x) {
return(1 / (1 + exp(-x)))
}
# Generate a sequence of values for the x-axis
x_values <- seq(min(predict(model, type = "link")),
max(predict(model, type = "link")),
length = 100)
# Calculate the predicted probabilities using the logistic curve function
predicted_probabilities_curve <- logistic_curve(x_values)
# Plot the logistic curve and predicted probabilities
ggplot() +
geom_line(data = data.frame(x = x_values,
y = predicted_probabilities_curve),
aes(x = x,
y = y), color = "blue", size = 1) +
geom_point(data = train, mapping = aes(x = predict(model, type = "link"),
y = predicted_probs,
color = factor(binary_predictions)
),
alpha = 0.7, size = 2) +
scale_color_manual(values = c("0" = "blue", "1" = "red")) +
labs(title = "S-Curve and Predicted Probabilities",
x = "Predicted Log-Odds",
y = "Predicted Probabilities") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Coming into this class I have taken a number of stats classes, so I was fairly familiar with some of the statistical concepts (however, I learned about some of them much more in depth such as how certain equations are derived).
However, I had 0 coding experience coming in. I remember stressing the night the second discussion was due because I could not figure out how to plot a dot plot for the life of me, and when I finally figured it out I could not get a trend line. Fast forward about 14 weeks and I am lifetimes better at R. I feel very confident in a lot of the fundamental packages such as dpylr, ggplot, tidyverse, to name a few. I am able to create a much neater R-Markdown file.
I feel very confident in basic data cleaning and manipulation, trying to improve on those skills in many of my later discussion posts.
I felt as if I grew a patience for coding. At the beginning I was trying to rush through it, and often got frustrated at debugging. Now I enjoy solving problems with my code and I find it very rewarding when something works. I am much better at reading code, function definitions, and believe I am prepared for the next steps in the data analytics program. 4 months ago if you had told me the improvement I experienced I would not have believed you.
I tried to push myself in the Discussions, making many of my posts sports related (as that is my dream industry to perform some data analytics in). Doing this made a lot of the assignments fun and interesting for me, and I encourage you all to take the extra time to work on datasets that interest you! It really does pay off!
I truly thank you Professor Sharma for leading the way and being an amazing teacher. I hope our paths cross again! You have unlocked a lot of potential I was not aware I had and you were nothing short of amazing all semester long. And to all my classmates that helped me improve week to week from their discussion posts and feedback on my code I very much appreciate you all and wish you nothing but the best going forward!
It was very nice to see a core group of the class get closer as the semester goes on. I feel I have a solid group of peers that I hope to stay in touch with in future classes and beyond BC.
This has been one of the most rewarding classes I have ever taken, and I was very happy to be a part of it.
-R