library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#For the logistic model we will use the titanic data set and predict survival
#import the dataset


library(readr)
train <- read_csv("train.csv")
## Rows: 891 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Name, Sex, Ticket, Cabin, Embarked
## dbl (7): PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Clean the data, we want males and females to be 0s and 1s... males will be 0 and females will be 1

train <- train[!is.na(train$Age), ]

train <- train %>%
  mutate(Sex_binary = ifelse(Sex == "male", 0, 1)
  )

head(train)
## # A tibble: 6 × 13
##   PassengerId Survived Pclass Name    Sex     Age SibSp Parch Ticket  Fare Cabin
##         <dbl>    <dbl>  <dbl> <chr>   <chr> <dbl> <dbl> <dbl> <chr>  <dbl> <chr>
## 1           1        0      3 Braund… male     22     1     0 A/5 2…  7.25 <NA> 
## 2           2        1      1 Cuming… fema…    38     1     0 PC 17… 71.3  C85  
## 3           3        1      3 Heikki… fema…    26     0     0 STON/…  7.92 <NA> 
## 4           4        1      1 Futrel… fema…    35     1     0 113803 53.1  C123 
## 5           5        0      3 Allen,… male     35     0     0 373450  8.05 <NA> 
## 6           7        0      1 McCart… male     54     0     0 17463  51.9  E46  
## # ℹ 2 more variables: Embarked <chr>, Sex_binary <dbl>

Create the Model

model <- glm(formula = Survived ~ Pclass + Sex_binary + Age, family = binomial(link = 'logit'), data = train)

summary(model)
## 
## Call:
## glm(formula = Survived ~ Pclass + Sex_binary + Age, family = binomial(link = "logit"), 
##     data = train)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.533875   0.456247   5.554 2.80e-08 ***
## Pclass      -1.288545   0.139259  -9.253  < 2e-16 ***
## Sex_binary   2.522131   0.207283  12.168  < 2e-16 ***
## Age         -0.036929   0.007628  -4.841 1.29e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 964.52  on 713  degrees of freedom
## Residual deviance: 647.29  on 710  degrees of freedom
## AIC: 655.29
## 
## Number of Fisher Scoring iterations: 5

Interpretation

#exp() calculates e^x

The logistic regression coefficients is the expencted change in log odds of having the outcome per unit change X

PClass: The PCass variable is defined as follows: 1 = 1st class, 2 = second class, 3 = 3rd class.

Coefficient: -1.288545

exp(-1.288545)
## [1] 0.2756716

Since the coefficient is negative, this is going to be an inverse relationship between PCLass and Survival. This means that as PClass increases (meaning that usually people are poorer) the odds of survival decrease (this is expected based on historical data).

One unit increase in PClass would result in a decrease of about 1.29 log-odds of Survival.

e^coefficient yields a value of .2757 meaning for every 1 unit increase in PClass, the odds of survival will decrease by (1-.2757 = .7243) 72.43%.

SEX_BINARY: Male = 0, Female = 1.

Coefficient = 2.522.

The positive value means that there is a positive relationship in sex and survival (here it means that being a woman would have increased your survival chances).

exp(2.552)
## [1] 12.83274

For an increase in 1 unit in Sex, the log odds of survival go up 2.552.

For an increase in 1 unit in Sex, the odds of survival increase by (12.45 - 1) 1145% !

AGE: The age of the passengers.

Coefficient: -.036929. This is a negative relationship, so there will be an inverse relationship between age and survival odds (if age goes up, survival odds will decrease).

exp(-.036929)
## [1] 0.9637446

For an increase in 1 unit in age, the log odds of survival decrease by .036929.

For an increase in 1 unit in age, the odds of survival decrease by (1 - .9637) = 3.63%

Why not a Multivariate Regression?

A multivariate regression here will not be a good choice of model. This is because there are only 2 outcomes, either survival or death (here modeled by 1 = survival and 0 = death). The outcome is a categorical variable.

Logistic models are used to estimate odds of an event occurring (survival), and assume a nonlinear relationship between the predictors and the log-odds of the outcome.

A multivariate linear regression would be a good choice for a dependent variable of continuous nature. This model will assume linear relationship with the predictors and the outcome. This type of model is good at estimating relationships between independent and dependent variables.

Confusion Matrix

#generate predicted probabilities
predicted_probs <- predict(object = model, type = 'response')
#convert probabilites to binary outcomes
threshold <- .5

binary_predictions <- ifelse(test = predicted_probs >= threshold,
                             yes = 1,
                             no  = 0)

#Attach binary predictions to the original dataset
train$binary_predictions <- binary_predictions

summary(train$binary_predictions)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3852  1.0000  1.0000
table(train$binary_predictions)
## 
##   0   1 
## 439 275
#confusion matrix
conf_matrix <- table(train$Survived, train$binary_predictions,
                     dnn = c("Actual", "Predicted"))

conf_matrix
##       Predicted
## Actual   0   1
##      0 356  68
##      1  83 207
true_positives <- conf_matrix[2,2]
false_positives <- conf_matrix[1,2]
true_negatives <- conf_matrix[1,1]
false_negatives <- conf_matrix[2,1]


#calculate sensitivity and specificity
sensitivity <- true_positives/(true_positives + false_negatives)

specificity <- true_negatives/(true_negatives + false_positives)

#print

cat("Sensitivity (True Positive Rate):", sensitivity, "\n")
## Sensitivity (True Positive Rate): 0.7137931
cat("Specificity (True Negative Rate):", specificity, "\n")
## Specificity (True Negative Rate): 0.8396226

Sigmoid Curve

library(ggplot2)
# Function to calculate the logistic (S-shaped) curve
logistic_curve <- function(x) {
  return(1 / (1 + exp(-x)))
}

# Generate a sequence of values for the x-axis
x_values <- seq(min(predict(model, type = "link")), 
                max(predict(model, type = "link")), 
                length = 100)

# Calculate the predicted probabilities using the logistic curve function
predicted_probabilities_curve <- logistic_curve(x_values)


# Plot the logistic curve and predicted probabilities
ggplot() +
  geom_line(data = data.frame(x = x_values, 
                              y = predicted_probabilities_curve),
            aes(x = x,
                y = y), color = "blue", size = 1) +
  geom_point(data = train, mapping = aes(x = predict(model, type = "link"), 
                               y = predicted_probs, 
                               color = factor(binary_predictions)
                             ),
             alpha = 0.7, size = 2) +
  scale_color_manual(values = c("0" = "blue", "1" = "red")) +
  labs(title = "S-Curve and Predicted Probabilities",
       x = "Predicted Log-Odds",
       y = "Predicted Probabilities") +
  theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Reflection:

Coming into this class I have taken a number of stats classes, so I was fairly familiar with some of the statistical concepts (however, I learned about some of them much more in depth such as how certain equations are derived).

However, I had 0 coding experience coming in. I remember stressing the night the second discussion was due because I could not figure out how to plot a dot plot for the life of me, and when I finally figured it out I could not get a trend line. Fast forward about 14 weeks and I am lifetimes better at R. I feel very confident in a lot of the fundamental packages such as dpylr, ggplot, tidyverse, to name a few. I am able to create a much neater R-Markdown file.

I feel very confident in basic data cleaning and manipulation, trying to improve on those skills in many of my later discussion posts.

I felt as if I grew a patience for coding. At the beginning I was trying to rush through it, and often got frustrated at debugging. Now I enjoy solving problems with my code and I find it very rewarding when something works. I am much better at reading code, function definitions, and believe I am prepared for the next steps in the data analytics program. 4 months ago if you had told me the improvement I experienced I would not have believed you.

I tried to push myself in the Discussions, making many of my posts sports related (as that is my dream industry to perform some data analytics in). Doing this made a lot of the assignments fun and interesting for me, and I encourage you all to take the extra time to work on datasets that interest you! It really does pay off!

I truly thank you Professor Sharma for leading the way and being an amazing teacher. I hope our paths cross again! You have unlocked a lot of potential I was not aware I had and you were nothing short of amazing all semester long. And to all my classmates that helped me improve week to week from their discussion posts and feedback on my code I very much appreciate you all and wish you nothing but the best going forward!

It was very nice to see a core group of the class get closer as the semester goes on. I feel I have a solid group of peers that I hope to stay in touch with in future classes and beyond BC.

This has been one of the most rewarding classes I have ever taken, and I was very happy to be a part of it.

-R