Discussion 7D

# packages
library(datasets)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(stargazer)

## 
## Please cite as:

##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.

##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

# Input data
data("Titanic")
titanic_df <- as.data.frame(Titanic)
str(titanic_df)

## 'data.frame':    32 obs. of  5 variables:
##  $ Class   : Factor w/ 4 levels "1st","2nd","3rd",..: 1 2 3 4 1 2 3 4 1 2 ...
##  $ Sex     : Factor w/ 2 levels "Male","Female": 1 1 1 1 2 2 2 2 1 1 ...
##  $ Age     : Factor w/ 2 levels "Child","Adult": 1 1 1 1 1 1 1 1 2 2 ...
##  $ Survived: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Freq    : num  0 0 35 0 0 0 17 0 118 154 ...

Q1. Logistic regression

Equation

\[ \log\left(\frac{p}{1 - p}\right) = \beta_0 + \beta_1 \times \text{Class} + \beta_2 \times \text{Sex} + \beta_3 \times \text{Age} + \beta_4 \times \text{Freq} \]

Where: - $\log\left(\frac{p}{1 - p}\right)$ is the log-odds of survival. - $p$ is the probability of survival. - $\beta_0$ is the intercept of the model. - $\beta_1, \beta_2, \beta_3, \beta_4$ are the coefficients for Class, Sex, Age and Freq respectively.

Model

logit_model <- glm(Survived ~ Class + Sex + Age + Freq, 
                   data = titanic_df, 
                   family = binomial(link = "logit"))

stargazer(logit_model, type = "text")

## 
## =============================================
##                       Dependent variable:    
##                   ---------------------------
##                            Survived          
## ---------------------------------------------
## Class2nd                    -0.026           
##                             (1.010)          
##                                              
## Class3rd                     0.249           
##                             (1.037)          
##                                              
## ClassCrew                    0.271           
##                             (1.045)          
##                                              
## SexFemale                   -0.368           
##                             (0.769)          
##                                              
## AgeAdult                     0.618           
##                             (0.869)          
##                                              
## Freq                        -0.005           
##                             (0.005)          
##                                              
## Constant                     0.098           
##                             (0.877)          
##                                              
## ---------------------------------------------
## Observations                  32             
## Log Likelihood              -21.253          
## Akaike Inf. Crit.           56.506           
## =============================================
## Note:             *p<0.1; **p<0.05; ***p<0.01

Interpretation

Class2nd, Class3rd, ClassCrew: The coefficients are -0.026, 0.249, and 0.271, respectively. This suggests that, holding other variables constant, the odds of survival for 2nd class are slightly lower compared to 1st class (though very marginally given the coefficient is very close to zero), and higher for 3rd class and crew members compared to 1st class. However, the standard errors are relatively large (1.010, 1.037, and 1.045), indicating these estimates might not be statistically significant.
SexFemale: The coefficient of -0.368 for females suggests that, holding other factors constant, the log odds of surviving are lower for females compared to males.
AgeAdult: The positive coefficient of 0.618 suggests that adults are more likely to survive than non-adults, controlling for other factors. However, the standard error is quite large (0.869), which may impact the reliability of this estimate.
Freq: This coefficient is very small (-0.005) with a standard error of the same magnitude (0.005). This implies that the frequency variable, possibly representing the number of observations for each row of data or some kind of weighting, has a very minimal impact on the odds of survival.
Constant: The intercept of 0.098 with a standard error of 0.877 as it represents the log odds of survival when all other variables are zero.
None of the coefficients show standard significance levels, as indicated by the absence of asterisks next to the coefficients.

Not running a multivariate regression:

Due to the categorical nature of the dependent variable (Survived). Multivariate regression, which is better suited for continuous outcomes, might not effectively capture the nuances of categorical outcomes without specific modifications.

Q2. Reflection

I’ve really come a long way in my data analysis skills, especially since diving into R. It’s amazing how quickly I’ve picked up various statistical distributions and their applications, which has been super helpful in understanding different kinds of data. Calculating probabilities and getting to grips with the Central Limit Theorem have also been eye-openers, showing me just how much you can predict and infer from a given dataset.

On top of that, learning about simple regression has really rounded out my skills, letting me see the relationships between variables and how one can influence another. It’s been quite practical, too, because I’ve been using R to apply these theories hands-on. Whether it’s manipulating data, visualizing it, or running different analyses, I’m getting a lot of practice. I’m excited to keep building on this foundation, exploring more complex models and techniques that R has to offer.