Welcome to the Final! Please read the instructions below before starting the exam.

Tips for cracking the exam:

Good luck!

1 Question 1: Distribution of an individual score vs Sample Mean (CLT)

The distribution of the average IQ score is known to closely follow a Gaussian distribution with a mean centered directly at 100 and a population standard deviation of 16 (Stanford-Benet).

1.1 (4 points)

A single person is randomly selected for jury duty. What is the probability that this person will have an IQ of 110 or higher? Be sure to write the probability statement and show your R code.

mu <- 100
sigma <- 16
x <- 110

1 - pnorm(x, mu, sigma, lower.tail = TRUE)
## [1] 0.2659855

The probability that this person will have an IQ of 110 or higher is 26.60%.

1.2 (4 points)

What is the probability that the mean IQ of the 12-person jury would be 110 or above if drawn from a normal population with \(\mu=100\) and \(\sigma=16\)? Be sure to write the probability statement and show your R code.

mu <- 100
sigma <- 16
x <- 110
n <- 12
standard_error <- sigma / sqrt(n)

1 - pnorm(x, mu, standard_error, lower.tail = TRUE)
## [1] 0.01519141

The probability that the mean IQ of the 12-person jury would be 110 or above if drawn from a normal population with mu = 100 and sigma = 16 is 1.52%.

2 Question 2: Hypothesis testing - Proportion

Among a simple random sample of 331 American adults who do not have a four-year college degree and are not currently enrolled in school, 48% said they decided not to go to college because they could not afford school.

2.1 (6 points)

A newspaper article states that only a minority of the Americans who decide not to go to college do so because they cannot afford it and uses the point estimate from this survey as evidence. Conduct a hypothesis test with \(\alpha=.05\) to determine if these data provide strong evidence that more than 50% adult Americans do not go to college for financial reasons. What is the critical value, test statistic, and p-value for proportion of Americans who do not go college for financial reasons. Do you reject or fail to reject the null ?

alpha <- .05
n <- 331 
pi <- .5
p_hat <- .48

standard_error <- sqrt((pi * (1 - pi)) / n)

z_score <- (p_hat - pi) / standard_error
z_score
## [1] -0.7277362
critical_value <- qnorm(alpha)
print(critical_value)
## [1] -1.644854
if(z_score < critical_value) {
  print('Reject the null hypothesis')
} else{
  print("Fail to reject the null hypothesis")
}
## [1] "Fail to reject the null hypothesis"

In this case we fail to reject the null hypothesis.

2.2 (2 points)

Would you expect a confidence interval (\(\alpha=.05\), double sided) for the proportion of American adults who decide not to go to college for financial reasons to include 0.5 (hypothesized value)? Show, or explain.

standard_error_two <- sqrt((p_hat * (1 - p_hat)) / n)
standard_error_two
## [1] 0.02746049
critical_value_two <- qnorm(1 - alpha/2)
print(critical_value_two)
## [1] 1.959964
confidence_interval_upper <- p_hat + critical_value_two * standard_error_two
confidence_interval_upper
## [1] 0.5338216
confidence_interval_lower <- p_hat - critical_value_two * standard_error_two
confidence_interval_lower
## [1] 0.4261784

We would expect a confidence interval for the proportion of American adults who decide not to attend college for financial reasons to include .5 since it fits within the confidence interval range of 42.62% and 53.38%.

3 Question 3: Hypothesis testing - Difference of two means

3.1 (6 points)

You are investigating the effects of being distracted by a game on how much people eat. The 22 patients in the treatment group who ate their lunch while playing solitaire were asked to do a serial-order recall of the food lunch items they ate. The average number of items recalled by the patients in this group was 4.9, with a standard deviation of 1.8. The average number of items recalled by the patients in the control group (no distraction, n=22) was 6.1, with a standard deviation of 1.8.

Do these data provide strong evidence that the average number of food items recalled by the patients in the treatment and control groups are different? Assume \(\alpha = 5\%\).

n <- 22
alpha <- .05
sigma <- 1.8
mu_control <- 6.1
mu_treatment <- 4.9

standard_error <- sqrt((sigma^2 / n) + (sigma^2 / n))

t_score <- (mu_treatment - mu_control) / standard_error
print(t_score)
## [1] -2.211083
critical_value <- qnorm(alpha/2)
print(critical_value)
## [1] -1.959964
df <- n - 1
p_value <- 2 * pt(t_score, df, lower.tail = TRUE)
p_value
## [1] 0.03825675

Since this p-value (.038) is less than .05 it is statistically significant. As such, this data provides strong evidence that the average number of food items recalled by the patients in the treatment and control groups are different. In other words, reject the null hypothesis.

4 Question 4: Working backwards

4.1 (6 points)

Suppose the population distribution is approximately normal and the population standard deviation is unknown. You are told the 90% confidence interval for the population mean is (65,77), and the confidence interval is based on a simple random sample of 25 observations (double sided). Calculate the sample mean, the margin of error, standard error and the population standard deviation.

n <- 25
confidence_interval_lower <- 65
confidence_interval_upper <- 77
alpha <- 1 - .9

sample_mean <- (confidence_interval_lower + confidence_interval_upper) / 2
sample_mean
## [1] 71
margin_of_error <- (confidence_interval_upper - confidence_interval_lower) / 2
margin_of_error
## [1] 6
critical_value <- qnorm(1 - alpha/2)

pop_standard_deviation <- (margin_of_error * sqrt(n)) / critical_value
pop_standard_deviation
## [1] 18.2387

The population standard deviation is 18.24, the margin of error is 6, and the sample mean is 71.

5 Question 5: Assumptions

5.1 (4 points)

The Stanford University Heart Transplant Study was conducted to determine whether an experimental heart transplant program increased lifespan. Each patient entering the program was officially designated a heart transplant candidate, meaning that he was gravely ill and might benefit from a new heart. Patients were randomly assigned into treatment and control groups. Patients in the treatment group received a transplant, and those in the control group did not. The table below displays how many patients survived and died in each group -

mym <- matrix(c(4,30,24,45), nrow=2)
colnames(mym) <- c("control", "treatment")
rownames(mym) <- c("alive","dead")
mym
##       control treatment
## alive       4        24
## dead       30        45

Suppose we are interested in estimating the difference in survival rate between the control and treatment groups using a confidence interval. Explain why we cannot construct such an interval using the normal approximation. What might go wrong if we constructed the confidence interval despite this problem?

We cannot construct such an interval using the normal approximation for a couple reasons. The extremely limited number of surviving people in the control group compared to the relatively high number of people that died suggests that the distribution would be skewed rather than balanced as a normal distribution is inherently. As a result, this skewed data would lead to a mistaken confidence interval, potentially leading to erroneous conclusions about the efficacy of these transplants.

6 Question 6: Hypothesis testing in a regression

6.1 (6 points)

Write your own function in R that executes a hypothesis test following a linear regression, using the t-distribution. The function should be your own, and should use the t-distribution. One of the arguments of the function should be an object of class “lm” (that is, the output of a “lm” function used to fit linear regressions). The other arguments should be the elements necessary for a hypothesis test (the coefficient tested, the level according to the Null and Alternative Hypotheses, the significance level, etc.) The function should return the test statistic, the p-value, and the conclusion of the test.

\[ Team\ wins :\beta_0 + \beta_1 + \epsilon_i \]

\[ Null\ hypothesis :H_0: \beta_1 = 0 \] \[ Alternate\ hypothesis :H_a: \beta_1 > 0 \] For this assignment I am using the hoopR data set which contains various past NBA statistics. I am trying to find out the relationship between a team’s number assists (β1) and their number of team wins. The null hypothesis predicts that the coefficient of assists is zero, meaning that the number of assists a team has does not have an impact on the number of games they win. The alternate hypothesis states the opposite, signifying that assists do impact the number of games won.

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
lm_hypothesis_test <- function(lin_regression_model, coef_name, null_value = 0, alpha = .05){
  coef_estimation <- coef(lin_regression_model)[coef_name]
 
   standard_error <- summary(lin_regression_model)$coefficients[coef_name, "Std. Error"]
  
   t_score <- (coef_estimation - null_value) / standard_error           

     df <- lin_regression_model$df.residual 
  p_value <- 2 * pt(-abs(t_score), df)
conclusion <- ifelse(p_value < alpha, 'Reject H0', 'Fail to reject H0')

return(list(
  t_score = t_score,
  p_value = p_value,
  conclusion = conclusion
))
}

6.2 (10 points)

Test the function for a regression of your choice using any data set you wish. Specify the elements of the test and carry out the test by hand. Then check that the function you wrote in the previous part gives the same (correct) answer.

model <- lm(team_winner ~ assists, data = nba_stats)
beta_hat <- coef(model)['assists']
se_beta <- summary(model)$coefficients['assists', 'Std. Error']
beta_hat
##    assists 
## 0.03011146
se_beta
## [1] 0.001780496
df <- model$df.residual
manual_tscore <- beta_hat / se_beta
p_value <- 2 * pt(-abs(manual_tscore), df)
manual_tscore
##  assists 
## 16.91184
p_value
##     assists 
## 5.53805e-61
alpha <- .05
if(p_value < alpha) {
  conclude <- 'Reject H0: Assists impact a teams total wins'
} else{
  conclude <- 'Fail to reject H0: Assists do not impact a teams total wins'
}
   conclude
## [1] "Reject H0: Assists impact a teams total wins"
nba_stats <- load_nba_team_box(seasons = 2023)
lin_regression_model <- lm(team_winner ~ assists, data = nba_stats)

lin_regression_model <- lm(team_winner ~ assists, data = nba_stats)

lm_hypothesis_test(lin_regression_model, "assists", null_value = 0, alpha = .05)
## $t_score
##  assists 
## 15.43826 
## 
## $p_value
##      assists 
## 1.547036e-51 
## 
## $conclusion
##     assists 
## "Reject H0"

As you can see, the manually derived t-score, p-value, and conclusion are equal to their function counterparts.

7 Question 7: Determinants of smoking

Import the smoking.csv file into R. The variables are “smoker” - whether or not one smokes; “snmkban” - whether or not one faces a smoking ban at one’s workplace (this is fairly old data); “age” - in years; “hsdrop” - high school dropout; “hsgrad” - graduated from high school but has no further education; “colsome” - has some college education; “colgrad” - graduated from college, as well as some self-explanatory demographic variables.

library(readxl)
getwd()
## [1] "C:/Users/mickp/Downloads"
smoking_data <- read_excel('smokingtwo.xlsx')

7.1 (4 points)

Estimate a logit regression model of whether or not one smokes as a function of the other variables.

logit_regression <- glm(smoker ~ smkban + age + hsdrop + hsgrad + colsome + colgrad + hispanic + female + black, data = smoking_data, family = binomial(link = 'logit'))
summary(logit_regression)
## 
## Call:
## glm(formula = smoker ~ smkban + age + hsdrop + hsgrad + colsome + 
##     colgrad + hispanic + female + black, family = binomial(link = "logit"), 
##     data = smoking_data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.696765   0.138696 -12.234  < 2e-16 ***
## smkban      -0.250735   0.049164  -5.100 3.40e-07 ***
## age         -0.007452   0.001987  -3.751 0.000176 ***
## hsdrop       1.931075   0.131261  14.712  < 2e-16 ***
## hsgrad       1.523305   0.114308  13.326  < 2e-16 ***
## colsome      1.180080   0.116518  10.128  < 2e-16 ***
## colgrad      0.424753   0.125801   3.376 0.000734 ***
## hispanic    -0.584845   0.083085  -7.039 1.93e-12 ***
## female      -0.188720   0.049105  -3.843 0.000121 ***
## black       -0.149472   0.089994  -1.661 0.096732 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11074  on 9999  degrees of freedom
## Residual deviance: 10502  on 9990  degrees of freedom
## AIC: 10522
## 
## Number of Fisher Scoring iterations: 4

7.2 (4 points)

What does the coefficient on “snmkban” tell you? How about the coefficient on “age”? How about the coefficient on “colgrad”?

The coefficient on ‘smkban’ is -0.25, and this value indicates that a negative correlation exists between smoke bans at one’s workplace and one’s likelihood to smoke. The coefficient suggests that each smoke ban at one’s workplace decreases the log-odds of smoking by .25. The coefficient on age is -0.007, and this extremely small value indicates that no correlation exists between one’s age and one’s likelihood to smoke. The coefficient on ‘colgrad’ is 0.42, and this value indicates that a positive correlation exists between graduating college and smoking. The coefficient suggests that graduating college increases the log-odds of smoking by .42.

8 Question 8: Difference-in-differences

Import the “nsw.csv” file in R. The data set contains observations from The National Supported Work Demonstration (NSW), which was a temporary employment program designed to help disadvantaged workers lacking basic job skills move into the labor market by giving them work experience and counseling in a sheltered environment. The variables of interest are “education” – the number of years of education for a worker in the sample; “age” – a worker’s age in years; “re75” – real earnings in 1975, at the beginning of the program; “re78” - real earnings in 1978, at the end of the program; “treat” - whether or not a worker was placed in the treatment group receiving work experience and counseling in a sheltered environment; as well as a few self-explanatory demographic variables.

nsw <- read_excel('nsw.xlsx')

8.1 (4 points)

Estimate the effect of receiving work experience and counseling in a sheltered environment on the increase in real earnings between 1975 and 1978 in a univariate regression. What is this effect? Is it statistically significant at 10% significance level (make sure to state the hypotheses)?

\[ Differnce\ in \ earnings :\beta_0 + \beta_1 + \epsilon_i \] \[ Null\ hypothesis :H_0: \beta_1 = 0 \] \[ Alternative\ hypothesis :H_0: \beta_1 > 0 \] In this case β1 represents the difference in the average change in real earnings between the treatment group and control group. β0 represents the average change in real earnings for those in the control group.

real_earnings <-  nsw$re78 - nsw$re75
earnings_model <- lm(real_earnings ~ treat, data = nsw)
summary(earnings_model)
## 
## Call:
## lm(formula = real_earnings ~ treat, data = nsw)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -37995  -3198  -1442   3677  56114 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2063.4      359.3   5.743 1.37e-08 ***
## treat          846.9      560.1   1.512    0.131    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7406 on 720 degrees of freedom
## Multiple R-squared:  0.003165,   Adjusted R-squared:  0.00178 
## F-statistic: 2.286 on 1 and 720 DF,  p-value: 0.131

This model suggests that the average increase in real earnings for this time span is 846.9. However, it is not statistically significant at the 10% significance level since the p-value (.131) exceeds .1. This means there is not sufficient evidence to suggest that the program has an impact.

8.2 (4 points)

Re-estimate the effect of receiving work experience and counseling in a sheltered environment on the increase in real earnings between 1975 and 1978, but this time use the other variables as control variables. What is this effect? Is it statistically significant at 10% significance level (make sure to state the hypotheses)? How does it compare to the effect estimated in the previous part?

earnings_model_two <- lm(real_earnings ~ treat + age + education + black + hispanic + married + nodegree, data = nsw)
summary(earnings_model_two)
## 
## Call:
## lm(formula = real_earnings ~ treat + age + education + black + 
##     hispanic + married + nodegree, data = nsw)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -38153  -3383  -1076   3503  55863 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  1986.822   3123.319   0.636  0.52490   
## treat         869.294    561.559   1.548  0.12207   
## age             4.194     43.424   0.097  0.92308   
## education      26.633    215.306   0.124  0.90159   
## black         112.085    956.157   0.117  0.90672   
## hispanic     1229.152   1253.250   0.981  0.32704   
## married     -2210.949    767.530  -2.881  0.00409 **
## nodegree     -217.758    891.527  -0.244  0.80710   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7384 on 714 degrees of freedom
## Multiple R-squared:  0.01729,    Adjusted R-squared:  0.007654 
## F-statistic: 1.794 on 7 and 714 DF,  p-value: 0.08539

In this adapted model the average increase in real earnings for this time span is 869.294. However, it is not statistically significant at the 10% significance level since the p-value (0.12207) exceeds .1. This means there is not sufficient evidence to suggest that the program has an impact.

8.3 (6 points)

Create a data frame in which to estimate the same effect, but this time using a difference-in-differences approach. Don’t use any other control variables. What is the estimated effect of receiving work experience and counseling in a sheltered environment on the increase in real earnings between 1975 and 1978?

treated_data <- nsw[nsw$treat == 1, ]
control_data <- nsw[nsw$treat == 0, ]

treated_mean <- mean(treated_data$re78) - mean(treated_data$re75)
control_mean <- mean(control_data$re78) - mean(control_data$re75)
difference_approach <- treated_mean - control_mean
difference_approach
## [1] 846.8883

The estimated effect of receiving work experience and counseling in a sheltered environment on the increase in real earnings between 1975 and 1978 is $846.89.

9 Question 9: Data Question

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

Import the titanic train.csv file into R. You will have to clean the data to run some of commands/set up the workflow.

mydata <- read_excel("C:/Users/mickp/Downloads/train.xlsx")
  1. (4 points) Run some preliminary correlations of Survived with some other variables.
cor(mydata$Survived, mydata$SibSp)
## [1] -0.0353225
cor(mydata$Survived, mydata$Fare)
## [1] 0.2573065
mydata$Age[is.na(mydata$Age)] <- median(mydata$Age, na.rm = TRUE)
cor(mydata$Survived, mydata$Age)
## [1] -0.06491042
mydata$Sex <- as.numeric(factor(mydata$Sex))
cor(mydata$Survived, mydata$Sex)
## [1] -0.5433514
  1. (3 points) Conduct descriptive statistics of the dataset. Anything interesting you find ?
library(psych)
describe(mydata)
##             vars   n   mean     sd median trimmed    mad  min    max  range
## PassengerId    1 891 446.00 257.35 446.00  446.00 330.62 1.00 891.00 890.00
## Survived       2 891   0.38   0.49   0.00    0.35   0.00 0.00   1.00   1.00
## Pclass         3 891   2.31   0.84   3.00    2.39   0.00 1.00   3.00   2.00
## Name*          4 891 446.00 257.35 446.00  446.00 330.62 1.00 891.00 890.00
## Sex            5 891   1.65   0.48   2.00    1.68   0.00 1.00   2.00   1.00
## Age            6 891  29.36  13.02  28.00   28.83   8.90 0.42  80.00  79.58
## SibSp          7 891   0.52   1.10   0.00    0.27   0.00 0.00   8.00   8.00
## Parch          8 891   0.38   0.81   0.00    0.18   0.00 0.00   6.00   6.00
## Ticket*        9 891 339.52 200.83 338.00  339.65 268.35 1.00 681.00 680.00
## Fare          10 891  32.20  49.69  14.45   21.38  10.24 0.00 512.33 512.33
## Cabin*        11 204  77.00  42.23  76.00   77.09  54.11 1.00 147.00 146.00
## Embarked*     12 889   2.54   0.79   3.00    2.67   0.00 1.00   3.00   2.00
##              skew kurtosis   se
## PassengerId  0.00    -1.20 8.62
## Survived     0.48    -1.77 0.02
## Pclass      -0.63    -1.28 0.03
## Name*        0.00    -1.20 8.62
## Sex         -0.62    -1.62 0.02
## Age          0.51     0.97 0.44
## SibSp        3.68    17.73 0.04
## Parch        2.74     9.69 0.03
## Ticket*      0.00    -1.28 6.73
## Fare         4.77    33.12 1.66
## Cabin*       0.00    -1.19 2.96
## Embarked*   -1.26    -0.23 0.03

The age data indicates that the population is fairly young on average since the mean age is 29.36. There is a lot of variance in the data for age, as suggested by the significant standard deviation and range. The mean and median for fare is completely different. This could be because certain exclusive tickets were expensive and brought the median up, or because a certain percentage of passengers got complimentary or very cheap tickets.

  1. (3 points) Use set.seed(100) command, and create a subset of train dataset that has only 500 observations.
set.seed(100)
titanic_subset <- sample_n(mydata, 500)
summary(titanic_subset)
##   PassengerId       Survived         Pclass          Name          
##  Min.   :  1.0   Min.   :0.000   Min.   :1.000   Length:500        
##  1st Qu.:222.8   1st Qu.:0.000   1st Qu.:1.000   Class :character  
##  Median :428.5   Median :0.000   Median :3.000   Mode  :character  
##  Mean   :440.0   Mean   :0.378   Mean   :2.308                     
##  3rd Qu.:659.2   3rd Qu.:1.000   3rd Qu.:3.000                     
##  Max.   :891.0   Max.   :1.000   Max.   :3.000                     
##       Sex             Age            SibSp         Parch      
##  Min.   :1.000   Min.   : 0.42   Min.   :0.0   Min.   :0.000  
##  1st Qu.:1.000   1st Qu.:22.00   1st Qu.:0.0   1st Qu.:0.000  
##  Median :2.000   Median :28.00   Median :0.0   Median :0.000  
##  Mean   :1.676   Mean   :29.36   Mean   :0.5   Mean   :0.366  
##  3rd Qu.:2.000   3rd Qu.:35.00   3rd Qu.:1.0   3rd Qu.:0.000  
##  Max.   :2.000   Max.   :71.00   Max.   :8.0   Max.   :5.000  
##     Ticket               Fare            Cabin             Embarked        
##  Length:500         Min.   :  0.000   Length:500         Length:500        
##  Class :character   1st Qu.:  7.896   Class :character   Class :character  
##  Mode  :character   Median : 14.456   Mode  :character   Mode  :character  
##                     Mean   : 32.092                                        
##                     3rd Qu.: 30.772                                        
##                     Max.   :512.329
  1. (4 points) Create an Ordinary Least Squares model / linear regression where Survived is the dependent variable on your n=500 sample.
ols <- lm(Survived ~ Pclass + Sex + Age + SibSp + Parch, data = titanic_subset)
summary(ols)
## 
## Call:
## lm(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch, data = titanic_subset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0654 -0.2347 -0.1264  0.2540  0.9752 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.787299   0.099751  17.918  < 2e-16 ***
## Pclass      -0.175550   0.023069  -7.610 1.40e-13 ***
## Sex         -0.465582   0.039870 -11.678  < 2e-16 ***
## Age         -0.006771   0.001502  -4.507 8.22e-06 ***
## SibSp       -0.040120   0.017629  -2.276   0.0233 *  
## Parch       -0.013569   0.025350  -0.535   0.5927    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3984 on 494 degrees of freedom
## Multiple R-squared:  0.3329, Adjusted R-squared:  0.3262 
## F-statistic: 49.31 on 5 and 494 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(ols)

  1. (4 points) Create an estimate of whether an individual survived or not (binary variable) using the predict command on your estimated model. Essentially, you are using the coefficient from your linear model to forecast/predict/estimate the survival variable given independant variable values /data.
predicted_values <- predict(ols, titanic_subset, type = 'response')
survival_estimate <- ifelse(predicted_values > .5, 1, 0)
table(survival_estimate)
## survival_estimate
##   0   1 
## 343 157
  1. (4 points) Create a confusion matrix, and calculate the accuracy for the entire train dataset. In other words, how many individuals who survived or died the titanic crash were correctly classified by your regression?
confusion_matrix <- table(predicted = survival_estimate, actual = titanic_subset$Survived)
confusion_matrix
##          actual
## predicted   0   1
##         0 272  71
##         1  39 118
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
accuracy
## [1] 0.78
  1. (6 points) Pclass, Sex, Age can give you an accuracy of about 80%. Can you get higher accuracy ?

Organise your work in the form of a short report. Make sure to defend (OLS assumptions) and explain your model - interpretation of your coefficients, how well your linear regression can be used predict correctly, et cetra.

The following modelling approaches may be helpful (read, and focus on relevant sections) -

https://www.kaggle.com/code/jeremyd/titanic-logistic-regression-in-r

https://www.kaggle.com/code/hillabehar/titanic-analysis-with-r/report

In this section I will attempt to obtain a prediction accuracy of 80% or greater while implementing an additional variable(s) to the inherent variables of Pclass, sex, and age. OLS assumptions include linearity, independence of errors, homoscedasticity, and normality of errors, and we must keep them in mind to make sure our eventual predictions are reliable. The linear regression model will follow the following equation: \[ Survived = \beta_0 + \beta_1X_p + \beta_2X_s + \beta_3X_a... +\epsilon_i \]

mydata$Age[is.na(mydata$Age)] <- median(mydata$Age, na.rm = TRUE)
mydata$log_Fare <- log(mydata$Fare +1)

ols_two <- lm(mydata$Survived ~ mydata$Pclass + mydata$Sex + mydata$SibSp + mydata$Age + mydata$log_Fare + mydata$Embarked + mydata$Parch)
summary(ols_two)
## 
## Call:
## lm(formula = mydata$Survived ~ mydata$Pclass + mydata$Sex + mydata$SibSp + 
##     mydata$Age + mydata$log_Fare + mydata$Embarked + mydata$Parch)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.07607 -0.19712 -0.09409  0.24102  0.99873 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.631911   0.130890  12.468  < 2e-16 ***
## mydata$Pclass    -0.143082   0.023961  -5.971 3.41e-09 ***
## mydata$Sex       -0.497412   0.028417 -17.504  < 2e-16 ***
## mydata$SibSp     -0.052215   0.013993  -3.731 0.000203 ***
## mydata$Age       -0.005841   0.001078  -5.419 7.72e-08 ***
## mydata$log_Fare   0.050414   0.021904   2.302 0.021588 *  
## mydata$EmbarkedQ  0.002694   0.055250   0.049 0.961127    
## mydata$EmbarkedS -0.054766   0.034442  -1.590 0.112169    
## mydata$Parch     -0.025071   0.018721  -1.339 0.180868    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3787 on 880 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.3989, Adjusted R-squared:  0.3934 
## F-statistic:    73 on 8 and 880 DF,  p-value: < 2.2e-16
log_model <- glm(Survived ~ Pclass + Sex + SibSp + Age + Embarked + Fare +                  Parch,
                 data = mydata,
                 family = binomial)
summary(log_model)
## 
## Call:
## glm(formula = Survived ~ Pclass + Sex + SibSp + Age + Embarked + 
##     Fare + Parch, family = binomial, data = mydata)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  7.966842   0.668551  11.917  < 2e-16 ***
## Pclass      -1.098660   0.143543  -7.654 1.95e-14 ***
## Sex         -2.719922   0.200663 -13.555  < 2e-16 ***
## SibSp       -0.324239   0.108889  -2.978   0.0029 ** 
## Age         -0.039330   0.007845  -5.013 5.35e-07 ***
## EmbarkedQ   -0.062831   0.381463  -0.165   0.8692    
## EmbarkedS   -0.411475   0.236604  -1.739   0.0820 .  
## Fare         0.001940   0.002374   0.817   0.4138    
## Parch       -0.088638   0.118505  -0.748   0.4545    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1182.82  on 888  degrees of freedom
## Residual deviance:  784.93  on 880  degrees of freedom
##   (2 observations deleted due to missingness)
## AIC: 802.93
## 
## Number of Fisher Scoring iterations: 5
prediction <- predict(log_model, mydata, type = 'response')
survival_prediction <- ifelse(prediction > .6, 1, 0)
confusion_matrix_three <- table(predicted = survival_prediction, actual =                                  mydata$Survived)
accuracy_three <- sum(diag(confusion_matrix_three)) / sum(confusion_matrix_three)
accuracy_three
## [1] 0.8143982

I struggled to get an accuracy above 80% until I pivoted to using a logistic regression model. I think the logistic model is suited for this problem given the binary nature of the variable we are looking at (survival). Looking at the summary of this model we see that class and sex variables have the most pronounced relationships with survival in a negative context. Certain other variables such as fare lack a relationship with survival, and have no correlation. We can see that adding other variables other than class, sex, and age to this model lead to increase accuracy, albeit not by that much.