Welcome to the Final! Please read the instructions below before starting the exam.
Tips for cracking the exam:
Good luck!
The distribution of the average IQ score is known to closely follow a Gaussian distribution with a mean centered directly at 100 and a population standard deviation of 16 (Stanford-Benet).
A single person is randomly selected for jury duty. What is the probability that this person will have an IQ of 110 or higher? Be sure to write the probability statement and show your R code.
mu <- 100
sigma <- 16
x <- 110
1 - pnorm(x, mu, sigma, lower.tail = TRUE)
## [1] 0.2659855
The probability that this person will have an IQ of 110 or higher is 26.60%.
What is the probability that the mean IQ of the 12-person jury would be 110 or above if drawn from a normal population with \(\mu=100\) and \(\sigma=16\)? Be sure to write the probability statement and show your R code.
mu <- 100
sigma <- 16
x <- 110
n <- 12
standard_error <- sigma / sqrt(n)
1 - pnorm(x, mu, standard_error, lower.tail = TRUE)
## [1] 0.01519141
The probability that the mean IQ of the 12-person jury would be 110 or above if drawn from a normal population with mu = 100 and sigma = 16 is 1.52%.
Among a simple random sample of 331 American adults who do not have a four-year college degree and are not currently enrolled in school, 48% said they decided not to go to college because they could not afford school.
A newspaper article states that only a minority of the Americans who decide not to go to college do so because they cannot afford it and uses the point estimate from this survey as evidence. Conduct a hypothesis test with \(\alpha=.05\) to determine if these data provide strong evidence that more than 50% adult Americans do not go to college for financial reasons. What is the critical value, test statistic, and p-value for proportion of Americans who do not go college for financial reasons. Do you reject or fail to reject the null ?
alpha <- .05
n <- 331
pi <- .5
p_hat <- .48
standard_error <- sqrt((pi * (1 - pi)) / n)
z_score <- (p_hat - pi) / standard_error
z_score
## [1] -0.7277362
critical_value <- qnorm(alpha)
print(critical_value)
## [1] -1.644854
if(z_score < critical_value) {
print('Reject the null hypothesis')
} else{
print("Fail to reject the null hypothesis")
}
## [1] "Fail to reject the null hypothesis"
In this case we fail to reject the null hypothesis.
Would you expect a confidence interval (\(\alpha=.05\), double sided) for the proportion of American adults who decide not to go to college for financial reasons to include 0.5 (hypothesized value)? Show, or explain.
standard_error_two <- sqrt((p_hat * (1 - p_hat)) / n)
standard_error_two
## [1] 0.02746049
critical_value_two <- qnorm(1 - alpha/2)
print(critical_value_two)
## [1] 1.959964
confidence_interval_upper <- p_hat + critical_value_two * standard_error_two
confidence_interval_upper
## [1] 0.5338216
confidence_interval_lower <- p_hat - critical_value_two * standard_error_two
confidence_interval_lower
## [1] 0.4261784
We would expect a confidence interval for the proportion of American adults who decide not to attend college for financial reasons to include .5 since it fits within the confidence interval range of 42.62% and 53.38%.
You are investigating the effects of being distracted by a game on how much people eat. The 22 patients in the treatment group who ate their lunch while playing solitaire were asked to do a serial-order recall of the food lunch items they ate. The average number of items recalled by the patients in this group was 4.9, with a standard deviation of 1.8. The average number of items recalled by the patients in the control group (no distraction, n=22) was 6.1, with a standard deviation of 1.8.
Do these data provide strong evidence that the average number of food items recalled by the patients in the treatment and control groups are different? Assume \(\alpha = 5\%\).
n <- 22
alpha <- .05
sigma <- 1.8
mu_control <- 6.1
mu_treatment <- 4.9
standard_error <- sqrt((sigma^2 / n) + (sigma^2 / n))
t_score <- (mu_treatment - mu_control) / standard_error
print(t_score)
## [1] -2.211083
critical_value <- qnorm(alpha/2)
print(critical_value)
## [1] -1.959964
df <- n - 1
p_value <- 2 * pt(t_score, df, lower.tail = TRUE)
p_value
## [1] 0.03825675
Since this p-value (.038) is less than .05 it is statistically significant. As such, this data provides strong evidence that the average number of food items recalled by the patients in the treatment and control groups are different. In other words, reject the null hypothesis.
Suppose the population distribution is approximately normal and the population standard deviation is unknown. You are told the 90% confidence interval for the population mean is (65,77), and the confidence interval is based on a simple random sample of 25 observations (double sided). Calculate the sample mean, the margin of error, standard error and the population standard deviation.
n <- 25
confidence_interval_lower <- 65
confidence_interval_upper <- 77
alpha <- 1 - .9
sample_mean <- (confidence_interval_lower + confidence_interval_upper) / 2
sample_mean
## [1] 71
margin_of_error <- (confidence_interval_upper - confidence_interval_lower) / 2
margin_of_error
## [1] 6
critical_value <- qnorm(1 - alpha/2)
pop_standard_deviation <- (margin_of_error * sqrt(n)) / critical_value
pop_standard_deviation
## [1] 18.2387
The population standard deviation is 18.24, the margin of error is 6, and the sample mean is 71.
The Stanford University Heart Transplant Study was conducted to determine whether an experimental heart transplant program increased lifespan. Each patient entering the program was officially designated a heart transplant candidate, meaning that he was gravely ill and might benefit from a new heart. Patients were randomly assigned into treatment and control groups. Patients in the treatment group received a transplant, and those in the control group did not. The table below displays how many patients survived and died in each group -
mym <- matrix(c(4,30,24,45), nrow=2)
colnames(mym) <- c("control", "treatment")
rownames(mym) <- c("alive","dead")
mym
## control treatment
## alive 4 24
## dead 30 45
Suppose we are interested in estimating the difference in survival rate between the control and treatment groups using a confidence interval. Explain why we cannot construct such an interval using the normal approximation. What might go wrong if we constructed the confidence interval despite this problem?
We cannot construct such an interval using the normal approximation for a couple reasons. The extremely limited number of surviving people in the control group compared to the relatively high number of people that died suggests that the distribution would be skewed rather than balanced as a normal distribution is inherently. As a result, this skewed data would lead to a mistaken confidence interval, potentially leading to erroneous conclusions about the efficacy of these transplants.
Write your own function in R that executes a hypothesis test following a linear regression, using the t-distribution. The function should be your own, and should use the t-distribution. One of the arguments of the function should be an object of class “lm” (that is, the output of a “lm” function used to fit linear regressions). The other arguments should be the elements necessary for a hypothesis test (the coefficient tested, the level according to the Null and Alternative Hypotheses, the significance level, etc.) The function should return the test statistic, the p-value, and the conclusion of the test.
\[ Team\ wins :\beta_0 + \beta_1 + \epsilon_i \]
\[ Null\ hypothesis :H_0: \beta_1 = 0 \] \[ Alternate\ hypothesis :H_a: \beta_1 > 0 \] For this assignment I am using the hoopR data set which contains various past NBA statistics. I am trying to find out the relationship between a team’s number assists (β1) and their number of team wins. The null hypothesis predicts that the coefficient of assists is zero, meaning that the number of assists a team has does not have an impact on the number of games they win. The alternate hypothesis states the opposite, signifying that assists do impact the number of games won.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
lm_hypothesis_test <- function(lin_regression_model, coef_name, null_value = 0, alpha = .05){
coef_estimation <- coef(lin_regression_model)[coef_name]
standard_error <- summary(lin_regression_model)$coefficients[coef_name, "Std. Error"]
t_score <- (coef_estimation - null_value) / standard_error
df <- lin_regression_model$df.residual
p_value <- 2 * pt(-abs(t_score), df)
conclusion <- ifelse(p_value < alpha, 'Reject H0', 'Fail to reject H0')
return(list(
t_score = t_score,
p_value = p_value,
conclusion = conclusion
))
}
Test the function for a regression of your choice using any data set you wish. Specify the elements of the test and carry out the test by hand. Then check that the function you wrote in the previous part gives the same (correct) answer.
model <- lm(team_winner ~ assists, data = nba_stats)
beta_hat <- coef(model)['assists']
se_beta <- summary(model)$coefficients['assists', 'Std. Error']
beta_hat
## assists
## 0.03011146
se_beta
## [1] 0.001780496
df <- model$df.residual
manual_tscore <- beta_hat / se_beta
p_value <- 2 * pt(-abs(manual_tscore), df)
manual_tscore
## assists
## 16.91184
p_value
## assists
## 5.53805e-61
alpha <- .05
if(p_value < alpha) {
conclude <- 'Reject H0: Assists impact a teams total wins'
} else{
conclude <- 'Fail to reject H0: Assists do not impact a teams total wins'
}
conclude
## [1] "Reject H0: Assists impact a teams total wins"
nba_stats <- load_nba_team_box(seasons = 2023)
lin_regression_model <- lm(team_winner ~ assists, data = nba_stats)
lin_regression_model <- lm(team_winner ~ assists, data = nba_stats)
lm_hypothesis_test(lin_regression_model, "assists", null_value = 0, alpha = .05)
## $t_score
## assists
## 15.43826
##
## $p_value
## assists
## 1.547036e-51
##
## $conclusion
## assists
## "Reject H0"
As you can see, the manually derived t-score, p-value, and conclusion are equal to their function counterparts.
Import the smoking.csv file into R. The variables are “smoker” - whether or not one smokes; “snmkban” - whether or not one faces a smoking ban at one’s workplace (this is fairly old data); “age” - in years; “hsdrop” - high school dropout; “hsgrad” - graduated from high school but has no further education; “colsome” - has some college education; “colgrad” - graduated from college, as well as some self-explanatory demographic variables.
library(readxl)
getwd()
## [1] "C:/Users/mickp/Downloads"
smoking_data <- read_excel('smokingtwo.xlsx')
Estimate a logit regression model of whether or not one smokes as a function of the other variables.
logit_regression <- glm(smoker ~ smkban + age + hsdrop + hsgrad + colsome + colgrad + hispanic + female + black, data = smoking_data, family = binomial(link = 'logit'))
summary(logit_regression)
##
## Call:
## glm(formula = smoker ~ smkban + age + hsdrop + hsgrad + colsome +
## colgrad + hispanic + female + black, family = binomial(link = "logit"),
## data = smoking_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.696765 0.138696 -12.234 < 2e-16 ***
## smkban -0.250735 0.049164 -5.100 3.40e-07 ***
## age -0.007452 0.001987 -3.751 0.000176 ***
## hsdrop 1.931075 0.131261 14.712 < 2e-16 ***
## hsgrad 1.523305 0.114308 13.326 < 2e-16 ***
## colsome 1.180080 0.116518 10.128 < 2e-16 ***
## colgrad 0.424753 0.125801 3.376 0.000734 ***
## hispanic -0.584845 0.083085 -7.039 1.93e-12 ***
## female -0.188720 0.049105 -3.843 0.000121 ***
## black -0.149472 0.089994 -1.661 0.096732 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 11074 on 9999 degrees of freedom
## Residual deviance: 10502 on 9990 degrees of freedom
## AIC: 10522
##
## Number of Fisher Scoring iterations: 4
What does the coefficient on “snmkban” tell you? How about the coefficient on “age”? How about the coefficient on “colgrad”?
The coefficient on ‘smkban’ is -0.25, and this value indicates that a negative correlation exists between smoke bans at one’s workplace and one’s likelihood to smoke. The coefficient suggests that each smoke ban at one’s workplace decreases the log-odds of smoking by .25. The coefficient on age is -0.007, and this extremely small value indicates that no correlation exists between one’s age and one’s likelihood to smoke. The coefficient on ‘colgrad’ is 0.42, and this value indicates that a positive correlation exists between graduating college and smoking. The coefficient suggests that graduating college increases the log-odds of smoking by .42.
Import the “nsw.csv” file in R. The data set contains observations from The National Supported Work Demonstration (NSW), which was a temporary employment program designed to help disadvantaged workers lacking basic job skills move into the labor market by giving them work experience and counseling in a sheltered environment. The variables of interest are “education” – the number of years of education for a worker in the sample; “age” – a worker’s age in years; “re75” – real earnings in 1975, at the beginning of the program; “re78” - real earnings in 1978, at the end of the program; “treat” - whether or not a worker was placed in the treatment group receiving work experience and counseling in a sheltered environment; as well as a few self-explanatory demographic variables.
nsw <- read_excel('nsw.xlsx')
Estimate the effect of receiving work experience and counseling in a sheltered environment on the increase in real earnings between 1975 and 1978 in a univariate regression. What is this effect? Is it statistically significant at 10% significance level (make sure to state the hypotheses)?
\[ Differnce\ in \ earnings :\beta_0 + \beta_1 + \epsilon_i \] \[ Null\ hypothesis :H_0: \beta_1 = 0 \] \[ Alternative\ hypothesis :H_0: \beta_1 > 0 \] In this case β1 represents the difference in the average change in real earnings between the treatment group and control group. β0 represents the average change in real earnings for those in the control group.
real_earnings <- nsw$re78 - nsw$re75
earnings_model <- lm(real_earnings ~ treat, data = nsw)
summary(earnings_model)
##
## Call:
## lm(formula = real_earnings ~ treat, data = nsw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37995 -3198 -1442 3677 56114
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2063.4 359.3 5.743 1.37e-08 ***
## treat 846.9 560.1 1.512 0.131
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7406 on 720 degrees of freedom
## Multiple R-squared: 0.003165, Adjusted R-squared: 0.00178
## F-statistic: 2.286 on 1 and 720 DF, p-value: 0.131
This model suggests that the average increase in real earnings for this time span is 846.9. However, it is not statistically significant at the 10% significance level since the p-value (.131) exceeds .1. This means there is not sufficient evidence to suggest that the program has an impact.
Re-estimate the effect of receiving work experience and counseling in a sheltered environment on the increase in real earnings between 1975 and 1978, but this time use the other variables as control variables. What is this effect? Is it statistically significant at 10% significance level (make sure to state the hypotheses)? How does it compare to the effect estimated in the previous part?
earnings_model_two <- lm(real_earnings ~ treat + age + education + black + hispanic + married + nodegree, data = nsw)
summary(earnings_model_two)
##
## Call:
## lm(formula = real_earnings ~ treat + age + education + black +
## hispanic + married + nodegree, data = nsw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38153 -3383 -1076 3503 55863
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1986.822 3123.319 0.636 0.52490
## treat 869.294 561.559 1.548 0.12207
## age 4.194 43.424 0.097 0.92308
## education 26.633 215.306 0.124 0.90159
## black 112.085 956.157 0.117 0.90672
## hispanic 1229.152 1253.250 0.981 0.32704
## married -2210.949 767.530 -2.881 0.00409 **
## nodegree -217.758 891.527 -0.244 0.80710
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7384 on 714 degrees of freedom
## Multiple R-squared: 0.01729, Adjusted R-squared: 0.007654
## F-statistic: 1.794 on 7 and 714 DF, p-value: 0.08539
In this adapted model the average increase in real earnings for this time span is 869.294. However, it is not statistically significant at the 10% significance level since the p-value (0.12207) exceeds .1. This means there is not sufficient evidence to suggest that the program has an impact.
Create a data frame in which to estimate the same effect, but this time using a difference-in-differences approach. Don’t use any other control variables. What is the estimated effect of receiving work experience and counseling in a sheltered environment on the increase in real earnings between 1975 and 1978?
treated_data <- nsw[nsw$treat == 1, ]
control_data <- nsw[nsw$treat == 0, ]
treated_mean <- mean(treated_data$re78) - mean(treated_data$re75)
control_mean <- mean(control_data$re78) - mean(control_data$re75)
difference_approach <- treated_mean - control_mean
difference_approach
## [1] 846.8883
The estimated effect of receiving work experience and counseling in a sheltered environment on the increase in real earnings between 1975 and 1978 is $846.89.
The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).
Import the titanic train.csv file into R. You will have to clean the data to run some of commands/set up the workflow.
mydata <- read_excel("C:/Users/mickp/Downloads/train.xlsx")
cor(mydata$Survived, mydata$SibSp)
## [1] -0.0353225
cor(mydata$Survived, mydata$Fare)
## [1] 0.2573065
mydata$Age[is.na(mydata$Age)] <- median(mydata$Age, na.rm = TRUE)
cor(mydata$Survived, mydata$Age)
## [1] -0.06491042
mydata$Sex <- as.numeric(factor(mydata$Sex))
cor(mydata$Survived, mydata$Sex)
## [1] -0.5433514
library(psych)
describe(mydata)
## vars n mean sd median trimmed mad min max range
## PassengerId 1 891 446.00 257.35 446.00 446.00 330.62 1.00 891.00 890.00
## Survived 2 891 0.38 0.49 0.00 0.35 0.00 0.00 1.00 1.00
## Pclass 3 891 2.31 0.84 3.00 2.39 0.00 1.00 3.00 2.00
## Name* 4 891 446.00 257.35 446.00 446.00 330.62 1.00 891.00 890.00
## Sex 5 891 1.65 0.48 2.00 1.68 0.00 1.00 2.00 1.00
## Age 6 891 29.36 13.02 28.00 28.83 8.90 0.42 80.00 79.58
## SibSp 7 891 0.52 1.10 0.00 0.27 0.00 0.00 8.00 8.00
## Parch 8 891 0.38 0.81 0.00 0.18 0.00 0.00 6.00 6.00
## Ticket* 9 891 339.52 200.83 338.00 339.65 268.35 1.00 681.00 680.00
## Fare 10 891 32.20 49.69 14.45 21.38 10.24 0.00 512.33 512.33
## Cabin* 11 204 77.00 42.23 76.00 77.09 54.11 1.00 147.00 146.00
## Embarked* 12 889 2.54 0.79 3.00 2.67 0.00 1.00 3.00 2.00
## skew kurtosis se
## PassengerId 0.00 -1.20 8.62
## Survived 0.48 -1.77 0.02
## Pclass -0.63 -1.28 0.03
## Name* 0.00 -1.20 8.62
## Sex -0.62 -1.62 0.02
## Age 0.51 0.97 0.44
## SibSp 3.68 17.73 0.04
## Parch 2.74 9.69 0.03
## Ticket* 0.00 -1.28 6.73
## Fare 4.77 33.12 1.66
## Cabin* 0.00 -1.19 2.96
## Embarked* -1.26 -0.23 0.03
The age data indicates that the population is fairly young on average since the mean age is 29.36. There is a lot of variance in the data for age, as suggested by the significant standard deviation and range. The mean and median for fare is completely different. This could be because certain exclusive tickets were expensive and brought the median up, or because a certain percentage of passengers got complimentary or very cheap tickets.
set.seed(100)
titanic_subset <- sample_n(mydata, 500)
summary(titanic_subset)
## PassengerId Survived Pclass Name
## Min. : 1.0 Min. :0.000 Min. :1.000 Length:500
## 1st Qu.:222.8 1st Qu.:0.000 1st Qu.:1.000 Class :character
## Median :428.5 Median :0.000 Median :3.000 Mode :character
## Mean :440.0 Mean :0.378 Mean :2.308
## 3rd Qu.:659.2 3rd Qu.:1.000 3rd Qu.:3.000
## Max. :891.0 Max. :1.000 Max. :3.000
## Sex Age SibSp Parch
## Min. :1.000 Min. : 0.42 Min. :0.0 Min. :0.000
## 1st Qu.:1.000 1st Qu.:22.00 1st Qu.:0.0 1st Qu.:0.000
## Median :2.000 Median :28.00 Median :0.0 Median :0.000
## Mean :1.676 Mean :29.36 Mean :0.5 Mean :0.366
## 3rd Qu.:2.000 3rd Qu.:35.00 3rd Qu.:1.0 3rd Qu.:0.000
## Max. :2.000 Max. :71.00 Max. :8.0 Max. :5.000
## Ticket Fare Cabin Embarked
## Length:500 Min. : 0.000 Length:500 Length:500
## Class :character 1st Qu.: 7.896 Class :character Class :character
## Mode :character Median : 14.456 Mode :character Mode :character
## Mean : 32.092
## 3rd Qu.: 30.772
## Max. :512.329
ols <- lm(Survived ~ Pclass + Sex + Age + SibSp + Parch, data = titanic_subset)
summary(ols)
##
## Call:
## lm(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch, data = titanic_subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0654 -0.2347 -0.1264 0.2540 0.9752
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.787299 0.099751 17.918 < 2e-16 ***
## Pclass -0.175550 0.023069 -7.610 1.40e-13 ***
## Sex -0.465582 0.039870 -11.678 < 2e-16 ***
## Age -0.006771 0.001502 -4.507 8.22e-06 ***
## SibSp -0.040120 0.017629 -2.276 0.0233 *
## Parch -0.013569 0.025350 -0.535 0.5927
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3984 on 494 degrees of freedom
## Multiple R-squared: 0.3329, Adjusted R-squared: 0.3262
## F-statistic: 49.31 on 5 and 494 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(ols)
predicted_values <- predict(ols, titanic_subset, type = 'response')
survival_estimate <- ifelse(predicted_values > .5, 1, 0)
table(survival_estimate)
## survival_estimate
## 0 1
## 343 157
confusion_matrix <- table(predicted = survival_estimate, actual = titanic_subset$Survived)
confusion_matrix
## actual
## predicted 0 1
## 0 272 71
## 1 39 118
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
accuracy
## [1] 0.78
Organise your work in the form of a short report. Make sure to defend (OLS assumptions) and explain your model - interpretation of your coefficients, how well your linear regression can be used predict correctly, et cetra.
The following modelling approaches may be helpful (read, and focus on relevant sections) -
https://www.kaggle.com/code/jeremyd/titanic-logistic-regression-in-r
https://www.kaggle.com/code/hillabehar/titanic-analysis-with-r/report
In this section I will attempt to obtain a prediction accuracy of 80% or greater while implementing an additional variable(s) to the inherent variables of Pclass, sex, and age. OLS assumptions include linearity, independence of errors, homoscedasticity, and normality of errors, and we must keep them in mind to make sure our eventual predictions are reliable. The linear regression model will follow the following equation: \[ Survived = \beta_0 + \beta_1X_p + \beta_2X_s + \beta_3X_a... +\epsilon_i \]
mydata$Age[is.na(mydata$Age)] <- median(mydata$Age, na.rm = TRUE)
mydata$log_Fare <- log(mydata$Fare +1)
ols_two <- lm(mydata$Survived ~ mydata$Pclass + mydata$Sex + mydata$SibSp + mydata$Age + mydata$log_Fare + mydata$Embarked + mydata$Parch)
summary(ols_two)
##
## Call:
## lm(formula = mydata$Survived ~ mydata$Pclass + mydata$Sex + mydata$SibSp +
## mydata$Age + mydata$log_Fare + mydata$Embarked + mydata$Parch)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.07607 -0.19712 -0.09409 0.24102 0.99873
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.631911 0.130890 12.468 < 2e-16 ***
## mydata$Pclass -0.143082 0.023961 -5.971 3.41e-09 ***
## mydata$Sex -0.497412 0.028417 -17.504 < 2e-16 ***
## mydata$SibSp -0.052215 0.013993 -3.731 0.000203 ***
## mydata$Age -0.005841 0.001078 -5.419 7.72e-08 ***
## mydata$log_Fare 0.050414 0.021904 2.302 0.021588 *
## mydata$EmbarkedQ 0.002694 0.055250 0.049 0.961127
## mydata$EmbarkedS -0.054766 0.034442 -1.590 0.112169
## mydata$Parch -0.025071 0.018721 -1.339 0.180868
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3787 on 880 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.3989, Adjusted R-squared: 0.3934
## F-statistic: 73 on 8 and 880 DF, p-value: < 2.2e-16
log_model <- glm(Survived ~ Pclass + Sex + SibSp + Age + Embarked + Fare + Parch,
data = mydata,
family = binomial)
summary(log_model)
##
## Call:
## glm(formula = Survived ~ Pclass + Sex + SibSp + Age + Embarked +
## Fare + Parch, family = binomial, data = mydata)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.966842 0.668551 11.917 < 2e-16 ***
## Pclass -1.098660 0.143543 -7.654 1.95e-14 ***
## Sex -2.719922 0.200663 -13.555 < 2e-16 ***
## SibSp -0.324239 0.108889 -2.978 0.0029 **
## Age -0.039330 0.007845 -5.013 5.35e-07 ***
## EmbarkedQ -0.062831 0.381463 -0.165 0.8692
## EmbarkedS -0.411475 0.236604 -1.739 0.0820 .
## Fare 0.001940 0.002374 0.817 0.4138
## Parch -0.088638 0.118505 -0.748 0.4545
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1182.82 on 888 degrees of freedom
## Residual deviance: 784.93 on 880 degrees of freedom
## (2 observations deleted due to missingness)
## AIC: 802.93
##
## Number of Fisher Scoring iterations: 5
prediction <- predict(log_model, mydata, type = 'response')
survival_prediction <- ifelse(prediction > .6, 1, 0)
confusion_matrix_three <- table(predicted = survival_prediction, actual = mydata$Survived)
accuracy_three <- sum(diag(confusion_matrix_three)) / sum(confusion_matrix_three)
accuracy_three
## [1] 0.8143982
I struggled to get an accuracy above 80% until I pivoted to using a logistic regression model. I think the logistic model is suited for this problem given the binary nature of the variable we are looking at (survival). Looking at the summary of this model we see that class and sex variables have the most pronounced relationships with survival in a negative context. Certain other variables such as fare lack a relationship with survival, and have no correlation. We can see that adding other variables other than class, sex, and age to this model lead to increase accuracy, albeit not by that much.