The distribution of the average IQ score is known to closely follow a Gaussian distribution with a mean centered directly at 100 and a population standard deviation of 16 (Stanford-Benet).
A single person is randomly selected for jury duty. What is the probability that this person will have an IQ of 110 or higher? Be sure to write the probability statement and show your R code.
# p(IQ >= 110)
p_IQ_110_or_higher <- 1 - pnorm(110, mean = 100, sd = 16)
p_IQ_110_or_higher
## [1] 0.2659855
The probability of this person will have an IQ of 110 or higher is 26.6%.
What is the probability that the mean IQ of the 12-person jury would be 110 or above if drawn from a normal population with \(\mu=100\) and \(\sigma=16\)? Be sure to write the probability statement and show your R code.
se <- 16 / sqrt(12)
p_jury <- 1 - pnorm(110, mean = 100, sd = se)
p_jury
## [1] 0.01519141
The probability that the mean IQ of the 12-person jury would be 110 or above would be 1.52%
Among a simple random sample of 331 American adults who do not have a four-year college degree and are not currently enrolled in school, 48% said they decided not to go to college because they could not afford school.
A newspaper article states that only a minority of the Americans who decide not to go to college do so because they cannot afford it and uses the point estimate from this survey as evidence. Conduct a hypothesis test with \(\alpha=.05\) to determine if these data provide strong evidence that more than 50% adult Americans do not go to college for financial reasons. What is the critical value, test statistic, and p-value for proportion of Americans who do not go college for financial reasons. Do you reject or fail to reject the null ?
H0: Mean of 50% adult American do not
go to college for financial reasons.
HA: Mean of more than 50%
adult American do not go to college for financial reasons.
n <- 331
p_hat <- 0.48
p_0 <- 0.5
se <- sqrt((p_0 * (1 - p_0)) / n)
alpha <- 0.05
# critical value
critical <- qnorm(0.05)
# test statistic
t_stat <- (p_hat - p_0) / se
# p value
p_value <- pnorm(t_stat , lower.tail = TRUE)
cat("Critical Value:", critical, "\n")
## Critical Value: -1.644854
cat("Test Statistic:", t_stat, "\n")
## Test Statistic: -0.7277362
cat("P-Value:", p_value, "\n")
## P-Value: 0.2333875
# function for hypothesis testing
myp=function(p, alpha){
if(p<alpha){print('REJECT H0')}else{print('FAIL 2 REJECT')}
}
myp(p_value, alpha)
## [1] "FAIL 2 REJECT"
We fail to reject our null hypothesis of mean of 50% adult American do not go to college for financial reasons at the significance level of 0.05.
Would you expect a confidence interval (\(\alpha=.05\), double sided) for the proportion of American adults who decide not to go to college for financial reasons to include 0.5 (hypothesized value)? Show, or explain.
ci <- p_hat + c(-1, 1) * qnorm(0.975) * se
ci
## [1] 0.4261353 0.5338647
Yes, we see from this output that 0.5 is included within this output of confidence interval (0.4261, 0.5339)
You are investigating the effects of being distracted by a game on how much people eat. The 22 patients in the treatment group who ate their lunch while playing solitaire were asked to do a serial-order recall of the food lunch items they ate. The average number of items recalled by the patients in this group was 4.9, with a standard deviation of 1.8. The average number of items recalled by the patients in the control group (no distraction, n=22) was 6.1, with a standard deviation of 1.8.
Do these data provide strong evidence that the average number of food items recalled by the patients in the treatment and control groups are different? Assume \(\alpha = 5\%\).
H0: The average number of items
recalled by the patients in treatment and control group is the same.
HA: The average number of items recalled by the patients in
treatment and control group is different.
mu_1 <- 4.9
sigma_1 <- 1.8
mu_2 <- 6.1
sigma_2 <- 1.8
n_1 <- 22
n_2 <- 22
alpha <- .05
# calculate variance
var_1 = sigma_1^2
var_2 = sigma_2^2
# standard error
se <- sqrt((var_1 / n) + (var_2 / n))
t_statistic <- (mu_1-mu_2) / se
p_value <- 2 * pt(t_statistic, df = min(n_1, n_2) - 1, lower.tail = TRUE)
myp(p_value, alpha)
## [1] "REJECT H0"
Because we are reject our null hypothesis, the data does provide strong evidence that average number of food items recalled by the patient between treatment and control groups be different.
Suppose the population distribution is approximately normal and the population standard deviation is unknown. You are told the 90% confidence interval for the population mean is (65,77), and the confidence interval is based on a simple random sample of 25 observations (double sided). Calculate the sample mean, the margin of error, standard error and the population standard deviation.
lower <- 65
upper <- 77
n <- 25
# sample mean
sample_mean <- (lower + upper) / 2
cat("Sample mean is", sample_mean)
## Sample mean is 71
# margin of error
me <- (upper - lower) / 2
cat("Margin of error is", me)
## Margin of error is 6
# standard error
se <- me/ qt(0.95, df = n - 1)
cat("Standard error is", se)
## Standard error is 3.506963
# standard deviation
sd <- se * sqrt(n)
cat("Standard deviation is", sd)
## Standard deviation is 17.53481
The Stanford University Heart Transplant Study was conducted to determine whether an experimental heart transplant program increased lifespan. Each patient entering the program was officially designated a heart transplant candidate, meaning that he was gravely ill and might benefit from a new heart. Patients were randomly assigned into treatment and control groups. Patients in the treatment group received a transplant, and those in the control group did not. The table below displays how many patients survived and died in each group -
mym <- matrix(c(4,30,24,45), nrow=2)
colnames(mym) <- c("control", "treatment")
rownames(mym) <- c("alive","dead")
mym
## control treatment
## alive 4 24
## dead 30 45
Suppose we are interested in estimating the difference in survival rate between the control and treatment groups using a confidence interval. Explain why we cannot construct such an interval using the normal approximation. What might go wrong if we constructed the confidence interval despite this problem?
The sample size in this experiment is just too small for us to conduct meaningful analysis, particularly using the normal approximation. When this small sample size, we are very likely to get a wider confidence interval range with large error margin, which the result doesn’t help to provide the most precise estimation to represent the entire population.
Write your own function in R that executes a hypothesis test following a linear regression, using the t-distribution. The function should be your own, and should use the t-distribution. One of the arguments of the function should be an object of class “lm” (that is, the output of a “lm” function used to fit linear regressions). The other arguments should be the elements necessary for a hypothesis test (the coefficient tested, the level according to the Null and Alternative Hypotheses, the significance level, etc.) The function should return the test statistic, the p-value, and the conclusion of the test.
hypothesis_test <- function(lm_model, coefficient, null_value = 0, alternative = "two.sided",
alpha = 0.05) {
# extract the coefficients and standard error
coef_summary <- summary(lm_model)$coefficients
estimate <- coef_summary[coefficient, 1]
se <- coef_summary[coefficient, 2]
# calculate t tests
t_stat <- (estimate - null_value) / se
# degrees of freedom
df <- lm_model$df.residual
# p value
p_value <- 2 * pt(abs(t_stat), df, lower.tail = FALSE)
# check hypothesis
conclusion <- if (p_value < alpha) {
print('REJECT H0') }
else{
print('FAIL 2 REJECT')
}
# return necessary results
return(list(t_statistic = t_stat, p_value = p_value, conclusion = conclusion))
}
Test the function for a regression of your choice using any data set you wish. Specify the elements of the test and carry out the test by hand. Then check that the function you wrote in the previous part gives the same (correct) answer.
data("swiss")
mylm <- lm(Fertility ~ Education , data = swiss)
coef_summary <- summary(mylm)$coefficients
estimate <- coef_summary["Education", 1]
se <- coef_summary["Education", 2]
t_stat <- estimate / se
t_stat
## [1] -5.953619
df <- mylm$df.residual
p_value <- 2 * pt(abs(t_stat), df, lower.tail = FALSE)
p_value
## [1] 3.658617e-07
myp(p_value, alpha = 0.05)
## [1] "REJECT H0"
hypothesis_test(mylm, coefficient = "Education")
## [1] "REJECT H0"
## $t_statistic
## [1] -5.953619
##
## $p_value
## [1] 3.658617e-07
##
## $conclusion
## [1] "REJECT H0"
Outputs are the same.
Import the smoking.csv file into R. The variables are “smoker” - whether or not one smokes; “snmkban” - whether or not one faces a smoking ban at one’s workplace (this is fairly old data); “age” - in years; “hsdrop” - high school dropout; “hsgrad” - graduated from high school but has no further education; “colsome” - has some college education; “colgrad” - graduated from college, as well as some self-explanatory demographic variables.
Estimate a logit regression model of whether or not one smokes as a function of the other variables.
smoking <- read.csv("smoking.csv")
logit <- glm(smoker ~ ., data = smoking, family = binomial)
summary(logit)
##
## Call:
## glm(formula = smoker ~ ., family = binomial, data = smoking)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.696765 0.138696 -12.234 < 2e-16 ***
## smkban -0.250735 0.049164 -5.100 3.40e-07 ***
## age -0.007452 0.001987 -3.751 0.000176 ***
## hsdrop 1.931075 0.131261 14.712 < 2e-16 ***
## hsgrad 1.523305 0.114308 13.326 < 2e-16 ***
## colsome 1.180080 0.116518 10.128 < 2e-16 ***
## colgrad 0.424753 0.125801 3.376 0.000734 ***
## black -0.149472 0.089994 -1.661 0.096732 .
## hispanic -0.584845 0.083085 -7.039 1.93e-12 ***
## female -0.188720 0.049105 -3.843 0.000121 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 11074 on 9999 degrees of freedom
## Residual deviance: 10502 on 9990 degrees of freedom
## AIC: 10522
##
## Number of Fisher Scoring iterations: 4
What does the coefficient on “snmkban” tell you? How about the coefficient on “age”? How about the coefficient on “colgrad”?
snmkban: The odds of smokers would
decrease by 22.18% ((1 - e^(-0.250735)) * 100) if there is smoke bans at
one’s workplace in comparison to the workplace without smoke bans. This
means as we are holding all other variables constant, the ones who have
a smoke bans at their workplace are less likely to smoke than those who
does NOT have a smoke bans at workplace.
age: For every unit
increase in age, the odds of the smokers would decrease by 1% ((1 -
(e^(-0.117452)) * 100). This means as we hold all other variables
constant, less people are likely to smoke as they ages.
colgrad:
The odds of smokers would increase by 52.92% ((e^(0.424753) - 1) * 100)
of those individuals who graduated college than those who didn’t
graduate, as we hold other variables constant.
Import the “nsw.csv” file in R. The data set contains observations from The National Supported Work Demonstration (NSW), which was a temporary employment program designed to help disadvantaged workers lacking basic job skills move into the labor market by giving them work experience and counseling in a sheltered environment. The variables of interest are “education” – the number of years of education for a worker in the sample; “age” – a worker’s age in years; “re75” – real earnings in 1975, at the beginning of the program; “re78” - real earnings in 1978, at the end of the program; “treat” - whether or not a worker was placed in the treatment group receiving work experience and counseling in a sheltered environment; as well as a few self-explanatory demographic variables.
Estimate the effect of receiving work experience and counseling in a sheltered environment on the increase in real earnings between 1975 and 1978 in a univariate regression. What is this effect? Is it statistically significant at 10% significance level (make sure to state the hypotheses)?
H0: There is no difference in real
earnings between 1975 and 1978 of those who received work experience and
counseling in a sheltered environments.
HA: There is a difference
real earnings between 1975 and 1978 of those who received work
experience and counseling in a sheltered environments.
nsw <- read.csv("nsw.csv")
univariate_reg <- lm(re78 - re75 ~ treat, data = nsw)
summary(univariate_reg )
##
## Call:
## lm(formula = re78 - re75 ~ treat, data = nsw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37995 -3198 -1442 3677 56114
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2063.4 359.3 5.743 1.37e-08 ***
## treat 846.9 560.1 1.512 0.131
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7406 on 720 degrees of freedom
## Multiple R-squared: 0.003165, Adjusted R-squared: 0.00178
## F-statistic: 2.286 on 1 and 720 DF, p-value: 0.131
With a p-value (0.131) larger than 0.05, we fail to reject our null hypothesis that there is no difference of effect in terms of receiving work experience and counseling in a sheltered environment on real earnings between 1975 and 1978.
Re-estimate the effect of receiving work experience and counseling in a sheltered environment on the increase in real earnings between 1975 and 1978, but this time use the other variables as control variables. What is this effect? Is it statistically significant at 10% significance level (make sure to state the hypotheses)? How does it compare to the effect estimated in the previous part?
univariate_reg1 <- lm(re78 - re75 ~ treat + education + age + black + hispanic + married + nodegree, data = nsw)
summary(univariate_reg1)
##
## Call:
## lm(formula = re78 - re75 ~ treat + education + age + black +
## hispanic + married + nodegree, data = nsw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38153 -3383 -1076 3503 55863
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1986.822 3123.319 0.636 0.52490
## treat 869.294 561.559 1.548 0.12207
## education 26.633 215.306 0.124 0.90159
## age 4.194 43.424 0.097 0.92308
## black 112.085 956.157 0.117 0.90672
## hispanic 1229.152 1253.250 0.981 0.32704
## married -2210.949 767.530 -2.881 0.00409 **
## nodegree -217.758 891.527 -0.244 0.80710
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7384 on 714 degrees of freedom
## Multiple R-squared: 0.01729, Adjusted R-squared: 0.007654
## F-statistic: 1.794 on 7 and 714 DF, p-value: 0.08539
After adding the control variables,
only this married variable is statistically significant enough
to influence the model while all other variables are not statistically
significant. Hence, this is showing how even when other dependent
variables are added, this model is still not the best in terms of
demonstrating the relationship of treatment effect throughout
time.
Create a data frame in which to estimate the same effect, but this time using a difference-in-differences approach. Don’t use any other control variables. What is the estimated effect of receiving work experience and counseling in a sheltered environment on the increase in real earnings between 1975 and 1978?
# create separate data frames for 1975 (denotes as 0) and 1978 (denotes as 1)
nsw_75 <- data.frame(id = 1:nrow(nsw),
treat = nsw$treat,
time = 0,
earnings = nsw$re75)
nsw_78 <- data.frame(id = 1:nrow(nsw),
treat = nsw$treat,
time = 1,
earnings = nsw$re78)
# combine data frames
nsw_combined<- rbind(nsw_75, nsw_78)
# create an interaction term between treat and time
nsw_combined$did <- nsw_combined$treat * nsw_combined$time
did_model <- lm(earnings ~ treat + time + did, data = nsw_combined)
summary(did_model)
##
## Call:
## lm(formula = earnings ~ treat + time + did, data = nsw_combined)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5976 -3066 -1861 2255 54332
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3026.68 275.82 10.974 < 2e-16 ***
## treat 39.42 430.04 0.092 0.927
## time 2063.37 390.06 5.290 1.41e-07 ***
## did 846.89 608.17 1.393 0.164
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5686 on 1440 degrees of freedom
## Multiple R-squared: 0.04585, Adjusted R-squared: 0.04386
## F-statistic: 23.07 on 3 and 1440 DF, p-value: 1.385e-14
did_model1 <- lm(earnings ~ did, data = nsw_combined)
summary(did_model1)
##
## Call:
## lm(formula = earnings ~ did, data = nsw_combined)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5976 -3801 -2195 2499 54332
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3801.4 169.8 22.390 < 2e-16 ***
## did 2174.9 374.4 5.809 7.7e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5750 on 1442 degrees of freedom
## Multiple R-squared: 0.02287, Adjusted R-squared: 0.02219
## F-statistic: 33.75 on 1 and 1442 DF, p-value: 7.7e-09
The very first model using DID approach only has variable time being statistically significant, which perhaps is not providing enough helpful information. However when we only keep the interaction term, or the DID estimator, its large positive coefficient (2174.9) in the model tells us how treatment does have a positive effect on the real earnings over time from 1975 to 1978 with a statistic significance level at 1%.
The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).
Import the titanic train.csv file into R. You will have to clean the data to run some of commands/set up the workflow.
train <- read.csv("train.csv")
head(train)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## Ticket Fare Cabin Embarked
## 1 A/5 21171 7.2500 S
## 2 PC 17599 71.2833 C85 C
## 3 STON/O2. 3101282 7.9250 S
## 4 113803 53.1000 C123 S
## 5 373450 8.0500 S
## 6 330877 8.4583 Q
# fill in NAs
train$Age[is.na(train$Age)] = median(train$Age, na.rm = TRUE)
correlations <- cor(train[, c("Pclass", "Survived", "Age", "SibSp", "Fare")])
correlations
## Pclass Survived Age SibSp Fare
## Pclass 1.00000000 -0.33848104 -0.33989833 0.08308136 -0.54949962
## Survived -0.33848104 1.00000000 -0.06491042 -0.03532250 0.25730652
## Age -0.33989833 -0.06491042 1.00000000 -0.23329633 0.09668842
## SibSp 0.08308136 -0.03532250 -0.23329633 1.00000000 0.15965104
## Fare -0.54949962 0.25730652 0.09668842 0.15965104 1.00000000
Survived and Pclass
have a negative correlation.
Survived and Age
have a negative correlation.
Survived and SibSp
have a negative correlation.
Survived and Fare
have a positive correlation.
library("psych")
describe(train)
## vars n mean sd median trimmed mad min max range
## PassengerId 1 891 446.00 257.35 446.00 446.00 330.62 1.00 891.00 890.00
## Survived 2 891 0.38 0.49 0.00 0.35 0.00 0.00 1.00 1.00
## Pclass 3 891 2.31 0.84 3.00 2.39 0.00 1.00 3.00 2.00
## Name* 4 891 446.00 257.35 446.00 446.00 330.62 1.00 891.00 890.00
## Sex* 5 891 1.65 0.48 2.00 1.68 0.00 1.00 2.00 1.00
## Age 6 891 29.36 13.02 28.00 28.83 8.90 0.42 80.00 79.58
## SibSp 7 891 0.52 1.10 0.00 0.27 0.00 0.00 8.00 8.00
## Parch 8 891 0.38 0.81 0.00 0.18 0.00 0.00 6.00 6.00
## Ticket* 9 891 339.52 200.83 338.00 339.65 268.35 1.00 681.00 680.00
## Fare 10 891 32.20 49.69 14.45 21.38 10.24 0.00 512.33 512.33
## Cabin* 11 891 18.63 38.14 1.00 8.29 0.00 1.00 148.00 147.00
## Embarked* 12 891 3.53 0.80 4.00 3.66 0.00 1.00 4.00 3.00
## skew kurtosis se
## PassengerId 0.00 -1.20 8.62
## Survived 0.48 -1.77 0.02
## Pclass -0.63 -1.28 0.03
## Name* 0.00 -1.20 8.62
## Sex* -0.62 -1.62 0.02
## Age 0.51 0.97 0.44
## SibSp 3.68 17.73 0.04
## Parch 2.74 9.69 0.03
## Ticket* 0.00 -1.28 6.73
## Fare 4.77 33.12 1.66
## Cabin* 2.09 3.07 1.28
## Embarked* -1.27 -0.16 0.03
summary(train)
## PassengerId Survived Pclass Name
## Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median :446.0 Median :0.0000 Median :3.000 Mode :character
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
## Sex Age SibSp Parch
## Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
## Class :character 1st Qu.:22.00 1st Qu.:0.000 1st Qu.:0.0000
## Mode :character Median :28.00 Median :0.000 Median :0.0000
## Mean :29.36 Mean :0.523 Mean :0.3816
## 3rd Qu.:35.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## Ticket Fare Cabin Embarked
## Length:891 Min. : 0.00 Length:891 Length:891
## Class :character 1st Qu.: 7.91 Class :character Class :character
## Mode :character Median : 14.45 Mode :character Mode :character
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
str(train)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 28 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
It is surprising to see summary results
such as max/min/kurtosis (and more) are still displayed for variables
with character types according to description(), but I do notice the
asterisk next to these special variables.
For binary variable,
such as Survived, it is interesting to see how R is treating it
as integer, which explains why we also see Min/Max/Quantile values
displayed for it using summary() function.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
set.seed(100)
train_subset <- sample_n(train, 500)
str(train_subset)
## 'data.frame': 500 obs. of 12 variables:
## $ PassengerId: int 714 503 358 624 718 470 516 823 838 98 ...
## $ Survived : int 0 0 0 0 1 1 0 0 0 1 ...
## $ Pclass : int 3 3 2 3 2 3 1 1 3 1 ...
## $ Name : chr "Larsson, Mr. August Viktor" "O'Sullivan, Miss. Bridget Mary" "Funk, Miss. Annie Clemmer" "Hansen, Mr. Henry Damsgaard" ...
## $ Sex : chr "male" "female" "female" "male" ...
## $ Age : num 29 28 38 21 27 0.75 47 38 28 23 ...
## $ SibSp : int 0 0 0 0 0 2 0 0 0 0 ...
## $ Parch : int 0 0 0 0 0 1 0 0 0 1 ...
## $ Ticket : chr "7545" "330909" "237671" "350029" ...
## $ Fare : num 9.48 7.63 13 7.85 10.5 ...
## $ Cabin : chr "" "" "" "" ...
## $ Embarked : chr "S" "Q" "S" "S" ...
ols <- lm(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Cabin + Embarked, data = train_subset)
summary(ols)
##
## Call:
## lm(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch +
## Fare + Cabin + Embarked, data = train_subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.72839 -0.18183 -0.07858 0.05158 0.95235
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.954832 0.153667 6.214 1.32e-09 ***
## Pclass -0.084622 0.044371 -1.907 0.05723 .
## Sexmale -0.419610 0.046253 -9.072 < 2e-16 ***
## Age -0.005613 0.001790 -3.135 0.00185 **
## SibSp -0.065145 0.020440 -3.187 0.00155 **
## Parch -0.002622 0.028690 -0.091 0.92723
## Fare 0.002959 0.001308 2.262 0.02426 *
## CabinA10 -0.367248 0.400503 -0.917 0.35972
## CabinA14 -0.442341 0.399118 -1.108 0.26841
## CabinA16 0.347194 0.403580 0.860 0.39016
## CabinA20 0.721148 0.401858 1.795 0.07350 .
## CabinA24 -0.421052 0.398828 -1.056 0.29174
## CabinA26 0.758692 0.402480 1.885 0.06016 .
## CabinA31 0.682200 0.401096 1.701 0.08976 .
## CabinA36 -0.226756 0.402710 -0.563 0.57370
## CabinA5 -0.154613 0.405929 -0.381 0.70349
## CabinA7 -0.227095 0.402789 -0.564 0.57321
## CabinB101 -0.769891 0.727208 -1.059 0.29039
## CabinB18 0.129264 0.289289 0.447 0.65524
## CabinB19 -0.202384 0.400515 -0.505 0.61362
## CabinB20 0.375755 0.286319 1.312 0.19016
## CabinB3 -0.246541 0.451740 -0.546 0.58554
## CabinB30 -0.266510 0.405349 -0.657 0.51126
## CabinB35 0.059472 0.290305 0.205 0.83779
## CabinB37 -0.212923 0.403217 -0.528 0.59776
## CabinB38 -0.271628 0.399258 -0.680 0.49669
## CabinB49 0.258762 0.292265 0.885 0.37650
## CabinB5 -0.365722 0.357488 -1.023 0.30692
## CabinB50 0.638776 0.401063 1.593 0.11203
## CabinB57 B59 B63 B66 -0.393050 0.475064 -0.827 0.40853
## CabinB58 B60 -0.682405 0.385070 -1.772 0.07714 .
## CabinB69 0.217745 0.403290 0.540 0.58956
## CabinB77 0.047205 0.402763 0.117 0.90676
## CabinB86 -0.550207 0.402875 -1.366 0.17281
## CabinB94 -0.221143 0.402707 -0.549 0.58322
## CabinB96 B98 0.052161 0.408572 0.128 0.89848
## CabinC103 0.381729 0.402008 0.950 0.34292
## CabinC104 0.755976 0.399473 1.892 0.05917 .
## CabinC106 0.621268 0.399434 1.555 0.12066
## CabinC110 -0.335697 0.398767 -0.842 0.40039
## CabinC118 -0.328171 0.402431 -0.815 0.41530
## CabinC123 -0.045353 0.285390 -0.159 0.87382
## CabinC124 -0.320818 0.286196 -1.121 0.26298
## CabinC125 0.008876 0.423370 0.021 0.98328
## CabinC126 0.203194 0.399869 0.508 0.61163
## CabinC2 -0.414778 0.399537 -1.038 0.29984
## CabinC22 C26 -1.232018 0.422480 -2.916 0.00375 **
## CabinC23 C25 C27 -0.852661 0.391318 -2.179 0.02993 *
## CabinC30 -0.227186 0.399803 -0.568 0.57019
## CabinC32 -0.069423 0.416137 -0.167 0.86759
## CabinC45 -0.330061 0.462909 -0.713 0.47626
## CabinC46 -0.411730 0.400345 -1.028 0.30438
## CabinC47 0.618692 0.401282 1.542 0.12393
## CabinC50 0.200705 0.405416 0.495 0.62083
## CabinC52 0.606475 0.399184 1.519 0.12949
## CabinC65 -0.319219 0.297348 -1.074 0.28368
## CabinC68 0.088406 0.406909 0.217 0.82812
## CabinC7 -0.173788 0.427538 -0.406 0.68461
## CabinC70 0.322011 0.412443 0.781 0.43543
## CabinC82 -0.919538 0.454110 -2.025 0.04355 *
## CabinC85 0.197329 0.402238 0.491 0.62400
## CabinC86 -0.419674 0.407552 -1.030 0.30376
## CabinC90 0.118039 0.401363 0.294 0.76884
## CabinC92 0.088477 0.403708 0.219 0.82664
## CabinC95 -0.944637 0.461385 -2.047 0.04128 *
## CabinC99 -0.070093 0.415651 -0.169 0.86617
## CabinD 0.161218 0.232522 0.693 0.48850
## CabinD10 D12 0.493670 0.402284 1.227 0.22049
## CabinD19 0.699744 0.398801 1.755 0.08010 .
## CabinD21 0.203600 0.399871 0.509 0.61092
## CabinD26 -0.368599 0.402249 -0.916 0.36005
## CabinD33 0.449958 0.291007 1.546 0.12286
## CabinD36 -0.011097 0.408309 -0.027 0.97833
## CabinD46 -0.282505 0.398896 -0.708 0.47923
## CabinD47 0.168861 0.404484 0.417 0.67656
## CabinD48 -0.454938 0.414034 -1.099 0.27253
## CabinD49 0.473942 0.402338 1.178 0.23952
## CabinD6 -0.371640 0.399374 -0.931 0.35265
## CabinD7 0.322845 0.403936 0.799 0.42463
## CabinD9 0.021960 0.402328 0.055 0.95650
## CabinE10 0.879381 0.394655 2.228 0.02643 *
## CabinE101 0.338141 0.233183 1.450 0.14782
## CabinE12 0.745211 0.399419 1.866 0.06282 .
## CabinE121 0.722978 0.397207 1.820 0.06950 .
## CabinE24 0.692665 0.286285 2.419 0.01600 *
## CabinE25 0.678337 0.399219 1.699 0.09008 .
## CabinE33 0.131795 0.399866 0.330 0.74188
## CabinE38 -0.159371 0.401692 -0.397 0.69177
## CabinE40 -0.038005 0.415989 -0.091 0.92725
## CabinE44 -0.005927 0.285772 -0.021 0.98346
## CabinE46 -0.296000 0.399365 -0.741 0.45903
## CabinE49 0.170432 0.402662 0.423 0.67234
## CabinE50 0.590841 0.401088 1.473 0.14153
## CabinE58 -0.257555 0.399434 -0.645 0.51943
## CabinE67 -0.067958 0.287699 -0.236 0.81339
## CabinE77 -0.491778 0.399680 -1.230 0.21927
## CabinE8 0.228004 0.399675 0.570 0.56868
## CabinF E69 0.457812 0.399298 1.147 0.25227
## CabinF G73 -0.158725 0.394610 -0.402 0.68773
## CabinF2 0.643842 0.286199 2.250 0.02503 *
## CabinF33 0.347364 0.283088 1.227 0.22054
## CabinF38 -0.170883 0.398601 -0.429 0.66837
## CabinF4 0.460721 0.285071 1.616 0.10686
## CabinG6 0.008726 0.233420 0.037 0.97020
## CabinT -0.298107 0.398733 -0.748 0.45513
## EmbarkedQ 0.023757 0.084989 0.280 0.77998
## EmbarkedS -0.004943 0.059455 -0.083 0.93379
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3934 on 393 degrees of freedom
## Multiple R-squared: 0.4826, Adjusted R-squared: 0.3431
## F-statistic: 3.458 on 106 and 393 DF, p-value: < 2.2e-16
# drop other variables which are not statistically significant enough
ols_updated <- lm(Survived ~ Pclass + Sex + Age + SibSp, data = train_subset)
summary(ols_updated)
##
## Call:
## lm(formula = Survived ~ Pclass + Sex + Age + SibSp, data = train_subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0814 -0.2296 -0.1248 0.2526 0.9761
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.314190 0.083579 15.724 < 2e-16 ***
## Pclass -0.175536 0.023053 -7.615 1.35e-13 ***
## Sexmale -0.460820 0.038836 -11.866 < 2e-16 ***
## Age -0.006731 0.001499 -4.489 8.90e-06 ***
## SibSp -0.043828 0.016199 -2.706 0.00706 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3981 on 495 degrees of freedom
## Multiple R-squared: 0.3325, Adjusted R-squared: 0.3271
## F-statistic: 61.65 on 4 and 495 DF, p-value: < 2.2e-16
predicted_values <- predict(ols_updated, train_subset)
train_subset$predicted_survived <- ifelse(predicted_values > 0.5, 1, 0)
table(Predicted = train_subset$predicted_survived)
## Predicted
## 0 1
## 340 160
My model would predict that 340/500 individuals would not survive, and 160/500 would not survive.
predicted_values <- predict(ols_updated, train)
train$predicted_survived <- ifelse(predicted_values > 0.5, 1, 0)
conf_matrix1 <- table(Predicted = train$predicted_survived,
Actual = train$Survived)
conf_matrix1
## Actual
## Predicted 0 1
## 0 477 107
## 1 72 235
accuracy1 <- sum(diag(conf_matrix1)) / sum(conf_matrix1)
accuracy1
## [1] 0.7991021
My OLS model has a prediction accuracy of close of 80%.
Organise your work in the form of a short report. Make sure to defend (OLS assumptions) and explain your model - interpretation of your coefficients, how well your linear regression can be used predict correctly, et cetra.
logit <- glm(Survived ~ Pclass + Sex + Age + Pclass*Sex + SibSp,
data = train, family = "binomial")
summary(logit)
##
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + Pclass * Sex +
## SibSp, family = "binomial", data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 8.009515 0.935776 8.559 < 2e-16 ***
## Pclass -2.229760 0.305638 -7.295 2.98e-13 ***
## Sexmale -6.144993 0.886275 -6.934 4.11e-12 ***
## Age -0.042227 0.008241 -5.124 2.99e-07 ***
## SibSp -0.340267 0.104578 -3.254 0.00114 **
## Pclass:Sexmale 1.364235 0.324890 4.199 2.68e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.7 on 890 degrees of freedom
## Residual deviance: 767.9 on 885 degrees of freedom
## AIC: 779.9
##
## Number of Fisher Scoring iterations: 6
# predict probabilities
logit_pred <- predict(logit, train, type = "response")
train$logit_predicted <- ifelse(logit_pred > 0.5, 1, 0)
logit_conf_matrix <- table(Predicted = train$logit_predicted,
Actual = train$Survived)
logit_conf_matrix
## Actual
## Predicted 0 1
## 0 503 119
## 1 46 223
logit_accuracy <- sum(diag(logit_conf_matrix)) / sum(logit_conf_matrix)
logit_accuracy
## [1] 0.8148148
By adding an interaction term and adding back the SibSp, I am able to bring up this prediction model’s accuracy rate to around 81.48%.
The negative coefficient for
Pclass shows that the higher class numbers (those with low
socioeconomic background) is associated with low survival rate.
The negative coefficient for Sexmale shows how males are less
likely to survive.
The negative coefficient for Age
shows how the older the passenger is the less it will likely to survive.
The negative coefficient for SibSp shows those with more
siblings also reduce survival rate.
The positive coefficient for
the interaction term between class and sex (male) suggests how the first
class followed the rule more than those in lower classes by allowing
women to escape/survive.
Because this is a binary logit model, so unlike the OLS assumptions, the logit assumptions would include Linearity of the Log-Odds, Independence of Observations, Absence of Multicollinearity, and Large Samples. It seems like all the assumptions are met for the logit model expect for the multicollinearity part, as we are noticing Sex variable have a large VIF value of 21 using high variance inflation factor (VIF) testing. However, the multicollinearirty assumption also seem to be met when looking at the small standard errors in comparison to the coefficients for the variables.
In general, though, logit model would be a slight better prediction model than OLS with a slight higher accuracy rate. This was a tragic event that perhaps everyone was trying to survive with their best they can, many other factors (such as human emotions/relationships) could also affect the survival rate of particular group of people which a model cannot truly predict.
# check linearity
logit1 <- predict(logit, type = "link")
plot(train$Age, logit1, xlab = "SibSp", ylab = "Logit")
abline(lm(logit1 ~ train$Age), col = "red")
plot(train$Age, logit1, xlab = "Age", ylab = "Logit")
abline(lm(logit1 ~ train$Age), col = "red")
plot(train$Age, logit1, xlab = "Pclass", ylab = "Logit")
abline(lm(logit1 ~ train$Age), col = "red")
# check for multicollinearity
car::vif(logit)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
## Pclass Sex Age SibSp Pclass:Sex
## 7.637949 21.834624 1.270969 1.128889 19.890722
The following modelling approaches may be helpful (read, and focus on relevant sections) -
https://www.kaggle.com/code/jeremyd/titanic-logistic-regression-in-r
https://www.kaggle.com/code/hillabehar/titanic-analysis-with-r/report