1 Question 1: Distribution of an individual score vs Sample Mean (CLT)

The distribution of the average IQ score is known to closely follow a Gaussian distribution with a mean centered directly at 100 and a population standard deviation of 16 (Stanford-Benet).

1.1 (4 points)

A single person is randomly selected for jury duty. What is the probability that this person will have an IQ of 110 or higher? Be sure to write the probability statement and show your R code.

# p(IQ >= 110)
p_IQ_110_or_higher <- 1 - pnorm(110, mean = 100, sd = 16)
p_IQ_110_or_higher
## [1] 0.2659855

The probability of this person will have an IQ of 110 or higher is 26.6%.

1.2 (4 points)

What is the probability that the mean IQ of the 12-person jury would be 110 or above if drawn from a normal population with \(\mu=100\) and \(\sigma=16\)? Be sure to write the probability statement and show your R code.

se <- 16 / sqrt(12)

p_jury <- 1 - pnorm(110, mean = 100, sd = se)
p_jury
## [1] 0.01519141

The probability that the mean IQ of the 12-person jury would be 110 or above would be 1.52%

2 Question 2: Hypothesis testing - Proportion

Among a simple random sample of 331 American adults who do not have a four-year college degree and are not currently enrolled in school, 48% said they decided not to go to college because they could not afford school.

2.1 (6 points)

A newspaper article states that only a minority of the Americans who decide not to go to college do so because they cannot afford it and uses the point estimate from this survey as evidence. Conduct a hypothesis test with \(\alpha=.05\) to determine if these data provide strong evidence that more than 50% adult Americans do not go to college for financial reasons. What is the critical value, test statistic, and p-value for proportion of Americans who do not go college for financial reasons. Do you reject or fail to reject the null ?

H0: Mean of 50% adult American do not go to college for financial reasons.
HA: Mean of more than 50% adult American do not go to college for financial reasons.

n <- 331
p_hat <- 0.48
p_0 <- 0.5
se <- sqrt((p_0 * (1 - p_0)) / n)
alpha <- 0.05

# critical value 
critical <- qnorm(0.05)
# test statistic
t_stat <- (p_hat - p_0) / se
# p value
p_value <- pnorm(t_stat , lower.tail = TRUE)

cat("Critical Value:", critical, "\n")
## Critical Value: -1.644854
cat("Test Statistic:", t_stat, "\n")
## Test Statistic: -0.7277362
cat("P-Value:", p_value, "\n")
## P-Value: 0.2333875
# function for hypothesis testing
myp=function(p, alpha){
  if(p<alpha){print('REJECT H0')}else{print('FAIL 2 REJECT')}
}

myp(p_value, alpha)
## [1] "FAIL 2 REJECT"

We fail to reject our null hypothesis of mean of 50% adult American do not go to college for financial reasons at the significance level of 0.05.

2.2 (2 points)

Would you expect a confidence interval (\(\alpha=.05\), double sided) for the proportion of American adults who decide not to go to college for financial reasons to include 0.5 (hypothesized value)? Show, or explain.

ci <- p_hat + c(-1, 1) * qnorm(0.975) * se
ci
## [1] 0.4261353 0.5338647

Yes, we see from this output that 0.5 is included within this output of confidence interval (0.4261, 0.5339)

3 Question 3: Hypothesis testing - Difference of two means

3.1 (6 points)

You are investigating the effects of being distracted by a game on how much people eat. The 22 patients in the treatment group who ate their lunch while playing solitaire were asked to do a serial-order recall of the food lunch items they ate. The average number of items recalled by the patients in this group was 4.9, with a standard deviation of 1.8. The average number of items recalled by the patients in the control group (no distraction, n=22) was 6.1, with a standard deviation of 1.8.

Do these data provide strong evidence that the average number of food items recalled by the patients in the treatment and control groups are different? Assume \(\alpha = 5\%\).

H0: The average number of items recalled by the patients in treatment and control group is the same.
HA: The average number of items recalled by the patients in treatment and control group is different.

mu_1 <- 4.9
sigma_1 <- 1.8
mu_2 <- 6.1
sigma_2 <- 1.8
n_1 <- 22
n_2 <- 22
alpha <- .05

# calculate variance
var_1 = sigma_1^2 
var_2 = sigma_2^2

# standard error 
se <- sqrt((var_1 / n) + (var_2 / n))

t_statistic <- (mu_1-mu_2) / se  
p_value <- 2 * pt(t_statistic, df = min(n_1, n_2) - 1, lower.tail = TRUE)
myp(p_value, alpha)
## [1] "REJECT H0"

Because we are reject our null hypothesis, the data does provide strong evidence that average number of food items recalled by the patient between treatment and control groups be different.

4 Question 4: Working backwards

4.1 (6 points)

Suppose the population distribution is approximately normal and the population standard deviation is unknown. You are told the 90% confidence interval for the population mean is (65,77), and the confidence interval is based on a simple random sample of 25 observations (double sided). Calculate the sample mean, the margin of error, standard error and the population standard deviation.

lower <- 65
upper <- 77
n <- 25

# sample mean
sample_mean <- (lower + upper) / 2
cat("Sample mean is", sample_mean)
## Sample mean is 71
# margin of error
me <- (upper - lower) / 2
cat("Margin of error is", me)
## Margin of error is 6
# standard error 
se <- me/ qt(0.95, df = n - 1)
cat("Standard error is", se)
## Standard error is 3.506963
# standard deviation
sd <- se * sqrt(n)
cat("Standard deviation is", sd)
## Standard deviation is 17.53481

5 Question 5: Assumptions

5.1 (4 points)

The Stanford University Heart Transplant Study was conducted to determine whether an experimental heart transplant program increased lifespan. Each patient entering the program was officially designated a heart transplant candidate, meaning that he was gravely ill and might benefit from a new heart. Patients were randomly assigned into treatment and control groups. Patients in the treatment group received a transplant, and those in the control group did not. The table below displays how many patients survived and died in each group -

mym <- matrix(c(4,30,24,45), nrow=2)
colnames(mym) <- c("control", "treatment")
rownames(mym) <- c("alive","dead")
mym
##       control treatment
## alive       4        24
## dead       30        45

Suppose we are interested in estimating the difference in survival rate between the control and treatment groups using a confidence interval. Explain why we cannot construct such an interval using the normal approximation. What might go wrong if we constructed the confidence interval despite this problem?

The sample size in this experiment is just too small for us to conduct meaningful analysis, particularly using the normal approximation. When this small sample size, we are very likely to get a wider confidence interval range with large error margin, which the result doesn’t help to provide the most precise estimation to represent the entire population.

6 Question 6: Hypothesis testing in a regression

6.1 (6 points)

Write your own function in R that executes a hypothesis test following a linear regression, using the t-distribution. The function should be your own, and should use the t-distribution. One of the arguments of the function should be an object of class “lm” (that is, the output of a “lm” function used to fit linear regressions). The other arguments should be the elements necessary for a hypothesis test (the coefficient tested, the level according to the Null and Alternative Hypotheses, the significance level, etc.) The function should return the test statistic, the p-value, and the conclusion of the test.

6.2 (10 points)

hypothesis_test <- function(lm_model, coefficient, null_value = 0, alternative = "two.sided",
                            alpha = 0.05) {
  
  # extract the coefficients and standard error
  coef_summary <- summary(lm_model)$coefficients
  estimate <- coef_summary[coefficient, 1]
  se <- coef_summary[coefficient, 2]
  
  # calculate t tests
  t_stat <- (estimate - null_value) / se
  
  # degrees of freedom
  df <- lm_model$df.residual
  
  # p value
  p_value <- 2 * pt(abs(t_stat), df, lower.tail = FALSE)
  
  # check hypothesis
  conclusion <- if (p_value < alpha) {
    print('REJECT H0') }
  else{
    print('FAIL 2 REJECT')
  }
  
  # return necessary results
  return(list(t_statistic = t_stat, p_value = p_value, conclusion = conclusion))
}

Test the function for a regression of your choice using any data set you wish. Specify the elements of the test and carry out the test by hand. Then check that the function you wrote in the previous part gives the same (correct) answer.

data("swiss")

mylm <- lm(Fertility ~ Education , data = swiss)

coef_summary <- summary(mylm)$coefficients
estimate <- coef_summary["Education", 1]
se <- coef_summary["Education", 2]
  
t_stat <- estimate / se
t_stat
## [1] -5.953619
df <- mylm$df.residual
  
p_value <- 2 * pt(abs(t_stat), df, lower.tail = FALSE)
p_value
## [1] 3.658617e-07
myp(p_value, alpha = 0.05)
## [1] "REJECT H0"
hypothesis_test(mylm, coefficient = "Education")
## [1] "REJECT H0"
## $t_statistic
## [1] -5.953619
## 
## $p_value
## [1] 3.658617e-07
## 
## $conclusion
## [1] "REJECT H0"

Outputs are the same.

7 Question 7: Determinants of smoking

Import the smoking.csv file into R. The variables are “smoker” - whether or not one smokes; “snmkban” - whether or not one faces a smoking ban at one’s workplace (this is fairly old data); “age” - in years; “hsdrop” - high school dropout; “hsgrad” - graduated from high school but has no further education; “colsome” - has some college education; “colgrad” - graduated from college, as well as some self-explanatory demographic variables.

7.1 (4 points)

Estimate a logit regression model of whether or not one smokes as a function of the other variables.

smoking <- read.csv("smoking.csv")
logit <- glm(smoker ~ ., data = smoking, family = binomial)
summary(logit)
## 
## Call:
## glm(formula = smoker ~ ., family = binomial, data = smoking)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.696765   0.138696 -12.234  < 2e-16 ***
## smkban      -0.250735   0.049164  -5.100 3.40e-07 ***
## age         -0.007452   0.001987  -3.751 0.000176 ***
## hsdrop       1.931075   0.131261  14.712  < 2e-16 ***
## hsgrad       1.523305   0.114308  13.326  < 2e-16 ***
## colsome      1.180080   0.116518  10.128  < 2e-16 ***
## colgrad      0.424753   0.125801   3.376 0.000734 ***
## black       -0.149472   0.089994  -1.661 0.096732 .  
## hispanic    -0.584845   0.083085  -7.039 1.93e-12 ***
## female      -0.188720   0.049105  -3.843 0.000121 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11074  on 9999  degrees of freedom
## Residual deviance: 10502  on 9990  degrees of freedom
## AIC: 10522
## 
## Number of Fisher Scoring iterations: 4

7.2 (4 points)

What does the coefficient on “snmkban” tell you? How about the coefficient on “age”? How about the coefficient on “colgrad”?

snmkban: The odds of smokers would decrease by 22.18% ((1 - e^(-0.250735)) * 100) if there is smoke bans at one’s workplace in comparison to the workplace without smoke bans. This means as we are holding all other variables constant, the ones who have a smoke bans at their workplace are less likely to smoke than those who does NOT have a smoke bans at workplace.
age: For every unit increase in age, the odds of the smokers would decrease by 1% ((1 - (e^(-0.117452)) * 100). This means as we hold all other variables constant, less people are likely to smoke as they ages.
colgrad: The odds of smokers would increase by 52.92% ((e^(0.424753) - 1) * 100) of those individuals who graduated college than those who didn’t graduate, as we hold other variables constant.

8 Question 8: Difference-in-differences

Import the “nsw.csv” file in R. The data set contains observations from The National Supported Work Demonstration (NSW), which was a temporary employment program designed to help disadvantaged workers lacking basic job skills move into the labor market by giving them work experience and counseling in a sheltered environment. The variables of interest are “education” – the number of years of education for a worker in the sample; “age” – a worker’s age in years; “re75” – real earnings in 1975, at the beginning of the program; “re78” - real earnings in 1978, at the end of the program; “treat” - whether or not a worker was placed in the treatment group receiving work experience and counseling in a sheltered environment; as well as a few self-explanatory demographic variables.

8.1 (4 points)

Estimate the effect of receiving work experience and counseling in a sheltered environment on the increase in real earnings between 1975 and 1978 in a univariate regression. What is this effect? Is it statistically significant at 10% significance level (make sure to state the hypotheses)?

H0: There is no difference in real earnings between 1975 and 1978 of those who received work experience and counseling in a sheltered environments.
HA: There is a difference real earnings between 1975 and 1978 of those who received work experience and counseling in a sheltered environments.

nsw <- read.csv("nsw.csv")
univariate_reg <- lm(re78 - re75 ~ treat, data = nsw)
summary(univariate_reg )
## 
## Call:
## lm(formula = re78 - re75 ~ treat, data = nsw)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -37995  -3198  -1442   3677  56114 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2063.4      359.3   5.743 1.37e-08 ***
## treat          846.9      560.1   1.512    0.131    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7406 on 720 degrees of freedom
## Multiple R-squared:  0.003165,   Adjusted R-squared:  0.00178 
## F-statistic: 2.286 on 1 and 720 DF,  p-value: 0.131

With a p-value (0.131) larger than 0.05, we fail to reject our null hypothesis that there is no difference of effect in terms of receiving work experience and counseling in a sheltered environment on real earnings between 1975 and 1978.

8.2 (4 points)

Re-estimate the effect of receiving work experience and counseling in a sheltered environment on the increase in real earnings between 1975 and 1978, but this time use the other variables as control variables. What is this effect? Is it statistically significant at 10% significance level (make sure to state the hypotheses)? How does it compare to the effect estimated in the previous part?

univariate_reg1 <- lm(re78 - re75 ~ treat + education + age + black + hispanic + married + nodegree, data = nsw)
summary(univariate_reg1)
## 
## Call:
## lm(formula = re78 - re75 ~ treat + education + age + black + 
##     hispanic + married + nodegree, data = nsw)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -38153  -3383  -1076   3503  55863 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  1986.822   3123.319   0.636  0.52490   
## treat         869.294    561.559   1.548  0.12207   
## education      26.633    215.306   0.124  0.90159   
## age             4.194     43.424   0.097  0.92308   
## black         112.085    956.157   0.117  0.90672   
## hispanic     1229.152   1253.250   0.981  0.32704   
## married     -2210.949    767.530  -2.881  0.00409 **
## nodegree     -217.758    891.527  -0.244  0.80710   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7384 on 714 degrees of freedom
## Multiple R-squared:  0.01729,    Adjusted R-squared:  0.007654 
## F-statistic: 1.794 on 7 and 714 DF,  p-value: 0.08539

After adding the control variables, only this married variable is statistically significant enough to influence the model while all other variables are not statistically significant. Hence, this is showing how even when other dependent variables are added, this model is still not the best in terms of demonstrating the relationship of treatment effect throughout time.

8.3 (6 points)

Create a data frame in which to estimate the same effect, but this time using a difference-in-differences approach. Don’t use any other control variables. What is the estimated effect of receiving work experience and counseling in a sheltered environment on the increase in real earnings between 1975 and 1978?

# create separate data frames for 1975 (denotes as 0) and 1978 (denotes as 1)
nsw_75 <- data.frame(id = 1:nrow(nsw),
                     treat = nsw$treat,
                     time = 0,  
                     earnings = nsw$re75)
nsw_78 <- data.frame(id = 1:nrow(nsw),
                     treat = nsw$treat,
                     time = 1,  
                     earnings = nsw$re78)
# combine data frames
nsw_combined<- rbind(nsw_75, nsw_78)

# create an interaction term between treat and time
nsw_combined$did <- nsw_combined$treat * nsw_combined$time

did_model <- lm(earnings ~ treat + time + did, data = nsw_combined)
summary(did_model)
## 
## Call:
## lm(formula = earnings ~ treat + time + did, data = nsw_combined)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5976  -3066  -1861   2255  54332 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3026.68     275.82  10.974  < 2e-16 ***
## treat          39.42     430.04   0.092    0.927    
## time         2063.37     390.06   5.290 1.41e-07 ***
## did           846.89     608.17   1.393    0.164    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5686 on 1440 degrees of freedom
## Multiple R-squared:  0.04585,    Adjusted R-squared:  0.04386 
## F-statistic: 23.07 on 3 and 1440 DF,  p-value: 1.385e-14
did_model1 <- lm(earnings ~ did, data = nsw_combined)
summary(did_model1)
## 
## Call:
## lm(formula = earnings ~ did, data = nsw_combined)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5976  -3801  -2195   2499  54332 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3801.4      169.8  22.390  < 2e-16 ***
## did           2174.9      374.4   5.809  7.7e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5750 on 1442 degrees of freedom
## Multiple R-squared:  0.02287,    Adjusted R-squared:  0.02219 
## F-statistic: 33.75 on 1 and 1442 DF,  p-value: 7.7e-09

The very first model using DID approach only has variable time being statistically significant, which perhaps is not providing enough helpful information. However when we only keep the interaction term, or the DID estimator, its large positive coefficient (2174.9) in the model tells us how treatment does have a positive effect on the real earnings over time from 1975 to 1978 with a statistic significance level at 1%.

9 Question 9: Data Question

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

Import the titanic train.csv file into R. You will have to clean the data to run some of commands/set up the workflow.

  1. (4 points) Run some preliminary correlations of Survived with some other variables.
train <- read.csv("train.csv")

head(train)
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                                    Moran, Mr. James   male  NA     0     0
##             Ticket    Fare Cabin Embarked
## 1        A/5 21171  7.2500              S
## 2         PC 17599 71.2833   C85        C
## 3 STON/O2. 3101282  7.9250              S
## 4           113803 53.1000  C123        S
## 5           373450  8.0500              S
## 6           330877  8.4583              Q
# fill in NAs 
train$Age[is.na(train$Age)] = median(train$Age, na.rm = TRUE)

correlations <- cor(train[, c("Pclass", "Survived", "Age", "SibSp", "Fare")])
correlations
##               Pclass    Survived         Age       SibSp        Fare
## Pclass    1.00000000 -0.33848104 -0.33989833  0.08308136 -0.54949962
## Survived -0.33848104  1.00000000 -0.06491042 -0.03532250  0.25730652
## Age      -0.33989833 -0.06491042  1.00000000 -0.23329633  0.09668842
## SibSp     0.08308136 -0.03532250 -0.23329633  1.00000000  0.15965104
## Fare     -0.54949962  0.25730652  0.09668842  0.15965104  1.00000000

Survived and Pclass have a negative correlation.
Survived and Age have a negative correlation.
Survived and SibSp have a negative correlation.
Survived and Fare have a positive correlation.

  1. (3 points) Conduct descriptive statistics of the dataset. Anything interesting you find ?
library("psych")
describe(train)
##             vars   n   mean     sd median trimmed    mad  min    max  range
## PassengerId    1 891 446.00 257.35 446.00  446.00 330.62 1.00 891.00 890.00
## Survived       2 891   0.38   0.49   0.00    0.35   0.00 0.00   1.00   1.00
## Pclass         3 891   2.31   0.84   3.00    2.39   0.00 1.00   3.00   2.00
## Name*          4 891 446.00 257.35 446.00  446.00 330.62 1.00 891.00 890.00
## Sex*           5 891   1.65   0.48   2.00    1.68   0.00 1.00   2.00   1.00
## Age            6 891  29.36  13.02  28.00   28.83   8.90 0.42  80.00  79.58
## SibSp          7 891   0.52   1.10   0.00    0.27   0.00 0.00   8.00   8.00
## Parch          8 891   0.38   0.81   0.00    0.18   0.00 0.00   6.00   6.00
## Ticket*        9 891 339.52 200.83 338.00  339.65 268.35 1.00 681.00 680.00
## Fare          10 891  32.20  49.69  14.45   21.38  10.24 0.00 512.33 512.33
## Cabin*        11 891  18.63  38.14   1.00    8.29   0.00 1.00 148.00 147.00
## Embarked*     12 891   3.53   0.80   4.00    3.66   0.00 1.00   4.00   3.00
##              skew kurtosis   se
## PassengerId  0.00    -1.20 8.62
## Survived     0.48    -1.77 0.02
## Pclass      -0.63    -1.28 0.03
## Name*        0.00    -1.20 8.62
## Sex*        -0.62    -1.62 0.02
## Age          0.51     0.97 0.44
## SibSp        3.68    17.73 0.04
## Parch        2.74     9.69 0.03
## Ticket*      0.00    -1.28 6.73
## Fare         4.77    33.12 1.66
## Cabin*       2.09     3.07 1.28
## Embarked*   -1.27    -0.16 0.03
summary(train)
##   PassengerId       Survived          Pclass          Name          
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891        
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median :446.0   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309                     
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000                     
##      Sex                 Age            SibSp           Parch       
##  Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
##  Class :character   1st Qu.:22.00   1st Qu.:0.000   1st Qu.:0.0000  
##  Mode  :character   Median :28.00   Median :0.000   Median :0.0000  
##                     Mean   :29.36   Mean   :0.523   Mean   :0.3816  
##                     3rd Qu.:35.00   3rd Qu.:1.000   3rd Qu.:0.0000  
##                     Max.   :80.00   Max.   :8.000   Max.   :6.0000  
##     Ticket               Fare           Cabin             Embarked        
##  Length:891         Min.   :  0.00   Length:891         Length:891        
##  Class :character   1st Qu.:  7.91   Class :character   Class :character  
##  Mode  :character   Median : 14.45   Mode  :character   Mode  :character  
##                     Mean   : 32.20                                        
##                     3rd Qu.: 31.00                                        
##                     Max.   :512.33
str(train)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 28 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

It is surprising to see summary results such as max/min/kurtosis (and more) are still displayed for variables with character types according to description(), but I do notice the asterisk next to these special variables.
For binary variable, such as Survived, it is interesting to see how R is treating it as integer, which explains why we also see Min/Max/Quantile values displayed for it using summary() function.

  1. (3 points) Use set.seed(100) command, and create a subset of train dataset that has only 500 observations.
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
set.seed(100)
train_subset <- sample_n(train, 500)

str(train_subset)
## 'data.frame':    500 obs. of  12 variables:
##  $ PassengerId: int  714 503 358 624 718 470 516 823 838 98 ...
##  $ Survived   : int  0 0 0 0 1 1 0 0 0 1 ...
##  $ Pclass     : int  3 3 2 3 2 3 1 1 3 1 ...
##  $ Name       : chr  "Larsson, Mr. August Viktor" "O'Sullivan, Miss. Bridget Mary" "Funk, Miss. Annie Clemmer" "Hansen, Mr. Henry Damsgaard" ...
##  $ Sex        : chr  "male" "female" "female" "male" ...
##  $ Age        : num  29 28 38 21 27 0.75 47 38 28 23 ...
##  $ SibSp      : int  0 0 0 0 0 2 0 0 0 0 ...
##  $ Parch      : int  0 0 0 0 0 1 0 0 0 1 ...
##  $ Ticket     : chr  "7545" "330909" "237671" "350029" ...
##  $ Fare       : num  9.48 7.63 13 7.85 10.5 ...
##  $ Cabin      : chr  "" "" "" "" ...
##  $ Embarked   : chr  "S" "Q" "S" "S" ...
  1. (4 points) Create an Ordinary Least Squares model / linear regression where Survived is the dependent variable on your n=500 sample.
ols <- lm(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Cabin + Embarked, data = train_subset)
summary(ols)
## 
## Call:
## lm(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch + 
##     Fare + Cabin + Embarked, data = train_subset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.72839 -0.18183 -0.07858  0.05158  0.95235 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           0.954832   0.153667   6.214 1.32e-09 ***
## Pclass               -0.084622   0.044371  -1.907  0.05723 .  
## Sexmale              -0.419610   0.046253  -9.072  < 2e-16 ***
## Age                  -0.005613   0.001790  -3.135  0.00185 ** 
## SibSp                -0.065145   0.020440  -3.187  0.00155 ** 
## Parch                -0.002622   0.028690  -0.091  0.92723    
## Fare                  0.002959   0.001308   2.262  0.02426 *  
## CabinA10             -0.367248   0.400503  -0.917  0.35972    
## CabinA14             -0.442341   0.399118  -1.108  0.26841    
## CabinA16              0.347194   0.403580   0.860  0.39016    
## CabinA20              0.721148   0.401858   1.795  0.07350 .  
## CabinA24             -0.421052   0.398828  -1.056  0.29174    
## CabinA26              0.758692   0.402480   1.885  0.06016 .  
## CabinA31              0.682200   0.401096   1.701  0.08976 .  
## CabinA36             -0.226756   0.402710  -0.563  0.57370    
## CabinA5              -0.154613   0.405929  -0.381  0.70349    
## CabinA7              -0.227095   0.402789  -0.564  0.57321    
## CabinB101            -0.769891   0.727208  -1.059  0.29039    
## CabinB18              0.129264   0.289289   0.447  0.65524    
## CabinB19             -0.202384   0.400515  -0.505  0.61362    
## CabinB20              0.375755   0.286319   1.312  0.19016    
## CabinB3              -0.246541   0.451740  -0.546  0.58554    
## CabinB30             -0.266510   0.405349  -0.657  0.51126    
## CabinB35              0.059472   0.290305   0.205  0.83779    
## CabinB37             -0.212923   0.403217  -0.528  0.59776    
## CabinB38             -0.271628   0.399258  -0.680  0.49669    
## CabinB49              0.258762   0.292265   0.885  0.37650    
## CabinB5              -0.365722   0.357488  -1.023  0.30692    
## CabinB50              0.638776   0.401063   1.593  0.11203    
## CabinB57 B59 B63 B66 -0.393050   0.475064  -0.827  0.40853    
## CabinB58 B60         -0.682405   0.385070  -1.772  0.07714 .  
## CabinB69              0.217745   0.403290   0.540  0.58956    
## CabinB77              0.047205   0.402763   0.117  0.90676    
## CabinB86             -0.550207   0.402875  -1.366  0.17281    
## CabinB94             -0.221143   0.402707  -0.549  0.58322    
## CabinB96 B98          0.052161   0.408572   0.128  0.89848    
## CabinC103             0.381729   0.402008   0.950  0.34292    
## CabinC104             0.755976   0.399473   1.892  0.05917 .  
## CabinC106             0.621268   0.399434   1.555  0.12066    
## CabinC110            -0.335697   0.398767  -0.842  0.40039    
## CabinC118            -0.328171   0.402431  -0.815  0.41530    
## CabinC123            -0.045353   0.285390  -0.159  0.87382    
## CabinC124            -0.320818   0.286196  -1.121  0.26298    
## CabinC125             0.008876   0.423370   0.021  0.98328    
## CabinC126             0.203194   0.399869   0.508  0.61163    
## CabinC2              -0.414778   0.399537  -1.038  0.29984    
## CabinC22 C26         -1.232018   0.422480  -2.916  0.00375 ** 
## CabinC23 C25 C27     -0.852661   0.391318  -2.179  0.02993 *  
## CabinC30             -0.227186   0.399803  -0.568  0.57019    
## CabinC32             -0.069423   0.416137  -0.167  0.86759    
## CabinC45             -0.330061   0.462909  -0.713  0.47626    
## CabinC46             -0.411730   0.400345  -1.028  0.30438    
## CabinC47              0.618692   0.401282   1.542  0.12393    
## CabinC50              0.200705   0.405416   0.495  0.62083    
## CabinC52              0.606475   0.399184   1.519  0.12949    
## CabinC65             -0.319219   0.297348  -1.074  0.28368    
## CabinC68              0.088406   0.406909   0.217  0.82812    
## CabinC7              -0.173788   0.427538  -0.406  0.68461    
## CabinC70              0.322011   0.412443   0.781  0.43543    
## CabinC82             -0.919538   0.454110  -2.025  0.04355 *  
## CabinC85              0.197329   0.402238   0.491  0.62400    
## CabinC86             -0.419674   0.407552  -1.030  0.30376    
## CabinC90              0.118039   0.401363   0.294  0.76884    
## CabinC92              0.088477   0.403708   0.219  0.82664    
## CabinC95             -0.944637   0.461385  -2.047  0.04128 *  
## CabinC99             -0.070093   0.415651  -0.169  0.86617    
## CabinD                0.161218   0.232522   0.693  0.48850    
## CabinD10 D12          0.493670   0.402284   1.227  0.22049    
## CabinD19              0.699744   0.398801   1.755  0.08010 .  
## CabinD21              0.203600   0.399871   0.509  0.61092    
## CabinD26             -0.368599   0.402249  -0.916  0.36005    
## CabinD33              0.449958   0.291007   1.546  0.12286    
## CabinD36             -0.011097   0.408309  -0.027  0.97833    
## CabinD46             -0.282505   0.398896  -0.708  0.47923    
## CabinD47              0.168861   0.404484   0.417  0.67656    
## CabinD48             -0.454938   0.414034  -1.099  0.27253    
## CabinD49              0.473942   0.402338   1.178  0.23952    
## CabinD6              -0.371640   0.399374  -0.931  0.35265    
## CabinD7               0.322845   0.403936   0.799  0.42463    
## CabinD9               0.021960   0.402328   0.055  0.95650    
## CabinE10              0.879381   0.394655   2.228  0.02643 *  
## CabinE101             0.338141   0.233183   1.450  0.14782    
## CabinE12              0.745211   0.399419   1.866  0.06282 .  
## CabinE121             0.722978   0.397207   1.820  0.06950 .  
## CabinE24              0.692665   0.286285   2.419  0.01600 *  
## CabinE25              0.678337   0.399219   1.699  0.09008 .  
## CabinE33              0.131795   0.399866   0.330  0.74188    
## CabinE38             -0.159371   0.401692  -0.397  0.69177    
## CabinE40             -0.038005   0.415989  -0.091  0.92725    
## CabinE44             -0.005927   0.285772  -0.021  0.98346    
## CabinE46             -0.296000   0.399365  -0.741  0.45903    
## CabinE49              0.170432   0.402662   0.423  0.67234    
## CabinE50              0.590841   0.401088   1.473  0.14153    
## CabinE58             -0.257555   0.399434  -0.645  0.51943    
## CabinE67             -0.067958   0.287699  -0.236  0.81339    
## CabinE77             -0.491778   0.399680  -1.230  0.21927    
## CabinE8               0.228004   0.399675   0.570  0.56868    
## CabinF E69            0.457812   0.399298   1.147  0.25227    
## CabinF G73           -0.158725   0.394610  -0.402  0.68773    
## CabinF2               0.643842   0.286199   2.250  0.02503 *  
## CabinF33              0.347364   0.283088   1.227  0.22054    
## CabinF38             -0.170883   0.398601  -0.429  0.66837    
## CabinF4               0.460721   0.285071   1.616  0.10686    
## CabinG6               0.008726   0.233420   0.037  0.97020    
## CabinT               -0.298107   0.398733  -0.748  0.45513    
## EmbarkedQ             0.023757   0.084989   0.280  0.77998    
## EmbarkedS            -0.004943   0.059455  -0.083  0.93379    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3934 on 393 degrees of freedom
## Multiple R-squared:  0.4826, Adjusted R-squared:  0.3431 
## F-statistic: 3.458 on 106 and 393 DF,  p-value: < 2.2e-16
# drop other variables which are not statistically significant enough 
ols_updated <- lm(Survived ~ Pclass + Sex + Age + SibSp, data = train_subset)
summary(ols_updated)
## 
## Call:
## lm(formula = Survived ~ Pclass + Sex + Age + SibSp, data = train_subset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0814 -0.2296 -0.1248  0.2526  0.9761 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.314190   0.083579  15.724  < 2e-16 ***
## Pclass      -0.175536   0.023053  -7.615 1.35e-13 ***
## Sexmale     -0.460820   0.038836 -11.866  < 2e-16 ***
## Age         -0.006731   0.001499  -4.489 8.90e-06 ***
## SibSp       -0.043828   0.016199  -2.706  0.00706 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3981 on 495 degrees of freedom
## Multiple R-squared:  0.3325, Adjusted R-squared:  0.3271 
## F-statistic: 61.65 on 4 and 495 DF,  p-value: < 2.2e-16
  1. (4 points) Create an estimate of whether an individual survived or not (binary variable) using the predict command on your estimated model. Essentially, you are using the coefficient from your linear model to forecast/predict/estimate the survival variable given independant variable values /data.
predicted_values <- predict(ols_updated, train_subset)
train_subset$predicted_survived <- ifelse(predicted_values > 0.5, 1, 0)

table(Predicted = train_subset$predicted_survived)
## Predicted
##   0   1 
## 340 160

My model would predict that 340/500 individuals would not survive, and 160/500 would not survive.

  1. (4 points) Create a confusion matrix, and calculate the accuracy for the entire train dataset. In other words, how many individuals who survived or died the titanic crash were correctly classified by your regression?
predicted_values <- predict(ols_updated, train)
train$predicted_survived <- ifelse(predicted_values > 0.5, 1, 0)

conf_matrix1 <- table(Predicted = train$predicted_survived, 
                     Actual = train$Survived)
conf_matrix1
##          Actual
## Predicted   0   1
##         0 477 107
##         1  72 235
accuracy1 <- sum(diag(conf_matrix1)) / sum(conf_matrix1)
accuracy1
## [1] 0.7991021

My OLS model has a prediction accuracy of close of 80%.

  1. (6 points) Pclass, Sex, Age can give you an accuracy of about 80%. Can you get higher accuracy ?

Organise your work in the form of a short report. Make sure to defend (OLS assumptions) and explain your model - interpretation of your coefficients, how well your linear regression can be used predict correctly, et cetra.

logit <- glm(Survived ~ Pclass + Sex + Age + Pclass*Sex + SibSp, 
                   data = train, family = "binomial")
summary(logit)
## 
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + Pclass * Sex + 
##     SibSp, family = "binomial", data = train)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     8.009515   0.935776   8.559  < 2e-16 ***
## Pclass         -2.229760   0.305638  -7.295 2.98e-13 ***
## Sexmale        -6.144993   0.886275  -6.934 4.11e-12 ***
## Age            -0.042227   0.008241  -5.124 2.99e-07 ***
## SibSp          -0.340267   0.104578  -3.254  0.00114 ** 
## Pclass:Sexmale  1.364235   0.324890   4.199 2.68e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.7  on 890  degrees of freedom
## Residual deviance:  767.9  on 885  degrees of freedom
## AIC: 779.9
## 
## Number of Fisher Scoring iterations: 6
# predict probabilities
logit_pred <- predict(logit, train, type = "response")
train$logit_predicted <- ifelse(logit_pred > 0.5, 1, 0)


logit_conf_matrix <- table(Predicted = train$logit_predicted, 
                           Actual = train$Survived)
logit_conf_matrix
##          Actual
## Predicted   0   1
##         0 503 119
##         1  46 223
logit_accuracy <- sum(diag(logit_conf_matrix)) / sum(logit_conf_matrix)
logit_accuracy
## [1] 0.8148148

By adding an interaction term and adding back the SibSp, I am able to bring up this prediction model’s accuracy rate to around 81.48%.

The negative coefficient for Pclass shows that the higher class numbers (those with low socioeconomic background) is associated with low survival rate.
The negative coefficient for Sexmale shows how males are less likely to survive.
The negative coefficient for Age shows how the older the passenger is the less it will likely to survive.
The negative coefficient for SibSp shows those with more siblings also reduce survival rate.
The positive coefficient for the interaction term between class and sex (male) suggests how the first class followed the rule more than those in lower classes by allowing women to escape/survive.

Because this is a binary logit model, so unlike the OLS assumptions, the logit assumptions would include Linearity of the Log-Odds, Independence of Observations, Absence of Multicollinearity, and Large Samples. It seems like all the assumptions are met for the logit model expect for the multicollinearity part, as we are noticing Sex variable have a large VIF value of 21 using high variance inflation factor (VIF) testing. However, the multicollinearirty assumption also seem to be met when looking at the small standard errors in comparison to the coefficients for the variables.

In general, though, logit model would be a slight better prediction model than OLS with a slight higher accuracy rate. This was a tragic event that perhaps everyone was trying to survive with their best they can, many other factors (such as human emotions/relationships) could also affect the survival rate of particular group of people which a model cannot truly predict.

# check linearity
logit1 <- predict(logit, type = "link")
plot(train$Age, logit1, xlab = "SibSp", ylab = "Logit")
abline(lm(logit1 ~ train$Age), col = "red")  

plot(train$Age, logit1, xlab = "Age", ylab = "Logit")
abline(lm(logit1 ~ train$Age), col = "red")  

plot(train$Age, logit1, xlab = "Pclass", ylab = "Logit")
abline(lm(logit1 ~ train$Age), col = "red")  

# check for multicollinearity
car::vif(logit)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
##     Pclass        Sex        Age      SibSp Pclass:Sex 
##   7.637949  21.834624   1.270969   1.128889  19.890722

The following modelling approaches may be helpful (read, and focus on relevant sections) -

https://www.kaggle.com/code/jeremyd/titanic-logistic-regression-in-r

https://www.kaggle.com/code/hillabehar/titanic-analysis-with-r/report