1 Question 1: Distribution of an individual score vs Sample Mean (CLT) The distribution of the average IQ score is known to closely follow a Gaussian distribution with a mean centered directly at 100 and a population standard deviation of 16 (Stanford-Benet).

1.1 (4 points)A single person is randomly selected for jury duty. What is the probability that this person will have an IQ of 110 or higher? Be sure to write the probability statement and show your R code.

mean1 <- 100
sd1 <- 16
IQ <- 110
p1_1 <- 1-pnorm(IQ,mean1,sd1)
print(p1_1)
## [1] 0.2659855

1.2 (4 points)What is the probability that the mean IQ of the 12-person jury would be 110 or above if drawn from a normal population with μ=100 and σ=16? Be sure to write the probability statement and show your R code.

mu1 <- 100
sd1 <- 16
n <- 12
mean1 <- 110
SE <- sd1/sqrt(n)
z <- (mean1-mu1)/SE
p1_2<- 1-pnorm(z)
print(p1_2)
## [1] 0.01519141

2 Question 2: Hypothesis testing - Proportion Among a simple random sample of 331 American adults who do not have a four-year college degree and are not currently enrolled in school, 48% said they decided not to go to college because they could not afford school.

2.1 (6 points)A newspaper article states that only a minority of the Americans who decide not to go to college do so because they cannot afford it and uses the point estimate from this survey as evidence. Conduct a hypothesis test with α=.05 to determine if these data provide strong evidence that more than 50% adult Americans do not go to college for financial reasons. What is the critical value, test statistic, and p-value for proportion of Americans who do not go college for financial reasons. Do you reject or fail to reject the null?

pi <- 0.5   
n <- 331
p <- 0.48 
alpha <- 0.05
t <- (p - pi) / sqrt(pi * (1 - pi) / n)
p2_1<-pnorm(t)
critical_value <- qnorm(alpha)
print(t)
## [1] -0.7277362
print(critical_value)
## [1] -1.644854
print(p2_1)
## [1] 0.2333875

Since p-value (0.2333875) is greater than the alpha (0.05) and test statstic is greater than critical value, it fail to reject the null hypothesis.

2.2 (2 points)Would you expect a confidence interval (α=.05, double sided) for the proportion of American adults who decide not to go to college for financial reasons to include 0.5 (hypothesized value)? Show, or explain.

n <- 331
p <- 0.48 
alpha <- 0.05
SE <- sqrt((p * (1-p))/n)
lower <- p - qnorm(0.975) * SE
upper <- p + qnorm(0.975) * SE
cat("the 95% confidence interval for the proportion of American adults between", lower ,"to", upper)
## the 95% confidence interval for the proportion of American adults between 0.4261784 to 0.5338216

Since the hypothesized value(0.5) is within the interval(0.4261784-0.5338216). I would expect a 95% confidence interval for the proportion of American adults who decide not to go to college for financial reasons to include the hypothesized value.

3 Question 3: Hypothesis testing - Difference of two means 3.1 (6 points) You are investigating the effects of being distracted by a game on how much people eat. The 22 patients in the treatment group who ate their lunch while playing solitaire were asked to do a serial-order recall of the food lunch items they ate. The average number of items recalled by the patients in this group was 4.9, with a standard deviation of 1.8. The average number of items recalled by the patients in the control group (no distraction, n=22) was 6.1, with a standard deviation of 1.8.

Do these data provide strong evidence that the average number of food items recalled by the patients in the treatment and control groups are different? Assume α=5%

myp=function(p, alpha){
  if(p<alpha){print('REJECT Ho')}else{print('FAIL 2 REJECT')}
}
mu1 <- 4.9
sigma1 <- 1.8
n1 <- 22
mu2 <- 6.1
sigma2 <- 1.8
n2 <- 22
alpha <- 0.05
t3_1<-(mu1 - mu2)/sqrt((sigma1^2 / n1) + (sigma2^2 / n2))
df1<-((sigma1^2 / n1) + (sigma2^2 / n2))^2
df2<-(((sigma1^2 / n1)^2)/(n1-1)) + ((sigma2^2 / n2)^2/(n2-1))
df<-df1/df2
p3_1 <- 2*pt(t3_1, df)
print(t3_1)
## [1] -2.211083
print(p3_1)
## [1] 0.03252806
myp(p3_1,alpha)
## [1] "REJECT Ho"

Since p value less than alpha, it reject the null hypothesis.

4 Question 4: Working backwards 4.1 (6 points) Suppose the population distribution is approximately normal and the population standard deviation is unknown. You are told the 90% confidence interval for the population mean is (65,77), and the confidence interval is based on a simple random sample of 25 observations (double sided). Calculate the sample mean, the margin of error, standard error and the population standard deviation.

n<- 25
sample_mean<-(65+77)/2
print(sample_mean)
## [1] 71
margin_of_error <- (77-65)/2
print(margin_of_error)
## [1] 6
t<- qt(0.9 + (1 - 0.90) / 2,df=n-1)
standard_error<-margin_of_error/t
print(standard_error)
## [1] 3.506963
population_standard_deviation<-standard_error * sqrt(n)
print(population_standard_deviation)
## [1] 17.53481

5 Question 5: Assumptions 5.1 (4 points)The Stanford University Heart Transplant Study was conducted to determine whether an experimental heart transplant program increased lifespan. Each patient entering the program was officially designated a heart transplant candidate, meaning that he was gravely ill and might benefit from a new heart. Patients were randomly assigned into treatment and control groups. Patients in the treatment group received a transplant, and those in the control group did not. The table below displays how many patients survived and died in each group -

mym <- matrix(c(4,30,24,45), nrow=2)
colnames(mym) <- c("control", "treatment")
rownames(mym) <- c("alive","dead")
mym
##       control treatment
## alive       4        24
## dead       30        45

Suppose we are interested in estimating the difference in survival rate between the control and treatment groups using a confidence interval. Explain why we cannot construct such an interval using the normal approximation. What might go wrong if we constructed the confidence interval despite this problem?

In this case, the binomial distribution is not approximated to the normal approximation because the sample size is small. By the Central Limit Theorem, the normal distribution requires the sample size large enough to ensure the accuracy of result. However, there is not enough observations for the control and treatment of alive (4 in control group 24 in treatment group, which will lead a less accurate result. Compare to the group of control and treatment of alive, the group of control and treatment of death has a relatively higher size, it will lead the binomial distribution more skewed. If we constructed the confidence interval despite this problem, we will get a very skewed shape and lead to an inaccurate result.

6.Question 6: Hypothesis testing in a regression 6.1 (6 points) Write your own function in R that executes a hypothesis test following a linear regression, using the t-distribution. The function should be your own, and should use the t-distribution. One of the arguments of the function should be an object of class “lm” (that is, the output of a “lm” function used to fit linear regressions). The other arguments should be the elements necessary for a hypothesis test (the coefficient tested, the level according to the Null and Alternative Hypotheses, the significance level, etc.) The function should return the test statistic, the p-value, and the conclusion of the test.

myfunction <- function(mymodel, x, alpha = 0.05){
  coeff_summary <- summary(mymodel)$coefficients
  coeff_estimate <- coef(mymodel)[2]
  coeff_SE <- coeff_summary[2, "Std. Error"]
  df <- mymodel$df.residual
  t <- coeff_estimate / coeff_SE
  p_value <- 2 * pt(-abs(t), df)
  if (p_value < alpha) {
    myp <- "Reject the Null Hypothesis"
  } else {
    myp <- "Fail to Reject the Null Hypothesis"
  }
  return(list(
    t =  t,
    p_value = p_value,
    myp = myp
  ))
}

6.2 (10 points) Test the function for a regression of your choice using any data set you wish. Specify the elements of the test and carry out the test by hand. Then check that the function you wrote in the previous part gives the same (correct) answer.

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
mymodel<- lm(mpg ~carb, data=mtcars)
summary(mymodel)
## 
## Call:
## lm(formula = mpg ~ carb, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.250 -3.316 -1.433  3.384 10.083 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  25.8723     1.8368  14.085 9.22e-15 ***
## carb         -2.0557     0.5685  -3.616  0.00108 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.113 on 30 degrees of freedom
## Multiple R-squared:  0.3035, Adjusted R-squared:  0.2803 
## F-statistic: 13.07 on 1 and 30 DF,  p-value: 0.001084
coeff_summary <- summary(mymodel)$coefficients
coeff_estimate <- coef(mymodel)[2]
coeff_SE <- coeff_summary[2, "Std. Error"]
df <- mymodel$df.residual
t <- coeff_estimate / coeff_SE
p_value <- 2 * pt(-abs(t), df)
print(t)
##     carb 
## -3.61575
print(p_value)
##        carb 
## 0.001084446
result <- myfunction(mymodel, x="mtcars$cyl", alpha=0.05)
print(result)
## $t
##     carb 
## -3.61575 
## 
## $p_value
##        carb 
## 0.001084446 
## 
## $myp
## [1] "Reject the Null Hypothesis"

7.Question 7: Determinants of smoking Import the smoking.csv file into R. The variables are “smoker” - whether or not one smokes; “snmkban” - whether or not one faces a smoking ban at one’s workplace (this is fairly old data); “age” - in years; “hsdrop” - high school dropout; “hsgrad” - graduated from high school but has no further education; “colsome” - has some college education; “colgrad” - graduated from college, as well as some self-explanatory demographic variables.

mydata1 <- read.csv("C:/Users/dingz/Downloads/smoking.csv")
str(mydata1)
## 'data.frame':    10000 obs. of  10 variables:
##  $ smoker  : int  1 1 0 1 0 0 1 1 0 0 ...
##  $ smkban  : int  1 1 0 0 1 0 1 0 1 0 ...
##  $ age     : int  41 44 19 29 28 40 47 36 49 44 ...
##  $ hsdrop  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hsgrad  : int  1 0 0 1 0 0 0 0 0 0 ...
##  $ colsome : int  0 1 1 0 1 1 1 1 1 1 ...
##  $ colgrad : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ black   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hispanic: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ female  : int  1 1 1 1 1 0 1 0 1 0 ...

7.1 (4 points) Estimate a logit regression model of whether or not one smokes as a function of the other variables.

mymodel1 <- glm(smoker ~ smkban + age + hsdrop + hsgrad + colsome + colgrad + black + hispanic + female, 
             data = mydata1, 
             family = binomial)
summary(mymodel1)
## 
## Call:
## glm(formula = smoker ~ smkban + age + hsdrop + hsgrad + colsome + 
##     colgrad + black + hispanic + female, family = binomial, data = mydata1)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.696765   0.138696 -12.234  < 2e-16 ***
## smkban      -0.250735   0.049164  -5.100 3.40e-07 ***
## age         -0.007452   0.001987  -3.751 0.000176 ***
## hsdrop       1.931075   0.131261  14.712  < 2e-16 ***
## hsgrad       1.523305   0.114308  13.326  < 2e-16 ***
## colsome      1.180080   0.116518  10.128  < 2e-16 ***
## colgrad      0.424753   0.125801   3.376 0.000734 ***
## black       -0.149472   0.089994  -1.661 0.096732 .  
## hispanic    -0.584845   0.083085  -7.039 1.93e-12 ***
## female      -0.188720   0.049105  -3.843 0.000121 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11074  on 9999  degrees of freedom
## Residual deviance: 10502  on 9990  degrees of freedom
## AIC: 10522
## 
## Number of Fisher Scoring iterations: 4

7.2 (4 points) What does the coefficient on “snmkban” tell you? How about the coefficient on “age”? How about the coefficient on “colgrad”?

coefficient on snmkban: The coefficient on snmkban is -0.250735 which tell us that there exist a negative correlation coefficient between snmkban and smoker. Every snmkban increase by 1 unit will decrease the log-odds of smoker 0.250735 units. According to the model, it indicate that enacting a smoking ban policy in the workplace can reduce the likelihood that smokers smoking.

coefficent on age: The coefficient on age is -0.007452 which is quiet small. It tell us that age does not significantly affect whether smokers smoke, but we can see that for every age increase by 1 year, the log odds of smoker will decrease by 0.007452 unit.

coefficient on colgrad: The coefficient on colgrad is 0.424753 which tell us there exist a positive correlation coefficient between colgrad and smoker. It means that people who graduated from college(colgrad=1) is more likely to be a smoker than those who have not graduated from college(corglad=0).

8 Question 8: Difference-in-differences Import the “nsw.csv” file in R. The data set contains observations from The National Supported Work Demonstration (NSW), which was a temporary employment program designed to help disadvantaged workers lacking basic job skills move into the labor market by giving them work experience and counseling in a sheltered environment. The variables of interest are “education” – the number of years of education for a worker in the sample; “age” – a worker’s age in years; “re75” – real earnings in 1975, at the beginning of the program; “re78” - real earnings in 1978, at the end of the program; “treat” - whether or not a worker was placed in the treatment group receiving work experience and counseling in a sheltered environment; as well as a few self-explanatory demographic variables.

mydata2 <- read.csv("C:/Users/dingz/Downloads/nsw.csv")
str(mydata2)
## 'data.frame':    722 obs. of  10 variables:
##  $ data_id  : chr  "Lalonde Sample" "Lalonde Sample" "Lalonde Sample" "Lalonde Sample" ...
##  $ treat    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ age      : int  37 22 30 27 33 22 23 32 22 33 ...
##  $ education: int  11 9 12 11 8 9 12 11 16 12 ...
##  $ black    : int  1 0 1 1 1 1 1 1 1 0 ...
##  $ hispanic : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ married  : int  1 0 0 0 0 0 0 0 0 1 ...
##  $ nodegree : int  1 1 0 1 1 1 0 1 0 0 ...
##  $ re75     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ re78     : num  9930 3596 24909 7506 290 ...

8.1 (4 points) Estimate the effect of receiving work experience and counseling in a sheltered environment on the increase in real earnings between 1975 and 1978 in a univariate regression. What is this effect? Is it statistically significant at 10% significance level (make sure to state the hypotheses)?

mydata2$increase <- mydata2$re78 - mydata2$re75
mymodel2 <- lm(increase ~ treat, data = mydata2)
summary(mymodel2)
## 
## Call:
## lm(formula = increase ~ treat, data = mydata2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -37995  -3198  -1442   3677  56114 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2063.4      359.3   5.743 1.37e-08 ***
## treat          846.9      560.1   1.512    0.131    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7406 on 720 degrees of freedom
## Multiple R-squared:  0.003165,   Adjusted R-squared:  0.00178 
## F-statistic: 2.286 on 1 and 720 DF,  p-value: 0.131

According to the model, the average of increase in real earnings between 1975 and 1978 is 846.9.Since the p-value (0.131) is greater than alpha (0.1), it fail to reject the null hypotheses Thus, it is not statistically significant at 10% significance level.

8.2 (4 points) Re-estimate the effect of receiving work experience and counseling in a sheltered environment on the increase in real earnings between 1975 and 1978, but this time use the other variables as control variables. What is this effect? Is it statistically significant at 10% significance level (make sure to state the hypotheses)? How does it compare to the effect estimated in the previous part?

mymodel3 <- lm(increase ~ treat + age + education + black + hispanic + married + nodegree, data = mydata2)
summary(mymodel3)
## 
## Call:
## lm(formula = increase ~ treat + age + education + black + hispanic + 
##     married + nodegree, data = mydata2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -38153  -3383  -1076   3503  55863 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  1986.822   3123.319   0.636  0.52490   
## treat         869.294    561.559   1.548  0.12207   
## age             4.194     43.424   0.097  0.92308   
## education      26.633    215.306   0.124  0.90159   
## black         112.085    956.157   0.117  0.90672   
## hispanic     1229.152   1253.250   0.981  0.32704   
## married     -2210.949    767.530  -2.881  0.00409 **
## nodegree     -217.758    891.527  -0.244  0.80710   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7384 on 714 degrees of freedom
## Multiple R-squared:  0.01729,    Adjusted R-squared:  0.007654 
## F-statistic: 1.794 on 7 and 714 DF,  p-value: 0.08539

According to the model, the average of increase in real earnings between 1975 and 1978 (869.294) is higher than the previous model. However, the p-value(0.12207) is still greater than alpha. Thus, it is still not statistically significant at 10% significance level.

8.3 (6 points) Create a data frame in which to estimate the same effect, but this time using a difference-in-differences approach. Don’t use any other control variables. What is the estimated effect of receiving work experience and counseling in a sheltered environment on the increase in real earnings between 1975 and 1978?

treated <- mydata2[mydata2$treat == 1, ]
control <- mydata2[mydata2$treat == 0, ]
did<-mean(treated$increase)-mean(control$increase)
print(did)
## [1] 846.8883

According to the model, the average of increase in real earnings between 1975 and 1978 is 846.8883.

9 Question 9: Data Question The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

Import the titanic train.csv file into R. You will have to clean the data to run some of commands/set up the workflow.

mydata3 <- read.csv("C:/Users/dingz/Downloads/train.csv")
str(mydata3)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...
mydata3$Age[is.na(mydata3$Age)] = median(mydata3$Age, na.rm = TRUE)
mydata3$sex <- as.numeric(factor(mydata3$Sex))

1.(4 points) Run some preliminary correlations of Survived with some other variables.

cor(mydata3$Survived, mydata3$Pclass)
## [1] -0.338481
cor(mydata3$Survived, mydata3$SibSp)
## [1] -0.0353225
cor(mydata3$Survived, mydata3$Age)
## [1] -0.06491042
cor(mydata3$Survived, mydata3$Parch)
## [1] 0.08162941
cor(mydata3$Survived, mydata3$Fare)
## [1] 0.2573065
cor(mydata3$Survived, mydata3$sex)
## [1] -0.5433514
  1. (3 points) Conduct descriptive statistics of the dataset. Anything interesting you find ?
library('psych')
describe(mydata3)
##             vars   n   mean     sd median trimmed    mad  min    max  range
## PassengerId    1 891 446.00 257.35 446.00  446.00 330.62 1.00 891.00 890.00
## Survived       2 891   0.38   0.49   0.00    0.35   0.00 0.00   1.00   1.00
## Pclass         3 891   2.31   0.84   3.00    2.39   0.00 1.00   3.00   2.00
## Name*          4 891 446.00 257.35 446.00  446.00 330.62 1.00 891.00 890.00
## Sex*           5 891   1.65   0.48   2.00    1.68   0.00 1.00   2.00   1.00
## Age            6 891  29.36  13.02  28.00   28.83   8.90 0.42  80.00  79.58
## SibSp          7 891   0.52   1.10   0.00    0.27   0.00 0.00   8.00   8.00
## Parch          8 891   0.38   0.81   0.00    0.18   0.00 0.00   6.00   6.00
## Ticket*        9 891 339.52 200.83 338.00  339.65 268.35 1.00 681.00 680.00
## Fare          10 891  32.20  49.69  14.45   21.38  10.24 0.00 512.33 512.33
## Cabin*        11 891  18.63  38.14   1.00    8.29   0.00 1.00 148.00 147.00
## Embarked*     12 891   3.53   0.80   4.00    3.66   0.00 1.00   4.00   3.00
## sex           13 891   1.65   0.48   2.00    1.68   0.00 1.00   2.00   1.00
##              skew kurtosis   se
## PassengerId  0.00    -1.20 8.62
## Survived     0.48    -1.77 0.02
## Pclass      -0.63    -1.28 0.03
## Name*        0.00    -1.20 8.62
## Sex*        -0.62    -1.62 0.02
## Age          0.51     0.97 0.44
## SibSp        3.68    17.73 0.04
## Parch        2.74     9.69 0.03
## Ticket*      0.00    -1.28 6.73
## Fare         4.77    33.12 1.66
## Cabin*       2.09     3.07 1.28
## Embarked*   -1.27    -0.16 0.03
## sex         -0.62    -1.62 0.02

Age: The mean of age is 29.7, while the variability in the age of passengers is huge (between 0.42 to 80.00).

Fare: The mean of the fare(32.20)is greater than the median(14.45) with a max of 512.33 dollars of tickets. It indicate that lots of passergeners get a “free pass”.

Sex: The amount of male passengers is greater than female

Survived: The rate of survived in the titanic is 38%.

SibSp: Most of passengers do not aboard with their siblings or spouses.

parch: Most of passengers do not aboard with their parents or childrens.

  1. (3 points) Use set.seed(100) command, and create a subset of train dataset that has only 500 observations.
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
set.seed(100)
trainsubset <- sample_n(mydata3, 500)
summary(trainsubset)
##   PassengerId       Survived         Pclass          Name          
##  Min.   :  1.0   Min.   :0.000   Min.   :1.000   Length:500        
##  1st Qu.:222.8   1st Qu.:0.000   1st Qu.:1.000   Class :character  
##  Median :428.5   Median :0.000   Median :3.000   Mode  :character  
##  Mean   :440.0   Mean   :0.378   Mean   :2.308                     
##  3rd Qu.:659.2   3rd Qu.:1.000   3rd Qu.:3.000                     
##  Max.   :891.0   Max.   :1.000   Max.   :3.000                     
##      Sex                 Age            SibSp         Parch      
##  Length:500         Min.   : 0.42   Min.   :0.0   Min.   :0.000  
##  Class :character   1st Qu.:22.00   1st Qu.:0.0   1st Qu.:0.000  
##  Mode  :character   Median :28.00   Median :0.0   Median :0.000  
##                     Mean   :29.36   Mean   :0.5   Mean   :0.366  
##                     3rd Qu.:35.00   3rd Qu.:1.0   3rd Qu.:0.000  
##                     Max.   :71.00   Max.   :8.0   Max.   :5.000  
##     Ticket               Fare            Cabin             Embarked        
##  Length:500         Min.   :  0.000   Length:500         Length:500        
##  Class :character   1st Qu.:  7.896   Class :character   Class :character  
##  Mode  :character   Median : 14.456   Mode  :character   Mode  :character  
##                     Mean   : 32.092                                        
##                     3rd Qu.: 30.772                                        
##                     Max.   :512.329                                        
##       sex       
##  Min.   :1.000  
##  1st Qu.:1.000  
##  Median :2.000  
##  Mean   :1.676  
##  3rd Qu.:2.000  
##  Max.   :2.000

4.(4 points) Create an Ordinary Least Squares model / linear regression where Survived is the dependent variable on your n=500 sample.

mymodel4  <- lm(Survived ~ Pclass + Sex + Age + SibSp + Parch, data = trainsubset)
summary(mymodel4)
## 
## Call:
## lm(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch, data = trainsubset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0654 -0.2347 -0.1264  0.2540  0.9752 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.321717   0.084813  15.584  < 2e-16 ***
## Pclass      -0.175550   0.023069  -7.610 1.40e-13 ***
## Sexmale     -0.465582   0.039870 -11.678  < 2e-16 ***
## Age         -0.006771   0.001502  -4.507 8.22e-06 ***
## SibSp       -0.040120   0.017629  -2.276   0.0233 *  
## Parch       -0.013569   0.025350  -0.535   0.5927    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3984 on 494 degrees of freedom
## Multiple R-squared:  0.3329, Adjusted R-squared:  0.3262 
## F-statistic: 49.31 on 5 and 494 DF,  p-value: < 2.2e-16
plot(mymodel4)

5.(4 points) Create an estimate of whether an individual survived or not (binary variable) using the predict command on your estimated model. Essentially, you are using the coefficient from your linear model to forecast/predict/estimate the survival variable given independant variable values /data.

You may have to transform your predicted values into 0 or 1, say using an ifelse rule/command. ?ifelse in R shows the syntax and usage. EG - ifelse(predicted>.5, yes= 1, no= 0 )

mypredict1 <- predict(mymodel4, trainsubset, type = "response")
mypredict_survived1  <- ifelse(mypredict1 > 0.5, 1, 0)
table(mypredict_survived1)
## mypredict_survived1
##   0   1 
## 343 157
mymatrix1 <- table(trainsubset$Survived, mypredict_survived1)
print(mymatrix1)
##    mypredict_survived1
##       0   1
##   0 272  39
##   1  71 118

6.(4 points) Create a confusion matrix, and calculate the accuracy for the entire train dataset. In other words, how many individuals who survived or died the titanic crash were correctly classified by your regression?

mymodel5  <- lm(Survived ~ Pclass + Sex + Age , data = trainsubset)
summary(mymodel5)
## 
## Call:
## lm(formula = Survived ~ Pclass + Sex + Age, data = trainsubset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0791 -0.2568 -0.1164  0.2534  0.9845 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.267864   0.082326  15.401  < 2e-16 ***
## Pclass      -0.176925   0.023193  -7.628 1.23e-13 ***
## Sexmale     -0.454444   0.039011 -11.649  < 2e-16 ***
## Age         -0.005937   0.001480  -4.012 6.94e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4007 on 496 degrees of freedom
## Multiple R-squared:  0.3227, Adjusted R-squared:  0.3186 
## F-statistic: 78.76 on 3 and 496 DF,  p-value: < 2.2e-16
mypredict2<- predict(mymodel5, mydata3, type = "response")
mypredict_survived2  <- ifelse(mypredict2 > 0.5, 1, 0)
table(mypredict_survived2)
## mypredict_survived2
##   0   1 
## 578 313
mymatrix2 <- table(mydata3$Survived, mypredict_survived2)
print(mymatrix2)
##    mypredict_survived2
##       0   1
##   0 472  77
##   1 106 236
myaccuracy <- sum(diag(mymatrix2)) / sum(mymatrix2)
print(myaccuracy)
## [1] 0.7946128

7.(6 points) Pclass, Sex, Age can give you an accuracy of about 80%. Can you get higher accuracy ? Organise your work in the form of a short report. Make sure to defend (OLS assumptions) and explain your model - interpretation of your coefficients, how well your linear regression can be used predict correctly, et cetra.

mypredict3<- predict(mymodel4, mydata3, type = "response")
mypredict_survived3  <- ifelse(mypredict3 > 0.5, 1, 0)
table(mypredict_survived3)
## mypredict_survived3
##   0   1 
## 587 304
mymatrix3 <- table(mydata3$Survived, mypredict_survived3)
print(mymatrix3)
##    mypredict_survived3
##       0   1
##   0 480  69
##   1 107 235
myaccuracy2 <- sum(diag(mymatrix3)) / sum(mymatrix3)
print(myaccuracy2)
## [1] 0.8024691

Our first assignment shows that most Titanic survivors were young adults, which means sibsp is also a key factor to determine whether a passenger will survived on the titanic. Hence, instead of using mymodel5 which represent a linear regression of age, pclass, and sex. I try to use mymodel4 to get a higher accuracy. As a result, sibsp and parch is also a key factor affect whether a passenger will survived on the titanic, and the accuracy improved (0.802 vs 0.794)