1 Question 1: Distribution of an individual score vs Sample Mean (CLT) The distribution of the average IQ score is known to closely follow a Gaussian distribution with a mean centered directly at 100 and a population standard deviation of 16 (Stanford-Benet).
1.1 (4 points)A single person is randomly selected for jury duty. What is the probability that this person will have an IQ of 110 or higher? Be sure to write the probability statement and show your R code.
mean1 <- 100
sd1 <- 16
IQ <- 110
p1_1 <- 1-pnorm(IQ,mean1,sd1)
print(p1_1)
## [1] 0.2659855
1.2 (4 points)What is the probability that the mean IQ of the 12-person jury would be 110 or above if drawn from a normal population with μ=100 and σ=16? Be sure to write the probability statement and show your R code.
mu1 <- 100
sd1 <- 16
n <- 12
mean1 <- 110
SE <- sd1/sqrt(n)
z <- (mean1-mu1)/SE
p1_2<- 1-pnorm(z)
print(p1_2)
## [1] 0.01519141
2 Question 2: Hypothesis testing - Proportion Among a simple random sample of 331 American adults who do not have a four-year college degree and are not currently enrolled in school, 48% said they decided not to go to college because they could not afford school.
2.1 (6 points)A newspaper article states that only a minority of the Americans who decide not to go to college do so because they cannot afford it and uses the point estimate from this survey as evidence. Conduct a hypothesis test with α=.05 to determine if these data provide strong evidence that more than 50% adult Americans do not go to college for financial reasons. What is the critical value, test statistic, and p-value for proportion of Americans who do not go college for financial reasons. Do you reject or fail to reject the null?
pi <- 0.5
n <- 331
p <- 0.48
alpha <- 0.05
t <- (p - pi) / sqrt(pi * (1 - pi) / n)
p2_1<-pnorm(t)
critical_value <- qnorm(alpha)
print(t)
## [1] -0.7277362
print(critical_value)
## [1] -1.644854
print(p2_1)
## [1] 0.2333875
Since p-value (0.2333875) is greater than the alpha (0.05) and test statstic is greater than critical value, it fail to reject the null hypothesis.
2.2 (2 points)Would you expect a confidence interval (α=.05, double sided) for the proportion of American adults who decide not to go to college for financial reasons to include 0.5 (hypothesized value)? Show, or explain.
n <- 331
p <- 0.48
alpha <- 0.05
SE <- sqrt((p * (1-p))/n)
lower <- p - qnorm(0.975) * SE
upper <- p + qnorm(0.975) * SE
cat("the 95% confidence interval for the proportion of American adults between", lower ,"to", upper)
## the 95% confidence interval for the proportion of American adults between 0.4261784 to 0.5338216
Since the hypothesized value(0.5) is within the interval(0.4261784-0.5338216). I would expect a 95% confidence interval for the proportion of American adults who decide not to go to college for financial reasons to include the hypothesized value.
3 Question 3: Hypothesis testing - Difference of two means 3.1 (6 points) You are investigating the effects of being distracted by a game on how much people eat. The 22 patients in the treatment group who ate their lunch while playing solitaire were asked to do a serial-order recall of the food lunch items they ate. The average number of items recalled by the patients in this group was 4.9, with a standard deviation of 1.8. The average number of items recalled by the patients in the control group (no distraction, n=22) was 6.1, with a standard deviation of 1.8.
Do these data provide strong evidence that the average number of food items recalled by the patients in the treatment and control groups are different? Assume α=5%
myp=function(p, alpha){
if(p<alpha){print('REJECT Ho')}else{print('FAIL 2 REJECT')}
}
mu1 <- 4.9
sigma1 <- 1.8
n1 <- 22
mu2 <- 6.1
sigma2 <- 1.8
n2 <- 22
alpha <- 0.05
t3_1<-(mu1 - mu2)/sqrt((sigma1^2 / n1) + (sigma2^2 / n2))
df1<-((sigma1^2 / n1) + (sigma2^2 / n2))^2
df2<-(((sigma1^2 / n1)^2)/(n1-1)) + ((sigma2^2 / n2)^2/(n2-1))
df<-df1/df2
p3_1 <- 2*pt(t3_1, df)
print(t3_1)
## [1] -2.211083
print(p3_1)
## [1] 0.03252806
myp(p3_1,alpha)
## [1] "REJECT Ho"
Since p value less than alpha, it reject the null hypothesis.
4 Question 4: Working backwards 4.1 (6 points) Suppose the population distribution is approximately normal and the population standard deviation is unknown. You are told the 90% confidence interval for the population mean is (65,77), and the confidence interval is based on a simple random sample of 25 observations (double sided). Calculate the sample mean, the margin of error, standard error and the population standard deviation.
n<- 25
sample_mean<-(65+77)/2
print(sample_mean)
## [1] 71
margin_of_error <- (77-65)/2
print(margin_of_error)
## [1] 6
t<- qt(0.9 + (1 - 0.90) / 2,df=n-1)
standard_error<-margin_of_error/t
print(standard_error)
## [1] 3.506963
population_standard_deviation<-standard_error * sqrt(n)
print(population_standard_deviation)
## [1] 17.53481
5 Question 5: Assumptions 5.1 (4 points)The Stanford University Heart Transplant Study was conducted to determine whether an experimental heart transplant program increased lifespan. Each patient entering the program was officially designated a heart transplant candidate, meaning that he was gravely ill and might benefit from a new heart. Patients were randomly assigned into treatment and control groups. Patients in the treatment group received a transplant, and those in the control group did not. The table below displays how many patients survived and died in each group -
mym <- matrix(c(4,30,24,45), nrow=2)
colnames(mym) <- c("control", "treatment")
rownames(mym) <- c("alive","dead")
mym
## control treatment
## alive 4 24
## dead 30 45
Suppose we are interested in estimating the difference in survival rate between the control and treatment groups using a confidence interval. Explain why we cannot construct such an interval using the normal approximation. What might go wrong if we constructed the confidence interval despite this problem?
In this case, the binomial distribution is not approximated to the normal approximation because the sample size is small. By the Central Limit Theorem, the normal distribution requires the sample size large enough to ensure the accuracy of result. However, there is not enough observations for the control and treatment of alive (4 in control group 24 in treatment group, which will lead a less accurate result. Compare to the group of control and treatment of alive, the group of control and treatment of death has a relatively higher size, it will lead the binomial distribution more skewed. If we constructed the confidence interval despite this problem, we will get a very skewed shape and lead to an inaccurate result.
6.Question 6: Hypothesis testing in a regression 6.1 (6 points) Write your own function in R that executes a hypothesis test following a linear regression, using the t-distribution. The function should be your own, and should use the t-distribution. One of the arguments of the function should be an object of class “lm” (that is, the output of a “lm” function used to fit linear regressions). The other arguments should be the elements necessary for a hypothesis test (the coefficient tested, the level according to the Null and Alternative Hypotheses, the significance level, etc.) The function should return the test statistic, the p-value, and the conclusion of the test.
myfunction <- function(mymodel, x, alpha = 0.05){
coeff_summary <- summary(mymodel)$coefficients
coeff_estimate <- coef(mymodel)[2]
coeff_SE <- coeff_summary[2, "Std. Error"]
df <- mymodel$df.residual
t <- coeff_estimate / coeff_SE
p_value <- 2 * pt(-abs(t), df)
if (p_value < alpha) {
myp <- "Reject the Null Hypothesis"
} else {
myp <- "Fail to Reject the Null Hypothesis"
}
return(list(
t = t,
p_value = p_value,
myp = myp
))
}
6.2 (10 points) Test the function for a regression of your choice using any data set you wish. Specify the elements of the test and carry out the test by hand. Then check that the function you wrote in the previous part gives the same (correct) answer.
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
mymodel<- lm(mpg ~carb, data=mtcars)
summary(mymodel)
##
## Call:
## lm(formula = mpg ~ carb, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.250 -3.316 -1.433 3.384 10.083
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.8723 1.8368 14.085 9.22e-15 ***
## carb -2.0557 0.5685 -3.616 0.00108 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.113 on 30 degrees of freedom
## Multiple R-squared: 0.3035, Adjusted R-squared: 0.2803
## F-statistic: 13.07 on 1 and 30 DF, p-value: 0.001084
coeff_summary <- summary(mymodel)$coefficients
coeff_estimate <- coef(mymodel)[2]
coeff_SE <- coeff_summary[2, "Std. Error"]
df <- mymodel$df.residual
t <- coeff_estimate / coeff_SE
p_value <- 2 * pt(-abs(t), df)
print(t)
## carb
## -3.61575
print(p_value)
## carb
## 0.001084446
result <- myfunction(mymodel, x="mtcars$cyl", alpha=0.05)
print(result)
## $t
## carb
## -3.61575
##
## $p_value
## carb
## 0.001084446
##
## $myp
## [1] "Reject the Null Hypothesis"
7.Question 7: Determinants of smoking Import the smoking.csv file into R. The variables are “smoker” - whether or not one smokes; “snmkban” - whether or not one faces a smoking ban at one’s workplace (this is fairly old data); “age” - in years; “hsdrop” - high school dropout; “hsgrad” - graduated from high school but has no further education; “colsome” - has some college education; “colgrad” - graduated from college, as well as some self-explanatory demographic variables.
mydata1 <- read.csv("C:/Users/dingz/Downloads/smoking.csv")
str(mydata1)
## 'data.frame': 10000 obs. of 10 variables:
## $ smoker : int 1 1 0 1 0 0 1 1 0 0 ...
## $ smkban : int 1 1 0 0 1 0 1 0 1 0 ...
## $ age : int 41 44 19 29 28 40 47 36 49 44 ...
## $ hsdrop : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hsgrad : int 1 0 0 1 0 0 0 0 0 0 ...
## $ colsome : int 0 1 1 0 1 1 1 1 1 1 ...
## $ colgrad : int 0 0 0 0 0 0 0 0 0 0 ...
## $ black : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hispanic: int 0 0 0 0 0 0 0 0 0 0 ...
## $ female : int 1 1 1 1 1 0 1 0 1 0 ...
7.1 (4 points) Estimate a logit regression model of whether or not one smokes as a function of the other variables.
mymodel1 <- glm(smoker ~ smkban + age + hsdrop + hsgrad + colsome + colgrad + black + hispanic + female,
data = mydata1,
family = binomial)
summary(mymodel1)
##
## Call:
## glm(formula = smoker ~ smkban + age + hsdrop + hsgrad + colsome +
## colgrad + black + hispanic + female, family = binomial, data = mydata1)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.696765 0.138696 -12.234 < 2e-16 ***
## smkban -0.250735 0.049164 -5.100 3.40e-07 ***
## age -0.007452 0.001987 -3.751 0.000176 ***
## hsdrop 1.931075 0.131261 14.712 < 2e-16 ***
## hsgrad 1.523305 0.114308 13.326 < 2e-16 ***
## colsome 1.180080 0.116518 10.128 < 2e-16 ***
## colgrad 0.424753 0.125801 3.376 0.000734 ***
## black -0.149472 0.089994 -1.661 0.096732 .
## hispanic -0.584845 0.083085 -7.039 1.93e-12 ***
## female -0.188720 0.049105 -3.843 0.000121 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 11074 on 9999 degrees of freedom
## Residual deviance: 10502 on 9990 degrees of freedom
## AIC: 10522
##
## Number of Fisher Scoring iterations: 4
7.2 (4 points) What does the coefficient on “snmkban” tell you? How about the coefficient on “age”? How about the coefficient on “colgrad”?
coefficient on snmkban: The coefficient on snmkban is -0.250735 which tell us that there exist a negative correlation coefficient between snmkban and smoker. Every snmkban increase by 1 unit will decrease the log-odds of smoker 0.250735 units. According to the model, it indicate that enacting a smoking ban policy in the workplace can reduce the likelihood that smokers smoking.
coefficent on age: The coefficient on age is -0.007452 which is quiet small. It tell us that age does not significantly affect whether smokers smoke, but we can see that for every age increase by 1 year, the log odds of smoker will decrease by 0.007452 unit.
coefficient on colgrad: The coefficient on colgrad is 0.424753 which tell us there exist a positive correlation coefficient between colgrad and smoker. It means that people who graduated from college(colgrad=1) is more likely to be a smoker than those who have not graduated from college(corglad=0).
8 Question 8: Difference-in-differences Import the “nsw.csv” file in R. The data set contains observations from The National Supported Work Demonstration (NSW), which was a temporary employment program designed to help disadvantaged workers lacking basic job skills move into the labor market by giving them work experience and counseling in a sheltered environment. The variables of interest are “education” – the number of years of education for a worker in the sample; “age” – a worker’s age in years; “re75” – real earnings in 1975, at the beginning of the program; “re78” - real earnings in 1978, at the end of the program; “treat” - whether or not a worker was placed in the treatment group receiving work experience and counseling in a sheltered environment; as well as a few self-explanatory demographic variables.
mydata2 <- read.csv("C:/Users/dingz/Downloads/nsw.csv")
str(mydata2)
## 'data.frame': 722 obs. of 10 variables:
## $ data_id : chr "Lalonde Sample" "Lalonde Sample" "Lalonde Sample" "Lalonde Sample" ...
## $ treat : int 1 1 1 1 1 1 1 1 1 1 ...
## $ age : int 37 22 30 27 33 22 23 32 22 33 ...
## $ education: int 11 9 12 11 8 9 12 11 16 12 ...
## $ black : int 1 0 1 1 1 1 1 1 1 0 ...
## $ hispanic : int 0 1 0 0 0 0 0 0 0 0 ...
## $ married : int 1 0 0 0 0 0 0 0 0 1 ...
## $ nodegree : int 1 1 0 1 1 1 0 1 0 0 ...
## $ re75 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ re78 : num 9930 3596 24909 7506 290 ...
8.1 (4 points) Estimate the effect of receiving work experience and counseling in a sheltered environment on the increase in real earnings between 1975 and 1978 in a univariate regression. What is this effect? Is it statistically significant at 10% significance level (make sure to state the hypotheses)?
mydata2$increase <- mydata2$re78 - mydata2$re75
mymodel2 <- lm(increase ~ treat, data = mydata2)
summary(mymodel2)
##
## Call:
## lm(formula = increase ~ treat, data = mydata2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37995 -3198 -1442 3677 56114
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2063.4 359.3 5.743 1.37e-08 ***
## treat 846.9 560.1 1.512 0.131
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7406 on 720 degrees of freedom
## Multiple R-squared: 0.003165, Adjusted R-squared: 0.00178
## F-statistic: 2.286 on 1 and 720 DF, p-value: 0.131
According to the model, the average of increase in real earnings between 1975 and 1978 is 846.9.Since the p-value (0.131) is greater than alpha (0.1), it fail to reject the null hypotheses Thus, it is not statistically significant at 10% significance level.
8.2 (4 points) Re-estimate the effect of receiving work experience and counseling in a sheltered environment on the increase in real earnings between 1975 and 1978, but this time use the other variables as control variables. What is this effect? Is it statistically significant at 10% significance level (make sure to state the hypotheses)? How does it compare to the effect estimated in the previous part?
mymodel3 <- lm(increase ~ treat + age + education + black + hispanic + married + nodegree, data = mydata2)
summary(mymodel3)
##
## Call:
## lm(formula = increase ~ treat + age + education + black + hispanic +
## married + nodegree, data = mydata2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38153 -3383 -1076 3503 55863
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1986.822 3123.319 0.636 0.52490
## treat 869.294 561.559 1.548 0.12207
## age 4.194 43.424 0.097 0.92308
## education 26.633 215.306 0.124 0.90159
## black 112.085 956.157 0.117 0.90672
## hispanic 1229.152 1253.250 0.981 0.32704
## married -2210.949 767.530 -2.881 0.00409 **
## nodegree -217.758 891.527 -0.244 0.80710
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7384 on 714 degrees of freedom
## Multiple R-squared: 0.01729, Adjusted R-squared: 0.007654
## F-statistic: 1.794 on 7 and 714 DF, p-value: 0.08539
According to the model, the average of increase in real earnings between 1975 and 1978 (869.294) is higher than the previous model. However, the p-value(0.12207) is still greater than alpha. Thus, it is still not statistically significant at 10% significance level.
8.3 (6 points) Create a data frame in which to estimate the same effect, but this time using a difference-in-differences approach. Don’t use any other control variables. What is the estimated effect of receiving work experience and counseling in a sheltered environment on the increase in real earnings between 1975 and 1978?
treated <- mydata2[mydata2$treat == 1, ]
control <- mydata2[mydata2$treat == 0, ]
did<-mean(treated$increase)-mean(control$increase)
print(did)
## [1] 846.8883
According to the model, the average of increase in real earnings between 1975 and 1978 is 846.8883.
9 Question 9: Data Question The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).
Import the titanic train.csv file into R. You will have to clean the data to run some of commands/set up the workflow.
mydata3 <- read.csv("C:/Users/dingz/Downloads/train.csv")
str(mydata3)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
mydata3$Age[is.na(mydata3$Age)] = median(mydata3$Age, na.rm = TRUE)
mydata3$sex <- as.numeric(factor(mydata3$Sex))
1.(4 points) Run some preliminary correlations of Survived with some other variables.
cor(mydata3$Survived, mydata3$Pclass)
## [1] -0.338481
cor(mydata3$Survived, mydata3$SibSp)
## [1] -0.0353225
cor(mydata3$Survived, mydata3$Age)
## [1] -0.06491042
cor(mydata3$Survived, mydata3$Parch)
## [1] 0.08162941
cor(mydata3$Survived, mydata3$Fare)
## [1] 0.2573065
cor(mydata3$Survived, mydata3$sex)
## [1] -0.5433514
library('psych')
describe(mydata3)
## vars n mean sd median trimmed mad min max range
## PassengerId 1 891 446.00 257.35 446.00 446.00 330.62 1.00 891.00 890.00
## Survived 2 891 0.38 0.49 0.00 0.35 0.00 0.00 1.00 1.00
## Pclass 3 891 2.31 0.84 3.00 2.39 0.00 1.00 3.00 2.00
## Name* 4 891 446.00 257.35 446.00 446.00 330.62 1.00 891.00 890.00
## Sex* 5 891 1.65 0.48 2.00 1.68 0.00 1.00 2.00 1.00
## Age 6 891 29.36 13.02 28.00 28.83 8.90 0.42 80.00 79.58
## SibSp 7 891 0.52 1.10 0.00 0.27 0.00 0.00 8.00 8.00
## Parch 8 891 0.38 0.81 0.00 0.18 0.00 0.00 6.00 6.00
## Ticket* 9 891 339.52 200.83 338.00 339.65 268.35 1.00 681.00 680.00
## Fare 10 891 32.20 49.69 14.45 21.38 10.24 0.00 512.33 512.33
## Cabin* 11 891 18.63 38.14 1.00 8.29 0.00 1.00 148.00 147.00
## Embarked* 12 891 3.53 0.80 4.00 3.66 0.00 1.00 4.00 3.00
## sex 13 891 1.65 0.48 2.00 1.68 0.00 1.00 2.00 1.00
## skew kurtosis se
## PassengerId 0.00 -1.20 8.62
## Survived 0.48 -1.77 0.02
## Pclass -0.63 -1.28 0.03
## Name* 0.00 -1.20 8.62
## Sex* -0.62 -1.62 0.02
## Age 0.51 0.97 0.44
## SibSp 3.68 17.73 0.04
## Parch 2.74 9.69 0.03
## Ticket* 0.00 -1.28 6.73
## Fare 4.77 33.12 1.66
## Cabin* 2.09 3.07 1.28
## Embarked* -1.27 -0.16 0.03
## sex -0.62 -1.62 0.02
Age: The mean of age is 29.7, while the variability in the age of passengers is huge (between 0.42 to 80.00).
Fare: The mean of the fare(32.20)is greater than the median(14.45) with a max of 512.33 dollars of tickets. It indicate that lots of passergeners get a “free pass”.
Sex: The amount of male passengers is greater than female
Survived: The rate of survived in the titanic is 38%.
SibSp: Most of passengers do not aboard with their siblings or spouses.
parch: Most of passengers do not aboard with their parents or childrens.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
set.seed(100)
trainsubset <- sample_n(mydata3, 500)
summary(trainsubset)
## PassengerId Survived Pclass Name
## Min. : 1.0 Min. :0.000 Min. :1.000 Length:500
## 1st Qu.:222.8 1st Qu.:0.000 1st Qu.:1.000 Class :character
## Median :428.5 Median :0.000 Median :3.000 Mode :character
## Mean :440.0 Mean :0.378 Mean :2.308
## 3rd Qu.:659.2 3rd Qu.:1.000 3rd Qu.:3.000
## Max. :891.0 Max. :1.000 Max. :3.000
## Sex Age SibSp Parch
## Length:500 Min. : 0.42 Min. :0.0 Min. :0.000
## Class :character 1st Qu.:22.00 1st Qu.:0.0 1st Qu.:0.000
## Mode :character Median :28.00 Median :0.0 Median :0.000
## Mean :29.36 Mean :0.5 Mean :0.366
## 3rd Qu.:35.00 3rd Qu.:1.0 3rd Qu.:0.000
## Max. :71.00 Max. :8.0 Max. :5.000
## Ticket Fare Cabin Embarked
## Length:500 Min. : 0.000 Length:500 Length:500
## Class :character 1st Qu.: 7.896 Class :character Class :character
## Mode :character Median : 14.456 Mode :character Mode :character
## Mean : 32.092
## 3rd Qu.: 30.772
## Max. :512.329
## sex
## Min. :1.000
## 1st Qu.:1.000
## Median :2.000
## Mean :1.676
## 3rd Qu.:2.000
## Max. :2.000
4.(4 points) Create an Ordinary Least Squares model / linear regression where Survived is the dependent variable on your n=500 sample.
mymodel4 <- lm(Survived ~ Pclass + Sex + Age + SibSp + Parch, data = trainsubset)
summary(mymodel4)
##
## Call:
## lm(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch, data = trainsubset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0654 -0.2347 -0.1264 0.2540 0.9752
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.321717 0.084813 15.584 < 2e-16 ***
## Pclass -0.175550 0.023069 -7.610 1.40e-13 ***
## Sexmale -0.465582 0.039870 -11.678 < 2e-16 ***
## Age -0.006771 0.001502 -4.507 8.22e-06 ***
## SibSp -0.040120 0.017629 -2.276 0.0233 *
## Parch -0.013569 0.025350 -0.535 0.5927
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3984 on 494 degrees of freedom
## Multiple R-squared: 0.3329, Adjusted R-squared: 0.3262
## F-statistic: 49.31 on 5 and 494 DF, p-value: < 2.2e-16
plot(mymodel4)
5.(4 points) Create an estimate of whether an individual survived or not (binary variable) using the predict command on your estimated model. Essentially, you are using the coefficient from your linear model to forecast/predict/estimate the survival variable given independant variable values /data.
You may have to transform your predicted values into 0 or 1, say using an ifelse rule/command. ?ifelse in R shows the syntax and usage. EG - ifelse(predicted>.5, yes= 1, no= 0 )
mypredict1 <- predict(mymodel4, trainsubset, type = "response")
mypredict_survived1 <- ifelse(mypredict1 > 0.5, 1, 0)
table(mypredict_survived1)
## mypredict_survived1
## 0 1
## 343 157
mymatrix1 <- table(trainsubset$Survived, mypredict_survived1)
print(mymatrix1)
## mypredict_survived1
## 0 1
## 0 272 39
## 1 71 118
6.(4 points) Create a confusion matrix, and calculate the accuracy for the entire train dataset. In other words, how many individuals who survived or died the titanic crash were correctly classified by your regression?
mymodel5 <- lm(Survived ~ Pclass + Sex + Age , data = trainsubset)
summary(mymodel5)
##
## Call:
## lm(formula = Survived ~ Pclass + Sex + Age, data = trainsubset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0791 -0.2568 -0.1164 0.2534 0.9845
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.267864 0.082326 15.401 < 2e-16 ***
## Pclass -0.176925 0.023193 -7.628 1.23e-13 ***
## Sexmale -0.454444 0.039011 -11.649 < 2e-16 ***
## Age -0.005937 0.001480 -4.012 6.94e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4007 on 496 degrees of freedom
## Multiple R-squared: 0.3227, Adjusted R-squared: 0.3186
## F-statistic: 78.76 on 3 and 496 DF, p-value: < 2.2e-16
mypredict2<- predict(mymodel5, mydata3, type = "response")
mypredict_survived2 <- ifelse(mypredict2 > 0.5, 1, 0)
table(mypredict_survived2)
## mypredict_survived2
## 0 1
## 578 313
mymatrix2 <- table(mydata3$Survived, mypredict_survived2)
print(mymatrix2)
## mypredict_survived2
## 0 1
## 0 472 77
## 1 106 236
myaccuracy <- sum(diag(mymatrix2)) / sum(mymatrix2)
print(myaccuracy)
## [1] 0.7946128
7.(6 points) Pclass, Sex, Age can give you an accuracy of about 80%. Can you get higher accuracy ? Organise your work in the form of a short report. Make sure to defend (OLS assumptions) and explain your model - interpretation of your coefficients, how well your linear regression can be used predict correctly, et cetra.
mypredict3<- predict(mymodel4, mydata3, type = "response")
mypredict_survived3 <- ifelse(mypredict3 > 0.5, 1, 0)
table(mypredict_survived3)
## mypredict_survived3
## 0 1
## 587 304
mymatrix3 <- table(mydata3$Survived, mypredict_survived3)
print(mymatrix3)
## mypredict_survived3
## 0 1
## 0 480 69
## 1 107 235
myaccuracy2 <- sum(diag(mymatrix3)) / sum(mymatrix3)
print(myaccuracy2)
## [1] 0.8024691
Our first assignment shows that most Titanic survivors were young adults, which means sibsp is also a key factor to determine whether a passenger will survived on the titanic. Hence, instead of using mymodel5 which represent a linear regression of age, pclass, and sex. I try to use mymodel4 to get a higher accuracy. As a result, sibsp and parch is also a key factor affect whether a passenger will survived on the titanic, and the accuracy improved (0.802 vs 0.794)