?— title: “Final Exam” author: “Doru Cojoc” date: “2024-03-03” output: html_document: toc: no number_sections: yes —

Welcome to the Final! Please read the instructions below before starting the exam.

This final is open book, open notes/slides, open Internet, and open computer (R). Thus, you can choose to use/reference any resource you can find online.
However, one is expected to work independently on the exam and not communicate with any peer taking the exam. Therefore, you are not allowed to communicate/collaborate with any classmate or other intelligent being on this planet. The only person you can communicate with is the instructor while taking the exam.
Penalty for working in a group on the midterm (if found out) will result harsh penalty including an F grade, as per BC academic integrity policy.
All calculations are to be done in a statistical software (like R), and at the end of the exam please submit ALL your supporting scripts / results / code in Canvas. Further instructions are provided for each part of each question.

Tips for cracking the exam:

Time management is important - make sure you are pacing yourself and have time to attempt all questions at least. Do not spend too much time on a question if you are stuck – move on and come back later.
Do not leave any question blank as it would result in a grade of 0. Try to show the setup / meaningful guess even if you are not sure or cannot complete the entire question to obtain some partial credit. For full credit, show the interim steps/logic atleast instead of simply pasting the correct answer, as I may not able to infer how reached the conclsuion.
May want to pick up the low hanging fruit (the easiest questions) first. Maybe read/skim all the questions first before simply starting from the first question you see.
Ask for hint from the instructor if you get stuck or feel a question is not clear enough.

Good luck!

Question 1: Distribution of an individual score vs Sample Mean (CLT)

The distribution of the average IQ score is known to closely follow a Gaussian distribution with a mean centered directly at 100 and a population standard deviation of 16 (Stanford-Benet).

(4 points)

A single person is randomly selected for jury duty. What is the probability that this person will have an IQ of 110 or higher? Be sure to write the probability statement and show your R code.

1-pnorm(q=110, mean = 100, sd = 16, lower.tail = TRUE, log.p = FALSE)
0.2659855
1-pnorm yields the right tail probability, in this case, the probability of a person’s IQ score being greater than 110.

(4 points)

What is the probability that the mean IQ of the 12-person jury would be 110 or above if drawn from a normal population with $\mu=100$ and $\sigma=16$? Be sure to write the probability statement and show your R code.

1-pnorm(q=110, mean = 100, sd = 16/sqrt(12), lower.tail = TRUE, log.p = FALSE)
0.01519141
CLT tells us that we can calculate this probability similarly to how we did in the previous part, but substituting the standard deviation with sd/sqrt(n) (the standard error of the mean).

Question 2: Hypothesis testing - Proportion

Among a simple random sample of 331 American adults who do not have a four-year college degree and are not currently enrolled in school, 48% said they decided not to go to college because they could not afford school.

(6 points)

A newspaper article states that only a minority of the Americans who decide not to go to college do so because they cannot afford it and uses the point estimate from this survey as evidence. Conduct a hypothesis test with $\alpha=.05$ to determine if these data provide strong evidence that more than 50% adult Americans do not go to college for financial reasons. What is the critical value, test statistic, and p-value for proportion of Americans who do not go college for financial reasons. Do you reject or fail to reject the null ?

\[H_o: \pi <= 0.50, H_a: \pi > 0.50\] Ho: mean proportion of American adults who do not have a four-year college degree is less than or equal to 0.50
Ha:mean proportion of American adults who do not have a four-year college degree is greater than 0.50
Test statistic: Distribution: t (t because sd is unknown, but could use z instead since n>30)
myp<-function(p, alpha){if (p<alpha) {print('REJECT Ho')} else {print('FAIL 2 REJECT')}}
phat <- 0.48 #sample proportion
p<- 0.50 #population hypothesized value
q<- 1-p
alpha<- 0.05
df<- n-1
n<-331
se<- sqrt(p*q/n)
t<- (phat-p)/se # t<- (sample estimate - null hypothesis value)/se
test_statistic<-t
test_statistic
-0.7277362
critical_value <- qt(p=0.05, df = n-1)
critical_value
-1.649484
p_value<-pnorm(t)
p_value
0.2333875
alpha
0.05
myp(p = p_value, alpha = alpha)
"FAIL 2 REJECT"
We find that our critical value is -1.649484, test statistic is the t distribution, p-value is 0.2333875.
We fail to reject the null hypothesis, meaning that is is likely that the true mean proportion of American adults who do not have a four-year college degree is less than or equal to 0.50.

(2 points)

Would you expect a confidence interval ($\alpha=.05$, double sided) for the proportion of American adults who decide not to go to college for financial reasons to include 0.5 (hypothesized value)? Show, or explain.

We construct our confidence interval using tails of area 0.025 since double-sided
critical_value <- qt(p=0.975, df = n-1)
lower<-phat-critical_value*se/sqrt(n)
upper<-phat+critical_value*se/sqrt(n)
lower
0.4770284
upper
0.4829716
This calculation tells us that we should expect the true value of the proportion of American adults who decide not to go to college for financial reasons to be between 0.477 and 0.483 with about 95% confidence- a confidence interval that does not include 0.5.

Question 3: Hypothesis testing - Difference of two means

(6 points)

You are investigating the effects of being distracted by a game on how much people eat. The 22 patients in the treatment group who ate their lunch while playing solitaire were asked to do a serial-order recall of the food lunch items they ate. The average number of items recalled by the patients in this group was 4.9, with a standard deviation of 1.8. The average number of items recalled by the patients in the control group (no distraction, n=22) was 6.1, with a standard deviation of 1.8.

Do these data provide strong evidence that the average number of food items recalled by the patients in the treatment and control groups are different? Assume $\alpha = 5\%$.

\[H_o: \mu_1-\mu_2 = 0, H_a: \mu_1-\mu_2 \neq 0 \] Ho: The difference of the two means is equal to zero.
Ha: The difference of the two means is not equal to zero.
Using alpha=0.05 and the t distribution.
mu1<-4.9set given values
mu2<-6.1
alpha<-0.05
n1<-22
n2<-22
df1<-n1-1
df2<-n2-1
sd1<-1.8
sd2<-1.8
var1<-sd1^2
var2<-sd2^2
num_point_estimate_diff <- (mu1 - mu2) point estimate difference/ numerator term
den_Se <- sqrt( var1/n1 + var2/n2) standard error/ denominator term
t<-(num_point_estimate_diff -0)/den_Se ratio between point estimate/hypothesized value (0) and standard error
test_statistic <- t
test_statistic
-2.211083
numdf = (var1/n1 + var2/n2)^2
dendf = (var1/n1)^2/df1 + (var2/n2)^2/df2
df = numdf / dendf using satterthwaite approximation
p_value =2*(1- pt(q = t,df = numdf / dendf)) multiplied by 2 to account for two tails
myp(p = p_value, alpha = alpha)
"FAIL 2 REJECT"
P-value is greater than significance level, so we fail to reject the null hypothesis that the difference between the two means is equal to zero. Thus the data does not provide strong evidence that the average number of food items recalled by the patients in the treatment and control groups are different.

Question 4: Working backwards

(6 points)

Suppose the population distribution is approximately normal and the population standard deviation is unknown. You are told the 90% confidence interval for the population mean is (65,77), and the confidence interval is based on a simple random sample of 25 observations (double sided). Calculate the sample mean, the margin of error, standard error and the population standard deviation.

sample_mean<-(65+77)/2 sample mean is midpoint of confidence interval
sample_mean
71
margin_of_error <- (77-65)/2 total potential distance from sample mean
margin_of_error
6
\[Margin of error=z_{\alpha/2} \times \sigma /\sqrt{n} \] Using algebra, we can get \[\sigma=(Marginoferror \times \sqrt{n})/z_{\alpha/2}\] z_halfalpha<-qnorm(p=0.95, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
z_halfalpha
1.644854
sd <- (margin_of_error*sqrt(n))/z_halfalpha
sd
18.2387
se<-sd/sqrt(n)standard error is the ratio of the population standard deviation and the square root of the sample size
se
3.647741

Question 5: Assumptions

(4 points)

The Stanford University Heart Transplant Study was conducted to determine whether an experimental heart transplant program increased lifespan. Each patient entering the program was officially designated a heart transplant candidate, meaning that he was gravely ill and might benefit from a new heart. Patients were randomly assigned into treatment and control groups. Patients in the treatment group received a transplant, and those in the control group did not. The table below displays how many patients survived and died in each group -

mym <- matrix(c(4,30,24,45), nrow=2)
colnames(mym) <- c("control", "treatment")
rownames(mym) <- c("alive","dead")
mym

##       control treatment
## alive       4        24
## dead       30        45

Suppose we are interested in estimating the difference in survival rate between the control and treatment groups using a confidence interval. Explain why we cannot construct such an interval using the normal approximation. What might go wrong if we constructed the confidence interval despite this problem?

An important condition for using a normal approximation for a binomial distribution is that each group being studied (in this case treatment and control) must have at least successes (alive) and 5 failures (dead) (or 10 for more certainty). The control group only has 4 successes, so the conditions for using a normal approximation are not met.
Constructing a confidence interval despite this problem is not advisable because there is not enough data available to come up with meaningful conclusions. As our amount of data increases, we should see the binomial distribution approximate the normal distribution, but for smaller amounts of data, the binomial distribution does not necessarily approximate the normal distribution, and it will be difficult to determine whether or not the conclusions from the confidence interval can be trusted with a high degree of confidence.

Question 6: Hypothesis testing in a regression

(6 points)

Write your own function in R that executes a hypothesis test following a linear regression, using the t-distribution. The function should be your own, and should use the t-distribution. One of the arguments of the function should be an object of class “lm” (that is, the output of a “lm” function used to fit linear regressions). The other arguments should be the elements necessary for a hypothesis test (the coefficient tested, the level according to the Null and Alternative Hypotheses, the significance level, etc.) The function should return the test statistic, the p-value, and the conclusion of the test.

\[H_0: y = x,\:H_a: y \neq x\] \lmfunc<-lm(y~x)
``hyp_func <- function(y,x,lmfunc,t,alpha, p) {if(p<alpha) {return(list(t,p,“Reject Ho”))} else if(p>alpha){return(list(t,p,“Fail to Reject Ho”))}}```
Where y=dependent variable, x=independent variable, lmfunc is our lm object, t is the test statistic from the model, alpha is a given significance level, and p is the p-value from the model.

(10 points)

Test the function for a regression of your choice using any data set you wish. Specify the elements of the test and carry out the test by hand. Then check that the function you wrote in the previous part gives the same (correct) answer.

Let’s test this function on the nsw dataset from question 8.
lmfunc<-lm(nsw$increase~nsw$treat)
hyp_func(nsw$increase, nsw$treat, lmfunc, t=1.512, alpha=0.05, p=0.131)
[[1]]
[1] 1.512

[[2]]
[1] 0.131

[[3]]
[1] "Fail to Reject Ho"

The function returns a list of all the values we are interested in. In order to obtain the arguments of the function, we must first run the linear regression model. From there, we extract the appropriate test statistics and p-values and choose an appropriate significance level.

Question 7: Determinants of smoking

Import the smoking.csv file into R. The variables are “smoker” - whether or not one smokes; “snmkban” - whether or not one faces a smoking ban at one’s workplace (this is fairly old data); “age” - in years; “hsdrop” - high school dropout; “hsgrad” - graduated from high school but has no further education; “colsome” - has some college education; “colgrad” - graduated from college, as well as some self-explanatory demographic variables.

(4 points)

Estimate a logit regression model of whether or not one smokes as a function of the other variables.

smoking<-read.csv("C:/Users/brady/OneDrive/Documents/smoking.csv") load csv file
smokingglm <- glm(smoker ~ smkban + age + hsdrop + hsgrad + colsome + colgrad + black + hispanic + female, data = smoking, family = 'binomial') logistic regression model of whether or not someone smokes as a function of the other variables.
summary(smokingglm) Display estimated model

Call:
glm(formula = smoker ~ smkban + age + hsdrop + hsgrad + colsome + colgrad + black + hispanic + female, family = "binomial", data = smoking)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.696765 0.138696 -12.234 < 2e-16 ***
smkban -0.250735 0.049164 -5.100 3.40e-07 ***
age -0.007452 0.001987 -3.751 0.000176 ***
hsdrop 1.931075 0.131261 14.712 < 2e-16 ***
hsgrad 1.523305 0.114308 13.326 < 2e-16 ***
colsome 1.180080 0.116518 10.128 < 2e-16 ***
colgrad 0.424753 0.125801 3.376 0.000734 ***
black -0.149472 0.089994 -1.661 0.096732 .
hispanic -0.584845 0.083085 -7.039 1.93e-12 ***
female -0.188720 0.049105 -3.843 0.000121 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 11074 on 9999 degrees of freedom
Residual deviance: 10502 on 9990 degrees of freedom
AIC: 10522

Number of Fisher Scoring iterations: 4

(4 points)

What does the coefficient on “snmkban” tell you? How about the coefficient on “age”? How about the coefficient on “colgrad”?

snmkban: The coefficient on snmkban of -0.25 tells us there is a slightly negative correlation between smoker and snmkban. Facing a smoking ban at one’s workplace makes them less likely to be a smoker, according to the model.

age: The coefficient of age of -0.007 is very small so it does not have much weight in the model in terms of its relationship with whether or not someone smokes.

colgrad: The coefficient of colgrad of 0.42 tells us there is a positive correlation between smoker and colgrad. Graduating from college makes a person more likely to be a smoker, according to the model.

Question 8: Difference-in-differences

Import the “nsw.csv” file in R. The data set contains observations from The National Supported Work Demonstration (NSW), which was a temporary employment program designed to help disadvantaged workers lacking basic job skills move into the labor market by giving them work experience and counseling in a sheltered environment. The variables of interest are “education” – the number of years of education for a worker in the sample; “age” – a worker’s age in years; “re75” – real earnings in 1975, at the beginning of the program; “re78” - real earnings in 1978, at the end of the program; “treat” - whether or not a worker was placed in the treatment group receiving work experience and counseling in a sheltered environment; as well as a few self-explanatory demographic variables.

(4 points)

Estimate the effect of receiving work experience and counseling in a sheltered environment on the increase in real earnings between 1975 and 1978 in a univariate regression. What is this effect? Is it statistically significant at 10% significance level (make sure to state the hypotheses)?

nsw <- read.csv("C:/Users/brady/OneDrive/Documents/nsw.csv") load csv in R
nsw
nsw$increase<-nsw$re78-nsw$re75 create a variable for increase in real earnings between 1975 and 1978
nsw

nswlm = lm(increase~treat, data = nsw) create linear model
summary(nswlm) print summary results

Call:
lm(formula = increase ~ treat, data = nsw)

Residuals:
Min 1Q Median 3Q Max
-37995 -3198 -1442 3677 56114

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2063.4 359.3 5.743 1.37e-08 ***
treat 846.9 560.1 1.512 0.131
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7406 on 720 degrees of freedom
Multiple R-squared: 0.003165, Adjusted R-squared: 0.00178
F-statistic: 2.286 on 1 and 720 DF, p-value: 0.131

Effect: On average there is an 846.9 dollar increase in real earnings between 1975 and 1978 when a worker receives work experience and counseling in a sheltered environment, according to the model.

\[H_0: Increase_0 = Increase_1,\:H_a: Increase_0 \neq Increase_1\] Null hypothesis: The increase in real earnings between 1975 and 1978 of the control group is equal to that of the treat group (no effect)
Alternative hypothesis: The increase in real earnings between 1975 and 1978 of the control group is not equal to that of the treat group (there is an effect)

myp(p = 0.131, alpha = 0.1)
"FAIL 2 REJECT"
Since the p-value is greater than the significance level, we fail to reject the null hypothesis and conclude that the effect of a worker receiving work experience and counseling in a sheltered environment on the increase in real earnings between 1975 and 1978 is not statistically significant at the 10% significance level.

(4 points)

Re-estimate the effect of receiving work experience and counseling in a sheltered environment on the increase in real earnings between 1975 and 1978, but this time use the other variables as control variables. What is this effect? Is it statistically significant at 10% significance level (make sure to state the hypotheses)? How does it compare to the effect estimated in the previous part?

nswlm = lm(increase ~ treat + age + education + black + hispanic + married + nodegree + re75 + re78, data = nsw)
summary(nswlm)

Call:
lm(formula = increase ~ treat + age + education + black + hispanic + married + nodegree + re75 + re78, data = nsw)

Residuals:
Min 1Q Median 3Q Max
-1.127e-10 -9.630e-13 -1.460e-13 7.760e-13 8.713e-11

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.099e-12 2.924e-12 -1.060e+00 0.28966
treat 5.203e-13 5.256e-13 9.900e-01 0.32253
age 1.786e-13 4.057e-14 4.402e+00 1.24e-05 ***
education 6.360e-13 2.015e-13 3.157e+00 0.00166 **
black -1.478e-12 9.005e-13 -1.642e+00 0.10107
hispanic -1.906e-12 1.173e-12 -1.625e+00 0.10460
married 6.774e-12 7.312e-13 9.265e+00 < 2e-16 ***
nodegree 4.941e-13 8.331e-13 5.930e-01 0.55329
re75 -1.000e+00 5.284e-17 -1.893e+16 < 2e-16 ***
re78 1.000e+00 4.198e-17 2.382e+16 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.897e-12 on 712 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 9.254e+31 on 9 and 712 DF, p-value: < 2.2e-16

Effect: With the control variables in place, the coefficient of treat of 5.203x10^-13 tells us there is essentially zero discernible correlation between treat and increase.

\[H_0: Increase_0 = Increase_1,\:H_a: Increase_0 \neq Increase_1\] myp(p = 0.32253, alpha = 0.10)
"FAIL 2 REJECT"

Since our p-value for treat is greater than our significance level of 0.10, this conclusion is not statistically significant at the 10% significance level.

The many control variables included in this model muffle the effect of the treat variable. The model is too “crowded” to be able to show if the treat variable has an effect on the increase variable. The previous model, while also not statistically significant for treat, had a lower p-value for treat, telling us that the new model is less effective at yielding any kind of significant correlation between treat and increase.

(6 points)

Create a data frame in which to estimate the same effect, but this time using a difference-in-differences approach. Don’t use any other control variables. What is the estimated effect of receiving work experience and counseling in a sheltered environment on the increase in real earnings between 1975 and 1978?

nsw <- mutate(nsw, time = ifelse(nsw$time == 1, 1, 0),treated = ifelse(treat >= 0, 1, 0))

The objective of this problem is to create a variable for time and interact it with our treat variable in order to study the effect treat has on the increase in income over a continuous time interval. Using DID, we can examine the group’s features both before and after treatment, as well as compare changes over time between the treat group and the control group.

Question 9: Data Question

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

Import the titanic train.csv file into R. You will have to clean the data to run some of commands/set up the workflow.

Data clean up:

train$Age[is.na(train$Age)]<-median(train$Age, na.rm=TRUE)set na values as column median
train$SibSp[is.na(train$SibSp)]<-median(train$SibSp, na.rm=TRUE)
train$Parch[is.na(train$Parch)]<-median(train$Parch, na.rm=TRUE)
train$sexnumeric <- as.numeric(factor(train$Sex) ) convert variables to numeric
train$embarkednumeric <- as.numeric(factor(train$Embarked) )

(4 points) Run some preliminary correlations of Survived with some other variables.

cor(train$Survived, train$sexnumeric) correlation coefficients between survived and some variables from the train dataset
-0.5433514
cor(train$Survived, train$Age)
-0.06491042
cor(train$Survived, train$SibSp)
-0.0353225
cor(train$Survived, train$Parch)
0.08162941
cor(train$Survived, train$embarkednumeric)
-0.1765092

(3 points) Conduct descriptive statistics of the dataset. Anything interesting you find ?

library('psych')
describe(train) yields descriptive statistics of the train dataset
vars n mean sd median trimmed mad min max range skew kurtosis se
PassengerId 1 891 446.00 257.35 446.00 446.00 330.62 1.00 891.00 890.00 0.00 -1.20 8.62
Survived 2 891 0.38 0.49 0.00 0.35 0.00 0.00 1.00 1.00 0.48 -1.77 0.02
Pclass 3 891 2.31 0.84 3.00 2.39 0.00 1.00 3.00 2.00 -0.63 -1.28 0.03
Name* 4 891 446.00 257.35 446.00 446.00 330.62 1.00 891.00 890.00 0.00 -1.20 8.62
Sex* 5 891 1.65 0.48 2.00 1.68 0.00 1.00 2.00 1.00 -0.62 -1.62 0.02
Age 6 891 29.36 13.02 28.00 28.83 8.90 0.42 80.00 79.58 0.51 0.97 0.44
SibSp 7 891 0.52 1.10 0.00 0.27 0.00 0.00 8.00 8.00 3.68 17.73 0.04
Parch 8 891 0.38 0.81 0.00 0.18 0.00 0.00 6.00 6.00 2.74 9.69 0.03
Ticket* 9 891 339.52 200.83 338.00 339.65 268.35 1.00 681.00 680.00 0.00 -1.28 6.73
Fare 10 891 32.20 49.69 14.45 21.38 10.24 0.00 512.33 512.33 4.77 33.12 1.66
Cabin* 11 891 18.63 38.14 1.00 8.29 0.00 1.00 148.00 147.00 2.09 3.07 1.28
Embarked* 12 891 3.53 0.80 4.00 3.66 0.00 1.00 4.00 3.00 -1.27 -0.16 0.03
sexnumeric 14 891 1.65 0.48 2.00 1.68 0.00 1.00 2.00 1.00 -0.62 -1.62 0.02
embarkednumeric 15 891 3.53 0.80 4.00 3.66 0.00 1.00 4.00 3.00 -1.27 -0.16 0.03

Interesting finds: The variables PassengerId and Name have the exact same statistics, which tells me they are being considered as randomly assigned numbers and do not have any meaningful value. A majority of the variables exhibit negative kurtosis. I presume this is because we are dealing with variables that can only take on a handful of values, which causes them to be heavy-tailed.

(3 points) Use set.seed(100) command, and create a subset of train dataset that has only 500 observations.

You can use sample_n command from tidyverse ecosystem dplyr package which you will have to install. train <- sample_n(train, 500) You can use alternative commands to subset the data if you wish.

set.seed(100)set seed for reproducibility
library("dplyr")invoke dplyr
train <- sample_n(train, 500)create subset of train with 500 observations

(4 points) Create an Ordinary Least Squares model / linear regression where Survived is the dependent variable on your n=500 sample.

survivedlm<-lm(train$Survived~train$Sex+train$Pclass+train$Age+train$SibSp+train$Parch+train$Embarked)
survivedlm

Call:
lm(formula = train$Survived ~ train$Sex + train$Pclass + train$Age + train$SibSp + train$Parch + train$Embarked)

\Coefficients:
(Intercept) train$Sexmale train$Pclass train$Age train$SibSp train$Parch train$EmbarkedQ train$EmbarkedS
1.343341 -0.462952 -0.168571 -0.006615 -0.038300 -0.013013 -0.033922 -0.059378

(4 points) Create an estimate of whether an individual survived or not (binary variable) using the predict command on your estimated model. Essentially, you are using the coefficient from your linear model to forecast/predict/estimate the survival variable given independant variable values /data.

You may have to transform your predicted values into 0 or 1, say using an ifelse rule/command. ?ifelse in R shows the syntax and usage. EG - ifelse(predicted>.5, yes= 1, no= 0 )

mypred=predict(survivedlm, train, type='response')
mypredprints probability of survival for each observation
mypred[mypred<0.5]=0categorize observations as 0 or 1
mypred[mypred>0.5]=1
mypredprints survived (1) or not survived (0) for each observation based on the condition set in the previous two lines

(4 points) Create a confusion matrix, and calculate the accuracy for the entire train dataset. In other words, how many individuals who survived or died the titanic crash were correctly classified by your regression?

See the table command used in logit regression context - https://www.ucl.ac.uk/~uctqiax/PUBLG100/2015/week9/seminar9.html

train<-read.csv("C:/Users/brady/OneDrive/Documents/train.csv")reload full train dataset
I redid the data cleanup steps and redefined survivedlm and the predict function to reflect the full train dataset
table(mypred, train$Survived)confusion matrix
mypred 0 1
0 472 103
1 77 239
This model accurately predicts roughly 79.8% ((472+239)/891) of outcomes accurately.

(6 points) Pclass, Sex, Age can give you an accuracy of about 80%. Can you get higher accuracy ?

glm2<-glm(train$Survived~train$Pclass+train$Sex+train$Age+train$Fare, family='binomial')
mypred=predict(glm2, train, type='response')
mypred[mypred<0.5]=0
mypred[mypred>0.5]=1
table(mypred, train$Survived)

mypred 0 1
0 469 98
1 80 244
(469+244)/891
0.8002245

After numerous attempts at attaining an accuracy higher than 80%, I found that adding Fare to the model increased the accuracy to 80.02245%. When considering various models, we must keep in mind the vital assumptions for OLS: the residuals of the model must be nearly normal, their variability must be nearly constant, they must be independent, and each variable in the regression must be linearly related to the outcome.Taking a deeper dive into the model I came up with, it is possible that this model violates the independence assumption, as passenger class goes up, fare goes down, i.e. cor(train$Pclass, train$Fare)
-0.5494996
The Q-Q residuals plot for this model appears mostly normal, although it veers slightly away from our reference line towards the middle of our data. The variables in the model are all linearly related to the outcome as they each have a coefficient contributing to the regression line.

Organise your work in the form of a short report. Make sure to defend (OLS assumptions) and explain your model - interpretation of your coefficients, how well your linear regression can be used predict correctly, et cetra.

The following modelling approaches may be helpful (read, and focus on relevant sections) -

# Step 6: Test Accuracy of Model on Training Data in Titanic - Logistic Regression in R

https://www.kaggle.com/code/jeremyd/titanic-logistic-regression-in-r

For summary statistics/data cleaning and running regressions

https://www.kaggle.com/code/hillabehar/titanic-analysis-with-r/report