?— title: “Final Exam” author: “Doru Cojoc” date: “2024-03-03” output: html_document: toc: no number_sections: yes —
Welcome to the Final! Please read the instructions below before starting the exam.
Tips for cracking the exam:
Good luck!
The distribution of the average IQ score is known to closely follow a Gaussian distribution with a mean centered directly at 100 and a population standard deviation of 16 (Stanford-Benet).
A single person is randomly selected for jury duty. What is the probability that this person will have an IQ of 110 or higher? Be sure to write the probability statement and show your R code.
1-pnorm(q=110, mean = 100, sd = 16, lower.tail = TRUE, log.p = FALSE)
0.2659855
1-pnorm yields the right tail probability, in this case, the probability
of a person’s IQ score being greater than 110.
What is the probability that the mean IQ of the 12-person jury would be 110 or above if drawn from a normal population with \(\mu=100\) and \(\sigma=16\)? Be sure to write the probability statement and show your R code.
1-pnorm(q=110, mean = 100, sd = 16/sqrt(12), lower.tail = TRUE, log.p = FALSE)
0.01519141
CLT tells us that we can calculate this probability similarly to how we
did in the previous part, but substituting the standard deviation with
sd/sqrt(n) (the standard error of the mean).
Among a simple random sample of 331 American adults who do not have a four-year college degree and are not currently enrolled in school, 48% said they decided not to go to college because they could not afford school.
A newspaper article states that only a minority of the Americans who decide not to go to college do so because they cannot afford it and uses the point estimate from this survey as evidence. Conduct a hypothesis test with \(\alpha=.05\) to determine if these data provide strong evidence that more than 50% adult Americans do not go to college for financial reasons. What is the critical value, test statistic, and p-value for proportion of Americans who do not go college for financial reasons. Do you reject or fail to reject the null ?
\[H_o: \pi <= 0.50, H_a: \pi >
0.50\] Ho: mean proportion of American adults who do not have a
four-year college degree is less than or equal to 0.50
Ha:mean proportion of American adults who do not have a four-year
college degree is greater than 0.50
Test statistic: Distribution: t (t because sd is unknown, but could use
z instead since n>30)
myp<-function(p, alpha){if (p<alpha) {print('REJECT Ho')} else {print('FAIL 2 REJECT')}}
phat <- 0.48 #sample proportion
p<- 0.50 #population hypothesized value
q<- 1-p
alpha<- 0.05
df<- n-1
n<-331
se<- sqrt(p*q/n)
t<- (phat-p)/se # t<- (sample estimate - null hypothesis value)/se
test_statistic<-t
test_statistic
-0.7277362
critical_value <- qt(p=0.05, df = n-1)
critical_value
-1.649484
p_value<-pnorm(t)
p_value
0.2333875
alpha
0.05
myp(p = p_value, alpha = alpha)
"FAIL 2 REJECT"
We find that our critical value is -1.649484, test statistic is the t
distribution, p-value is 0.2333875.
We fail to reject the null hypothesis, meaning that is is likely that
the true mean proportion of American adults who do not have a four-year
college degree is less than or equal to 0.50.
Would you expect a confidence interval (\(\alpha=.05\), double sided) for the proportion of American adults who decide not to go to college for financial reasons to include 0.5 (hypothesized value)? Show, or explain.
We construct our confidence interval using tails of area 0.025 since
double-sided
critical_value <- qt(p=0.975, df = n-1)
lower<-phat-critical_value*se/sqrt(n)
upper<-phat+critical_value*se/sqrt(n)
lower
0.4770284
upper
0.4829716
This calculation tells us that we should expect the true value of the
proportion of American adults who decide not to go to college for
financial reasons to be between 0.477 and 0.483 with about 95%
confidence- a confidence interval that does not include 0.5.
You are investigating the effects of being distracted by a game on how much people eat. The 22 patients in the treatment group who ate their lunch while playing solitaire were asked to do a serial-order recall of the food lunch items they ate. The average number of items recalled by the patients in this group was 4.9, with a standard deviation of 1.8. The average number of items recalled by the patients in the control group (no distraction, n=22) was 6.1, with a standard deviation of 1.8.
Do these data provide strong evidence that the average number of food items recalled by the patients in the treatment and control groups are different? Assume \(\alpha = 5\%\).
\[H_o: \mu_1-\mu_2 = 0, H_a: \mu_1-\mu_2
\neq 0 \] Ho: The difference of the two means is equal to
zero.
Ha: The difference of the two means is not equal to zero.
Using alpha=0.05 and the t distribution.
mu1<-4.9set given values
mu2<-6.1
alpha<-0.05
n1<-22
n2<-22
df1<-n1-1
df2<-n2-1
sd1<-1.8
sd2<-1.8
var1<-sd1^2
var2<-sd2^2
num_point_estimate_diff <- (mu1 - mu2) point estimate
difference/ numerator term
den_Se <- sqrt( var1/n1 + var2/n2) standard error/
denominator term
t<-(num_point_estimate_diff -0)/den_Se ratio between
point estimate/hypothesized value (0) and standard error
test_statistic <- t
test_statistic
-2.211083
numdf = (var1/n1 + var2/n2)^2
dendf = (var1/n1)^2/df1 + (var2/n2)^2/df2
df = numdf / dendf using satterthwaite approximation
p_value =2*(1- pt(q = t,df = numdf / dendf)) multiplied by
2 to account for two tails
myp(p = p_value, alpha = alpha)
"FAIL 2 REJECT"
P-value is greater than significance level, so we fail to reject the
null hypothesis that the difference between the two means is equal to
zero. Thus the data does not provide strong evidence that the average
number of food items recalled by the patients in the treatment and
control groups are different.
Suppose the population distribution is approximately normal and the population standard deviation is unknown. You are told the 90% confidence interval for the population mean is (65,77), and the confidence interval is based on a simple random sample of 25 observations (double sided). Calculate the sample mean, the margin of error, standard error and the population standard deviation.
sample_mean<-(65+77)/2 sample mean is midpoint of
confidence interval
sample_mean
71
margin_of_error <- (77-65)/2 total potential distance
from sample mean
margin_of_error
6
\[Margin of error=z_{\alpha/2} \times \sigma
/\sqrt{n} \] Using algebra, we can get \[\sigma=(Marginoferror \times
\sqrt{n})/z_{\alpha/2}\]
z_halfalpha<-qnorm(p=0.95, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
z_halfalpha
1.644854
sd <- (margin_of_error*sqrt(n))/z_halfalpha
sd
18.2387
se<-sd/sqrt(n)standard error is the ratio of the
population standard deviation and the square root of the sample
size
se
3.647741
The Stanford University Heart Transplant Study was conducted to determine whether an experimental heart transplant program increased lifespan. Each patient entering the program was officially designated a heart transplant candidate, meaning that he was gravely ill and might benefit from a new heart. Patients were randomly assigned into treatment and control groups. Patients in the treatment group received a transplant, and those in the control group did not. The table below displays how many patients survived and died in each group -
mym <- matrix(c(4,30,24,45), nrow=2)
colnames(mym) <- c("control", "treatment")
rownames(mym) <- c("alive","dead")
mym
## control treatment
## alive 4 24
## dead 30 45
Suppose we are interested in estimating the difference in survival rate between the control and treatment groups using a confidence interval. Explain why we cannot construct such an interval using the normal approximation. What might go wrong if we constructed the confidence interval despite this problem?
An important condition for using a normal approximation for a
binomial distribution is that each group being studied (in this case
treatment and control) must have at least successes (alive) and 5
failures (dead) (or 10 for more certainty). The control group only has 4
successes, so the conditions for using a normal approximation are not
met.
Constructing a confidence interval despite this problem is not advisable
because there is not enough data available to come up with meaningful
conclusions. As our amount of data increases, we should see the binomial
distribution approximate the normal distribution, but for smaller
amounts of data, the binomial distribution does not necessarily
approximate the normal distribution, and it will be difficult to
determine whether or not the conclusions from the confidence interval
can be trusted with a high degree of confidence.
Write your own function in R that executes a hypothesis test following a linear regression, using the t-distribution. The function should be your own, and should use the t-distribution. One of the arguments of the function should be an object of class “lm” (that is, the output of a “lm” function used to fit linear regressions). The other arguments should be the elements necessary for a hypothesis test (the coefficient tested, the level according to the Null and Alternative Hypotheses, the significance level, etc.) The function should return the test statistic, the p-value, and the conclusion of the test.
\[H_0: y = x,\:H_a: y \neq x\]
\lmfunc<-lm(y~x)
``hyp_func <- function(y,x,lmfunc,t,alpha, p) {if(p<alpha)
{return(list(t,p,“Reject Ho”))} else
if(p>alpha){return(list(t,p,“Fail to Reject Ho”))}}```
Where y=dependent variable, x=independent variable, lmfunc is our lm
object, t is the test statistic from the model, alpha is a given
significance level, and p is the p-value from the model.
Test the function for a regression of your choice using any data set you wish. Specify the elements of the test and carry out the test by hand. Then check that the function you wrote in the previous part gives the same (correct) answer.
Let’s test this function on the nsw dataset from question 8.
lmfunc<-lm(nsw$increase~nsw$treat)
hyp_func(nsw$increase, nsw$treat, lmfunc, t=1.512, alpha=0.05, p=0.131)
[[1]]
[1] 1.512
[[2]]
[1] 0.131
[[3]]
[1] "Fail to Reject Ho"
The function returns a list of all the values we are interested in.
In order to obtain the arguments of the function, we must first run the
linear regression model. From there, we extract the appropriate test
statistics and p-values and choose an appropriate significance
level.
Import the smoking.csv file into R. The variables are “smoker” - whether or not one smokes; “snmkban” - whether or not one faces a smoking ban at one’s workplace (this is fairly old data); “age” - in years; “hsdrop” - high school dropout; “hsgrad” - graduated from high school but has no further education; “colsome” - has some college education; “colgrad” - graduated from college, as well as some self-explanatory demographic variables.
Estimate a logit regression model of whether or not one smokes as a function of the other variables.
smoking<-read.csv("C:/Users/brady/OneDrive/Documents/smoking.csv")
load csv file
smokingglm <- glm(smoker ~ smkban + age + hsdrop + hsgrad + colsome + colgrad + black + hispanic + female, data = smoking, family = 'binomial')
logistic regression model of whether or not someone smokes as a function
of the other variables.
summary(smokingglm) Display estimated model
Call:
glm(formula = smoker ~ smkban + age + hsdrop + hsgrad + colsome + colgrad + black + hispanic + female, family = "binomial", data = smoking)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.696765 0.138696 -12.234 < 2e-16 ***
smkban -0.250735 0.049164 -5.100 3.40e-07 ***
age -0.007452 0.001987 -3.751 0.000176 ***
hsdrop 1.931075 0.131261 14.712 < 2e-16 ***
hsgrad 1.523305 0.114308 13.326 < 2e-16 ***
colsome 1.180080 0.116518 10.128 < 2e-16 ***
colgrad 0.424753 0.125801 3.376 0.000734 ***
black -0.149472 0.089994 -1.661 0.096732 .
hispanic -0.584845 0.083085 -7.039 1.93e-12 ***
female -0.188720 0.049105 -3.843 0.000121 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 11074 on 9999 degrees of freedom
Residual deviance: 10502 on 9990 degrees of freedom
AIC: 10522
Number of Fisher Scoring iterations: 4
What does the coefficient on “snmkban” tell you? How about the coefficient on “age”? How about the coefficient on “colgrad”?
snmkban: The coefficient on snmkban of -0.25 tells us there is a slightly negative correlation between smoker and snmkban. Facing a smoking ban at one’s workplace makes them less likely to be a smoker, according to the model.
age: The coefficient of age of -0.007 is very small so it does not have much weight in the model in terms of its relationship with whether or not someone smokes.
colgrad: The coefficient of colgrad of 0.42 tells us there is a positive correlation between smoker and colgrad. Graduating from college makes a person more likely to be a smoker, according to the model.
Import the “nsw.csv” file in R. The data set contains observations from The National Supported Work Demonstration (NSW), which was a temporary employment program designed to help disadvantaged workers lacking basic job skills move into the labor market by giving them work experience and counseling in a sheltered environment. The variables of interest are “education” – the number of years of education for a worker in the sample; “age” – a worker’s age in years; “re75” – real earnings in 1975, at the beginning of the program; “re78” - real earnings in 1978, at the end of the program; “treat” - whether or not a worker was placed in the treatment group receiving work experience and counseling in a sheltered environment; as well as a few self-explanatory demographic variables.
Estimate the effect of receiving work experience and counseling in a sheltered environment on the increase in real earnings between 1975 and 1978 in a univariate regression. What is this effect? Is it statistically significant at 10% significance level (make sure to state the hypotheses)?
nsw <- read.csv("C:/Users/brady/OneDrive/Documents/nsw.csv")
load csv in R
nsw
nsw$increase<-nsw$re78-nsw$re75 create a variable for
increase in real earnings between 1975 and 1978
nsw
nswlm = lm(increase~treat, data = nsw) create linear
model
summary(nswlm) print summary results
Call:
lm(formula = increase ~ treat, data = nsw)
Residuals:
Min 1Q Median 3Q Max
-37995 -3198 -1442 3677 56114
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2063.4 359.3 5.743 1.37e-08 ***
treat 846.9 560.1 1.512 0.131
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7406 on 720 degrees of freedom
Multiple R-squared: 0.003165, Adjusted R-squared: 0.00178
F-statistic: 2.286 on 1 and 720 DF, p-value: 0.131
Effect: On average there is an 846.9 dollar increase in real earnings
between 1975 and 1978 when a worker receives work experience and
counseling in a sheltered environment, according to the model.
\[H_0: Increase_0 = Increase_1,\:H_a:
Increase_0 \neq Increase_1\] Null hypothesis: The increase in
real earnings between 1975 and 1978 of the control group is equal to
that of the treat group (no effect)
Alternative hypothesis: The increase in real earnings between 1975 and
1978 of the control group is not equal to that of the treat group (there
is an effect)
myp(p = 0.131, alpha = 0.1)
"FAIL 2 REJECT"
Since the p-value is greater than the significance level, we fail to
reject the null hypothesis and conclude that the effect of a worker
receiving work experience and counseling in a sheltered environment on
the increase in real earnings between 1975 and 1978 is not statistically
significant at the 10% significance level.
Re-estimate the effect of receiving work experience and counseling in a sheltered environment on the increase in real earnings between 1975 and 1978, but this time use the other variables as control variables. What is this effect? Is it statistically significant at 10% significance level (make sure to state the hypotheses)? How does it compare to the effect estimated in the previous part?
nswlm = lm(increase ~ treat + age + education + black + hispanic + married + nodegree + re75 + re78, data = nsw)
summary(nswlm)
Call:
lm(formula = increase ~ treat + age + education + black + hispanic + married + nodegree + re75 + re78, data = nsw)
Residuals:
Min 1Q Median 3Q Max
-1.127e-10 -9.630e-13 -1.460e-13 7.760e-13 8.713e-11
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.099e-12 2.924e-12 -1.060e+00 0.28966
treat 5.203e-13 5.256e-13 9.900e-01 0.32253
age 1.786e-13 4.057e-14 4.402e+00 1.24e-05 ***
education 6.360e-13 2.015e-13 3.157e+00 0.00166 **
black -1.478e-12 9.005e-13 -1.642e+00 0.10107
hispanic -1.906e-12 1.173e-12 -1.625e+00 0.10460
married 6.774e-12 7.312e-13 9.265e+00 < 2e-16 ***
nodegree 4.941e-13 8.331e-13 5.930e-01 0.55329
re75 -1.000e+00 5.284e-17 -1.893e+16 < 2e-16 ***
re78 1.000e+00 4.198e-17 2.382e+16 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.897e-12 on 712 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 9.254e+31 on 9 and 712 DF, p-value: < 2.2e-16
Effect: With the control variables in place, the coefficient of treat of 5.203x10^-13 tells us there is essentially zero discernible correlation between treat and increase.
\[H_0: Increase_0 = Increase_1,\:H_a:
Increase_0 \neq Increase_1\]
myp(p = 0.32253, alpha = 0.10)
"FAIL 2 REJECT"
Since our p-value for treat is greater than our significance level of 0.10, this conclusion is not statistically significant at the 10% significance level.
The many control variables included in this model muffle the effect of the treat variable. The model is too “crowded” to be able to show if the treat variable has an effect on the increase variable. The previous model, while also not statistically significant for treat, had a lower p-value for treat, telling us that the new model is less effective at yielding any kind of significant correlation between treat and increase.
Create a data frame in which to estimate the same effect, but this time using a difference-in-differences approach. Don’t use any other control variables. What is the estimated effect of receiving work experience and counseling in a sheltered environment on the increase in real earnings between 1975 and 1978?
nsw <- mutate(nsw, time = ifelse(nsw$time == 1, 1, 0),treated = ifelse(treat >= 0, 1, 0))
The objective of this problem is to create a variable for time and
interact it with our treat variable in order to study the effect treat
has on the increase in income over a continuous time interval. Using
DID, we can examine the group’s features both before and after
treatment, as well as compare changes over time between the treat group
and the control group.
The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).
Import the titanic train.csv file into R. You will have to clean the data to run some of commands/set up the workflow.
Data clean up:
train$Age[is.na(train$Age)]<-median(train$Age, na.rm=TRUE)set
na values as column median
train$SibSp[is.na(train$SibSp)]<-median(train$SibSp, na.rm=TRUE)
train$Parch[is.na(train$Parch)]<-median(train$Parch, na.rm=TRUE)
train$sexnumeric <- as.numeric(factor(train$Sex) )
convert variables to numeric
train$embarkednumeric <- as.numeric(factor(train$Embarked) )
cor(train$Survived, train$sexnumeric) correlation
coefficients between survived and some variables from the train
dataset
-0.5433514
cor(train$Survived, train$Age)
-0.06491042
cor(train$Survived, train$SibSp)
-0.0353225
cor(train$Survived, train$Parch)
0.08162941
cor(train$Survived, train$embarkednumeric)
-0.1765092
library('psych')
describe(train) yields descriptive statistics of the train
dataset
vars n mean sd median trimmed mad min max range skew kurtosis se
PassengerId 1 891 446.00 257.35 446.00 446.00 330.62 1.00 891.00 890.00 0.00 -1.20 8.62
Survived 2 891 0.38 0.49 0.00 0.35 0.00 0.00 1.00 1.00 0.48 -1.77 0.02
Pclass 3 891 2.31 0.84 3.00 2.39 0.00 1.00 3.00 2.00 -0.63 -1.28 0.03
Name* 4 891 446.00 257.35 446.00 446.00 330.62 1.00 891.00 890.00 0.00 -1.20 8.62
Sex* 5 891 1.65 0.48 2.00 1.68 0.00 1.00 2.00 1.00 -0.62 -1.62 0.02
Age 6 891 29.36 13.02 28.00 28.83 8.90 0.42 80.00 79.58 0.51 0.97 0.44
SibSp 7 891 0.52 1.10 0.00 0.27 0.00 0.00 8.00 8.00 3.68 17.73 0.04
Parch 8 891 0.38 0.81 0.00 0.18 0.00 0.00 6.00 6.00 2.74 9.69 0.03
Ticket* 9 891 339.52 200.83 338.00 339.65 268.35 1.00 681.00 680.00 0.00 -1.28 6.73
Fare 10 891 32.20 49.69 14.45 21.38 10.24 0.00 512.33 512.33 4.77 33.12 1.66
Cabin* 11 891 18.63 38.14 1.00 8.29 0.00 1.00 148.00 147.00 2.09 3.07 1.28
Embarked* 12 891 3.53 0.80 4.00 3.66 0.00 1.00 4.00 3.00 -1.27 -0.16 0.03
sexnumeric 14 891 1.65 0.48 2.00 1.68 0.00 1.00 2.00 1.00 -0.62 -1.62 0.02
embarkednumeric 15 891 3.53 0.80 4.00 3.66 0.00 1.00 4.00 3.00 -1.27 -0.16 0.03
Interesting finds: The variables PassengerId and Name have the exact same statistics, which tells me they are being considered as randomly assigned numbers and do not have any meaningful value. A majority of the variables exhibit negative kurtosis. I presume this is because we are dealing with variables that can only take on a handful of values, which causes them to be heavy-tailed.
set.seed(100)set seed for reproducibility
library("dplyr")invoke dplyr
train <- sample_n(train, 500)create subset of train with
500 observations
survivedlm<-lm(train$Survived~train$Sex+train$Pclass+train$Age+train$SibSp+train$Parch+train$Embarked)
survivedlm
Call:
lm(formula = train$Survived ~ train$Sex + train$Pclass + train$Age + train$SibSp + train$Parch + train$Embarked)
\Coefficients:
(Intercept) train$Sexmale train$Pclass train$Age train$SibSp train$Parch train$EmbarkedQ train$EmbarkedS
1.343341 -0.462952 -0.168571 -0.006615 -0.038300 -0.013013 -0.033922 -0.059378
mypred=predict(survivedlm, train, type='response')
mypredprints probability of survival for each
observation
mypred[mypred<0.5]=0categorize observations as 0 or
1
mypred[mypred>0.5]=1
mypredprints survived (1) or not survived (0) for each
observation based on the condition set in the previous two lines
train<-read.csv("C:/Users/brady/OneDrive/Documents/train.csv")reload
full train dataset
I redid the data cleanup steps and redefined survivedlm and the predict
function to reflect the full train dataset
table(mypred, train$Survived)confusion matrix
mypred 0 1
0 472 103
1 77 239
This model accurately predicts roughly 79.8% ((472+239)/891) of outcomes
accurately.
glm2<-glm(train$Survived~train$Pclass+train$Sex+train$Age+train$Fare, family='binomial')
mypred=predict(glm2, train, type='response')
mypred[mypred<0.5]=0
mypred[mypred>0.5]=1
table(mypred, train$Survived)
mypred 0 1
0 469 98
1 80 244
(469+244)/891
0.8002245
After numerous attempts at attaining an accuracy higher than 80%, I
found that adding Fare to the model increased the accuracy to 80.02245%.
When considering various models, we must keep in mind the vital
assumptions for OLS: the residuals of the model must be nearly normal,
their variability must be nearly constant, they must be independent, and
each variable in the regression must be linearly related to the
outcome.Taking a deeper dive into the model I came up with, it is
possible that this model violates the independence assumption, as
passenger class goes up, fare goes down, i.e.
cor(train$Pclass, train$Fare)
-0.5494996
The Q-Q residuals plot for this model appears mostly normal, although it
veers slightly away from our reference line towards the middle of our
data. The variables in the model are all linearly related to the outcome
as they each have a coefficient contributing to the regression line.
Organise your work in the form of a short report. Make sure to defend (OLS assumptions) and explain your model - interpretation of your coefficients, how well your linear regression can be used predict correctly, et cetra.
The following modelling approaches may be helpful (read, and focus on relevant sections) -
https://www.kaggle.com/code/jeremyd/titanic-logistic-regression-in-r
https://www.kaggle.com/code/hillabehar/titanic-analysis-with-r/report