#clearing workspace
rm(list = ls()) # Clear environment
gc() # Clear unused memory
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 535403 28.6 1195012 63.9 NA 669420 35.8
## Vcells 997483 7.7 8388608 64.0 16384 1851771 14.2
cat("\f") # Clear the console
if(!is.null(dev.list())) dev.off() # Clear all plots
## null device
## 1
#preparing packages
packages <- c("psych",
"tidyverse",
"ggplot2",
"lemon",
"gridExtra",
"ggrepel",
"scales",
"knitr",
"dplyr"
)
for (i in 1:length(packages)) {
if (!packages[i] %in% rownames(installed.packages())) {
install.packages(packages[i], dependencies = TRUE)
}
library(packages[i], character.only = TRUE)
}
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::%+%() masks psych::%+%()
## ✖ ggplot2::alpha() masks psych::alpha()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##
## Attaching package: 'lemon'
##
##
## The following object is masked from 'package:purrr':
##
## %||%
##
##
##
## Attaching package: 'gridExtra'
##
##
## The following object is masked from 'package:dplyr':
##
## combine
##
##
##
## Attaching package: 'scales'
##
##
## The following object is masked from 'package:purrr':
##
## discard
##
##
## The following object is masked from 'package:readr':
##
## col_factor
##
##
## The following objects are masked from 'package:psych':
##
## alpha, rescale
rm(packages)
Welcome to the Final! Please read the instructions below before starting the exam.
Tips for cracking the exam:
Good luck!
The distribution of the average IQ score is known to closely follow a Gaussian distribution with a mean centered directly at 100 and a population standard deviation of 16 (Stanford-Benet).
A single person is randomly selected for jury duty. What is the probability that this person will have an IQ of 110 or higher? Be sure to write the probability statement and show your R code.
Probability Statement: P(IQ >= 110); μ = 100, , σ = 16
#defining parameters of interest
mu <- 100
sigma <- 16
pnorm(110, mean = mu, sd = sigma, lower.tail = FALSE)
## [1] 0.2659855
Interpretation: There is a .2660 probability that a person randomly selected for jury duty has an IQ greater than or equal to 110.
What is the probability that the mean IQ of the 12-person jury would be 110 or above if drawn from a normal population with \(\mu=100\) and \(\sigma=16\)? Be sure to write the probability statement and show your R code.
Probability Statement:P(Mean IQ >= 110);μ = 100, , σ = 16
#defining sample size
n <- 12
#determining sigma of sample
sample.sigma <- sigma/sqrt(n)
#determining probability
round(pnorm(110, mean = mu, sd = sample.sigma, lower.tail = FALSE),4)
## [1] 0.0152
Interpretation: There is a .0152 probability that the mean of the 12 person jury is greater than or equal to 110.
Among a simple random sample of 331 American adults who do not have a four-year college degree and are not currently enrolled in school, 48% said they decided not to go to college because they could not afford school.
A newspaper article states that only a minority of the Americans who decide not to go to college do so because they cannot afford it and uses the point estimate from this survey as evidence. Conduct a hypothesis test with \(\alpha=.05\) to determine if these data provide strong evidence that more than 50% adult Americans do not go to college for financial reasons. What is the critical value, test statistic, and p-value for proportion of Americans who do not go college for financial reasons. Do you reject or fail to reject the null ?
Hypothesis Test:
H0 - The percentage of American adults who decide not to go to college due to financial constraints is at least equal to 50%.
Ha - The percentage of American adults who decide not to go to college due to financial constraints is greater than 50%.
#defining parameters of interest
pi <- .50 #population parameter
n <- 331 #sample size
p.hat <- .48 #point estimate
alpha <- .05 #level of significance
#calculating standard error
se <- sqrt((pi * (1-pi))/n)
#calculating z-score
z <- (p.hat - pi)/se
#calculating p-value
p.value <- pnorm(z)
print(p.value)
## [1] 0.2333875
myp=function(p, alpha){
if(p<alpha){print('REJECT Ho')}else{print('FAIL 2 REJECT')}
}
#drawing conclusion
myp(p.value, alpha)
## [1] "FAIL 2 REJECT"
Interpretation: At a significance level of .05, we do not have sufficient evidence to reject the null hypothesis - the data suggests that the percentage of American adults who decide not to go to college due to financial constraints is not greater than 50%.
Would you expect a confidence interval (\(\alpha=.05\), double sided) for the proportion of American adults who decide not to go to college for financial reasons to include 0.5 (hypothesized value)? Show, or explain.
upper.bound <- p.hat + qnorm(p = .975) * se
lower.bound <- p.hat - qnorm(p = .975) * se
ci <- c(lower.bound, upper.bound)
print(ci)
## [1] 0.4261353 0.5338647
Interpretation: At a significance level of .05, we would expect the hypothesized value (.5) to fall within the confidence interval.
You are investigating the effects of being distracted by a game on how much people eat. The 22 patients in the treatment group who ate their lunch while playing solitaire were asked to do a serial-order recall of the food lunch items they ate. The average number of items recalled by the patients in this group was 4.9, with a standard deviation of 1.8. The average number of items recalled by the patients in the control group (no distraction, n=22) was 6.1, with a standard deviation of 1.8.
Do these data provide strong evidence that the average number of food items recalled by the patients in the treatment and control groups are different? Assume \(\alpha = 5\%\).
H0 - The average number of food items recalled by the treatment and control groups are equivalent.
Ha - The average number of food items recalled by the treatment and control groups are different.
#defining parameters of interest
n.1 <- 22 #sample size of treatment group
mu.1 <- 4.9 #mean of treatment group
sigma.1 <- 1.8 #sd of treatment group
n.2 <- 22 #sample size of control group
mu.2 <- 6.1 #mean of control group
sigma.2 <- 1.8 #sd of control group
alpha <- .05 #level of significance
#calculating variance
var.1 <- sigma.1^2
var.2 <- sigma.2^2
#calculating standard error
se <- sqrt((var.1/n.1)+(var.2/n.2))
#calculating test statistic
t <- (mu.1 - mu.2)/se
#determining degrees of freeedom
df <- n.1 + n.2 - 2
#calculating p-value
p.value <- 2*pt(abs(t), df = df, lower.tail = FALSE)
print(p.value)
## [1] 0.03252806
#drawing conclusion
myp(p.value, alpha)
## [1] "REJECT Ho"
Interpretation: At a significance level of .05, we have strong evidence supporting the alternative hypothesis - the average number of food items recalled by the treatment and control groups is not equivalent.
Suppose the population distribution is approximately normal and the population standard deviation is unknown. You are told the 90% confidence interval for the population mean is (65,77), and the confidence interval is based on a simple random sample of 25 observations (double sided). Calculate the sample mean, the margin of error, standard error and the population standard deviation.
#defining parameters of interest
n <- 25
upper.bound <- 77
lower.bound <- 65
alpha <- .1
#determining sample mean
x.bar <- (upper.bound + lower.bound)/2
print(x.bar)
## [1] 71
Interpretation: The mean of the sample is 71.
#determining margin of error
margin.error <- (upper.bound - lower.bound)/2
print(margin.error)
## [1] 6
Interpretation: The margin of error is 6.
#determining standard error
se <- margin.error/qnorm(.95)
print(se)
## [1] 3.647741
Interpretation: The standard error is 3.65.
#determining sigma
sigma <- (margin.error*sqrt(n))/qt(.95, df = n - 1)
print(sigma)
## [1] 17.53481
Interpretation: The standard deviation is 17.53.
The Stanford University Heart Transplant Study was conducted to determine whether an experimental heart transplant program increased lifespan. Each patient entering the program was officially designated a heart transplant candidate, meaning that he was gravely ill and might benefit from a new heart. Patients were randomly assigned into treatment and control groups. Patients in the treatment group received a transplant, and those in the control group did not. The table below displays how many patients survived and died in each group -
mym <- matrix(c(4,30,24,45), nrow=2)
colnames(mym) <- c("control", "treatment")
rownames(mym) <- c("alive","dead")
mym
## control treatment
## alive 4 24
## dead 30 45
Suppose we are interested in estimating the difference in survival rate between the control and treatment groups using a confidence interval. Explain why we cannot construct such an interval using the normal approximation. What might go wrong if we constructed the confidence interval despite this problem?
Interpretation: We cannot construct a confidence interval using the normal approximation given the data is discrete (the normal distribution is continuous) - each value in the table is a discrete count of the number of patients alive/dead. Furthermore, the sample sizes are relatively small (particularly the control group).
If we constructed the confidence interval, we may underestimate or overestimate the bounds, which calls into question the reliability of the interval. Unreliable data can lead to misleading conclusions and false levels of confidence - thus, we will not be able to make reliable inferences about the population.
Write your own function in R that executes a hypothesis test following a linear regression, using the t-distribution. The function should be your own, and should use the t-distribution. One of the arguments of the function should be an object of class “lm” (that is, the output of a “lm” function used to fit linear regressions). The other arguments should be the elements necessary for a hypothesis test (the coefficient tested, the level according to the Null and Alternative Hypotheses, the significance level, etc.) The function should return the test statistic, the p-value, and the conclusion of the test.
my.function <- function(my.lm, coefficient.name, alternative = "two.sided", alpha = 0.05) {
coeff.summary <- summary(my.lm)$coefficients #isolating coefficient
#check
if (!(coefficient.name %in% rownames(coeff.summary))) {
stop("Coefficient name not found in the model.")
}
#isolating coeffcient estimate and standard error from fitted model
coeff.estimate <- coeff.summary[coefficient.name, "Estimate"]
coeff.se <- coeff.summary[coefficient.name, "Std. Error"]
t.stat <- coeff.estimate /coeff.se #calculating test statistic
df <- df.residual(my.lm) #calculating degrees of freedom
#calculating p.value under different scenarios
if (alternative == "two.sided") {
p.value <- 2 * pt(abs(t.stat), df = df, lower.tail = FALSE)
} else if (alternative == "greater") {
p.value <- pt(t.stat, df = df, lower.tail = FALSE)
} else if (alternative == "less") {
p.value <- pt(t.stat, df = df, lower.tail = TRUE)
} else {
stop("Invalid 'alternative' argument. Must be one of 'two.sided', 'greater', or 'less'.") #check
}
#drawing conclusion
if (p.value < alpha) {
conclusion <- "Reject Null Hypothesis"
} else {
conclusion <- "Fail to Reject Null Hypothesis"
}
#returning results
return(list(
t.stat = t.stat,
p.value = p.value,
conclusion = conclusion
))
}
Test the function for a regression of your choice using any data set you wish. Specify the elements of the test and carry out the test by hand. Then check that the function you wrote in the previous part gives the same (correct) answer.
Hypothesis:
H0 - The sepal length of iris flowers has no effect on sepal width.
Ha - The sepal length of iris flowers has an effect on sepal width.
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Estimating Equation:
\[ Sepal Width \sim \beta_0 + Sepal Length* \beta_1 + \epsilon \] \[\epsilon \sim N(0,\sigma^2)\]
#built in function
my.lm <- lm(iris$Sepal.Width ~ iris$Sepal.Length)
summary(my.lm)
##
## Call:
## lm(formula = iris$Sepal.Width ~ iris$Sepal.Length)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1095 -0.2454 -0.0167 0.2763 1.3338
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.41895 0.25356 13.48 <2e-16 ***
## iris$Sepal.Length -0.06188 0.04297 -1.44 0.152
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4343 on 148 degrees of freedom
## Multiple R-squared: 0.01382, Adjusted R-squared: 0.007159
## F-statistic: 2.074 on 1 and 148 DF, p-value: 0.1519
Interpretation: The coefficient indicates that a one unit increase in sepal length will lead to sepal width decreasing .06188 - the p-value is not statistically significant. The intercept suggests that when sepal length is zero, the predicted mean sepal width is 3.41895 - the result is statistically significant at the .001 level.
sepal.length.coeff <- coef(my.lm)[2] #isolating coefficient
sepal.length.se <- summary(my.lm)$coefficients[2, "Std. Error"] #isolating se
#caluclating test statistic
t.stat.by.hand <- (sepal.length.coeff - 0) / sepal.length.se
#determining degrees of freedom
df <- nrow(iris) - ncol(iris) + 1
#determining p-value
p.value.by.hand <- 2 * pt(-abs(t.stat.by.hand), df = df)
print(t.stat.by.hand)
## iris$Sepal.Length
## -1.440287
print(p.value.by.hand)
## iris$Sepal.Length
## 0.1519271
Interpretation: At a significance level of .05, we do not have sufficient evidence to reject the null hypothesis - our p-value of .15 is greater than alpha (.05). The data suggests sepal length does not have an effect on sepal width.
#using custom function
result <- my.function(my.lm, "iris$Sepal.Length", alternative = "two.sided", alpha = 0.05)
print(result)
## $t.stat
## [1] -1.440287
##
## $p.value
## [1] 0.1518983
##
## $conclusion
## [1] "Fail to Reject Null Hypothesis"
Interpretation: My custom function and manual calculation both yield identical results (rounded to the fourth decimal place). Our conclusion thus holds: At a significance level of .05, we do not have sufficient evidence to reject the null hypothesis - the data suggests sepal length does not have an effect on sepal width.
Import the smoking.csv file into R. The variables are “smoker” - whether or not one smokes; “snmkban” - whether or not one faces a smoking ban at one’s workplace (this is fairly old data); “age” - in years; “hsdrop” - high school dropout; “hsgrad” - graduated from high school but has no further education; “colsome” - has some college education; “colgrad” - graduated from college, as well as some self-explanatory demographic variables.
#setting wd
setwd("/Users/josephmancuso/Documents/BC/Spring'24/Final Exam")
getwd()
## [1] "/Users/josephmancuso/Documents/BC/Spring'24/Final Exam"
#importing csv file
smoke <- read.csv("/Users/josephmancuso/Documents/BC/Spring'24/Final Exam/smoking.csv"
, check.names = FALSE
, stringsAsFactors = FALSE
, na.strings = ""
)
#preparing summary statistics
describe(smoke)
## vars n mean sd median trimmed mad min max range skew
## smoker 1 10000 0.24 0.43 0 0.18 0.00 0 1 1 1.20
## smkban 2 10000 0.61 0.49 1 0.64 0.00 0 1 1 -0.45
## age 3 10000 38.69 12.11 37 37.98 13.34 18 88 70 0.51
## hsdrop 4 10000 0.09 0.29 0 0.00 0.00 0 1 1 2.84
## hsgrad 5 10000 0.33 0.47 0 0.28 0.00 0 1 1 0.74
## colsome 6 10000 0.28 0.45 0 0.23 0.00 0 1 1 0.98
## colgrad 7 10000 0.20 0.40 0 0.12 0.00 0 1 1 1.52
## black 8 10000 0.08 0.27 0 0.00 0.00 0 1 1 3.18
## hispanic 9 10000 0.11 0.32 0 0.02 0.00 0 1 1 2.44
## female 10 10000 0.56 0.50 1 0.58 0.00 0 1 1 -0.26
## kurtosis se
## smoker -0.55 0.00
## smkban -1.80 0.00
## age -0.22 0.12
## hsdrop 6.06 0.00
## hsgrad -1.45 0.00
## colsome -1.04 0.00
## colgrad 0.32 0.00
## black 8.08 0.00
## hispanic 3.94 0.00
## female -1.93 0.00
#checking for missing values
colSums(is.na(smoke))
## smoker smkban age hsdrop hsgrad colsome colgrad black
## 0 0 0 0 0 0 0 0
## hispanic female
## 0 0
Estimate a logit regression model of whether or not one smokes as a function of the other variables.
#running correlation analysis
cor(smoke)
## smoker smkban age hsdrop hsgrad
## smoker 1.0000000000 -0.088295899 -0.03417260 0.09891966 0.11677429
## smkban -0.0882958987 1.000000000 0.04002314 -0.09622929 -0.07763858
## age -0.0341725998 0.040023142 1.00000000 0.04401114 0.02607083
## hsdrop 0.0989196643 -0.096229290 0.04401114 1.00000000 -0.22061488
## hsgrad 0.1167742886 -0.077638583 0.02607083 -0.22061488 1.00000000
## colsome 0.0052359677 0.015219258 -0.09235948 -0.19764754 -0.43450990
## colgrad -0.1230708747 0.085258235 -0.03388195 -0.15700484 -0.34516067
## black 0.0005880358 0.006974056 -0.00480490 0.01677509 0.02548611
## hispanic -0.0204386397 -0.034598175 -0.10451016 0.28983694 -0.03117953
## female -0.0314577461 0.115149149 0.02779034 -0.06450376 0.08210537
## colsome colgrad black hispanic female
## smoker 0.005235968 -0.12307087 0.0005880358 -0.020438640 -0.03145775
## smkban 0.015219258 0.08525823 0.0069740559 -0.034598175 0.11514915
## age -0.092359484 -0.03388195 -0.0048049004 -0.104510156 0.02779034
## hsdrop -0.197647536 -0.15700484 0.0167750927 0.289836936 -0.06450376
## hsgrad -0.434509903 -0.34516067 0.0254861131 -0.031179533 0.08210537
## colsome 1.000000000 -0.30922736 0.0338694662 -0.005440164 0.03659891
## colgrad -0.309227364 1.00000000 -0.0440024820 -0.097196535 -0.03274681
## black 0.033869466 -0.04400248 1.0000000000 -0.066529449 0.03823082
## hispanic -0.005440164 -0.09719654 -0.0665294488 1.000000000 -0.03639827
## female 0.036598909 -0.03274681 0.0382308196 -0.036398270 1.00000000
#fitting model
my.logit <- glm(smoke$smoker ~ .,, data = smoke, family = "binomial")
summary(my.logit)
##
## Call:
## glm(formula = smoke$smoker ~ ., family = "binomial", data = smoke)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.696765 0.138696 -12.234 < 2e-16 ***
## smkban -0.250735 0.049164 -5.100 3.40e-07 ***
## age -0.007452 0.001987 -3.751 0.000176 ***
## hsdrop 1.931075 0.131261 14.712 < 2e-16 ***
## hsgrad 1.523305 0.114308 13.326 < 2e-16 ***
## colsome 1.180080 0.116518 10.128 < 2e-16 ***
## colgrad 0.424753 0.125801 3.376 0.000734 ***
## black -0.149472 0.089994 -1.661 0.096732 .
## hispanic -0.584845 0.083085 -7.039 1.93e-12 ***
## female -0.188720 0.049105 -3.843 0.000121 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 11074 on 9999 degrees of freedom
## Residual deviance: 10502 on 9990 degrees of freedom
## AIC: 10522
##
## Number of Fisher Scoring iterations: 4
plot(my.logit)
What does the coefficient on “snmkban” tell you? How about the coefficient on “age”? How about the coefficient on “colgrad”?
Interpretation:
smkban - For every additional workplace smoking ban, the log odds of the individual being a smoker decreases by .2507. The coefficient is statistically significant at the .001 level, and indicates that workplace smoking bans decrease the likelihood of smoking amongst employees.
age - For every one year increase in age, the log odds of the individual being a smoker decreases by .0074. The coefficient is statistically significant at the .001 level, and indicates that as individuals age, they are less likely to smoke.
colgrad - For every one unit increase in the colgrad variable (additional college graduates), the log odds of being a smoker increase by .4248. The coefficient is statistically significant at the .001 level.
Import the “nsw.csv” file in R. The data set contains observations from The National Supported Work Demonstration (NSW), which was a temporary employment program designed to help disadvantaged workers lacking basic job skills move into the labor market by giving them work experience and counseling in a sheltered environment. The variables of interest are “education” – the number of years of education for a worker in the sample; “age” – a worker’s age in years; “re75” – real earnings in 1975, at the beginning of the program; “re78” - real earnings in 1978, at the end of the program; “treat” - whether or not a worker was placed in the treatment group receiving work experience and counseling in a sheltered environment; as well as a few self-explanatory demographic variables.
#importing csv file
nsw <- read.csv("/Users/josephmancuso/Documents/BC/Spring'24/Final Exam/nsw.csv"
, check.names = FALSE
, stringsAsFactors = FALSE
, na.strings = ""
)
#preparing summary statistics
describe(nsw)
## vars n mean sd median trimmed mad min max
## data_id* 1 722 1.00 0.00 1.00 1.00 0.00 1 1.00
## treat 2 722 0.41 0.49 0.00 0.39 0.00 0 1.00
## age 3 722 24.52 6.63 23.00 23.56 5.93 17 55.00
## education 4 722 10.27 1.70 10.00 10.37 1.48 3 16.00
## black 5 722 0.80 0.40 1.00 0.88 0.00 0 1.00
## hispanic 6 722 0.11 0.31 0.00 0.01 0.00 0 1.00
## married 7 722 0.16 0.37 0.00 0.08 0.00 0 1.00
## nodegree 8 722 0.78 0.41 1.00 0.85 0.00 0 1.00
## re75 9 722 3042.90 5066.14 936.31 1932.15 1388.17 0 37431.66
## re78 10 722 5454.64 6252.94 3951.89 4482.25 5859.07 0 60307.93
## range skew kurtosis se
## data_id* 0.00 NaN NaN 0.00
## treat 1.00 0.36 -1.87 0.02
## age 38.00 1.40 2.32 0.25
## education 13.00 -0.69 1.67 0.06
## black 1.00 -1.50 0.25 0.01
## hispanic 1.00 2.57 4.60 0.01
## married 1.00 1.83 1.35 0.01
## nodegree 1.00 -1.35 -0.18 0.02
## re75 37431.66 2.95 11.51 188.54
## re78 60307.93 2.21 10.49 232.71
#checking for missing values
colSums(is.na(nsw))
## data_id treat age education black hispanic married nodegree
## 0 0 0 0 0 0 0 0
## re75 re78
## 0 0
Estimate the effect of receiving work experience and counseling in a sheltered environment on the increase in real earnings between 1975 and 1978 in a univariate regression. What is this effect? Is it statistically significant at 10% significance level (make sure to state the hypotheses)?
H0 - Receiving work experience and counseling in a sheltered environment has no effect on the increase in real earnings between 1975 and 1978
Ha - Receiving work experience and counseling in a sheltered environment has an effect on the increase in real earnings between 1975 and 1978.
#determining increase in earnings and appeding to dataset
nsw$increase.earnings <- nsw$re78 - nsw$re75
#fitting model
nsw.lm <- lm(nsw$increase.earnings ~ nsw$treat)
summary(nsw.lm)
##
## Call:
## lm(formula = nsw$increase.earnings ~ nsw$treat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37995 -3198 -1442 3677 56114
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2063.4 359.3 5.743 1.37e-08 ***
## nsw$treat 846.9 560.1 1.512 0.131
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7406 on 720 degrees of freedom
## Multiple R-squared: 0.003165, Adjusted R-squared: 0.00178
## F-statistic: 2.286 on 1 and 720 DF, p-value: 0.131
Interpretation: The coefficient indicates that individuals who received treatment experienced an $846.9 increase in real earnings versus those not treated. The p-value of .131 is not statistically significant and is greater than the significance level of 10% - we do not have sufficient evidence to reject the null hypothesis.
Re-estimate the effect of receiving work experience and counseling in a sheltered environment on the increase in real earnings between 1975 and 1978, but this time use the other variables as control variables. What is this effect? Is it statistically significant at 10% significance level (make sure to state the hypotheses)? How does it compare to the effect estimated in the previous part?
#fitting model
nsw.lm.2 <- lm(nsw$increase.earnings ~ nsw$treat + nsw$age + nsw$education + nsw$black + nsw$hispanic + nsw$married + nsw$nodegree)
summary(nsw.lm.2)
##
## Call:
## lm(formula = nsw$increase.earnings ~ nsw$treat + nsw$age + nsw$education +
## nsw$black + nsw$hispanic + nsw$married + nsw$nodegree)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38153 -3383 -1076 3503 55863
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1986.822 3123.319 0.636 0.52490
## nsw$treat 869.294 561.559 1.548 0.12207
## nsw$age 4.194 43.424 0.097 0.92308
## nsw$education 26.633 215.306 0.124 0.90159
## nsw$black 112.085 956.157 0.117 0.90672
## nsw$hispanic 1229.152 1253.250 0.981 0.32704
## nsw$married -2210.949 767.530 -2.881 0.00409 **
## nsw$nodegree -217.758 891.527 -0.244 0.80710
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7384 on 714 degrees of freedom
## Multiple R-squared: 0.01729, Adjusted R-squared: 0.007654
## F-statistic: 1.794 on 7 and 714 DF, p-value: 0.08539
Interpretation: The coefficient is slightly higher ($869.29) when controlling for the other variables. However, the p-value is not statistically significant, and again exceeds the 10% level of significance (.12207). Consequently, we can draw a similar conclusion - we have insufficient evidence to reject the null hypothesis.
Create a data frame in which to estimate the same effect, but this time using a difference-in-differences approach. Don’t use any other control variables. What is the estimated effect of receiving work experience and counseling in a sheltered environment on the increase in real earnings between 1975 and 1978?
#assigning binary operators to the treated and control gorups
treated <- nsw[nsw$treat == 1, ]
control <- nsw[nsw$treat == 0, ]
#determining mean change in earnings for each group
treated.diff <- mean(treated$increase.earnings)
control.diff <- mean(control$increase.earnings)
#calculating DID
diff.in.diff <- treated.diff - control.diff
print(diff.in.diff)
## [1] 846.8883
Interpretation: The DID approach indicates the treatment group experienced an $846.89 increase in real earnings between 1975 and 1978 compared to the control group.
The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).
Import the titanic train.csv file into R. You will have to clean the data to run some of commands/set up the workflow.
#importing csv file
train <- read.csv("/Users/josephmancuso/Documents/BC/Spring'24/Final Exam/train.csv")
missing.observations <- colSums(is.na(train))
print(missing.observations)
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
#imputing missing values
age.source.data <- train$Age #Storing source data
age.median <- median(train$Age, na.rm = T) #Calculating median age
train$Age[is.na(train$Age)] <- median(train$Age, na.rm=TRUE) #Replacing NAs with median age
sum(is.na(train$Age)) #Check
## [1] 0
#transforming sex into a binary
train$Sex <- ifelse(train$Sex == "male",1,0)
(4 points) Run some preliminary correlations of Survived with some other variables.
cor(train$Survived, train$Age)
## [1] -0.06491042
Interpretation: Very weak, negative correlation between age and survival.
cor(train$Survived, train$Sex)
## [1] -0.5433514
Interpretation: Weak, negative correlation between sex and survival.
cor(train$Survived, train$Pclass)
## [1] -0.338481
Interpretation: Weak, negative correlation between passenger class and survival.
cor(train$Survived, train$SibSp)
## [1] -0.0353225
Interpretation: Very weak, negative correlation between the number of siblings and spouses aboard and survival.
cor(train$Survived, train$Parch)
## [1] 0.08162941
Interpretation: Very weak, positive correlation between the number of parents and children aboard and survival.
(3 points) Conduct descriptive statistics of the dataset. Anything interesting you find ?
describe(train)
## vars n mean sd median trimmed mad min max range
## PassengerId 1 891 446.00 257.35 446.00 446.00 330.62 1.00 891.00 890.00
## Survived 2 891 0.38 0.49 0.00 0.35 0.00 0.00 1.00 1.00
## Pclass 3 891 2.31 0.84 3.00 2.39 0.00 1.00 3.00 2.00
## Name* 4 891 446.00 257.35 446.00 446.00 330.62 1.00 891.00 890.00
## Sex 5 891 0.65 0.48 1.00 0.68 0.00 0.00 1.00 1.00
## Age 6 891 29.36 13.02 28.00 28.83 8.90 0.42 80.00 79.58
## SibSp 7 891 0.52 1.10 0.00 0.27 0.00 0.00 8.00 8.00
## Parch 8 891 0.38 0.81 0.00 0.18 0.00 0.00 6.00 6.00
## Ticket* 9 891 339.52 200.83 338.00 339.65 268.35 1.00 681.00 680.00
## Fare 10 891 32.20 49.69 14.45 21.38 10.24 0.00 512.33 512.33
## Cabin* 11 891 18.63 38.14 1.00 8.29 0.00 1.00 148.00 147.00
## Embarked* 12 891 3.53 0.80 4.00 3.66 0.00 1.00 4.00 3.00
## skew kurtosis se
## PassengerId 0.00 -1.20 8.62
## Survived 0.48 -1.77 0.02
## Pclass -0.63 -1.28 0.03
## Name* 0.00 -1.20 8.62
## Sex -0.62 -1.62 0.02
## Age 0.51 0.97 0.44
## SibSp 3.68 17.73 0.04
## Parch 2.74 9.69 0.03
## Ticket* 0.00 -1.28 6.73
## Fare 4.77 33.12 1.66
## Cabin* 2.09 3.07 1.28
## Embarked* -1.27 -0.16 0.03
Highlights:
Age: The mean age of passengers on the titanic is relatively young (28). The variance indicates there is high variability in the age of passengers - the standard deviation suggests this is the case given most ages fall within 13 years of the mean.
var(train$Age)
## [1] 169.5125Fare: The mean ticket price of 32.20 was relatively expensive - according to BLS data, the mean ticket price is equivalent to $514.56 in 2024 dollars. The max ticket price of $681 translates into $10,882 adjusted for inflation. The min ticket price of $0 suggests some individuals received “free passage.”
Sibsp: The range of distribution for the SibSp is 8 siblings/spouses. The observations indicate there were some large families on board, however the standard deviation (1.1) highlights that the majority of observations are relatively close to the mean (.52).
Sex: After transforming sex into a binary variable, it appears that more men were on the ship than women.
set.seed(100)
train.subset <- sample_n(train, 500)
(4 points) Create an Ordinary Least Squares model / linear regression where Survived is the dependent variable on your n=500 sample.
my.lm <- lm(train.subset$Survived ~ train.subset$Pclass + train.subset$Age + train.subset$Sex)
summary(my.lm)
##
## Call:
## lm(formula = train.subset$Survived ~ train.subset$Pclass + train.subset$Age +
## train.subset$Sex)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0791 -0.2568 -0.1164 0.2534 0.9845
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.267864 0.082326 15.401 < 2e-16 ***
## train.subset$Pclass -0.176925 0.023193 -7.628 1.23e-13 ***
## train.subset$Age -0.005937 0.001480 -4.012 6.94e-05 ***
## train.subset$Sex -0.454444 0.039011 -11.649 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4007 on 496 degrees of freedom
## Multiple R-squared: 0.3227, Adjusted R-squared: 0.3186
## F-statistic: 78.76 on 3 and 496 DF, p-value: < 2.2e-16
Interpretation: The coefficients suggest that for every one unit increase in Pclass, Age, and Sex, survival decreases by .177, .006, and .4544 respectively. All three coefficients are statistically significant at the .001 level. The adjusted R-squared value of .3186 suggests the model could be improved given a high amount of variability remains unexplained.
plot(my.lm)
(4 points) Create an estimate of whether an individual survived or not (binary variable) using the predict command on your estimated model. Essentially, you are using the coefficient from your linear model to forecast/predict/estimate the survival variable given independant variable values /data.
#predicted probabilites of survival
predictions <- predict(my.lm, newdata = train.subset, type = "response")
#defining threshold
threshold <- 0.5
#transforming predicted values into binary operators
predicted.survival <- ifelse(predictions >= threshold, yes = 1, no = 0)
table(predicted.survival)
## predicted.survival
## 0 1
## 339 161
Interpretation: My model predicts that of 339 out of the 500 sampled individuals did not surive the journey.
#expanding model to encompass full dataset
my.lm.2 <- lm(train$Survived ~ train$Pclass + train$Age + train$Sex)
summary(my.lm.2)
##
## Call:
## lm(formula = train$Survived ~ train$Pclass + train$Age + train$Sex)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.08688 -0.23209 -0.08626 0.22073 0.99862
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.282644 0.058032 22.103 < 2e-16 ***
## train$Pclass -0.185779 0.016561 -11.218 < 2e-16 ***
## train$Age -0.004993 0.001058 -4.721 2.72e-06 ***
## train$Sex -0.499229 0.027332 -18.265 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3828 on 887 degrees of freedom
## Multiple R-squared: 0.3832, Adjusted R-squared: 0.3811
## F-statistic: 183.7 on 3 and 887 DF, p-value: < 2.2e-16
#predicted probabilites of survival
predictions.2 <- predict(my.lm.2, newdata = train, type = "response")
#defining threshold
threshold.2 <- 0.5
#transforming predicted values into binary operators
predicted.survival.2 <- ifelse(predictions.2 >= threshold.2, yes = 1, no = 0)
table(predicted.survival.2)
## predicted.survival.2
## 0 1
## 573 318
#building confusion matrix
observed <- train$Survived
outcome.table <- table(observed, predicted.survival.2)
outcome.table
## predicted.survival.2
## observed 0 1
## 0 467 82
## 1 106 236
#calculating accuracy
outcome.table.accuracy <- sum(diag(outcome.table)) / sum(outcome.table)
print(outcome.table.accuracy)
## [1] 0.7890011
Interpretation: When extending the model to the full train dataset, the model accurately classified 79% of passengers.
Organise your work in the form of a short report. Make sure to defend (OLS assumptions) and explain your model - interpretation of your coefficients, how well your linear regression can be used predict correctly, et cetra.
The following modelling approaches may be helpful (read, and focus on relevant sections) -
https://www.kaggle.com/code/jeremyd/titanic-logistic-regression-in-r
https://www.kaggle.com/code/hillabehar/titanic-analysis-with-r/report
describe(train)
## vars n mean sd median trimmed mad min max range
## PassengerId 1 891 446.00 257.35 446.00 446.00 330.62 1.00 891.00 890.00
## Survived 2 891 0.38 0.49 0.00 0.35 0.00 0.00 1.00 1.00
## Pclass 3 891 2.31 0.84 3.00 2.39 0.00 1.00 3.00 2.00
## Name* 4 891 446.00 257.35 446.00 446.00 330.62 1.00 891.00 890.00
## Sex 5 891 0.65 0.48 1.00 0.68 0.00 0.00 1.00 1.00
## Age 6 891 29.36 13.02 28.00 28.83 8.90 0.42 80.00 79.58
## SibSp 7 891 0.52 1.10 0.00 0.27 0.00 0.00 8.00 8.00
## Parch 8 891 0.38 0.81 0.00 0.18 0.00 0.00 6.00 6.00
## Ticket* 9 891 339.52 200.83 338.00 339.65 268.35 1.00 681.00 680.00
## Fare 10 891 32.20 49.69 14.45 21.38 10.24 0.00 512.33 512.33
## Cabin* 11 891 18.63 38.14 1.00 8.29 0.00 1.00 148.00 147.00
## Embarked* 12 891 3.53 0.80 4.00 3.66 0.00 1.00 4.00 3.00
## skew kurtosis se
## PassengerId 0.00 -1.20 8.62
## Survived 0.48 -1.77 0.02
## Pclass -0.63 -1.28 0.03
## Name* 0.00 -1.20 8.62
## Sex -0.62 -1.62 0.02
## Age 0.51 0.97 0.44
## SibSp 3.68 17.73 0.04
## Parch 2.74 9.69 0.03
## Ticket* 0.00 -1.28 6.73
## Fare 4.77 33.12 1.66
## Cabin* 2.09 3.07 1.28
## Embarked* -1.27 -0.16 0.03
Background: One of the key takeaways from our first assignment was the interaction between the age of passengers and survival status. Notched boxplots were created to compare the median age of Titanic passengers who survived or perished. The top boxplot (“Survived”) indicates most Titanic survivors were young adults - the trend holds for passengers who did not survive the voyage as indicated by the lower boxplot (“Perished”).
Additionally, children were more likely to survive the voyage as indicated by the lower minimum line in the Survived boxplot. The reverse hold true for the elderly who were less likely to survive the voyage (lower maximum and greater number of outliers in the Perished boxplot).
Given, age was a statistically significant independent variable in our initial OLS model, and the likelihood of survival varies across different age groups, I believe that classifying the age variable into groups will improve the accuracy of predictions.
boxplot(train$Age~train$Survived, notch=TRUE, horizontal=T,
xlab = "Age",
ylab = "Survived",
main = "Age & Survival Rate"
)
train.2 <- as.data.frame(train)
#binning age varible
bins <- c(0,18,50,80)
labels <- c("Youth", "Adult", "Old")
train.2 <- train.2 %>%
mutate(age.categories = factor(cut(Age, breaks = bins, labels = labels)))
Model: To test my assumption, I built a logistical regression model which controlled for Pclass, a new Age variable (binned ages), Sex, Parch, and SibSp variables. I added the Parch variable as a control variable under the assumption that passengers with children aboard would behave differently than passengers traveling alone - parents will likely prioritize the welfare of their children which could change the odds of survival. Similar logic justifies the inclusion of the SibSp variable - passengers may have behaved differently if “loved ones” were aboard, which could be a factor in survival odds.
my.logit <- glm(Survived ~ Pclass + age.categories + Sex + Parch + SibSp, data = train.2, family = "binomial")
summary(my.logit)
##
## Call:
## glm(formula = Survived ~ Pclass + age.categories + Sex + Parch +
## SibSp, family = "binomial", data = train.2)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.7190 0.4418 10.680 < 2e-16 ***
## Pclass -1.0759 0.1143 -9.409 < 2e-16 ***
## age.categoriesAdult -0.9888 0.2629 -3.761 0.000169 ***
## age.categoriesOld -1.7884 0.4282 -4.177 2.96e-05 ***
## Sex -2.7486 0.1978 -13.899 < 2e-16 ***
## Parch -0.1060 0.1132 -0.936 0.349383
## SibSp -0.3319 0.1092 -3.040 0.002367 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 797.97 on 884 degrees of freedom
## AIC: 811.97
##
## Number of Fisher Scoring iterations: 5
Interpretation: The magnitude of significance appears similar to the original OLS model - Pclass and Sex remain statistically significant. Interestingly, both the “adult” and “old” categorizations of the age variable are statistically significant at the .001 level. My assumption around SibSp also appears to hold - the presence of siblings or spouses aboard is statistically significant at the .01 level. However, the model does challenge my assumption given the logged odds of survival decreases with additional siblings or spouses aboard (same dynamic for the Parch variable).
The difference between the Null and Residual deviance (388.69) indicates my model is explaining a large portion of variability, however, the high AIC value suggests I may have over-fit the model.
#predicted probabilites of survival
predictions.2 <- predict(my.logit, newdata = train.2, type = "response")
#defining threshold
threshold.2 <- 0.5
#transforming predicted values into binary operators
predicted.survival.2 <- ifelse(predictions.2 >= threshold, yes = 1, no = 0)
table(predicted.survival.2)
## predicted.survival.2
## 0 1
## 597 294
#building confucsion matrix
observed.2 <- train.2$Survived
outcome.table.2 <- table(observed.2, predicted.survival.2)
outcome.table.2
## predicted.survival.2
## observed.2 0 1
## 0 486 63
## 1 111 231
#dtermining accuracy
outcome.table.accuracy.2 <- sum(diag(outcome.table.2)) / sum(outcome.table.2)
print(outcome.table.accuracy.2)
## [1] 0.8047138
Interpretation: My new model slightly improved on the accuracy of the original OLS model (80% vs. 79%). Greater accuracy may be achieved by adding socioeconomic variables (based on Fare and Pclass) to categorize passengers. An alternative approach could involve concatenating Age and Sex - both variables are highly statistically significant, and together, may be a more accurate predictor of survival.