#clearing workspace
rm(list = ls()) # Clear environment
gc()            # Clear unused memory
##          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 535403 28.6    1195012 63.9         NA   669420 35.8
## Vcells 997483  7.7    8388608 64.0      16384  1851771 14.2
cat("\f")       # Clear the console
if(!is.null(dev.list())) dev.off() # Clear all plots
## null device 
##           1
#preparing packages
packages <- c("psych",
              "tidyverse",
              "ggplot2",
              "lemon",
              "gridExtra",
              "ggrepel",
              "scales",
              "knitr",
              "dplyr"
                  )
   
  for (i in 1:length(packages)) {
    if (!packages[i] %in% rownames(installed.packages())) {
      install.packages(packages[i], dependencies = TRUE)
    }
    library(packages[i], character.only = TRUE)
  }
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::%+%()   masks psych::%+%()
## ✖ ggplot2::alpha() masks psych::alpha()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## 
## Attaching package: 'lemon'
## 
## 
## The following object is masked from 'package:purrr':
## 
##     %||%
## 
## 
## 
## Attaching package: 'gridExtra'
## 
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## 
## 
## Attaching package: 'scales'
## 
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
## 
## 
## The following objects are masked from 'package:psych':
## 
##     alpha, rescale
  rm(packages)

Welcome to the Final! Please read the instructions below before starting the exam.

Tips for cracking the exam:

Good luck!

1 Question 1: Distribution of an individual score vs Sample Mean (CLT)

The distribution of the average IQ score is known to closely follow a Gaussian distribution with a mean centered directly at 100 and a population standard deviation of 16 (Stanford-Benet).

1.1 (4 points)

A single person is randomly selected for jury duty. What is the probability that this person will have an IQ of 110 or higher? Be sure to write the probability statement and show your R code.

Probability Statement: P(IQ >= 110); μ = 100, , σ = 16

#defining parameters of interest
mu <- 100
sigma <- 16

pnorm(110, mean = mu, sd = sigma, lower.tail = FALSE)
## [1] 0.2659855

Interpretation: There is a .2660 probability that a person randomly selected for jury duty has an IQ greater than or equal to 110.

1.2 (4 points)

What is the probability that the mean IQ of the 12-person jury would be 110 or above if drawn from a normal population with \(\mu=100\) and \(\sigma=16\)? Be sure to write the probability statement and show your R code.

Probability Statement:P(Mean IQ >= 110);μ = 100, , σ = 16

#defining sample size
n <- 12

#determining sigma of sample
sample.sigma <- sigma/sqrt(n)

#determining probability
round(pnorm(110, mean = mu, sd = sample.sigma, lower.tail = FALSE),4)
## [1] 0.0152

Interpretation: There is a .0152 probability that the mean of the 12 person jury is greater than or equal to 110.

2 Question 2: Hypothesis testing - Proportion

Among a simple random sample of 331 American adults who do not have a four-year college degree and are not currently enrolled in school, 48% said they decided not to go to college because they could not afford school.

2.1 (6 points)

A newspaper article states that only a minority of the Americans who decide not to go to college do so because they cannot afford it and uses the point estimate from this survey as evidence. Conduct a hypothesis test with \(\alpha=.05\) to determine if these data provide strong evidence that more than 50% adult Americans do not go to college for financial reasons. What is the critical value, test statistic, and p-value for proportion of Americans who do not go college for financial reasons. Do you reject or fail to reject the null ?

Hypothesis Test:

  1. H0 - The percentage of American adults who decide not to go to college due to financial constraints is at least equal to 50%.

  2. Ha - The percentage of American adults who decide not to go to college due to financial constraints is greater than 50%.

#defining parameters of interest
pi <- .50 #population parameter
n <- 331 #sample size
p.hat <- .48 #point estimate
alpha <- .05 #level of significance
#calculating standard error
se <- sqrt((pi * (1-pi))/n)

#calculating z-score
z <- (p.hat - pi)/se

#calculating p-value
p.value <- pnorm(z)
print(p.value)
## [1] 0.2333875
myp=function(p, alpha){
  if(p<alpha){print('REJECT Ho')}else{print('FAIL 2 REJECT')}
}
#drawing conclusion
myp(p.value, alpha)
## [1] "FAIL 2 REJECT"

Interpretation: At a significance level of .05, we do not have sufficient evidence to reject the null hypothesis - the data suggests that the percentage of American adults who decide not to go to college due to financial constraints is not greater than 50%.

2.2 (2 points)

Would you expect a confidence interval (\(\alpha=.05\), double sided) for the proportion of American adults who decide not to go to college for financial reasons to include 0.5 (hypothesized value)? Show, or explain.

upper.bound <- p.hat + qnorm(p = .975) * se
lower.bound <- p.hat - qnorm(p = .975) * se

ci <- c(lower.bound, upper.bound)
print(ci)
## [1] 0.4261353 0.5338647

Interpretation: At a significance level of .05, we would expect the hypothesized value (.5) to fall within the confidence interval.

3 Question 3: Hypothesis testing - Difference of two means

3.1 (6 points)

You are investigating the effects of being distracted by a game on how much people eat. The 22 patients in the treatment group who ate their lunch while playing solitaire were asked to do a serial-order recall of the food lunch items they ate. The average number of items recalled by the patients in this group was 4.9, with a standard deviation of 1.8. The average number of items recalled by the patients in the control group (no distraction, n=22) was 6.1, with a standard deviation of 1.8.

Do these data provide strong evidence that the average number of food items recalled by the patients in the treatment and control groups are different? Assume \(\alpha = 5\%\).

3.2 Hypothesis Test:

  1. H0 - The average number of food items recalled by the treatment and control groups are equivalent.

  2. Ha - The average number of food items recalled by the treatment and control groups are different.

#defining parameters of interest 

n.1 <- 22 #sample size of treatment group
mu.1 <- 4.9 #mean of treatment group
sigma.1 <- 1.8 #sd of treatment group

n.2 <- 22 #sample size of control group
mu.2 <- 6.1 #mean of control group
sigma.2 <- 1.8 #sd of control group

alpha <- .05 #level of significance 
#calculating variance
var.1 <- sigma.1^2
var.2 <- sigma.2^2

#calculating standard error
se <- sqrt((var.1/n.1)+(var.2/n.2))

#calculating test statistic
t <- (mu.1 - mu.2)/se

#determining degrees of freeedom
df <- n.1 + n.2 - 2

#calculating p-value
p.value <- 2*pt(abs(t), df = df, lower.tail = FALSE)
print(p.value)
## [1] 0.03252806
#drawing conclusion
myp(p.value, alpha)
## [1] "REJECT Ho"

Interpretation: At a significance level of .05, we have strong evidence supporting the alternative hypothesis - the average number of food items recalled by the treatment and control groups is not equivalent.

4 Question 4: Working backwards

4.1 (6 points)

Suppose the population distribution is approximately normal and the population standard deviation is unknown. You are told the 90% confidence interval for the population mean is (65,77), and the confidence interval is based on a simple random sample of 25 observations (double sided). Calculate the sample mean, the margin of error, standard error and the population standard deviation.

#defining parameters of interest
n <- 25
upper.bound <- 77
lower.bound <- 65
alpha <- .1
#determining sample mean
x.bar <- (upper.bound + lower.bound)/2
print(x.bar)
## [1] 71

Interpretation: The mean of the sample is 71.

#determining margin of error
margin.error <- (upper.bound - lower.bound)/2
print(margin.error)
## [1] 6

Interpretation: The margin of error is 6.

#determining standard error
se <- margin.error/qnorm(.95)
print(se)
## [1] 3.647741

Interpretation: The standard error is 3.65.

#determining sigma
sigma <- (margin.error*sqrt(n))/qt(.95, df = n - 1)
print(sigma)
## [1] 17.53481

Interpretation: The standard deviation is 17.53.

5 Question 5: Assumptions

5.1 (4 points)

The Stanford University Heart Transplant Study was conducted to determine whether an experimental heart transplant program increased lifespan. Each patient entering the program was officially designated a heart transplant candidate, meaning that he was gravely ill and might benefit from a new heart. Patients were randomly assigned into treatment and control groups. Patients in the treatment group received a transplant, and those in the control group did not. The table below displays how many patients survived and died in each group -

mym <- matrix(c(4,30,24,45), nrow=2)
colnames(mym) <- c("control", "treatment")
rownames(mym) <- c("alive","dead")
mym
##       control treatment
## alive       4        24
## dead       30        45

Suppose we are interested in estimating the difference in survival rate between the control and treatment groups using a confidence interval. Explain why we cannot construct such an interval using the normal approximation. What might go wrong if we constructed the confidence interval despite this problem?

Interpretation: We cannot construct a confidence interval using the normal approximation given the data is discrete (the normal distribution is continuous) - each value in the table is a discrete count of the number of patients alive/dead. Furthermore, the sample sizes are relatively small (particularly the control group).

If we constructed the confidence interval, we may underestimate or overestimate the bounds, which calls into question the reliability of the interval. Unreliable data can lead to misleading conclusions and false levels of confidence - thus, we will not be able to make reliable inferences about the population.

6 Question 6: Hypothesis testing in a regression

6.1 (6 points)

Write your own function in R that executes a hypothesis test following a linear regression, using the t-distribution. The function should be your own, and should use the t-distribution. One of the arguments of the function should be an object of class “lm” (that is, the output of a “lm” function used to fit linear regressions). The other arguments should be the elements necessary for a hypothesis test (the coefficient tested, the level according to the Null and Alternative Hypotheses, the significance level, etc.) The function should return the test statistic, the p-value, and the conclusion of the test.

my.function <- function(my.lm, coefficient.name, alternative = "two.sided", alpha = 0.05) {
  
  coeff.summary <- summary(my.lm)$coefficients #isolating coefficient
  
  #check
  if (!(coefficient.name %in% rownames(coeff.summary))) {
    stop("Coefficient name not found in the model.")
  }
  
  #isolating coeffcient estimate and standard error from fitted model
  coeff.estimate <- coeff.summary[coefficient.name, "Estimate"]
  coeff.se <- coeff.summary[coefficient.name, "Std. Error"]
  
  t.stat <- coeff.estimate /coeff.se #calculating test statistic
  
  df <- df.residual(my.lm) #calculating degrees of freedom
  
  #calculating p.value under different scenarios
  if (alternative == "two.sided") {
    p.value <- 2 * pt(abs(t.stat), df = df, lower.tail = FALSE)
  } else if (alternative == "greater") {
    p.value <- pt(t.stat, df = df, lower.tail = FALSE)
  } else if (alternative == "less") {
    p.value <- pt(t.stat, df = df, lower.tail = TRUE)
  } else {
    stop("Invalid 'alternative' argument. Must be one of 'two.sided', 'greater', or 'less'.") #check
  }
  
  #drawing conclusion
  if (p.value < alpha) {
    conclusion <- "Reject Null Hypothesis"
  } else {
    conclusion <- "Fail to Reject Null Hypothesis"
  }
  
  #returning results
  return(list(
    t.stat =  t.stat,
    p.value = p.value,
    conclusion = conclusion
  ))
}

6.2 (10 points)

Test the function for a regression of your choice using any data set you wish. Specify the elements of the test and carry out the test by hand. Then check that the function you wrote in the previous part gives the same (correct) answer.

Hypothesis:

  1. H0 - The sepal length of iris flowers has no effect on sepal width.

  2. Ha - The sepal length of iris flowers has an effect on sepal width.

str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Estimating Equation:

\[ Sepal Width \sim \beta_0 + Sepal Length* \beta_1 + \epsilon \] \[\epsilon \sim N(0,\sigma^2)\]

#built in function
my.lm <- lm(iris$Sepal.Width ~ iris$Sepal.Length)
summary(my.lm)
## 
## Call:
## lm(formula = iris$Sepal.Width ~ iris$Sepal.Length)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1095 -0.2454 -0.0167  0.2763  1.3338 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3.41895    0.25356   13.48   <2e-16 ***
## iris$Sepal.Length -0.06188    0.04297   -1.44    0.152    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4343 on 148 degrees of freedom
## Multiple R-squared:  0.01382,    Adjusted R-squared:  0.007159 
## F-statistic: 2.074 on 1 and 148 DF,  p-value: 0.1519

Interpretation: The coefficient indicates that a one unit increase in sepal length will lead to sepal width decreasing .06188 - the p-value is not statistically significant. The intercept suggests that when sepal length is zero, the predicted mean sepal width is 3.41895 - the result is statistically significant at the .001 level.

6.2.1 Calculation by Hand:

sepal.length.coeff <- coef(my.lm)[2] #isolating coefficient
sepal.length.se <- summary(my.lm)$coefficients[2, "Std. Error"] #isolating se

#caluclating test statistic
t.stat.by.hand <- (sepal.length.coeff - 0) / sepal.length.se

#determining degrees of freedom
df <- nrow(iris) - ncol(iris) + 1 

#determining p-value
p.value.by.hand <- 2 * pt(-abs(t.stat.by.hand), df = df)
print(t.stat.by.hand)
## iris$Sepal.Length 
##         -1.440287
print(p.value.by.hand)
## iris$Sepal.Length 
##         0.1519271

Interpretation: At a significance level of .05, we do not have sufficient evidence to reject the null hypothesis - our p-value of .15 is greater than alpha (.05). The data suggests sepal length does not have an effect on sepal width.

6.2.2 Calculation using Custom Function:

#using custom function
result <- my.function(my.lm, "iris$Sepal.Length", alternative = "two.sided", alpha = 0.05)

print(result)
## $t.stat
## [1] -1.440287
## 
## $p.value
## [1] 0.1518983
## 
## $conclusion
## [1] "Fail to Reject Null Hypothesis"

Interpretation: My custom function and manual calculation both yield identical results (rounded to the fourth decimal place). Our conclusion thus holds: At a significance level of .05, we do not have sufficient evidence to reject the null hypothesis - the data suggests sepal length does not have an effect on sepal width.

7 Question 7: Determinants of smoking

Import the smoking.csv file into R. The variables are “smoker” - whether or not one smokes; “snmkban” - whether or not one faces a smoking ban at one’s workplace (this is fairly old data); “age” - in years; “hsdrop” - high school dropout; “hsgrad” - graduated from high school but has no further education; “colsome” - has some college education; “colgrad” - graduated from college, as well as some self-explanatory demographic variables.

#setting wd
setwd("/Users/josephmancuso/Documents/BC/Spring'24/Final Exam")
getwd()
## [1] "/Users/josephmancuso/Documents/BC/Spring'24/Final Exam"
#importing csv file
smoke <- read.csv("/Users/josephmancuso/Documents/BC/Spring'24/Final Exam/smoking.csv"
                                    , check.names = FALSE
                                    , stringsAsFactors = FALSE
                                    , na.strings = ""
                                    )
#preparing summary statistics
describe(smoke)
##          vars     n  mean    sd median trimmed   mad min max range  skew
## smoker      1 10000  0.24  0.43      0    0.18  0.00   0   1     1  1.20
## smkban      2 10000  0.61  0.49      1    0.64  0.00   0   1     1 -0.45
## age         3 10000 38.69 12.11     37   37.98 13.34  18  88    70  0.51
## hsdrop      4 10000  0.09  0.29      0    0.00  0.00   0   1     1  2.84
## hsgrad      5 10000  0.33  0.47      0    0.28  0.00   0   1     1  0.74
## colsome     6 10000  0.28  0.45      0    0.23  0.00   0   1     1  0.98
## colgrad     7 10000  0.20  0.40      0    0.12  0.00   0   1     1  1.52
## black       8 10000  0.08  0.27      0    0.00  0.00   0   1     1  3.18
## hispanic    9 10000  0.11  0.32      0    0.02  0.00   0   1     1  2.44
## female     10 10000  0.56  0.50      1    0.58  0.00   0   1     1 -0.26
##          kurtosis   se
## smoker      -0.55 0.00
## smkban      -1.80 0.00
## age         -0.22 0.12
## hsdrop       6.06 0.00
## hsgrad      -1.45 0.00
## colsome     -1.04 0.00
## colgrad      0.32 0.00
## black        8.08 0.00
## hispanic     3.94 0.00
## female      -1.93 0.00
#checking for missing values
colSums(is.na(smoke))
##   smoker   smkban      age   hsdrop   hsgrad  colsome  colgrad    black 
##        0        0        0        0        0        0        0        0 
## hispanic   female 
##        0        0

7.1 (4 points)

Estimate a logit regression model of whether or not one smokes as a function of the other variables.

#running correlation analysis
cor(smoke)
##                 smoker       smkban         age      hsdrop      hsgrad
## smoker    1.0000000000 -0.088295899 -0.03417260  0.09891966  0.11677429
## smkban   -0.0882958987  1.000000000  0.04002314 -0.09622929 -0.07763858
## age      -0.0341725998  0.040023142  1.00000000  0.04401114  0.02607083
## hsdrop    0.0989196643 -0.096229290  0.04401114  1.00000000 -0.22061488
## hsgrad    0.1167742886 -0.077638583  0.02607083 -0.22061488  1.00000000
## colsome   0.0052359677  0.015219258 -0.09235948 -0.19764754 -0.43450990
## colgrad  -0.1230708747  0.085258235 -0.03388195 -0.15700484 -0.34516067
## black     0.0005880358  0.006974056 -0.00480490  0.01677509  0.02548611
## hispanic -0.0204386397 -0.034598175 -0.10451016  0.28983694 -0.03117953
## female   -0.0314577461  0.115149149  0.02779034 -0.06450376  0.08210537
##               colsome     colgrad         black     hispanic      female
## smoker    0.005235968 -0.12307087  0.0005880358 -0.020438640 -0.03145775
## smkban    0.015219258  0.08525823  0.0069740559 -0.034598175  0.11514915
## age      -0.092359484 -0.03388195 -0.0048049004 -0.104510156  0.02779034
## hsdrop   -0.197647536 -0.15700484  0.0167750927  0.289836936 -0.06450376
## hsgrad   -0.434509903 -0.34516067  0.0254861131 -0.031179533  0.08210537
## colsome   1.000000000 -0.30922736  0.0338694662 -0.005440164  0.03659891
## colgrad  -0.309227364  1.00000000 -0.0440024820 -0.097196535 -0.03274681
## black     0.033869466 -0.04400248  1.0000000000 -0.066529449  0.03823082
## hispanic -0.005440164 -0.09719654 -0.0665294488  1.000000000 -0.03639827
## female    0.036598909 -0.03274681  0.0382308196 -0.036398270  1.00000000
#fitting model
my.logit <- glm(smoke$smoker ~ .,, data = smoke, family = "binomial")
summary(my.logit)
## 
## Call:
## glm(formula = smoke$smoker ~ ., family = "binomial", data = smoke)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.696765   0.138696 -12.234  < 2e-16 ***
## smkban      -0.250735   0.049164  -5.100 3.40e-07 ***
## age         -0.007452   0.001987  -3.751 0.000176 ***
## hsdrop       1.931075   0.131261  14.712  < 2e-16 ***
## hsgrad       1.523305   0.114308  13.326  < 2e-16 ***
## colsome      1.180080   0.116518  10.128  < 2e-16 ***
## colgrad      0.424753   0.125801   3.376 0.000734 ***
## black       -0.149472   0.089994  -1.661 0.096732 .  
## hispanic    -0.584845   0.083085  -7.039 1.93e-12 ***
## female      -0.188720   0.049105  -3.843 0.000121 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11074  on 9999  degrees of freedom
## Residual deviance: 10502  on 9990  degrees of freedom
## AIC: 10522
## 
## Number of Fisher Scoring iterations: 4
plot(my.logit)

7.2 (4 points)

What does the coefficient on “snmkban” tell you? How about the coefficient on “age”? How about the coefficient on “colgrad”?

Interpretation:

  • smkban - For every additional workplace smoking ban, the log odds of the individual being a smoker decreases by .2507. The coefficient is statistically significant at the .001 level, and indicates that workplace smoking bans decrease the likelihood of smoking amongst employees.

  • age - For every one year increase in age, the log odds of the individual being a smoker decreases by .0074. The coefficient is statistically significant at the .001 level, and indicates that as individuals age, they are less likely to smoke.

  • colgrad - For every one unit increase in the colgrad variable (additional college graduates), the log odds of being a smoker increase by .4248. The coefficient is statistically significant at the .001 level.

8 Question 8: Difference-in-differences

Import the “nsw.csv” file in R. The data set contains observations from The National Supported Work Demonstration (NSW), which was a temporary employment program designed to help disadvantaged workers lacking basic job skills move into the labor market by giving them work experience and counseling in a sheltered environment. The variables of interest are “education” – the number of years of education for a worker in the sample; “age” – a worker’s age in years; “re75” – real earnings in 1975, at the beginning of the program; “re78” - real earnings in 1978, at the end of the program; “treat” - whether or not a worker was placed in the treatment group receiving work experience and counseling in a sheltered environment; as well as a few self-explanatory demographic variables.

#importing csv file
nsw <- read.csv("/Users/josephmancuso/Documents/BC/Spring'24/Final Exam/nsw.csv"
                                    , check.names = FALSE
                                    , stringsAsFactors = FALSE
                                    , na.strings = ""
                                    )
#preparing summary statistics
describe(nsw)
##           vars   n    mean      sd  median trimmed     mad min      max
## data_id*     1 722    1.00    0.00    1.00    1.00    0.00   1     1.00
## treat        2 722    0.41    0.49    0.00    0.39    0.00   0     1.00
## age          3 722   24.52    6.63   23.00   23.56    5.93  17    55.00
## education    4 722   10.27    1.70   10.00   10.37    1.48   3    16.00
## black        5 722    0.80    0.40    1.00    0.88    0.00   0     1.00
## hispanic     6 722    0.11    0.31    0.00    0.01    0.00   0     1.00
## married      7 722    0.16    0.37    0.00    0.08    0.00   0     1.00
## nodegree     8 722    0.78    0.41    1.00    0.85    0.00   0     1.00
## re75         9 722 3042.90 5066.14  936.31 1932.15 1388.17   0 37431.66
## re78        10 722 5454.64 6252.94 3951.89 4482.25 5859.07   0 60307.93
##              range  skew kurtosis     se
## data_id*      0.00   NaN      NaN   0.00
## treat         1.00  0.36    -1.87   0.02
## age          38.00  1.40     2.32   0.25
## education    13.00 -0.69     1.67   0.06
## black         1.00 -1.50     0.25   0.01
## hispanic      1.00  2.57     4.60   0.01
## married       1.00  1.83     1.35   0.01
## nodegree      1.00 -1.35    -0.18   0.02
## re75      37431.66  2.95    11.51 188.54
## re78      60307.93  2.21    10.49 232.71
#checking for missing values
colSums(is.na(nsw))
##   data_id     treat       age education     black  hispanic   married  nodegree 
##         0         0         0         0         0         0         0         0 
##      re75      re78 
##         0         0

8.1 (4 points)

Estimate the effect of receiving work experience and counseling in a sheltered environment on the increase in real earnings between 1975 and 1978 in a univariate regression. What is this effect? Is it statistically significant at 10% significance level (make sure to state the hypotheses)?

  1. H0 - Receiving work experience and counseling in a sheltered environment has no effect on the increase in real earnings between 1975 and 1978

  2. Ha - Receiving work experience and counseling in a sheltered environment has an effect on the increase in real earnings between 1975 and 1978.

#determining increase in earnings and appeding to dataset
nsw$increase.earnings <- nsw$re78 - nsw$re75

#fitting model
nsw.lm <- lm(nsw$increase.earnings ~ nsw$treat)
summary(nsw.lm)
## 
## Call:
## lm(formula = nsw$increase.earnings ~ nsw$treat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -37995  -3198  -1442   3677  56114 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2063.4      359.3   5.743 1.37e-08 ***
## nsw$treat      846.9      560.1   1.512    0.131    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7406 on 720 degrees of freedom
## Multiple R-squared:  0.003165,   Adjusted R-squared:  0.00178 
## F-statistic: 2.286 on 1 and 720 DF,  p-value: 0.131

Interpretation: The coefficient indicates that individuals who received treatment experienced an $846.9 increase in real earnings versus those not treated. The p-value of .131 is not statistically significant and is greater than the significance level of 10% - we do not have sufficient evidence to reject the null hypothesis.

8.2 (4 points)

Re-estimate the effect of receiving work experience and counseling in a sheltered environment on the increase in real earnings between 1975 and 1978, but this time use the other variables as control variables. What is this effect? Is it statistically significant at 10% significance level (make sure to state the hypotheses)? How does it compare to the effect estimated in the previous part?

#fitting model
nsw.lm.2 <- lm(nsw$increase.earnings ~ nsw$treat + nsw$age + nsw$education + nsw$black + nsw$hispanic + nsw$married + nsw$nodegree) 
summary(nsw.lm.2)
## 
## Call:
## lm(formula = nsw$increase.earnings ~ nsw$treat + nsw$age + nsw$education + 
##     nsw$black + nsw$hispanic + nsw$married + nsw$nodegree)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -38153  -3383  -1076   3503  55863 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    1986.822   3123.319   0.636  0.52490   
## nsw$treat       869.294    561.559   1.548  0.12207   
## nsw$age           4.194     43.424   0.097  0.92308   
## nsw$education    26.633    215.306   0.124  0.90159   
## nsw$black       112.085    956.157   0.117  0.90672   
## nsw$hispanic   1229.152   1253.250   0.981  0.32704   
## nsw$married   -2210.949    767.530  -2.881  0.00409 **
## nsw$nodegree   -217.758    891.527  -0.244  0.80710   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7384 on 714 degrees of freedom
## Multiple R-squared:  0.01729,    Adjusted R-squared:  0.007654 
## F-statistic: 1.794 on 7 and 714 DF,  p-value: 0.08539

Interpretation: The coefficient is slightly higher ($869.29) when controlling for the other variables. However, the p-value is not statistically significant, and again exceeds the 10% level of significance (.12207). Consequently, we can draw a similar conclusion - we have insufficient evidence to reject the null hypothesis.

8.3 (6 points)

Create a data frame in which to estimate the same effect, but this time using a difference-in-differences approach. Don’t use any other control variables. What is the estimated effect of receiving work experience and counseling in a sheltered environment on the increase in real earnings between 1975 and 1978?

#assigning binary operators to the treated and control gorups
treated <- nsw[nsw$treat == 1, ]
control <- nsw[nsw$treat == 0, ]

#determining mean change in earnings for each group
treated.diff <- mean(treated$increase.earnings)
control.diff <- mean(control$increase.earnings)

#calculating DID
diff.in.diff <- treated.diff - control.diff

print(diff.in.diff)
## [1] 846.8883

Interpretation: The DID approach indicates the treatment group experienced an $846.89 increase in real earnings between 1975 and 1978 compared to the control group.

9 Question 9: Data Question

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

Import the titanic train.csv file into R. You will have to clean the data to run some of commands/set up the workflow.

#importing csv file
train <- read.csv("/Users/josephmancuso/Documents/BC/Spring'24/Final Exam/train.csv")
missing.observations <- colSums(is.na(train))
print(missing.observations)
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0
#imputing missing values
age.source.data <- train$Age #Storing source data

age.median <- median(train$Age, na.rm = T) #Calculating median age 

train$Age[is.na(train$Age)] <- median(train$Age, na.rm=TRUE) #Replacing NAs with median age 

sum(is.na(train$Age)) #Check 
## [1] 0
#transforming sex into a binary
train$Sex <- ifelse(train$Sex == "male",1,0)
  1. (4 points) Run some preliminary correlations of Survived with some other variables.

    cor(train$Survived, train$Age)
    ## [1] -0.06491042

    Interpretation: Very weak, negative correlation between age and survival.

    cor(train$Survived, train$Sex)
    ## [1] -0.5433514

    Interpretation: Weak, negative correlation between sex and survival.

    cor(train$Survived, train$Pclass)
    ## [1] -0.338481

    Interpretation: Weak, negative correlation between passenger class and survival.

    cor(train$Survived, train$SibSp)
    ## [1] -0.0353225

    Interpretation: Very weak, negative correlation between the number of siblings and spouses aboard and survival.

    cor(train$Survived, train$Parch)
    ## [1] 0.08162941

    Interpretation: Very weak, positive correlation between the number of parents and children aboard and survival.

  2. (3 points) Conduct descriptive statistics of the dataset. Anything interesting you find ?

    describe(train)
    ##             vars   n   mean     sd median trimmed    mad  min    max  range
    ## PassengerId    1 891 446.00 257.35 446.00  446.00 330.62 1.00 891.00 890.00
    ## Survived       2 891   0.38   0.49   0.00    0.35   0.00 0.00   1.00   1.00
    ## Pclass         3 891   2.31   0.84   3.00    2.39   0.00 1.00   3.00   2.00
    ## Name*          4 891 446.00 257.35 446.00  446.00 330.62 1.00 891.00 890.00
    ## Sex            5 891   0.65   0.48   1.00    0.68   0.00 0.00   1.00   1.00
    ## Age            6 891  29.36  13.02  28.00   28.83   8.90 0.42  80.00  79.58
    ## SibSp          7 891   0.52   1.10   0.00    0.27   0.00 0.00   8.00   8.00
    ## Parch          8 891   0.38   0.81   0.00    0.18   0.00 0.00   6.00   6.00
    ## Ticket*        9 891 339.52 200.83 338.00  339.65 268.35 1.00 681.00 680.00
    ## Fare          10 891  32.20  49.69  14.45   21.38  10.24 0.00 512.33 512.33
    ## Cabin*        11 891  18.63  38.14   1.00    8.29   0.00 1.00 148.00 147.00
    ## Embarked*     12 891   3.53   0.80   4.00    3.66   0.00 1.00   4.00   3.00
    ##              skew kurtosis   se
    ## PassengerId  0.00    -1.20 8.62
    ## Survived     0.48    -1.77 0.02
    ## Pclass      -0.63    -1.28 0.03
    ## Name*        0.00    -1.20 8.62
    ## Sex         -0.62    -1.62 0.02
    ## Age          0.51     0.97 0.44
    ## SibSp        3.68    17.73 0.04
    ## Parch        2.74     9.69 0.03
    ## Ticket*      0.00    -1.28 6.73
    ## Fare         4.77    33.12 1.66
    ## Cabin*       2.09     3.07 1.28
    ## Embarked*   -1.27    -0.16 0.03

    Highlights:

  1. (3 points) Use set.seed(100) command, and create a subset of train dataset that has only 500 observations.
set.seed(100)
train.subset <- sample_n(train, 500)
  1. (4 points) Create an Ordinary Least Squares model / linear regression where Survived is the dependent variable on your n=500 sample.

    my.lm <- lm(train.subset$Survived ~ train.subset$Pclass + train.subset$Age + train.subset$Sex)
    summary(my.lm)
    ## 
    ## Call:
    ## lm(formula = train.subset$Survived ~ train.subset$Pclass + train.subset$Age + 
    ##     train.subset$Sex)
    ## 
    ## Residuals:
    ##     Min      1Q  Median      3Q     Max 
    ## -1.0791 -0.2568 -0.1164  0.2534  0.9845 
    ## 
    ## Coefficients:
    ##                      Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)          1.267864   0.082326  15.401  < 2e-16 ***
    ## train.subset$Pclass -0.176925   0.023193  -7.628 1.23e-13 ***
    ## train.subset$Age    -0.005937   0.001480  -4.012 6.94e-05 ***
    ## train.subset$Sex    -0.454444   0.039011 -11.649  < 2e-16 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 0.4007 on 496 degrees of freedom
    ## Multiple R-squared:  0.3227, Adjusted R-squared:  0.3186 
    ## F-statistic: 78.76 on 3 and 496 DF,  p-value: < 2.2e-16

    Interpretation: The coefficients suggest that for every one unit increase in Pclass, Age, and Sex, survival decreases by .177, .006, and .4544 respectively. All three coefficients are statistically significant at the .001 level. The adjusted R-squared value of .3186 suggests the model could be improved given a high amount of variability remains unexplained.

    plot(my.lm)

  2. (4 points) Create an estimate of whether an individual survived or not (binary variable) using the predict command on your estimated model. Essentially, you are using the coefficient from your linear model to forecast/predict/estimate the survival variable given independant variable values /data.

#predicted probabilites of survival
predictions <- predict(my.lm, newdata = train.subset, type = "response")

#defining threshold
threshold <- 0.5

#transforming predicted values into binary operators
predicted.survival <- ifelse(predictions >= threshold, yes = 1, no = 0)

table(predicted.survival)
## predicted.survival
##   0   1 
## 339 161

Interpretation: My model predicts that of 339 out of the 500 sampled individuals did not surive the journey.

  1. (4 points) Create a confusion matrix, and calculate the accuracy for the entire train dataset. In other words, how many individuals who survived or died the titanic crash were correctly classified by your regression?
#expanding model to encompass full dataset
my.lm.2 <- lm(train$Survived ~ train$Pclass + train$Age + train$Sex)
summary(my.lm.2)
## 
## Call:
## lm(formula = train$Survived ~ train$Pclass + train$Age + train$Sex)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.08688 -0.23209 -0.08626  0.22073  0.99862 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.282644   0.058032  22.103  < 2e-16 ***
## train$Pclass -0.185779   0.016561 -11.218  < 2e-16 ***
## train$Age    -0.004993   0.001058  -4.721 2.72e-06 ***
## train$Sex    -0.499229   0.027332 -18.265  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3828 on 887 degrees of freedom
## Multiple R-squared:  0.3832, Adjusted R-squared:  0.3811 
## F-statistic: 183.7 on 3 and 887 DF,  p-value: < 2.2e-16
#predicted probabilites of survival
predictions.2 <- predict(my.lm.2, newdata = train, type = "response")

#defining threshold
threshold.2 <- 0.5

#transforming predicted values into binary operators
predicted.survival.2 <- ifelse(predictions.2 >= threshold.2, yes = 1, no = 0)

table(predicted.survival.2)
## predicted.survival.2
##   0   1 
## 573 318
#building confusion matrix
observed <- train$Survived
outcome.table <- table(observed, predicted.survival.2)
outcome.table
##         predicted.survival.2
## observed   0   1
##        0 467  82
##        1 106 236
#calculating accuracy
outcome.table.accuracy <- sum(diag(outcome.table)) / sum(outcome.table)
print(outcome.table.accuracy)
## [1] 0.7890011

Interpretation: When extending the model to the full train dataset, the model accurately classified 79% of passengers.

  1. (6 points) Pclass, Sex, Age can give you an accuracy of about 80%. Can you get higher accuracy ?

Organise your work in the form of a short report. Make sure to defend (OLS assumptions) and explain your model - interpretation of your coefficients, how well your linear regression can be used predict correctly, et cetra.

The following modelling approaches may be helpful (read, and focus on relevant sections) -

https://www.kaggle.com/code/jeremyd/titanic-logistic-regression-in-r

https://www.kaggle.com/code/hillabehar/titanic-analysis-with-r/report

describe(train)
##             vars   n   mean     sd median trimmed    mad  min    max  range
## PassengerId    1 891 446.00 257.35 446.00  446.00 330.62 1.00 891.00 890.00
## Survived       2 891   0.38   0.49   0.00    0.35   0.00 0.00   1.00   1.00
## Pclass         3 891   2.31   0.84   3.00    2.39   0.00 1.00   3.00   2.00
## Name*          4 891 446.00 257.35 446.00  446.00 330.62 1.00 891.00 890.00
## Sex            5 891   0.65   0.48   1.00    0.68   0.00 0.00   1.00   1.00
## Age            6 891  29.36  13.02  28.00   28.83   8.90 0.42  80.00  79.58
## SibSp          7 891   0.52   1.10   0.00    0.27   0.00 0.00   8.00   8.00
## Parch          8 891   0.38   0.81   0.00    0.18   0.00 0.00   6.00   6.00
## Ticket*        9 891 339.52 200.83 338.00  339.65 268.35 1.00 681.00 680.00
## Fare          10 891  32.20  49.69  14.45   21.38  10.24 0.00 512.33 512.33
## Cabin*        11 891  18.63  38.14   1.00    8.29   0.00 1.00 148.00 147.00
## Embarked*     12 891   3.53   0.80   4.00    3.66   0.00 1.00   4.00   3.00
##              skew kurtosis   se
## PassengerId  0.00    -1.20 8.62
## Survived     0.48    -1.77 0.02
## Pclass      -0.63    -1.28 0.03
## Name*        0.00    -1.20 8.62
## Sex         -0.62    -1.62 0.02
## Age          0.51     0.97 0.44
## SibSp        3.68    17.73 0.04
## Parch        2.74     9.69 0.03
## Ticket*      0.00    -1.28 6.73
## Fare         4.77    33.12 1.66
## Cabin*       2.09     3.07 1.28
## Embarked*   -1.27    -0.16 0.03

Background: One of the key takeaways from our first assignment was the interaction between the age of passengers and survival status. Notched boxplots were created to compare the median age of Titanic passengers who survived or perished. The top boxplot (“Survived”) indicates most Titanic survivors were young adults - the trend holds for passengers who did not survive the voyage as indicated by the lower boxplot (“Perished”).

Additionally, children were more likely to survive the voyage as indicated by the lower minimum line in the Survived boxplot. The reverse hold true for the elderly who were less likely to survive the voyage (lower maximum and greater number of outliers in the Perished boxplot).

Given, age was a statistically significant independent variable in our initial OLS model, and the likelihood of survival varies across different age groups, I believe that classifying the age variable into groups will improve the accuracy of predictions.

boxplot(train$Age~train$Survived, notch=TRUE, horizontal=T, 
        xlab = "Age",
        ylab = "Survived",
        main = "Age & Survival Rate"
                                  )

train.2 <- as.data.frame(train)

#binning age varible
bins <- c(0,18,50,80)
labels <- c("Youth", "Adult", "Old")

train.2 <- train.2 %>%
  mutate(age.categories = factor(cut(Age, breaks = bins, labels = labels)))

Model: To test my assumption, I built a logistical regression model which controlled for Pclass, a new Age variable (binned ages), Sex, Parch, and SibSp variables. I added the Parch variable as a control variable under the assumption that passengers with children aboard would behave differently than passengers traveling alone - parents will likely prioritize the welfare of their children which could change the odds of survival. Similar logic justifies the inclusion of the SibSp variable - passengers may have behaved differently if “loved ones” were aboard, which could be a factor in survival odds.

my.logit <- glm(Survived ~ Pclass + age.categories + Sex + Parch + SibSp, data = train.2, family = "binomial")
summary(my.logit)
## 
## Call:
## glm(formula = Survived ~ Pclass + age.categories + Sex + Parch + 
##     SibSp, family = "binomial", data = train.2)
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           4.7190     0.4418  10.680  < 2e-16 ***
## Pclass               -1.0759     0.1143  -9.409  < 2e-16 ***
## age.categoriesAdult  -0.9888     0.2629  -3.761 0.000169 ***
## age.categoriesOld    -1.7884     0.4282  -4.177 2.96e-05 ***
## Sex                  -2.7486     0.1978 -13.899  < 2e-16 ***
## Parch                -0.1060     0.1132  -0.936 0.349383    
## SibSp                -0.3319     0.1092  -3.040 0.002367 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  797.97  on 884  degrees of freedom
## AIC: 811.97
## 
## Number of Fisher Scoring iterations: 5

Interpretation: The magnitude of significance appears similar to the original OLS model - Pclass and Sex remain statistically significant. Interestingly, both the “adult” and “old” categorizations of the age variable are statistically significant at the .001 level. My assumption around SibSp also appears to hold - the presence of siblings or spouses aboard is statistically significant at the .01 level. However, the model does challenge my assumption given the logged odds of survival decreases with additional siblings or spouses aboard (same dynamic for the Parch variable).

The difference between the Null and Residual deviance (388.69) indicates my model is explaining a large portion of variability, however, the high AIC value suggests I may have over-fit the model.

#predicted probabilites of survival
predictions.2 <- predict(my.logit, newdata = train.2, type = "response")

#defining threshold
threshold.2 <- 0.5

#transforming predicted values into binary operators
predicted.survival.2 <- ifelse(predictions.2 >= threshold, yes = 1, no = 0)

table(predicted.survival.2)
## predicted.survival.2
##   0   1 
## 597 294
#building confucsion matrix
observed.2 <- train.2$Survived
outcome.table.2 <- table(observed.2, predicted.survival.2)
outcome.table.2
##           predicted.survival.2
## observed.2   0   1
##          0 486  63
##          1 111 231
#dtermining accuracy
outcome.table.accuracy.2 <- sum(diag(outcome.table.2)) / sum(outcome.table.2)

print(outcome.table.accuracy.2)
## [1] 0.8047138

Interpretation: My new model slightly improved on the accuracy of the original OLS model (80% vs. 79%). Greater accuracy may be achieved by adding socioeconomic variables (based on Fare and Pclass) to categorize passengers. An alternative approach could involve concatenating Age and Sex - both variables are highly statistically significant, and together, may be a more accurate predictor of survival.