1. Conduct the probit regression
library(AER)
library(xtable)
data("SwissLabor")
require(mgcv)
require(np)
require(arm)
SwissLabor$agesq <- (SwissLabor$age*10)^2 / 1000
SwissLabor$nyc <- SwissLabor$youngkids
SwissLabor$noc <- SwissLabor$oldkids
SwissLabor$nlinc <- SwissLabor$income
probit_model <- glm(participation ~ age + agesq + education + nyc + noc + nlinc + foreign, family = binomial(link = "probit"), data = SwissLabor)
summary(probit_model)
Call:
glm(formula = participation ~ age + agesq + education + nyc +
noc + nlinc + foreign, family = binomial(link = "probit"),
data = SwissLabor)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.919 -0.970 -0.479 1.021 2.480
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.7491 1.4069 2.66 0.0077 **
age 2.0753 0.4054 5.12 0.0000003077309 ***
agesq -2.9434 0.4995 -5.89 0.0000000037941 ***
education 0.0192 0.0179 1.07 0.2843
nyc -0.7145 0.1004 -7.12 0.0000000000011 ***
noc -0.1470 0.0509 -2.89 0.0039 **
nlinc -0.6669 0.1320 -5.05 0.0000004327887 ***
foreignyes 0.7144 0.1213 5.89 0.0000000039152 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1203.2 on 871 degrees of freedom
Residual deviance: 1017.2 on 864 degrees of freedom
AIC: 1033
Number of Fisher Scoring iterations: 4
2.1: Histograms
hist(SwissLabor$age*10, xlab="Age in years", main="Distribution of Age within the Swiss Sample", col="lightblue")

If all Swiss citizens were evenly represented in this sample and that Swiss population is either growing or constant, we would expect that there would be a slight drop-off. Since there is a build-up to age 30, we should suspect that younger citizens in this sample are somehow different from those who are older.

This sample looks roughly normally distributed with the average number of years of formal education equal to less than high school.

Within the sample the vast majority of our sample has no young children.

Non-labor income looks like it may be exponentially distributed with a mean of approximately ~$35.000. The log tranformation strongly controls for this distribution such that the log of yearly non-labor income is approximately normal.

While the median number of older children is zero, many more people within the sample have older children than younger children.
plot(as.factor(SwissLabor$participation), main="Distribution of labor force participation in the Swiss sample", col="lightblue")

When plotting labor force participation of the sample, we see that the majority of the sample does not participate in the labor force, but that both participating and non-participation are present in the sample.
plot(as.factor(SwissLabor$foreign), main="Distribution of foreign residency in the Swiss sample", col="lightblue")

Roughly 75% of the sample is not a foreign resident.
2.2: Logit: log odds regression from table one, linear in log odds
Instructions: Write out the logit (log odds) regression equivalent of the model proposed in Table I (i.e., write the regression model in terms of a function that is linear in the log odds).
exp(cbind(OR = coef(probit_model), confint(probit_model)))
Waiting for profiling to be done...
OR 2.5 % 97.5 %
(Intercept) 42.482 2.70 705.05
age 7.967 3.60 17.77
agesq 0.053 0.02 0.14
education 1.019 0.98 1.06
nyc 0.489 0.40 0.59
noc 0.863 0.78 0.95
nlinc 0.513 0.39 0.66
foreignyes 2.043 1.61 2.59
# xtabs(~participation + foreign, data = SwissLabor)
Thus, the log odds equivalent is to: \(Participation_{odds}=~42.4+7.9age+0.053age^2+1.02educ+-.49NYC+-.86NOC+-.51nlinc+2.04Foreign\\\) indicating that for a one unit increase in years of formal schooling (\(education\)), we expect the odds of participating in the labor force to increase by a factor of ~\(1.02\).
Next, write the equivalent model in terms of the Bernoulli likelihood function using the logit link function.
In compact form, the Bernoulli likelihood function is: \(Pr(Part_{i} = y_{i}) = \pi^{y_{i}}_{i}\times(1-\pi_{y_{i}})^{1-y_{i}}\) for \(y_{i} = [0,1]\). Using the return values from R, this function is equal to: \(\pi_i = \frac{exp(Participation_{odds})}{1+exp(Participation_{odds})}\) with the odds previously defined.
The logit function (log unit) calculates the “log-odds” that the dependent variable, participation, will be true. By first taking the odds, we remove the range restriction from the dependent variable. What this means is that we will no longer generate estimates of probability that are greater than one or less than zero. If we do not, and instead directly transformed the dependent variable, we would likely overpredict the certainty of the tails of the distribution.
3. Conduct the logistic regression version of the model proposed in Table I.
Interpret the logistic regression coefficients in terms of odds ratios. Note: Don’t worry about interpreting the age coefficients. Plot the predicted probability of labor force participation against the woman’s age, holding the other variables at their mean. (1 point)
logr_model <- glm(participation ~ age + agesq + education + nyc + noc + nlinc + foreign, family = binomial, data=SwissLabor)
# summary(probit_model)
# summary(logr_model)
# get the raw fitted probabilities
probit_predict <- predict(probit_model, type = 'response')
logr_predict <- predict(logr_model, type = 'response')
################################################################################
## Calibration table
probs <- logr_predict
# Get the deciles of the fitted probabilities
decile.cutpoints <- quantile(probs, probs = seq(0, 1, .1))
# Create a new variable that identifies
# the quantile that each fitted probibility falls in
decileID <- cut(probs,
breaks = decile.cutpoints,
labels = 1:10,
include.lowest = TRUE)
# cbind(probs, decileID)
# Calculate the number that switched in each decile
# You can see the counts by using table:
tab <- table(decileID, participation)
# tab
# To turn the table into a data.frame, use as.data.frame.matrix
observed <- as.data.frame.matrix(tab)
# To calculate the expected values for each decile,
# we need to sum the fitted probabilities for each decile
# We can do this by using tapply to compute
# the sum of probs within each level of decileID:
expected1 <- tapply(probs, decileID, FUN = sum)
# Create a summary calibration table
# Create cutpoints that represent the 9 intervals that exist for deciles
interval.cutpoints <- round(quantile(probs, probs = seq(.1, 1, .1)), 2)
# Create a dataframe with these cutpoints:
cal <- data.frame(interval.cutpoints)
# Add a column of observed 1's
cal$observed1 <- observed[, 2]
# Add a column of expected 1's
cal$expected1 <- round(expected1, 0)
# Add a column for the total # of observations in each decile
cal$total <- table(decileID)
################################################################################
| 10% |
0.16 |
11 |
10 |
88 |
| 20% |
0.24 |
20 |
18 |
87 |
| 30% |
0.31 |
22 |
24 |
87 |
| 40% |
0.39 |
29 |
30 |
87 |
| 50% |
0.47 |
34 |
37 |
87 |
| 60% |
0.52 |
49 |
43 |
87 |
| 70% |
0.59 |
46 |
48 |
87 |
| 80% |
0.67 |
53 |
54 |
87 |
| 90% |
0.78 |
61 |
64 |
87 |
| 100% |
0.93 |
76 |
73 |
88 |
Looking at these values, it seems like the model is well-specified since the predictions and the actuals are similar.
4. Calibration plot
[--Comments #!cal1–]
3: Conduct the logistic regression version of the model proposed in Table 1.
2.3.1: Interpret the logistic regression coefficients in terms of log odds ratios.
2.3.2: Plot the predicted probability of labor force participation against woman’s age holding the
other variables at their mean
2.4: Create and examine a calibration plot and calibration table for the model proposed in Table 1
table()
2.4.1: Does there seem to be a model misspecification?
2.4.2: Why or why not?
2.5: Compare logistic regression and a generalized additive model with smoothing on age and education
for the model proposed in Table 1
gam()
2.5.1: Look at the partial residual plots from the GAM. Do any transformations seem necessary
for age or education? why?
2.6: Use Stukel’s test for the logistic regression
???
2.6.1: Use Stukel’s test for the GAM
!!!
2.6.2: Does the logit link function seem appropriate? Why or why not?
___?
2.7: Create a cross-validated ROC curve for the logistic regression:
predictions <- prediction(predict(logr_model, type = "response"), y)
Error in prediction(predict(logr_model, type = "response"), y) :
Number of predictions in each run must be equal to the number of labels for each run.
2.7.1: Create a cross-validated ROC curve with transformations and interactions indicated by previous work:
2.7.2: Which model perfoms better in terms of cross-validation? Which do I prefer?
2.8: Comment on statistical inference and causality stories:
2.8.1: Use simulations or confidence intervals
sims()
---
title: "R Notebook"
output: html_notebook
author: "Eric Silver"
---

#1. Conduct the probit regression

```{r}
library(AER)
library(xtable)
data("SwissLabor")
require(mgcv)
require(np)
require(arm)
require(ROCR)

SwissLabor$agesq <- (SwissLabor$age*10)^2 / 1000
SwissLabor$nyc <- SwissLabor$youngkids
SwissLabor$noc <- SwissLabor$oldkids
SwissLabor$nlinc <- SwissLabor$income

probit_model <- glm(participation ~ age + agesq + education + nyc + noc + nlinc + foreign, family = binomial(link = "probit"), data = SwissLabor)
summary(probit_model)
```


#2.1: Histograms
```{r}
hist(SwissLabor$age*10, xlab="Age in years", main="Distribution of Age within the Swiss Sample", col="lightblue")
```
If all Swiss citizens were evenly represented in this sample and that Swiss population
is either growing or constant, we would expect that there would be a slight drop-off. 
Since there is a build-up to age 30, we should suspect that younger citizens in this sample
are somehow different from those who are older.

```{r}
hist(SwissLabor$education, xlab="Years of formal education", main="Distribution of Years of Formal Education within the Swiss Sample", col="lightblue")
```

This sample looks roughly normally distributed with the average number of years of formal education equal 
to less than high school.

```{r}
hist(SwissLabor$nyc, xlab="Number of Young Children", main="Distribution of the number of young children within the Swiss Sample", col="lightblue", breaks=4)
```

Within the sample the vast majority of our sample has no young children.

```{r}
hist(SwissLabor$nlinc, xlab="Log of yearly non-labor income", main="Distribution of the log of yearly non-labor income", col="lightblue")
```

Non-labor income looks like it may be exponentially distributed with a mean of approximately ~$35.000. The
log tranformation strongly controls for this distribution such that the log of yearly non-labor income is
approximately normal.

```{r}
hist(SwissLabor$noc, xlab="Number of Older Children", main="Distribution of the number of older children within the Swiss Sample", col="lightblue", breaks=7)
```

While the median number of older children is zero, many more people within the sample have older children
than younger children.

```{r}
plot(as.factor(SwissLabor$participation), main="Distribution of labor force participation in the Swiss sample", col="lightblue")
```
When plotting labor force participation of the sample, we see that the majority of the sample does not 
participate in the labor force, but that both participating and non-participation are present in the 
sample.

```{r}
plot(as.factor(SwissLabor$foreign), main="Distribution of foreign residency in the Swiss sample", col="lightblue")
```

Roughly 75% of the sample is not a foreign resident.

# 2.2: Logit: log odds regression from table one, linear in log odds
Instructions: Write out the logit (log odds) regression equivalent of the model
proposed in Table I (i.e., write the regression model in terms of a
function that is linear in the log odds). 
```{r}
exp(cbind(OR = coef(probit_model), confint(probit_model)))
# xtabs(~participation + foreign, data = SwissLabor)
```
Thus, the log odds equivalent is to:
$Participation_{odds}=~42.4+7.9age+0.053age^2+1.02educ+-.49NYC+-.86NOC+-.51nlinc+2.04Foreign\\$
indicating that for a one unit increase in years of formal schooling ($education$), we expect the odds 
of participating in the labor force to increase by a factor of ~$1.02$.

Next, write the equivalent
model in terms of the Bernoulli likelihood function using the logit link
function. 

In compact form, the Bernoulli likelihood function is:
$Pr(Part_{i} = y_{i}) = \pi^{y_{i}}_{i}\times(1-\pi_{y_{i}})^{1-y_{i}}$
for $y_{i} = [0,1]$. Using the return values from R, this function is
equal to: $\pi_i = \frac{exp(Participation_{odds})}{1+exp(Participation_{odds})}$ with 
the odds previously defined.

The logit function (log unit) calculates the "log-odds" that the dependent 
variable, participation, will be true.  By first taking the odds, we remove the
range restriction from the dependent variable.  What this means is that we will 
no longer generate estimates of probability that are greater than one or less 
than zero. If we do not, and instead directly transformed the dependent variable, 
we would likely overpredict the certainty of the tails of the distribution.



#3. Conduct the logistic regression version of the model proposed in Table I.
Interpret the logistic regression coefficients in terms of odds ratios. Note:
Don’t worry about interpreting the age coefficients. Plot the predicted
probability of labor force participation against the woman’s age, holding
the other variables at their mean. (1 point)

```{r}
logr_model <- glm(participation ~ age + agesq + education + nyc + noc + nlinc + foreign, family = binomial, data=SwissLabor)

# get the raw fitted probabilities
probit_predict <- predict(probit_model, type = 'response')
logr_predict <- predict(logr_model, type = 'response')

## Calibration table
probs <- logr_predict
# Get the deciles of the fitted probabilities 
decile.cutpoints <- quantile(probs, probs = seq(0, 1, .1))

# Create a new variable that identifies 
# the quantile that each fitted probibility falls in 
decileID <- cut(probs, 
                breaks = decile.cutpoints, 
                labels = 1:10, 
                include.lowest = TRUE)

# Calculate the number that switched in each decile 
# You can see the counts by using table: 
tab <- table(decileID, participation)
# tab
# To turn the table into a data.frame, use as.data.frame.matrix
observed <- as.data.frame.matrix(tab)

# To calculate the expected values for each decile, 
# we need to sum the fitted probabilities for each decile
# We can do this by using tapply to compute 
# the sum of probs within each level of decileID: 
expected1 <- tapply(probs, decileID, FUN = sum)

# Create a summary calibration table
# Create cutpoints that represent the 9 intervals that exist for deciles 
interval.cutpoints <- round(quantile(probs, probs = seq(.1, 1, .1)), 2)

# Create a dataframe with these cutpoints: 
cal <- data.frame(interval.cutpoints)

# Add a column of observed 1's
cal$observed1 <- observed[, 2]

# Add a column of expected 1's
cal$expected1 <- round(expected1, 0)

# Add a column for the total # of observations in each decile
cal$total <- table(decileID)
################################################################################ 
```

```{r, results='asis'}
kable(cal)
#print(xtable(cal), type="html")
```
Looking at these values, it seems like the model is well-specified since the predictions and the 
actuals are similar.

#4. Calibration plot
```{r}

```

[\--Comments #!cal1--]

# 3: Conduct the logistic regression version of the model proposed in Table 1.  


# 2.3.1: Interpret the logistic regression coefficients in terms of log odds ratios. 

# 2.3.2: Plot the predicted probability of labor force participation against woman's age holding the
# other variables at their mean

# 2.4: Create and examine a calibration plot and calibration table for the model proposed in Table 1


# table()

# 2.4.1: Does there seem to be a model misspecification?

# 2.4.2: Why or why not?

# 2.5: Compare logistic regression and a generalized additive model with smoothing on age and education
# for the model proposed in Table 1

# gam()


# 2.5.1: Look at the partial residual plots from the GAM.  Do any transformations seem necessary 
# for age or education?  why?


# 2.6: Use Stukel's test for the logistic regression

???

# 2.6.1: Use Stukel's test for the GAM

!!!

# 2.6.2: Does the logit link function seem appropriate?  Why or why not?

___?

# 2.7: Create a cross-validated ROC curve for the logistic regression:
```{r}

################################################################################
## install.packages("ROCR", repos = "http://lib.stat.cmu.edu/R/CRAN/")
library(ROCR)

# Generate predictions from model
glm1 <- glm(y ~ x1 + x2, family = binomial(link = "logit"))
logr_model <- glm(participation ~ age + agesq + education + nyc + noc + nlinc + foreign, family = binomial(link = "logit"), data=SwissLabor)
predictions <- prediction(predict(logr_model, type = "response"), y)
str(predictions)
cuts <- unlist(predictions@cutoffs)
cuts
fp <- unlist(predictions@fp)
fp
tp <- unlist(predictions@tp)
tp
tn <- unlist(predictions@tn)
tn
fn <- unlist(predictions@fn)
fn

options(scipen = 99, digits = 2)
preds <- data.frame("cuts" = cuts, "fp" = fp, "tp" = tp, "tn" = tn, "fn" = fn)
preds

# Generate true positive and false positive rates from predictions
perf <- performance(predictions, measure = "tpr", x.measure = "fpr")
str(perf)
cut <- unlist(perf@alpha.values)
fpr <- unlist(perf@x.values)
tpr <- unlist(perf@y.values)
cbind(cut, tpr, fpr)

# Plot ROC curve
par(cex = 1.3, mar = c(5, 4, 2, 1))
plot(perf, 
     colorize         = TRUE,
     color            = rainbow(10),
     main             = "Receiver Operating Characteristic",
     print.cutoffs.at = seq(.0, .9, by = 0.1),
     text.adj         = c(-.5, 1.2),
     xlab             = "False Positive Rate (1-Specificity) (FP/(FP+TN))",
     ylab             = "True Positive Rate (Sensitivity) (TP/(TP+FN))")
abline(0, 1)

## Cross-validation
# generate empty prediction and outcome vectors
predictions <- c()
labels <- c() 
dat <- data.frame(y, x1, x2)

# logistic regression CV function
logit.cv <- function(data = dat, k = 5){
  # select number of folds
  folds <- k <- 5
  # generate fold sequence
  fold.num <- rep(1:folds, length.out = length(y))
  # randomize fold sequence
  fold.samp <- sample(fold.num)
  for(i in 1:k){
    # Training data 
    train <- data[fold.samp != i, ]
    # Test data takes the remaining rows
    test <- data[fold.samp == i, ]
    # Run glm on training data
    glm <- glm(y ~ x1 + x2, data = train, 
           family = binomial(link = "logit"))
    # Make probability predictions for test data
    glmpred <- predict(glm, test, type = "response")
    # Add the predictions for this iteration to the data frame
    predictions <- c(predictions, glmpred)
    # Add the actual outcomes for this iteration to the data frame
    labels <- c(labels, test$y)
    }
  return(list(predictions, labels))
}

logit.cv()

cvdata <- replicate(2, logit.cv())
cvdata

preds <- sapply(cvdata[1, ], cbind)
preds

labs <- sapply(cvdata[2, ], cbind)
labs

# Run the ROCR prediction and performance measures
glmerr <- prediction(preds, labs)
glmerr
glmerr@cutoffs
glmerr@fp
glmerr@tp

glmperf <- performance(glmerr, measure="tpr", x.measure="fpr")
glmperf
glmperf@alpha.values
glmperf@x.values
glmperf@y.values

# This gives a vector of AUCs
glmauc <- performance(glmerr, measure = "auc")
# Unlist the AUCs
glmauc <- unlist(glmauc@y.values)
glmauc
# Take the average
glmauc <- mean(glmauc) 

# Cross-validated ROC curve
par(cex = 1.3, mar = c(5, 4, 2, 1))
plot(glmperf,
     colorize         = TRUE,
     color            = rainbow(10),
     main             = "Cross-Validated ROC Curve",
     avg              = 'threshold', 
     spread.estimate  = 'stddev',
     print.cutoffs.at = seq(.0, .9, by = 0.1),
     text.adj         = c(-.5, 1.2),
     xlab             = "Average False Positive Rate",
     ylab             = "Average True Positive Rate")
abline(0, 1)
################################################################################

```

# 2.7.1: Create a cross-validated ROC curve with transformations and interactions indicated by previous work:
# 2.7.2: Which model perfoms better in terms of cross-validation?  Which do I prefer?

# 2.8: Comment on statistical inference and causality stories:
# 2.8.1: Use simulations or confidence intervals
# sims()

2.8: Comment on statistical inference and causality stories: