Activity06 - Logistic Regression
Load the necessary packages
We will use two packages from Posit: {tidyverse}
and
{tidymodels}
and any other necessary packages.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom 1.0.6 ✔ rsample 1.2.1
## ✔ dials 1.2.1 ✔ tune 1.2.1
## ✔ infer 1.0.7 ✔ workflows 1.1.4
## ✔ modeldata 1.3.0 ✔ workflowsets 1.1.0
## ✔ parsnip 1.2.1 ✔ yardstick 1.3.1
## ✔ recipes 1.0.10
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages
Load the data
The data
The data we are working with is from the OpenIntro site. From OpenIntro’s description of the data:
This experiment data comes from a study that sought to understand the influence of race and gender on job application callback rates.
The study monitored job postings in Boston and Chicago for several months during 2001 and 2002 and used this to build up a set of test cases.
Over this time period, the researchers randomly generating résumés to go out to a job posting, such as years of experience and education details, to create a realistic-looking résumé.
They then randomly assigned a name to the résumé that would communicate the applicant’s gender and race. The first names chosen for the study were selected so that the names would predominantly be recognized as belonging to black or white individuals.
For example, Lakisha was a name that their survey indicated would be interpreted as a black woman, while Greg was a name that would generally be interpreted to be associated with a white male.
On an initial reading, there are some concerns with how this study is designed. In the article, the authors do point out some of these concerns (in Sections 3.5 and 5.1).
We will now read in the resume
CSV file
that we got from the OpenIntro site.
## Rows: 4870 Columns: 30
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): job_city, job_industry, job_type, job_ownership, job_req_min_exper...
## dbl (20): job_ad_id, job_fed_contractor, job_equal_opp_employer, job_req_any...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Let’s explore the dataset
We can use dplyr::glimpse
to see some meta information
about the R data frame.
## Rows: 4,870
## Columns: 30
## $ job_ad_id <dbl> 384, 384, 384, 384, 385, 386, 386, 385, 386, 38…
## $ job_city <chr> "Chicago", "Chicago", "Chicago", "Chicago", "Ch…
## $ job_industry <chr> "manufacturing", "manufacturing", "manufacturin…
## $ job_type <chr> "supervisor", "supervisor", "supervisor", "supe…
## $ job_fed_contractor <dbl> NA, NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, N…
## $ job_equal_opp_employer <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ job_ownership <chr> "unknown", "unknown", "unknown", "unknown", "no…
## $ job_req_any <dbl> 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0,…
## $ job_req_communication <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,…
## $ job_req_education <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ job_req_min_experience <chr> "5", "5", "5", "5", "some", NA, NA, "some", NA,…
## $ job_req_computer <dbl> 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0,…
## $ job_req_organization <dbl> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,…
## $ job_req_school <chr> "none_listed", "none_listed", "none_listed", "n…
## $ received_callback <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ firstname <chr> "Allison", "Kristen", "Lakisha", "Latonya", "Ca…
## $ race <chr> "white", "white", "black", "black", "white", "w…
## $ gender <chr> "f", "f", "f", "f", "f", "m", "f", "f", "f", "m…
## $ years_college <dbl> 4, 3, 4, 3, 3, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 1,…
## $ college_degree <dbl> 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0,…
## $ honors <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ worked_during_school <dbl> 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0,…
## $ years_experience <dbl> 6, 6, 6, 6, 22, 6, 5, 21, 3, 6, 8, 8, 4, 4, 5, …
## $ computer_skills <dbl> 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1,…
## $ special_skills <dbl> 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1,…
## $ volunteer <dbl> 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0,…
## $ military <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ employment_holes <dbl> 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0,…
## $ has_email_address <dbl> 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0,…
## $ resume_quality <chr> "low", "high", "low", "high", "high", "low", "h…
After doing this and a review of the data’s description page, we will answer the following questions:
- Is this an observational study or an experiment?
Explain.
This is an experiment. This is because some of the variables are being altered/manipulated by the researchers. Random assignmnet of names that signify race and gender. Some varibles have been controlled such as the work experience. In other words the study is an experiment because it involves random assignment and controlled manipulation of variables, enabling the researchers to draw causal conclusions about the impact of race and gender on job application callback rates.
- The variable of interest is
received_callback
.
What type of variable is this?
What do the values represent?
received_callback
, is a binary categorical variable. This type of variable can take on one of two possible values, indicating whether a call back was received or not.
- Let’s create an appropriate data visualization for
received_callback
using{ggplot2}
.
# Bar plot for recieved_callback
ggplot(resume, aes(x = factor(received_callback))) +
geom_bar(fill = "pink", color = "black") +
scale_x_discrete(labels = c("0" = "No Callback", "1" = "Callback")) +
labs(title = "Distribution of Callbacks for Job Applications",
x = "Callback Status",
y = "Count of Resumes",
fill = "Callback Status") +
theme_few()
- Below, is a numerical summary table that should reiterate (i.e., provides numerical values) the plot in (3).
resume %>%
mutate(received_callback = case_when(
received_callback == 0 ~ "No",
received_callback == 1 ~ "Yes"
)) %>%
count(received_callback) %>%
mutate(percent = round(n / sum(n) * 100, 2)) %>%
knitr::kable()
received_callback | n | percent |
---|---|---|
No | 4478 | 91.95 |
Yes | 392 | 8.05 |
- Using the output from (3) and (4), we notice that:
The percentage of those that that received is very small (8%). Both the plot and the table highlight the disparity in callback rates, showing that only a small fraction of resumes received a callback, which aligns with the data visualization’s visual representation.
Probability and odds
Let’s answer the following questions using our output from (3) and (4):
- What is the probability that a randomly selected résumé/person will be called back?
P(callback) = resume received callback/total resumes = 392/4478 = 0.0805 = 8%
- What are the odds that a randomly selected résumé/person will be called back?
The odds of an event occurring are calculated as the ratio of the probability of the event happening to the probability of the event not happening. Using the given data:
Probability of receiving a callback, p(callback), is 0.0805
Probability of not receiving a callback, p(no callback), is 1 - 0.0805 = 0.9195.
Odds(callback) = p(callback)/p(no callback) = 0.0805/0.9195 = 0.0875
This can also be expressed as approximately 1 to 11.43, meaning for every 1 résumé that receives a callback, about 11.43 résumés do not receive a callback.
Logistic regression
Logistic regression is one form of a generalized linear
model.
For this type of model, the outcome/response variable takes one of two
levels (sometimes called a binary variable or a two-level categorical
variable).
In this activity, \(Y_i\) takes the
value 1 if a résumé receives a callback and 0 if it did not.
Generally, we will let the probability of a “success” (a 1) be \(p_i\) and the probability of a “failure” (a
0) be \(1 - p_i\).
Therefore, the odds of a “success” are:
\[ \frac{Pr(Y_i = 1)}{Pr(Y_i = 0)} = \frac{p_i}{1-p_i} \]
We use the logit function (or log odds) to model binary outcome variables:
\[ \begin{equation*} \log\left(\frac{p_i}{1-p_i}\right) = \beta_0 + \beta_1 X \end{equation*} \]
To keep things simpler, we will first explore a logistic regression
model with a two-level categorical explanatory variable:
race
- the inferred race associated to the first name on
the résumé.
Below is a two-way table (also known as a contingency table or
crosstable), where the rows are the response variable levels, the
columns are the explanatory variable levels, and the cells are the
percent (and number of in parentheses).
Note that the values in each column add to 100%.
resume %>%
mutate(received_callback = case_when(
received_callback == 0 ~ "No",
received_callback == 1 ~ "Yes"
),
race = case_when(
race == "black" ~ "Black",
race == "white" ~ "White"
)) %>%
group_by(race, received_callback) %>%
summarise(n = n()) %>%
mutate(percent = round(n / sum(n) * 100, 2),
percent_n = glue::glue("{percent} ({n})")) %>%
select(received_callback, race, percent_n) %>%
pivot_wider(
names_from = race,
values_from = percent_n
) %>%
knitr::kable()
## `summarise()` has grouped output by 'race'. You can override using the
## `.groups` argument.
received_callback | Black | White |
---|---|---|
No | 93.55 (2278) | 90.35 (2200) |
Yes | 6.45 (157) | 9.65 (235) |
Let’s answer the following question using the above table:
- What is the probability that a randomly selected résumé/person perceived as Black will be called back?
Probability of callback for Black applicants: 𝑃(callback∣Black)= 6.45% = 0.0645
- What are the odds that a randomly selected résumé/person perceived as Black will be called back?
To find the odds that a randomly selected résumé/person perceived as Black will be called back, we use the odds formula:
Odds=𝑃(callback∣Black)/1−𝑃(callback∣Blac)
Odds(Black) = 0.0645/1-0.0645 = 0.0645/0.9355 = 0.0689
This process of calculating conditional (e.g., if a résumé/person perceived as Black is called back) odds will be helpful as we fit our logistic model.
We will now begin to use the{tidymodel}
method for fitting models.
# The {tidymodels} method for logistic regression requires that the response be a factor variable
resume <- resume %>%
mutate(received_callback = as.factor(received_callback))
logistic_spec <- logistic_reg() %>%
set_engine("glm")
logistic_spec
## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
resume_mod <- logistic_spec %>%
fit(received_callback ~ race, data = resume, family = "binomial")
tidy(resume_mod) %>%
knitr::kable(digits = 3)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -2.675 | 0.083 | -32.417 | 0 |
racewhite | 0.438 | 0.107 | 4.083 | 0 |
After doing this, we will respond to the following questions:
- Write the estimated regression equation.
Rounded to 3 digits.
racewhite
is an indicator variable that is 1 if the race is “White” and 0 if the race is “Black” (the reference category).
log(p/1-p)
−2.675 + 0.438 ×racewhite
- Let’s use our equation in (8) to write the simplified
estimated regression equation corresponding to
résumés
/persons perceived as Black.
Rounded to 3 digits.
For
résumés
/persons perceived as Black, the indicator variableracewhite
is 0. Substitutingracewhite
= 0 into the estimated regression equation:
log(p/1-p)
−2.675+0.438×0
−2.675
Based on this model, if a randomly selected résumé/person perceived as Black:
- The log-odds that they will be called back are
−2.675+0.438×0
The log-odds that a randomly selected résumé/person perceived as Black will be called back is −2.675
- What are the odds that they will be called back?
How does this relate back to our answer from (7)?
Hint: In (9) we obtained the log-odds (i.e., the natural log-odds).
How can we back-transform this value to obtain the odds?
To back-transform the log-odds to obtain the odds, we exponentiate the log-odds value.
Odds = e^log-odds
Odds = e^−2.675
= 0.069
In question 7, we calculated the odds of a randomly selected résumé/person perceived as Black being called back using the observed frequencies. The odds calculated from the logistic regression model (0.069) match the odds calculated directly from the observed data (0.0689). This consistency confirms that our model is accurately reflecting the observed data.
- What is the probability that will be called back?
How does this related back to our answer from (6)?
Hint We use the odds in (11) to calculate this value.
To find the probability from the odds, we use the relationship between odds and probability:
Probability = odds/1 + odds = 0.069/1 + 0.069 = 0.0645
In question 6, we calculated the probability of a résumé/person perceived as Black being called back directly from the observed data, which was 0.0645 (or 6.45%).
The probability calculated from the odds derived from the logistic regression model (0.0645) matches the observed probability (0.0645). This consistency indicates that our logistic regression model accurately captures the relationship observed in the data.
- How does the output from following code relate to what we obtained
before (8)?
How can we use it help you answer (12)?
Before (8): Estimated Regression Equation
The estimated regression equation was: −2.675 + 0.438 ×racewhite
Whereracewhite
= 1 if therésumé
is perceived as White, and 0 if perceived as Black.
Exponentiated Output:
The exponentiated intercept and coefficients provide the odds and odds ratios: Intercept = e^−2.675 = 0.069
racewhite
= e^0.438 = 1.550
These exponentiated values translate the log-odds to odds, making the interpretation more straightforward.
To answer Question (12), we needed to calculate the probability of a résumé/person perceived as Black receiving a callback using the odds obtained from the exponentiated intercept.
Let’s now create a logistic regression model
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 0.069 | 0.083 | -32.417 | 0 | 0.058 | 0.081 |
racewhite | 1.550 | 0.107 | 4.083 | 0 | 1.257 | 1.915 |
Challenge: Extending to Mulitple Logistic Regression
We will explore the following question:
Is there a difference in call back rates in Chicago jobs, after
adjusting for the an applicant’s years of experience, years of college,
race, and gender?
Specifically, we will fit the following model, where \(\hat{p}\) is the estimated probability of
receiving a callback for a job in Chicago.
\[ \begin{equation*} \log\left(\frac{\hat{p}}{1-\hat{p}}\right) = \hat\beta_0 + \hat\beta_1 \times (\texttt{years\\_experience}) + \hat\beta_2 \times (\texttt{race:White}) + \hat\beta_3 \times (\texttt{gender:male}) \end{equation*} \]
Note that the researchers have the variable labeled
gender
with the categorizations: “male” and “female”.
resume_subset <- resume %>%
filter(job_city == "Chicago") %>%
mutate(race = case_when(
race == "white" ~ "White",
TRUE ~ "Black"
),
gender = case_when(
gender == "f" ~ "female",
TRUE ~ "male"
)) %>%
select(received_callback, years_experience, race, gender)
Let’s describe what the above code does in the context of this problem.
filter(job_city == "Chicago")
: This filters the resume dataset to include only the observations where the job city is Chicago.
mutate(race = case_when(race == "white" ~ "White", TRUE ~ "Black"))
: This modifies the race variable so that it only has two levels: “White” and “Black”. If the race is “white”, it is recoded to “White”; otherwise, it is recoded to “Black”.
mutate(gender = case_when(gender == "f" ~ "female", TRUE ~ "male"))
: This modifies the gender variable so that it only has two levels: “female” and “male”. If the gender is “f”, it is recoded to “female”; otherwise, it is recoded to “male”.
select(received_callback, years_experience, race, gender)
: This selects only the columns received_callback, years_experience, race, and gender from the filtered and mutated dataset.
In the context of exploring whether there is a difference in callback rates for Chicago jobs, after adjusting for an applicant’s years of experience, race, and gender, this code prepares a subset of the original dataset. This subset contains only the necessary variables and only includes observations relevant to Chicago.
Relationship Exploration
There are many variables in this model. Let’s explore each explanatory variable’s relationship with the response variable.
To explore the relationship between the
received_callback
response variable and each of the explanatory variables (years_experience
,race
, andgender
) in theresume_subset
dataset, we can create individual visualizations usingggplot2.
# Convert received_callback to a factor for plotting
resume_subset$received_callback <- as.factor(resume_subset$received_callback)
# Combined plot using facet_grid
resume_subset_plot <- ggplot(resume_subset, aes(x = years_experience, fill = received_callback)) +
geom_histogram(binwidth = 1, position = "dodge") +
facet_grid(race ~ gender) +
labs(
title = "Years of Experience vs. Callback Received by Race and Gender",
x = "Years of Experience",
y = "Count",
fill = "Received Callback"
) +
theme_few() +
theme(legend.position = "bottom")
resume_subset_plot
After doing this, we will answer the following question:
- Describe any patterns. What do you notice?
This plot is not really good.
Fitting the model
Using the logistic model code above, let’s fit the model to address
our research question.
Focusing on the estimated coefficient for years_experience
,
we would say:
# Fit the logistic regression model
logistic_model <- glm(
received_callback ~ years_experience + race + gender,
data = resume_subset,
family = "binomial"
)
# Display the model summary
summary(logistic_model)
##
## Call:
## glm(formula = received_callback ~ years_experience + race + gender,
## family = "binomial", data = resume_subset)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.27861 0.18132 -18.082 < 2e-16 ***
## years_experience 0.04487 0.01631 2.751 0.00594 **
## raceWhite 0.42645 0.15679 2.720 0.00653 **
## gendermale 0.58011 0.20306 2.857 0.00428 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1333.7 on 2703 degrees of freedom
## Residual deviance: 1313.5 on 2700 degrees of freedom
## AIC: 1321.5
##
## Number of Fisher Scoring iterations: 5
For each additional year of experience for an applicant in Chicago, we expect the log odds of an applicant receiving a call back to increase by 0.045 units. Assuming applicants have similar time in spent in college, similar inferred races, and similar inferred gender.
This interpretation is somewhat confusing because we are describing
this in log odds.
Fortunately, we can convert these back to odds using the following
transformation:
\[ \text{odds} = e^{\log(\text{odds})} \]
Odds = e^0.045 = 1.046
We saw how to do this in (13)
After doing this, we will answer the following question:
- Interpret the estimated coefficient for
years_experience
.
For each additional year of experience, the odds of receiving a callback increase by approximately 1.046 times. This means that for every additional year of experience, the odds of an applicant receiving a callback are about 4.6% higher, holding other factors constant.
Assessing model fit
Now we want to check the residuals of this model to check the model’s
fit.
As we saw for multiple linear regression, there are various kinds of
residuals that try to adjust for various features of the data. Two new
residuals to explore are Pearson residuals and Deviance
residuals.
Pearson residuals
The Pearson residual corrects for the unequal variance in the raw residuals by dividing by the standard deviation.
\[ \text{Pearson}_i = \frac{y_i - \hat{p}_i}{\sqrt{\hat{p}_i(1 - \hat{p}_i)}} \]
Deviance residuals
Deviance residuals are popular because the sum of squares of these residuals is the deviance statistic.
\[ d_i = \text{sign}(y_i - \hat{p}_i)\sqrt{2\Big[y_i\log\Big(\frac{y_i}{\hat{p}_i}\Big) + (1 - y_i)\log\Big(\frac{1 - y_i}{1 - \hat{p}_i}\Big)\Big]} \]
Since Pearson residuals are similar to residuals that we have already explored, we will instead focus on the deviance residuals.
# To store residuals and create row number variable
mult_log_aug <- augment(logistic_model, type.predict = "response",
type.residuals = "deviance") %>%
mutate(id = row_number())
# Plot residuals vs fitted values
ggplot(data = mult_log_aug, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, color = "red") +
labs(x = "Fitted values",
y = "Deviance residuals",
title = "Deviance residuals vs. fitted")
# Plot residuals vs row number
ggplot(data = mult_log_aug, aes(x = id, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, color = "red") +
labs(x = "id",
y = "Deviance residuals",
title = "Deviance residuals vs. id")
This plot shows the relationship between the deviance residuals and the fitted values from the logistic regression model.
The presence of two distinct lines is a common occurrence in logistic regression, reflecting the binary nature of the response variable (0 or 1).
The red horizontal line at y = 0 indicates the expected residual value of 0 if the model perfectly fit the data.
Here we produced two residual plots: the deviance residuals against
the fitted values and the deviance variables against the index id (an
index plot).
The index plot allows us to easily see some of the more extreme
observations - there are a lot (\(|d_i| >
2\) is quiet alarming).
The residual plot may look odd (why are there two distinct lines?!?),
but this is a pretty typical shape when working with a binary response
variable (the original data is really either a 0 or a 1).
In general because there are so many extreme values in the index plot,
this model leaves room for improvement.