This template runs the simplest kind of logistic regression: a model with one outcome and one predictor.
You use logistic regression when the thing you want to predict (the outcome) has exactly two categories.
In our example dataset the outcome is
RealizationOfRecipient, which is either:
A predictor is one thing you think might influence which of the two outcomes happens. The model measures how strongly your single predictor pushes the outcome toward one category or the other.
How to use this template
- Read the short text before each grey code block.
- Run the block: click the small green ▶ arrow at its top-right corner.
- Run the blocks in order, from top to bottom.
- Wherever you see STUDENT:, you may need to change something. Everything else can stay exactly as it is.
A package is a toolbox of extra functions someone else already wrote for you. You only need to load them once each time you open RStudio.
The first time ever on a computer, a package also has to be
installed. If R complains that a package is “not
found”, delete the # in front of the matching
install.packages(...) line, run that line once, then put
the # back.
# install.packages("tidyverse")
# install.packages("broom")
# install.packages("gtsummary")
# install.packages("ggeffects")
library(tidyverse) # data handling + plots
library(broom) # turns model output into a tidy table
library(gtsummary) # makes a clean, report-ready results table
library(ggeffects) # effect plots (predicted probabilities)
Put your .csv file in the same folder
as this .Rmd file. Then you only need to write the file
name — no long folder path.
STUDENT: Change the file name below to match your file.
data <- read_csv("Sem1_bresnan_et_al_2008_dative.csv")
# Show the first rows and the type of each column:
glimpse(data)
## Rows: 3,263
## Columns: 16
## $ Speaker <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ Modality <chr> "written", "written", "written", "written", "wr…
## $ Verb <chr> "feed", "give", "give", "give", "offer", "give"…
## $ SemanticClass <chr> "t", "a", "a", "a", "c", "a", "t", "a", "a", "a…
## $ LengthOfRecipient <dbl> 1, 2, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 1, 2, 1, 1,…
## $ AnimacyOfRec <chr> "animate", "animate", "animate", "animate", "an…
## $ DefinOfRec <chr> "definite", "definite", "definite", "definite",…
## $ PronomOfRec <chr> "pronominal", "nonpronominal", "nonpronominal",…
## $ LengthOfTheme <dbl> 14, 3, 13, 5, 3, 4, 4, 1, 11, 2, 3, 3, 5, 2, 8,…
## $ AnimacyOfTheme <chr> "inanimate", "inanimate", "inanimate", "inanima…
## $ DefinOfTheme <chr> "indefinite", "indefinite", "definite", "indefi…
## $ PronomOfTheme <chr> "nonpronominal", "nonpronominal", "nonpronomina…
## $ RealizationOfRecipient <chr> "NP", "NP", "NP", "NP", "NP", "NP", "NP", "NP",…
## $ AccessOfRec <chr> "given", "given", "given", "given", "given", "g…
## $ AccessOfTheme <chr> "new", "new", "new", "new", "new", "new", "new"…
## $ HeavinessNum <dbl> 13, 1, 12, 4, 1, 2, 2, 0, 10, 1, 1, 1, 4, 0, 7,…
glimpse() lists every column (variable) in your data.
Notice the little labels: <chr> means text (a
category) and <dbl> / <int> means
a number. You’ll use these names in the next step, so spell them
exactly as shown here.
The outcome is always RealizationOfRecipient. You only
pick one predictor.
STUDENT: Put the name of your chosen predictor between the quotation marks below. Use the exact spelling from
glimpse()above. Then give it a friendly label — this is just the text that will appear on your plot later.
my_predictor <- "LengthOfRecipient" # <- STUDENT: your predictor
my_label <- "Length Of Recipient (spoken vs. written)" # <- STUDENT: a friendly label
# Keep only the outcome + your one predictor, drop rows with missing values,
# and make sure text columns are treated as categories (factors).
# (Leave this part as it is.)
data <- data %>%
select(RealizationOfRecipient, all_of(my_predictor)) %>%
drop_na() %>%
mutate(across(where(is.character), as.factor))
Which outcome is the model predicting? Run the line below. R always predicts the probability of the second category in the list. The first category is the baseline (the “reference” it compares against).
levels(data$RealizationOfRecipient) # the 2nd name = the outcome being predicted
## [1] "NP" "PP"
In our example this prints "NP" "PP", so the model
predicts the probability of PP. If your outcome is
different, just remember: second name = predicted
outcome, and read the rest of this template with that in
mind.
Before modelling, always look at your data. This builds your intuition for what the model should find.
First, how often does each outcome occur?
data %>% count(RealizationOfRecipient)
Now look at your predictor. It is either a category (text, like Modality) or a number (like LengthOfRecipient). The block below checks which one you have and prints the right summary automatically — so you can just run it. You don’t need to change anything.
if (is.numeric(data[[my_predictor]])) {
# Predictor is a number: mean & SD within each outcome group
data %>%
group_by(RealizationOfRecipient) %>%
summarise(
Mean = round(mean(.data[[my_predictor]]), 2),
SD = round(sd(.data[[my_predictor]]), 2),
.groups = "drop"
)
} else {
# Predictor is a category: counts crossed with the outcome
table(data[[my_predictor]], data$RealizationOfRecipient)
}
This single line builds the model. Read the formula as a sentence:
“predict RealizationOfRecipient from
my_predictor.”
# reformulate() just builds the formula RealizationOfRecipient ~ <your predictor>
# from the name you typed above. (Leave this as it is.)
model <- glm(reformulate(my_predictor, "RealizationOfRecipient"),
data = data,
family = binomial) # "binomial" tells R this is logistic regression
Each row gives:
broom::tidy(model) %>%
mutate(across(where(is.numeric), ~ round(.x, 3))) %>%
rename(Predictor = term, B = estimate, SE = std.error, z = statistic, p = p.value)
The first row, (Intercept), is just the baseline — you
don’t interpret it. The row with your predictor’s name is the one you
care about.
Log-odds are awkward, so we convert B into an odds ratio (OR) by “exponentiating” it. The OR is the number you actually report and interpret:
broom::tidy(model, exponentiate = TRUE, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(.x, 3))) %>%
rename(Predictor = term, OR = estimate, SE = std.error,
z = statistic, p = p.value, CI_low = conf.low, CI_high = conf.high)
CI_low and CI_high are the lower and upper
ends of the 95% confidence interval for the OR. If that
interval does not include 1, the effect is
significant.
This produces a tidy table with the odds ratio, its confidence interval, and the p-value — close to APA format and easy to paste into a report.
tbl_regression(model, exponentiate = TRUE)
| Characteristic | OR | 95% CI | p-value |
|---|---|---|---|
| LengthOfRecipient | 2.02 | 1.87, 2.18 | <0.001 |
| Abbreviations: CI = Confidence Interval, OR = Odds Ratio | |||
This compares your model to an “empty” model that has no predictor at all. A significant result (p < .05) means your predictor really does improve prediction.
null_model <- glm(RealizationOfRecipient ~ 1, data = data, family = binomial)
anova(null_model, model, test = "Chisq")
This plot turns the numbers into a picture: the predicted probability of the PP construction across the values of your predictor. The shaded band (for a number) or the error bars (for a category) show the 95% confidence interval.
eff <- ggpredict(model, terms = my_predictor)
plot(eff) + ggtitle(paste("Predicted probability of PP by", my_label))
Copy the sentence that fits your result, then replace each
[slot] with the value from your tables. Round B,
SE, z, and OR to 2 decimals;
report p exactly, or as p < .001 when it is very
small.
Where to find each value
- B, SE, z, p → the “Results table (raw coefficients)”.
- OR and its CI → the “Odds ratios” table.
- χ², df, p for the overall model → the “overall model test” output.
A logistic regression was conducted to predict the realization of the recipient as a prepositional phrase (PP) rather than a noun phrase (NP) from [predictor]. The model significantly improved on a null model with no predictors, χ²([df]) = [chi], p [= / < .001].
[Predictor] significantly predicted recipient realization, B = [B], SE = [SE], z = [z], p = [p]. The odds of a PP realization were [OR] times [higher / lower] for [level] than for [baseline level], OR = [OR], 95% CI [[CI_low], [CI_high]].
Worked example: Modality significantly predicted recipient realization, B = 0.71, SE = 0.18, z = 3.94, p < .001. The odds of a PP realization were 2.03 times higher for written than for spoken language, OR = 2.03, 95% CI [1.43, 2.89].
[Predictor] significantly predicted recipient realization, B = [B], SE = [SE], z = [z], p = [p]. Each additional [unit, e.g. “word”] [increased / decreased] the odds of a PP realization by a factor of [OR], OR = [OR], 95% CI [[CI_low], [CI_high]].
Worked example: Length of recipient significantly predicted recipient realization, B = 0.28, SE = 0.05, z = 5.60, p < .001. Each additional word increased the odds of a PP realization by a factor of 1.32, OR = 1.32, 95% CI [1.20, 1.46].
[Predictor] did not significantly predict recipient realization, B = [B], SE = [SE], z = [z], p = [p], indicating that it was not a meaningful factor in the choice of construction.
Reporting tip: B is the log-odds coefficient (some textbooks write it as β). Always report the odds ratio (OR) alongside it, because that is the number readers can actually interpret.