1 What this template does

This template runs the simplest kind of logistic regression: a model with one outcome and one predictor.

You use logistic regression when the thing you want to predict (the outcome) has exactly two categories.

In our example dataset the outcome is RealizationOfRecipient, which is either:

  • NP = double-object construction (e.g. give the child a book), or
  • PP = prepositional dative (e.g. give a book to the child).

A predictor is one thing you think might influence which of the two outcomes happens. The model measures how strongly your single predictor pushes the outcome toward one category or the other.

How to use this template

  1. Read the short text before each grey code block.
  2. Run the block: click the small green ▶ arrow at its top-right corner.
  3. Run the blocks in order, from top to bottom.
  4. Wherever you see STUDENT:, you may need to change something. Everything else can stay exactly as it is.

2 Setup: load the packages

A package is a toolbox of extra functions someone else already wrote for you. You only need to load them once each time you open RStudio.

The first time ever on a computer, a package also has to be installed. If R complains that a package is “not found”, delete the # in front of the matching install.packages(...) line, run that line once, then put the # back.

# install.packages("tidyverse")
# install.packages("broom")
# install.packages("gtsummary")
# install.packages("ggeffects")

library(tidyverse)  # data handling + plots
library(broom)      # turns model output into a tidy table
library(gtsummary)  # makes a clean, report-ready results table
library(ggeffects)  # effect plots (predicted probabilities)

3 Load your data

Put your .csv file in the same folder as this .Rmd file. Then you only need to write the file name — no long folder path.

STUDENT: Change the file name below to match your file.

data <- read_csv("Sem1_bresnan_et_al_2008_dative.csv")

# Show the first rows and the type of each column:
glimpse(data)
## Rows: 3,263
## Columns: 16
## $ Speaker                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ Modality               <chr> "written", "written", "written", "written", "wr…
## $ Verb                   <chr> "feed", "give", "give", "give", "offer", "give"…
## $ SemanticClass          <chr> "t", "a", "a", "a", "c", "a", "t", "a", "a", "a…
## $ LengthOfRecipient      <dbl> 1, 2, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 1, 2, 1, 1,…
## $ AnimacyOfRec           <chr> "animate", "animate", "animate", "animate", "an…
## $ DefinOfRec             <chr> "definite", "definite", "definite", "definite",…
## $ PronomOfRec            <chr> "pronominal", "nonpronominal", "nonpronominal",…
## $ LengthOfTheme          <dbl> 14, 3, 13, 5, 3, 4, 4, 1, 11, 2, 3, 3, 5, 2, 8,…
## $ AnimacyOfTheme         <chr> "inanimate", "inanimate", "inanimate", "inanima…
## $ DefinOfTheme           <chr> "indefinite", "indefinite", "definite", "indefi…
## $ PronomOfTheme          <chr> "nonpronominal", "nonpronominal", "nonpronomina…
## $ RealizationOfRecipient <chr> "NP", "NP", "NP", "NP", "NP", "NP", "NP", "NP",…
## $ AccessOfRec            <chr> "given", "given", "given", "given", "given", "g…
## $ AccessOfTheme          <chr> "new", "new", "new", "new", "new", "new", "new"…
## $ HeavinessNum           <dbl> 13, 1, 12, 4, 1, 2, 2, 0, 10, 1, 1, 1, 4, 0, 7,…

glimpse() lists every column (variable) in your data. Notice the little labels: <chr> means text (a category) and <dbl> / <int> means a number. You’ll use these names in the next step, so spell them exactly as shown here.


4 Choose your one predictor

The outcome is always RealizationOfRecipient. You only pick one predictor.

STUDENT: Put the name of your chosen predictor between the quotation marks below. Use the exact spelling from glimpse() above. Then give it a friendly label — this is just the text that will appear on your plot later.

my_predictor <- "LengthOfRecipient"                          # <- STUDENT: your predictor
my_label     <- "Length Of Recipient (spoken vs. written)"     # <- STUDENT: a friendly label

# Keep only the outcome + your one predictor, drop rows with missing values,
# and make sure text columns are treated as categories (factors).
# (Leave this part as it is.)
data <- data %>%
  select(RealizationOfRecipient, all_of(my_predictor)) %>%
  drop_na() %>%
  mutate(across(where(is.character), as.factor))

Which outcome is the model predicting? Run the line below. R always predicts the probability of the second category in the list. The first category is the baseline (the “reference” it compares against).

levels(data$RealizationOfRecipient)   # the 2nd name = the outcome being predicted
## [1] "NP" "PP"

In our example this prints "NP" "PP", so the model predicts the probability of PP. If your outcome is different, just remember: second name = predicted outcome, and read the rest of this template with that in mind.


5 Look at the data first (descriptive statistics)

Before modelling, always look at your data. This builds your intuition for what the model should find.

First, how often does each outcome occur?

data %>% count(RealizationOfRecipient)

Now look at your predictor. It is either a category (text, like Modality) or a number (like LengthOfRecipient). The block below checks which one you have and prints the right summary automatically — so you can just run it. You don’t need to change anything.

  • If your predictor is a category, you get a small cross-table: how the outcome splits within each category.
  • If your predictor is a number, you get its average and spread within each outcome group.
if (is.numeric(data[[my_predictor]])) {
  # Predictor is a number: mean & SD within each outcome group
  data %>%
    group_by(RealizationOfRecipient) %>%
    summarise(
      Mean = round(mean(.data[[my_predictor]]), 2),
      SD   = round(sd(.data[[my_predictor]]), 2),
      .groups = "drop"
    )
} else {
  # Predictor is a category: counts crossed with the outcome
  table(data[[my_predictor]], data$RealizationOfRecipient)
}

6 Fit the logistic regression model

This single line builds the model. Read the formula as a sentence: “predict RealizationOfRecipient from my_predictor.”

# reformulate() just builds the formula  RealizationOfRecipient ~ <your predictor>
# from the name you typed above. (Leave this as it is.)
model <- glm(reformulate(my_predictor, "RealizationOfRecipient"),
             data   = data,
             family = binomial)   # "binomial" tells R this is logistic regression

6.1 Results table (raw coefficients)

Each row gives:

  • B — the coefficient on the log-odds scale (hard to read directly — see the odds ratios below),
  • SE — standard error (how uncertain B is),
  • z — the test statistic,
  • p — significance. A predictor is usually called “significant” when p < .05.
broom::tidy(model) %>%
  mutate(across(where(is.numeric), ~ round(.x, 3))) %>%
  rename(Predictor = term, B = estimate, SE = std.error, z = statistic, p = p.value)

The first row, (Intercept), is just the baseline — you don’t interpret it. The row with your predictor’s name is the one you care about.

6.2 Odds ratios (the easy-to-read version)

Log-odds are awkward, so we convert B into an odds ratio (OR) by “exponentiating” it. The OR is the number you actually report and interpret:

  • OR = 1 → no effect.
  • OR > 1 → the predictor increases the odds of the predicted outcome (PP).
  • OR < 1 → the predictor decreases the odds of the predicted outcome (PP).
  • Quick percentage: (OR − 1) × 100% is the change in odds. Example: OR = 1.50 → +50% odds; OR = 0.80 → −20% odds.
broom::tidy(model, exponentiate = TRUE, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(.x, 3))) %>%
  rename(Predictor = term, OR = estimate, SE = std.error,
         z = statistic, p = p.value, CI_low = conf.low, CI_high = conf.high)

CI_low and CI_high are the lower and upper ends of the 95% confidence interval for the OR. If that interval does not include 1, the effect is significant.

6.3 A clean, report-ready table

This produces a tidy table with the odds ratio, its confidence interval, and the p-value — close to APA format and easy to paste into a report.

tbl_regression(model, exponentiate = TRUE)
Characteristic OR 95% CI p-value
LengthOfRecipient 2.02 1.87, 2.18 <0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

6.4 Does the predictor help at all? (overall model test)

This compares your model to an “empty” model that has no predictor at all. A significant result (p < .05) means your predictor really does improve prediction.

null_model <- glm(RealizationOfRecipient ~ 1, data = data, family = binomial)
anova(null_model, model, test = "Chisq")

7 Effect plot (predicted probabilities)

This plot turns the numbers into a picture: the predicted probability of the PP construction across the values of your predictor. The shaded band (for a number) or the error bars (for a category) show the 95% confidence interval.

eff <- ggpredict(model, terms = my_predictor)
plot(eff) + ggtitle(paste("Predicted probability of PP by", my_label))


8 Writing it up: ready-made APA sentences

Copy the sentence that fits your result, then replace each [slot] with the value from your tables. Round B, SE, z, and OR to 2 decimals; report p exactly, or as p < .001 when it is very small.

Where to find each value

  • B, SE, z, p → the “Results table (raw coefficients)”.
  • OR and its CI → the “Odds ratios” table.
  • χ², df, p for the overall model → the “overall model test” output.

8.0.1 Overall model

A logistic regression was conducted to predict the realization of the recipient as a prepositional phrase (PP) rather than a noun phrase (NP) from [predictor]. The model significantly improved on a null model with no predictors, χ²([df]) = [chi], p [= / < .001].

8.0.2 If your predictor is a category and significant

[Predictor] significantly predicted recipient realization, B = [B], SE = [SE], z = [z], p = [p]. The odds of a PP realization were [OR] times [higher / lower] for [level] than for [baseline level], OR = [OR], 95% CI [[CI_low], [CI_high]].

Worked example: Modality significantly predicted recipient realization, B = 0.71, SE = 0.18, z = 3.94, p < .001. The odds of a PP realization were 2.03 times higher for written than for spoken language, OR = 2.03, 95% CI [1.43, 2.89].

8.0.3 If your predictor is a number and significant

[Predictor] significantly predicted recipient realization, B = [B], SE = [SE], z = [z], p = [p]. Each additional [unit, e.g. “word”] [increased / decreased] the odds of a PP realization by a factor of [OR], OR = [OR], 95% CI [[CI_low], [CI_high]].

Worked example: Length of recipient significantly predicted recipient realization, B = 0.28, SE = 0.05, z = 5.60, p < .001. Each additional word increased the odds of a PP realization by a factor of 1.32, OR = 1.32, 95% CI [1.20, 1.46].

8.0.4 If your predictor is not significant

[Predictor] did not significantly predict recipient realization, B = [B], SE = [SE], z = [z], p = [p], indicating that it was not a meaningful factor in the choice of construction.

Reporting tip: B is the log-odds coefficient (some textbooks write it as β). Always report the odds ratio (OR) alongside it, because that is the number readers can actually interpret.


9 Quick checklist before you submit