This template offers an opinionated guide on how to structure a modeling analysis. Your individual modeling analysis may require you to add to, subtract from, or otherwise change this structure, but consider this a general framework to start from. If you want to learn more about using tidymodels, check out our Getting Started guide.

library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom        1.0.5     ✔ rsample      1.2.0
## ✔ dials        1.2.0     ✔ tune         1.1.2
## ✔ infer        1.0.5     ✔ workflows    1.1.3
## ✔ modeldata    1.2.0     ✔ workflowsets 1.0.1
## ✔ parsnip      1.1.1     ✔ yardstick    1.2.0
## ✔ recipes      1.0.8
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Search for functions across packages at https://www.tidymodels.org/find/
nyt_titles <- readr::read_tsv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-05-10/nyt_titles.tsv')
## Rows: 7431 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr  (2): title, author
## dbl  (5): id, year, total_weeks, debut_rank, best_rank
## date (1): first_week
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(nyt_titles)
## Rows: 7,431
## Columns: 8
## $ id          <dbl> 0, 1, 10, 100, 1000, 1001, 1002, 1003, 1004, 1005, 1006, 1…
## $ title       <chr> "\"H\" IS FOR HOMICIDE", "\"I\" IS FOR INNOCENT", "''G'' I…
## $ author      <chr> "Sue Grafton", "Sue Grafton", "Sue Grafton", "W. Bruce Cam…
## $ year        <dbl> 1991, 1992, 1990, 2012, 2006, 2016, 1985, 1994, 2002, 1999…
## $ total_weeks <dbl> 15, 11, 6, 1, 1, 3, 16, 5, 4, 1, 3, 2, 11, 6, 9, 8, 1, 1, …
## $ first_week  <date> 1991-05-05, 1992-04-26, 1990-05-06, 2012-05-27, 2006-02-1…
## $ debut_rank  <dbl> 1, 14, 4, 3, 11, 1, 9, 7, 7, 12, 13, 5, 12, 2, 11, 13, 2, …
## $ best_rank   <dbl> 2, 2, 8, 14, 14, 7, 2, 10, 12, 17, 13, 13, 8, 5, 5, 11, 4,…

Question and Data

Can a word in the title indicate the best rank of a book?

The dataset shows the best selling books according to New York Times. It includes book names, authors, how many weeks they spent on the best selling list, their best rank on the list, and what rank they debuted on the list. It also includes the year, first week on, and book ids.

The primary variables of interest here are debut rank, total weeks on, the author, and the book title. These are of interest because we are trying to find a correlation between debut rank and total weeks on, and the book title especially will allow us to seperate one book from another.

Explore data

Exploratory data analysis (EDA) is an important part of the modeling process.

nyt_titles %>%
    ggplot(aes(best_rank)) +
    geom_histogram(bins = 30)

library(tidytext)

tidy_nyt_titles <-
    nyt_titles %>%
    unnest_tokens(word, title)

tidy_nyt_titles %>%
    count(word, sort = TRUE)
## # A tibble: 5,770 × 2
##    word      n
##    <chr> <int>
##  1 the    2992
##  2 of      952
##  3 a       362
##  4 in      268
##  5 and     242
##  6 to      159
##  7 for     107
##  8 night    85
##  9 you      80
## 10 love     76
## # ℹ 5,760 more rows
tidy_nyt_titles %>%
    group_by(word) %>%
    summarise(n = n(), 
              best_rank = mean(best_rank)) %>%
    ggplot(aes(n, best_rank)) +
    geom_hline( yintercept = mean(tidy_nyt_titles$best_rank), 
               lty = 2, color = "gray50", size = 1.5) +
    geom_point(color = "midnightblue", alpha = .7) +
    scale_x_log10()

Build models

Let’s consider how to spend our data budget: