steps we will follow 1. load the data set 2. Clean the data 3.Train and test 4. create a model 5. Evaluate the model ##1. load the data Lets get started by loading the libraries we will be using
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(palmerpenguins)
penguins
## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # … with 334 more rows, and 2 more variables: sex <fct>, year <int>
###Lets ow do a simple visualize to see how our data set will look
ggplot(penguins,aes(flipper_length_mm, bill_length_mm, color = sex, size = body_mass_g)) +
geom_point(alpha = 0.5) +
facet_wrap(~species)
## Warning: Removed 2 rows containing missing values (geom_point).
##2. Clealing the data Given our model will use sex for classification
we will drop the missing values. Then drop the data variables we might
not in the classification
penguins_df <- penguins %>%
filter(!is.na(sex)) %>%
select(-year, -island)
penguins_df
## # A tibble: 333 × 6
## species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie 39.1 18.7 181 3750 male
## 2 Adelie 39.5 17.4 186 3800 female
## 3 Adelie 40.3 18 195 3250 female
## 4 Adelie 36.7 19.3 193 3450 female
## 5 Adelie 39.3 20.6 190 3650 male
## 6 Adelie 38.9 17.8 181 3625 female
## 7 Adelie 39.2 19.6 195 4675 male
## 8 Adelie 41.1 17.6 182 3200 female
## 9 Adelie 38.6 21.2 191 3800 male
## 10 Adelie 34.6 21.1 198 4400 male
## # … with 323 more rows
##3. Training and testing Here the sex variables been our stata is splited into traning and testing
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 0.2.0 ──
## ✔ broom 0.8.0 ✔ rsample 1.0.0
## ✔ dials 1.0.0 ✔ tune 0.2.0
## ✔ infer 1.0.2 ✔ workflows 0.2.6
## ✔ modeldata 0.1.1 ✔ workflowsets 0.2.1
## ✔ parsnip 1.0.0 ✔ yardstick 1.0.0
## ✔ recipes 0.2.0
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Learn how to get started at https://www.tidymodels.org/start/
set.seed(123)
penguin_split <- initial_split(penguins_df, strata = sex)
penguin_train <- training(penguin_split)
penguin_test <- testing(penguin_split)
hence we will use bootstraps to help us evaluate now let’s create bootstrap resamples of the training data, to evaluate our models.
set.seed(123)
penguin_boot <- bootstraps(penguin_train)
penguin_boot
## # Bootstrap sampling
## # A tibble: 25 × 2
## splits id
## <list> <chr>
## 1 <split [249/93]> Bootstrap01
## 2 <split [249/91]> Bootstrap02
## 3 <split [249/90]> Bootstrap03
## 4 <split [249/91]> Bootstrap04
## 5 <split [249/85]> Bootstrap05
## 6 <split [249/87]> Bootstrap06
## 7 <split [249/94]> Bootstrap07
## 8 <split [249/88]> Bootstrap08
## 9 <split [249/95]> Bootstrap09
## 10 <split [249/89]> Bootstrap10
## # … with 15 more rows
##4. Building the model lets set our logistic regression model
glm_spec <- logistic_reg() %>%
set_engine("glm")
glm_spec
## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
let’s start putting together a tidymodels workflow(), a helper object to help manage modeling pipelines with pieces that fit together.
penguin_wf <- workflow() %>%
add_formula(sex ~ .)
penguin_wf
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: None
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## sex ~ .
Lets now add a model, and the fit to each of the resamples. First, we can fit the logistic regression model
glm_rs <- penguin_wf %>%
add_model(glm_spec) %>%
fit_resamples(
resamples = penguin_boot,
control = control_resamples(save_pred = TRUE)
)
## ! Bootstrap05: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Bootstrap08: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Bootstrap23: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
glm_rs
## # Resampling results
## # Bootstrap sampling
## # A tibble: 25 × 5
## splits id .metrics .notes .predictions
## <list> <chr> <list> <list> <list>
## 1 <split [249/93]> Bootstrap01 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
## 2 <split [249/91]> Bootstrap02 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
## 3 <split [249/90]> Bootstrap03 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
## 4 <split [249/91]> Bootstrap04 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
## 5 <split [249/85]> Bootstrap05 <tibble [2 × 4]> <tibble [1 × 3]> <tibble>
## 6 <split [249/87]> Bootstrap06 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
## 7 <split [249/94]> Bootstrap07 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
## 8 <split [249/88]> Bootstrap08 <tibble [2 × 4]> <tibble [1 × 3]> <tibble>
## 9 <split [249/95]> Bootstrap09 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
## 10 <split [249/89]> Bootstrap10 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
## # … with 15 more rows
##
## There were issues with some computations:
##
## - Warning(s) x3: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Use `collect_notes(object)` for more information.
##5. Evaluate the model
collect_metrics(glm_rs)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.918 25 0.00639 Preprocessor1_Model1
## 2 roc_auc binary 0.979 25 0.00254 Preprocessor1_Model1
###Now you have created your first working lOGISTIC REGRESSION MODEL