We will classify penguins using a simple logistics regression

steps we will follow 1. load the data set 2. Clean the data 3.Train and test 4. create a model 5. Evaluate the model ##1. load the data Lets get started by loading the libraries we will be using

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(palmerpenguins)
penguins

## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # … with 334 more rows, and 2 more variables: sex <fct>, year <int>

###Lets ow do a simple visualize to see how our data set will look

  ggplot(penguins,aes(flipper_length_mm, bill_length_mm, color = sex, size = body_mass_g)) +
  geom_point(alpha = 0.5) +
  facet_wrap(~species)

## Warning: Removed 2 rows containing missing values (geom_point).

##2. Clealing the data Given our model will use sex for classification we will drop the missing values. Then drop the data variables we might not in the classification

penguins_df <- penguins %>%
  filter(!is.na(sex)) %>%
  select(-year, -island)
penguins_df

## # A tibble: 333 × 6
##    species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex   
##    <fct>            <dbl>         <dbl>             <int>       <int> <fct> 
##  1 Adelie            39.1          18.7               181        3750 male  
##  2 Adelie            39.5          17.4               186        3800 female
##  3 Adelie            40.3          18                 195        3250 female
##  4 Adelie            36.7          19.3               193        3450 female
##  5 Adelie            39.3          20.6               190        3650 male  
##  6 Adelie            38.9          17.8               181        3625 female
##  7 Adelie            39.2          19.6               195        4675 male  
##  8 Adelie            41.1          17.6               182        3200 female
##  9 Adelie            38.6          21.2               191        3800 male  
## 10 Adelie            34.6          21.1               198        4400 male  
## # … with 323 more rows

##3. Training and testing Here the sex variables been our stata is splited into traning and testing

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 0.2.0 ──

## ✔ broom        0.8.0     ✔ rsample      1.0.0
## ✔ dials        1.0.0     ✔ tune         0.2.0
## ✔ infer        1.0.2     ✔ workflows    0.2.6
## ✔ modeldata    0.1.1     ✔ workflowsets 0.2.1
## ✔ parsnip      1.0.0     ✔ yardstick    1.0.0
## ✔ recipes      0.2.0

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Learn how to get started at https://www.tidymodels.org/start/

set.seed(123)
penguin_split <- initial_split(penguins_df, strata = sex)
penguin_train <- training(penguin_split)
penguin_test <- testing(penguin_split)

we need to evaluate our model by checking on the trained data.

hence we will use bootstraps to help us evaluate now let’s create bootstrap resamples of the training data, to evaluate our models.

set.seed(123)
penguin_boot <- bootstraps(penguin_train)
penguin_boot

## # Bootstrap sampling 
## # A tibble: 25 × 2
##    splits           id         
##    <list>           <chr>      
##  1 <split [249/93]> Bootstrap01
##  2 <split [249/91]> Bootstrap02
##  3 <split [249/90]> Bootstrap03
##  4 <split [249/91]> Bootstrap04
##  5 <split [249/85]> Bootstrap05
##  6 <split [249/87]> Bootstrap06
##  7 <split [249/94]> Bootstrap07
##  8 <split [249/88]> Bootstrap08
##  9 <split [249/95]> Bootstrap09
## 10 <split [249/89]> Bootstrap10
## # … with 15 more rows

##4. Building the model lets set our logistic regression model

glm_spec <- logistic_reg() %>%
  set_engine("glm")

glm_spec

## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm

let’s start putting together a tidymodels workflow(), a helper object to help manage modeling pipelines with pieces that fit together.

penguin_wf <- workflow() %>%
  add_formula(sex ~ .)

penguin_wf

## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: None
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## sex ~ .

Lets now add a model, and the fit to each of the resamples. First, we can fit the logistic regression model

glm_rs <- penguin_wf %>%
  add_model(glm_spec) %>%
  fit_resamples(
    resamples = penguin_boot,
    control = control_resamples(save_pred = TRUE)
  )

## ! Bootstrap05: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...

## ! Bootstrap08: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...

## ! Bootstrap23: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...

glm_rs

## # Resampling results
## # Bootstrap sampling 
## # A tibble: 25 × 5
##    splits           id          .metrics         .notes           .predictions
##    <list>           <chr>       <list>           <list>           <list>      
##  1 <split [249/93]> Bootstrap01 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>    
##  2 <split [249/91]> Bootstrap02 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>    
##  3 <split [249/90]> Bootstrap03 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>    
##  4 <split [249/91]> Bootstrap04 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>    
##  5 <split [249/85]> Bootstrap05 <tibble [2 × 4]> <tibble [1 × 3]> <tibble>    
##  6 <split [249/87]> Bootstrap06 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>    
##  7 <split [249/94]> Bootstrap07 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>    
##  8 <split [249/88]> Bootstrap08 <tibble [2 × 4]> <tibble [1 × 3]> <tibble>    
##  9 <split [249/95]> Bootstrap09 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>    
## 10 <split [249/89]> Bootstrap10 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>    
## # … with 15 more rows
## 
## There were issues with some computations:
## 
##   - Warning(s) x3: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Use `collect_notes(object)` for more information.

##5. Evaluate the model

collect_metrics(glm_rs)

## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.918    25 0.00639 Preprocessor1_Model1
## 2 roc_auc  binary     0.979    25 0.00254 Preprocessor1_Model1

###Now you have created your first working lOGISTIC REGRESSION MODEL

log_reg_penguin data

Kelvin Nyongesa

2022-06-28

We will classify penguins using a simple logistics regression

we need to evaluate our model by checking on the trained data.