Prompt 1:

I have a dataset called attrition_raw_tbl that looks like this.

attrition_raw_tbl %>% glimpse()

## Rows: 1,470
## Columns: 35
## $ Age                      <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 2…
## $ Attrition                <chr> "Yes", "No", "Yes", "No", "No", "No", "No", "…
## $ BusinessTravel           <chr> "Travel_Rarely", "Travel_Frequently", "Travel…
## $ DailyRate                <dbl> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358,…
## $ Department               <chr> "Sales", "Research & Development", "Research …
## $ DistanceFromHome         <dbl> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, …
## $ Education                <dbl> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, …
## $ EducationField           <chr> "Life Sciences", "Life Sciences", "Other", "L…
## $ EmployeeCount            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ EmployeeNumber           <dbl> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16,…
## $ EnvironmentSatisfaction  <dbl> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3, …
## $ Gender                   <chr> "Female", "Male", "Male", "Female", "Male", "…
## $ HourlyRate               <dbl> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, 4…
## $ JobInvolvement           <dbl> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2, …
## $ JobLevel                 <dbl> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1, …
## $ JobRole                  <chr> "Sales Executive", "Research Scientist", "Lab…
## $ JobSatisfaction          <dbl> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3, …
## $ MaritalStatus            <chr> "Single", "Married", "Single", "Married", "Ma…
## $ MonthlyIncome            <dbl> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 269…
## $ MonthlyRate              <dbl> 19479, 24907, 2396, 23159, 16632, 11864, 9964…
## $ NumCompaniesWorked       <dbl> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5, …
## $ Over18                   <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", …
## $ OverTime                 <chr> "Yes", "No", "Yes", "Yes", "No", "No", "Yes",…
## $ PercentSalaryHike        <dbl> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, 1…
## $ PerformanceRating        <dbl> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3, …
## $ RelationshipSatisfaction <dbl> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2, …
## $ StandardHours            <dbl> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 8…
## $ StockOptionLevel         <dbl> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0, …
## $ TotalWorkingYears        <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, 3…
## $ TrainingTimesLastYear    <dbl> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4, …
## $ WorkLifeBalance          <dbl> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3, …
## $ YearsAtCompany           <dbl> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4,…
## $ YearsInCurrentRole       <dbl> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2, …
## $ YearsSinceLastPromotion  <dbl> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0, …
## $ YearsWithCurrManager     <dbl> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3, …

The goal is to help predict attrition for employees.

Please write R code to create a predictive model that predicts the probability of attrition.

Series of prompts with errors

Prompt 2

Load required libraries

library(dplyr) library(tidyr) library(caret) library(glmnet)

Assuming ‘attrition_raw_tbl’ is your dataset

Preprocess the data

attrition_data <- attrition_raw_tbl %>% # Drop irrelevant columns select(-EmployeeNumber, -EmployeeCount, -Over18, -StandardHours) %>% # Convert categorical variables to factors mutate_if(is.character, as.factor) %>% # Handle missing values if any na.omit()

Split data into training and testing sets

set.seed(123) # for reproducibility train_index <- createDataPartition(attrition_data$Attrition, p = 0.7, list = FALSE) train_data <- attrition_data[train_index, ] test_data <- attrition_data[-train_index, ]

Train a logistic regression model

model <- glm(Attrition ~ ., data = train_data, family = binomial)

Make predictions on the test set

predictions <- predict(model, newdata = test_data, type = “response”)

Evaluate the model

confusion_matrix <- table(Actual = test_data$Attrition, Predicted = ifelse(predictions > 0.5, “Yes”, “No”)) accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)

Print the confusion matrix and accuracy

print(confusion_matrix) print(paste(“Accuracy:”, accuracy))

Prompt 3:

“Please use tidymodels instead of caret and to use the h2o model instead of glmnet”

Load required libraries

library(dplyr) library(tidymodels) library(h2o)

Assuming ‘attrition_raw_tbl’ is your dataset

Preprocess the data

Split data into training and testing sets

set.seed(123) # for reproducibility train_data <- attrition_data %>% initial_split(prop = 0.7, strata = Attrition)

train_data <- training(train_data) test_data <- testing(train_data)

Train an h2o logistic regression model

h2o.init() train_data_h2o <- as.h2o(train_data) test_data_h2o <- as.h2o(test_data)

model <- h2o.glm(x = names(train_data_h2o)[-1], y = “Attrition”, training_frame = train_data_h2o, family = “binomial”)

Make predictions on the test set

predictions <- h2o.predict(model, newdata = test_data_h2o) predictions <- as.data.frame(predictions$predict)

Evaluate the model

confusion_matrix <- table(Actual = test_data$Attrition, Predicted = ifelse(predictions$predict > 0.5, “Yes”, “No”)) accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)

Print the confusion matrix and accuracy

print(confusion_matrix) print(paste(“Accuracy:”, accuracy))

Shutdown h2o

h2o.shutdown()

Prompt 4

“Error”

Load required libraries

library(dplyr) library(tidymodels) library(h2o)

Assuming ‘attrition_raw_tbl’ is your dataset

Preprocess the data

Split data into training and testing sets

set.seed(123) # for reproducibility split_data <- initial_split(attrition_data, prop = 0.7, strata = Attrition) train_data <- training(split_data) test_data <- testing(split_data)

Train an h2o logistic regression model

h2o.init() train_data_h2o <- as.h2o(train_data) test_data_h2o <- as.h2o(test_data)

model <- h2o.glm(x = names(train_data_h2o)[-1], y = “Attrition”, training_frame = train_data_h2o, family = “binomial”)

Make predictions on the test set

predictions <- h2o.predict(model, newdata = test_data_h2o) predictions <- as.data.frame(predictions$predict)

Evaluate the model

confusion_matrix <- table(Actual = test_data$Attrition, Predicted = ifelse(predictions$predict > 0.5, “Yes”, “No”)) accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)

Print the confusion matrix and accuracy

print(confusion_matrix) print(paste(“Accuracy:”, accuracy))

Shutdown h2o

h2o.shutdown()

Prompt 5

“Please update from h2o.glm to h2o.automl”

Final Code

# Load required libraries
library(dplyr)
library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──

## ✔ broom        1.0.5     ✔ rsample      1.2.0
## ✔ dials        1.2.0     ✔ tune         1.1.2
## ✔ infer        1.0.6     ✔ workflows    1.1.3
## ✔ modeldata    1.3.0     ✔ workflowsets 1.0.1
## ✔ parsnip      1.1.1     ✔ yardstick    1.3.0
## ✔ recipes      1.0.8

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Learn how to get started at https://www.tidymodels.org/start/

library(h2o)

## 
## ----------------------------------------------------------------------
## 
## Your next step is to start H2O:
##     > h2o.init()
## 
## For H2O package documentation, ask for help:
##     > ??h2o
## 
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit https://docs.h2o.ai
## 
## ----------------------------------------------------------------------

## 
## Attaching package: 'h2o'

## The following objects are masked from 'package:lubridate':
## 
##     day, hour, month, week, year

## The following objects are masked from 'package:stats':
## 
##     cor, sd, var

## The following objects are masked from 'package:base':
## 
##     &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
##     colnames<-, ifelse, is.character, is.factor, is.numeric, log,
##     log10, log1p, log2, round, signif, trunc

# Assuming 'attrition_raw_tbl' is your dataset
# Preprocess the data
attrition_data <- attrition_raw_tbl %>%
  # Drop irrelevant columns
  select(-EmployeeNumber, -EmployeeCount, -Over18, -StandardHours) %>%
  # Convert categorical variables to factors
  mutate_if(is.character, as.factor) %>%
  # Handle missing values if any
  na.omit()

# Split data into training and testing sets
set.seed(123)  # for reproducibility
split_data <- initial_split(attrition_data, prop = 0.7, strata = Attrition)
train_data <- training(split_data)
test_data <- testing(split_data)

# Train an h2o AutoML model
h2o.init()

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         6 days 22 hours 
##     H2O cluster timezone:       America/New_York 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.44.0.3 
##     H2O cluster version age:    4 months and 9 days 
##     H2O cluster name:           H2O_started_from_R_Vanessa_vmr042 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   1.15 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 4.3.1 (2023-06-16)

## Warning in h2o.clusterInfo(): 
## Your H2O cluster version is (4 months and 9 days) old. There may be a newer version available.
## Please download and install the latest version from: https://h2o-release.s3.amazonaws.com/h2o/latest_stable.html

train_data_h2o <- as.h2o(train_data)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

test_data_h2o <- as.h2o(test_data)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

automl <- h2o.automl(x = names(train_data_h2o)[-1],
                      y = "Attrition",
                      training_frame = train_data_h2o,
                      max_runtime_secs = 30)  # adjust runtime as needed

## Warning in .verify_dataxy(training_frame, x, y): removing response variable
## from the explanatory variables

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |====================================                                  |  51%
  |                                                                            
  |=========================================                             |  59%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |=======================================================               |  78%
  |                                                                            
  |============================================================          |  86%
  |                                                                            
  |==================================================================    |  94%
  |                                                                            
  |======================================================================| 100%

# Get the best model from AutoML
best_model <- automl@leader

# Make predictions on the test set
predictions <- h2o.predict(best_model, newdata = test_data_h2o)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

predictions <- as.data.frame(predictions$predict)

# Evaluate the model
confusion_matrix <- table(Actual = test_data$Attrition, Predicted = ifelse(predictions$predict > 0.5, "Yes", "No"))

## Warning in Ops.factor(predictions$predict, 0.5): '>' not meaningful for factors

accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)

# Print the confusion matrix and accuracy
print(confusion_matrix)

## < table of extent 2 x 0 >

print(paste("Accuracy:", accuracy))

## [1] "Accuracy: NaN"

# Shutdown h2o
h2o.shutdown()

## Are you sure you want to shutdown the H2O instance running at http://localhost:54321/ (Y/N)?

Code Along 12: Coding Faster With ChatGPT

Vanessa Riise Naas

2024-04-30

Prompt 1:

Series of prompts with errors

Prompt 2

Load required libraries

Assuming ‘attrition_raw_tbl’ is your dataset

Preprocess the data

Split data into training and testing sets

Train a logistic regression model

Make predictions on the test set

Evaluate the model

Print the confusion matrix and accuracy

Prompt 3:

Load required libraries

Assuming ‘attrition_raw_tbl’ is your dataset

Preprocess the data

Split data into training and testing sets

Train an h2o logistic regression model

Make predictions on the test set

Evaluate the model

Print the confusion matrix and accuracy

Shutdown h2o

Prompt 4

Load required libraries

Assuming ‘attrition_raw_tbl’ is your dataset

Preprocess the data

Split data into training and testing sets

Train an h2o logistic regression model

Make predictions on the test set

Evaluate the model

Print the confusion matrix and accuracy

Shutdown h2o

Prompt 5

Final Code