Load the required libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
The original Project 4 consisted of going through and processing spam/ not spam emails and labeling them as so. For this project ,we explored the mushroom dataset from UCI Machine Learning Repo. The classification of mushrooms as edible or poisonous is important to public health, culinary arts, and biological research. For project 4, we will try to classify these mushrooms as poisonous or edible using a logistic regression model.
In the data preprocessing section of the project, several steps were
taken to prepare the mushroom
dataset for logistic
regression analysis. First, the dataset was loaded into R for
processing. The primary focus was to ensure that the data was ‘tidy’ and
formatted correctly for analysis. This involved converting categorical
variables into a suitable numeric format, as many machine learning
algorithms, including logistic regression, work better with numerical
input.
During this conversion, each category within the variables was assigned a unique numerical identifier. This step is essential to preserve the categorical information in a form that the logistic regression model can utilize effectively. Additionally, the dataset was checked for missing values, and necessary imputations were performed to handle any gaps in the data, ensuring that the model receives a complete dataset for training and testing.
# Read the file
url <- "https://raw.githubusercontent.com/Mattr5541/DATA_607_Project-4/main/mushroom/agaricus-lepiota_data.txt"
file = readLines(url)
split_data <- lapply(file, function(x) unlist(strsplit(x, ",")))
# Convert the data to a dataframe
df <- as.data.frame(do.call(rbind, split_data))
# The names of the 23 columns as detailed in agaricus-lepiota_names.txt
col_names = c('class', 'cap_shape', 'cap_surface', 'cap_color', 'bruises', 'odor', 'gill_attachment', 'gill_spacing', 'gill_size', 'gill_color', 'stalk_shape', 'stalk_root', 'stalk_surface_above_ring', 'stalk_surface_below_ring', 'stalk_color_above_ring', 'stalk_color_below_ring', 'veil_type', 'veil_color', 'ring_number', 'ring_type', 'spore_print_color', 'population', 'habitat')
# Assign the names to each column of the dataframe
names(df) <- col_names
# The type of class for each mushroom is binary, poisonous or edible
# There are no missing values for them, so ifelse() can be used
df$class <- ifelse(df$class == 'p', 'poisonous', 'edible')
# These columns also have binary values
df$bruises <- ifelse(df$bruises == 't', 'bruises', 'no')
df$gill_size <- ifelse(df$gill_size == 'b', 'broad', 'narrow')
df$stalk_shape <- ifelse(df$stalk_shape == 'e', 'enlarging', 'tapering')
df$veil_type <- ifelse(df$veil_type == 'p', 'partial', 'universal')
# Rename all the values in each column to make them more clear as noted in agaricus-lepiota_names.txt
# The same file states the only missing values are in the stalk_root column
df <- df |>
mutate(cap_shape = case_when(
cap_shape == 'b' ~ 'bell',
cap_shape == 'c' ~ 'conical',
cap_shape == 'x' ~ 'convex',
cap_shape == 'f' ~ 'flat',
cap_shape == 'k' ~ 'knobbed',
TRUE ~ 'sunken'),
cap_surface = case_when(
cap_surface == 'f' ~ 'fibrous',
cap_surface == 'g' ~ 'grooves',
cap_surface == 's' ~ 'scaly',
TRUE ~ 'smooth'
),
cap_color = case_when(
cap_color == 'n' ~ 'brown',
cap_color == 'b' ~ 'buff',
cap_color == 'c' ~ 'cinnamon',
cap_color == 'g' ~ 'gray',
cap_color == 'r' ~ 'green',
cap_color == 'p' ~ 'pink',
cap_color == 'u' ~ 'purple',
cap_color == 'e' ~ 'red',
cap_color == 'w' ~ 'white',
TRUE ~ 'yellow'
),
odor = case_when(
odor == 'a' ~ 'almond',
odor == 'l' ~ 'anise',
odor == 'c' ~ 'creosote',
odor == 'y' ~ 'fishy',
odor == 'f' ~ 'foul',
odor == 'm' ~ 'musty',
odor == 'n' ~ 'none',
odor == 'p' ~ 'pungent',
TRUE ~ 'spicy'
),
gill_attachment = case_when(
gill_attachment == 'a' ~ 'attached',
gill_attachment == 'd' ~ 'descending',
gill_attachment == 'f' ~ 'free',
TRUE ~ 'notched'
),
gill_spacing = case_when(
gill_spacing == 'c' ~ 'close',
gill_spacing == 'w' ~ 'crowded',
TRUE ~ 'distant'
),
gill_color = case_when(
gill_color == 'k' ~ 'black',
gill_color == 'n' ~ 'brown',
gill_color == 'b' ~ 'buff',
gill_color == 'h' ~ 'chocolate',
gill_color == 'g' ~ 'gray',
gill_color == 'r' ~ 'green',
gill_color == 'o' ~ 'orange',
gill_color == 'p' ~ 'pink',
gill_color == 'u' ~ 'purple',
gill_color == 'e' ~ 'red',
gill_color == 'w' ~ 'white',
TRUE ~ 'yellow'
),
stalk_root = case_when(
stalk_root == 'b' ~ 'bulbous',
stalk_root == 'c' ~ 'club',
stalk_root == 'u' ~ 'cup',
stalk_root == 'e' ~ 'equal',
stalk_root == 'z' ~ 'rhizomorphs',
stalk_root == 'r' ~ 'rooted',
TRUE ~ NA
),
stalk_surface_above_ring = case_when(
stalk_surface_above_ring == 'f' ~ 'fibrous',
stalk_surface_above_ring == 'y' ~ 'scaly',
stalk_surface_above_ring == 'k' ~ 'silky',
TRUE ~ 'smooth'
),
stalk_surface_below_ring = case_when(
stalk_surface_below_ring == 'f' ~ 'fibrous',
stalk_surface_below_ring == 'y' ~ 'scaly',
stalk_surface_below_ring == 'k' ~ 'silky',
TRUE ~ 'smooth'
),
stalk_color_above_ring = case_when(
stalk_color_above_ring == 'n' ~ 'brown',
stalk_color_above_ring == 'b' ~ 'buff',
stalk_color_above_ring == 'c' ~ 'cinnamon',
stalk_color_above_ring == 'g' ~ 'gray',
stalk_color_above_ring == 'o' ~ 'orange',
stalk_color_above_ring == 'p' ~ 'pink',
stalk_color_above_ring == 'e' ~ 'red',
stalk_color_above_ring == 'w' ~ 'white',
TRUE ~ 'yellow'
),
stalk_color_below_ring = case_when(
stalk_color_below_ring == 'n' ~ 'brown',
stalk_color_below_ring == 'b' ~ 'buff',
stalk_color_below_ring == 'c' ~ 'cinnamon',
stalk_color_below_ring == 'g' ~ 'gray',
stalk_color_below_ring == 'o' ~ 'orange',
stalk_color_below_ring == 'p' ~ 'pink',
stalk_color_below_ring == 'e' ~ 'red',
stalk_color_below_ring == 'w' ~ 'white',
TRUE ~ 'yellow'
),
veil_color = case_when(
veil_color == 'n' ~ 'brown',
veil_color == 'o' ~ 'orange',
veil_color == 'w' ~ 'white',
TRUE ~ 'yellow'
),
ring_number = case_when(
ring_number == 'n' ~ '0',
ring_number == 'o' ~ '1',
TRUE ~ '2'
),
ring_type = case_when(
ring_type == 'c' ~ 'cobwebby',
ring_type == 'e' ~ 'evanescent',
ring_type == 'f' ~ 'flaring',
ring_type == 'l' ~ 'large',
ring_type == 'n' ~ 'none',
ring_type == 'p' ~ 'pendant',
ring_type == 's' ~ 'sheathing',
TRUE ~ 'zone'
),
spore_print_color = case_when(
spore_print_color == 'k' ~ 'black',
spore_print_color == 'n' ~ 'brown',
spore_print_color == 'b' ~ 'buff',
spore_print_color == 'h' ~ 'chocolate',
spore_print_color == 'r' ~ 'green',
spore_print_color == 'o' ~ 'orange',
spore_print_color == 'u' ~ 'purple',
spore_print_color == 'w' ~ 'white',
TRUE ~ 'yellow'
),
population = case_when(
population == 'a' ~ 'abundant',
population == 'c' ~ 'clustered',
population == 'n' ~ 'numerous',
population == 's' ~ 'scattered',
population == 'v' ~ 'several',
TRUE ~ 'solitary'
),
habitat = case_when(
habitat == 'g' ~ 'grasses',
habitat == 'l' ~ 'leaves',
habitat == 'm' ~ 'meadows',
habitat == 'p' ~ 'paths',
habitat == 'u' ~ 'urban',
habitat == 'w' ~ 'waste',
TRUE ~ 'woods'
))
# Convert the number of rings from character to numeric
df$ring_number <- as.numeric(df$ring_number)
# According to agaricus-lepopta_names.txt, there are 2480 missing attributes for the stalk_root column so we can filter those rows out
df <- df |>
filter(!is.na(stalk_root))
mushroom_colors <- c('brown', 'orange', 'white', 'yellow', 'buff', 'gray', 'pink', 'red', 'green', 'purple', 'cinnamon', 'black', 'chocolate')
surface <- c('fibrous', 'scaly', 'silky', 'smooth')
df_num <- df
df_num$class <- as.numeric(factor(df_num$class, levels=c('poisonous', 'edible')))
df_num$cap_shape <- as.numeric(factor(df_num$cap_shape, levels=c('bell', 'conical', 'convex', 'flat', 'knobbed', 'sunken')))
df_num$cap_surface <- as.numeric(factor(df_num$cap_surface, levels=c('fibrous', 'grooves', 'scaly', 'smooth')))
df_num$cap_color <- as.numeric(factor(df_num$cap_color, levels=mushroom_colors))
df_num$bruises <- as.numeric(factor(df_num$bruises, levels=c('bruises', 'no')))
df_num$odor <- as.numeric(factor(df_num$odor, levels=c('almond', 'anise', 'creosote', 'fishy', 'foul', 'musty', 'none', 'pungent' , 'spicy')))
df_num$gill_attachment <- as.numeric(factor(df_num$gill_attachment, levels=c('attached','descending','free', 'notched')))
df_num$gill_spacing <- as.numeric(factor(df_num$gill_spacing, levels=c('close', 'crowded', 'distant')))
df_num$gill_size <- as.numeric(factor(df_num$gill_size, levels=c('broad', 'narrow')))
df_num$gill_color <- as.numeric(factor(df_num$gill_color, levels=mushroom_colors))
df_num$stalk_shape <- as.numeric(factor(df_num$stalk_shape, levels=c('enlarging', 'tapering')))
df_num$stalk_root <- as.numeric(factor(df_num$stalk_root, levels=c('bulbous', 'club', 'cup', 'equal', 'rhizomorphs', 'rooted')))
df_num$stalk_surface_above_ring <- as.numeric(factor(df_num$stalk_surface_above_ring, levels=surface))
df_num$stalk_surface_below_ring <- as.numeric(factor(df_num$stalk_surface_below_ring, levels=surface))
df_num$stalk_color_above_ring <- as.numeric(factor(df_num$stalk_color_above_ring, levels=mushroom_colors))
df_num$stalk_color_below_ring <- as.numeric(factor(df_num$stalk_color_below_ring, levels=mushroom_colors))
df_num$veil_type <- as.numeric(factor(df_num$veil_type, levels=c('partial', 'universal')))
df_num$veil_color <- as.numeric(factor(df_num$veil_color, levels=mushroom_colors))
df_num$ring_type <- as.numeric(factor(df_num$ring_type, levels=c('cobwebby', 'evanescent', 'flaring', 'large', 'none', 'pendant', 'sheathing', 'zone')))
df_num$spore_print_color <- as.numeric(factor(df_num$spore_print_color, levels=mushroom_colors))
df_num$population <- as.numeric(factor(df_num$population, levels=c('abundant', 'clustered', 'numerous', 'scattered', 'several', 'solitary')))
df_num$habitat <- as.numeric(factor(df_num$habitat, levels=c('grasses', 'leaves', 'meadows', 'paths', 'urban', 'waste', 'woods')))
# Convert the binary columns to zeros and ones
# Veil type is also binary but it only contains partial for all columns, there's not a single row that contains universal for the column
binary_cols <- c(1, 5, 9, 11)
for (col in binary_cols) {
df_num[[col]] <- df_num[[col]] - 1
}
For exploratory data analysis, a thorough examination of the
mushroom
dataset was conducted to gain insights into its
characteristics and uncover any underlying patterns. This involved
generating summary statistics; including measures like mean, median, and
standard deviation for each variable.
Bar plots were created for the various categorical features such as
ring_number
, bruises
, gill_size
,
and stalk_shape
. These visualizations are important in
showing the frequency distribution of different categories within each
feature. The bar plots also help to visualize the relationship between
these categorical features and the target variable (edible or
poisonous).
A correlation matrix was made to provide insights into how different features are related to each other.
# Summary Statistics
summary(df_num$ring_number)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 1.000 1.015 1.000 2.000
# Gets the mode for all columns
# The type of mushrooms that appear the most in this dataset are poisonous
df_num |>
summarise(across(everything(), ~as.numeric(names(which.max(table(.))))))
## class cap_shape cap_surface cap_color bruises odor gill_attachment
## 1 1 3 4 6 0 7 3
## gill_spacing gill_size gill_color stalk_shape stalk_root
## 1 1 0 7 1 1
## stalk_surface_above_ring stalk_surface_below_ring stalk_color_above_ring
## 1 4 4 3
## stalk_color_below_ring veil_type veil_color ring_number ring_type
## 1 3 1 3 1 6
## spore_print_color population habitat
## 1 1 5 7
# The amount of mushrooms with 1 or 2 rings are similar for both poisonous and edible types
# There are no mushrooms that are edible with 0 rings in this dataset
df |>
ggplot(aes(x = ring_number)) +
geom_bar() +
facet_grid(. ~ class) +
labs(x="Number of Rings", "Count") +
scale_y_log10()
# Comparing the binary columns
# Poisonous mushrooms tends not to have bruises. Edible mushrooms tend to have more bruises
df |>
ggplot(aes(x = bruises)) +
geom_bar() +
facet_grid(. ~ class) +
labs(x="Bruises?", "Count") +
scale_y_log10()
# Both poisonous and edible mushrooms have more broad gill sizes than narrow
df |>
ggplot(aes(x = gill_size)) +
geom_bar() +
facet_grid(. ~ class) +
labs(x="Type of Gill Size", "Count") +
scale_y_log10()
# Poisonous mushrooms have more enlarging stalk shape than tapering while the opposite is true for edible mushrooms
df |>
ggplot(aes(x = stalk_shape)) +
geom_bar() +
facet_grid(. ~ class) +
labs(x="Stalk Shape", "Count") +
scale_y_log10()
We will be performing a simple machine learning procedure, so the next step will be to randomly split the dataset into an 70% / 30% ratio for training and testing, respectively
print(cor(df_num[, c(1:23)]))
## Warning in cor(df_num[, c(1:23)]): the standard deviation is zero
## class cap_shape cap_surface cap_color
## class 1.000000000 -0.07442866 -0.0468588384 -0.094145965
## cap_shape -0.074428656 1.00000000 -0.1360355590 0.055995891
## cap_surface -0.046858838 -0.13603556 1.0000000000 -0.052383041
## cap_color -0.094145965 0.05599589 -0.0523830408 1.000000000
## bruises -0.435561947 0.17500456 -0.2285895204 0.035063040
## odor 0.112626833 0.39178575 -0.1865878728 0.063718249
## gill_attachment 0.071944939 -0.04736870 -0.0581044722 -0.056427238
## gill_spacing 0.264159844 0.07020886 -0.2044475297 -0.144347669
## gill_size -0.215288952 0.07158188 -0.0423561776 -0.122150503
## gill_color -0.235281902 0.04460456 -0.0752318996 0.012597579
## stalk_shape 0.592445682 0.19355638 -0.1735538666 0.045344886
## stalk_root 0.271684328 0.03249354 0.0546906075 -0.361274640
## stalk_surface_above_ring 0.246214816 -0.12052939 0.0918231084 0.013669579
## stalk_surface_below_ring 0.189169644 -0.13493275 0.0082750463 0.064299207
## stalk_color_above_ring 0.052756291 0.12337621 -0.0245223940 0.176569415
## stalk_color_below_ring 0.045345317 0.11485028 -0.0234544798 0.168681357
## veil_type NA NA NA NA
## veil_color -0.047920723 -0.01726316 0.0387019345 -0.006054078
## ring_number -0.008615448 -0.11327544 0.0442890890 -0.014443061
## ring_type 0.098533305 -0.15075739 0.1973860432 0.061151616
## spore_print_color -0.437212476 0.04884640 -0.0122651174 0.078088269
## population -0.203882302 0.16741067 -0.0003903856 0.203156911
## habitat 0.102083009 0.15257501 -0.0754217326 0.226197897
## bruises odor gill_attachment gill_spacing
## class -0.43556195 0.112626833 0.071944939 0.26415984
## cap_shape 0.17500456 0.391785745 -0.047368698 0.07020886
## cap_surface -0.22858952 -0.186587873 -0.058104472 -0.20444753
## cap_color 0.03506304 0.063718249 -0.056427238 -0.14434767
## bruises 1.00000000 0.025353056 -0.064351010 0.43918287
## odor 0.02535306 1.000000000 -0.012427493 0.12926755
## gill_attachment -0.06435101 -0.012427493 1.000000000 0.02662965
## gill_spacing 0.43918287 0.129267549 0.026629645 1.00000000
## gill_size 0.04017781 -0.007439447 0.021353017 0.17848372
## gill_color 0.32220573 0.034852019 0.042701547 0.08830182
## stalk_shape -0.34829416 0.540036206 0.057738250 0.31405968
## stalk_root 0.12941254 -0.022209165 -0.005458082 0.46535646
## stalk_surface_above_ring -0.49815038 -0.084534498 0.028452169 -0.34491819
## stalk_surface_below_ring -0.42381521 0.062333938 0.082123661 -0.29486021
## stalk_color_above_ring -0.11817807 0.228182741 -0.204688855 -0.26037155
## stalk_color_below_ring -0.12707225 0.219691997 -0.203260804 -0.27660353
## veil_type NA NA NA NA
## veil_color 0.04286260 0.027176919 0.002131061 0.08002589
## ring_number -0.12216179 0.084158509 0.346681618 -0.04231552
## ring_type -0.81265833 -0.199523522 -0.001909714 -0.69498832
## spore_print_color 0.24598095 -0.093987906 0.054523698 -0.17814453
## population -0.19629556 0.083650188 0.115499342 -0.59591012
## habitat -0.32635512 0.208066175 -0.057171234 -0.38078574
## gill_size gill_color stalk_shape stalk_root
## class -0.215288952 -0.235281902 0.59244568 0.271684328
## cap_shape 0.071581885 0.044604557 0.19355638 0.032493537
## cap_surface -0.042356178 -0.075231900 -0.17355387 0.054690608
## cap_color -0.122150503 0.012597579 0.04534489 -0.361274640
## bruises 0.040177810 0.322205730 -0.34829416 0.129412536
## odor -0.007439447 0.034852019 0.54003621 -0.022209165
## gill_attachment 0.021353017 0.042701547 0.05773825 -0.005458082
## gill_spacing 0.178483716 0.088301818 0.31405968 0.465356460
## gill_size 1.000000000 -0.107142754 -0.28236762 0.173320541
## gill_color -0.107142754 1.000000000 -0.07447205 -0.001986807
## stalk_shape -0.282367616 -0.074472052 1.00000000 -0.046273909
## stalk_root 0.173320541 -0.001986807 -0.04627391 1.000000000
## stalk_surface_above_ring 0.169684968 -0.226252875 -0.01342712 -0.121060618
## stalk_surface_below_ring 0.192483604 -0.161637300 0.06657057 -0.325789753
## stalk_color_above_ring -0.208079519 -0.052139021 0.17725580 -0.321906677
## stalk_color_below_ring -0.229419833 -0.043913951 0.18456601 -0.313490282
## veil_type NA NA NA NA
## veil_color 0.099801403 -0.028442432 -0.03845800 0.003635492
## ring_number -0.033930758 -0.026291139 -0.09174828 -0.082381544
## ring_type 0.187912486 -0.298049579 -0.01184888 -0.356232188
## spore_print_color -0.152695961 0.178614734 -0.21201900 -0.200204311
## population -0.008851112 -0.062171035 -0.09959825 -0.520161281
## habitat 0.070147202 -0.164714443 0.25984857 -0.539517617
## stalk_surface_above_ring stalk_surface_below_ring
## class 0.24621482 0.189169644
## cap_shape -0.12052939 -0.134932745
## cap_surface 0.09182311 0.008275046
## cap_color 0.01366958 0.064299207
## bruises -0.49815038 -0.423815211
## odor -0.08453450 0.062333938
## gill_attachment 0.02845217 0.082123661
## gill_spacing -0.34491819 -0.294860208
## gill_size 0.16968497 0.192483604
## gill_color -0.22625288 -0.161637300
## stalk_shape -0.01342712 0.066570573
## stalk_root -0.12106062 -0.325789753
## stalk_surface_above_ring 1.00000000 0.421548925
## stalk_surface_below_ring 0.42154893 1.000000000
## stalk_color_above_ring 0.12810174 0.138633257
## stalk_color_below_ring 0.13610770 0.145497456
## veil_type NA NA
## veil_color -0.05996394 -0.054700515
## ring_number 0.05729046 0.102365643
## ring_type 0.57772556 0.504202676
## spore_print_color -0.19065556 -0.153964654
## population 0.24686841 0.235524299
## habitat 0.35731950 0.385854958
## stalk_color_above_ring stalk_color_below_ring
## class 0.052756291 0.0453453170
## cap_shape 0.123376210 0.1148502772
## cap_surface -0.024522394 -0.0234544798
## cap_color 0.176569415 0.1686813573
## bruises -0.118178072 -0.1270722464
## odor 0.228182741 0.2196919974
## gill_attachment -0.204688855 -0.2032608038
## gill_spacing -0.260371547 -0.2766035289
## gill_size -0.208079519 -0.2294198325
## gill_color -0.052139021 -0.0439139508
## stalk_shape 0.177255802 0.1845660131
## stalk_root -0.321906677 -0.3134902815
## stalk_surface_above_ring 0.128101743 0.1361076996
## stalk_surface_below_ring 0.138633257 0.1454974564
## stalk_color_above_ring 1.000000000 0.3711363006
## stalk_color_below_ring 0.371136301 1.0000000000
## veil_type NA NA
## veil_color -0.001324077 -0.0009802644
## ring_number -0.228806015 -0.2258631047
## ring_type 0.160300478 0.1764519571
## spore_print_color -0.003703535 0.0048091508
## population 0.256794339 0.2524786093
## habitat 0.382121322 0.3862637473
## veil_type veil_color ring_number ring_type
## class NA -0.0479207229 -0.008615448 0.098533305
## cap_shape NA -0.0172631588 -0.113275436 -0.150757395
## cap_surface NA 0.0387019345 0.044289089 0.197386043
## cap_color NA -0.0060540784 -0.014443061 0.061151616
## bruises NA 0.0428625966 -0.122161788 -0.812658330
## odor NA 0.0271769190 0.084158509 -0.199523522
## gill_attachment NA 0.0021310610 0.346681618 -0.001909714
## gill_spacing NA 0.0800258896 -0.042315521 -0.694988316
## gill_size NA 0.0998014029 -0.033930758 0.187912486
## gill_color NA -0.0284424317 -0.026291139 -0.298049579
## stalk_shape NA -0.0384580027 -0.091748279 -0.011848875
## stalk_root NA 0.0036354917 -0.082381544 -0.356232188
## stalk_surface_above_ring NA -0.0599639356 0.057290457 0.577725560
## stalk_surface_below_ring NA -0.0547005152 0.102365643 0.504202676
## stalk_color_above_ring NA -0.0013240770 -0.228806015 0.160300478
## stalk_color_below_ring NA -0.0009802644 -0.225863105 0.176451957
## veil_type 1 NA NA NA
## veil_color NA 1.0000000000 -0.003386337 -0.075648421
## ring_number NA -0.0033863372 1.000000000 0.090418953
## ring_type NA -0.0756484213 0.090418953 1.000000000
## spore_print_color NA -0.0363168703 -0.001725182 -0.123749162
## population NA -0.0769312205 0.132167949 0.443578638
## habitat NA -0.0331436154 -0.103343759 0.523897696
## spore_print_color population habitat
## class -0.437212476 -0.2038823018 0.10208301
## cap_shape 0.048846402 0.1674106717 0.15257501
## cap_surface -0.012265117 -0.0003903856 -0.07542173
## cap_color 0.078088269 0.2031569106 0.22619790
## bruises 0.245980947 -0.1962955617 -0.32635512
## odor -0.093987906 0.0836501881 0.20806618
## gill_attachment 0.054523698 0.1154993419 -0.05717123
## gill_spacing -0.178144526 -0.5959101225 -0.38078574
## gill_size -0.152695961 -0.0088511121 0.07014720
## gill_color 0.178614734 -0.0621710354 -0.16471444
## stalk_shape -0.212018999 -0.0995982471 0.25984857
## stalk_root -0.200204311 -0.5201612807 -0.53951762
## stalk_surface_above_ring -0.190655556 0.2468684140 0.35731950
## stalk_surface_below_ring -0.153964654 0.2355242987 0.38585496
## stalk_color_above_ring -0.003703535 0.2567943390 0.38212132
## stalk_color_below_ring 0.004809151 0.2524786093 0.38626375
## veil_type NA NA NA
## veil_color -0.036316870 -0.0769312205 -0.03314362
## ring_number -0.001725182 0.1321679488 -0.10334376
## ring_type -0.123749162 0.4435786385 0.52389770
## spore_print_color 1.000000000 0.1602302778 -0.07083490
## population 0.160230278 1.0000000000 0.51657302
## habitat -0.070834902 0.5165730219 1.00000000
#install.packages("caret")
library(caret)
## Warning: package 'caret' was built under R version 4.3.2
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
#install.packages("brglm2")
library(brglm2)
## Warning: package 'brglm2' was built under R version 4.3.2
#Setting a seed to keep the outcomes consistent
set.seed(12345)
index <- createDataPartition(df_num$class, p = .70, list = F)
train <- df_num[index,]
test <- df_num[-index,]
A logistic regression model was developed to predict whether a mushroom is edible or poisonous based on various features in the dataset. Logistic regression is a popular method for binary classification.
First, we created the logistic regression model, specifying the
target variable (edible or poisonous) as a function of the predictor
variables. These predictors were cap_shape
,
cap_color
, gill_size
, and others. The training
dataset, which consisted of 70% of the total data, was used to fit the
model. This subset provided a substantial amount of data for the model
to learn the patterns and relationships between the features and the
target variable.
Once the model was defined and the data was prepared, the next step
involved training the model using the glm()
function in R,
specifying the binomial family to denote a logistic regression. During
this training phase, the model learned the coefficients for each
predictor, adjusting them to best fit the training data.
#Function to detect binary columns
# is_bin <- function(train) {
# binary_col <- sapply(train, function(column) {
# all(column %in% c(0, 1))
# })
# return(binary_col)
# }
#
# binary_col <- is_bin(train)
#
# print(binary_col)
str(train)
## 'data.frame': 3951 obs. of 23 variables:
## $ class : num 0 1 1 1 0 1 1 0 1 1 ...
## $ cap_shape : num 3 3 1 3 3 1 1 3 3 6 ...
## $ cap_surface : num 3 3 3 3 4 3 3 4 1 1 ...
## $ cap_color : num 1 4 3 6 3 4 4 3 1 6 ...
## $ bruises : num 0 0 0 1 0 0 0 0 1 1 ...
## $ odor : num 8 1 2 7 8 1 1 8 7 7 ...
## $ gill_attachment : num 3 3 3 3 3 3 3 3 3 3 ...
## $ gill_spacing : num 1 1 1 2 1 1 1 1 2 1 ...
## $ gill_size : num 1 0 0 0 1 0 0 1 0 1 ...
## $ gill_color : num 12 12 1 12 7 6 3 12 1 12 ...
## $ stalk_shape : num 0 0 0 1 0 0 0 0 1 0 ...
## $ stalk_root : num 4 2 2 4 4 2 2 4 4 4 ...
## $ stalk_surface_above_ring: num 4 4 4 4 4 4 4 4 4 4 ...
## $ stalk_surface_below_ring: num 4 4 4 4 4 4 4 4 1 4 ...
## $ stalk_color_above_ring : num 3 3 3 3 3 3 3 3 3 3 ...
## $ stalk_color_below_ring : num 3 3 3 3 3 3 3 3 3 3 ...
## $ veil_type : num 1 1 1 1 1 1 1 1 1 1 ...
## $ veil_color : num 3 3 3 3 3 3 3 3 3 3 ...
## $ ring_number : num 1 1 1 1 1 1 1 1 1 1 ...
## $ ring_type : num 6 6 6 2 6 6 6 6 2 6 ...
## $ spore_print_color : num 12 1 1 1 12 12 1 1 12 1 ...
## $ population : num 4 3 3 1 5 4 4 5 1 6 ...
## $ habitat : num 5 1 3 1 1 3 1 5 1 5 ...
#The model properly converges when gill size and veil type are removed; but note that a warning is still produced: glm.fit: fitted probabilities numerically 0 or 1 occurred
train_model <- glm(class ~ cap_shape + cap_surface + cap_color + bruises + odor + gill_attachment +
gill_spacing + gill_color + stalk_shape + stalk_root +
stalk_surface_above_ring + stalk_surface_below_ring +
stalk_color_above_ring + stalk_color_below_ring + veil_color + ring_number +
ring_type + spore_print_color + population + habitat, data = train, family = binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(train_model)
##
## Call:
## glm(formula = class ~ cap_shape + cap_surface + cap_color + bruises +
## odor + gill_attachment + gill_spacing + gill_color + stalk_shape +
## stalk_root + stalk_surface_above_ring + stalk_surface_below_ring +
## stalk_color_above_ring + stalk_color_below_ring + veil_color +
## ring_number + ring_type + spore_print_color + population +
## habitat, family = binomial, data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 65.79078 1826.56649 0.036 0.971267
## cap_shape -0.28985 0.10687 -2.712 0.006682 **
## cap_surface -0.23494 0.07232 -3.248 0.001160 **
## cap_color -0.00187 0.03654 -0.051 0.959179
## bruises -1.48213 0.45000 -3.294 0.000989 ***
## odor -3.31584 0.20108 -16.490 < 2e-16 ***
## gill_attachment -4.15344 207.75505 -0.020 0.984050
## gill_spacing 2.80895 0.50471 5.565 2.61e-08 ***
## gill_color -0.14785 0.02248 -6.578 4.78e-11 ***
## stalk_shape 18.33905 1.06766 17.177 < 2e-16 ***
## stalk_root 6.86953 0.44158 15.557 < 2e-16 ***
## stalk_surface_above_ring 1.80216 0.15274 11.799 < 2e-16 ***
## stalk_surface_below_ring 1.75452 0.15372 11.413 < 2e-16 ***
## stalk_color_above_ring 0.31952 0.06249 5.113 3.16e-07 ***
## stalk_color_below_ring 0.24207 0.05931 4.082 4.47e-05 ***
## veil_color -20.16103 572.31316 -0.035 0.971899
## ring_number 19.83821 1.34213 14.781 < 2e-16 ***
## ring_type -6.23461 0.39974 -15.597 < 2e-16 ***
## spore_print_color -0.23299 0.02296 -10.146 < 2e-16 ***
## population 0.85217 0.16473 5.173 2.30e-07 ***
## habitat 0.82870 0.08477 9.776 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5260.6 on 3950 degrees of freedom
## Residual deviance: 1133.0 on 3930 degrees of freedom
## AIC: 1175
##
## Number of Fisher Scoring iterations: 14
We can now use the model to make predictions on the training data to evaluate its performance.
The results from the confusion matrix and cross-validation provided insights into how well the model was performing. They offered a detailed look at the model’s strengths and weaknesses in classifying the mushrooms as edible or poisonous, based on the training data. This evaluation phase was critical for understanding the efficacy of the logistic regression model before proceeding to test it on unseen data.
#This code will generate predictions
train_pred <- predict(train_model, type = "response", newdata = train)
train_pred_class <- ifelse(train_pred > 0.5, 1, 0)
#And now we can assess the acuracy of those predictions
confusionMatrix(factor(train_pred_class), factor(train$class))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1418 60
## 1 97 2376
##
## Accuracy : 0.9603
## 95% CI : (0.9537, 0.9661)
## No Information Rate : 0.6166
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9156
##
## Mcnemar's Test P-Value : 0.004064
##
## Sensitivity : 0.9360
## Specificity : 0.9754
## Pos Pred Value : 0.9594
## Neg Pred Value : 0.9608
## Prevalence : 0.3834
## Detection Rate : 0.3589
## Detection Prevalence : 0.3741
## Balanced Accuracy : 0.9557
##
## 'Positive' Class : 0
##
#As we can see, our training data are able to predict edibility outcomes with a 96% accuracy
#And now we can cross-validate the data
cv_results <- train(class ~ cap_shape + cap_surface + cap_color + bruises + odor + gill_attachment +
gill_spacing + gill_color + stalk_shape + stalk_root +
stalk_surface_above_ring + stalk_surface_below_ring +
stalk_color_above_ring + stalk_color_below_ring + veil_color + ring_number +
ring_type + spore_print_color + population + habitat, data = train, method = "glm",
trControl = trainControl(method = "cv", number = 10), family = 'binomial')
## Warning in train.default(x, y, weights = w, ...): You are trying to do
## regression and your outcome only has two possible values Are you trying to do
## classification? If so, use a 2 level factor as your outcome column.
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
print(cv_results)
## Generalized Linear Model
##
## 3951 samples
## 20 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 3556, 3555, 3556, 3556, 3556, 3556, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.1873762 0.8516147 0.07693206
#As we can see, the RMSE value is low, whereas the R^2 is high, indicating that the model is likely a good fit
Confusion Matrix:
Model Performance:
According to the results, the model is doing a good job at classifying mushrooms as edible or poisonous.
The test dataset, which comprised 30% of the entire dataset, was used for this purpose. It contained the same features as the training set but had not been used during the model training phase. The model made predictions on this test data, estimating whether each mushroom was edible or poisonous based on the learned patterns from the training data.
#This code will apply the model to our test dataset
test_pred <- predict(train_model, type = "response", newdata = test)
test_pred_class <- ifelse(test_pred > 0.5, 1,0)
#This code will check for accuracy
confusionMatrix(factor(test_pred_class), factor(test$class))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 596 26
## 1 45 1026
##
## Accuracy : 0.9581
## 95% CI : (0.9474, 0.9671)
## No Information Rate : 0.6214
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.9104
##
## Mcnemar's Test P-Value : 0.03266
##
## Sensitivity : 0.9298
## Specificity : 0.9753
## Pos Pred Value : 0.9582
## Neg Pred Value : 0.9580
## Prevalence : 0.3786
## Detection Rate : 0.3520
## Detection Prevalence : 0.3674
## Balanced Accuracy : 0.9525
##
## 'Positive' Class : 0
##
##As we can see, the results are rather consistent, with a 95.8% accuracy for predicting edibility
library(pROC)
## Warning: package 'pROC' was built under R version 4.3.2
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
#The following code will calculate the Area under the Curve
rocCurve <- roc(test$class, test_pred)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(rocCurve)
auc(rocCurve)
## Area under the curve: 0.9867
#The plot and diagnostics demonstrate the high predictive accuracy of our model
Confusion Matrix:
True Negatives (0,0)
: 596 - The model correctly
predicted 596 instances where mushrooms were not poisonous
(edible).False Positives (0,1)
: 26 - The model incorrectly
predicted 26 instances as non-poisonous (edible) when they were actually
poisonous.False Negatives (1,0)
: 45 - The model incorrectly
predicted 45 instances as poisonous when they were actually
non-poisonous (edible).True Positives (1,1)
: 1026 - The model correctly
predicted 1026 instances where mushrooms were poisonous.Accuracy: Our model was correct for 95.81% (0.9581) of the mushrooms in the test set. We are 95% confident that it lies in this range (0.9474, 0.9671).
No Information Rate: 0.6214.This means if we were to guess the most common class for every mushroom, we’d be right about 62.14% of the time.
P-Value: < 2e-16 - This p-value is extremely low. Our model is statistically significant
Kappa: 0.9104 - Since the kappa is close to one, we know that our model is effective.
Sensitivity and Specificity:
Predictive Values:
Prevalence: 0.3786 - 37.86% of the mushrooms in the test set were actually non-poisonous (edible).
Detection Rate: 0.3520 - 35.20% of all mushrooms in the test set were correctly identified as non-poisonous by the model.
Detection Prevalence: 0.3674 - 36.74% of all mushrooms were predicted as non-poisonous by the model.
Area Under the Curve (AUC): 0.9867 - This is very close to 1, meaning the model has good ability to differentiate between edible and poisonous mushrooms.
In UCI’s Machine Learning Repository, there is a secondary mushroom dataset with more data (but new entries). I will import this new data and test the accuracy of our logistic regression model for identifing whether or not mushrooms are poisonous or edible.
Before we can use our test data, we must prepare it so that it is in the same format of the data used to train our linear model. Since not all columns included in our original model is in the secondary data set, we either have to create columns and estimate the data or train a new model without those columns. We chose to train a new model that will be used to evaluate the new dataset.
secondary_dat=read.csv(url("https://raw.githubusercontent.com/Mattr5541/DATA_607_Project-4/main/mushroom/secondary_data.csv"))
names(secondary_dat)
## [1] "class.cap.diameter.cap.shape.cap.surface.cap.color.does.bruise.or.bleed.gill.attachment.gill.spacing.gill.color.stem.height.stem.width.stem.root.stem.surface.stem.color.veil.type.veil.color.has.ring.ring.type.spore.print.color.habitat.season"
col_names_sec = c('class','cap_diameter', 'cap_shape', 'cap_surface', 'cap_color', 'bruises', 'gill_attachment', 'gill_spacing', 'gill_color', 'stem_height', 'stem_width', 'stem_root', 'stem_surface', 'stem_color', 'veil_type', 'veil_color', 'has_ring', 'ring_type', 'spore_print_color', 'habitat', 'season')
split_data = str_split_fixed(secondary_dat$class.cap.diameter.cap.shape.cap.surface.cap.color.does.bruise.or.bleed.gill.attachment.gill.spacing.gill.color.stem.height.stem.width.stem.root.stem.surface.stem.color.veil.type.veil.color.has.ring.ring.type.spore.print.color.habitat.season, pattern = ";", n = 21)
test_2_df=as.data.frame(split_data)
names(test_2_df)=col_names_sec
test_2_df$class = as.numeric(factor(test_2_df$class, levels=c('p','e')))
test_2_df$cap_shape = as.numeric(factor(test_2_df$cap_shape, levels=c('b', 'c', 'x', 'f', 'k', 's')))
test_2_df$cap_color = as.numeric(factor(test_2_df$cap_color, levels=c("n","o","w","y","b","g","p","e","r","u","c","k","l")))
test_2_df$bruises = as.numeric(factor(test_2_df$bruises, levels=c('t', 'f')))
test_2_df$gill_attachment = as.numeric(factor(test_2_df$gill_attachment, levels=c('a','d','e', 'f')))
test_2_df$gill_color = as.numeric(factor(test_2_df$gill_color, levels=c("n","o","w","y","b","g","p","e","r","u","c","k","l")))
test_2_df$veil_type = as.numeric(factor(test_2_df$veil_type, levels=c('p', 'u')))
test_2_df$veil_color = as.numeric(factor(test_2_df$veil_color, levels=c("n","o","w","y","b","g","p","e","r","u","c","k","l")))
test_2_df$habitat = as.numeric(factor(test_2_df$habitat, levels=c('g', 'l', 'm', 'p', 'u', 'w', 'd')))
test_2 = test_2_df %>%
select(c(class,`cap_shape`,`cap_color`,bruises,`gill_attachment`, `gill_color`, `veil_type`,`veil_color`, `habitat`))
train_model_2 = glm(class ~ cap_shape + cap_color + bruises + gill_attachment + gill_color +veil_type+ veil_color +
habitat, data = train, family = binomial)
summary(train_model_2)
##
## Call:
## glm(formula = class ~ cap_shape + cap_color + bruises + gill_attachment +
## gill_color + veil_type + veil_color + habitat, family = binomial,
## data = train)
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.454e+01 1.144e+03 0.021 0.9829
## cap_shape 2.748e-02 5.095e-02 0.539 0.5896
## cap_color -8.603e-02 1.686e-02 -5.102 3.37e-07 ***
## bruises -1.699e+00 8.255e-02 -20.577 < 2e-16 ***
## gill_attachment 7.090e+00 1.258e+02 0.056 0.9551
## gill_color -7.867e-02 9.467e-03 -8.309 < 2e-16 ***
## veil_type NA NA NA NA
## veil_color -1.452e+01 3.601e+02 -0.040 0.9678
## habitat -2.901e-02 1.561e-02 -1.859 0.0631 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5260.6 on 3950 degrees of freedom
## Residual deviance: 4433.2 on 3943 degrees of freedom
## AIC: 4449.2
##
## Number of Fisher Scoring iterations: 13
#generate predictions
train_pred_2 = predict(train_model_2, type = "response", newdata = train)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
train_pred_class_2 = ifelse(train_pred_2 > 0.5, 1, 0)
#assess the acuracy of those predictions
confusionMatrix(factor(train_pred_class_2), factor(train$class))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1049 464
## 1 466 1972
##
## Accuracy : 0.7646
## 95% CI : (0.7511, 0.7778)
## No Information Rate : 0.6166
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.5021
##
## Mcnemar's Test P-Value : 0.9738
##
## Sensitivity : 0.6924
## Specificity : 0.8095
## Pos Pred Value : 0.6933
## Neg Pred Value : 0.8089
## Prevalence : 0.3834
## Detection Rate : 0.2655
## Detection Prevalence : 0.3829
## Balanced Accuracy : 0.7510
##
## 'Positive' Class : 0
##
cv_results_2 = train(class ~ cap_shape + cap_color + bruises + odor + gill_attachment + gill_color + veil_color + habitat, data = train, method = "glm",
trControl = trainControl(method = "cv", number = 10), family = 'binomial')
## Warning in train.default(x, y, weights = w, ...): You are trying to do
## regression and your outcome only has two possible values Are you trying to do
## classification? If so, use a 2 level factor as your outcome column.
print(cv_results_2)
## Generalized Linear Model
##
## 3951 samples
## 8 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 3556, 3556, 3556, 3556, 3556, 3556, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.4179356 0.2624879 0.3584861
The second model trained was a less accurate than the first one. But this was expected since the secondary dataset did not the complete set of columns as our original dataset. Lets evaluate this second model on our secondary test data.
confusionMatrix(factor(test_pred_class_2), factor(test_2$class))
test_pred_2 = predict(train_model_2, type = "response", newdata = test_2)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
test_pred_class_2 = ifelse(test_pred_2 > 0.5, 2, 1)
test_pred_class_2 = factor(test_pred_class_2, levels = c(1, 2))
test_2$class = factor(test_2$class, levels = c(1, 2))
confusionMatrix(test_pred_class_2, test_2$class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 2400 945
## 2 95 0
##
## Accuracy : 0.6977
## 95% CI : (0.682, 0.713)
## No Information Rate : 0.7253
## P-Value [Acc > NIR] : 0.9998
##
## Kappa : -0.0528
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9619
## Specificity : 0.0000
## Pos Pred Value : 0.7175
## Neg Pred Value : 0.0000
## Prevalence : 0.7253
## Detection Rate : 0.6977
## Detection Prevalence : 0.9724
## Balanced Accuracy : 0.4810
##
## 'Positive' Class : 1
##
rocCurve_2 = roc(test_2$class, test_pred_2)
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
plot(rocCurve_2)
auc(rocCurve_2)
## Area under the curve: 0.6834
True Positives (2400): The model correctly predicted 2400 instances as ‘poisonous’ mushrooms.
True Negatives (0): The model did not correctly predict any instances of ‘edible’ mushrooms.
False Positives (945): The model incorrectly classified 945 instances as ‘poisonous’ , which were actually ‘edible’.
False Negatives (95): The model incorrectly classified 95 instances as ‘edible’, which were actually ‘poisonous’.
Accuracy (69.77%): The model correctly predicted about 70% of the instances. However, this is below the No Information Rate, suggesting that the model might not be performing better than random guessing for this dataset.
Kappa (-0.0528): This means that the model is not suitable for this dataset.
Sensitivity (96.19%): The model is highly sensitive in predicting poisonous mushrooms but fails significantly in predicting edible mushrooms, as indicated by a specificity of 0%.
Specificity (0%): The model fails to correctly identify any true negatives (edible mushrooms)
Positive Predictive Value (71.75%): When the model predicts poisonous mushrooms, it is correct about 72% of the time.
Our secondary model was not as accurate for predicting whether the secondary dataset is poisonous or not. In the future, I would consider using more columns and adding the missing values to my new dataset (either by inserting the mean, or individually mushroom by mushroom getting that data point). The more data a model has to train on, the better the model will get ideally.
In this project, we implemented and evaluated a logistic regression model to classify mushrooms as edible or poisonous,
The logistic regression model, trained on the training set, demonstrated effective learning of the relationships between various mushroom features and their edibility status. The model’s performance was initially evaluated on the training data using a confusion matrix and cross-validation, showing promising results.
Our logistic regression model had good results, with high accuracy, sensitivity, and specificity. These outcomes suggest that the model effectively distinguishes between edible and poisonous mushrooms, aligning well with known characteristics of these fungi. Notably, features like odor, gill color, and cap shape have emerged as significant predictors.
There are some results that we should explore in the future. For instance, the model’s occasional confusion between certain classes of mushrooms, as evidenced by false positives and negatives, hint at smaller morphological or chemical similarities that aren’t captured by the dataset. In the future, we could explore more complex models, like neural networks, to capture complex patterns. Integrating datasets from different geographic and environmental data could enhance the model’s applicability to diverse real-world scenarios.