Analysis of Effects of Bacteria on Humans

Data Programing Project

Author

Martin Stojanovski

Published

June 6, 2026

Introduction:

Studying and classifying bacteria into categories based on their species, natural habitats, and potential impacts on human health is essential for fields such as microbiology, public health, and environmental science. By studying bacteria based on the categories mentioned above, we can infer how these microorganisms have an effect on humans and their health, uncover patterns and relationships among bacterial families, and illustrate the diversity of bacterial life and its relevance to ecosystems.

This project aims to study bacteria based on: 1) The scientific name of the bacterial species. 2) The taxonomic family they belong to. 3) Their natural habitats or where they are typically found. 4) Whether they are harmful to humans or not.

Problem Description:

We will follow a list of questions to guide our analysis:

  1. Which bacterial famillies contain the most species, and how do harmful and safe species comprae within each family
  2. What proportion of bacterial species in the dataset are harmful to humans?
  3. Which bacterial families have the highest proportion of harmful species?

The goal is to provide more information on the types of bacteria humans should be more wary of, whether they are conducting research on them, getting diagnosed with them, or working with them daily in medical centers.

The following libraries will be used:

library(tidyverse)
library(knitr)
library(kableExtra)
library(scales)

The data set that I will be using is:

data <- read_csv("C:/Users/Martin/Downloads/archive/bacteria_list_200.csv")

Presentation of the Data:

In this section, I will show the head of my data frame, the types, and explain what each column is

data %>%
  head() %>%
  kable(caption = "Table 1: First 6 rows of the Bacteria Dataset") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE, font_size = 13)
Table 1: First 6 rows of the Bacteria Dataset
Name Family Where Found Harmful to Humans
Escherichia coli Enterobacteriaceae Intestinal tract Yes
Staphylococcus aureus Staphylococcaceae Skin, nasal passages Yes
Lactobacillus acidophilus Lactobacillaceae Human mouth & intestine No
Bacillus subtilis Bacillaceae Soil No
Clostridium botulinum Clostridiaceae Soil, improperly canned foods Yes
Streptococcus pneumoniae Streptococcaceae Throat, nasal passages Yes
tibble(
  Column        = names(data),
  Type          = sapply(data, class),
  `Non-NA`      = sapply(data, function(x) sum(!is.na(x))),
  `NA Count`    = sapply(data, function(x) sum(is.na(x))),
  `Example`     = sapply(data, function(x) as.character(x[1]))
) %>%
  kable(caption = "Table 2: Column types and completeness") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE, font_size = 13)
Table 2: Column types and completeness
Column Type Non-NA NA Count Example
Name character 199 0 Escherichia coli
Family character 199 0 Enterobacteriaceae
Where Found character 199 0 Intestinal tract
Harmful to Humans character 199 0 Yes

Column 1: Represents the scientific names for the bacteria

Column 2: Shows the types of families the bacteria belong to

Column 3: Where the bacteria are found

Column 4: Whether the bacteria are harmful to humans or not

Transformation of the Data

Species Count per Family + Safety Percentage + Total Number of Safe and Unsafe Bacteria

family_summary <- data %>%
  filter(!is.na(Family)) %>%
  group_by(Family) %>%
  summarise(
    Total_Species  = n(),
    Harmful_Count  = sum(`Harmful to Humans` == "Yes", na.rm = TRUE),
    Safe_Count     = sum(`Harmful to Humans` == "No",  na.rm = TRUE)
  ) %>%
  mutate(
    Harmful_Percent = round(Harmful_Count / Total_Species * 100, 1)
  ) %>%
  arrange(desc(Total_Species))

family_summary %>%
  head(15) %>%
  kable(
    caption = "Table 3: Top 15 bacterial families by species count, with harmful vs. safe breakdown.",
    col.names = c("Family", "Total Species", "Harmful", "Safe", "% Harmful")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = TRUE,
    font_size = 13
  ) %>%
  column_spec(5, bold = TRUE,
              color = ifelse(
                family_summary %>% head(15) %>% pull(Harmful_Percent) >= 50,
                "firebrick", "darkgreen"
              ))
Table 3: Top 15 bacterial families by species count, with harmful vs. safe breakdown.
Family Total Species Harmful Safe % Harmful
Enterobacteriaceae 21 11 10 52.4
Bacillaceae 7 0 7 0.0
Flavobacteriaceae 6 5 1 83.3
Streptococcaceae 6 4 2 66.7
Bifidobacteriaceae 5 2 3 40.0
Alcaligenaceae 4 3 1 75.0
Clostridiaceae 4 3 1 75.0
Comamonadaceae 4 0 4 0.0
Corynebacteriaceae 4 4 0 100.0
Micrococcaceae 4 1 3 25.0
Neisseriaceae 4 4 0 100.0
Pasteurellaceae 4 4 0 100.0
Xanthomonadaceae 4 2 2 50.0
Acetobacteraceae 3 0 3 0.0
Bacteroidaceae 3 1 2 33.3

Habitat Danger Profile

habitat_summary <- data %>%
  filter(!is.na(`Where Found`)) %>%
  mutate(
    # Simplify habitats by extracting the first location mentioned
    Habitat_Simple = str_trim(str_split(`Where Found`, ",", simplify = TRUE)[, 1])
  ) %>%
  group_by(Habitat_Simple, `Harmful to Humans`) %>%
  summarise(Count = n(), .groups = "drop") %>%
  pivot_wider(
    names_from  = `Harmful to Humans`,
    values_from = Count,
    values_fill = 0
  ) %>%
  mutate(
    Total        = Yes + No,
    Risk_Percent = round(Yes / Total * 100, 1)
  ) %>%
  arrange(desc(Risk_Percent)) %>%
  filter(Total >= 3)

habitat_summary %>%
  kable(
    caption = "Table 4: Habitat danger profile — habitats with 3+ species, sorted by % harmful.",
    col.names = c("Habitat", "Harmful (Yes)", "Safe (No)", "Total", "% Harmful")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = TRUE,
    font_size = 13
  )
Table 4: Habitat danger profile — habitats with 3+ species, sorted by % harmful.
Habitat Harmful (Yes) Safe (No) Total % Harmful
Genitourinary tract 0 3 3 100.0
Human oral cavity 0 3 3 100.0
Human urogenital tract 0 3 3 100.0
Infected animals 0 6 6 100.0
Respiratory tract 0 5 5 100.0
Throat 0 3 3 100.0
Human skin 1 4 5 80.0
Human mouth 1 2 3 66.7
Water 3 5 8 62.5
Plants 3 4 7 57.1
Intestinal tract 6 6 12 50.0
Freshwater 4 3 7 42.9
Soil 29 15 44 34.1
Mouth 2 1 3 33.3
Skin 3 1 4 25.0
Marine environments 5 1 6 16.7

Overall Harmful vs. Non-harmful Summary

overall_summary <- data %>%
  filter(!is.na(`Harmful to Humans`)) %>%
  group_by(`Harmful to Humans`) %>%
  summarise(
    Count         = n(),
    Top_Family    = names(sort(table(Family), decreasing = TRUE))[1],
    Top_Habitat   = names(sort(table(`Where Found`), decreasing = TRUE))[1]
  ) %>%
  mutate(Percent = round(Count / sum(Count) * 100, 1))

overall_summary %>%
  kable(
    caption = "Table 5: Overall split of harmful vs. non-harmful bacteria, with most common family and habitat for each group.",
    col.names = c("Harmful to Humans", "Count", "Most Common Family",
                  "Most Common Habitat", "% of Dataset")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = TRUE,
    font_size = 13
  )
Table 5: Overall split of harmful vs. non-harmful bacteria, with most common family and habitat for each group.
Harmful to Humans Count Most Common Family Most Common Habitat % of Dataset
No 102 Enterobacteriaceae Soil 51.3
Yes 97 Enterobacteriaceae Intestinal tract 48.7

4. Exploratory Data Analysis

Bar Chart of the Top 15 Families by Species Count:

data %>%
  filter(!is.na(Family), !is.na(`Harmful to Humans`)) %>%
  count(Family, `Harmful to Humans`) %>%
  group_by(Family) %>%
  mutate(Total = sum(n)) %>%
  ungroup() %>%
  filter(Total >= 3) %>%
  mutate(Family = fct_reorder(Family, Total)) %>%
  ggplot(aes(x = Family, y = n, fill = `Harmful to Humans`)) +
  geom_col(position = "stack", width = 0.5) +
  coord_flip() +
  scale_fill_manual(
    values = c("Yes" = "#C0392B", "No" = "#27AE60")
  ) +
  labs(
    title    = "Bacterial Families by Species Count",
    subtitle = "Families with 3 or more species — red = harmful, green = safe",
    x        = "Family",
    y        = "Number of Species",
    fill     = "Harmful to Humans"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title      = element_text(face = "bold", size = 15),
    plot.subtitle   = element_text(colour = "grey40"),
    legend.position = "bottom" 
  )

Figure 1: Top 15 bacterial families by number of species, coloured by harmfulness.

Shows the top 15 families ranked by how many species they contain. Each bar is split into red (harmful) and green (safe) segments, so you can see at a glance both how large a family is and how dangerous it tends to be.

Pie Chart of Overall Harmful vs. Non-Harmful Bacteria

#| fig-cap: "Figure 2: Proportion of harmful vs. non-harmful bacteria in the dataset."

data %>%
  filter(!is.na(`Harmful to Humans`)) %>%
  count(`Harmful to Humans`) %>%
  mutate(
    Percent = round(n / sum(n) * 100, 1),
    Label   = paste0(`Harmful to Humans`, "\n", Percent, "%")
  ) %>%
  ggplot(aes(x = 2, y = n, fill = `Harmful to Humans`)) +
  geom_col(width = 1, colour = "white", linewidth = 1) +
  coord_polar(theta = "y") +
  xlim(0.5, 2.5) +
  scale_fill_manual(values = c("Yes" = "#C0392B", "No" = "#27AE60")) +
  geom_text(aes(label = Label),
            position = position_stack(vjust = 0.5),
            colour = "white", fontface = "bold", size = 5) +
  labs(
    title    = "Proportion of Harmful vs. Non-Harmful Bacteria",
    subtitle = "Based on 199 species in the dataset",
    fill     = "Harmful to Humans"
  ) +
  theme_void(base_size = 13) +
  theme(
    plot.title      = element_text(face = "bold", size = 15, hjust = 0.5),
    plot.subtitle   = element_text(colour = "grey40", hjust = 0.5),
    legend.position = "none"
  )

A simple split of the entire dataset into harmful vs. non-harmful bacteria, displayed as percentages.

Lollipop chart of the Family Composition by Harmfulness of the Bacteria

data %>%
  filter(!is.na(Family), !is.na(`Harmful to Humans`)) %>%
  group_by(Family) %>%
  summarise(
    Total   = n(),
    Harmful = sum(`Harmful to Humans` == "Yes")
  ) %>%
  filter(Total >= 3) %>%
  mutate(
    Pct    = Harmful / Total * 100,
    Family = fct_reorder(Family, Pct)
  ) %>%
  ggplot(aes(x = Family, y = Pct)) +
  geom_segment(aes(xend = Family, y = 0, yend = Pct),
               colour = "grey60", linewidth = 0.8) +
  geom_point(aes(colour = Pct), size = 4) +
  scale_colour_gradient(low = "#27AE60", high = "#C0392B") +
  coord_flip() +
  labs(
    title    = "Percentage of Harmful Species per Family",
    subtitle = "Families with 3+ species — green = mostly safe, red = mostly harmful",
    x        = "Family",
    y        = "% Harmful Species",
    colour   = "% Harmful"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title      = element_text(face = "bold", size = 15),
    plot.subtitle   = element_text(colour = "grey40"),
    legend.position = "right"
  )

Figure 4: Lollipop chart of harmful species percentage per family.

Ranks families by the proportion of their harmful species, with dots colored on a green-to-red gradient

Conclusion

In this project, I aimed to explore how bacterial species are distributed across families and habitats, and what that means for human health. The analysis of 199 species revealed that a significant portion of the bacteria in this dataset are classified as harmful to humans, with the risk varying considerably depending on taxonomic family and habitat. Some families stand out as disproportionately dangerous, containing a high percentage of harmful species, while others are predominantly safe. The habitat danger profile showed that certain environments, particularly those in close contact with humans, tend to harbor higher concentrations of pathogenic species, with clear implications for hygiene, clinical, and environmental risk assessment. Together, the lollipop chart and stacked bar charts illustrated that harmfulness is not evenly distributed across bacterial families, reinforcing the importance of family-level classification in microbiology and public health screening. Knowing which families carry the highest risk can help researchers and clinicians prioritize monitoring and treatment efforts. Overall, this analysis underscores that not all bacteria are created equal; their potential impact on human health is closely tied to where they live and what family they belong to. A more informed understanding of these patterns can support better decision-making in medical, laboratory, and environmental settings alike.

References

  1. Kanchana1990. (2024, March 27). Bacteria dataset. Kaggle. https://www.kaggle.com/datasets/kanchana1990/bacteria-dataset