library(tidyverse)
library(knitr)
library(kableExtra)
library(scales)Analysis of Effects of Bacteria on Humans
Data Programing Project
Introduction:
Studying and classifying bacteria into categories based on their species, natural habitats, and potential impacts on human health is essential for fields such as microbiology, public health, and environmental science. By studying bacteria based on the categories mentioned above, we can infer how these microorganisms have an effect on humans and their health, uncover patterns and relationships among bacterial families, and illustrate the diversity of bacterial life and its relevance to ecosystems.
This project aims to study bacteria based on: 1) The scientific name of the bacterial species. 2) The taxonomic family they belong to. 3) Their natural habitats or where they are typically found. 4) Whether they are harmful to humans or not.
Problem Description:
We will follow a list of questions to guide our analysis:
- Which bacterial famillies contain the most species, and how do harmful and safe species comprae within each family
- What proportion of bacterial species in the dataset are harmful to humans?
- Which bacterial families have the highest proportion of harmful species?
The goal is to provide more information on the types of bacteria humans should be more wary of, whether they are conducting research on them, getting diagnosed with them, or working with them daily in medical centers.
The following libraries will be used:
The data set that I will be using is:
data <- read_csv("C:/Users/Martin/Downloads/archive/bacteria_list_200.csv")Presentation of the Data:
In this section, I will show the head of my data frame, the types, and explain what each column is
data %>%
head() %>%
kable(caption = "Table 1: First 6 rows of the Bacteria Dataset") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = TRUE, font_size = 13)| Name | Family | Where Found | Harmful to Humans |
|---|---|---|---|
| Escherichia coli | Enterobacteriaceae | Intestinal tract | Yes |
| Staphylococcus aureus | Staphylococcaceae | Skin, nasal passages | Yes |
| Lactobacillus acidophilus | Lactobacillaceae | Human mouth & intestine | No |
| Bacillus subtilis | Bacillaceae | Soil | No |
| Clostridium botulinum | Clostridiaceae | Soil, improperly canned foods | Yes |
| Streptococcus pneumoniae | Streptococcaceae | Throat, nasal passages | Yes |
tibble(
Column = names(data),
Type = sapply(data, class),
`Non-NA` = sapply(data, function(x) sum(!is.na(x))),
`NA Count` = sapply(data, function(x) sum(is.na(x))),
`Example` = sapply(data, function(x) as.character(x[1]))
) %>%
kable(caption = "Table 2: Column types and completeness") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = TRUE, font_size = 13)| Column | Type | Non-NA | NA Count | Example |
|---|---|---|---|---|
| Name | character | 199 | 0 | Escherichia coli |
| Family | character | 199 | 0 | Enterobacteriaceae |
| Where Found | character | 199 | 0 | Intestinal tract |
| Harmful to Humans | character | 199 | 0 | Yes |
Column 1: Represents the scientific names for the bacteria
Column 2: Shows the types of families the bacteria belong to
Column 3: Where the bacteria are found
Column 4: Whether the bacteria are harmful to humans or not
Transformation of the Data
Species Count per Family + Safety Percentage + Total Number of Safe and Unsafe Bacteria
family_summary <- data %>%
filter(!is.na(Family)) %>%
group_by(Family) %>%
summarise(
Total_Species = n(),
Harmful_Count = sum(`Harmful to Humans` == "Yes", na.rm = TRUE),
Safe_Count = sum(`Harmful to Humans` == "No", na.rm = TRUE)
) %>%
mutate(
Harmful_Percent = round(Harmful_Count / Total_Species * 100, 1)
) %>%
arrange(desc(Total_Species))
family_summary %>%
head(15) %>%
kable(
caption = "Table 3: Top 15 bacterial families by species count, with harmful vs. safe breakdown.",
col.names = c("Family", "Total Species", "Harmful", "Safe", "% Harmful")
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = TRUE,
font_size = 13
) %>%
column_spec(5, bold = TRUE,
color = ifelse(
family_summary %>% head(15) %>% pull(Harmful_Percent) >= 50,
"firebrick", "darkgreen"
))| Family | Total Species | Harmful | Safe | % Harmful |
|---|---|---|---|---|
| Enterobacteriaceae | 21 | 11 | 10 | 52.4 |
| Bacillaceae | 7 | 0 | 7 | 0.0 |
| Flavobacteriaceae | 6 | 5 | 1 | 83.3 |
| Streptococcaceae | 6 | 4 | 2 | 66.7 |
| Bifidobacteriaceae | 5 | 2 | 3 | 40.0 |
| Alcaligenaceae | 4 | 3 | 1 | 75.0 |
| Clostridiaceae | 4 | 3 | 1 | 75.0 |
| Comamonadaceae | 4 | 0 | 4 | 0.0 |
| Corynebacteriaceae | 4 | 4 | 0 | 100.0 |
| Micrococcaceae | 4 | 1 | 3 | 25.0 |
| Neisseriaceae | 4 | 4 | 0 | 100.0 |
| Pasteurellaceae | 4 | 4 | 0 | 100.0 |
| Xanthomonadaceae | 4 | 2 | 2 | 50.0 |
| Acetobacteraceae | 3 | 0 | 3 | 0.0 |
| Bacteroidaceae | 3 | 1 | 2 | 33.3 |
Habitat Danger Profile
habitat_summary <- data %>%
filter(!is.na(`Where Found`)) %>%
mutate(
# Simplify habitats by extracting the first location mentioned
Habitat_Simple = str_trim(str_split(`Where Found`, ",", simplify = TRUE)[, 1])
) %>%
group_by(Habitat_Simple, `Harmful to Humans`) %>%
summarise(Count = n(), .groups = "drop") %>%
pivot_wider(
names_from = `Harmful to Humans`,
values_from = Count,
values_fill = 0
) %>%
mutate(
Total = Yes + No,
Risk_Percent = round(Yes / Total * 100, 1)
) %>%
arrange(desc(Risk_Percent)) %>%
filter(Total >= 3)
habitat_summary %>%
kable(
caption = "Table 4: Habitat danger profile — habitats with 3+ species, sorted by % harmful.",
col.names = c("Habitat", "Harmful (Yes)", "Safe (No)", "Total", "% Harmful")
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = TRUE,
font_size = 13
)| Habitat | Harmful (Yes) | Safe (No) | Total | % Harmful |
|---|---|---|---|---|
| Genitourinary tract | 0 | 3 | 3 | 100.0 |
| Human oral cavity | 0 | 3 | 3 | 100.0 |
| Human urogenital tract | 0 | 3 | 3 | 100.0 |
| Infected animals | 0 | 6 | 6 | 100.0 |
| Respiratory tract | 0 | 5 | 5 | 100.0 |
| Throat | 0 | 3 | 3 | 100.0 |
| Human skin | 1 | 4 | 5 | 80.0 |
| Human mouth | 1 | 2 | 3 | 66.7 |
| Water | 3 | 5 | 8 | 62.5 |
| Plants | 3 | 4 | 7 | 57.1 |
| Intestinal tract | 6 | 6 | 12 | 50.0 |
| Freshwater | 4 | 3 | 7 | 42.9 |
| Soil | 29 | 15 | 44 | 34.1 |
| Mouth | 2 | 1 | 3 | 33.3 |
| Skin | 3 | 1 | 4 | 25.0 |
| Marine environments | 5 | 1 | 6 | 16.7 |
Overall Harmful vs. Non-harmful Summary
overall_summary <- data %>%
filter(!is.na(`Harmful to Humans`)) %>%
group_by(`Harmful to Humans`) %>%
summarise(
Count = n(),
Top_Family = names(sort(table(Family), decreasing = TRUE))[1],
Top_Habitat = names(sort(table(`Where Found`), decreasing = TRUE))[1]
) %>%
mutate(Percent = round(Count / sum(Count) * 100, 1))
overall_summary %>%
kable(
caption = "Table 5: Overall split of harmful vs. non-harmful bacteria, with most common family and habitat for each group.",
col.names = c("Harmful to Humans", "Count", "Most Common Family",
"Most Common Habitat", "% of Dataset")
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = TRUE,
font_size = 13
)| Harmful to Humans | Count | Most Common Family | Most Common Habitat | % of Dataset |
|---|---|---|---|---|
| No | 102 | Enterobacteriaceae | Soil | 51.3 |
| Yes | 97 | Enterobacteriaceae | Intestinal tract | 48.7 |
4. Exploratory Data Analysis
Bar Chart of the Top 15 Families by Species Count:
data %>%
filter(!is.na(Family), !is.na(`Harmful to Humans`)) %>%
count(Family, `Harmful to Humans`) %>%
group_by(Family) %>%
mutate(Total = sum(n)) %>%
ungroup() %>%
filter(Total >= 3) %>%
mutate(Family = fct_reorder(Family, Total)) %>%
ggplot(aes(x = Family, y = n, fill = `Harmful to Humans`)) +
geom_col(position = "stack", width = 0.5) +
coord_flip() +
scale_fill_manual(
values = c("Yes" = "#C0392B", "No" = "#27AE60")
) +
labs(
title = "Bacterial Families by Species Count",
subtitle = "Families with 3 or more species — red = harmful, green = safe",
x = "Family",
y = "Number of Species",
fill = "Harmful to Humans"
) +
theme_minimal(base_size = 13) +
theme(
plot.title = element_text(face = "bold", size = 15),
plot.subtitle = element_text(colour = "grey40"),
legend.position = "bottom"
)Shows the top 15 families ranked by how many species they contain. Each bar is split into red (harmful) and green (safe) segments, so you can see at a glance both how large a family is and how dangerous it tends to be.
Pie Chart of Overall Harmful vs. Non-Harmful Bacteria
#| fig-cap: "Figure 2: Proportion of harmful vs. non-harmful bacteria in the dataset."
data %>%
filter(!is.na(`Harmful to Humans`)) %>%
count(`Harmful to Humans`) %>%
mutate(
Percent = round(n / sum(n) * 100, 1),
Label = paste0(`Harmful to Humans`, "\n", Percent, "%")
) %>%
ggplot(aes(x = 2, y = n, fill = `Harmful to Humans`)) +
geom_col(width = 1, colour = "white", linewidth = 1) +
coord_polar(theta = "y") +
xlim(0.5, 2.5) +
scale_fill_manual(values = c("Yes" = "#C0392B", "No" = "#27AE60")) +
geom_text(aes(label = Label),
position = position_stack(vjust = 0.5),
colour = "white", fontface = "bold", size = 5) +
labs(
title = "Proportion of Harmful vs. Non-Harmful Bacteria",
subtitle = "Based on 199 species in the dataset",
fill = "Harmful to Humans"
) +
theme_void(base_size = 13) +
theme(
plot.title = element_text(face = "bold", size = 15, hjust = 0.5),
plot.subtitle = element_text(colour = "grey40", hjust = 0.5),
legend.position = "none"
)A simple split of the entire dataset into harmful vs. non-harmful bacteria, displayed as percentages.
Lollipop chart of the Family Composition by Harmfulness of the Bacteria
data %>%
filter(!is.na(Family), !is.na(`Harmful to Humans`)) %>%
group_by(Family) %>%
summarise(
Total = n(),
Harmful = sum(`Harmful to Humans` == "Yes")
) %>%
filter(Total >= 3) %>%
mutate(
Pct = Harmful / Total * 100,
Family = fct_reorder(Family, Pct)
) %>%
ggplot(aes(x = Family, y = Pct)) +
geom_segment(aes(xend = Family, y = 0, yend = Pct),
colour = "grey60", linewidth = 0.8) +
geom_point(aes(colour = Pct), size = 4) +
scale_colour_gradient(low = "#27AE60", high = "#C0392B") +
coord_flip() +
labs(
title = "Percentage of Harmful Species per Family",
subtitle = "Families with 3+ species — green = mostly safe, red = mostly harmful",
x = "Family",
y = "% Harmful Species",
colour = "% Harmful"
) +
theme_minimal(base_size = 13) +
theme(
plot.title = element_text(face = "bold", size = 15),
plot.subtitle = element_text(colour = "grey40"),
legend.position = "right"
)Ranks families by the proportion of their harmful species, with dots colored on a green-to-red gradient
Conclusion
In this project, I aimed to explore how bacterial species are distributed across families and habitats, and what that means for human health. The analysis of 199 species revealed that a significant portion of the bacteria in this dataset are classified as harmful to humans, with the risk varying considerably depending on taxonomic family and habitat. Some families stand out as disproportionately dangerous, containing a high percentage of harmful species, while others are predominantly safe. The habitat danger profile showed that certain environments, particularly those in close contact with humans, tend to harbor higher concentrations of pathogenic species, with clear implications for hygiene, clinical, and environmental risk assessment. Together, the lollipop chart and stacked bar charts illustrated that harmfulness is not evenly distributed across bacterial families, reinforcing the importance of family-level classification in microbiology and public health screening. Knowing which families carry the highest risk can help researchers and clinicians prioritize monitoring and treatment efforts. Overall, this analysis underscores that not all bacteria are created equal; their potential impact on human health is closely tied to where they live and what family they belong to. A more informed understanding of these patterns can support better decision-making in medical, laboratory, and environmental settings alike.
References
- Kanchana1990. (2024, March 27). Bacteria dataset. Kaggle. https://www.kaggle.com/datasets/kanchana1990/bacteria-dataset