Introduction

The intent of this analysis is to explore the multifaceted characteristics of various dog breeds, leveraging a dataset provided by the American Kennel Club. This dataset is augmented by insights from the Segmanta Big Pet Survey, offering a unique perspective on the traits that pet owners value most in their canine companions. It encompasses a range of breed-specific attributes, including temperament, popularity, physical dimensions, life expectancy, shedding patterns, energy levels, and trainability. These assessments aim to highlight the underlying trends and preferences that shape the popularity and perception of different dog breeds, providing valuable insights into the reasons why people select the pets they do.

Dataset

These analyses utilize a comprehensive dataset provided by the American Kennel Club, which offers detailed insights into various dog breeds. This data is particularly valuable as it encompasses a wide array of breed-specific information and can be linked to other data, such as the Segmanta Big Pet Survey, which was used to derive the most sought-after traits used in one analysis.

Key attributes included for each breed are:

  • Breed - The specific name of the dog breed.
  • Temperament - Characteristic behavior and personality traits of the breed.
  • Popularity - A ranking indicating the breed’s popularity.
  • Height Range - Specifies the height spectrum for the breed in centimeters.
  • Weight Range - Provides the weight range in kilograms.
  • Life Expectancy - Expected lifespan range in years.
  • Group - The AKC classification group for the breed.
  • Shedding (Shedding Value, Shedding Category) - Information on the breed’s shedding pattern, both quantitatively and qualitatively.
  • Energy Level - A numerical and categorical representation of the breed’s energy.
  • Trainability - Indicates the ease of training the breed, again, both numerically and descriptively.

The dataset provides a holistic view of each breed’s physical and behavioral characteristics. Insights derived from this data will center primarily around the top 25 most popular breeds, ranked from 1 (most popular) to 192 (least popular).

baseDataset <- fread("C:/Users/grego/OneDrive/Desktop/School/Data Visualizations/Dog Breeds/dog_breeds.csv")

Analysis

# Replace any empty strings in the dataset with NA (missing values)
baseDataset[baseDataset == ""] <- NA

# Replace any instances of the string "of" in the dataset with NA (missing values) to correct for erroneous datapoints
baseDataset[baseDataset == "of"] <- NA

# Convert 'Popularity' to numeric, handling potential conversion warnings/errors gracefully
baseDataset <- baseDataset %>%
  mutate(Popularity = as.numeric(as.character(Popularity)))

#Arrange the data by 'Popularity' in ascending order, then filter the dataset to keep only the top 25% of entries
top_25_percent <- baseDataset %>%
  arrange(Popularity) %>%
  filter(row_number() <= n() * 0.25)

Popularity by Dog Group

Here we observe a comparative visualization of popularity scores across seven different dog groups: Herding, Hound, Non-Sporting, Sporting, Terrier, Toy, and Working. The plot provides a density estimation of the scores, where the width of each violin indicates the frequency of data points at different levels of popularity within each group.

The median popularity score for each group is denoted by a white dot, revealing that the Hound and Working groups have higher median popularity scores compared to the others, with the Non-Sporting group having the lowest median score. The distribution within each group is varied; the Non-Sporting group’s distribution is particularly narrow, suggesting a high degree of consistency in popularity scores among its breeds. Conversely, the Sporting and Toy groups display wider distributions, indicating a more varied perception of popularity among their respective breeds.

The Herding, Sporting, and Working groups exhibit fairly symmetrical distributions, implying a balanced spread of popularity. In contrast, the Terrier group shows a slight skew towards lower popularity scores. The range of the violins indicates the overall spread of the data, with the Toy and Working groups showing a substantial range, indicating the presence of both highly popular and much less popular breeds within these groups.

# Filter the baseDataset to include only rows where 'Popularity' and 'Group' are not NA,
df_groups <- baseDataset %>%
  filter(!is.na(Popularity),!is.na(Group))

# Create a violin plot showing the distribution of 'Popularity' for each 'Group'.
ggplot(df_groups, aes(x = Group, y = Popularity, fill = Group)) +
  geom_violin(trim = FALSE) +
  stat_summary(
    fun = median,
    geom = "point",
    color = "white",
    fill = "white",
    size = 3,
    shape = 23,
    show.legend = FALSE
  ) +
  theme_modern() +
  theme(
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
    legend.title = element_blank(),
    legend.position = "none",
    axis.title.x = element_text(face = "bold"),
    axis.title.y = element_text(face = "bold"),
    axis.text.x = element_text(
      angle = 45,
      vjust = 1,
      hjust = 1
    ),
    panel.grid.major = element_line(colour = "grey60", linewidth = 0.2),
    panel.grid.minor = element_line(colour = "grey90", linewidth = 0.2)
  ) +
  scale_fill_brewer(palette = "Dark2") +
  labs(title = "Popularity by Dog Group",
       x = "Group",
       y = "Popularity") +
  annotate(
    "text",
    x = Inf,
    y = -Inf,
    label = "*White dot indicates median",
    hjust = 1,
    vjust = 10,
    size = 4,
    color = "black",
    fontface = "italic",
    margin = margin(
      t = 10,
      r = 10,
      b = 10,
      l = 10,
      unit = "pt"
    )
  ) +
  coord_cartesian(clip = "off")

Top Breeds Life Expectancy vs. Average Height & Weight

This analysis depicts a weak negative correlation between average weight and average life expectancy meaning that, on average, smaller dog breeds tend to live longer than larger dog breeds. However, the correlation is weak, and there is a significant amount of variability around the trend line. For example, the Chihuahua, which is one of the smallest breeds on the chart, has a shorter life expectancy than some larger breeds, such as the Beagle.

The chart relies on average data, which means that it does not account for the individual variation in weight and lifespan that can occur within each breed. The data set also does not take into account other factors that can affect lifespan, such as genetics, diet, and exercise.

The link between body size and lifespan in mammals is well-established, and this chart provides a visual representation of this relationship for the most popular dog breeds.

# Calculate average sizes and life expectancy for breeds and filter those without values
df_size <- top_25_percent %>%
  mutate(
    AverageWeight = (`Min Weight` + `Max Weight`) / 2,
    AverageHeight = (`Min Height` + `Max Height`) / 2,
    AverageLifeExpectancy = (`Min Life Expectancy` + `Max Life Expectancy`) / 2
  ) %>%
  filter(AverageWeight != 0, AverageHeight != 0, AverageLifeExpectancy != 0)


# Create Height & Weight vs. Life Expectancy plot
ggplot(df_size, aes(x = AverageHeight, y = AverageLifeExpectancy)) +
  geom_point(aes(size = AverageWeight, color = AverageWeight), alpha = 0.6) +
  geom_smooth(
    method = 'lm',
    se = FALSE,
    color = flatUIPalette["belize_hole"],
    linewidth = 2
  ) +
  scale_color_gradient(low = "lightblue", high = flatUIPalette["belize_hole"]) +
  scale_size_area(max_size = 8) +
  theme_modern() +
  theme(
    plot.title = element_text(face = "bold", size = 18, hjust = 0.5),
    legend.background = element_rect(
      fill = "white",
      linewidth = 4,
      colour = "white"
    ),
    legend.justification = c(0, 1),
    axis.ticks = element_line(colour = "grey70", linewidth = 0.2),
    axis.title.x = element_text(face = "bold", hjust = 0.5),
    axis.title.y = element_text(face = "bold"),
    panel.grid.major = element_line(colour = "grey60", linewidth = 0.2),
    panel.grid.minor = element_line(colour = "grey90", linewidth = 0.2)
  ) +
  labs(
    title = "Top Breeds Life Expectancy vs. Average Height & Weight",
    x = "Average Height (cm)",
    y = "Average Life Expectancy (years)",
    color = "Average Weight (kg)",
    size = "Average Weight (kg)"
  ) +
  theme(legend.position = "right") +
  guides(size = guide_legend(override.aes = list(color = flatUIPalette["belize_hole"]))) +
  guides(color = FALSE)

Top 25% Breeds Energy Level vs. Trainability Plot

This visual shows the number of dog breeds that fall into different categories of energy level and trainability. The categories are:

Couch Potato Regular Exercise
Calm Needs Lots of Activity
Regular Exercise Energetic

From this information we can garner that trainability does not necessarily correspond with energy level. The chart does not show a clear pattern between a dog breed’s energy level and its trainability. For example, there are breeds that are considered easy to train in all five energy level categories.

More breeds fall into the middle categories. The dta shows that the largest number of breeds fall into the middle categories of both energy level and trainability within the population. This suggests that most popular breeds tend to have moderate levels of energy and require an average level of training effort.

# Filter the dataset for non-missing values
df_train <- top_25_percent %>%
  filter(
    !is.na(`Trainability Value`) &
      `Trainability Value` != "" & `Trainability Value` != 0,!is.na(`Energy Level Category`) &
      `Energy Level Category` != "" & `Energy Level Category` != 0
  )
# Cutting the `Trainability Value` into bins
df_train$TrainabilityLevelVal <- cut(df_train$`Trainability Value`,
                                     breaks = seq(0, 1, by = .01),
                                     labels = FALSE)

# Summarize the count data
count_data <- df_train %>%
  group_by(`Energy Level Category`, TrainabilityLevelVal) %>%
  summarize(BreedCount = n(), .groups = 'drop')

# Define the custom order for the Energy Level Category
energy_levels_order <-
  c("Couch Potato",
    "Calm",
    "Regular Exercise",
    "Needs Lots of Activity",
    "Energetic")

# Convert Energy Level Category to a factor with the specified order
count_data$`Energy Level Category` <-
  factor(count_data$`Energy Level Category`, levels = energy_levels_order)

ggplot(count_data,
       aes(
         x = `Energy Level Category`,
         y = as.factor(TrainabilityLevelVal),
         fill = BreedCount
       )) +
  geom_tile(color = "white",
            lwd = 1.5,
            linetype = 1) +
  geom_text(aes(label = BreedCount)) +
  scale_fill_gradientn(colors = c(flatUIPalette["peter_river"], flatUIPalette["sun_flower"], flatUIPalette["alizarin"])) +
  theme_modern() +
  theme(
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
    axis.text.x = element_text(
      angle = 30,
      vjust = 0.8,
      hjust = 0.8
    ),
    axis.title.x = element_text(face = "bold"),
    axis.title.y = element_text(face = "bold")
  ) +
  labs(title = "Top 25% of Breeds by Energy Level Category and Trainability",
       x = "Energy Level Category",
       y = "Trainability Level",
       fill = "Breed Count")

Conclusion

The findings shed light on the diverse attributes that define each breed, from their physical characteristics to behavioral traits. By analyzing aspects such as popularity trends, shedding patterns, life expectancy, and the correlation between energy levels and trainability, the study provides insight into what makes certain breeds stand out among the rest. This analysis can not only serve as a valuable resource for prospective dog owners to make informed decisions but also contributes to the broader understanding of canine characteristics and their implications in the realms of breeding, training, and pet ownership. Understanding the intricate relationship between a dog’s inherent traits and its appeal to humans helps to further understand why we’ve bonded across species lines for many millenia.