Popular Dog Breeds Analysis

Introduction

The intent of this analysis is to explore the multifaceted characteristics of various dog breeds, leveraging a dataset provided by the American Kennel Club. This dataset is augmented by insights from the Segmanta Big Pet Survey, offering a unique perspective on the traits that pet owners value most in their canine companions. It encompasses a range of breed-specific attributes, including temperament, popularity, physical dimensions, life expectancy, shedding patterns, energy levels, and trainability. These assessments aim to highlight the underlying trends and preferences that shape the popularity and perception of different dog breeds, providing valuable insights into the reasons why people select the pets they do.

Dataset

These analyses utilize a comprehensive dataset provided by the American Kennel Club, which offers detailed insights into various dog breeds. This data is particularly valuable as it encompasses a wide array of breed-specific information and can be linked to other data, such as the Segmanta Big Pet Survey, which was used to derive the most sought-after traits used in one analysis.

Key attributes included for each breed are:

Breed - The specific name of the dog breed.
Temperament - Characteristic behavior and personality traits of the breed.
Popularity - A ranking indicating the breed’s popularity.
Height Range - Specifies the height spectrum for the breed in centimeters.
Weight Range - Provides the weight range in kilograms.
Life Expectancy - Expected lifespan range in years.
Group - The AKC classification group for the breed.
Shedding (Shedding Value, Shedding Category) - Information on the breed’s shedding pattern, both quantitatively and qualitatively.
Energy Level - A numerical and categorical representation of the breed’s energy.
Trainability - Indicates the ease of training the breed, again, both numerically and descriptively.

The dataset provides a holistic view of each breed’s physical and behavioral characteristics. Insights derived from this data will center primarily around the top 25 most popular breeds, ranked from 1 (most popular) to 192 (least popular).

baseDataset <- fread("C:/Users/grego/OneDrive/Desktop/School/Data Visualizations/Dog Breeds/dog_breeds.csv")

Analysis

# Replace any empty strings in the dataset with NA (missing values)
baseDataset[baseDataset == ""] <- NA

# Replace any instances of the string "of" in the dataset with NA (missing values) to correct for erroneous datapoints
baseDataset[baseDataset == "of"] <- NA

# Convert 'Popularity' to numeric, handling potential conversion warnings/errors gracefully
baseDataset <- baseDataset %>%
  mutate(Popularity = as.numeric(as.character(Popularity)))

#Arrange the data by 'Popularity' in ascending order, then filter the dataset to keep only the top 25% of entries
top_25_percent <- baseDataset %>%
  arrange(Popularity) %>%
  filter(row_number() <= n() * 0.25)

Popularity by Dog Group

Here we observe a comparative visualization of popularity scores across seven different dog groups: Herding, Hound, Non-Sporting, Sporting, Terrier, Toy, and Working. The plot provides a density estimation of the scores, where the width of each violin indicates the frequency of data points at different levels of popularity within each group.

The median popularity score for each group is denoted by a white dot, revealing that the Hound and Working groups have higher median popularity scores compared to the others, with the Non-Sporting group having the lowest median score. The distribution within each group is varied; the Non-Sporting group’s distribution is particularly narrow, suggesting a high degree of consistency in popularity scores among its breeds. Conversely, the Sporting and Toy groups display wider distributions, indicating a more varied perception of popularity among their respective breeds.

The Herding, Sporting, and Working groups exhibit fairly symmetrical distributions, implying a balanced spread of popularity. In contrast, the Terrier group shows a slight skew towards lower popularity scores. The range of the violins indicates the overall spread of the data, with the Toy and Working groups showing a substantial range, indicating the presence of both highly popular and much less popular breeds within these groups.

# Filter the baseDataset to include only rows where 'Popularity' and 'Group' are not NA,
df_groups <- baseDataset %>%
  filter(!is.na(Popularity),!is.na(Group))

# Create a violin plot showing the distribution of 'Popularity' for each 'Group'.
ggplot(df_groups, aes(x = Group, y = Popularity, fill = Group)) +
  geom_violin(trim = FALSE) +
  stat_summary(
    fun = median,
    geom = "point",
    color = "white",
    fill = "white",
    size = 3,
    shape = 23,
    show.legend = FALSE
  ) +
  theme_modern() +
  theme(
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
    legend.title = element_blank(),
    legend.position = "none",
    axis.title.x = element_text(face = "bold"),
    axis.title.y = element_text(face = "bold"),
    axis.text.x = element_text(
      angle = 45,
      vjust = 1,
      hjust = 1
    ),
    panel.grid.major = element_line(colour = "grey60", linewidth = 0.2),
    panel.grid.minor = element_line(colour = "grey90", linewidth = 0.2)
  ) +
  scale_fill_brewer(palette = "Dark2") +
  labs(title = "Popularity by Dog Group",
       x = "Group",
       y = "Popularity") +
  annotate(
    "text",
    x = Inf,
    y = -Inf,
    label = "*White dot indicates median",
    hjust = 1,
    vjust = 10,
    size = 4,
    color = "black",
    fontface = "italic",
    margin = margin(
      t = 10,
      r = 10,
      b = 10,
      l = 10,
      unit = "pt"
    )
  ) +
  coord_cartesian(clip = "off")

Traits of the Most Popular Dogs

According to Segmanta’s Big Pet Survey 2020, most pet owners most desired traits in dogs can be described using the following 16 words:

Loving	Sweet	Playful	Loyal
Happy	Smart	Friendly	Funny
Energetic	Protective	Lazy	Goofy
Silly	Hyper	Needy	Cuddly

The most frequently observed trait among popular dogs is Friendly, with 17 occurrences, indicating that this is a highly valued characteristic in dogs that are considered popular. Following Friendly, the trait Smart appears 13 times, suggesting intelligence is also a significant factor in a dog’s popularity. Playful and Loyal traits are next, with 10 and 9 occurrences respectively, which shows that these behaviors are also appreciated in popular dogs.

The traits Energetic, Happy, and Loving each have 4 occurrences. This implies that while these traits are desirable, they may not be as strongly associated with popularity as the traits with higher occurrences. Funny and Sweet are the least observed traits in the dataset, each with just one occurrence, suggesting that while these traits are endearing, they may not be primary factors in determining a dog’s popularity.

While the other terms appear in the dataset in association with other breeds, the topmost quartile does not reflect these terms as being associated with the most popular breeds.

# Create the list of words to search for
words <- c(
  "Loving",
  "Sweet",
  "Playful",
  "Loyal",
  "Happy",
  "Smart",
  "Friendly",
  "Funny",
  "Energetic",
  "Protective",
  "Lazy",
  "Goofy",
  "Silly",
  "Hyper",
  "Needy",
  "Cuddly"
)

# Filter out rows with NA in 'Popularity' and 'Temperament'
df_temperament <- baseDataset %>%
  filter(!is.na(Popularity) & !is.na(`Temperament`))

# Create a data frame to count occurrences
word_counts <-
  data.frame(word = character(),
             count = integer(),
             stringsAsFactors = FALSE)

# Count occurrences of each word in the temperament column
for (word in words) {
  count <-
    sum(grepl(word, top_25_percent$Temperament, ignore.case = TRUE))
  word_counts <-
    rbind(word_counts, data.frame(word = word, count = count))
}

# Filter out words that do not appear in the temperament descriptions
word_counts <- word_counts %>%
  filter(count != 0)

# Create a bar plot showing the occurrences of each word in the description of the top 25 dog breeds
ggplot(word_counts, aes(x = reorder(word,-count), y = count)) +
  geom_bar(stat = "identity", aes(fill = word)) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
    legend.title = element_blank(),
    legend.position = "none",
    axis.title.x = element_text(face = "bold"),
    axis.title.y = element_text(face = "bold"),
    axis.text.x = element_text(
      angle = 45,
      vjust = 1,
      hjust = 1
    ),
    panel.grid.major.y = element_line(colour = "grey60", linewidth = 0.2),
    panel.grid.minor.y = element_line(colour = "grey90", linewidth = 0.2),
  ) +
  geom_text(
    aes(label = count),
    vjust = -0.5,
    size = 4,
    fontface = "bold"
  ) +
  labs(title = "Traits of the Most Popular Dogs",
       x = "Temperament Trait",
       y = "Occurrences")

Top 25% Most Popular Dogs by Shedding Category

This analysis provides a breakdown of the shedding frequencies for the most popular quartile of dog breeds. Shedding is categorized into four different frequencies: Seasonal, Occasional, Regularly, Infrequent, and Frequent.

The largest segment of the pie chart is Seasonal, which accounts for 42.6% of the top popular dogs. This suggests that the majority of popular dogs tend to shed in a pattern that aligns with changes in the season, which might be due to natural cycles of coat growth and shedding in response to climate changes.

Occasional shedding represents 20.6% of the chart, indicating a significant number of popular dogs do not shed continuously but may do so in response to specific triggers or less regularly. ‘Regularly’ shedding dogs comprise 19.1%, showing that a nearly equal proportion of popular dogs shed hair at a consistent rate throughout the year.

Dogs that shed Infrequently make up 13.2% of the pie, which could suggest that while low-shedding dogs are appreciated by many, they are less common among the most popular dogs. Not surprisingly, the Frequent shedding category is the smallest segment at 4.41%, indicating that such dogs are the least common among the most popular ones, possibly due to the higher maintenance required to manage their shedding.

Overall, this illustrates a preference for breeds that shed seasonally, which could reflect a balance between the aesthetic and practical aspects of dog ownership. Dogs that require less frequent grooming are also well represented, whereas breeds that shed very frequently are relatively rare in the group of the most popular dogs.

# Filter out rows with NA in 'Popularity' and 'Shedding Category', and ensure Popularity is not 0
df_shedding <- top_25_percent %>%
  filter(!is.na(Popularity) & !is.na(`Shedding Category`))

# Aggregate data by shedding category for the top 25% breeds
category_counts <- df_shedding %>%
  group_by(`Shedding Category`) %>%
  count() %>%
  ungroup() %>%
  mutate(perc = n / sum(n)) %>%
  arrange(desc(perc)) %>%
  mutate(labels = scales::percent(perc))

# Create an interactive pie chart
pie <-
  plot_ly(
    category_counts,
    labels = ~ `Shedding Category`,
    values = ~ n,
    type = 'pie',
    textinfo = 'percent+label',
    insidetextorientation = 'horizontal',
    hoverinfo = 'label+percent',
    text = ~ labels
  ) %>%
  layout(
    title = list(
      text = 'Top 25% Most Popular Dogs by Shedding Category',
      font = list(
        size = 18,
        color = 'black',
        weight = 'bold'
      )
    ),
    showlegend = TRUE,
    margin = list(t = 100)
  )

pie

Top Breeds Life Expectancy vs. Average Height & Weight

This analysis depicts a weak negative correlation between average weight and average life expectancy meaning that, on average, smaller dog breeds tend to live longer than larger dog breeds. However, the correlation is weak, and there is a significant amount of variability around the trend line. For example, the Chihuahua, which is one of the smallest breeds on the chart, has a shorter life expectancy than some larger breeds, such as the Beagle.

The chart relies on average data, which means that it does not account for the individual variation in weight and lifespan that can occur within each breed. The data set also does not take into account other factors that can affect lifespan, such as genetics, diet, and exercise.

The link between body size and lifespan in mammals is well-established, and this chart provides a visual representation of this relationship for the most popular dog breeds.

# Calculate average sizes and life expectancy for breeds and filter those without values
df_size <- top_25_percent %>%
  mutate(
    AverageWeight = (`Min Weight` + `Max Weight`) / 2,
    AverageHeight = (`Min Height` + `Max Height`) / 2,
    AverageLifeExpectancy = (`Min Life Expectancy` + `Max Life Expectancy`) / 2
  ) %>%
  filter(AverageWeight != 0, AverageHeight != 0, AverageLifeExpectancy != 0)


# Create Height & Weight vs. Life Expectancy plot
ggplot(df_size, aes(x = AverageHeight, y = AverageLifeExpectancy)) +
  geom_point(aes(size = AverageWeight, color = AverageWeight), alpha = 0.6) +
  geom_smooth(
    method = 'lm',
    se = FALSE,
    color = flatUIPalette["belize_hole"],
    linewidth = 2
  ) +
  scale_color_gradient(low = "lightblue", high = flatUIPalette["belize_hole"]) +
  scale_size_area(max_size = 8) +
  theme_modern() +
  theme(
    plot.title = element_text(face = "bold", size = 18, hjust = 0.5),
    legend.background = element_rect(
      fill = "white",
      linewidth = 4,
      colour = "white"
    ),
    legend.justification = c(0, 1),
    axis.ticks = element_line(colour = "grey70", linewidth = 0.2),
    axis.title.x = element_text(face = "bold", hjust = 0.5),
    axis.title.y = element_text(face = "bold"),
    panel.grid.major = element_line(colour = "grey60", linewidth = 0.2),
    panel.grid.minor = element_line(colour = "grey90", linewidth = 0.2)
  ) +
  labs(
    title = "Top Breeds Life Expectancy vs. Average Height & Weight",
    x = "Average Height (cm)",
    y = "Average Life Expectancy (years)",
    color = "Average Weight (kg)",
    size = "Average Weight (kg)"
  ) +
  theme(legend.position = "right") +
  guides(size = guide_legend(override.aes = list(color = flatUIPalette["belize_hole"]))) +
  guides(color = FALSE)

Top 25% Breeds Energy Level vs. Trainability Plot

This visual shows the number of dog breeds that fall into different categories of energy level and trainability. The categories are:

Couch Potato	Regular Exercise
Calm	Needs Lots of Activity
Regular Exercise	Energetic

From this information we can garner that trainability does not necessarily correspond with energy level. The chart does not show a clear pattern between a dog breed’s energy level and its trainability. For example, there are breeds that are considered easy to train in all five energy level categories.

More breeds fall into the middle categories. The dta shows that the largest number of breeds fall into the middle categories of both energy level and trainability within the population. This suggests that most popular breeds tend to have moderate levels of energy and require an average level of training effort.

# Filter the dataset for non-missing values
df_train <- top_25_percent %>%
  filter(
    !is.na(`Trainability Value`) &
      `Trainability Value` != "" & `Trainability Value` != 0,!is.na(`Energy Level Category`) &
      `Energy Level Category` != "" & `Energy Level Category` != 0
  )
# Cutting the `Trainability Value` into bins
df_train$TrainabilityLevelVal <- cut(df_train$`Trainability Value`,
                                     breaks = seq(0, 1, by = .01),
                                     labels = FALSE)

# Summarize the count data
count_data <- df_train %>%
  group_by(`Energy Level Category`, TrainabilityLevelVal) %>%
  summarize(BreedCount = n(), .groups = 'drop')

# Define the custom order for the Energy Level Category
energy_levels_order <-
  c("Couch Potato",
    "Calm",
    "Regular Exercise",
    "Needs Lots of Activity",
    "Energetic")

# Convert Energy Level Category to a factor with the specified order
count_data$`Energy Level Category` <-
  factor(count_data$`Energy Level Category`, levels = energy_levels_order)

ggplot(count_data,
       aes(
         x = `Energy Level Category`,
         y = as.factor(TrainabilityLevelVal),
         fill = BreedCount
       )) +
  geom_tile(color = "white",
            lwd = 1.5,
            linetype = 1) +
  geom_text(aes(label = BreedCount)) +
  scale_fill_gradientn(colors = c(flatUIPalette["peter_river"], flatUIPalette["sun_flower"], flatUIPalette["alizarin"])) +
  theme_modern() +
  theme(
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
    axis.text.x = element_text(
      angle = 30,
      vjust = 0.8,
      hjust = 0.8
    ),
    axis.title.x = element_text(face = "bold"),
    axis.title.y = element_text(face = "bold")
  ) +
  labs(title = "Top 25% of Breeds by Energy Level Category and Trainability",
       x = "Energy Level Category",
       y = "Trainability Level",
       fill = "Breed Count")

Conclusion

The findings shed light on the diverse attributes that define each breed, from their physical characteristics to behavioral traits. By analyzing aspects such as popularity trends, shedding patterns, life expectancy, and the correlation between energy levels and trainability, the study provides insight into what makes certain breeds stand out among the rest. This analysis can not only serve as a valuable resource for prospective dog owners to make informed decisions but also contributes to the broader understanding of canine characteristics and their implications in the realms of breeding, training, and pet ownership. Understanding the intricate relationship between a dog’s inherent traits and its appeal to humans helps to further understand why we’ve bonded across species lines for many millenia.