Instructions

This exam is covered by the UHD honor code. Turning in this exam acknowledges that all work submitted on this exam is your own, and that you have not used any off limit resource. You may use the R help files and your notes as needed.

Be sure to commit your rmd and your knitted results to GitHub.

You are submitting this exam under the UHD honor code. Answer all below questions on the Exam2_template_S24.Rmd provided on Github. Complete the questions, knit the file to html, commit and push to your repository BEFORE 1:15. Any commit that is pushed after 1:15 will not be graded.

Code clarity and organization will be 5 points of your total grade. Make sure you have comments where needed and your code is easy to follow (and grade), do not print out whole dataframes. Please read this document carefully to make sure that you don’t miss any requested task.

0 - Loading in Data

Use this code to read in your data.

AKCdata is data about 136 breeds of dogs, from the American Kennel Club. There are variables with missing data.

AKCdata <- read_csv("https://raw.githubusercontent.com/kshoemaker/Exam2Data/main/best_in_show.csv", show_col_types = F, na = c("", "no data","-")) %>%
  mutate(intelligence_category = factor(intelligence_category, levels = c("Unknown", "Lowest", "Fair", "Average", "Above average", "Excellent", "Brightest"), ordered = T, exclude = NULL), 
         Suitability_for_Children = factor(Suitability_for_Children, levels = c("Unknown", "Low", "Medium", "High"), ordered = T, exclude = NULL))

sizedata <- read_csv("https://raw.githubusercontent.com/kshoemaker/Exam2Data/main/DogSize.csv", show_col_types = F)

1 - Dog Info - EDA using dplyr and ggplot

We will start with AKCdata, information about dog breeds from the American Kennel Club.

glimpse(AKCdata)
## Rows: 136
## Columns: 11
## $ Dog_breed                <chr> "Portuguese Water Dog", "Finnish Lapphund", "…
## $ category                 <chr> "working", "herding", "terrier", "working", "…
## $ Popularity               <dbl> 56, 104, 106, 111, 112, 141, 142, 150, 152, 1…
## $ INTELLIGENCE_ranking     <dbl> NA, NA, NA, NA, 34, 34, 47, 24, NA, 41, NA, N…
## $ Average_Lifetime         <dbl> 11.42, 7.33, 8.42, 10.75, 12.77, 9.90, 8.70, …
## $ Suitability_for_Children <ord> Unknown, Low, Low, Medium, Low, Medium, Unkno…
## $ size_category            <chr> "medium", "medium", "small", "large", "medium…
## $ intelligence_category    <ord> Unknown, Unknown, Unknown, Unknown, Above ave…
## $ Average_Price            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ price_bracket            <chr> "Unknown", "Unknown", "Unknown", "Unknown", "…
## $ Food_cost_per_year       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
summary(AKCdata)
##   Dog_breed           category           Popularity     INTELLIGENCE_ranking
##  Length:136         Length:136         Min.   :  1.00   Min.   : 1.00       
##  Class :character   Class :character   1st Qu.: 36.50   1st Qu.:27.75       
##  Mode  :character   Mode  :character   Median : 76.50   Median :42.50       
##                                        Mean   : 78.88   Mean   :41.46       
##                                        3rd Qu.:120.25   3rd Qu.:57.25       
##                                        Max.   :169.00   Max.   :80.00       
##                                                         NA's   :24          
##  Average_Lifetime Suitability_for_Children size_category     
##  Min.   : 1.830   Unknown:24               Length:136        
##  1st Qu.: 9.623   Low    :71               Class :character  
##  Median :11.280   Medium :32               Mode  :character  
##  Mean   :10.890   High   : 9                                 
##  3rd Qu.:12.338                                              
##  Max.   :16.500                                              
##                                                              
##    intelligence_category Average_Price    price_bracket      Food_cost_per_year
##  Unknown      :24        Min.   : 283.0   Length:136         Min.   : 197.0    
##  Lowest       :11        1st Qu.: 608.5   Class :character   1st Qu.: 278.0    
##  Fair         :21        Median : 800.0   Mode  :character   Median : 400.0    
##  Average      :30        Mean   : 893.7                      Mean   : 410.9    
##  Above average:24        3rd Qu.:1038.2                      3rd Qu.: 400.0    
##  Excellent    :16        Max.   :3460.0                      Max.   :1044.0    
##  Brightest    :10        NA's   :18                          NA's   :49

Using dplyr and the data, answer the following questions:

AKCdata %>% arrange(Popularity) %>% slice(1:10)
## # A tibble: 10 × 11
##    Dog_breed          category  Popularity INTELLIGENCE_ranking Average_Lifetime
##    <chr>              <chr>          <dbl>                <dbl>            <dbl>
##  1 Labrador Retriever sporting           1                    7            12.0 
##  2 German Shepherd    herding            2                    3             9.73
##  3 Beagle             hound              3                   73            12.3 
##  4 Golden Retriever   sporting           4                    4            12.0 
##  5 Yorkshire Terrier  toy                5                   27            12.6 
##  6 Bulldog            non-spor…          6                   78             6.29
##  7 Boxer              working            7                   48             8.81
##  8 Poodle             non-spor…          8                    2            12.0 
##  9 Dachshund          hound              9                   49            12.6 
## 10 Rottweiler         working           10                    9             9.11
## # ℹ 6 more variables: Suitability_for_Children <ord>, size_category <chr>,
## #   intelligence_category <ord>, Average_Price <dbl>, price_bracket <chr>,
## #   Food_cost_per_year <dbl>
AKCdata %>% arrange(INTELLIGENCE_ranking) %>% slice(1:10)
## # A tibble: 10 × 11
##    Dog_breed           category Popularity INTELLIGENCE_ranking Average_Lifetime
##    <chr>               <chr>         <dbl>                <dbl>            <dbl>
##  1 Border Collie       herding          45                    1            12.5 
##  2 Poodle              non-spo…          8                    2            12.0 
##  3 German Shepherd     herding           2                    3             9.73
##  4 Golden Retriever    sporting          4                    4            12.0 
##  5 Doberman Pinscher   working          13                    5            10.3 
##  6 Shetland Sheepdog   herding          20                    6            12.5 
##  7 Labrador Retriever  sporting          1                    7            12.0 
##  8 Papillon            toy              38                    8            13   
##  9 Rottweiler          working          10                    9             9.11
## 10 Australian Cattle … herding          60                   10            11.7 
## # ℹ 6 more variables: Suitability_for_Children <ord>, size_category <chr>,
## #   intelligence_category <ord>, Average_Price <dbl>, price_bracket <chr>,
## #   Food_cost_per_year <dbl>
AKCdata %>% arrange(-Average_Lifetime) %>% slice (1:5)
## # A tibble: 5 × 11
##   Dog_breed        category     Popularity INTELLIGENCE_ranking Average_Lifetime
##   <chr>            <chr>             <dbl>                <dbl>            <dbl>
## 1 Chihuahua        toy                  14                   67             16.5
## 2 Canaan Dog       herding             168                   NA             14.7
## 3 Tibetan Spaniel  non-sporting        114                   46             14.4
## 4 Silky Terrier    toy                  85                   37             14.2
## 5 Swedish Vallhund herding             153                   NA             14.2
## # ℹ 6 more variables: Suitability_for_Children <ord>, size_category <chr>,
## #   intelligence_category <ord>, Average_Price <dbl>, price_bracket <chr>,
## #   Food_cost_per_year <dbl>
AKCdata %>% group_by(category) %>% summarise(max(Average_Lifetime))
## # A tibble: 7 × 2
##   category     `max(Average_Lifetime)`
##   <chr>                          <dbl>
## 1 herding                         14.7
## 2 hound                           13.6
## 3 non-sporting                    14.4
## 4 sporting                        12.9
## 5 terrier                         14  
## 6 toy                             16.5
## 7 working                         12.6
# Enter number of dog do you want to suggest
numdog_suggest = 3

AKCdata %>% filter(Suitability_for_Children =="High") %>% 
  arrange(-Average_Lifetime) %>% 
  slice(1:numdog_suggest)
## # A tibble: 3 × 11
##   Dog_breed         category Popularity INTELLIGENCE_ranking Average_Lifetime
##   <chr>             <chr>         <dbl>                <dbl>            <dbl>
## 1 Chihuahua         toy              14                   67             16.5
## 2 Dachshund         hound             9                   49             12.6
## 3 Yorkshire Terrier toy               5                   27             12.6
## # ℹ 6 more variables: Suitability_for_Children <ord>, size_category <chr>,
## #   intelligence_category <ord>, Average_Price <dbl>, price_bracket <chr>,
## #   Food_cost_per_year <dbl>
AKCdata %>% group_by(category) %>% summarise(count = n()) %>% arrange(-count)
## # A tibble: 7 × 2
##   category     count
##   <chr>        <int>
## 1 sporting        25
## 2 working         23
## 3 terrier         22
## 4 herding         17
## 5 hound           17
## 6 toy             17
## 7 non-sporting    15
AKCdata %>%
  group_by(category) %>%
  summarise(mean_Average_Price = mean(Average_Price, na.rm = TRUE), mean_Popularity = mean(Popularity, na.rm = TRUE)) %>% 
  arrange(-mean_Average_Price)
## # A tibble: 7 × 3
##   category     mean_Average_Price mean_Popularity
##   <chr>                     <dbl>           <dbl>
## 1 working                   1237.            65.4
## 2 non-sporting               982.            71.4
## 3 terrier                    911.            99.2
## 4 hound                      790.            85.1
## 5 sporting                   782.            84.8
## 6 herding                    756.            89.3
## 7 toy                        709.            52
AKCdata %>% ggplot(aes(x= Average_Price)) + 
  geom_histogram(bins = 50, color= "black",  alpha = 0.5, position = 'identity') + 
  theme_minimal()
## Warning: Removed 18 rows containing non-finite values (`stat_bin()`).

category_summary <- AKCdata %>% group_by(category) %>%
  mutate(avg_price = mean(Average_Price,na.rm = TRUE)) %>% arrange(avg_price)

# finding the most expensive and least expensive 2 categories for each

most_expensive_category <- tail(unique(category_summary$category), 2)
least_expensive_category <- head(unique(category_summary$category), 2)

AKCdata %>% filter(category %in% c(most_expensive_category, least_expensive_category))%>%
  ggplot(aes(x = Average_Price, fill = category)) +
  geom_histogram(bins = 50, color = "black", alpha = 0.5, position = "identity") +
  facet_wrap(~category) +
  theme_minimal()
## Warning: Removed 9 rows containing non-finite values (`stat_bin()`).

2 - Relationships

Go back to the unfiltered AKCdata to answer the following questions about relationships in the data.

Answer : by the graph I can say, the Popularity and Average_Price does not have relationship, when the popularity increase but the price keep the same on average.

AKCdata %>% ggplot(aes(x= Popularity, y = Average_Price, color = Popularity)) +
  geom_point()
## Warning: Removed 18 rows containing missing values (`geom_point()`).

AKCdata %>% arrange(-Average_Price) %>% 
  slice(1:10) %>% 
  select(Dog_breed, category, Popularity, Average_Price)
## # A tibble: 10 × 4
##    Dog_breed                          category     Popularity Average_Price
##    <chr>                              <chr>             <dbl>         <dbl>
##  1 Tibetan Mastiff                    working             122          3460
##  2 Black Russian Terrier              working             128          2833
##  3 Bulldog                            non-sporting          6          2680
##  4 Norfolk Terrier                    terrier             120          2083
##  5 French Bulldog                     non-sporting         18          1900
##  6 Miniature Bull Terrier             terrier             127          1740
##  7 Spinone Italiano                   sporting            123          1725
##  8 Greater Swiss Mountain Dog         working              82          1605
##  9 Saluki                             hound               117          1525
## 10 Nova Scotia Duck Tolling Retriever sporting            107          1500
AKCdata %>% group_by(intelligence_category) %>%
  summarise(count = n()) %>%
  arrange(-count)
## # A tibble: 7 × 2
##   intelligence_category count
##   <ord>                 <int>
## 1 Average                  30
## 2 Unknown                  24
## 3 Above average            24
## 4 Fair                     21
## 5 Excellent                16
## 6 Lowest                   11
## 7 Brightest                10
ggplot(AKCdata, aes(x = category, fill = intelligence_category)) +
  geom_bar(position = "dodge", color = "black", stat = "count") +
  labs(x = "Category", y = "Count", fill = "Intelligence Category") +
  theme_minimal()

ggplot(AKCdata, aes(x = intelligence_category, fill = category)) +
  geom_bar(position = "dodge", color = "black", stat = "count") +
  labs(x = "Intelligence", y = "Count", fill = "Category") +
  theme_minimal()

library(ggplot2)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
AKCdata_filtered <- AKCdata[!is.na(AKCdata$INTELLIGENCE_ranking), ]

bubble_plot <- ggplot(AKCdata_filtered, aes(x = INTELLIGENCE_ranking, y = Popularity, color = category)) +
  geom_point(alpha = 0.7) +
  scale_size_continuous(range = c(3, 10)) +
  labs(x = "Intelligence Ranking", y = "Popularity", title = "Bubble Plot of Popularity vs Intelligence Ranking by Category") +
  theme_minimal()

bubble_plot

3 - New Variable for Total Costs

We have the average purchase price of the dog, Average_Price, the average lifetime of the dog, Average_Lifetime, and the food cost per year, Food_cost_per_year. If we’re interested in the lifetime cost of the dog, we can find that!

(3.1) Create a new column in your dataframe called Total_cost that adds together the Average_Price of the dog breed and the Food_cost_per_year times the Average_Lifetime of that breed. Be sure to add it to your dataframe.

AKCdata <- AKCdata %>% mutate(Total_cost = Average_Price + Food_cost_per_year * Average_Lifetime )

AKCdata
## # A tibble: 136 × 12
##    Dog_breed           category Popularity INTELLIGENCE_ranking Average_Lifetime
##    <chr>               <chr>         <dbl>                <dbl>            <dbl>
##  1 Portuguese Water D… working          56                   NA            11.4 
##  2 Finnish Lapphund    herding         104                   NA             7.33
##  3 Cesky Terrier       terrier         106                   NA             8.42
##  4 Anatolian Shepherd… working         111                   NA            10.8 
##  5 Bearded Collie      herding         112                   34            12.8 
##  6 Field Spaniel       sporting        141                   34             9.9 
##  7 Scottish Deerhound  hound           142                   47             8.7 
##  8 Irish Water Spaniel sporting        150                   24             9.33
##  9 Löwchen             non-spo…        152                   NA            10   
## 10 Curly Coated Retri… sporting        154                   41            10.8 
## # ℹ 126 more rows
## # ℹ 7 more variables: Suitability_for_Children <ord>, size_category <chr>,
## #   intelligence_category <ord>, Average_Price <dbl>, price_bracket <chr>,
## #   Food_cost_per_year <dbl>, Total_cost <dbl>

(3.2) Let’s check again now that we have this total cost value - is there a relationship between Total_cost and Popularity?

AKCdata %>% ggplot(aes(x= Popularity, y = Total_cost, color = Popularity)) +
  geom_point()
## Warning: Removed 49 rows containing missing values (`geom_point()`).

4 - Join in the size info

The AKC data has a size category, but nothing about their height and weight specifically. Most of the dog breeds have their sizes listed in the second “sizedata” dataframe.

(4.1) Using a join, combine the size info with the AKCdata. We are only interested in dog breeds that have AKC data and size data at this point, so consider the correct join to use.

combine_AKCdata <- left_join(AKCdata, sizedata, by = c("Dog_breed"="Breed"))
combine_AKCdata
## # A tibble: 136 × 18
##    Dog_breed           category Popularity INTELLIGENCE_ranking Average_Lifetime
##    <chr>               <chr>         <dbl>                <dbl>            <dbl>
##  1 Portuguese Water D… working          56                   NA            11.4 
##  2 Finnish Lapphund    herding         104                   NA             7.33
##  3 Cesky Terrier       terrier         106                   NA             8.42
##  4 Anatolian Shepherd… working         111                   NA            10.8 
##  5 Bearded Collie      herding         112                   34            12.8 
##  6 Field Spaniel       sporting        141                   34             9.9 
##  7 Scottish Deerhound  hound           142                   47             8.7 
##  8 Irish Water Spaniel sporting        150                   24             9.33
##  9 Löwchen             non-spo…        152                   NA            10   
## 10 Curly Coated Retri… sporting        154                   41            10.8 
## # ℹ 126 more rows
## # ℹ 13 more variables: Suitability_for_Children <ord>, size_category <chr>,
## #   intelligence_category <ord>, Average_Price <dbl>, price_bracket <chr>,
## #   Food_cost_per_year <dbl>, Total_cost <dbl>, height_low_inches <dbl>,
## #   height_avg_inches <dbl>, height_high_inches <dbl>, weight_low_lbs <dbl>,
## #   weight_avg_lbs <dbl>, weight_high_lbs <dbl>

(4.2) It is said that small dogs live longer. Choose an appropriate tool (or tools) to investigate the relationship between size and average lifetime. You may use any of the variables available to you.

ggplot(combine_AKCdata, aes(x = Average_Lifetime , y = weight_avg_lbs)) +
  geom_point() +
  labs(x = "Average Lifetime", y = "Weight") +
  theme_minimal()
## Warning: Removed 19 rows containing missing values (`geom_point()`).

5 - Your Turn

Using one or both of these dataframes of dogs, state a question and create a plot (or plots) of your own to answer that question. You may use the original dataframes, the filtered dataframes, the joined dataframe, or any combination of your choice.

Your question doesn’t have to have an interesting answer, but your plot must answer the question. However, extra credit will be given for interesting plots and questions.

Question : find the dog breed that is less expensive but popular. Answer : Affenpinscher

less_expensive_popular <- AKCdata %>% na.omit() %>%
  arrange(Total_cost) %>%
  slice_head(n = 5)

# Create a bar plot to visualize the popularity of less expensive dog breeds
ggplot(less_expensive_popular, aes(x = reorder(Dog_breed, Popularity), y = Popularity)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(x = "Breed", y = "Popularity", title = "Popularity of Less Expensive Dog Breeds") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))