This exam is covered by the UHD honor code. Turning in this exam acknowledges that all work submitted on this exam is your own, and that you have not used any off limit resource. You may use the R help files and your notes as needed.
Be sure to commit your rmd and your knitted results to GitHub.
You are submitting this exam under the UHD honor code. Answer all
below questions on the Exam2_template_S24.Rmd provided on
Github. Complete the questions, knit the file to html, commit and push
to your repository BEFORE 1:15. Any commit that is
pushed after 1:15 will not be graded.
Code clarity and organization will be 5 points of your total grade. Make sure you have comments where needed and your code is easy to follow (and grade), do not print out whole dataframes. Please read this document carefully to make sure that you don’t miss any requested task.
Use this code to read in your data.
AKCdata is data about 136 breeds of dogs, from the American Kennel Club. There are variables with missing data.
AKCdata <- read_csv("https://raw.githubusercontent.com/kshoemaker/Exam2Data/main/best_in_show.csv", show_col_types = F, na = c("", "no data","-")) %>%
mutate(intelligence_category = factor(intelligence_category, levels = c("Unknown", "Lowest", "Fair", "Average", "Above average", "Excellent", "Brightest"), ordered = T, exclude = NULL),
Suitability_for_Children = factor(Suitability_for_Children, levels = c("Unknown", "Low", "Medium", "High"), ordered = T, exclude = NULL))
sizedata <- read_csv("https://raw.githubusercontent.com/kshoemaker/Exam2Data/main/DogSize.csv", show_col_types = F)
We will start with AKCdata, information about dog breeds
from the American Kennel Club.
glimpse(AKCdata)
## Rows: 136
## Columns: 11
## $ Dog_breed <chr> "Portuguese Water Dog", "Finnish Lapphund", "…
## $ category <chr> "working", "herding", "terrier", "working", "…
## $ Popularity <dbl> 56, 104, 106, 111, 112, 141, 142, 150, 152, 1…
## $ INTELLIGENCE_ranking <dbl> NA, NA, NA, NA, 34, 34, 47, 24, NA, 41, NA, N…
## $ Average_Lifetime <dbl> 11.42, 7.33, 8.42, 10.75, 12.77, 9.90, 8.70, …
## $ Suitability_for_Children <ord> Unknown, Low, Low, Medium, Low, Medium, Unkno…
## $ size_category <chr> "medium", "medium", "small", "large", "medium…
## $ intelligence_category <ord> Unknown, Unknown, Unknown, Unknown, Above ave…
## $ Average_Price <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ price_bracket <chr> "Unknown", "Unknown", "Unknown", "Unknown", "…
## $ Food_cost_per_year <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
summary(AKCdata)
## Dog_breed category Popularity INTELLIGENCE_ranking
## Length:136 Length:136 Min. : 1.00 Min. : 1.00
## Class :character Class :character 1st Qu.: 36.50 1st Qu.:27.75
## Mode :character Mode :character Median : 76.50 Median :42.50
## Mean : 78.88 Mean :41.46
## 3rd Qu.:120.25 3rd Qu.:57.25
## Max. :169.00 Max. :80.00
## NA's :24
## Average_Lifetime Suitability_for_Children size_category
## Min. : 1.830 Unknown:24 Length:136
## 1st Qu.: 9.623 Low :71 Class :character
## Median :11.280 Medium :32 Mode :character
## Mean :10.890 High : 9
## 3rd Qu.:12.338
## Max. :16.500
##
## intelligence_category Average_Price price_bracket Food_cost_per_year
## Unknown :24 Min. : 283.0 Length:136 Min. : 197.0
## Lowest :11 1st Qu.: 608.5 Class :character 1st Qu.: 278.0
## Fair :21 Median : 800.0 Mode :character Median : 400.0
## Average :30 Mean : 893.7 Mean : 410.9
## Above average:24 3rd Qu.:1038.2 3rd Qu.: 400.0
## Excellent :16 Max. :3460.0 Max. :1044.0
## Brightest :10 NA's :18 NA's :49
Using dplyr and the data, answer the following questions:
Popularity and
INTELLIGENCE_ranking, 1 is the highest ranking, so the most
popular dog will have Popularity of 1.AKCdata %>% arrange(Popularity) %>% slice(1:10)
## # A tibble: 10 × 11
## Dog_breed category Popularity INTELLIGENCE_ranking Average_Lifetime
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Labrador Retriever sporting 1 7 12.0
## 2 German Shepherd herding 2 3 9.73
## 3 Beagle hound 3 73 12.3
## 4 Golden Retriever sporting 4 4 12.0
## 5 Yorkshire Terrier toy 5 27 12.6
## 6 Bulldog non-spor… 6 78 6.29
## 7 Boxer working 7 48 8.81
## 8 Poodle non-spor… 8 2 12.0
## 9 Dachshund hound 9 49 12.6
## 10 Rottweiler working 10 9 9.11
## # ℹ 6 more variables: Suitability_for_Children <ord>, size_category <chr>,
## # intelligence_category <ord>, Average_Price <dbl>, price_bracket <chr>,
## # Food_cost_per_year <dbl>
AKCdata %>% arrange(INTELLIGENCE_ranking) %>% slice(1:10)
## # A tibble: 10 × 11
## Dog_breed category Popularity INTELLIGENCE_ranking Average_Lifetime
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Border Collie herding 45 1 12.5
## 2 Poodle non-spo… 8 2 12.0
## 3 German Shepherd herding 2 3 9.73
## 4 Golden Retriever sporting 4 4 12.0
## 5 Doberman Pinscher working 13 5 10.3
## 6 Shetland Sheepdog herding 20 6 12.5
## 7 Labrador Retriever sporting 1 7 12.0
## 8 Papillon toy 38 8 13
## 9 Rottweiler working 10 9 9.11
## 10 Australian Cattle … herding 60 10 11.7
## # ℹ 6 more variables: Suitability_for_Children <ord>, size_category <chr>,
## # intelligence_category <ord>, Average_Price <dbl>, price_bracket <chr>,
## # Food_cost_per_year <dbl>
average_lifetime?AKCdata %>% arrange(-Average_Lifetime) %>% slice (1:5)
## # A tibble: 5 × 11
## Dog_breed category Popularity INTELLIGENCE_ranking Average_Lifetime
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Chihuahua toy 14 67 16.5
## 2 Canaan Dog herding 168 NA 14.7
## 3 Tibetan Spaniel non-sporting 114 46 14.4
## 4 Silky Terrier toy 85 37 14.2
## 5 Swedish Vallhund herding 153 NA 14.2
## # ℹ 6 more variables: Suitability_for_Children <ord>, size_category <chr>,
## # intelligence_category <ord>, Average_Price <dbl>, price_bracket <chr>,
## # Food_cost_per_year <dbl>
category?AKCdata %>% group_by(category) %>% summarise(max(Average_Lifetime))
## # A tibble: 7 × 2
## category `max(Average_Lifetime)`
## <chr> <dbl>
## 1 herding 14.7
## 2 hound 13.6
## 3 non-sporting 14.4
## 4 sporting 12.9
## 5 terrier 14
## 6 toy 16.5
## 7 working 12.6
Suitability_for_Children and that live a long time. Use
this data to find the ideal dogs for me.# Enter number of dog do you want to suggest
numdog_suggest = 3
AKCdata %>% filter(Suitability_for_Children =="High") %>%
arrange(-Average_Lifetime) %>%
slice(1:numdog_suggest)
## # A tibble: 3 × 11
## Dog_breed category Popularity INTELLIGENCE_ranking Average_Lifetime
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Chihuahua toy 14 67 16.5
## 2 Dachshund hound 9 49 12.6
## 3 Yorkshire Terrier toy 5 27 12.6
## # ℹ 6 more variables: Suitability_for_Children <ord>, size_category <chr>,
## # intelligence_category <ord>, Average_Price <dbl>, price_bracket <chr>,
## # Food_cost_per_year <dbl>
category?AKCdata %>% group_by(category) %>% summarise(count = n()) %>% arrange(-count)
## # A tibble: 7 × 2
## category count
## <chr> <int>
## 1 sporting 25
## 2 working 23
## 3 terrier 22
## 4 herding 17
## 5 hound 17
## 6 toy 17
## 7 non-sporting 15
Average_Price and Popularity. Arrange your
results by average price.AKCdata %>%
group_by(category) %>%
summarise(mean_Average_Price = mean(Average_Price, na.rm = TRUE), mean_Popularity = mean(Popularity, na.rm = TRUE)) %>%
arrange(-mean_Average_Price)
## # A tibble: 7 × 3
## category mean_Average_Price mean_Popularity
## <chr> <dbl> <dbl>
## 1 working 1237. 65.4
## 2 non-sporting 982. 71.4
## 3 terrier 911. 99.2
## 4 hound 790. 85.1
## 5 sporting 782. 84.8
## 6 herding 756. 89.3
## 7 toy 709. 52
Average_Price of all the
dogs.AKCdata %>% ggplot(aes(x= Average_Price)) +
geom_histogram(bins = 50, color= "black", alpha = 0.5, position = 'identity') +
theme_minimal()
## Warning: Removed 18 rows containing non-finite values (`stat_bin()`).
category_summary <- AKCdata %>% group_by(category) %>%
mutate(avg_price = mean(Average_Price,na.rm = TRUE)) %>% arrange(avg_price)
# finding the most expensive and least expensive 2 categories for each
most_expensive_category <- tail(unique(category_summary$category), 2)
least_expensive_category <- head(unique(category_summary$category), 2)
AKCdata %>% filter(category %in% c(most_expensive_category, least_expensive_category))%>%
ggplot(aes(x = Average_Price, fill = category)) +
geom_histogram(bins = 50, color = "black", alpha = 0.5, position = "identity") +
facet_wrap(~category) +
theme_minimal()
## Warning: Removed 9 rows containing non-finite values (`stat_bin()`).
Go back to the unfiltered AKCdata to answer the following questions about relationships in the data.
Popularity
and the Average_Price? Use an appropriate plot to
investigate visually.Answer : by the graph I can say, the Popularity and Average_Price does not have relationship, when the popularity increase but the price keep the same on average.
AKCdata %>% ggplot(aes(x= Popularity, y = Average_Price, color = Popularity)) +
geom_point()
## Warning: Removed 18 rows containing missing values (`geom_point()`).
Dog_breed, category, Popularity
and Average_PriceAKCdata %>% arrange(-Average_Price) %>%
slice(1:10) %>%
select(Dog_breed, category, Popularity, Average_Price)
## # A tibble: 10 × 4
## Dog_breed category Popularity Average_Price
## <chr> <chr> <dbl> <dbl>
## 1 Tibetan Mastiff working 122 3460
## 2 Black Russian Terrier working 128 2833
## 3 Bulldog non-sporting 6 2680
## 4 Norfolk Terrier terrier 120 2083
## 5 French Bulldog non-sporting 18 1900
## 6 Miniature Bull Terrier terrier 127 1740
## 7 Spinone Italiano sporting 123 1725
## 8 Greater Swiss Mountain Dog working 82 1605
## 9 Saluki hound 117 1525
## 10 Nova Scotia Duck Tolling Retriever sporting 107 1500
intelligence_category? You may answer with a summary table
or with an appropriate graph.AKCdata %>% group_by(intelligence_category) %>%
summarise(count = n()) %>%
arrange(-count)
## # A tibble: 7 × 2
## intelligence_category count
## <ord> <int>
## 1 Average 30
## 2 Unknown 24
## 3 Above average 24
## 4 Fair 21
## 5 Excellent 16
## 6 Lowest 11
## 7 Brightest 10
(2.4) What is the relationship between category and
intelligence_category? Use an appropriate graph to
look.
Answer: base on the graph we can easier see the proportion of intelligent dog in each category or proportion of each kind of dog in each intelligent category. For example : in sporting they choose dog with excellent or above average intelligent while for working they don’t care about that.
ggplot(AKCdata, aes(x = category, fill = intelligence_category)) +
geom_bar(position = "dodge", color = "black", stat = "count") +
labs(x = "Category", y = "Count", fill = "Intelligence Category") +
theme_minimal()
ggplot(AKCdata, aes(x = intelligence_category, fill = category)) +
geom_bar(position = "dodge", color = "black", stat = "count") +
labs(x = "Intelligence", y = "Count", fill = "Category") +
theme_minimal()
category,
Popularity and the Intelligence_ranking? Use
an appropriate plot to find out.library(ggplot2)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
AKCdata_filtered <- AKCdata[!is.na(AKCdata$INTELLIGENCE_ranking), ]
bubble_plot <- ggplot(AKCdata_filtered, aes(x = INTELLIGENCE_ranking, y = Popularity, color = category)) +
geom_point(alpha = 0.7) +
scale_size_continuous(range = c(3, 10)) +
labs(x = "Intelligence Ranking", y = "Popularity", title = "Bubble Plot of Popularity vs Intelligence Ranking by Category") +
theme_minimal()
bubble_plot
We have the average purchase price of the dog,
Average_Price, the average lifetime of the dog,
Average_Lifetime, and the food cost per year,
Food_cost_per_year. If we’re interested in the lifetime
cost of the dog, we can find that!
(3.1) Create a new column in your dataframe called
Total_cost that adds together the
Average_Price of the dog breed and the
Food_cost_per_year times the Average_Lifetime
of that breed. Be sure to add it to your dataframe.
AKCdata <- AKCdata %>% mutate(Total_cost = Average_Price + Food_cost_per_year * Average_Lifetime )
AKCdata
## # A tibble: 136 × 12
## Dog_breed category Popularity INTELLIGENCE_ranking Average_Lifetime
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Portuguese Water D… working 56 NA 11.4
## 2 Finnish Lapphund herding 104 NA 7.33
## 3 Cesky Terrier terrier 106 NA 8.42
## 4 Anatolian Shepherd… working 111 NA 10.8
## 5 Bearded Collie herding 112 34 12.8
## 6 Field Spaniel sporting 141 34 9.9
## 7 Scottish Deerhound hound 142 47 8.7
## 8 Irish Water Spaniel sporting 150 24 9.33
## 9 Löwchen non-spo… 152 NA 10
## 10 Curly Coated Retri… sporting 154 41 10.8
## # ℹ 126 more rows
## # ℹ 7 more variables: Suitability_for_Children <ord>, size_category <chr>,
## # intelligence_category <ord>, Average_Price <dbl>, price_bracket <chr>,
## # Food_cost_per_year <dbl>, Total_cost <dbl>
(3.2) Let’s check again now that we have this total cost value - is
there a relationship between Total_cost and
Popularity?
AKCdata %>% ggplot(aes(x= Popularity, y = Total_cost, color = Popularity)) +
geom_point()
## Warning: Removed 49 rows containing missing values (`geom_point()`).
The AKC data has a size category, but nothing about their height and weight specifically. Most of the dog breeds have their sizes listed in the second “sizedata” dataframe.
(4.1) Using a join, combine the size info with the AKCdata. We are only interested in dog breeds that have AKC data and size data at this point, so consider the correct join to use.
combine_AKCdata <- left_join(AKCdata, sizedata, by = c("Dog_breed"="Breed"))
combine_AKCdata
## # A tibble: 136 × 18
## Dog_breed category Popularity INTELLIGENCE_ranking Average_Lifetime
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Portuguese Water D… working 56 NA 11.4
## 2 Finnish Lapphund herding 104 NA 7.33
## 3 Cesky Terrier terrier 106 NA 8.42
## 4 Anatolian Shepherd… working 111 NA 10.8
## 5 Bearded Collie herding 112 34 12.8
## 6 Field Spaniel sporting 141 34 9.9
## 7 Scottish Deerhound hound 142 47 8.7
## 8 Irish Water Spaniel sporting 150 24 9.33
## 9 Löwchen non-spo… 152 NA 10
## 10 Curly Coated Retri… sporting 154 41 10.8
## # ℹ 126 more rows
## # ℹ 13 more variables: Suitability_for_Children <ord>, size_category <chr>,
## # intelligence_category <ord>, Average_Price <dbl>, price_bracket <chr>,
## # Food_cost_per_year <dbl>, Total_cost <dbl>, height_low_inches <dbl>,
## # height_avg_inches <dbl>, height_high_inches <dbl>, weight_low_lbs <dbl>,
## # weight_avg_lbs <dbl>, weight_high_lbs <dbl>
(4.2) It is said that small dogs live longer. Choose an appropriate tool (or tools) to investigate the relationship between size and average lifetime. You may use any of the variables available to you.
ggplot(combine_AKCdata, aes(x = Average_Lifetime , y = weight_avg_lbs)) +
geom_point() +
labs(x = "Average Lifetime", y = "Weight") +
theme_minimal()
## Warning: Removed 19 rows containing missing values (`geom_point()`).
Using one or both of these dataframes of dogs, state a question and create a plot (or plots) of your own to answer that question. You may use the original dataframes, the filtered dataframes, the joined dataframe, or any combination of your choice.
Your question doesn’t have to have an interesting answer, but your plot must answer the question. However, extra credit will be given for interesting plots and questions.
Question : find the dog breed that is less expensive but popular. Answer : Affenpinscher
less_expensive_popular <- AKCdata %>% na.omit() %>%
arrange(Total_cost) %>%
slice_head(n = 5)
# Create a bar plot to visualize the popularity of less expensive dog breeds
ggplot(less_expensive_popular, aes(x = reorder(Dog_breed, Popularity), y = Popularity)) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(x = "Breed", y = "Popularity", title = "Popularity of Less Expensive Dog Breeds") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))