Download chickens.csv to your working directory. Make sure to set your working directory appropriately! This dataset was created by modifying the R built-in dataset chickwts.
Import the chickens.csv data into R. Store it in a data.frame named ch_df and print out the entire ch_df to the screen.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.1.0
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
ch_df <- read_csv("chickens.csv")
## Rows: 71 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): weight, feed
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ch_df
## # A tibble: 71 × 2
## weight feed
## <chr> <chr>
## 1 206 meatmeal
## 2 140 horsebean
## 3 <NA> <NA>
## 4 318 sunflower
## 5 332 casein
## 6 na horsebean
## 7 216 na
## 8 143 horsebean
## 9 271 soybean
## 10 315 meatmeal
## # … with 61 more rows
There are some missing values in this dataset. Unfortunately they are represented in a number of different ways.
num_na <- sum(is.na(ch_df))
# There are 12 NA elements in the original ch_df
ch_df <- ch_df %>%
mutate_all(~ifelse(. %in% c("-", "?", "na"), NA, .))
Now that the dataset is clean, let’s see what percentage of our data is missing.
weight_na_pct <- round(sum(is.na(ch_df$weight)) / nrow(ch_df) * 100, 2)
feed_na_pct <- round(sum(is.na(ch_df$feed)) / nrow(ch_df) * 100, 2)
total_na_pct <- round(num_na / (nrow(ch_df) * ncol(ch_df)) * 100, 2)
cat("Percentage of missing data in weight column:", weight_na_pct, "%\n")
## Percentage of missing data in weight column: 16.9 %
cat("Percentage of missing data in feed column:", feed_na_pct, "%\n")
## Percentage of missing data in feed column: 14.08 %
cat("Percentage of missing data in entire dataset:", total_na_pct, "%\n")
## Percentage of missing data in entire dataset: 8.45 %
EXTRA CREDIT (Optional): Figure out how to create these print statements so that the name and percentage number are not hard-coded into the statement. In other words, so that the name and percentage number are read in dynamically (for example, from a variable, from a function call, etc.) instead of just written in the statement. Please ask me for clarification if necessary.
# fill in your code here
ch_df <- ch_df %>%
mutate(weight = as.numeric(weight))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `weight = as.numeric(weight)`.
## Caused by warning:
## ! NAs introduced by coercion
# This was done to solve errors in my code due to the weight column being read as character values
feed_stats <- ch_df %>%
group_by(feed) %>%
summarize(weight_mean = mean(weight, na.rm = TRUE),
weight_median = median(weight, na.rm = TRUE))
feed_stats
## # A tibble: 9 × 3
## feed weight_mean weight_median
## <chr> <dbl> <dbl>
## 1 casein 314. 325
## 2 horsebean 161. 160
## 3 linseed 232. 236.
## 4 meatmeal 304. 315
## 5 not sure 329 329
## 6 soybean 242. 249
## 7 sunflower 353. 340
## 8 unknown 263 263
## 9 <NA> 241. 217
max_median_feed <- feed_stats %>%
filter(weight_median == max(weight_median)) %>%
pull(feed)
Sunflower has the maximum median chicken weight
hist(ch_df$weight)
boxplot(weight ~ feed, data = ch_df, xlab = "Feed", ylab = "Weight", main = "Chicken Weights by Feed Type")
The charts tell me that my median and mean calculations are correct There appears to be one outlier in the box plot for horsebean
ch_df %>%
group_by(feed) %>%
summarize(min = min(weight, na.rm = TRUE),
q1 = quantile(weight, 0.25, na.rm = TRUE),
median = median(weight, na.rm = TRUE),
q3 = quantile(weight, 0.75, na.rm = TRUE),
max = max(weight, na.rm = TRUE),
iqr = IQR(weight, na.rm = TRUE))
## # A tibble: 9 × 7
## feed min q1 median q3 max iqr
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 casein 222 277. 325 356 379 78.8
## 2 horsebean 108 142. 160 174. 227 32
## 3 linseed 148 205 236. 263. 309 57.8
## 4 meatmeal 206 280. 315 334. 380 54
## 5 not sure 329 329 329 329 329 0
## 6 soybean 158 225 249 268 327 43
## 7 sunflower 318 328 340 366. 423 38.5
## 8 unknown 263 263 263 263 263 0
## 9 <NA> 141 169 217 295 404 126
The five number summary confirms that there is an outlier of 227 in horsebean.
ggplot(data = ch_df, aes(x = feed, y = weight)) +
geom_boxplot() +
labs(x = "Feed", y = "Weight", title = "Chicken Weights by Feed Type")
## Warning: Removed 15 rows containing non-finite values (`stat_boxplot()`).