Download chickens.csv to your working directory. Make sure to set your working directory appropriately! This dataset was created by modifying the R built-in dataset chickwts.
Import the chickens.csv data into R. Store it in a data.frame named ch_df and print out the entire ch_df to the screen.
setwd("/Users/mikea/Desktop/Data 110 /project_chickens")
library(tidyverse)
ch_df <- read_csv("chickens.csv")
ch_df
## # A tibble: 71 × 2
## weight feed
## <chr> <chr>
## 1 206 meatmeal
## 2 140 horsebean
## 3 <NA> <NA>
## 4 318 sunflower
## 5 332 casein
## 6 na horsebean
## 7 216 na
## 8 143 horsebean
## 9 271 soybean
## 10 315 meatmeal
## # ℹ 61 more rows
There are some missing values in this dataset. Unfortunately they are represented in a number of different ways.
sum(is.na(ch_df))
## [1] 12
# 12 are recognized as NA by R, not including the other missing values types
#ch_df <- ch_df %>%
#na.omit(ch_df) %>%
ch_df <- ch_df %>%
filter(!(weight %in% c("?", "na", "n/a", "-", "NA")) &
!(feed %in% c("?", "na", "n/a", "-", "NA")))
ch_df
## # A tibble: 61 × 2
## weight feed
## <chr> <chr>
## 1 206 meatmeal
## 2 140 horsebean
## 3 <NA> <NA>
## 4 318 sunflower
## 5 332 casein
## 6 143 horsebean
## 7 271 soybean
## 8 315 meatmeal
## 9 227 horsebean
## 10 N/A sunflower
## # ℹ 51 more rows
# this was actually very hard and took me a while!
Now that the dataset is clean, let’s see what percentage of our data is missing.
missing_weight <- sum(is.na(ch_df$weight))
percentage_missing_weight <- (missing_weight / nrow(ch_df)) * 100
missing_feed <- sum(is.na(ch_df$feed))
percentage_missing_feed <- (missing_feed / nrow(ch_df)) * 100
missing_total <- sum(rowSums(is.na(ch_df)))
percentage_missing_total <- (missing_total / (nrow(ch_df) * ncol(ch_df)))
cat("Percentage of missing data in weight column:", percentage_missing_weight, "%.\n")
## Percentage of missing data in weight column: 11.47541 %.
cat("Percentage of missing data in feed column:", percentage_missing_feed, "%.\n")
## Percentage of missing data in feed column: 8.196721 %.
cat("Percentage of missing data in the entire dataset:", percentage_missing_total, "%.\n")
## Percentage of missing data in the entire dataset: 0.09836066 %.
EXTRA CREDIT (Optional): Figure out how to create these print statements so that the name and percentage number are not hard-coded into the statement. In other words, so that the name and percentage number are read in dynamically (for example, from a variable, from a function call, etc.) instead of just written in the statement. Please ask me for clarification if necessary.
# fill in your code here
ch_df$weight <- as.numeric(ch_df$weight)
## Warning: NAs introduced by coercion
grouped_data <- ch_df %>%
group_by(feed) %>%
summarize(weight_mean = mean(weight, na.rm = TRUE),
weight_median = median(weight, na.rm = TRUE))
grouped_data
## # A tibble: 9 × 3
## feed weight_mean weight_median
## <chr> <dbl> <dbl>
## 1 casein 314. 325
## 2 horsebean 161. 160
## 3 linseed 232. 236.
## 4 meatmeal 304. 315
## 5 not sure 329 329
## 6 soybean 242. 249
## 7 sunflower 353. 340
## 8 unknown 263 263
## 9 <NA> 298. 286.
grouped_data$feed[which.max(grouped_data$weight_median)]
## [1] "sunflower"
# sunflower
hist(ch_df$weight)
boxplot(weight ~ feed, data = ch_df)
# Yes, this confirms that that max median for feed is sunflower. I can tell because of the location and spread of sunflower is much larger compared to the rest. There are a few outliers.
summary(ch_df)
## weight feed
## Min. :108.0 Length:61
## 1st Qu.:219.5 Class :character
## Median :267.0 Mode :character
## Mean :270.9
## 3rd Qu.:328.0
## Max. :423.0
## NA's :10
library(ggplot2)
ggplot(ch_df, aes(x = feed, y = weight)) +
geom_boxplot(fill = "lightblue") +
xlab("Feed") +
ylab("Weight") +
ggtitle("Box Plot of Weight by Feed Type")
## Warning: Removed 10 rows containing non-finite values (`stat_boxplot()`).