Download chickens.csv to your working directory. Make sure to set your working directory appropriately! This dataset was created by modifying the R built-in dataset chickwts.
Import the chickens.csv data into R. Store it in a data.frame named ch_df and print out the entire ch_df to the screen.
ch_df <- read.csv("chickens.csv")
ch_df
## weight feed
## 1 206 meatmeal
## 2 140 horsebean
## 3 <NA> <NA>
## 4 318 sunflower
## 5 332 casein
## 6 na horsebean
## 7 216 na
## 8 143 horsebean
## 9 271 soybean
## 10 315 meatmeal
## 11 227 horsebean
## 12 N/A sunflower
## 13 322 sunflower
## 14 352 casein
## 15 329 not sure
## 16 N/A linseed
## 17 379 casein
## 18 153 ?
## 19 N/A linseed
## 20 213 linseed
## 21 257
## 22 179 horsebean
## 23 380 meatmeal
## 24 327 soybean
## 25 260 linseed
## 26 168 horsebean
## 27 248 soybean
## 28 181 linseed
## 29 160 horsebean
## 30 <NA> sunflower
## 31 soybean
## 32 340 sunflower
## 33 260 casein
## 34 169 ?
## 35 171 soybean
## 36 368 casein
## 37 283 casein
## 38 334 sunflower
## 39 - unknown
## 40 309 linseed
## 41 soybean
## 42 295 ?
## 43 404 <NA>
## 44 392 sunflower
## 45 na casein
## 46 267 soybean
## 47 303 meatmeal
## 48 250 soybean
## 49 243 soybean
## 50 108 horsebean
## 51 229 linseed
## 52 <NA> horsebean
## 53 222 casein
## 54 344 meatmeal
## 55 263 unknown
## 56 148 linseed
## 57 318 casein
## 58 - meatmeal
## 59 258 meatmeal
## 60 <NA> sunflower
## 61 325 meatmeal
## 62 217
## 63 271 linseed
## 64 244 linseed
## 65 341 sunflower
## 66 141 ?
## 67 158 soybean
## 68 423 sunflower
## 69 316 <NA>
## 70 na soybean
## 71 casein
There are some missing values in this dataset. Unfortunately they are represented in a number of different ways.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.8
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
sum(is.na(ch_df))
## [1] 7
#ch_df$weight
#is.na(ch_df$weight)
#ch_df$feed
#sum(is.na(ch_df$feed))
#ch_df
ch_df[ch_df == ""] <- NA
ch_df[ch_df == "?"] <- NA
ch_df[ch_df == "N/A"] <- NA
ch_df[ch_df == "na"] <- NA
ch_df[ch_df == "-"] <- NA
Now that the dataset is clean, let’s see what percentage of our data is missing.
((sum(is.na(ch_df$weight))/(length(ch_df$weight))*100))
## [1] 21.12676
((sum(is.na(ch_df$feed))/(length(ch_df$feed))*100))
## [1] 14.08451
((sum(is.na(ch_df))/(length(ch_df$weight)+length(ch_df$feed))*100))
## [1] 17.60563
"Percentage of missing data in in the weight column: 21.12676%."
## [1] "Percentage of missing data in in the weight column: 21.12676%."
"Percentage of missing data in in the feed column: 14.08451%."
## [1] "Percentage of missing data in in the feed column: 14.08451%."
"Percentage of missing data in in the entire dataset: 17.60563%."
## [1] "Percentage of missing data in in the entire dataset: 17.60563%."
EXTRA CREDIT (Optional): Figure out how to create these print statements so that the name and percentage number are not hard-coded into the statement. In other words, so that the name and percentage number are read in dynamically (for example, from a variable, from a function call, etc.) instead of just written in the statement. Please ask me for clarification if necessary.
na_weight <- ((sum(is.na(ch_df$weight))/(length(ch_df$weight))*100))
na_feed <- ((sum(is.na(ch_df$feed))/(length(ch_df$feed))*100))
na_df <- ((sum(is.na(ch_df))/(length(ch_df$weight)+length(ch_df$feed))*100))
print(paste0("Percentage of missing data in weight: ", na_weight))
## [1] "Percentage of missing data in weight: 21.1267605633803"
print(paste0("Percentage of missing data in feed: ", na_feed))
## [1] "Percentage of missing data in feed: 14.0845070422535"
print(paste0("Percentage of missing data in the dataset: ", na_df))
## [1] "Percentage of missing data in the dataset: 17.6056338028169"
ch_df$weight <- as.character(ch_df$weight)
ch_df$weight <- as.numeric(ch_df$weight)
new_ch_df <- ch_df %>%
group_by(feed) %>%
summarise(weight_mean=mean(weight, na.rm = TRUE), weight_median=median(weight, na.rm = TRUE))
new_ch_df[which.max(new_ch_df$weight_median),]
## # A tibble: 1 x 3
## feed weight_mean weight_median
## <chr> <dbl> <dbl>
## 1 sunflower 353. 340
hist(ch_df$weight)
ch_df$weight <- as.character(ch_df$weight)
ch_df$weight <- as.numeric(ch_df$weight)
boxplot(ch_df$weight ~ ch_df$feed)
The histogram shows unimodal at first glance. However, there was a higher number of 200-250 and 300-350. It leaves a slightly less number at 250-300, and there is no significant skew.
The box plot confirmed mean and median calculations where sunflower has the maximum median chicken weight, follows by casein, and the horsebean has the least mean and median.
The histogram does not show any outlier, but the box plot has an outlier on the horsebean feed.
library("dplyr")
summary(ch_df)
## weight feed
## Min. :108.0 Length:71
## 1st Qu.:211.2 Class :character
## Median :261.5 Mode :character
## Mean :264.1
## 3rd Qu.:325.5
## Max. :423.0
## NA's :15
ch_df %>%
group_by(feed) %>%
summarize(min = min(weight, na.rm = TRUE),
q1 = quantile(weight, 0.25, na.rm = TRUE),
median = median(weight, na.rm = TRUE),
mean = mean(weight, na.rm = TRUE),
q3 = quantile(weight, 0.75, na.rm = TRUE),
max = max(weight, na.rm = TRUE))
## # A tibble: 9 x 7
## feed min q1 median mean q3 max
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 casein 222 277. 325 314. 356 379
## 2 horsebean 108 142. 160 161. 174. 227
## 3 linseed 148 205 236. 232. 263. 309
## 4 meatmeal 206 280. 315 304. 334. 380
## 5 not sure 329 329 329 329 329 329
## 6 soybean 158 225 249 242. 268 327
## 7 sunflower 318 328 340 353. 366. 423
## 8 unknown 263 263 263 263 263 263
## 9 <NA> 141 169 217 241. 295 404
p<-ggplot(ch_df, aes(x=feed, y=weight, fill=feed)) +
geom_boxplot()+
labs(title="Relationships between Type of Feed and Weight of Chickens",x="Type of Feed", y = "Weight of Chickens")
p + scale_fill_brewer(palette="Dark2") + theme_minimal()
## Warning: Removed 15 rows containing non-finite values (stat_boxplot).
As it shown in the charts, ggplot gives nicer look with colors and details such as positioning the legend and labels
The base R only shows one maximum horsebean outlier, when ggplot shows 2 outliers: maximum horsebean and minimum soybean
Base R can only show some X oberservations’ label, ggplot shows them all Ggplot can add a them to it