Data Import

  1. Download chickens.csv to your working directory. Make sure to set your working directory appropriately! This dataset was created by modifying the R built-in dataset chickwts.

  2. Import the chickens.csv data into R. Store it in a data.frame named ch_df and print out the entire ch_df to the screen.

ch_df <- read.csv("chickens.csv")
ch_df
##    weight      feed
## 1     206  meatmeal
## 2     140 horsebean
## 3    <NA>      <NA>
## 4     318 sunflower
## 5     332    casein
## 6      na horsebean
## 7     216        na
## 8     143 horsebean
## 9     271   soybean
## 10    315  meatmeal
## 11    227 horsebean
## 12    N/A sunflower
## 13    322 sunflower
## 14    352    casein
## 15    329  not sure
## 16    N/A   linseed
## 17    379    casein
## 18    153         ?
## 19    N/A   linseed
## 20    213   linseed
## 21    257          
## 22    179 horsebean
## 23    380  meatmeal
## 24    327   soybean
## 25    260   linseed
## 26    168 horsebean
## 27    248   soybean
## 28    181   linseed
## 29    160 horsebean
## 30   <NA> sunflower
## 31          soybean
## 32    340 sunflower
## 33    260    casein
## 34    169         ?
## 35    171   soybean
## 36    368    casein
## 37    283    casein
## 38    334 sunflower
## 39      -   unknown
## 40    309   linseed
## 41          soybean
## 42    295         ?
## 43    404      <NA>
## 44    392 sunflower
## 45     na    casein
## 46    267   soybean
## 47    303  meatmeal
## 48    250   soybean
## 49    243   soybean
## 50    108 horsebean
## 51    229   linseed
## 52   <NA> horsebean
## 53    222    casein
## 54    344  meatmeal
## 55    263   unknown
## 56    148   linseed
## 57    318    casein
## 58      -  meatmeal
## 59    258  meatmeal
## 60   <NA> sunflower
## 61    325  meatmeal
## 62    217          
## 63    271   linseed
## 64    244   linseed
## 65    341 sunflower
## 66    141         ?
## 67    158   soybean
## 68    423 sunflower
## 69    316      <NA>
## 70     na   soybean
## 71           casein

Clean Missing Values

There are some missing values in this dataset. Unfortunately they are represented in a number of different ways.

  1. Clean up this dataset by doing the following:
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
sum(is.na(ch_df))
## [1] 7
#ch_df$weight
#is.na(ch_df$weight)

#ch_df$feed
#sum(is.na(ch_df$feed))
#ch_df

ch_df[ch_df == ""] <- NA 
ch_df[ch_df == "?"] <- NA
ch_df[ch_df == "N/A"] <- NA
ch_df[ch_df == "na"] <- NA
ch_df[ch_df == "-"] <- NA

Now that the dataset is clean, let’s see what percentage of our data is missing.

  1. Calculate the percentage of missing data from the weight column, the feed column, and the entire dataset. Print out each result in the following format: “Percentage of missing data in [fill in the column or dataset name]: [fill in percentage]%.”
((sum(is.na(ch_df$weight))/(length(ch_df$weight))*100))
## [1] 21.12676
((sum(is.na(ch_df$feed))/(length(ch_df$feed))*100))
## [1] 14.08451
((sum(is.na(ch_df))/(length(ch_df$weight)+length(ch_df$feed))*100))
## [1] 17.60563
"Percentage of missing data in in the weight column: 21.12676%."
## [1] "Percentage of missing data in in the weight column: 21.12676%."
"Percentage of missing data in in the feed column: 14.08451%."
## [1] "Percentage of missing data in in the feed column: 14.08451%."
"Percentage of missing data in in the entire dataset: 17.60563%."
## [1] "Percentage of missing data in in the entire dataset: 17.60563%."

EXTRA CREDIT (Optional): Figure out how to create these print statements so that the name and percentage number are not hard-coded into the statement. In other words, so that the name and percentage number are read in dynamically (for example, from a variable, from a function call, etc.) instead of just written in the statement. Please ask me for clarification if necessary.

na_weight <- ((sum(is.na(ch_df$weight))/(length(ch_df$weight))*100))
na_feed <- ((sum(is.na(ch_df$feed))/(length(ch_df$feed))*100))
na_df <- ((sum(is.na(ch_df))/(length(ch_df$weight)+length(ch_df$feed))*100))

print(paste0("Percentage of missing data in weight: ", na_weight))
## [1] "Percentage of missing data in weight: 21.1267605633803"
print(paste0("Percentage of missing data in feed: ", na_feed))
## [1] "Percentage of missing data in feed: 14.0845070422535"
print(paste0("Percentage of missing data in the dataset: ", na_df))
## [1] "Percentage of missing data in the dataset: 17.6056338028169"

Data Investigation

  1. Group the data by feed and find the mean and median weight for each group. Your result should be a new data frame with the group means in a column named weight_mean and the group medians in a column named weight_median. Save this new data frame; you can name the data frame as you wish. (Remember that variable names should be somewhat descriptive of what they contain.)
ch_df$weight <- as.character(ch_df$weight)
ch_df$weight <- as.numeric(ch_df$weight)

new_ch_df <- ch_df %>% 
  group_by(feed) %>%
  summarise(weight_mean=mean(weight, na.rm = TRUE), weight_median=median(weight, na.rm = TRUE))
  1. Find the feed that has the maximum median chicken weight.
new_ch_df[which.max(new_ch_df$weight_median),]
## # A tibble: 1 x 3
##   feed      weight_mean weight_median
##   <chr>           <dbl>         <dbl>
## 1 sunflower        353.           340
  1. Create a quick histogram of the weight from the original data frame using the Base R Plotting package.
hist(ch_df$weight)

  1. Create a box plot with feed type as the X axis.
ch_df$weight <- as.character(ch_df$weight)
ch_df$weight <- as.numeric(ch_df$weight)


boxplot(ch_df$weight ~ ch_df$feed)

  1. What do these charts tell you? Does the box plot confirm your mean and median calculations? If yes, how so? Are there any outliers displayed in either chart? Confirm this using the five number summary for specific feed types and the IQR.

The histogram shows unimodal at first glance. However, there was a higher number of 200-250 and 300-350. It leaves a slightly less number at 250-300, and there is no significant skew.

The box plot confirmed mean and median calculations where sunflower has the maximum median chicken weight, follows by casein, and the horsebean has the least mean and median.

The histogram does not show any outlier, but the box plot has an outlier on the horsebean feed.

library("dplyr") 
summary(ch_df)  
##      weight          feed          
##  Min.   :108.0   Length:71         
##  1st Qu.:211.2   Class :character  
##  Median :261.5   Mode  :character  
##  Mean   :264.1                     
##  3rd Qu.:325.5                     
##  Max.   :423.0                     
##  NA's   :15
ch_df %>%         
  group_by(feed) %>% 
  summarize(min = min(weight, na.rm = TRUE),
            q1 = quantile(weight, 0.25, na.rm = TRUE),
            median = median(weight, na.rm = TRUE),
            mean = mean(weight, na.rm = TRUE),
            q3 = quantile(weight, 0.75, na.rm = TRUE),
            max = max(weight, na.rm = TRUE))
## # A tibble: 9 x 7
##   feed        min    q1 median  mean    q3   max
##   <chr>     <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 casein      222  277.   325   314.  356    379
## 2 horsebean   108  142.   160   161.  174.   227
## 3 linseed     148  205    236.  232.  263.   309
## 4 meatmeal    206  280.   315   304.  334.   380
## 5 not sure    329  329    329   329   329    329
## 6 soybean     158  225    249   242.  268    327
## 7 sunflower   318  328    340   353.  366.   423
## 8 unknown     263  263    263   263   263    263
## 9 <NA>        141  169    217   241.  295    404
  1. Recreate the box plot above using ggplot and add x and y axis labels as well as a title. Describe any differences you notice about the two plots.
p<-ggplot(ch_df, aes(x=feed, y=weight, fill=feed)) +
  geom_boxplot()+
  labs(title="Relationships between Type of Feed and Weight of Chickens",x="Type of Feed", y = "Weight of Chickens")

p + scale_fill_brewer(palette="Dark2") + theme_minimal()
## Warning: Removed 15 rows containing non-finite values (stat_boxplot).

As it shown in the charts, ggplot gives nicer look with colors and details such as positioning the legend and labels

The base R only shows one maximum horsebean outlier, when ggplot shows 2 outliers: maximum horsebean and minimum soybean

Base R can only show some X oberservations’ label, ggplot shows them all Ggplot can add a them to it