Data Import

  1. Download chickens.csv to your working directory. Make sure to set your working directory appropriately! This dataset was created by modifying the R built-in dataset chickwts.

  2. Import the chickens.csv data into R. Store it in a data.frame named ch_df and print out the entire ch_df to the screen.

setwd("~/Data 101")
ch_df<-read.csv("chickens.csv")
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.1.1     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
view(ch_df)

Clean Missing Values

There are some missing values in this dataset. Unfortunately they are represented in a number of different ways.

  1. Clean up this dataset by doing the following:
sum(is.na(ch_df))
## [1] 7
ch_df[ch_df == ""]<-NA
ch_df[ch_df == "na"]<-NA
ch_df[ch_df == "N/A"]<-NA
ch_df[ch_df == "?"]<-NA
ch_df[ch_df == "NA"]<-NA
ch_df[ch_df == "-"]<-NA
view(ch_df)
ch_df <- ch_df %>%
  mutate(weight = replace(weight, weight == "na", NA)) %>%
  mutate(weight = replace(weight, weight == "N/A", NA)) %>%
  mutate(weight = replace(weight, weight == "", NA)) %>%
  mutate(weight = replace(weight, weight == "?", NA)) %>%
  mutate(weight = replace(weight, weight == "NA", NA))
ch_df <- ch_df %>%
  mutate(feed = replace(feed, feed == "na", NA)) %>%
  mutate(feed = replace(feed, feed == "N/A", NA)) %>%
  mutate(feed = replace(feed, feed == "", NA)) %>%
  mutate(feed = replace(feed, feed == "?", NA)) %>%
  mutate(feed = replace(feed, feed == "NA", NA))

Now that the dataset is clean, let’s see what percentage of our data is missing.

  1. Calculate the percentage of missing data from the weight column, the feed column, and the entire dataset. Print out each result in the following format: “Percentage of missing data in [fill in the column or dataset name]: [fill in percentage]%.”
mean(is.na(ch_df$weight))*100
## [1] 21.12676
print("Percentage of missing data in the weight column: 21.13%")
## [1] "Percentage of missing data in the weight column: 21.13%"
mean(is.na(ch_df$feed))*100
## [1] 14.08451
print("Percentage of missing data in the weight column: 14.09%")
## [1] "Percentage of missing data in the weight column: 14.09%"
mean(is.na(ch_df))*100
## [1] 17.60563
print("Percentage of missing data in the weight column: 17.61%")
## [1] "Percentage of missing data in the weight column: 17.61%"

EXTRA CREDIT (Optional): Figure out how to create these print statements so that the name and percentage number are not hard-coded into the statement. In other words, so that the name and percentage number are read in dynamically (for example, from a variable, from a function call, etc.) instead of just written in the statement. Please ask me for clarification if necessary.

x<-mean(is.na(ch_df$weight))*100
y<-mean(is.na(ch_df$feed))*100
z<-mean(is.na(ch_df))*100
cat("weight column:", x,"\n")
## weight column: 21.12676
cat("feed column:", y,"\n")
## feed column: 14.08451
cat("entire dataset:", z, "\n")
## entire dataset: 17.60563

Data Investigation

  1. Group the data by feed and find the mean and median weight for each group. Your result should be a new data frame with the group means in a column named weight_mean and the group medians in a column named weight_median. Save this new data frame; you can name the data frame as you wish. (Remember that variable names should be somewhat descriptive of what they contain.)
ch_df$weight<-as.character(ch_df$weight)
ch_df$weight<-as.numeric(ch_df$weight)
Chick<-ch_df %>%
  group_by(feed)%>%
  summarise(weight_mean=mean(weight,na.rm = TRUE),weight_median=median(weight,na.rm = TRUE))
view(Chick)
  1. Find the feed that has the maximum median chicken weight.
which.max(Chick$weight_median)
## [1] 7
Chick[7,]
## # A tibble: 1 x 3
##   feed      weight_mean weight_median
##   <chr>           <dbl>         <dbl>
## 1 sunflower        353.           340
  1. Create a quick histogram of the weight from the original data frame using the Base R Plotting package.
ggplot(ch_df, aes(weight, fill = "weight")) + 
 geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 15 rows containing non-finite values (stat_bin).

  1. Create a box plot with feed type as the X axis.
ggplot(ch_df, aes(feed, weight)) +
 geom_boxplot()
## Warning: Removed 15 rows containing non-finite values (stat_boxplot).

  1. What do these charts tell you? Does the box plot confirm you mean and median calculations? If yes, how so? Are there any outliers displayed in either chart? Confirm this using the five number summary for specific feed types and the IQR.

The histogram just shows me the weights that are present in the dataframe. The box and whisker plot confirms only the median calculations because the median is the line within the box. The mean isn’t displayed in the box and whisker plot. Outliers can be viewed as the dots in the box and whisker plots, which has 2 outliers; seen on the horsebean and the soybean.

  1. Recreate the box plot above using ggplot and add x and y axis labels as well as a title. Describe any differences you notice about the two plots.
ggplot(ch_df, aes(feed, weight, fill= feed)) +
 geom_boxplot()+
  labs(x = "feed", y = "weight", title = "Weights of Feed")
## Warning: Removed 15 rows containing non-finite values (stat_boxplot).

There isn’t much difference in the two plots. The only thing that is different is the title being added tot he second plot.