Project 2b - Chickens

getwd()

## [1] "C:/Users/Jerome/Documents/0000_Work_Files/0000_Montgomery_College/Data_Science_101/Data_101_Fall_2022/Homework_9_Due_7Nov2022/project2b"

### Setup
####install.packages("naniar") I used this package when trying to convert weight to numeric from character. The code I tried to use didn't work. I think I deleted those commands. Ultimately transform worked. I explained all of this below. 
####library(naniar)
getwd()

## [1] "C:/Users/Jerome/Documents/0000_Work_Files/0000_Montgomery_College/Data_Science_101/Data_101_Fall_2022/Homework_9_Due_7Nov2022/project2b"

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(dplyr)
library(readr)

Data Import

Download chickens.csv to your working directory. Make sure to set your working directory appropriately! This dataset was created by modifying the R built-in dataset chickwts.
Import the chickens.csv data into R. Store it in a data.frame named ch_df and print out the entire ch_df to the screen.

# fill in your code here
ch_df <- read_csv( "chickens.csv")

## Rows: 71 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): weight, feed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

print(ch_df)

## # A tibble: 71 × 2
##    weight feed     
##    <chr>  <chr>    
##  1 206    meatmeal 
##  2 140    horsebean
##  3 <NA>   <NA>     
##  4 318    sunflower
##  5 332    casein   
##  6 na     horsebean
##  7 216    na       
##  8 143    horsebean
##  9 271    soybean  
## 10 315    meatmeal 
## # … with 61 more rows

Clean Missing Values

There are some missing values in this dataset. Unfortunately they are represented in a number of different ways.

Clean up this dataset by doing the following:

Calculate how many elements in the original ch_df are recognized as NA by R.
Change all of the missing elements to NA in ch_df.
You do NOT have to fill in the missing values. Just leave them as NA.

# fill in your code here
table(is.na(ch_df))

## 
## FALSE  TRUE 
##   130    12

table(is.na(ch_df$weight))

## 
## FALSE  TRUE 
##    64     7

table(is.na(ch_df$feed))

## 
## FALSE  TRUE 
##    66     5

### In the original table, 12 NA values were observed, 7 in the weight column and 5 in the feed column.

### After cleaning, 


ch_df[ch_df == '?'] <- NA
ch_df[ch_df == 'na'] <- NA
ch_df[ch_df == '-'] <- NA
ch_df[ch_df == 'N/A'] <- NA
table(is.na(ch_df))

## 
## FALSE  TRUE 
##   117    25

table(ch_df$weight)

## 
## 108 140 141 143 148 153 158 160 168 169 171 179 181 206 213 216 217 222 227 229 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 243 244 248 250 257 258 260 263 267 271 283 295 303 309 315 316 318 322 325 327 
##   1   1   1   1   1   1   2   1   1   2   1   1   1   1   1   1   2   1   1   1 
## 329 332 334 340 341 344 352 368 379 380 392 404 423 
##   1   1   1   1   1   1   1   1   1   1   1   1   1

table(ch_df$feed)

## 
##    casein horsebean   linseed  meatmeal  not sure   soybean sunflower   unknown 
##        10         9        10         8         1        11        10         2

Now that the dataset is clean, let’s see what percentage of our data is missing.

Calculate the percentage of missing data from the weight column, the feed column, and the entire dataset. Print out each result in the following format: “Percentage of missing data in [fill in the column or dataset name]: [fill in percentage]%.”

# fill in your code here

15/71

## [1] 0.2112676

10/71

## [1] 0.1408451

25/71

## [1] 0.3521127

###Percentage of missing data in the weight column: 21.11%
###Percentage of missing data in the feed column: 14.08%
###Percentage of missing data in ch_df: 35.21%

EXTRA CREDIT (Optional): Figure out how to create these print statements so that the name and percentage number are not hard-coded into the statement. In other words, so that the name and percentage number are read in dynamically (for example, from a variable, from a function call, etc.) instead of just written in the statement. Please ask me for clarification if necessary.

# fill in your code here
### I think I'll skip this.

Data Investigation

Group the data by feed and find the mean and median weight for each group. Your result should be a new data frame with the group means in a column named weight_mean and the group medians in a column named weight_median. Save this new data frame; you can name the data frame as you wish. (Remember that variable names should be somewhat descriptive of what they contain.)

# fill in your code here
### Explanation: When I tried executing the code to calculate the mean and median for each group, the error message I got was weight was character. I tried multiple times to convert weight to numeric using as.numeric. It wouldn't work. I then removed the NAs in the entire data frame and tried as.numeric again. It still didn't work. I hunted all over the internet to find something that would change the character to numeric. I found this code at https://www.geeksforgeeks.org/how-to-convert-dataframe-column-from-character-to-numeric-in-r/ That at least gave me numeric for the weight column. Now i will try to calculate the mean and median for each group. 
ch_df_na <- na.omit(ch_df)
ch_df_na <- transform(ch_df_na,
                             weight = as.numeric(weight))
class(ch_df_na$weight)

## [1] "numeric"

### I tried the following code 1st; it worked, except it didn't save the file w/ mean and median. Then I found your code in the bandwidth example and tried it; after several false starts, it worked. 
#ch_df_na$feed <- as.factor(ch_df_na$feed)
#class(ch_df_na$feed)
#feed_grps <- group_by(ch_df_na, feed)
#summarise(feed_grps, 
#weight_mean=mean(weight),
#weight_median=median(weight))

feed_grps2 <- ch_df_na %>% group_by(feed) %>%
summarise(weight_mean = mean(weight),
weight_median = median(weight))

Find the feed that has the maximum median chicken weight.

# fill in your code here
### By reading from the table generated in the preceding step, I see that Sunflower has a median value of 340, the highest of all the feed groups.

Create a quick histogram of the weight from the original data frame using the Base R Plotting package.

# fill in your code here
### Note I had the same problem with weight  - it wasn't numeric, so I couldn't create a histogram. I used the code I used on the smaller "No NAs" data frame and changed weight to numeric so I could create the histogram. 
ch_df <- transform(ch_df,
                             weight = as.numeric(weight))
class(ch_df$weight)

## [1] "numeric"

hist(ch_df$weight)

Create a box plot with feed type as the X axis.

# fill in your code here

boxplot(ch_df_na$weight ~ (ch_df_na$feed))

What do these charts tell you? Does the box plot confirm you mean and median calculations? If yes, how so? Are there any outliers displayed in either chart? Confirm this using the five number summary for specific feed types and the IQR.

# fill in your code here

#feed_grps3 <- ch_df_na %>% group_by(feed) %>%
#summarise(weight_mean = mean(weight),
#weight_median = median(weight))
#ch_df_na <- transform(ch_df_na,
      #                       feed = as.factor(feed))
#class(ch_df_na$feed)

casein <- filter(ch_df_na, feed == "casein")
horsebean <- filter(ch_df_na, feed == "horsebean")
fivenum(horsebean$weight)

## [1] 108.0 141.5 160.0 173.5 227.0

IQR(horsebean$weight)

## [1] 32

fivenum(casein$weight)

## [1] 222.0 271.5 325.0 360.0 379.0

IQR(casein$weight)

## [1] 78.75

sunflower <- filter(ch_df_na, feed == "sunflower")
fivenum(sunflower$weight)

## [1] 318.0 328.0 340.0 366.5 423.0

IQR(sunflower$weight)

## [1] 38.5

### The charts tell me there is a variation in the weight of the chickens based on the type of feed the chickens received. I ran IQR and fivenum for 3 of the 8 groups and found those results corresponded w/ the values shown on the boxplots. There is an outlier for horsebean - a high value of 227, as shown on the boxplot. No outliers appear on the histogram, at least as I drew it. No group stands apart from the rest.

Recreate the box plot above using ggplot and add x and y axis labels as well as a title. Describe any differences you notice about the two plots.

# fill in your code here
p3 <- ch_df_na %>%
  ggplot(aes(feed, weight, fill = feed)) + 
  ggtitle("Weights by Feed Type") +
  xlab("Feed Type") +
  ylab("Range of Weights by Feed Type") +
  geom_boxplot() +
  scale_fill_discrete(name = "feed", labels = c("casein", "horsebean","linseed", "meatmeal", "not sure", "soybean", "sunflower", "unknown"))
p3

### The ggplot boxplots are prettier. The base R boxplots has a more informative y axis - more discrete markings. The data are the same. The ggplot boxplots had much more informative labels on the x axis - all were shown, whereas the base R boxplots only showed a subset of the labels.

Project 2b - Chickens

Robert Jerome Anderson

Data Import

Clean Missing Values

Data Investigation