getwd()
## [1] "C:/Users/Jerome/Documents/0000_Work_Files/0000_Montgomery_College/Data_Science_101/Data_101_Fall_2022/Homework_9_Due_7Nov2022/project2b"
### Setup
####install.packages("naniar") I used this package when trying to convert weight to numeric from character. The code I tried to use didn't work. I think I deleted those commands. Ultimately transform worked. I explained all of this below.
####library(naniar)
getwd()
## [1] "C:/Users/Jerome/Documents/0000_Work_Files/0000_Montgomery_College/Data_Science_101/Data_101_Fall_2022/Homework_9_Due_7Nov2022/project2b"
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(dplyr)
library(readr)
Download chickens.csv to your working directory. Make sure to set your working directory appropriately! This dataset was created by modifying the R built-in dataset chickwts.
Import the chickens.csv data into R. Store it in a data.frame named ch_df and print out the entire ch_df to the screen.
# fill in your code here
ch_df <- read_csv( "chickens.csv")
## Rows: 71 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): weight, feed
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
print(ch_df)
## # A tibble: 71 × 2
## weight feed
## <chr> <chr>
## 1 206 meatmeal
## 2 140 horsebean
## 3 <NA> <NA>
## 4 318 sunflower
## 5 332 casein
## 6 na horsebean
## 7 216 na
## 8 143 horsebean
## 9 271 soybean
## 10 315 meatmeal
## # … with 61 more rows
There are some missing values in this dataset. Unfortunately they are represented in a number of different ways.
# fill in your code here
table(is.na(ch_df))
##
## FALSE TRUE
## 130 12
table(is.na(ch_df$weight))
##
## FALSE TRUE
## 64 7
table(is.na(ch_df$feed))
##
## FALSE TRUE
## 66 5
### In the original table, 12 NA values were observed, 7 in the weight column and 5 in the feed column.
### After cleaning,
ch_df[ch_df == '?'] <- NA
ch_df[ch_df == 'na'] <- NA
ch_df[ch_df == '-'] <- NA
ch_df[ch_df == 'N/A'] <- NA
table(is.na(ch_df))
##
## FALSE TRUE
## 117 25
table(ch_df$weight)
##
## 108 140 141 143 148 153 158 160 168 169 171 179 181 206 213 216 217 222 227 229
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 243 244 248 250 257 258 260 263 267 271 283 295 303 309 315 316 318 322 325 327
## 1 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 2 1 1 1
## 329 332 334 340 341 344 352 368 379 380 392 404 423
## 1 1 1 1 1 1 1 1 1 1 1 1 1
table(ch_df$feed)
##
## casein horsebean linseed meatmeal not sure soybean sunflower unknown
## 10 9 10 8 1 11 10 2
Now that the dataset is clean, let’s see what percentage of our data is missing.
# fill in your code here
15/71
## [1] 0.2112676
10/71
## [1] 0.1408451
25/71
## [1] 0.3521127
###Percentage of missing data in the weight column: 21.11%
###Percentage of missing data in the feed column: 14.08%
###Percentage of missing data in ch_df: 35.21%
EXTRA CREDIT (Optional): Figure out how to create these print statements so that the name and percentage number are not hard-coded into the statement. In other words, so that the name and percentage number are read in dynamically (for example, from a variable, from a function call, etc.) instead of just written in the statement. Please ask me for clarification if necessary.
# fill in your code here
### I think I'll skip this.
# fill in your code here
### Explanation: When I tried executing the code to calculate the mean and median for each group, the error message I got was weight was character. I tried multiple times to convert weight to numeric using as.numeric. It wouldn't work. I then removed the NAs in the entire data frame and tried as.numeric again. It still didn't work. I hunted all over the internet to find something that would change the character to numeric. I found this code at https://www.geeksforgeeks.org/how-to-convert-dataframe-column-from-character-to-numeric-in-r/ That at least gave me numeric for the weight column. Now i will try to calculate the mean and median for each group.
ch_df_na <- na.omit(ch_df)
ch_df_na <- transform(ch_df_na,
weight = as.numeric(weight))
class(ch_df_na$weight)
## [1] "numeric"
### I tried the following code 1st; it worked, except it didn't save the file w/ mean and median. Then I found your code in the bandwidth example and tried it; after several false starts, it worked.
#ch_df_na$feed <- as.factor(ch_df_na$feed)
#class(ch_df_na$feed)
#feed_grps <- group_by(ch_df_na, feed)
#summarise(feed_grps,
#weight_mean=mean(weight),
#weight_median=median(weight))
feed_grps2 <- ch_df_na %>% group_by(feed) %>%
summarise(weight_mean = mean(weight),
weight_median = median(weight))
# fill in your code here
### By reading from the table generated in the preceding step, I see that Sunflower has a median value of 340, the highest of all the feed groups.
# fill in your code here
### Note I had the same problem with weight - it wasn't numeric, so I couldn't create a histogram. I used the code I used on the smaller "No NAs" data frame and changed weight to numeric so I could create the histogram.
ch_df <- transform(ch_df,
weight = as.numeric(weight))
class(ch_df$weight)
## [1] "numeric"
hist(ch_df$weight)
# fill in your code here
boxplot(ch_df_na$weight ~ (ch_df_na$feed))
# fill in your code here
#feed_grps3 <- ch_df_na %>% group_by(feed) %>%
#summarise(weight_mean = mean(weight),
#weight_median = median(weight))
#ch_df_na <- transform(ch_df_na,
# feed = as.factor(feed))
#class(ch_df_na$feed)
casein <- filter(ch_df_na, feed == "casein")
horsebean <- filter(ch_df_na, feed == "horsebean")
fivenum(horsebean$weight)
## [1] 108.0 141.5 160.0 173.5 227.0
IQR(horsebean$weight)
## [1] 32
fivenum(casein$weight)
## [1] 222.0 271.5 325.0 360.0 379.0
IQR(casein$weight)
## [1] 78.75
sunflower <- filter(ch_df_na, feed == "sunflower")
fivenum(sunflower$weight)
## [1] 318.0 328.0 340.0 366.5 423.0
IQR(sunflower$weight)
## [1] 38.5
### The charts tell me there is a variation in the weight of the chickens based on the type of feed the chickens received. I ran IQR and fivenum for 3 of the 8 groups and found those results corresponded w/ the values shown on the boxplots. There is an outlier for horsebean - a high value of 227, as shown on the boxplot. No outliers appear on the histogram, at least as I drew it. No group stands apart from the rest.
# fill in your code here
p3 <- ch_df_na %>%
ggplot(aes(feed, weight, fill = feed)) +
ggtitle("Weights by Feed Type") +
xlab("Feed Type") +
ylab("Range of Weights by Feed Type") +
geom_boxplot() +
scale_fill_discrete(name = "feed", labels = c("casein", "horsebean","linseed", "meatmeal", "not sure", "soybean", "sunflower", "unknown"))
p3
### The ggplot boxplots are prettier. The base R boxplots has a more informative y axis - more discrete markings. The data are the same. The ggplot boxplots had much more informative labels on the x axis - all were shown, whereas the base R boxplots only showed a subset of the labels.