library(here)
library(dplyr)
library(readxl)
library(tidyr)
library(tidyverse)
library(tibble)
First open up excel to preview the data. (I cannot seem to pull in an image into R otherwise I would to show what it looks like in excel)
I see column A and G are blank columns. Also the four rows are not necessary to pull in. I will use the range function to pull in just from ‘B5:L125’ and name the columns with col_names. For column G that still gets pulled in, I will label that column ‘delete’ and delete it by selecting it and using !contains to delete that column.
eggPoultry <- here("challenge_datasets/organiceggpoultry.xls") %>%
read_excel(range = "B6:L125", col_names= c("Date", "Doz Extra Large", ".5 Doz Extra Large", "Doz Large", ".5 Doz Large", "delete", "Whole Chicken", "B/S Breast", "Bone-in Breast", "Legs","Thighs"))%>%
select(!contains("delete"))
head(eggPoultry)
## # A tibble: 6 × 10
## Date `Doz Extra Large` `.5 Doz Extra Large` `Doz Large` `.5 Doz Large`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Jan 2004 230 132 230 126
## 2 February 230 134. 226. 128.
## 3 March 230 137 225 131
## 4 April 234. 137 225 131
## 5 May 236 137 225 131
## 6 June 241 137 231. 134.
## # ℹ 5 more variables: `Whole Chicken` <dbl>, `B/S Breast` <dbl>,
## # `Bone-in Breast` <chr>, Legs <dbl>, Thighs <chr>
There are cells with the text ‘too few’. I assume this means there is too few data to sold within poultry to be logged into this data. I assume this text in the cell is causing two columns (Bone-in Breast & Thighs) to be categorized as character rather than numeric.
I to use mutate to replace the character columns with as.numeric to change to numeric.
eggPoultry <- eggPoultry %>%
mutate(`Bone-in Breast` = as.numeric(`Bone-in Breast`),
Thighs = as.numeric(Thighs))
## Warning: There were 2 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `Bone-in Breast = as.numeric(`Bone-in Breast`)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
head(eggPoultry)
## # A tibble: 6 × 10
## Date `Doz Extra Large` `.5 Doz Extra Large` `Doz Large` `.5 Doz Large`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Jan 2004 230 132 230 126
## 2 February 230 134. 226. 128.
## 3 March 230 137 225 131
## 4 April 234. 137 225 131
## 5 May 236 137 225 131
## 6 June 241 137 231. 134.
## # ℹ 5 more variables: `Whole Chicken` <dbl>, `B/S Breast` <dbl>,
## # `Bone-in Breast` <dbl>, Legs <dbl>, Thighs <dbl>
First The data column has the year only for January, but I want a column with all the years. I use mutate to have a new column made of year to be if the str_detect finds “Jan” and then gsub can subs ‘Jan’ for blank to just get the year. To use fill to fill the year down the column.
In class, str_detect was mentioned so I dug into string functions and found a str_replace. I am sure there is a more effective way to remove the years in the date column (which contains month), but the str_replace is able to remove the occurance in the date column with nothing and I indicated all the years to be removed in reach row.
eggPoultry <- eggPoultry %>%
mutate(year = ifelse(str_detect(Date, "Jan"), gsub("Jan", "", Date), NA)) %>%
fill(year, .direction = "down")
eggPoultryClean <- eggPoultry %>%
mutate(
Date = str_replace(Date, "2004", ""),
Date = str_replace(Date, "2005", ""),
Date = str_replace(Date, "2006", ""),
Date = str_replace(Date, "2007", ""),
Date = str_replace(Date, "2008", ""),
Date = str_replace(Date, "2009", ""),
Date = str_replace(Date, "2010", ""),
Date = str_replace(Date, "2011", ""),
Date = str_replace(Date, "2012", ""),
Date = str_replace(Date, "2013", ""),
Date = str_replace(Date, "/1", ""))
head(eggPoultryClean)
## # A tibble: 6 × 11
## Date `Doz Extra Large` `.5 Doz Extra Large` `Doz Large` `.5 Doz Large`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 "Jan " 230 132 230 126
## 2 "February" 230 134. 226. 128.
## 3 "March" 230 137 225 131
## 4 "April" 234. 137 225 131
## 5 "May" 236 137 225 131
## 6 "June" 241 137 231. 134.
## # ℹ 6 more variables: `Whole Chicken` <dbl>, `B/S Breast` <dbl>,
## # `Bone-in Breast` <dbl>, Legs <dbl>, Thighs <dbl>, year <chr>
“According to Wickham & Grolemund, there are three principles for tidy data.
Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.”
In the eggPoultry data set, each observation does not have its own row. The goal is to make the table longer so eggs and chicken products are not both occupying a row together. To do this, I will use pivot_longer and list out all the products to be put into a column named ‘Product’ with the prices of each product going in a column named ‘Price’
eggPoultryLong <- eggPoultryClean %>%
pivot_longer(c("Doz Extra Large", ".5 Doz Extra Large", "Doz Large", ".5 Doz Large", "Whole Chicken", "B/S Breast", "Bone-in Breast", "Legs","Thighs"), names_to = "Product", values_to = "Price")
head(eggPoultryLong)
## # A tibble: 6 × 4
## Date year Product Price
## <chr> <chr> <chr> <dbl>
## 1 "Jan " " 2004" Doz Extra Large 230
## 2 "Jan " " 2004" .5 Doz Extra Large 132
## 3 "Jan " " 2004" Doz Large 230
## 4 "Jan " " 2004" .5 Doz Large 126
## 5 "Jan " " 2004" Whole Chicken 198.
## 6 "Jan " " 2004" B/S Breast 646.
It is missing the unit (price per carton for eggs and price cents per pound for chicken), but I am not sure how to add a column with the data when it is was not column in the original data set but noted at the top of the excel.