library(tidyverse) #loading all library needed for this assignment
#> Warning: package 'tidyverse' was built under R version 4.0.3
#> -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
#> v ggplot2 3.3.2     v purrr   0.3.4
#> v tibble  3.0.4     v dplyr   1.0.2
#> v tidyr   1.1.2     v stringr 1.4.0
#> v readr   1.4.0     v forcats 0.5.0
#> Warning: package 'tibble' was built under R version 4.0.3
#> Warning: package 'readr' was built under R version 4.0.3
#> Warning: package 'dplyr' was built under R version 4.0.3
#> -- Conflicts ------------------------------------------ tidyverse_conflicts() --
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag()    masks stats::lag()
library(knitr)
#library(readr)
#library(dplyr)
#library(stringr)
library(magrittr) # need to check this package
#> 
#> Attaching package: 'magrittr'
#> The following object is masked from 'package:purrr':
#> 
#>     set_names
#> The following object is masked from 'package:tidyr':
#> 
#>     extract
library(utils)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union

Web link: https://rpubs.com/amekueko/688323

Github link: https://github.com/asmozo24/TidyVerse_Extended_Assignment

Description

In this assignment, we will extend an existing classmate’s work on Tidyverse assignment. In addition we will evaluate the overall classmate’s work and annotated code. based on the Instructor guidance. The classmates work can be found in the public repo (https://github.com/acatlin/FALL2020TIDYVERSE). We randomly selected submission from Orli Khaimova, BDavidoff and John Mazon. We chose to extend Orli Khaimova’s work because her dataset is rich and offered opportunity to practice more tyverse package as we demonstrate below.

Orli Khaimova (23/25): https://github.com/acatlin/FALL2020TIDYVERSE/blob/master/Orli%20Khaimova%20Create%20tidyverse.Rmd BDavidoff (23/25) : https://github.com/acatlin/FALL2020TIDYVERSE/blob/master/Tidyverse_Create.Rmd John Mazon (25/25): https://github.com/acatlin/FALL2020TIDYVERSE/blob/master/tidyverse_johnmazon.Rmd

Data source

The dataset called ‘Avocado Prices’ was collected from kaggle.com which was updated by Justin Kiggins 02 years ago. Avocados Prices is about historical data on avocado prices and sales volume in multiple US markets. The dataset is about 2MB.
https://www.kaggle.com/neuromusic/avocado-prices

Research Question

What is the total volume by region?

I am a bit lost here…maybe telling your reader what you are about to do (Description). one or two lines…

R uses factors to handle categorical variables, variables that have a fixed and known set of possible values. Factors are also helpful for reordering character vectors to improve display. The goal of the forcats package is to provide a suite of tools that solve common problems with factors, including changing the order of levels or the values. Some examples include:

https://rdrr.io/cran/forcats/f/README.md

fct_rev relevels the levels of a factor in reverse order. In this case, I factored the regions and then used the function in order to put them in reverse alphabetical order. By doing so, I was able to print the regions alphabetically in the graph.

geom_pointrange graphs the interval for the average price of avocados for each region. I had to define a ymin and ymax as well. It is useful for drawing confidence intervals and in this case the range of prices.


# I think it is good to add some comment so that others understand better your approach or code.

avocados <- read.csv("https://raw.githubusercontent.com/okhaimova/DATA-607/master/avocado.csv")

# let's take a look at avocados details 
str(avocados) # 18249 obs. of  14 variables
#> 'data.frame':    18249 obs. of  14 variables:
#>  $ X           : int  0 1 2 3 4 5 6 7 8 9 ...
#>  $ Date        : chr  "2015-12-27" "2015-12-20" "2015-12-13" "2015-12-06" ...
#>  $ AveragePrice: num  1.33 1.35 0.93 1.08 1.28 1.26 0.99 0.98 1.02 1.07 ...
#>  $ Total.Volume: num  64237 54877 118220 78992 51040 ...
#>  $ X4046       : num  1037 674 795 1132 941 ...
#>  $ X4225       : num  54455 44639 109150 71976 43838 ...
#>  $ X4770       : num  48.2 58.3 130.5 72.6 75.8 ...
#>  $ Total.Bags  : num  8697 9506 8145 5811 6184 ...
#>  $ Small.Bags  : num  8604 9408 8042 5677 5986 ...
#>  $ Large.Bags  : num  93.2 97.5 103.1 133.8 197.7 ...
#>  $ XLarge.Bags : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ type        : chr  "conventional" "conventional" "conventional" "conventional" ...
#>  $ year        : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
#>  $ region      : chr  "Albany" "Albany" "Albany" "Albany" ...
view(avocados) # view can help view data in a different window rather than rerun each time you want to take a look at the data

#let's check missing data
sum(is.na(avocados))
#> [1] 0

# I think maybe saving the dataset in a local drive or other storage might be a safe partice
write.csv(avocados,'avocados')

Using dplr package (grammar)


# let's find out how many years of data weight in the data
avocados %>%
  count(year)


#so 4 years, from 2015 to 2018
# now le't look at the data by year with only selected column, there is another way of doing this but probably not in dplyr 
avocados2015 <- avocados %>% 
  select(Date, AveragePrice, Total.Volume, type, year, region) %>% 
  filter(year == 2015) 
view(avocados2015)

#let's see how many region are present in the new data
avocados2015 %>%
  count(region)


#there are 04 summary data, West, TotalUS, Northeast, Midsouth, SouthCentral, Southeast   let's remove them, this can be done with base-r feature 
avocados2015a <- avocados2015 %>%
  filter(!grepl("West", region) & !grepl("TotalUS", region) & !grepl("Northeast", region) & !grepl("Midsouth", region) & !grepl("SouthCentral", region) & !grepl("Southeast", region) )

# remove row value using slice () function
#avocados2015a <- avocados2015 %>% 
#  slice(-c(West, TotalUS, Northeast, Midsouth, SouthCentral, Southeast))

#let's take a look at the result 
avocados2015a %>%
  count(region)


# Now , let's rerange date by month, actual date format is yy-mm-dd
# let's see the data structure
str(avocados2015a)
#> 'data.frame':    4888 obs. of  6 variables:
#>  $ Date        : chr  "2015-12-27" "2015-12-20" "2015-12-13" "2015-12-06" ...
#>  $ AveragePrice: num  1.33 1.35 0.93 1.08 1.28 1.26 0.99 0.98 1.02 1.07 ...
#>  $ Total.Volume: num  64237 54877 118220 78992 51040 ...
#>  $ type        : chr  "conventional" "conventional" "conventional" "conventional" ...
#>  $ year        : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
#>  $ region      : chr  "Albany" "Albany" "Albany" "Albany" ...

#converting date character as Date but not working
avocados2015b <- avocados2015a %>%
  mutate(Date = as.Date(Date, format = "%y-%m-%d")) %>%
  arrange(Date)

#another way of converting date and time "%m/%d/%Y %H:%M:%S"
#strptime(avocados2015a$Date, "%m-%d-%Y")

Using Stringr

date formating seems not to be working in dplyr grammar, another way will be to use stringr grammar, change the appearance of date from yy-mm-dd (2015-12-27) to 2015-December-27, then we can string search word by “December” , “November” etc.. We can create a plot averaprice or total volume by new variable = month or create a facets plot by months ( facet_wrap())


#changing format date from yy-mm-dd (2015-12-27) to 2015-December-27 but not working ...
#avocados2015a %>%
# mutate (Date = ISOdate(Date, format = '%A, %B %d, %Y'))
# format(Date,'%A, %B %d, %Y')

# another way to detect matching pattern
#avocados2015a %>% 
#  str_detect(Date, "December")

Plots


#let's do some plot, plot total volume by region, we can do this two ways, #1 is to sum totalvolume by region, #2 is to do barplot or histogram
#option 2, I will use barplot using ggplots
# there is bit of a problem with size, with 47 region, the total volume scale on the Y axis need to be adjusted 
# reorder seems not to be working great, ex: chicago comes before PhoenixTuscon
avocados2015a %>%
  ggplot(aes(x=forcats::fct_reorder(region,Total.Volume ), y = Total.Volume )) +
  geom_col() + 
  coord_flip() + 
  labs(x="Region", y="Total Volume of Avocados", title = "Total Volume of Avocados by U.S. Region")


# now let's add the two type ..conventional and organic
avocados2015a %>%
  ggplot(aes(x=forcats::fct_reorder(region,Total.Volume ), y = Total.Volume , fill = type)) +
  geom_col() + 
  coord_flip() + 
  labs(x="Region", y="Total Volume of Avocados", title = "Total Volume of Avocados by U.S. Regions")

Plot averageprice by date but this not what I really want

# Not really what I want
avocados %>%
ggplot(aes(x = Date, y = AveragePrice)) +
      geom_point(color = "blue") +
      labs(title = "Average Price of Avocados by Year ",
           #subtitle = "average price trend ",
           y = "Average Price",
           x = "year") + theme_bw(base_size = 15)


# let's plot by facet
avocados %>%
ggplot(aes(x = Date, y = AveragePrice)) +
      geom_point(color = "blue") +
      facet_wrap( ~ year)

      labs(title = "Average Price of Avocados by Year ",
           #subtitle = "average price trend ",
           y = "Average Price",
           x = "year") + theme_bw(base_size = 15)
#> NULL

# Example here, I think you want to redefine the data type to date
avocados$Date <- as.Date(avocados$Date)
# redefine year (data type) to character...I thought ideally year is numerical or integer as data type.
# So , I can see you want to do something special with year but I cannot tell
avocados$year <- as.character(avocados$year)
#Good comment
#factors the regions and then using forcats, we reverse the order to make it z-a
avocados$region <- avocados$region %>%
  as.factor() %>%
  fct_rev()
#never done this long grouping for plot...looks good...actually reviewing your code kind of telling me about my own mistake...There is no way to see the result if I don't run ...I think there was option to include a sample result or html version...
# avocados$Date vector contains about 1000 values...a little heavy...I had that problem too.
avocados %>% 
  ggplot(aes(y = AveragePrice, x = region, 
             ymin = AveragePrice-sd(AveragePrice), ymax = AveragePrice+sd(AveragePrice))) +
  geom_pointrange(aes(color = as.factor(region)), size =.01) +
  ylab("Average Price") +
  xlab("Region") +
  ggtitle("Average Price Range") +
  coord_flip() +
  theme(legend.position = "none")

Conclusion

On this assignment peer review , I formated YAML header, added Description, data source, reasearch question, dplyr, stringr. stringr, plots and Plot averageprice by date . The stringr and plot average price by date are not good as my code does not work. I think something about the data needs a closer look. I added a lot of comments to let you know what I was doing. I hope you will be able to tell the difference. I hope my work will help you. Overall, your dataset is rich and offers a lot of practice for someone who wants to explore more R-package. I actually I think this can be a good project if adding statiscal analysis.

Collaborating and Providing peer to peer Feedback

Extended Orli Khaimova TidyVerse Assignment

Alexis Mekueko

Orli Khaimova

10/27/2020