Introduction :-

As part of project 2 , where we have to take 3 different messy datasets and try to transform them using tifyr & dplyr and other libraries of R and prepare them for analysis so that we can infer something meaningfull from the same.

Problem Statment :-

This data set contains the avacado consumption of 30 cities throughout US, the data is scattered cross various excel(s) one each for one city. It has a lot of unwanted data which is messy and needs to be filtered out before to can use it to Analyse and infer something from the same.

Solution :-

We are using below libraries in our quest to resolve the above problem.:- readxl
dplyr
kableExtra
sqldf
stringr
ggplot2

Using the list.files method of readxl library , we are loading all the excel(s) in the folder with pattern “.xls” and using sapply function to read all files and bind_rows function add the file name to each row using name “City”
Using read.csv function we are loading another file region_city.csv which contains the city and to which region those city belongs information.
Then we exclude the unwanted columns from the first Data frame using select method of dplyr and apply mutate on that data frame to get the name of the city by removing the ‘.xls’ and unwanted characters from the city name.
Using mutate we add a new column(Total Sales) which is the average price * total volume for that city.
Using the sqldf , we then combine the 2 data frames based on the city , now the new Data frame has everything including the region to which the city belongs.
Then we query to get the region & average Total Sales and group them by region.
Using KappleExtra library we display a few of the records of the final Data frame to be used for our analysis in Graph.
Using ggplot function of ggplot2 library we plot and show which region has the highest average total sales of Avacado .

files <- list.files(pattern = "*.xls")
tbl <- sapply(files, read_excel, simplify=FALSE) %>%
bind_rows(.id = "City")

## readxl works best with a newer version of the tibble package.
## You currently have tibble v1.4.2.
## Falling back to column name repair from tibble <= v1.4.2.
## Message displays once per session.

kable(head(tbl)) %>% 
  kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),full_width   = F,position = "left",font_size = 12) %>%
  row_spec(0, background ="gray")

City	Date	AveragePrice	Total Volume	4046	4225	4770	Total Bags	Small Bags	Large Bags	XLarge Bags
HAB_Retail_Volume_and_Price_2018_conventional_Atlanta.xls	2018-12-02	1.07	533473.8	249111.9	30272.99	2042.89	252046.0	151025.6	101014.75	5.56
HAB_Retail_Volume_and_Price_2018_conventional_Atlanta.xls	2018-11-25	1.08	426317.5	195607.5	28770.91	716.07	201223.0	124864.5	76321.78	36.67
HAB_Retail_Volume_and_Price_2018_conventional_Atlanta.xls	2018-11-18	1.03	499927.1	213505.0	27104.44	1262.54	258055.1	177498.5	80556.69	0.00
HAB_Retail_Volume_and_Price_2018_conventional_Atlanta.xls	2018-11-11	0.94	647115.1	268965.5	26762.72	1907.38	349479.4	251715.8	97753.68	9.99
HAB_Retail_Volume_and_Price_2018_conventional_Atlanta.xls	2018-11-04	0.91	707297.6	290875.5	31476.18	2000.78	382945.1	260623.2	122289.71	32.22
HAB_Retail_Volume_and_Price_2018_conventional_Atlanta.xls	2018-10-28	0.87	679128.2	304275.5	37877.05	2415.06	334560.6	207054.5	124439.39	3066.67

region_table <- read.csv("region_city.csv")

tbl <- select(tbl , 1:4) %>%
          mutate(City = str_replace_all(City,"HAB_Retail_Volume_and_Price_2018_conventional_",""))  %>%
mutate(City = str_replace_all(City,".xls","")) 
colnames(tbl) <- c("City","Date","AvgPrice","TotVol")

tbl <- mutate(tbl,Total_Sales = tbl$AvgPrice * tbl$TotVol)

combine_tbl <- sqldf("select a.City , a.total_sales , a.AvgPrice , b.region from tbl a left outer join region_table b on upper(a.city) = upper(b.city)" ) 

graph_data <- sqldf("select region 'region' , round(avg(total_sales)) avg_sales  from combine_tbl
              group by region" )

kable(head(graph_data)) %>% 
  kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),full_width   = F,position = "left",font_size = 12) %>%
  row_spec(0, background ="gray")

region	avg_sales
CALIFORNIA	5199388
GREAT LAKES	896085
MIDSOUTH	1483060
NORTHEAST	2145480
SOUTH CENTRAL	892462
SOUTHEAST	634714

graph_data %>%
  ggplot(aes(x=region, y=avg_sales, fill=region)) + 
  scale_x_discrete(limits=graph_data$region) + 
  geom_bar(stat="identity") + 
  labs(title = "Sales By Region", x="Region", y="Avg Sales")

Summary :-

Using the graph we can clearly see that California region has the highest total sales of Avacado, followed by North East region and West region has the least consumption on avacado.

Project2_A

Samriti Malhotra

Vishal Arora

March 10, 2019

Introduction :-

Problem Statment :-

Solution :-

Summary :-