As part of project 2 , where we have to take 3 different messy datasets and try to transform them using tifyr & dplyr and other libraries of R and prepare them for analysis so that we can infer something meaningfull from the same.
This data set contains the avacado consumption of 30 cities throughout US, the data is scattered cross various excel(s) one each for one city. It has a lot of unwanted data which is messy and needs to be filtered out before to can use it to Analyse and infer something from the same.
We are using below libraries in our quest to resolve the above problem.:- readxl
dplyr
kableExtra
sqldf
stringr
ggplot2
files <- list.files(pattern = "*.xls")
tbl <- sapply(files, read_excel, simplify=FALSE) %>%
bind_rows(.id = "City")
## readxl works best with a newer version of the tibble package.
## You currently have tibble v1.4.2.
## Falling back to column name repair from tibble <= v1.4.2.
## Message displays once per session.
kable(head(tbl)) %>%
kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),full_width = F,position = "left",font_size = 12) %>%
row_spec(0, background ="gray")
City | Date | AveragePrice | Total Volume | 4046 | 4225 | 4770 | Total Bags | Small Bags | Large Bags | XLarge Bags |
---|---|---|---|---|---|---|---|---|---|---|
HAB_Retail_Volume_and_Price_2018_conventional_Atlanta.xls | 2018-12-02 | 1.07 | 533473.8 | 249111.9 | 30272.99 | 2042.89 | 252046.0 | 151025.6 | 101014.75 | 5.56 |
HAB_Retail_Volume_and_Price_2018_conventional_Atlanta.xls | 2018-11-25 | 1.08 | 426317.5 | 195607.5 | 28770.91 | 716.07 | 201223.0 | 124864.5 | 76321.78 | 36.67 |
HAB_Retail_Volume_and_Price_2018_conventional_Atlanta.xls | 2018-11-18 | 1.03 | 499927.1 | 213505.0 | 27104.44 | 1262.54 | 258055.1 | 177498.5 | 80556.69 | 0.00 |
HAB_Retail_Volume_and_Price_2018_conventional_Atlanta.xls | 2018-11-11 | 0.94 | 647115.1 | 268965.5 | 26762.72 | 1907.38 | 349479.4 | 251715.8 | 97753.68 | 9.99 |
HAB_Retail_Volume_and_Price_2018_conventional_Atlanta.xls | 2018-11-04 | 0.91 | 707297.6 | 290875.5 | 31476.18 | 2000.78 | 382945.1 | 260623.2 | 122289.71 | 32.22 |
HAB_Retail_Volume_and_Price_2018_conventional_Atlanta.xls | 2018-10-28 | 0.87 | 679128.2 | 304275.5 | 37877.05 | 2415.06 | 334560.6 | 207054.5 | 124439.39 | 3066.67 |
region_table <- read.csv("region_city.csv")
tbl <- select(tbl , 1:4) %>%
mutate(City = str_replace_all(City,"HAB_Retail_Volume_and_Price_2018_conventional_","")) %>%
mutate(City = str_replace_all(City,".xls",""))
colnames(tbl) <- c("City","Date","AvgPrice","TotVol")
tbl <- mutate(tbl,Total_Sales = tbl$AvgPrice * tbl$TotVol)
combine_tbl <- sqldf("select a.City , a.total_sales , a.AvgPrice , b.region from tbl a left outer join region_table b on upper(a.city) = upper(b.city)" )
graph_data <- sqldf("select region 'region' , round(avg(total_sales)) avg_sales from combine_tbl
group by region" )
kable(head(graph_data)) %>%
kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),full_width = F,position = "left",font_size = 12) %>%
row_spec(0, background ="gray")
region | avg_sales |
---|---|
CALIFORNIA | 5199388 |
GREAT LAKES | 896085 |
MIDSOUTH | 1483060 |
NORTHEAST | 2145480 |
SOUTH CENTRAL | 892462 |
SOUTHEAST | 634714 |
graph_data %>%
ggplot(aes(x=region, y=avg_sales, fill=region)) +
scale_x_discrete(limits=graph_data$region) +
geom_bar(stat="identity") +
labs(title = "Sales By Region", x="Region", y="Avg Sales")
Using the graph we can clearly see that California region has the highest total sales of Avacado, followed by North East region and West region has the least consumption on avacado.