As a first step, we came up with a stategy for working on the assignment together. First, we picked the 3 datasets we were interested in analyzing. Then we each picked one dataset to take a lead on after discussing some basic strategies. We worked on those individually, collaborating and reaching out to each other with questions. That gave us an opportunity to perfect our data wrangling and collaboration skils. Then we worked on a 3rd dataset together.
The dataset contains information from the U.S. Department of Agriculture about the longitude and latitude, state, address, name, zip code, etc, of Farmers Markets in the United States. The data is considered to be wide format because, for example, the variable produce/product is structured as if every single type of produce/product is a variable.
Information is read from a csv file into R. The data has a “wide” structure. The data is then made tidy and tranformed, and analysis is performed.
The goal is to use the tidyverse packages, we will provide alternative ways of perfoming the analyses in this project.
Which of the states have most of farmers markets with organic produce and meat?
What categories of foods are predominantly present in farmers markets in the U.S and by States?
library(knitr)
library(kableExtra)# manipulate table styles
suppressMessages(library(tidyverse))
## Warning: package 'dplyr' was built under R version 3.5.1
df <- read.csv(file="https://raw.githubusercontent.com/che10vek/Data-607-Assignments/master/Export.csv", header=TRUE, sep=",")
str(df, give.attr = F) # with not to show attributes as substructures. If T then str is really long...
## 'data.frame': 8735 obs. of 59 variables:
## $ FMID : int 1018261 1018318 1009364 1010691 1002454 1011100 1009845 1005586 1008071 1012710 ...
## $ MarketName : Factor w/ 8170 levels " Caledonia Farmers Market Association - Danville",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Website : Factor w/ 4326 levels "","http:// jasper.agrilife.org/jasper-county-farmers-market/",..: 4236 3613 644 1 1 721 720 1 1 1 ...
## $ Facebook : Factor w/ 3613 levels ""," https://www.facebook.com/FPConservatory",..: 833 3253 1 1 1 37 423 1813 425 424 ...
## $ Twitter : Factor w/ 781 levels ""," no twitter",..: 1 1 1 1 1 3 319 1 172 124 ...
## $ Youtube : Factor w/ 135 levels ""," Delcambre Seafood and Farmers Market",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ OtherMedia : Factor w/ 632 levels ""," http://instagram.com/forsythfarmersmarket",..: 1 1 1 64 1 3 432 177 1 503 ...
## $ street : Factor w/ 8254 levels "","\t10336 S Estrella Pkwy, 85338, USA",..: 1 4545 502 582 701 2836 1529 1050 1237 4123 ...
## $ city : Factor w/ 5103 levels ""," Buffalo",..: 1108 3471 4166 2396 3159 3081 3159 4991 4786 4787 ...
## $ County : Factor w/ 1504 levels "","Abbeville",..: 209 359 1 97 946 371 946 942 400 400 ...
## $ State : Factor w/ 53 levels "Alabama","Alaska",..: 47 36 42 26 33 44 33 8 9 9 ...
## $ zip : Factor w/ 6359 levels "","00000","0033",..: 503 1 2068 4536 733 2571 731 1361 1390 1392 ...
## $ Season1Date : Factor w/ 2726 levels ""," to 07/01/2016",..: 1904 2104 1 258 2677 775 1789 1053 703 355 ...
## $ Season1Time : Factor w/ 1703 levels "","Fri: 1:00 AM-6:00 PM;",..: 1592 516 1 1454 1343 1096 999 114 516 516 ...
## $ Season2Date : Factor w/ 411 levels "","01/01/2014 to 04/01/2014",..: 190 1 1 1 1 1 1 1 1 1 ...
## $ Season2Time : Factor w/ 215 levels "","Fri: 11:00 AM-3:30 PM;",..: 178 1 1 1 1 1 1 1 1 1 ...
## $ Season3Date : Factor w/ 82 levels "","01/01/2012 to 05/31/2012",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Season3Time : Factor w/ 49 levels "","Fri: 3:00 PM-6:00 PM;",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Season4Date : Factor w/ 7 levels "","01/13/2019 to 01/13/2019",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Season4Time : Factor w/ 6 levels "","Sat: 10:00 AM-1:00 PM;",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ x : num -72.1 -81.7 -82.8 -94.3 -73.9 ...
## $ y : num 44.4 41.4 34.8 37.5 40.8 ...
## $ Location : Factor w/ 11 levels "","Closed-off public street",..: 1 1 1 1 11 1 6 9 10 1 ...
## $ Credit : Factor w/ 2 levels "N","Y": 2 2 2 2 1 2 2 1 2 2 ...
## $ WIC : Factor w/ 2 levels "N","Y": 2 1 1 1 1 1 2 1 2 2 ...
## $ WICcash : Factor w/ 2 levels "N","Y": 1 1 1 1 2 1 1 1 2 2 ...
## $ SFMNP : Factor w/ 2 levels "N","Y": 2 2 1 1 2 1 2 1 2 2 ...
## $ SNAP : Factor w/ 2 levels "N","Y": 1 1 1 1 1 2 2 2 2 2 ...
## $ Organic : Factor w/ 3 levels "-","N","Y": 3 1 1 1 1 3 3 2 3 2 ...
## $ Bakedgoods : Factor w/ 2 levels "N","Y": 2 2 1 2 2 2 2 1 2 2 ...
## $ Cheese : Factor w/ 2 levels "N","Y": 2 1 1 1 1 2 2 1 2 1 ...
## $ Crafts : Factor w/ 2 levels "N","Y": 2 2 1 2 2 1 2 1 1 2 ...
## $ Flowers : Factor w/ 2 levels "N","Y": 2 2 1 1 2 2 2 1 2 2 ...
## $ Eggs : Factor w/ 2 levels "N","Y": 2 2 1 2 1 2 2 1 2 2 ...
## $ Seafood : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ Herbs : Factor w/ 2 levels "N","Y": 2 2 1 2 2 2 2 2 2 2 ...
## $ Vegetables : Factor w/ 2 levels "N","Y": 2 2 1 2 2 2 2 2 2 2 ...
## $ Honey : Factor w/ 2 levels "N","Y": 2 2 1 2 2 2 2 1 2 2 ...
## $ Jams : Factor w/ 2 levels "N","Y": 2 2 1 2 2 2 2 1 2 2 ...
## $ Maple : Factor w/ 2 levels "N","Y": 2 2 1 1 1 2 2 1 1 1 ...
## $ Meat : Factor w/ 2 levels "N","Y": 2 1 1 2 1 2 2 1 2 2 ...
## $ Nursery : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ Nuts : Factor w/ 2 levels "N","Y": 1 1 1 1 2 1 2 1 2 1 ...
## $ Plants : Factor w/ 2 levels "N","Y": 1 1 1 2 1 1 1 1 2 1 ...
## $ Poultry : Factor w/ 2 levels "N","Y": 2 2 1 2 1 2 2 1 2 2 ...
## $ Prepared : Factor w/ 2 levels "N","Y": 2 1 1 2 2 2 2 1 1 2 ...
## $ Soap : Factor w/ 2 levels "N","Y": 2 2 1 2 2 2 2 1 1 1 ...
## $ Trees : Factor w/ 2 levels "N","Y": 2 1 1 1 1 1 1 1 1 1 ...
## $ Wine : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 2 1 1 1 ...
## $ Coffee : Factor w/ 2 levels "N","Y": 2 1 1 1 1 2 2 1 1 2 ...
## $ Beans : Factor w/ 2 levels "N","Y": 2 1 1 1 1 1 1 1 2 1 ...
## $ Fruits : Factor w/ 2 levels "N","Y": 2 2 1 2 1 2 2 2 2 2 ...
## $ Grains : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 2 1 ...
## $ Juices : Factor w/ 2 levels "N","Y": 1 1 1 1 1 2 2 1 2 1 ...
## $ Mushrooms : Factor w/ 2 levels "N","Y": 2 1 1 1 1 2 1 1 1 1 ...
## $ PetFood : Factor w/ 2 levels "N","Y": 2 1 1 1 1 2 1 1 1 2 ...
## $ Tofu : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ WildHarvested: Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ updateTime : Factor w/ 6348 levels "1/10/2018 12:23:05 PM",..: 3290 3387 386 139 391 2148 2050 1846 1999 4813 ...
For our analysis, not all variables in the data frame are needed. So we select the ones that will be used in the analysis.
wide_df <- df %>% select(FMID,State, Organic:WildHarvested)
This data is considered to be wide format because the variables from Organic to WildHarvested are structured as if there were single variables.
We can reshape the data into the long format by taking the Organic:WildHarvested transform then to a single column called “items”.
long_df <- wide_df %>% gather(Items, Sell_Item,Organic:WildHarvested)
## Warning: attributes are not identical across measure variables;
## they will be dropped
long_df[long_df == "-"] <- NA
kable(head(long_df)) %>% kable_styling(bootstrap_options = "striped", full_width = F)
FMID | State | Items | Sell_Item |
---|---|---|---|
1018261 | Vermont | Organic | Y |
1018318 | Ohio | Organic | NA |
1009364 | South Carolina | Organic | NA |
1010691 | Missouri | Organic | NA |
1002454 | New York | Organic | NA |
1011100 | Tennessee | Organic | Y |
Analysis 1: Which of the states have most of farmers markets offering both organic and meat products?
Analysis 2: What categories of foods are predominantly present in farmers markets in the U.S and by States?
Using the original wide format data:
data1 <- wide_df %>% group_by(State) %>%
select(State, Organic, Meat) %>%
filter(Organic=="Y", Meat=="Y") %>%
count(State) %>%
#ungroup() %>% #if this line is commented,the class function will show "grouped_df",
#but there is no difference in the results shown in the print displayed.
arrange(desc(n))
class(data1)
## [1] "grouped_df" "tbl_df" "tbl" "data.frame"
print.data.frame(data1) # The function "print.data.frame" from base R, to show all rows.
## State n
## 1 California 189
## 2 New York 147
## 3 Massachusetts 79
## 4 Washington 67
## 5 Wisconsin 67
## 6 Illinois 61
## 7 North Carolina 56
## 8 Oregon 56
## 9 Florida 55
## 10 Michigan 54
## 11 Ohio 51
## 12 Vermont 47
## 13 Georgia 46
## 14 Pennsylvania 46
## 15 Virginia 46
## 16 Colorado 43
## 17 Maine 43
## 18 New Hampshire 43
## 19 Texas 36
## 20 Indiana 35
## 21 Maryland 35
## 22 Tennessee 32
## 23 New Jersey 31
## 24 Minnesota 30
## 25 Connecticut 26
## 26 Kentucky 25
## 27 District of Columbia 24
## 28 Iowa 24
## 29 Missouri 19
## 30 Arkansas 15
## 31 Arizona 13
## 32 Idaho 13
## 33 South Carolina 12
## 34 Kansas 11
## 35 Montana 11
## 36 Hawaii 10
## 37 West Virginia 10
## 38 Nebraska 9
## 39 New Mexico 9
## 40 Rhode Island 9
## 41 Delaware 7
## 42 Louisiana 7
## 43 Nevada 7
## 44 South Dakota 7
## 45 Mississippi 6
## 46 Oklahoma 6
## 47 Alabama 5
## 48 Wyoming 5
## 49 Utah 4
## 50 Alaska 2
## 51 North Dakota 1
Using the long format data:
data2 <- long_df %>%
select(FMID,State, Items,Sell_Item) %>%
filter(Items=="Organic"| Items=="Meat")
kable(head(data2))%>% kable_styling(bootstrap_options = "striped", full_width = F)
FMID | State | Items | Sell_Item |
---|---|---|---|
1018261 | Vermont | Organic | Y |
1018318 | Ohio | Organic | NA |
1009364 | South Carolina | Organic | NA |
1010691 | Missouri | Organic | NA |
1002454 | New York | Organic | NA |
1011100 | Tennessee | Organic | Y |
data2 <-data2 %>% group_by(State) %>%
spread(Items,Sell_Item) %>%
filter(Organic=="Y",Meat=="Y") %>%
count(State) %>%
arrange(desc(n))
print(data2, n=15) # The function "print" from base R, to show n rows.
## # A tibble: 51 x 2
## # Groups: State [51]
## State n
## <fct> <int>
## 1 California 189
## 2 New York 147
## 3 Massachusetts 79
## 4 Washington 67
## 5 Wisconsin 67
## 6 Illinois 61
## 7 North Carolina 56
## 8 Oregon 56
## 9 Florida 55
## 10 Michigan 54
## 11 Ohio 51
## 12 Vermont 47
## 13 Georgia 46
## 14 Pennsylvania 46
## 15 Virginia 46
## # ... with 36 more rows
California and New York are the top states in raw number of farmers market that sells both organic and meat products. An interesting, analysis is to check the proportion to the state’s population.
We can filter the information needed and and count the number of times a particular produce or product appears in the data frame.
df_items <- long_df %>% select(Items, Sell_Item)
freq_items <- df_items %>% group_by(Items) %>%
filter(Sell_Item=="Y") %>%
count(Sell_Item) %>%
arrange(desc(n))
kable(freq_items) %>% kable_styling(bootstrap_options = "striped", full_width = F)
Items | Sell_Item | n |
---|---|---|
Vegetables | Y | 5698 |
Bakedgoods | Y | 5172 |
Fruits | Y | 4799 |
Honey | Y | 4768 |
Jams | Y | 4735 |
Herbs | Y | 4565 |
Eggs | Y | 4400 |
Flowers | Y | 4002 |
Soap | Y | 3983 |
Plants | Y | 3830 |
Crafts | Y | 3636 |
Prepared | Y | 3585 |
Meat | Y | 3230 |
Cheese | Y | 2878 |
Poultry | Y | 2654 |
Organic | Y | 2385 |
Coffee | Y | 2090 |
Maple | Y | 1892 |
Nuts | Y | 1737 |
Trees | Y | 1638 |
Seafood | Y | 1452 |
Juices | Y | 1442 |
Mushrooms | Y | 1395 |
PetFood | Y | 1127 |
Wine | Y | 1017 |
Beans | Y | 861 |
WildHarvested | Y | 855 |
Grains | Y | 848 |
Nursery | Y | 324 |
Tofu | Y | 232 |
Alternately, we can create a frequency table to count the number of markets that sell a particular type of produce or product.
This seems a less efficient approach because we have to perform more transformations to the data.
factor(df_items)
## Items Sell_Item
## <NA> <NA>
## 2 Levels: c("Organic", "Organic", "Organic", "Bakedgoods", "Bakedgoods", "Cheese", "Cheese", "Crafts", "Crafts", "Flowers", "Flowers", "Eggs", "Eggs", "Seafood", "Seafood", "Herbs", "Herbs", "Vegetables", "Vegetables", "Honey", "Honey", "Jams", "Jams", "Maple", "Maple", "Meat", "Meat", "Nursery", "Nursery", "Nuts", "Nuts", "Plants", "Plants", "Poultry", "Poultry", "Prepared", "Prepared", "Soap", "Soap", "Trees", "Trees", "Wine", "Wine", "Coffee", "Coffee", "Beans", "Beans", "Fruits", "Fruits", "Grains", "Grains", \n"Juices", "Juices", "Mushrooms", "Mushrooms", "PetFood", "PetFood", "Tofu", "Tofu", "WildHarvested", "WildHarvested") ...
tb <- table(df_items)
kable(head(tb))%>% kable_styling(bootstrap_options = "striped", full_width = F)
N | Y | |
---|---|---|
Bakedgoods | 3563 | 5172 |
Beans | 7874 | 861 |
Cheese | 5857 | 2878 |
Coffee | 6645 | 2090 |
Crafts | 5099 | 3636 |
Eggs | 4335 | 4400 |
tb <- as.data.frame(tb)
str(tb)
## 'data.frame': 60 obs. of 3 variables:
## $ Items : Factor w/ 30 levels "Bakedgoods","Beans",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Sell_Item: Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ Freq : int 3563 7874 5857 6645 5099 4335 4733 3936 7887 4170 ...
tb <- tb %>% filter(tb$Sell_Item=="Y") %>% select(Items, Freq) %>%
arrange(desc(Freq))
kable(tb) %>% kable_styling(bootstrap_options = "striped", full_width = F)
Items | Freq |
---|---|
Vegetables | 5698 |
Bakedgoods | 5172 |
Fruits | 4799 |
Honey | 4768 |
Jams | 4735 |
Herbs | 4565 |
Eggs | 4400 |
Flowers | 4002 |
Soap | 3983 |
Plants | 3830 |
Crafts | 3636 |
Prepared | 3585 |
Meat | 3230 |
Cheese | 2878 |
Poultry | 2654 |
Organic | 2385 |
Coffee | 2090 |
Maple | 1892 |
Nuts | 1737 |
Trees | 1638 |
Seafood | 1452 |
Juices | 1442 |
Mushrooms | 1395 |
PetFood | 1127 |
Wine | 1017 |
Beans | 861 |
WildHarvested | 855 |
Grains | 848 |
Nursery | 324 |
Tofu | 232 |
In addition to a frequency table, we can also graph a bar plot of the number of markets that sell a particular type of produce (or product).
barplot(tb$Freq, main = "U.S. Farmers Market",
ylab = "Frequency",
names.arg = tb$Items,
las=2,
col = "blue")
Data shows that most of the markets sell vegetables and baked goods.