As a first step, we came up with a stategy for working on the assignment together. First, we picked the 3 datasets we were interested in analyzing. Then we each picked one dataset to take a lead on after discussing some basic strategies. We worked on those individually, collaborating and reaching out to each other with questions. That gave us an opportunity to perfect our data wrangling and collaboration skils. Then we worked on a 3rd dataset together.

Dataset 1

The dataset contains information from the U.S. Department of Agriculture about the longitude and latitude, state, address, name, zip code, etc, of Farmers Markets in the United States. The data is considered to be wide format because, for example, the variable produce/product is structured as if every single type of produce/product is a variable.

Tidying, transforming and analyzing Data

Information is read from a csv file into R. The data has a “wide” structure. The data is then made tidy and tranformed, and analysis is performed.

The goal is to use the tidyverse packages, we will provide alternative ways of perfoming the analyses in this project.


Load packages

library(knitr)
library(kableExtra)# manipulate table styles
suppressMessages(library(tidyverse))
## Warning: package 'dplyr' was built under R version 3.5.1

Read the data

df <- read.csv(file="https://raw.githubusercontent.com/che10vek/Data-607-Assignments/master/Export.csv", header=TRUE, sep=",")
str(df, give.attr = F) # with not to show attributes as substructures. If T then str is really long... 
## 'data.frame':    8735 obs. of  59 variables:
##  $ FMID         : int  1018261 1018318 1009364 1010691 1002454 1011100 1009845 1005586 1008071 1012710 ...
##  $ MarketName   : Factor w/ 8170 levels " Caledonia Farmers Market Association - Danville",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Website      : Factor w/ 4326 levels "","http:// jasper.agrilife.org/jasper-county-farmers-market/",..: 4236 3613 644 1 1 721 720 1 1 1 ...
##  $ Facebook     : Factor w/ 3613 levels ""," https://www.facebook.com/FPConservatory",..: 833 3253 1 1 1 37 423 1813 425 424 ...
##  $ Twitter      : Factor w/ 781 levels ""," no twitter",..: 1 1 1 1 1 3 319 1 172 124 ...
##  $ Youtube      : Factor w/ 135 levels ""," Delcambre Seafood and Farmers Market",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ OtherMedia   : Factor w/ 632 levels ""," http://instagram.com/forsythfarmersmarket",..: 1 1 1 64 1 3 432 177 1 503 ...
##  $ street       : Factor w/ 8254 levels "","\t10336 S Estrella Pkwy, 85338, USA",..: 1 4545 502 582 701 2836 1529 1050 1237 4123 ...
##  $ city         : Factor w/ 5103 levels ""," Buffalo",..: 1108 3471 4166 2396 3159 3081 3159 4991 4786 4787 ...
##  $ County       : Factor w/ 1504 levels "","Abbeville",..: 209 359 1 97 946 371 946 942 400 400 ...
##  $ State        : Factor w/ 53 levels "Alabama","Alaska",..: 47 36 42 26 33 44 33 8 9 9 ...
##  $ zip          : Factor w/ 6359 levels "","00000","0033",..: 503 1 2068 4536 733 2571 731 1361 1390 1392 ...
##  $ Season1Date  : Factor w/ 2726 levels ""," to 07/01/2016",..: 1904 2104 1 258 2677 775 1789 1053 703 355 ...
##  $ Season1Time  : Factor w/ 1703 levels "","Fri: 1:00 AM-6:00 PM;",..: 1592 516 1 1454 1343 1096 999 114 516 516 ...
##  $ Season2Date  : Factor w/ 411 levels "","01/01/2014 to 04/01/2014",..: 190 1 1 1 1 1 1 1 1 1 ...
##  $ Season2Time  : Factor w/ 215 levels "","Fri: 11:00 AM-3:30 PM;",..: 178 1 1 1 1 1 1 1 1 1 ...
##  $ Season3Date  : Factor w/ 82 levels "","01/01/2012 to 05/31/2012",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Season3Time  : Factor w/ 49 levels "","Fri: 3:00 PM-6:00 PM;",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Season4Date  : Factor w/ 7 levels "","01/13/2019 to 01/13/2019",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Season4Time  : Factor w/ 6 levels "","Sat: 10:00 AM-1:00 PM;",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ x            : num  -72.1 -81.7 -82.8 -94.3 -73.9 ...
##  $ y            : num  44.4 41.4 34.8 37.5 40.8 ...
##  $ Location     : Factor w/ 11 levels "","Closed-off public street",..: 1 1 1 1 11 1 6 9 10 1 ...
##  $ Credit       : Factor w/ 2 levels "N","Y": 2 2 2 2 1 2 2 1 2 2 ...
##  $ WIC          : Factor w/ 2 levels "N","Y": 2 1 1 1 1 1 2 1 2 2 ...
##  $ WICcash      : Factor w/ 2 levels "N","Y": 1 1 1 1 2 1 1 1 2 2 ...
##  $ SFMNP        : Factor w/ 2 levels "N","Y": 2 2 1 1 2 1 2 1 2 2 ...
##  $ SNAP         : Factor w/ 2 levels "N","Y": 1 1 1 1 1 2 2 2 2 2 ...
##  $ Organic      : Factor w/ 3 levels "-","N","Y": 3 1 1 1 1 3 3 2 3 2 ...
##  $ Bakedgoods   : Factor w/ 2 levels "N","Y": 2 2 1 2 2 2 2 1 2 2 ...
##  $ Cheese       : Factor w/ 2 levels "N","Y": 2 1 1 1 1 2 2 1 2 1 ...
##  $ Crafts       : Factor w/ 2 levels "N","Y": 2 2 1 2 2 1 2 1 1 2 ...
##  $ Flowers      : Factor w/ 2 levels "N","Y": 2 2 1 1 2 2 2 1 2 2 ...
##  $ Eggs         : Factor w/ 2 levels "N","Y": 2 2 1 2 1 2 2 1 2 2 ...
##  $ Seafood      : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Herbs        : Factor w/ 2 levels "N","Y": 2 2 1 2 2 2 2 2 2 2 ...
##  $ Vegetables   : Factor w/ 2 levels "N","Y": 2 2 1 2 2 2 2 2 2 2 ...
##  $ Honey        : Factor w/ 2 levels "N","Y": 2 2 1 2 2 2 2 1 2 2 ...
##  $ Jams         : Factor w/ 2 levels "N","Y": 2 2 1 2 2 2 2 1 2 2 ...
##  $ Maple        : Factor w/ 2 levels "N","Y": 2 2 1 1 1 2 2 1 1 1 ...
##  $ Meat         : Factor w/ 2 levels "N","Y": 2 1 1 2 1 2 2 1 2 2 ...
##  $ Nursery      : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Nuts         : Factor w/ 2 levels "N","Y": 1 1 1 1 2 1 2 1 2 1 ...
##  $ Plants       : Factor w/ 2 levels "N","Y": 1 1 1 2 1 1 1 1 2 1 ...
##  $ Poultry      : Factor w/ 2 levels "N","Y": 2 2 1 2 1 2 2 1 2 2 ...
##  $ Prepared     : Factor w/ 2 levels "N","Y": 2 1 1 2 2 2 2 1 1 2 ...
##  $ Soap         : Factor w/ 2 levels "N","Y": 2 2 1 2 2 2 2 1 1 1 ...
##  $ Trees        : Factor w/ 2 levels "N","Y": 2 1 1 1 1 1 1 1 1 1 ...
##  $ Wine         : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 2 1 1 1 ...
##  $ Coffee       : Factor w/ 2 levels "N","Y": 2 1 1 1 1 2 2 1 1 2 ...
##  $ Beans        : Factor w/ 2 levels "N","Y": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Fruits       : Factor w/ 2 levels "N","Y": 2 2 1 2 1 2 2 2 2 2 ...
##  $ Grains       : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 2 1 ...
##  $ Juices       : Factor w/ 2 levels "N","Y": 1 1 1 1 1 2 2 1 2 1 ...
##  $ Mushrooms    : Factor w/ 2 levels "N","Y": 2 1 1 1 1 2 1 1 1 1 ...
##  $ PetFood      : Factor w/ 2 levels "N","Y": 2 1 1 1 1 2 1 1 1 2 ...
##  $ Tofu         : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ WildHarvested: Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ updateTime   : Factor w/ 6348 levels "1/10/2018 12:23:05 PM",..: 3290 3387 386 139 391 2148 2050 1846 1999 4813 ...

Select data

For our analysis, not all variables in the data frame are needed. So we select the ones that will be used in the analysis.

wide_df <- df %>% select(FMID,State, Organic:WildHarvested)

Tidying the data

This data is considered to be wide format because the variables from Organic to WildHarvested are structured as if there were single variables.

We can reshape the data into the long format by taking the Organic:WildHarvested transform then to a single column called “items”.

long_df <- wide_df %>% gather(Items, Sell_Item,Organic:WildHarvested)
## Warning: attributes are not identical across measure variables;
## they will be dropped
long_df[long_df == "-"] <- NA
kable(head(long_df)) %>% kable_styling(bootstrap_options = "striped", full_width = F)
FMID State Items Sell_Item
1018261 Vermont Organic Y
1018318 Ohio Organic NA
1009364 South Carolina Organic NA
1010691 Missouri Organic NA
1002454 New York Organic NA
1011100 Tennessee Organic Y

Analysis

  • Analysis 1: Which of the states have most of farmers markets offering both organic and meat products?

  • Analysis 2: What categories of foods are predominantly present in farmers markets in the U.S and by States?

Analysis 1

Using the original wide format data:

data1 <- wide_df %>% group_by(State) %>% 
        select(State, Organic, Meat)  %>% 
        filter(Organic=="Y", Meat=="Y") %>% 
        count(State) %>% 
        #ungroup() %>% #if this line is commented,the class function will show "grouped_df", 
                      #but there is no difference in the results shown in the print displayed.
        arrange(desc(n))
class(data1)
## [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"
print.data.frame(data1) # The function "print.data.frame" from base R, to show all rows.
##                   State   n
## 1            California 189
## 2              New York 147
## 3         Massachusetts  79
## 4            Washington  67
## 5             Wisconsin  67
## 6              Illinois  61
## 7        North Carolina  56
## 8                Oregon  56
## 9               Florida  55
## 10             Michigan  54
## 11                 Ohio  51
## 12              Vermont  47
## 13              Georgia  46
## 14         Pennsylvania  46
## 15             Virginia  46
## 16             Colorado  43
## 17                Maine  43
## 18        New Hampshire  43
## 19                Texas  36
## 20              Indiana  35
## 21             Maryland  35
## 22            Tennessee  32
## 23           New Jersey  31
## 24            Minnesota  30
## 25          Connecticut  26
## 26             Kentucky  25
## 27 District of Columbia  24
## 28                 Iowa  24
## 29             Missouri  19
## 30             Arkansas  15
## 31              Arizona  13
## 32                Idaho  13
## 33       South Carolina  12
## 34               Kansas  11
## 35              Montana  11
## 36               Hawaii  10
## 37        West Virginia  10
## 38             Nebraska   9
## 39           New Mexico   9
## 40         Rhode Island   9
## 41             Delaware   7
## 42            Louisiana   7
## 43               Nevada   7
## 44         South Dakota   7
## 45          Mississippi   6
## 46             Oklahoma   6
## 47              Alabama   5
## 48              Wyoming   5
## 49                 Utah   4
## 50               Alaska   2
## 51         North Dakota   1

Using the long format data:

data2 <- long_df %>% 
        select(FMID,State, Items,Sell_Item)  %>% 
        filter(Items=="Organic"| Items=="Meat") 
kable(head(data2))%>% kable_styling(bootstrap_options = "striped", full_width = F)
FMID State Items Sell_Item
1018261 Vermont Organic Y
1018318 Ohio Organic NA
1009364 South Carolina Organic NA
1010691 Missouri Organic NA
1002454 New York Organic NA
1011100 Tennessee Organic Y
data2 <-data2 %>% group_by(State) %>% 
        spread(Items,Sell_Item) %>%
        filter(Organic=="Y",Meat=="Y") %>% 
        count(State) %>% 
        arrange(desc(n))
print(data2, n=15) # The function "print" from base R, to show n rows.
## # A tibble: 51 x 2
## # Groups:   State [51]
##    State              n
##    <fct>          <int>
##  1 California       189
##  2 New York         147
##  3 Massachusetts     79
##  4 Washington        67
##  5 Wisconsin         67
##  6 Illinois          61
##  7 North Carolina    56
##  8 Oregon            56
##  9 Florida           55
## 10 Michigan          54
## 11 Ohio              51
## 12 Vermont           47
## 13 Georgia           46
## 14 Pennsylvania      46
## 15 Virginia          46
## # ... with 36 more rows
Conclusion:

California and New York are the top states in raw number of farmers market that sells both organic and meat products. An interesting, analysis is to check the proportion to the state’s population.

Analysis 2

Using dplyr only:

We can filter the information needed and and count the number of times a particular produce or product appears in the data frame.

df_items <- long_df %>% select(Items, Sell_Item)
freq_items <- df_items %>% group_by(Items) %>% 
        filter(Sell_Item=="Y") %>% 
        count(Sell_Item) %>% 
        arrange(desc(n))
kable(freq_items) %>% kable_styling(bootstrap_options = "striped", full_width = F)
Items Sell_Item n
Vegetables Y 5698
Bakedgoods Y 5172
Fruits Y 4799
Honey Y 4768
Jams Y 4735
Herbs Y 4565
Eggs Y 4400
Flowers Y 4002
Soap Y 3983
Plants Y 3830
Crafts Y 3636
Prepared Y 3585
Meat Y 3230
Cheese Y 2878
Poultry Y 2654
Organic Y 2385
Coffee Y 2090
Maple Y 1892
Nuts Y 1737
Trees Y 1638
Seafood Y 1452
Juices Y 1442
Mushrooms Y 1395
PetFood Y 1127
Wine Y 1017
Beans Y 861
WildHarvested Y 855
Grains Y 848
Nursery Y 324
Tofu Y 232

Using table() and dplyr:

Alternately, we can create a frequency table to count the number of markets that sell a particular type of produce or product.

This seems a less efficient approach because we have to perform more transformations to the data.

 factor(df_items)
##     Items Sell_Item 
##      <NA>      <NA> 
## 2 Levels: c("Organic", "Organic", "Organic", "Bakedgoods", "Bakedgoods", "Cheese", "Cheese", "Crafts", "Crafts", "Flowers", "Flowers", "Eggs", "Eggs", "Seafood", "Seafood", "Herbs", "Herbs", "Vegetables", "Vegetables", "Honey", "Honey", "Jams", "Jams", "Maple", "Maple", "Meat", "Meat", "Nursery", "Nursery", "Nuts", "Nuts", "Plants", "Plants", "Poultry", "Poultry", "Prepared", "Prepared", "Soap", "Soap", "Trees", "Trees", "Wine", "Wine", "Coffee", "Coffee", "Beans", "Beans", "Fruits", "Fruits", "Grains", "Grains", \n"Juices", "Juices", "Mushrooms", "Mushrooms", "PetFood", "PetFood", "Tofu", "Tofu", "WildHarvested", "WildHarvested") ...
 tb <- table(df_items) 
 kable(head(tb))%>% kable_styling(bootstrap_options = "striped", full_width = F)
N Y
Bakedgoods 3563 5172
Beans 7874 861
Cheese 5857 2878
Coffee 6645 2090
Crafts 5099 3636
Eggs 4335 4400
 tb <- as.data.frame(tb)
 str(tb)
## 'data.frame':    60 obs. of  3 variables:
##  $ Items    : Factor w/ 30 levels "Bakedgoods","Beans",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Sell_Item: Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Freq     : int  3563 7874 5857 6645 5099 4335 4733 3936 7887 4170 ...
 tb <- tb %>% filter(tb$Sell_Item=="Y") %>% select(Items, Freq) %>% 
         arrange(desc(Freq))
 kable(tb) %>% kable_styling(bootstrap_options = "striped", full_width = F)
Items Freq
Vegetables 5698
Bakedgoods 5172
Fruits 4799
Honey 4768
Jams 4735
Herbs 4565
Eggs 4400
Flowers 4002
Soap 3983
Plants 3830
Crafts 3636
Prepared 3585
Meat 3230
Cheese 2878
Poultry 2654
Organic 2385
Coffee 2090
Maple 1892
Nuts 1737
Trees 1638
Seafood 1452
Juices 1442
Mushrooms 1395
PetFood 1127
Wine 1017
Beans 861
WildHarvested 855
Grains 848
Nursery 324
Tofu 232

Displaying the data

In addition to a frequency table, we can also graph a bar plot of the number of markets that sell a particular type of produce (or product).

 barplot(tb$Freq, main = "U.S. Farmers Market",
 ylab = "Frequency",
 names.arg = tb$Items,
 las=2,
 col = "blue")

Conclusion:

Data shows that most of the markets sell vegetables and baked goods.