I am pulling down data from USDA about the prices of fruits and vegetables. This data comes in an extremely messy format, with a spreadsheet for each fruit or vegetable. I pulled each spreadsheet down from the website and combined them into a dataset. I decided for this assignment to limit my analysis to the solid foods and leave the juice for another time. I analyzed the price by categories such as “dried, fresh, etc.” while controlling for the specific item.

Start by pulling HTML from website to get a list of the csv links.

library(rvest)
library(stringr)
library(readr)
library(tidyr)
library(dplyr)
library(ggplot2)

webpage <- read_html('https://www.ers.usda.gov/data-products/fruit-and-vegetable-prices/')
webtable <- html_nodes(webpage, "table td")
links <- paste0('https://www.ers.usda.gov/',unlist(str_extract_all(webtable,'webdocs.+csv')))
unique_links <- unique(links)

I’m going to start with lists because they are a bit for flexible in terms of variable numbers of rows, etc. The spreasheets all have footnotes that give more detail. I will store those for reference, but won’t dig into them for this analysis.
I’m also going to store the individual headers for reference, but not keep them with the dataset.

price_list <- list() #This list will store the pricing data
footnotes <- list()  #This will store the footnotes
headers <- list()    #This will store the headers
j = 1
k = 1
maxlength=0
for (u in 1:length(unique_links)) {
  fruit_data <- read_delim(unique_links[u],",")
  
  fruit_name = str_extract(unique_links[u],'/[^/]*\\.csv')
  fruit_name = str_replace_all(str_replace_all(str_replace_all(fruit_name,'/',''),'.csv',''),'_',' ')
  
  #Pull out first two rows as header, and save in case we need them
  header <- unname(apply(fruit_data, 2, function(x) paste(x[1],x[2])))
  headers[[u]] = header
  
  fruit_data = fruit_data[-c(1,2),]

  #loop through data set
  for (i in 1:nrow(fruit_data)) {
    row <- fruit_data[i,]
    
    if (sum(is.na(row)) == ncol(row)) {
      #For completely blank rows - don't do anything
    } else if(!is.na(unlist(str_extract(row[1],'^\\d+')))) {
      #rows that start with a number are assumed to be footnotes. 
      #Put them in their own list
      footnotes[[j]] <- as.list(c(fruit_name,
                          unlist(str_extract(row[1],'^\\d+')),
                          unlist(str_replace(row[1],'^\\d+',''))
                          ))
      j = j+1
    } else if(!is.na(unlist(str_extract(row[1],'^Source')))) {
      #rows that start with a number are assumed to be footnotes. 
      #Put them in their own list
      footnotes[[j]] <- as.list(c(fruit_name,
                                  'Source',
                                  unlist(str_replace(row[1],'^Source: ',''))
      ))
      j = j+1
    }else {
      #The rest go into our price data list
      price_list[[k]] <- c(fruit_name,unname(as.list(row)))
      k = k+1
      len = length(unname(as.list(row))) +1
      if (len > maxlength) {
        maxlength = len
      }
    }
  }
}

I now have my list of lists. Since there are more columns in some of the lists, I will create a data frame with the maximum number of columns, and fill NA’s for ones with fewer columns.

price_data = data.frame(matrix(NA, nrow = 0, ncol = maxlength))

for(r in 1:length(price_list)) {
  row = price_list[[r]]
  rowcount = length(row)
  
  while (rowcount < 10) {
    row = c(row,NA)
    rowcount = rowcount + 1
  }
  
  row_df = as.data.frame(row)
  
  names(row_df) <- c('X1','X2','X3','X4','X5','X6','X7','X8','X9','X10')
  
  price_data = rbind(price_data, row_df)
  
}
head(price_data)
##         X1              X2    X3         X4   X5    X6     X7    X8 X9 X10
## 1   apples          Fresh1 $1.62  per pound  0.9 0.243 pounds $0.44 NA  NA
## 2   apples     Applesauce2 $1.05  per pound    1 0.540 pounds $0.57 NA  NA
## 3   apples           Juice  <NA>       <NA> <NA>  <NA>   <NA>  <NA> NA  NA
## 4   apples Ready to drink3 $0.63   per pint    1     8  fl oz $0.32 NA  NA
## 5   apples         Frozen4 $0.51   per pint    1     8  fl oz $0.26 NA  NA
## 6 apricots          Fresh1 $3.09  per pound 0.93 0.364 pounds $1.21 NA  NA

I now have my data frame, with generic column names.
Column 9 and 10 look completely blank, let’s make sure that’s the case

nrow(price_data)
## [1] 176
sum(is.na(price_data[9]))
## [1] 176
sum(is.na(price_data[10]))
## [1] 176

They are blank, let’s remove

price_data <- price_data[-c(9,10)]

Next, let’s remove the footnotes pointers

price_data['X2'] = apply(price_data,1, function(x) str_replace_all(x['X2'],'\\d*,*\\d+ *$',''))
price_data = drop_na(price_data, 'X3')

Instead of dealing with separate units, I will separate into two datasets.

price_data_juice <- price_data %>%
  filter(str_detect(X4,'pint'))

price_data <- price_data %>%
  filter(str_detect(X4,'pint')==FALSE)

From here on out I’m just going to work with the solid food dataset and leave the juice for another day. I believe columns 4 and 7 are just the names of the units. Now that we have separated, these should be superfluous.
Check to make sure they are the same (pounds for solid food, fl oz for juice).

table(price_data[4])
## 
##  per pound   per pint  per pound 
##        144          0          2
table(price_data[7])
## 
## pounds  fl oz  pound 
##    144      0      2
price_data = price_data[-c(4,7)]

Let’s get rid of extra spaces and dollar signs, then convert the numeric data to numeric columns.

clean_up_data <- function(row) {
  new_row = sapply(row, function(x) str_replace(str_trim(x),'\\$',''))
  return(new_row)
}

price_data_2 = as.data.frame(apply(price_data, 2, clean_up_data))

price_data_3 <- as.data.frame(lapply(price_data_2, function(col) {
  if (suppressWarnings(all(!is.na(as.numeric(as.character(col)))))) {
    as.numeric(as.character(col))
  } else {
    col
  }
}))
head(price_data_3)
##         X1                               X2   X3   X5    X6   X8
## 1   apples                            Fresh 1.62 0.90 0.243 0.44
## 2   apples                       Applesauce 1.05 1.00 0.540 0.57
## 3 apricots                            Fresh 3.09 0.93 0.364 1.21
## 4 apricots                  Packed in juice 1.47 1.00 0.540 0.80
## 5 apricots Packed in syrup, syrup discarded 1.86 0.65 0.441 1.26
## 6 apricots                            Dried 7.33 1.00 0.143 1.05

The intermediate columns are just to convert price “retail price”" to “price per cup”. We’re looking at price and so can remove these. Let’s just check quick that there isn’t some information here we’d be missing out on.

price_data_3['Check'] =  round(price_data_3['X3']*price_data_3['X6']/price_data_3['X5'],2)

price_data_3[price_data_3['X8']!=price_data_3['Check'],]
##                   X1                       X2   X3    X5    X6   X8 Check
## 4           apricots          Packed in juice 1.47 1.000 0.540 0.80  0.79
## 18              figs                    Dried 6.13 0.960 0.165 1.06  1.05
## 20    fruit cocktail Packed in syrup or water 1.26 0.650 0.441 0.86  0.85
## 24          honeydew                    Fresh 0.83 0.460 0.375 0.67  0.68
## 35           peaches                   Frozen 3.19 1.000 0.331 1.05  1.06
## 39         pineapple                    Fresh 0.65 0.510 0.364 0.47  0.46
## 42         pineapple                    Dried 5.87 1.000 0.154 0.91  0.90
## 44             plums           Dried (Prunes) 4.73 1.000 0.187 0.89  0.88
## 56         asparagus                    Fresh 3.08 0.494 0.397 2.47  2.48
## 68  brussels sprouts                    Fresh 2.96 1.060 0.342 0.95  0.96
## 111 mixed vegetables                   Canned 1.12 0.650 0.334 0.57  0.58
## 117   mustard greens                    Fresh 2.68 0.840 0.309 0.98  0.99
## 122             okra                    Fresh 3.82 0.769 0.350 1.75  1.74
## 124           olives                   Canned 5.09 1.000 0.298 1.51  1.52
## 127      pinto beans                    Dried 1.09 2.399 0.386 0.17  0.18
## 131          pumpkin                   Canned 1.38 1.000 0.540 0.74  0.75
## 134          spinach                   Boiled 3.83 0.770 0.331 1.64  1.65
## 145    turnip greens                   Canned 0.78 0.650 0.353 0.43  0.42

You can see that the “check” column is just a combination with the other two columns. The 18 where they are different look like rounding errors. We are safe to remove these and just keep our prices.

price_data_4 = price_data_3[-c(4,5,7)]

Let’s fix our column names and see how it looks.

names(price_data_4) = c('Food','Form','Retail_Price','Price_Per_Cup')
str(price_data_4)
## 'data.frame':    146 obs. of  4 variables:
##  $ Food         : Factor w/ 73 levels "acorn squash",..: 2 2 3 3 3 3 7 9 11 11 ...
##  $ Form         : Factor w/ 33 levels "Applesauce","Boiled",..: 9 1 9 21 23 6 9 14 9 14 ...
##  $ Retail_Price : num  1.62 1.05 3.09 1.47 1.86 7.33 0.55 3.64 5.66 3.69 ...
##  $ Price_Per_Cup: num  0.44 0.57 1.21 0.8 1.26 1.05 0.28 1.2 1.89 1.22 ...
sort(table(price_data_4['Form']),decreasing=T)
## 
##                            Fresh                           Canned 
##                               45                               24 
##                           Frozen                            Dried 
##                               24                               15 
##                  Packed in juice         Packed in syrup or water 
##                                5                                4 
##                          Florets                            Heads 
##                                2                                2 
##                       Applesauce                           Boiled 
##                                1                                1 
## Canned, packed in syrup or water                     Cooked whole 
##                                1                                1 
##                   Dried (Prunes)              Fresh green cabbage 
##                                1                                1 
##                Fresh red cabbage        Fresh, consumed with peel 
##                                1                                1 
##                    Fresh, peeled              Frozen french fries 
##                                1                                1 
##                       Full Heads                 Grape and cherry 
##                                1                                1 
##                           Hearts                      Large round 
##                                1                                1 
## Packed in syrup, syrup discarded                          Raisins 
##                                1                                1 
##                              Raw                         Raw baby 
##                                1                                1 
##                        Raw whole                    Roma and plum 
##                                1                                1 
##                       Sauerkraut                           Sliced 
##                                1                                1 
##                           Sticks                  Trimmed bunches 
##                                1                                1 
##                            Whole 
##                                1

I think it would be best to take some of the specific labels and put them in to more broad categories for analysis. I’ll add a new column for this so we can refer back to the given data.
I don’t think there is any elegant way to do this, I’ll need to just list out the new categories.

categorize = function(x){
  switch(x,
  'Packed in juice' = 'Packed',
  'Packed in syrup or water' = 'Packed',
  'Florets' = 'Fresh',
  'Heads' = 'Fresh',
  'Applesauce' = 'Other',
  'Boiled' = 'Other',
  'Canned, packed in syrup or water' = 'Packed',
  'Cooked whole' = 'Other',
  'Dried (Prunes)' = 'Dried',
  'Fresh green cabbage' = 'Fresh',
  'Fresh red cabbage' = 'Fresh',
  'Fresh, consumed with peel' = 'Fresh',
  'Fresh, peeled' = 'Other',
  'Frozen french fries' = 'Frozen',
  'Full Heads' = 'Fresh',
  'Grape and cherry' = 'Fresh',
  'Hearts' = 'Other',
  'Large round' = 'Fresh',
  'Packed in syrup, syrup discarded' = 'Packed',
  'Raisins' = 'Dried',
  'Raw' = 'Fresh',
  'Raw baby' = 'Fresh',
  'Raw whole' = 'Fresh',
  'Roma and plum' = 'Fresh',
  'Sauerkraut' = 'Other',
  'Sliced' = 'Other',
  'Sticks' = 'Other',
  'Trimmed bunches' = 'Other',
  'Whole' = 'Fresh',
  x
  )
}

price_data_4['Category'] = apply(price_data_4['Form'],1, categorize)
table(price_data_4['Category'])
## 
## Canned  Dried  Fresh Frozen  Other Packed 
##     24     17     60     25      9     11
boxplot(Price_Per_Cup ~ Category, data=price_data_4)

We are down to 6 categories, all with at least a few members. From the boxplot we can see that “Packed” appears the most expensive per cup.
Like the “airline delays” example, it’s possible we could run into Simpson’s Paradox here. Because there are so many different foods, pulling them together into one chart is tough to look at. Instead I’ll take all the two-way combinations, and then pull any fruit that has a record in both of those categories, and chart that.

compare_two = function(cat1,cat2) {
  #find all foods that have an entry in BOTH lists
  a <-filter(price_data_4,Category==cat1)['Food']
  b<- filter(price_data_4,Category==cat2)['Food']
  intersection = intersect(a,b)
  intersect_list <- as.character(unlist(intersection[,1]))
  intersect_rows = price_data_4[price_data_4$Food %in% intersect_list,]
  intersect_rows = filter(intersect_rows, Category==cat1 | Category==cat2)
  if (nrow(intersection) > 1) {
  ggplot(filter(intersect_rows, str_detect(Food,'^[a-e].*')), aes(Food, Price_Per_Cup, fill=Category)) + 
    geom_bar(position="dodge", stat="identity")
  }
}

combinations <- as.data.frame(t(combn(unique(price_data_4$Category),2)))

apply(combinations, 1, function(x) compare_two(x[1],x[2]))
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]
## NULL
## 
## [[7]]
## NULL
## 
## [[8]]

## 
## [[9]]

## 
## [[10]]

## 
## [[11]]
## NULL
## 
## [[12]]
## NULL
## 
## [[13]]
## NULL
## 
## [[14]]

## 
## [[15]]

There’s a lot to unpack (no pun intended) here. It does look like Fresh is generally more costly than Frozen or Canned, and that Frozen is more costly than canned.

A simpler way to look at the relative effects is to throw both food and category into a linear model and look at the coefficients. R will automatically turn our categorical variables into dummy variables.

model <- glm(Price_Per_Cup ~Category + Food,data=price_data_4)
summary(model)
## 
## Call:
## glm(formula = Price_Per_Cup ~ Category + Food, data = price_data_4)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.55942  -0.08801   0.00000   0.08866   0.97089  
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               0.86537    0.32198   2.688 0.009039 ** 
## CategoryDried             0.05914    0.12654   0.467 0.641732    
## CategoryFresh             0.23463    0.10047   2.335 0.022481 *  
## CategoryFrozen            0.06194    0.10292   0.602 0.549266    
## CategoryOther             0.44353    0.17217   2.576 0.012168 *  
## CategoryPacked            0.45579    0.17437   2.614 0.011011 *  
## Foodapples               -0.69945    0.38208  -1.831 0.071536 .  
## Foodapricots             -0.08671    0.35344  -0.245 0.806942    
## Foodartichoke             1.29244    0.35716   3.619 0.000565 ***
## Foodasparagus             1.18911    0.35716   3.329 0.001409 ** 
## Foodavocados             -0.14000    0.43261  -0.324 0.747223    
## Foodbananas              -0.82000    0.43261  -1.895 0.062283 .  
## Foodbeets                -0.32537    0.44413  -0.733 0.466315    
## Foodberries mixed         0.27269    0.44149   0.618 0.538871    
## Foodblack beans          -0.49494    0.38661  -1.280 0.204817    
## Foodblackberries          0.54134    0.37723   1.435 0.155862    
## Foodblackeye peas        -0.50494    0.38661  -1.306 0.195923    
## Foodblueberries           0.30134    0.37723   0.799 0.427175    
## Foodbroccoli             -0.25577    0.35445  -0.722 0.473010    
## Foodbrussels sprouts     -0.17366    0.37723  -0.460 0.646741    
## Foodbutternut squash     -0.28000    0.43261  -0.647 0.519662    
## Foodcabbage              -0.75630    0.35674  -2.120 0.037658 *  
## Foodcantaloupe           -0.72000    0.43261  -1.664 0.100653    
## Foodcarrots              -0.69032    0.33825  -2.041 0.045148 *  
## Foodcauliflower          -0.43244    0.35445  -1.220 0.226665    
## Foodcelery               -0.82890    0.40354  -2.054 0.043814 *  
## Foodcherries              0.64942    0.38246   1.698 0.094075 .  
## Foodcollard greens       -0.36756    0.35716  -1.029 0.307064    
## Foodcorn sweet            0.12911    0.35716   0.361 0.718860    
## Foodcranberries          -0.34451    0.45085  -0.764 0.447426    
## Foodcucumbers            -0.80445    0.38208  -2.105 0.038948 *  
## Fooddates                -0.01451    0.45085  -0.032 0.974417    
## Foodfigs                  0.13549    0.45085   0.301 0.764698    
## Foodfruit cocktail       -0.51116    0.40496  -1.262 0.211173    
## Foodgrapefruit           -0.15000    0.43261  -0.347 0.729865    
## Foodgrapes               -0.33226    0.37999  -0.874 0.384990    
## Foodgreat northern beans -0.50494    0.38661  -1.306 0.195923    
## Foodgreen beans          -0.42089    0.35716  -1.178 0.242720    
## Foodgreen peas           -0.29634    0.38294  -0.774 0.441700    
## Foodgreen peppers        -0.62000    0.43261  -1.433 0.156396    
## Foodhoneydew             -0.43000    0.43261  -0.994 0.323766    
## Foodkale                 -0.19866    0.37723  -0.527 0.600177    
## Foodkidney beans         -0.50494    0.38661  -1.306 0.195923    
## Foodkiwi                  0.01000    0.43261   0.023 0.981626    
## Foodlentils              -0.70451    0.45085  -1.563 0.122780    
## Foodlettuce iceberg      -0.82000    0.43261  -1.895 0.062283 .  
## Foodlettuce romaine      -0.72445    0.38208  -1.896 0.062200 .  
## Foodlima beans           -0.31906    0.36192  -0.882 0.381101    
## Foodmangoes              -0.03726    0.37999  -0.098 0.922186    
## Foodmixed vegetables     -0.32134    0.35107  -0.915 0.363260    
## Foodmushrooms            -0.62945    0.38208  -1.647 0.104083    
## Foodmustard greens       -0.26756    0.35716  -0.749 0.456355    
## Foodnavy beans           -0.49494    0.38661  -1.280 0.204817    
## Foodnectarines           -0.44000    0.43261  -1.017 0.312723    
## Foodokra                  0.18634    0.37723   0.494 0.622917    
## Foodolives                0.64463    0.44413   1.451 0.151251    
## Foodonions               -0.69000    0.43261  -1.595 0.115359    
## Foodoranges              -0.44000    0.43261  -1.017 0.312723    
## Foodpapaya               -0.28726    0.37999  -0.756 0.452287    
## Foodpeaches              -0.22991    0.35176  -0.654 0.515572    
## Foodpears                -0.36411    0.36779  -0.990 0.325693    
## Foodpineapple            -0.41921    0.35344  -1.186 0.239718    
## Foodpinto beans          -0.56994    0.38661  -1.474 0.145037    
## Foodplums                -0.18226    0.37999  -0.480 0.633028    
## Foodpomegranate           0.17000    0.43261   0.393 0.695578    
## Foodpotatoes             -0.56089    0.35716  -1.570 0.120953    
## Foodpumpkin              -0.12537    0.44413  -0.282 0.778582    
## Foodradish               -0.65000    0.43261  -1.502 0.137599    
## Foodraspberries           0.92134    0.37723   2.442 0.017196 *  
## Foodred peppers          -0.35000    0.43261  -0.809 0.421315    
## Foodspinach              -0.11790    0.34682  -0.340 0.734950    
## Foodstrawberries         -0.15866    0.37723  -0.421 0.675390    
## Foodsummer squash        -0.25000    0.43261  -0.578 0.565251    
## Foodsweet potatoes       -0.53000    0.43261  -1.225 0.224759    
## Foodtangerines           -0.28000    0.43261  -0.647 0.519662    
## Foodtomatoes             -0.22134    0.34293  -0.645 0.520812    
## Foodturnip greens        -0.22423    0.35716  -0.628 0.532228    
## Foodwatermelon           -0.90000    0.43261  -2.080 0.041261 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.09357697)
## 
##     Null deviance: 36.2113  on 145  degrees of freedom
## Residual deviance:  6.3632  on  68  degrees of freedom
## AIC: 114.9
## 
## Number of Fisher Scoring iterations: 2
coef <- model$coefficients
pvalues <- coef(summary(model))[,4]
coef_cat <- coef[str_detect(names(coef),"Category.+")]

Since we’re looking for the effects of the categoriesk I’ll pull just the coefficients that point to our categories.
The model automatically removed Canned since it is a linear combination of the other categories. We can interpret this as having a coefficient of 0.

list('Coefficients' = coef_cat,
'P-values' = pvalues[str_detect(names(pvalues),"Category.+")])
## $Coefficients
##  CategoryDried  CategoryFresh CategoryFrozen  CategoryOther CategoryPacked 
##     0.05914107     0.23462952     0.06194211     0.44353246     0.45578980 
## 
## $`P-values`
##  CategoryDried  CategoryFresh CategoryFrozen  CategoryOther CategoryPacked 
##     0.64173240     0.02248140     0.54926552     0.01216775     0.01101121

Because Dried and Frozen have such high P-values, it’s best to think of them as having no effect. The other three have P-values around 0.01-0.02, and so we can interpret the coefficient as their effect on pricing. I think “Fresh” is the best category to use as a baseline, so let’s report the rest relative to Fresh.

names(coef_cat) <- str_replace(names(coef_cat),'Category','')

coef_cat = c(coef_cat,'Canned' = 0)
coef_cat['Dried'] = 0
coef_cat['Frozen'] = 0
coef_cat = coef_cat - coef_cat['Fresh']
sort(round(coef_cat,3))
##  Dried Frozen Canned  Fresh  Other Packed 
## -0.235 -0.235 -0.235  0.000  0.209  0.221

Dried, Frozen and Canned are around 24 cents less per cup than fresh. Frozen and canned being less expensive than fresh matches with what we saw in the charts. It’s interesting though that Frozen and Canned aren’t shown to be any different, when every fruit that had a form in both frozen and canned had a higher price for frozen. The foods that had both were a fairly small sample, so it’s not that surprising our chart of 5 foods doesn’t match with the overall effect.

Packed (which includes packed in juice, packed in syrup, etc) is around 22 cents more per cup.
Other (which is a combination of lots of things) Are about 21 cents more per cup.
I wouldn’t put much stock into that last one, since it was really just a catch-all. Reminder of which forms got put into “Other:

price_data_4['Form'][price_data_4['Category']=='Other']
## [1] "Applesauce"      "Sauerkraut"      "Cooked whole"    "Trimmed bunches"
## [5] "Sticks"          "Fresh, peeled"   "Hearts"          "Sliced"         
## [9] "Boiled"