I am pulling down data from USDA about the prices of fruits and vegetables. This data comes in an extremely messy format, with a spreadsheet for each fruit or vegetable. I pulled each spreadsheet down from the website and combined them into a dataset. I decided for this assignment to limit my analysis to the solid foods and leave the juice for another time. I analyzed the price by categories such as “dried, fresh, etc.” while controlling for the specific item.
Start by pulling HTML from website to get a list of the csv links.
library(rvest)
library(stringr)
library(readr)
library(tidyr)
library(dplyr)
library(ggplot2)
webpage <- read_html('https://www.ers.usda.gov/data-products/fruit-and-vegetable-prices/')
webtable <- html_nodes(webpage, "table td")
links <- paste0('https://www.ers.usda.gov/',unlist(str_extract_all(webtable,'webdocs.+csv')))
unique_links <- unique(links)
I’m going to start with lists because they are a bit for flexible in terms of variable numbers of rows, etc. The spreasheets all have footnotes that give more detail. I will store those for reference, but won’t dig into them for this analysis.
I’m also going to store the individual headers for reference, but not keep them with the dataset.
price_list <- list() #This list will store the pricing data
footnotes <- list() #This will store the footnotes
headers <- list() #This will store the headers
j = 1
k = 1
maxlength=0
for (u in 1:length(unique_links)) {
fruit_data <- read_delim(unique_links[u],",")
fruit_name = str_extract(unique_links[u],'/[^/]*\\.csv')
fruit_name = str_replace_all(str_replace_all(str_replace_all(fruit_name,'/',''),'.csv',''),'_',' ')
#Pull out first two rows as header, and save in case we need them
header <- unname(apply(fruit_data, 2, function(x) paste(x[1],x[2])))
headers[[u]] = header
fruit_data = fruit_data[-c(1,2),]
#loop through data set
for (i in 1:nrow(fruit_data)) {
row <- fruit_data[i,]
if (sum(is.na(row)) == ncol(row)) {
#For completely blank rows - don't do anything
} else if(!is.na(unlist(str_extract(row[1],'^\\d+')))) {
#rows that start with a number are assumed to be footnotes.
#Put them in their own list
footnotes[[j]] <- as.list(c(fruit_name,
unlist(str_extract(row[1],'^\\d+')),
unlist(str_replace(row[1],'^\\d+',''))
))
j = j+1
} else if(!is.na(unlist(str_extract(row[1],'^Source')))) {
#rows that start with a number are assumed to be footnotes.
#Put them in their own list
footnotes[[j]] <- as.list(c(fruit_name,
'Source',
unlist(str_replace(row[1],'^Source: ',''))
))
j = j+1
}else {
#The rest go into our price data list
price_list[[k]] <- c(fruit_name,unname(as.list(row)))
k = k+1
len = length(unname(as.list(row))) +1
if (len > maxlength) {
maxlength = len
}
}
}
}
I now have my list of lists. Since there are more columns in some of the lists, I will create a data frame with the maximum number of columns, and fill NA’s for ones with fewer columns.
price_data = data.frame(matrix(NA, nrow = 0, ncol = maxlength))
for(r in 1:length(price_list)) {
row = price_list[[r]]
rowcount = length(row)
while (rowcount < 10) {
row = c(row,NA)
rowcount = rowcount + 1
}
row_df = as.data.frame(row)
names(row_df) <- c('X1','X2','X3','X4','X5','X6','X7','X8','X9','X10')
price_data = rbind(price_data, row_df)
}
head(price_data)
## X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
## 1 apples Fresh1 $1.62 per pound 0.9 0.243 pounds $0.44 NA NA
## 2 apples Applesauce2 $1.05 per pound 1 0.540 pounds $0.57 NA NA
## 3 apples Juice <NA> <NA> <NA> <NA> <NA> <NA> NA NA
## 4 apples Ready to drink3 $0.63 per pint 1 8 fl oz $0.32 NA NA
## 5 apples Frozen4 $0.51 per pint 1 8 fl oz $0.26 NA NA
## 6 apricots Fresh1 $3.09 per pound 0.93 0.364 pounds $1.21 NA NA
I now have my data frame, with generic column names.
Column 9 and 10 look completely blank, let’s make sure that’s the case
nrow(price_data)
## [1] 176
sum(is.na(price_data[9]))
## [1] 176
sum(is.na(price_data[10]))
## [1] 176
They are blank, let’s remove
price_data <- price_data[-c(9,10)]
Next, let’s remove the footnotes pointers
price_data['X2'] = apply(price_data,1, function(x) str_replace_all(x['X2'],'\\d*,*\\d+ *$',''))
price_data = drop_na(price_data, 'X3')
Instead of dealing with separate units, I will separate into two datasets.
price_data_juice <- price_data %>%
filter(str_detect(X4,'pint'))
price_data <- price_data %>%
filter(str_detect(X4,'pint')==FALSE)
From here on out I’m just going to work with the solid food dataset and leave the juice for another day. I believe columns 4 and 7 are just the names of the units. Now that we have separated, these should be superfluous.
Check to make sure they are the same (pounds for solid food, fl oz for juice).
table(price_data[4])
##
## per pound per pint per pound
## 144 0 2
table(price_data[7])
##
## pounds fl oz pound
## 144 0 2
price_data = price_data[-c(4,7)]
Let’s get rid of extra spaces and dollar signs, then convert the numeric data to numeric columns.
clean_up_data <- function(row) {
new_row = sapply(row, function(x) str_replace(str_trim(x),'\\$',''))
return(new_row)
}
price_data_2 = as.data.frame(apply(price_data, 2, clean_up_data))
price_data_3 <- as.data.frame(lapply(price_data_2, function(col) {
if (suppressWarnings(all(!is.na(as.numeric(as.character(col)))))) {
as.numeric(as.character(col))
} else {
col
}
}))
head(price_data_3)
## X1 X2 X3 X5 X6 X8
## 1 apples Fresh 1.62 0.90 0.243 0.44
## 2 apples Applesauce 1.05 1.00 0.540 0.57
## 3 apricots Fresh 3.09 0.93 0.364 1.21
## 4 apricots Packed in juice 1.47 1.00 0.540 0.80
## 5 apricots Packed in syrup, syrup discarded 1.86 0.65 0.441 1.26
## 6 apricots Dried 7.33 1.00 0.143 1.05
The intermediate columns are just to convert price “retail price”" to “price per cup”. We’re looking at price and so can remove these. Let’s just check quick that there isn’t some information here we’d be missing out on.
price_data_3['Check'] = round(price_data_3['X3']*price_data_3['X6']/price_data_3['X5'],2)
price_data_3[price_data_3['X8']!=price_data_3['Check'],]
## X1 X2 X3 X5 X6 X8 Check
## 4 apricots Packed in juice 1.47 1.000 0.540 0.80 0.79
## 18 figs Dried 6.13 0.960 0.165 1.06 1.05
## 20 fruit cocktail Packed in syrup or water 1.26 0.650 0.441 0.86 0.85
## 24 honeydew Fresh 0.83 0.460 0.375 0.67 0.68
## 35 peaches Frozen 3.19 1.000 0.331 1.05 1.06
## 39 pineapple Fresh 0.65 0.510 0.364 0.47 0.46
## 42 pineapple Dried 5.87 1.000 0.154 0.91 0.90
## 44 plums Dried (Prunes) 4.73 1.000 0.187 0.89 0.88
## 56 asparagus Fresh 3.08 0.494 0.397 2.47 2.48
## 68 brussels sprouts Fresh 2.96 1.060 0.342 0.95 0.96
## 111 mixed vegetables Canned 1.12 0.650 0.334 0.57 0.58
## 117 mustard greens Fresh 2.68 0.840 0.309 0.98 0.99
## 122 okra Fresh 3.82 0.769 0.350 1.75 1.74
## 124 olives Canned 5.09 1.000 0.298 1.51 1.52
## 127 pinto beans Dried 1.09 2.399 0.386 0.17 0.18
## 131 pumpkin Canned 1.38 1.000 0.540 0.74 0.75
## 134 spinach Boiled 3.83 0.770 0.331 1.64 1.65
## 145 turnip greens Canned 0.78 0.650 0.353 0.43 0.42
You can see that the “check” column is just a combination with the other two columns. The 18 where they are different look like rounding errors. We are safe to remove these and just keep our prices.
price_data_4 = price_data_3[-c(4,5,7)]
Let’s fix our column names and see how it looks.
names(price_data_4) = c('Food','Form','Retail_Price','Price_Per_Cup')
str(price_data_4)
## 'data.frame': 146 obs. of 4 variables:
## $ Food : Factor w/ 73 levels "acorn squash",..: 2 2 3 3 3 3 7 9 11 11 ...
## $ Form : Factor w/ 33 levels "Applesauce","Boiled",..: 9 1 9 21 23 6 9 14 9 14 ...
## $ Retail_Price : num 1.62 1.05 3.09 1.47 1.86 7.33 0.55 3.64 5.66 3.69 ...
## $ Price_Per_Cup: num 0.44 0.57 1.21 0.8 1.26 1.05 0.28 1.2 1.89 1.22 ...
sort(table(price_data_4['Form']),decreasing=T)
##
## Fresh Canned
## 45 24
## Frozen Dried
## 24 15
## Packed in juice Packed in syrup or water
## 5 4
## Florets Heads
## 2 2
## Applesauce Boiled
## 1 1
## Canned, packed in syrup or water Cooked whole
## 1 1
## Dried (Prunes) Fresh green cabbage
## 1 1
## Fresh red cabbage Fresh, consumed with peel
## 1 1
## Fresh, peeled Frozen french fries
## 1 1
## Full Heads Grape and cherry
## 1 1
## Hearts Large round
## 1 1
## Packed in syrup, syrup discarded Raisins
## 1 1
## Raw Raw baby
## 1 1
## Raw whole Roma and plum
## 1 1
## Sauerkraut Sliced
## 1 1
## Sticks Trimmed bunches
## 1 1
## Whole
## 1
I think it would be best to take some of the specific labels and put them in to more broad categories for analysis. I’ll add a new column for this so we can refer back to the given data.
I don’t think there is any elegant way to do this, I’ll need to just list out the new categories.
categorize = function(x){
switch(x,
'Packed in juice' = 'Packed',
'Packed in syrup or water' = 'Packed',
'Florets' = 'Fresh',
'Heads' = 'Fresh',
'Applesauce' = 'Other',
'Boiled' = 'Other',
'Canned, packed in syrup or water' = 'Packed',
'Cooked whole' = 'Other',
'Dried (Prunes)' = 'Dried',
'Fresh green cabbage' = 'Fresh',
'Fresh red cabbage' = 'Fresh',
'Fresh, consumed with peel' = 'Fresh',
'Fresh, peeled' = 'Other',
'Frozen french fries' = 'Frozen',
'Full Heads' = 'Fresh',
'Grape and cherry' = 'Fresh',
'Hearts' = 'Other',
'Large round' = 'Fresh',
'Packed in syrup, syrup discarded' = 'Packed',
'Raisins' = 'Dried',
'Raw' = 'Fresh',
'Raw baby' = 'Fresh',
'Raw whole' = 'Fresh',
'Roma and plum' = 'Fresh',
'Sauerkraut' = 'Other',
'Sliced' = 'Other',
'Sticks' = 'Other',
'Trimmed bunches' = 'Other',
'Whole' = 'Fresh',
x
)
}
price_data_4['Category'] = apply(price_data_4['Form'],1, categorize)
table(price_data_4['Category'])
##
## Canned Dried Fresh Frozen Other Packed
## 24 17 60 25 9 11
boxplot(Price_Per_Cup ~ Category, data=price_data_4)
We are down to 6 categories, all with at least a few members. From the boxplot we can see that “Packed” appears the most expensive per cup.
Like the “airline delays” example, it’s possible we could run into Simpson’s Paradox here. Because there are so many different foods, pulling them together into one chart is tough to look at. Instead I’ll take all the two-way combinations, and then pull any fruit that has a record in both of those categories, and chart that.
compare_two = function(cat1,cat2) {
#find all foods that have an entry in BOTH lists
a <-filter(price_data_4,Category==cat1)['Food']
b<- filter(price_data_4,Category==cat2)['Food']
intersection = intersect(a,b)
intersect_list <- as.character(unlist(intersection[,1]))
intersect_rows = price_data_4[price_data_4$Food %in% intersect_list,]
intersect_rows = filter(intersect_rows, Category==cat1 | Category==cat2)
if (nrow(intersection) > 1) {
ggplot(filter(intersect_rows, str_detect(Food,'^[a-e].*')), aes(Food, Price_Per_Cup, fill=Category)) +
geom_bar(position="dodge", stat="identity")
}
}
combinations <- as.data.frame(t(combn(unique(price_data_4$Category),2)))
apply(combinations, 1, function(x) compare_two(x[1],x[2]))
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
## NULL
##
## [[7]]
## NULL
##
## [[8]]
##
## [[9]]
##
## [[10]]
##
## [[11]]
## NULL
##
## [[12]]
## NULL
##
## [[13]]
## NULL
##
## [[14]]
##
## [[15]]
There’s a lot to unpack (no pun intended) here. It does look like Fresh is generally more costly than Frozen or Canned, and that Frozen is more costly than canned.
A simpler way to look at the relative effects is to throw both food and category into a linear model and look at the coefficients. R will automatically turn our categorical variables into dummy variables.
model <- glm(Price_Per_Cup ~Category + Food,data=price_data_4)
summary(model)
##
## Call:
## glm(formula = Price_Per_Cup ~ Category + Food, data = price_data_4)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.55942 -0.08801 0.00000 0.08866 0.97089
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.86537 0.32198 2.688 0.009039 **
## CategoryDried 0.05914 0.12654 0.467 0.641732
## CategoryFresh 0.23463 0.10047 2.335 0.022481 *
## CategoryFrozen 0.06194 0.10292 0.602 0.549266
## CategoryOther 0.44353 0.17217 2.576 0.012168 *
## CategoryPacked 0.45579 0.17437 2.614 0.011011 *
## Foodapples -0.69945 0.38208 -1.831 0.071536 .
## Foodapricots -0.08671 0.35344 -0.245 0.806942
## Foodartichoke 1.29244 0.35716 3.619 0.000565 ***
## Foodasparagus 1.18911 0.35716 3.329 0.001409 **
## Foodavocados -0.14000 0.43261 -0.324 0.747223
## Foodbananas -0.82000 0.43261 -1.895 0.062283 .
## Foodbeets -0.32537 0.44413 -0.733 0.466315
## Foodberries mixed 0.27269 0.44149 0.618 0.538871
## Foodblack beans -0.49494 0.38661 -1.280 0.204817
## Foodblackberries 0.54134 0.37723 1.435 0.155862
## Foodblackeye peas -0.50494 0.38661 -1.306 0.195923
## Foodblueberries 0.30134 0.37723 0.799 0.427175
## Foodbroccoli -0.25577 0.35445 -0.722 0.473010
## Foodbrussels sprouts -0.17366 0.37723 -0.460 0.646741
## Foodbutternut squash -0.28000 0.43261 -0.647 0.519662
## Foodcabbage -0.75630 0.35674 -2.120 0.037658 *
## Foodcantaloupe -0.72000 0.43261 -1.664 0.100653
## Foodcarrots -0.69032 0.33825 -2.041 0.045148 *
## Foodcauliflower -0.43244 0.35445 -1.220 0.226665
## Foodcelery -0.82890 0.40354 -2.054 0.043814 *
## Foodcherries 0.64942 0.38246 1.698 0.094075 .
## Foodcollard greens -0.36756 0.35716 -1.029 0.307064
## Foodcorn sweet 0.12911 0.35716 0.361 0.718860
## Foodcranberries -0.34451 0.45085 -0.764 0.447426
## Foodcucumbers -0.80445 0.38208 -2.105 0.038948 *
## Fooddates -0.01451 0.45085 -0.032 0.974417
## Foodfigs 0.13549 0.45085 0.301 0.764698
## Foodfruit cocktail -0.51116 0.40496 -1.262 0.211173
## Foodgrapefruit -0.15000 0.43261 -0.347 0.729865
## Foodgrapes -0.33226 0.37999 -0.874 0.384990
## Foodgreat northern beans -0.50494 0.38661 -1.306 0.195923
## Foodgreen beans -0.42089 0.35716 -1.178 0.242720
## Foodgreen peas -0.29634 0.38294 -0.774 0.441700
## Foodgreen peppers -0.62000 0.43261 -1.433 0.156396
## Foodhoneydew -0.43000 0.43261 -0.994 0.323766
## Foodkale -0.19866 0.37723 -0.527 0.600177
## Foodkidney beans -0.50494 0.38661 -1.306 0.195923
## Foodkiwi 0.01000 0.43261 0.023 0.981626
## Foodlentils -0.70451 0.45085 -1.563 0.122780
## Foodlettuce iceberg -0.82000 0.43261 -1.895 0.062283 .
## Foodlettuce romaine -0.72445 0.38208 -1.896 0.062200 .
## Foodlima beans -0.31906 0.36192 -0.882 0.381101
## Foodmangoes -0.03726 0.37999 -0.098 0.922186
## Foodmixed vegetables -0.32134 0.35107 -0.915 0.363260
## Foodmushrooms -0.62945 0.38208 -1.647 0.104083
## Foodmustard greens -0.26756 0.35716 -0.749 0.456355
## Foodnavy beans -0.49494 0.38661 -1.280 0.204817
## Foodnectarines -0.44000 0.43261 -1.017 0.312723
## Foodokra 0.18634 0.37723 0.494 0.622917
## Foodolives 0.64463 0.44413 1.451 0.151251
## Foodonions -0.69000 0.43261 -1.595 0.115359
## Foodoranges -0.44000 0.43261 -1.017 0.312723
## Foodpapaya -0.28726 0.37999 -0.756 0.452287
## Foodpeaches -0.22991 0.35176 -0.654 0.515572
## Foodpears -0.36411 0.36779 -0.990 0.325693
## Foodpineapple -0.41921 0.35344 -1.186 0.239718
## Foodpinto beans -0.56994 0.38661 -1.474 0.145037
## Foodplums -0.18226 0.37999 -0.480 0.633028
## Foodpomegranate 0.17000 0.43261 0.393 0.695578
## Foodpotatoes -0.56089 0.35716 -1.570 0.120953
## Foodpumpkin -0.12537 0.44413 -0.282 0.778582
## Foodradish -0.65000 0.43261 -1.502 0.137599
## Foodraspberries 0.92134 0.37723 2.442 0.017196 *
## Foodred peppers -0.35000 0.43261 -0.809 0.421315
## Foodspinach -0.11790 0.34682 -0.340 0.734950
## Foodstrawberries -0.15866 0.37723 -0.421 0.675390
## Foodsummer squash -0.25000 0.43261 -0.578 0.565251
## Foodsweet potatoes -0.53000 0.43261 -1.225 0.224759
## Foodtangerines -0.28000 0.43261 -0.647 0.519662
## Foodtomatoes -0.22134 0.34293 -0.645 0.520812
## Foodturnip greens -0.22423 0.35716 -0.628 0.532228
## Foodwatermelon -0.90000 0.43261 -2.080 0.041261 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.09357697)
##
## Null deviance: 36.2113 on 145 degrees of freedom
## Residual deviance: 6.3632 on 68 degrees of freedom
## AIC: 114.9
##
## Number of Fisher Scoring iterations: 2
coef <- model$coefficients
pvalues <- coef(summary(model))[,4]
coef_cat <- coef[str_detect(names(coef),"Category.+")]
Since we’re looking for the effects of the categoriesk I’ll pull just the coefficients that point to our categories.
The model automatically removed Canned since it is a linear combination of the other categories. We can interpret this as having a coefficient of 0.
list('Coefficients' = coef_cat,
'P-values' = pvalues[str_detect(names(pvalues),"Category.+")])
## $Coefficients
## CategoryDried CategoryFresh CategoryFrozen CategoryOther CategoryPacked
## 0.05914107 0.23462952 0.06194211 0.44353246 0.45578980
##
## $`P-values`
## CategoryDried CategoryFresh CategoryFrozen CategoryOther CategoryPacked
## 0.64173240 0.02248140 0.54926552 0.01216775 0.01101121
Because Dried and Frozen have such high P-values, it’s best to think of them as having no effect. The other three have P-values around 0.01-0.02, and so we can interpret the coefficient as their effect on pricing. I think “Fresh” is the best category to use as a baseline, so let’s report the rest relative to Fresh.
names(coef_cat) <- str_replace(names(coef_cat),'Category','')
coef_cat = c(coef_cat,'Canned' = 0)
coef_cat['Dried'] = 0
coef_cat['Frozen'] = 0
coef_cat = coef_cat - coef_cat['Fresh']
sort(round(coef_cat,3))
## Dried Frozen Canned Fresh Other Packed
## -0.235 -0.235 -0.235 0.000 0.209 0.221
Dried, Frozen and Canned are around 24 cents less per cup than fresh. Frozen and canned being less expensive than fresh matches with what we saw in the charts. It’s interesting though that Frozen and Canned aren’t shown to be any different, when every fruit that had a form in both frozen and canned had a higher price for frozen. The foods that had both were a fairly small sample, so it’s not that surprising our chart of 5 foods doesn’t match with the overall effect.
Packed (which includes packed in juice, packed in syrup, etc) is around 22 cents more per cup.
Other (which is a combination of lots of things) Are about 21 cents more per cup.
I wouldn’t put much stock into that last one, since it was really just a catch-all. Reminder of which forms got put into “Other:
price_data_4['Form'][price_data_4['Category']=='Other']
## [1] "Applesauce" "Sauerkraut" "Cooked whole" "Trimmed bunches"
## [5] "Sticks" "Fresh, peeled" "Hearts" "Sliced"
## [9] "Boiled"