USDA - NASS Pumpkin Production Dataset by State (2018-2020)

This data set includes planted, harvested, and production data for U.S. pumpkin producing States for years 2018 - 2020.

Load in the Dataset

setwd("/cloud/project")
statepumpkin <- read.csv("USDANASS_State Pumpkin Production_2018_2020.csv")

Clean up the Data

Make all headers lower case and remove spaces. Remove commas from dataset.

names(statepumpkin) <- tolower(names(statepumpkin))
names(statepumpkin) <- gsub("_","",names(statepumpkin))
str(statepumpkin)
## 'data.frame':    44 obs. of  8 variables:
##  $ year             : int  2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
##  $ state            : chr  "CALIFORNIA" "ILLINOIS" "INDIANA" "MICHIGAN" ...
##  $ statefips        : int  6 17 18 26 36 37 39 41 99 42 ...
##  $ commodity        : chr  "PUMPKINS" "PUMPKINS" "PUMPKINS" "PUMPKINS" ...
##  $ acresplanted     : chr  "3,900" "16,400" "6,200" "5,700" ...
##  $ acresharvested   : chr  "3,800" "15,900" "6,000" "5,400" ...
##  $ productioncwt    : chr  "1,026,000" "5,644,500" "960,000" "864,000" ...
##  $ productiondollars: chr  "20,705,000" "21,301,000" "16,397,000" "14,440,000" ...
statepumpkin$acresplanted = as.numeric(gsub(",", "", statepumpkin$acresplanted))
## Warning: NAs introduced by coercion
statepumpkin$acresharvested = as.numeric(gsub(",", "", statepumpkin$acresharvested))
## Warning: NAs introduced by coercion
statepumpkin$productioncwt = as.numeric(gsub(",", "", statepumpkin$productioncwt))
## Warning: NAs introduced by coercion
statepumpkin$productiondollars = as.numeric(gsub(",", "", statepumpkin$productiondollars))
## Warning: NAs introduced by coercion

2019: Data for Texas and Wisconsin were withheld by USDA-NASS and are coded as “NA” to avoid disclosing data for individual operations. Data for “Other States” that produce Pumpkins were aggregated by USDA-NASS to prevent disclosing data for individual operation.

2018 and 2020: Data for “Other States” is not present in the 2018 dataset and is zero in the 2020 dataset.

Summary of the Dataset

str(statepumpkin)
## 'data.frame':    44 obs. of  8 variables:
##  $ year             : int  2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
##  $ state            : chr  "CALIFORNIA" "ILLINOIS" "INDIANA" "MICHIGAN" ...
##  $ statefips        : int  6 17 18 26 36 37 39 41 99 42 ...
##  $ commodity        : chr  "PUMPKINS" "PUMPKINS" "PUMPKINS" "PUMPKINS" ...
##  $ acresplanted     : num  3900 16400 6200 5700 6300 3200 3900 2300 0 7300 ...
##  $ acresharvested   : num  3800 15900 6000 5400 5600 2600 3600 2200 0 7000 ...
##  $ productioncwt    : num  1026000 5644500 960000 864000 700000 ...
##  $ productiondollars: num  20705000 21301000 16397000 14440000 13203000 ...
summary(statepumpkin)
##       year         state             statefips      commodity        
##  Min.   :2018   Length:44          Min.   : 6.00   Length:44         
##  1st Qu.:2018   Class :character   1st Qu.:26.00   Class :character  
##  Median :2019   Mode  :character   Median :39.00   Mode  :character  
##  Mean   :2019                      Mean   :38.93                     
##  3rd Qu.:2020                      3rd Qu.:48.75                     
##  Max.   :2020                      Max.   :99.00                     
##                                                                      
##   acresplanted   acresharvested  productioncwt     productiondollars 
##  Min.   :    0   Min.   :    0   Min.   :      0   Min.   :       0  
##  1st Qu.: 2950   1st Qu.: 2600   1st Qu.: 429625   1st Qu.: 9322750  
##  Median : 4900   Median : 4800   Median : 690750   Median :13801500  
##  Mean   : 5086   Mean   : 4643   Mean   :1014507   Mean   :13561119  
##  3rd Qu.: 6100   3rd Qu.: 5600   3rd Qu.:1026000   3rd Qu.:16979250  
##  Max.   :16400   Max.   :15900   Max.   :5644500   Max.   :27592000  
##  NA's   :2       NA's   :2       NA's   :2         NA's   :2

Summary Statistics by Year

library(table1)
## 
## Attaching package: 'table1'
## The following objects are masked from 'package:base':
## 
##     units, units<-
table1::label(statepumpkin$acresplanted) <- "Acres Planted"
table1::label(statepumpkin$acresharvested) <- "Acres Harvested"
table1::label(statepumpkin$productioncwt) <- "Production CWT"
table1::label(statepumpkin$productiondollars) <- "Production in Dollars"
table1::table1(~acresplanted + acresharvested + productioncwt + productiondollars | year, data = statepumpkin)
## Warning in table1.formula(~acresplanted + acresharvested + productioncwt + :
## Terms to the right of '|' in formula 'x' define table columns and are expected
## to be factors with meaningful labels.
2018
(N=16)
2019
(N=14)
2020
(N=14)
Overall
(N=44)
Acres Planted
Mean (SD) 4720 (2640) 5680 (2930) 4990 (3820) 5090 (3100)
Median [Min, Max] 4550 [1700, 12000] 5450 [2000, 13100] 3900 [0, 16400] 4900 [0, 16400]
Missing 0 (0%) 2 (14.3%) 0 (0%) 2 (4.5%)
Acres Harvested
Mean (SD) 4230 (2410) 5100 (2350) 4730 (3720) 4640 (2850)
Median [Min, Max] 4300 [1400, 11000] 5200 [1900, 10900] 3750 [0, 15900] 4800 [0, 15900]
Missing 0 (0%) 2 (14.3%) 0 (0%) 2 (4.5%)
Production CWT
Mean (SD) 963000 (1210000) 1120000 (1050000) 982000 (1390000) 1010000 (1200000)
Median [Min, Max] 661000 [111000, 5170000] 762000 [410000, 4200000] 718000 [0, 5640000] 691000 [0, 5640000]
Missing 0 (0%) 2 (14.3%) 0 (0%) 2 (4.5%)
Production in Dollars
Mean (SD) 12200000 (6980000) 15000000 (6050000) 13900000 (7940000) 13600000 (7000000)
Median [Min, Max] 12100000 [2260000, 27600000] 16000000 [5150000, 27000000] 15400000 [0, 25900000] 13800000 [0, 27600000]
Missing 0 (0%) 2 (14.3%) 0 (0%) 2 (4.5%)

Data Visualization was completed in Tableau.

https://public.tableau.com/views/Data110_Project1_USPumpkinProduction_JannetyMosley/Dashboard-USPumpkins?:language=en-US&:display_count=n&:origin=viz_share_link

Data Source: USDA-National Agricultural Statistics Service (https://www.nass.usda.gov)

Data Description

In Project 1, I used data from USDA - National Agricultural Statistics Service (USDA‑NASS). This data set includes planted, harvested, and production data for U.S. pumpkin producing States for the years 2018 to 2020. I reviewed the data to see how pumpkin production has changed across three years. The following variables are included in the dataset: “Year” (categorical variable), “State” (categorical variable) , “State FIPS” (numeric constant variable), “Commodity” (categorical variable), “Acres Planted” (continuous variable), “Acres Harvested” (continuous variable), “Production CWT” (continuous variable), and “Production Dollars” (discrete-continuous variable). To clean the data set all headers lower case and removed spaces. Removed commas from data included in the variables: “Acres Planted”, “Acres Harvested”, “Production CWT”, and “Production Dollars”. Commas needed to be removed from the dataset to run a summary in R Studio.

An initial summary of the dataset was run in R Studio. Using R Studio I ran an overall summary of the data but it didn’t provide me enough detailed information of the dataset. As a result, I created a table that provides a summary of “Acres Planted”, “Acres Harvested”, “Production CWT”, and “Production Dollars” by “Year”. In 2019, two cases are missing. This is data for Texas and Wisconsin that were withheld by USDA‑NASS and are coded as “NA” to avoid disclosing data for individual operations. Data for “Other States” that produce Pumpkins were aggregated by USDA‑NASS to prevent disclosing data for individual operations. Also, data for “Other States” is not present in the 2018 dataset and is zero in the 2020 dataset. There is an increase in the average for pumpkin production cwt and dollars from 2018 to 2019. However, production cwt and dollars declines again in 2020 but not below the 2018 average.

The visualizations for this project were created in Tableau. Separate worksheets were created to create visuals for “Acres Planted”, “Acres Harvested”, and “Production CWT” for 2018 to 2020. A dashboard was created to combine each worksheet and to easily compare changes in pumpkin production. As the data is only representative of States that produce pumpkins all other States in the US are in gray on the map. Pumpkin production in the United States takes place in the mid-Atlantic, northeast, Texas, upper Midwest, and west coast. In 2019, Minnesota, Tennessee, and New Jersey stopped planting, harvesting, and producing pumpkins. Illinois was the top producer of pumpkins from 2018 to 2020 producing 5,644,500 cwt and $21,301,000. California, Indiana, Texas, and Virginia also contended in the top 5 ranking States from 2018 to 2020. Both California and Texas show a decline from 2018 to 2020 in pumpkin acres planted, harvested, and produced. Also, North Carolina’s loss of over half of its pumpkin production stands out on the map. Furthermore, from our analysis, we can see the pumpkin production in the United States decreased from 2018 to 2020 regardless of planted and harvested acres seeing a decline in 2019 and a slight increase in 2020. One thing I wish I would have done to improve this dataset, visualization, and analysis is to calculate the actual percent difference between the years. It would have provided a more clear vision of the difference over the three years.