This data set includes planted, harvested, and production data for U.S. pumpkin producing States for years 2018 - 2020.
setwd("/cloud/project")
statepumpkin <- read.csv("USDANASS_State Pumpkin Production_2018_2020.csv")
Make all headers lower case and remove spaces. Remove commas from dataset.
names(statepumpkin) <- tolower(names(statepumpkin))
names(statepumpkin) <- gsub("_","",names(statepumpkin))
str(statepumpkin)
## 'data.frame': 44 obs. of 8 variables:
## $ year : int 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
## $ state : chr "CALIFORNIA" "ILLINOIS" "INDIANA" "MICHIGAN" ...
## $ statefips : int 6 17 18 26 36 37 39 41 99 42 ...
## $ commodity : chr "PUMPKINS" "PUMPKINS" "PUMPKINS" "PUMPKINS" ...
## $ acresplanted : chr "3,900" "16,400" "6,200" "5,700" ...
## $ acresharvested : chr "3,800" "15,900" "6,000" "5,400" ...
## $ productioncwt : chr "1,026,000" "5,644,500" "960,000" "864,000" ...
## $ productiondollars: chr "20,705,000" "21,301,000" "16,397,000" "14,440,000" ...
statepumpkin$acresplanted = as.numeric(gsub(",", "", statepumpkin$acresplanted))
## Warning: NAs introduced by coercion
statepumpkin$acresharvested = as.numeric(gsub(",", "", statepumpkin$acresharvested))
## Warning: NAs introduced by coercion
statepumpkin$productioncwt = as.numeric(gsub(",", "", statepumpkin$productioncwt))
## Warning: NAs introduced by coercion
statepumpkin$productiondollars = as.numeric(gsub(",", "", statepumpkin$productiondollars))
## Warning: NAs introduced by coercion
2019: Data for Texas and Wisconsin were withheld by USDA-NASS and are coded as “NA” to avoid disclosing data for individual operations. Data for “Other States” that produce Pumpkins were aggregated by USDA-NASS to prevent disclosing data for individual operation.
2018 and 2020: Data for “Other States” is not present in the 2018 dataset and is zero in the 2020 dataset.
str(statepumpkin)
## 'data.frame': 44 obs. of 8 variables:
## $ year : int 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
## $ state : chr "CALIFORNIA" "ILLINOIS" "INDIANA" "MICHIGAN" ...
## $ statefips : int 6 17 18 26 36 37 39 41 99 42 ...
## $ commodity : chr "PUMPKINS" "PUMPKINS" "PUMPKINS" "PUMPKINS" ...
## $ acresplanted : num 3900 16400 6200 5700 6300 3200 3900 2300 0 7300 ...
## $ acresharvested : num 3800 15900 6000 5400 5600 2600 3600 2200 0 7000 ...
## $ productioncwt : num 1026000 5644500 960000 864000 700000 ...
## $ productiondollars: num 20705000 21301000 16397000 14440000 13203000 ...
summary(statepumpkin)
## year state statefips commodity
## Min. :2018 Length:44 Min. : 6.00 Length:44
## 1st Qu.:2018 Class :character 1st Qu.:26.00 Class :character
## Median :2019 Mode :character Median :39.00 Mode :character
## Mean :2019 Mean :38.93
## 3rd Qu.:2020 3rd Qu.:48.75
## Max. :2020 Max. :99.00
##
## acresplanted acresharvested productioncwt productiondollars
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 2950 1st Qu.: 2600 1st Qu.: 429625 1st Qu.: 9322750
## Median : 4900 Median : 4800 Median : 690750 Median :13801500
## Mean : 5086 Mean : 4643 Mean :1014507 Mean :13561119
## 3rd Qu.: 6100 3rd Qu.: 5600 3rd Qu.:1026000 3rd Qu.:16979250
## Max. :16400 Max. :15900 Max. :5644500 Max. :27592000
## NA's :2 NA's :2 NA's :2 NA's :2
library(table1)
##
## Attaching package: 'table1'
## The following objects are masked from 'package:base':
##
## units, units<-
table1::label(statepumpkin$acresplanted) <- "Acres Planted"
table1::label(statepumpkin$acresharvested) <- "Acres Harvested"
table1::label(statepumpkin$productioncwt) <- "Production CWT"
table1::label(statepumpkin$productiondollars) <- "Production in Dollars"
table1::table1(~acresplanted + acresharvested + productioncwt + productiondollars | year, data = statepumpkin)
## Warning in table1.formula(~acresplanted + acresharvested + productioncwt + :
## Terms to the right of '|' in formula 'x' define table columns and are expected
## to be factors with meaningful labels.
| 2018 (N=16) |
2019 (N=14) |
2020 (N=14) |
Overall (N=44) |
|
|---|---|---|---|---|
| Acres Planted | ||||
| Mean (SD) | 4720 (2640) | 5680 (2930) | 4990 (3820) | 5090 (3100) |
| Median [Min, Max] | 4550 [1700, 12000] | 5450 [2000, 13100] | 3900 [0, 16400] | 4900 [0, 16400] |
| Missing | 0 (0%) | 2 (14.3%) | 0 (0%) | 2 (4.5%) |
| Acres Harvested | ||||
| Mean (SD) | 4230 (2410) | 5100 (2350) | 4730 (3720) | 4640 (2850) |
| Median [Min, Max] | 4300 [1400, 11000] | 5200 [1900, 10900] | 3750 [0, 15900] | 4800 [0, 15900] |
| Missing | 0 (0%) | 2 (14.3%) | 0 (0%) | 2 (4.5%) |
| Production CWT | ||||
| Mean (SD) | 963000 (1210000) | 1120000 (1050000) | 982000 (1390000) | 1010000 (1200000) |
| Median [Min, Max] | 661000 [111000, 5170000] | 762000 [410000, 4200000] | 718000 [0, 5640000] | 691000 [0, 5640000] |
| Missing | 0 (0%) | 2 (14.3%) | 0 (0%) | 2 (4.5%) |
| Production in Dollars | ||||
| Mean (SD) | 12200000 (6980000) | 15000000 (6050000) | 13900000 (7940000) | 13600000 (7000000) |
| Median [Min, Max] | 12100000 [2260000, 27600000] | 16000000 [5150000, 27000000] | 15400000 [0, 25900000] | 13800000 [0, 27600000] |
| Missing | 0 (0%) | 2 (14.3%) | 0 (0%) | 2 (4.5%) |
https://public.tableau.com/views/Data110_Project1_USPumpkinProduction_JannetyMosley/Dashboard-USPumpkins?:language=en-US&:display_count=n&:origin=viz_share_link
Data Source: USDA-National Agricultural Statistics Service (https://www.nass.usda.gov)
In Project 1, I used data from USDA - National Agricultural Statistics Service (USDA‑NASS). This data set includes planted, harvested, and production data for U.S. pumpkin producing States for the years 2018 to 2020. I reviewed the data to see how pumpkin production has changed across three years. The following variables are included in the dataset: “Year” (categorical variable), “State” (categorical variable) , “State FIPS” (numeric constant variable), “Commodity” (categorical variable), “Acres Planted” (continuous variable), “Acres Harvested” (continuous variable), “Production CWT” (continuous variable), and “Production Dollars” (discrete-continuous variable). To clean the data set all headers lower case and removed spaces. Removed commas from data included in the variables: “Acres Planted”, “Acres Harvested”, “Production CWT”, and “Production Dollars”. Commas needed to be removed from the dataset to run a summary in R Studio.
An initial summary of the dataset was run in R Studio. Using R Studio I ran an overall summary of the data but it didn’t provide me enough detailed information of the dataset. As a result, I created a table that provides a summary of “Acres Planted”, “Acres Harvested”, “Production CWT”, and “Production Dollars” by “Year”. In 2019, two cases are missing. This is data for Texas and Wisconsin that were withheld by USDA‑NASS and are coded as “NA” to avoid disclosing data for individual operations. Data for “Other States” that produce Pumpkins were aggregated by USDA‑NASS to prevent disclosing data for individual operations. Also, data for “Other States” is not present in the 2018 dataset and is zero in the 2020 dataset. There is an increase in the average for pumpkin production cwt and dollars from 2018 to 2019. However, production cwt and dollars declines again in 2020 but not below the 2018 average.
The visualizations for this project were created in Tableau. Separate worksheets were created to create visuals for “Acres Planted”, “Acres Harvested”, and “Production CWT” for 2018 to 2020. A dashboard was created to combine each worksheet and to easily compare changes in pumpkin production. As the data is only representative of States that produce pumpkins all other States in the US are in gray on the map. Pumpkin production in the United States takes place in the mid-Atlantic, northeast, Texas, upper Midwest, and west coast. In 2019, Minnesota, Tennessee, and New Jersey stopped planting, harvesting, and producing pumpkins. Illinois was the top producer of pumpkins from 2018 to 2020 producing 5,644,500 cwt and $21,301,000. California, Indiana, Texas, and Virginia also contended in the top 5 ranking States from 2018 to 2020. Both California and Texas show a decline from 2018 to 2020 in pumpkin acres planted, harvested, and produced. Also, North Carolina’s loss of over half of its pumpkin production stands out on the map. Furthermore, from our analysis, we can see the pumpkin production in the United States decreased from 2018 to 2020 regardless of planted and harvested acres seeing a decline in 2019 and a slight increase in 2020. One thing I wish I would have done to improve this dataset, visualization, and analysis is to calculate the actual percent difference between the years. It would have provided a more clear vision of the difference over the three years.