I started with the idea of looking at the possible correlation between access to locally grown fresh food and better student performance in school, Good-Food Equals Good-Grades, but was stymied by the breadth of data sources. It was going to take me months to define good grades and good food.
Another idea came up while looking at the agriculture sites. Could I use weather data to predict useful information for food crop planting? At the big picture end of the question is climate change moving the best planting days (freeze or frost days)? At the narrow end of the question, when is the earliest I should plant my tomatoes this spring? This is a practical question I can use the answer to and the information looks more focused.
The job for me now is how best to predict the best planting date. My answer is to follow the data science workflow. I really like the simplicity and flow of Hadley Wickham’s Collect –> Analyze –> Communicate, but it is hard to beat a good mnemonic, so I decided to layout the rest of this document using OSEMN (with the hope of making it awesome).
Library Packages Used
library(rvest) # To read in HTML table of Freeze-Frost Dates
library(dplyr) # To use the tbl_df and other functions
library(randomForest) # For attempted modeling
The most important factor in determining when to plant any vegetable in your garden is the “LAST FROST DATE” in the spring, and the “FIRST FROST DATE” in the fall for your area. These dates for a given area are based on historical weather data from that area collected over a 30 year period and compiled by the National Climatic Data Center from over 5,800 Weather Monitoring Stations throughout the United States.
For each Weather Monitoring Station, a FREEZE DAY is any day in the year that the temperature reaches 32°F or below. A FROST DAY is any day in the year that the temperature reaches 36°F or below. So why worry about FROST DAYS more so than FREEZE DAYS? Well, a Freeze is what will kill many plants. But, Weather Monitoring Stations are typically mounted four to six feet above the ground. During clear, calm, and cold nights the temperature at ground level, where your garden is, can become much colder and even freeze. This is why most gardeners play it safe by using Frost Days as planting guides.
I wanted planting dates or Frost Days (dates) to compare to my analysis. My hope was to find some history of the Farmer’s Almanac planting dates over time for comparison. What I found was at least two almanacs (The Old Farmer’s Almanac and Farmers’ Almanac), neither of which seemed to keep old predictions around on their websites. Unlike predicting the winter weather, they all seemed to use the National Climatic Data Center dates for planting (last frost day) based on the USDA Hardiness Zones.
I found the best table form of the data on the Organic Gardening site and reproduce it below. I used R to read in the web tables and make a dataframe to show that I could read in HTML. This table was simple enough I could have copied it manually or maybe even used a simple copy-and-paste and formatted the results.
We should note that my home is in USDA Hardiness Zone 7, which gives the First Frost Date of October 15 and the Last Frost Date of April 15, so April 15 is my target date. We might also want to note that zone 1 has an incredibly short 1-month growing season.
I considered this table and the zones to be the “old school” method. When I found a “New and Improved” method on The How Do Gardner website I expected newer data to give me an earlier date (warming trend). The site reads: “The National Climatic Data Center has recently released an entirely new data set, this one collected between 1981 and 2010.” I entered my zip code into their new Freeze and Frost Day Calculator and discovered two things. My nearest weather monitoring station is named Vienna; and my last Frost Day is May 11 (last Freeze Day is April 27; Only a 10% chance of a Freeze/Frost after these dates).
I assume this is May 11, 2016 and it is almost a month later than the April 15 from the table. I now have a date range to examine and a significant one for a farmer or gardener.
Freeze-Frost Data Table
freeze <- read_html("http://organicgardening.about.com/od/organicgardening101/a/frostdatechart.htm", encoding = "UTF-8")
freeze <- html_nodes(freeze, "table")
freeze <- html_table(freeze[[1]], header = TRUE)
tbl_df(freeze)
## Source: local data frame [11 x 3]
##
## USDA Hardiness Zone First Frost Date Last Frost Date
## (int) (chr) (chr)
## 1 1 July 15th June 15th
## 2 2 Autust 15th May 15th
## 3 3 September 15th May 15th
## 4 4 September 15th May 15th
## 5 5 October 15th April 15th
## 6 6 October 15th April 15th
## 7 7 October 15th April 15th
## 8 8 November 15th March 15th
## 9 9 December 15th February 15th
## 10 10 December 15th January 31st (sometimes earlier)
## 11 11 No frost. No frost.
I requested the “Daily Normals” from all the weather stations in Fairfax County, Virginia from the Global Historical Climatology Network (GHCN) of the The National Climatic Data Center of the National Oceanic and Atmospheric Administration (NOAA) of the United States. I received a CSV file of 100,197 observations and 33 variables.
On first look the data seems to span the dates from September 1, 1950 to December 6, 2015. We can see the variables and sample data below. We see that there are forty-five (45) weather stations in Fairfax County and somehow that gives us forty-four (44) names, forty-two (42) elevations, forty-five (45) latitudes and forty-eight (48) longitudes. This hints at movement of some weather stations over the years.
A lot of data is missing with several fields having the value -9999, which the documentation said indicates missing data or data that has not been received. This should be fine, because what the documentation calls the five core values (PRCP, SNOW, SNWD, TMAX, TMIN) seem to have values and should be all I need.
Fairfax County Weather Data
weather <- read.csv("https://raw.githubusercontent.com/Godbero/CUNY-MSDA-IS607/master/fcwd.csv", header = TRUE, stringsAsFactors = TRUE)
str(weather)
## 'data.frame': 100197 obs. of 33 variables:
## $ STATION : Factor w/ 45 levels "GHCND:US1VAALC003",..: 35 35 35 35 35 35 35 35 35 35 ...
## $ STATION_NAME: Factor w/ 44 levels "ALEXANDRIA 1.8 S VA US",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ ELEVATION : Factor w/ 42 levels "105.5","106.1",..: 33 33 33 33 33 33 33 33 33 33 ...
## $ LATITUDE : Factor w/ 45 levels "38.713","38.71667",..: 19 19 19 19 19 19 19 19 19 19 ...
## $ LONGITUDE : Factor w/ 48 levels "-77.0547","-77.08333",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ DATE : int 19500901 19500902 19500903 19500904 19500905 19500906 19500907 19500908 19500909 19500910 ...
## $ MDPR : int -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 ...
## $ MDSF : int -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 ...
## $ DAPR : int -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 ...
## $ DASF : int -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 ...
## $ PRCP : int 10 46 5 25 0 0 0 0 282 145 ...
## $ SNWD : int 0 0 0 0 0 0 0 0 0 0 ...
## $ SNOW : int 0 0 0 0 0 0 0 0 0 0 ...
## $ TMAX : int 306 311 317 283 222 228 250 272 222 306 ...
## $ TMIN : int 211 217 217 206 106 94 100 128 167 194 ...
## $ TOBS : int 256 278 233 222 156 200 217 211 217 217 ...
## $ WESD : int -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 ...
## $ WESF : int -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 ...
## $ WDFG : int -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 ...
## $ PGTM : int -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 ...
## $ WSFG : int -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 ...
## $ WT09 : int -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 ...
## $ WT14 : int -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 ...
## $ WT07 : int -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 ...
## $ WT01 : int -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 ...
## $ WT06 : int -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 ...
## $ WT05 : int -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 ...
## $ WT11 : int -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 ...
## $ WT04 : int -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 ...
## $ WT16 : int -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 ...
## $ WT08 : int -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 ...
## $ WT18 : int -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 ...
## $ WT03 : int -9999 -9999 1 -9999 -9999 -9999 -9999 -9999 -9999 1 ...
tbl_df(weather)
## Source: local data frame [100,197 x 33]
##
## STATION STATION_NAME ELEVATION LATITUDE LONGITUDE
## (fctr) (fctr) (fctr) (fctr) (fctr)
## 1 GHCND:USC00440403 BAILEYS CROSSROADS VA US 78.9 38.85 -77.13333
## 2 GHCND:USC00440403 BAILEYS CROSSROADS VA US 78.9 38.85 -77.13333
## 3 GHCND:USC00440403 BAILEYS CROSSROADS VA US 78.9 38.85 -77.13333
## 4 GHCND:USC00440403 BAILEYS CROSSROADS VA US 78.9 38.85 -77.13333
## 5 GHCND:USC00440403 BAILEYS CROSSROADS VA US 78.9 38.85 -77.13333
## 6 GHCND:USC00440403 BAILEYS CROSSROADS VA US 78.9 38.85 -77.13333
## 7 GHCND:USC00440403 BAILEYS CROSSROADS VA US 78.9 38.85 -77.13333
## 8 GHCND:USC00440403 BAILEYS CROSSROADS VA US 78.9 38.85 -77.13333
## 9 GHCND:USC00440403 BAILEYS CROSSROADS VA US 78.9 38.85 -77.13333
## 10 GHCND:USC00440403 BAILEYS CROSSROADS VA US 78.9 38.85 -77.13333
## .. ... ... ... ... ...
## Variables not shown: DATE (int), MDPR (int), MDSF (int), DAPR (int), DASF
## (int), PRCP (int), SNWD (int), SNOW (int), TMAX (int), TMIN (int), TOBS
## (int), WESD (int), WESF (int), WDFG (int), PGTM (int), WSFG (int), WT09
## (int), WT14 (int), WT07 (int), WT01 (int), WT06 (int), WT05 (int), WT11
## (int), WT04 (int), WT16 (int), WT08 (int), WT18 (int), WT03 (int)
The documentation for the data set says the five core values are:
The documentation explains that TMAX is the maximum temperature on that date measured in Celsius to tenths (e.g., 306 = 30.6 degrees Celsius). TMIN is the minimum temperature on that day measured in the same units. We are primarily interested in TMIN, but it makes sense to keep all five core values.
We also need the date observed and the data on weather station location (at least until we determine the most complete data closest to me). We can get rid of the other columns and eventually narrow our data set to just seven (7) columns. On the other end, we can zero in on the weather station closest to me, drop the other 44 stations. This gives us the more manageable data set below.
The latitude of Reston, Virginia (where I live) is 38.9544 and the longitude is -77.3464. I decide to compare my location with the weather stations by comparing lat and long and counting observations by station (30 years of data from 1 station ~= 10,958 observations, I would love 2 or 3 times that).
We start by using the Count function from dplyr to get the stations and their coordinates into a table with counts. We then do the lat and long math and add the columns. A difference of one degree in latitude is about 69 miles, which means a tenth would be 6.9 miles and a hundredth .69 miles. There are a lot of weather stations that are close to me. The table below drops lat and long, so the other columns will fit. The n column is the count of observations from that station and latdiff and longdiff measures the distance from Reston, VA.
Weather Stations in Fairfax County
stations <- count(weather, STATION, STATION_NAME, LATITUDE, LONGITUDE)
stations$LATITUDE <- as.character(stations$LATITUDE)
stations$LONGITUDE <- as.character(stations$LONGITUDE)
latdiff <- as.numeric(stations$LATITUDE) - 38.9544
longdiff <- as.numeric(stations$LONGITUDE) + 77.3464
stations <- cbind(stations, latdiff, longdiff)
stations <- arrange(stations, desc(latdiff), desc(longdiff))
stations$STATION <- substr(stations$STATION, 7, 17) # shorten station ID for fit
subset(stations, select = c(-LATITUDE, -LONGITUDE))
## STATION STATION_NAME n latdiff longdiff
## 1 USC00442491 DRANESVILLE VA US 1064 0.028930 -0.003600
## 2 USC00443929 HERNDON VA US 944 0.012270 -0.036930
## 3 US1VAFX0035 HERNDON 0.7 WSW VA US 453 0.010400 -0.053600
## 4 US1VAFX0040 VIENNA 3.3 N VA US 1926 -0.006900 0.082700
## 5 US1VAFX0057 RESTON 1.8 SSW VA US 565 -0.025000 -0.014400
## 6 US1VAFX0001 HERNDON 3.3 S VA US 3181 -0.032300 -0.033700
## 7 US1VAFX0070 OAK HILL 1.0 NE VA US 239 -0.032500 -0.046600
## 8 US1VAFX0041 MCLEAN 2.3 SE VA US 2475 -0.036400 0.203000
## 9 US1VAFX0052 MCLEAN 1.5 S VA US 1753 -0.037700 0.164200
## 10 USC00448737 VIENNA VA US 211 -0.037730 0.113070
## 11 US1VAFX0038 MCLEAN 1.5 S VA US 171 -0.038000 0.167000
## 12 US1VAFX0072 OAK HILL 0.3 WSW VA US 198 -0.042600 -0.067700
## 13 US1VAFX0053 MCLEAN 2.4 SSE VA US 1673 -0.046000 0.191500
## 14 US1VAFX0051 VIENNA 1.3 W VA US 1881 -0.051400 0.062800
## 15 US1VAFX0043 FAIRFAX 7.0 NW VA US 1399 -0.053600 -0.064900
## 16 USC00448737 VIENNA VA US 16730 -0.054400 0.129730
## 17 USC00448737 VIENNA VA US 3100 -0.054400 0.080100
## 18 USC00448737 VIENNA VA US 3003 -0.054400 0.080010
## 19 US1VAFX0060 OAKTON 0.4 E VA US 684 -0.062200 0.053200
## 20 US1VAFX0061 VIENNA 0.8 S VA US 645 -0.065800 0.083500
## 21 US1VAFX0064 FALLS CHURCH 1.6 W VA US 744 -0.066900 0.142300
## 22 USC00442922 FALLS CHURCH 2 SW VA US 3132 -0.071070 0.163070
## 23 USC00448737 VIENNA VA US 1405 -0.071070 0.096400
## 24 USC00441570 CHANTILLY VA US 2190 -0.071070 -0.086930
## 25 USC00440390 BACON NR FAIRFAX VA US 221 -0.087730 -0.003600
## 26 US1VAFX0054 FAIRFAX 4.1 W VA US 482 -0.089800 -0.026800
## 27 US1VAFX0003 LAKE BARCROFT 0.5 NE VA US 160 -0.096900 0.193900
## 28 US1VAFX0005 FAIRFAX 2.3 W VA US 1916 -0.097200 0.005500
## 29 US1VAFX0059 FAIRFAX 2.2 E VA US 104 -0.103300 0.088300
## 30 USC00440403 BAILEYS CROSSROADS VA US 486 -0.104400 0.213070
## 31 USC00442922 FALLS CHURCH 2 SW VA US 4385 -0.104400 0.146400
## 32 USC00442890 FAIRFAX VA US 1794 -0.104400 0.046400
## 33 USC00445218 MANION NR FAIRFAX VA US 224 -0.104400 -0.003600
## 34 US1VAFX0032 CENTREVILLE 1.0 W VA US 204 -0.105000 -0.114900
## 35 US1VAFX0046 FAIRFAX 2.6 ESE VA US 124 -0.110500 0.095500
## 36 US1VAFX0002 LAKE BARCROFT 0.8 SSW VA US 891 -0.113700 0.180500
## 37 USC00448266 SWETNAM VA US 808 -0.121067 0.013067
## 38 USC00442890 FAIRFAX VA US 3461 -0.121070 0.029730
## 39 US1VAFX0037 MANTUA 1.4 S VA US 1649 -0.122000 0.089500
## 40 USC00440216 ANNANDALE VA US 1338 -0.137730 0.146400
## 41 USC00440090 ALEXANDRIA CITY GARA VA US 6103 -0.154400 0.263070
## 42 US1VAFX0047 BURKE 1.4 NE VA US 449 -0.158800 0.090000
## 43 US1VAALC003 ALEXANDRIA 1.8 S VA US 515 -0.159200 0.261500
## 44 US1VAFX0024 WEST SPRINGFIELD 0.7 W VA US 124 -0.166700 0.100300
## 45 USC00443635 GROVETON VA US 6846 -0.187730 0.246400
## 46 US1VAFX0033 FRANCONIA 1.3 SSE VA US 3621 -0.208200 0.207800
## 47 US1VAFX0049 ALEXANDRIA 4.6 S VA US 607 -0.215900 0.291700
## 48 US1VAFX0063 ALEXANDRIA 5.6 SSW VA US 850 -0.223300 0.257400
## 49 US1VAFX0031 MOUNT VERNON 0.5 NE VA US 1927 -0.230400 0.244100
## 50 USW00093728 DAVISON ARMY AIR FIELD VA US 5080 -0.237730 0.163070
## 51 US1VAFX0050 LORTON 1.2 NE VA US 1785 -0.241400 0.120100
## 52 USC00440216 ANNANDALE VA US 562 NA NA
## 53 USC00440403 BAILEYS CROSSROADS VA US 1 NA NA
## 54 USC00442922 FALLS CHURCH 2 SW VA US 31 NA NA
## 55 USC00448266 SWETNAM VA US 1 NA NA
## 56 USC00448737 VIENNA VA US 3682 NA NA
The interesting thing about this data is how few observation there are from some of the stations. Dranesville, which is about 2 miles from me, has 1,064 readings, or less than 3 years worth. Since the one web site mentioned Vienna as my closest station and it also has the most data, I focused on that station. The Vienna station at row 16 has the most observations at 16,730. I found a few locations with Vienna in the name and few Vienna stations with the same ID, but different location data. I decided to pull these out and look more closely.
I looked at the most recent data for the first week of December 2015 and found four data sources for each day, however only station USC00448737, or 37, had a value for TMIN; US1VAFX0061, US1VAFX0040, and US1VAFX0051 all had -9999 for TMIN (they had values for precipitation). After spot checking and seeing the same pattern throughout the data set I decided to work with just station 37.
Vienna Weather Stations
weather <- subset(weather, subset = (substr(weather$STATION_NAME, 1, 6) == "VIENNA"))
weather <- subset(weather, select = c(STATION, DATE, PRCP, SNOW, SNWD, TMAX, TMIN))
tbl_df(weather)
## Source: local data frame [32,583 x 7]
##
## STATION DATE PRCP SNOW SNWD TMAX TMIN
## (fctr) (int) (int) (int) (int) (int) (int)
## 1 GHCND:US1VAFX0061 20130420 305 -9999 -9999 -9999 -9999
## 2 GHCND:US1VAFX0061 20130425 13 -9999 -9999 -9999 -9999
## 3 GHCND:US1VAFX0061 20130427 0 0 -9999 -9999 -9999
## 4 GHCND:US1VAFX0061 20130429 30 -9999 -9999 -9999 -9999
## 5 GHCND:US1VAFX0061 20130430 53 -9999 -9999 -9999 -9999
## 6 GHCND:US1VAFX0061 20130501 23 -9999 -9999 -9999 -9999
## 7 GHCND:US1VAFX0061 20130502 0 0 -9999 -9999 -9999
## 8 GHCND:US1VAFX0061 20130503 0 0 -9999 -9999 -9999
## 9 GHCND:US1VAFX0061 20130513 0 0 -9999 -9999 -9999
## 10 GHCND:US1VAFX0061 20130514 0 0 -9999 -9999 -9999
## .. ... ... ... ... ... ... ...
weather <- subset(weather, subset = (STATION == "GHCND:USC00448737"))
weather <- weather[order(weather$DATE), ]
tbl_df(weather)
## Source: local data frame [28,131 x 7]
##
## STATION DATE PRCP SNOW SNWD TMAX TMIN
## (fctr) (int) (int) (int) (int) (int) (int)
## 1 GHCND:USC00448737 19250401 0 -9999 -9999 -9999 -9999
## 2 GHCND:USC00448737 19250402 5 -9999 -9999 -9999 -9999
## 3 GHCND:USC00448737 19250403 0 -9999 -9999 -9999 -9999
## 4 GHCND:USC00448737 19250404 0 -9999 -9999 -9999 -9999
## 5 GHCND:USC00448737 19250405 0 -9999 -9999 -9999 -9999
## 6 GHCND:USC00448737 19250406 0 -9999 -9999 -9999 -9999
## 7 GHCND:USC00448737 19250407 0 -9999 -9999 -9999 -9999
## 8 GHCND:USC00448737 19250408 0 -9999 -9999 -9999 -9999
## 9 GHCND:USC00448737 19250409 0 -9999 -9999 -9999 -9999
## 10 GHCND:USC00448737 19250410 0 -9999 -9999 -9999 -9999
## .. ... ... ... ... ... ... ...
After searching through the Station 37 data I found values for TMIN started being recorded on December 1, 1995. Before that date all the values for TMIN are recorded as -9999. After this date there are regular recorded values and the occasional -9999. I decided to take the subset from 12/1/1995 on as my scrubbed data set, but I was disappointment not to have more data to analyze. I took one last look to insure I did not pick up any duplicate dates or two observations on the same day (the number of observations and the number of unique dates are the same). And, I replaced all the -9999 in the core-five with NA, because -9999 would be a cold temperature. I went from over 100,000 observations down to just over 7,000 (not my hoped for 30 years).
Scrubbed Final (small) Dataset
weather <- subset(weather, subset = (DATE >= "19951201"))
tbl_df(weather)
## Source: local data frame [7,284 x 7]
##
## STATION DATE PRCP SNOW SNWD TMAX TMIN
## (fctr) (int) (int) (int) (int) (int) (int)
## 1 GHCND:USC00448737 19951201 0 0 0 61 -11
## 2 GHCND:USC00448737 19951202 0 0 0 178 39
## 3 GHCND:USC00448737 19951203 0 0 0 94 -6
## 4 GHCND:USC00448737 19951204 0 0 0 172 50
## 5 GHCND:USC00448737 19951205 0 0 0 172 -22
## 6 GHCND:USC00448737 19951206 0 0 0 67 11
## 7 GHCND:USC00448737 19951207 0 0 0 39 6
## 8 GHCND:USC00448737 19951208 0 0 0 61 -61
## 9 GHCND:USC00448737 19951209 -9999 102 51 39 -28
## 10 GHCND:USC00448737 19951210 0 0 -9999 0 -100
## .. ... ... ... ... ... ... ...
sapply(weather, function(x) length(unique(x)))
## STATION DATE PRCP SNOW SNWD TMAX TMIN
## 1 7284 202 34 30 88 80
weather$PRCP[weather$PRCP == -9999] <- NA
weather$SNOW[weather$SNOW == -9999] <- NA
weather$SNWD[weather$SNWD == -9999] <- NA
weather$TMAX[weather$TMAX == -9999] <- NA
weather$TMIN[weather$TMIN == -9999] <- NA
tbl_df(weather)
## Source: local data frame [7,284 x 7]
##
## STATION DATE PRCP SNOW SNWD TMAX TMIN
## (fctr) (int) (int) (int) (int) (int) (int)
## 1 GHCND:USC00448737 19951201 0 0 0 61 -11
## 2 GHCND:USC00448737 19951202 0 0 0 178 39
## 3 GHCND:USC00448737 19951203 0 0 0 94 -6
## 4 GHCND:USC00448737 19951204 0 0 0 172 50
## 5 GHCND:USC00448737 19951205 0 0 0 172 -22
## 6 GHCND:USC00448737 19951206 0 0 0 67 11
## 7 GHCND:USC00448737 19951207 0 0 0 39 6
## 8 GHCND:USC00448737 19951208 0 0 0 61 -61
## 9 GHCND:USC00448737 19951209 NA 102 51 39 -28
## 10 GHCND:USC00448737 19951210 0 0 NA 0 -100
## .. ... ... ... ... ... ... ...
We learned above that Freeze Days are days where the temperature is 32 degrees F or lower, which is equivalent to zero degree Celsius. We also learned that we are more interested in Frost Days, which are days where the temperature is 36 degrees F or lower. We need to convert 36 F to Celsius for a target temperature, which is (36 - 32) * 5/9 = 2.22 degrees C. Since TMIN is measured in tenths of degrees Celsius, we are looking for TMIN < 22. I added a column to the dataframe with the value set to true when TMIN < 22, and I summarized counts by month and year. Frost Days below shows that out of 7,284 observations, 121 are NA (no value for TMIN), 2,391 are frost-days and the rest have a TMIN > 2.2 degrees Celsius.
Frost Days
Frost <- (weather$TMIN < 22)
Year <- substr(weather$DATE, 1, 4)
Month <- substr(weather$DATE, 5, 6)
weather <- cbind(weather, Frost, Year, Month)
count(weather, Frost)
## Source: local data frame [3 x 2]
##
## Frost n
## (lgl) (int)
## 1 FALSE 4772
## 2 TRUE 2391
## 3 NA 121
Using the Year and Month columns for convenience we see the years 1995 to 1998 below. The 1995 data adds up to 31 observations, which is correct since we only have December. The year 1996 has 121 frost-days + 230 warmer days + 15 NA’s, which equals 366 (1996 must be a leap year). The year 1997 totals 365 with no NA’s and so on. The break out by month shows similar counts and lets us examine the important months of April and May for me.
Years & Months with Frost Days
FrostYears <- count(weather, Year, Frost, sort = TRUE)
FrostYears
## Source: local data frame [52 x 3]
## Groups: Year [21]
##
## Year Frost n
## (fctr) (lgl) (int)
## 1 1995 TRUE 28
## 2 1995 FALSE 3
## 3 1996 FALSE 230
## 4 1996 TRUE 121
## 5 1996 NA 15
## 6 1997 FALSE 242
## 7 1997 TRUE 123
## 8 1998 FALSE 251
## 9 1998 TRUE 104
## 10 1998 NA 10
## .. ... ... ...
FrostMonths <- count(weather, Year, Month, Frost, sort = TRUE)
FrostMonths
## Source: local data frame [426 x 4]
## Groups: Year, Month [241]
##
## Year Month Frost n
## (fctr) (fctr) (lgl) (int)
## 1 1995 12 TRUE 28
## 2 1995 12 FALSE 3
## 3 1996 01 TRUE 28
## 4 1996 01 FALSE 3
## 5 1996 02 TRUE 21
## 6 1996 02 FALSE 6
## 7 1996 02 NA 2
## 8 1996 03 TRUE 22
## 9 1996 03 FALSE 9
## 10 1996 04 FALSE 20
## .. ... ... ... ...
We also want to examine dates to see how often we had a Frost Day after April 15, and within that set how often they came after May 11 (my two target dates). To get the after April 15 Frost-Days I looked for frost from April 16 to June 30 of each year (>0415 & <0701). I did the same thing to get the Frost-Days after May 11. I stored both the frost-days after April 15 and May 11 in their own columns in the weather dataset for convenience. We can count the after April 15 frost-days the same way we did all the frost-days and if we subset when April is True we get the number of time there was a frost day after April 15 in a year. Same thing for May 11.
After April 15 Frost Days
April <- (weather$TMIN < 22) & (substr(weather$DATE, 5, 8) > '0415') & (substr(weather$DATE, 5, 8) < '0701')
May <- (weather$TMIN < 22) & (substr(weather$DATE, 5, 8) > '0511') & (substr(weather$DATE, 5, 8) < '0701')
weather <- cbind(weather, April, May)
April15 <- count(weather, Year, April, sort = TRUE)
April15 <- subset(April15, subset = April)
April15
## Source: local data frame [11 x 3]
## Groups: Year [11]
##
## Year April n
## (fctr) (lgl) (int)
## 1 1996 TRUE 2
## 2 1997 TRUE 2
## 3 1999 TRUE 2
## 4 2001 TRUE 4
## 5 2002 TRUE 1
## 6 2004 TRUE 3
## 7 2005 TRUE 3
## 8 2008 TRUE 2
## 9 2010 TRUE 1
## 10 2013 TRUE 5
## 11 2014 TRUE 6
After May 11 Frost Days
May11 <- count(weather, Year, May, sort = TRUE)
May11 <- subset(May11, subset = May)
May11
## Source: local data frame [2 x 3]
## Groups: Year [2]
##
## Year May n
## (fctr) (lgl) (int)
## 1 1996 TRUE 1
## 2 2013 TRUE 2
We can see above that we had 11 years where there were frost-days after April 15. Now that we are down to 20 years of data that is more than half the years or 55%. Not great odds for betting the crop. We can also see that the trend is not towards less frost-days after April 15, because the highest numbers we have are for 2013 and 2014. May fairs better with only two years where there were frost-days after May 11, or 10%.
To help us visualize this I plotted the frost-days after April 15 by year. I added zeros for the years without frost-days to smooth out the plot.
After April 15 Frost Days
tmp <- c('1998', FALSE, 0)
April15 <- rbind(April15, tmp)
tmp <- c('2000', FALSE, 0)
April15 <- rbind(April15, tmp)
tmp <- c('2003', FALSE, 0)
April15 <- rbind(April15, tmp)
tmp <- c('2006', FALSE, 0)
April15 <- rbind(April15, tmp)
tmp <- c('2007', FALSE, 0)
April15 <- rbind(April15, tmp)
tmp <- c('2009', FALSE, 0)
April15 <- rbind(April15, tmp)
tmp <- c('2011', FALSE, 0)
April15 <- rbind(April15, tmp)
tmp <- c('2012', FALSE, 0)
April15 <- rbind(April15, tmp)
tmp <- c('2015', FALSE, 0)
April15 <- rbind(April15, tmp)
April15$Year <- as.character(April15$Year)
April15 <- April15[order(April15$Year), ]
plot(April15$Year, April15$n, type = "l", col = "red", main = "Frost Days After April 15th", xlab = "Year", ylab = "Days")
I thought it might be interesting to look at the average TMIN and TMAX by year to see if any trends show up. I pull all the April data out of weather and then aggregate by year taking the mean of TMAX and TMIN. The plot of TMIN follows. To me it looked more variable than the frost-days. It would be easy to use this same method on other months or a whole year, but I am not seeing the value here.
AprilTemps <- subset(weather, Month == "04")
AprilTemps <- subset(AprilTemps, select = c(DATE, Year, TMAX, TMIN))
AprilAvg <- aggregate(AprilTemps[, 3:4], list(AprilTemps$Year), mean, na.rm = TRUE)
plot(AprilAvg$Group.1, AprilAvg$TMIN, type = "l", col = "green", main = "Mean TMIN in April", xlab = "Year", ylab = "Celsius")
My only modeling experience was our spam filter modeling in class. Perhaps that is why I could only think of a model that predicts whether there will be a frost after April 15 given the data for the previous months. Each observation for my model spans the “non-growing” season for my area (October to March) and has the number of Frost Days for each month, the average TMAX and TMIN for each month, the total snowfall and snow depth, and the total precipitation per month.
We would add a logical True-False column for frost-days after April 15 for that season. The thing that seems different from our spam example is that we could also include the number of Frost-Days after April 15 in our observation (it’s not just spam, but it’s spam 6 times). I understood that the algorithms we used did not understand the data in the columns, so I guess number of post April 15 frost-days could just be another column. My model would look like this:
With my twenty years of data that would only give us 19 observations to train a model, which seems like to low a number. I wrote a function below to create the model dataset and give it a try. I also wanted to use XFD = # of frost-days in Xember for each month in a season, but did not get that working in time.
Frost Model Dataframe
for (i in 1996:2014) {
# Create season label, e.g. 1996-1997
season <- paste(as.character(i), as.character(i + 1), sep = "-")
# Pull out the data for this season
season.df <- subset(weather, subset = (weather$DATE >= paste(as.character(i), "0901", sep = "") & weather$DATE <= paste(as.character(i + 1), "0331", sep = "")))
# Drop the columns we don't need
season.df <- subset(season.df, select = (c(Month, PRCP, SNOW, SNWD, TMAX, TMIN)))
# Find the average daily max and min temps
Avg <- aggregate(season.df[, 5:6], list(season.df$Month), mean, na.rm = TRUE)
# Find the total rain, snow, and snow depth
Tot <- aggregate(season.df[, 2:4], list(season.df$Month), sum, na.rm = TRUE)
# Make a season row for the model dataframe
newRow <- data.frame(Season = season, SPR = Tot[4, 2], SSN = Tot[4, 3], SSD = Tot[4, 4], STX = round(Avg[4, 2], 0), STN = round(Avg[4, 3], 0), OPR = Tot[5, 2], OSN = Tot[5, 3], OSD = Tot[5, 4], OTX = round(Avg[5, 2], 0), OTN = round(Avg[5, 3], 0), NPR = Tot[6, 2], NSN = Tot[6, 3], NSD = Tot[6, 4], NTX = round(Avg[6, 2], 0), NTN = round(Avg[6, 3], 0), DPR = Tot[7, 2], DSN = Tot[7, 4], DSD = Tot[7, 4], DTX = round(Avg[7, 2], 0), DTN = round(Avg[7, 3], 0), JPR = Tot[1, 2], JSN = Tot[1, 3], JSD = Tot[1, 4], JTX = round(Avg[1, 2], 0), JTN = round(Avg[1, 3], 0), FPR = Tot[2, 2], FSN = Tot[2, 3], FSD = Tot[2, 4], FTX = round(Avg[2, 2], 0), FTN = round(Avg[2, 2], 0), MPR = Tot[3, 2], MSN = Tot[3, 3], MSD = Tot[3, 4], MTX = round(Avg[3, 2], 0), MTN = round(Avg[3, 2], 0), FDC = 0, AA15 = FALSE)
if (i == 1996) {
FrostModel <- newRow
} else {
FrostModel <- rbind(FrostModel, newRow)
}
}
for (i in 1:18) {
# Number of Frost-Days after April 15 during this season
FrostModel[i, 37] <- April15[i + 1, 3]
FrostModel[i, 38] <- April15[i + 1, 2]
FrostModel$FDC <- as.numeric(FrostModel$FDC)
FrostModel$AA15 <- as.logical(FrostModel$AA15)
}
FrostModel
## Season SPR SSN SSD STX STN OPR OSN OSD OTX OTN NPR NSN NSD NTX
## 1 1996-1997 2385 0 0 238 146 1334 0 0 180 74 789 8 0 100
## 2 1997-1998 1277 0 0 235 133 792 0 0 189 74 1536 0 0 104
## 3 1998-1999 515 0 0 268 158 310 0 0 189 85 188 0 0 148
## 4 1999-2000 2832 0 0 239 143 543 0 0 171 63 537 0 0 173
## 5 2000-2001 1298 0 0 227 134 10 0 0 195 77 196 0 0 120
## 6 2001-2002 577 0 0 230 127 240 0 0 190 66 216 0 0 171
## 7 2002-2003 628 0 0 250 145 1447 0 0 166 83 1263 0 0 109
## 8 2003-2004 1804 0 0 236 148 1082 0 0 167 63 1378 0 0 163
## 9 2004-2005 1496 0 0 241 152 239 3 0 179 78 1144 0 0 151
## 10 2005-2006 49 0 0 267 150 2296 0 0 178 96 574 0 0 147
## 11 2006-2007 1491 0 0 217 135 1154 0 0 171 69 1655 0 0 138
## 12 2007-2008 446 0 0 258 141 934 0 0 219 112 343 0 0 127
## 13 2008-2009 832 0 0 244 148 506 0 0 171 62 395 0 0 118
## 14 2009-2010 439 0 0 229 147 1100 0 0 171 73 1154 0 0 137
## 15 2010-2011 370 0 0 267 143 1357 0 0 198 86 311 0 0 135
## 16 2011-2012 2353 0 0 245 156 1040 0 0 180 74 469 0 0 158
## 17 2012-2013 584 0 0 247 131 837 0 0 178 72 147 0 0 93
## 18 2013-2014 406 0 0 241 112 1253 0 0 185 78 586 0 0 100
## 19 2014-2015 321 0 0 283 176 746 0 0 232 125 530 25 25 142
## NTN DPR DSN DSD DTX DTN JPR JSN JSD JTX JTN FPR FSN FSD FTX FTN
## 1 -4 1503 0 0 78 -4 618 76 101 59 -47 564 217 407 97 97
## 2 20 603 228 228 81 -13 1556 0 51 90 3 1819 0 0 95 95
## 3 14 511 0 0 107 12 1470 96 228 76 -32 724 13 25 87 87
## 4 49 765 0 0 88 -12 810 486 1881 55 -47 594 21 1295 101 101
## 5 7 559 203 203 29 -61 738 87 203 46 -49 536 76 76 89 89
## 6 39 452 0 0 107 4 404 76 76 88 -13 128 0 0 112 112
## 7 17 1032 735 735 53 -34 648 198 380 21 -62 1601 1003 5032 29 29
## 8 46 1483 560 560 70 -23 542 160 737 24 -68 610 21 635 63 63
## 9 35 760 0 0 80 -24 826 216 406 70 -30 404 183 405 83 83
## 10 26 744 404 404 51 -44 739 0 0 100 -6 590 330 890 90 90
## 11 32 411 0 0 124 -3 691 58 76 85 -22 716 275 786 40 40
## 12 21 918 178 178 72 -9 383 76 176 69 -33 728 0 0 85 85
## 13 16 841 0 0 89 -21 773 76 203 26 -64 87 26 50 87 87
## 14 42 1733 2211 2211 60 -35 676 216 253 46 -46 1042 1036 9221 31 31
## 15 11 441 101 101 32 -42 525 280 659 29 -55 491 23 280 89 89
## 16 33 862 0 0 100 3 710 25 75 73 -27 247 18 0 99 99
## 17 -10 799 0 0 84 -3 679 25 50 68 -34 347 10 0 49 49
## 18 -9 1004 151 151 75 -13 632 253 633 19 -83 841 431 1779 51 51
## 19 36 618 0 0 84 -4 713 167 483 31 -54 430 339 1296 14 14
## MPR MSN MSD MTX MTN FDC AA15
## 1 1230 0 0 128 128 2 TRUE
## 2 1554 0 0 125 125 0 FALSE
## 3 1050 343 660 106 106 2 TRUE
## 4 851 0 0 155 155 0 FALSE
## 5 1416 0 0 102 102 4 TRUE
## 6 945 0 0 126 126 1 TRUE
## 7 1049 25 915 125 125 0 FALSE
## 8 565 0 0 138 138 3 TRUE
## 9 1361 170 228 94 94 3 TRUE
## 10 34 5 0 130 130 0 FALSE
## 11 785 96 102 136 136 0 FALSE
## 12 777 0 0 134 134 2 TRUE
## 13 599 191 457 115 115 0 FALSE
## 14 493 0 51 140 140 1 TRUE
## 15 1215 8 0 122 122 0 FALSE
## 16 506 0 0 174 174 0 FALSE
## 17 783 191 50 75 75 5 TRUE
## 18 957 368 660 82 82 6 TRUE
## 19 1044 181 838 97 97 0 FALSE
str(FrostModel)
## 'data.frame': 19 obs. of 38 variables:
## $ Season: Factor w/ 19 levels "1996-1997","1997-1998",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ SPR : int 2385 1277 515 2832 1298 577 628 1804 1496 49 ...
## $ SSN : int 0 0 0 0 0 0 0 0 0 0 ...
## $ SSD : int 0 0 0 0 0 0 0 0 0 0 ...
## $ STX : num 238 235 268 239 227 230 250 236 241 267 ...
## $ STN : num 146 133 158 143 134 127 145 148 152 150 ...
## $ OPR : int 1334 792 310 543 10 240 1447 1082 239 2296 ...
## $ OSN : int 0 0 0 0 0 0 0 0 3 0 ...
## $ OSD : int 0 0 0 0 0 0 0 0 0 0 ...
## $ OTX : num 180 189 189 171 195 190 166 167 179 178 ...
## $ OTN : num 74 74 85 63 77 66 83 63 78 96 ...
## $ NPR : int 789 1536 188 537 196 216 1263 1378 1144 574 ...
## $ NSN : int 8 0 0 0 0 0 0 0 0 0 ...
## $ NSD : int 0 0 0 0 0 0 0 0 0 0 ...
## $ NTX : num 100 104 148 173 120 171 109 163 151 147 ...
## $ NTN : num -4 20 14 49 7 39 17 46 35 26 ...
## $ DPR : int 1503 603 511 765 559 452 1032 1483 760 744 ...
## $ DSN : int 0 228 0 0 203 0 735 560 0 404 ...
## $ DSD : int 0 228 0 0 203 0 735 560 0 404 ...
## $ DTX : num 78 81 107 88 29 107 53 70 80 51 ...
## $ DTN : num -4 -13 12 -12 -61 4 -34 -23 -24 -44 ...
## $ JPR : int 618 1556 1470 810 738 404 648 542 826 739 ...
## $ JSN : int 76 0 96 486 87 76 198 160 216 0 ...
## $ JSD : int 101 51 228 1881 203 76 380 737 406 0 ...
## $ JTX : num 59 90 76 55 46 88 21 24 70 100 ...
## $ JTN : num -47 3 -32 -47 -49 -13 -62 -68 -30 -6 ...
## $ FPR : int 564 1819 724 594 536 128 1601 610 404 590 ...
## $ FSN : int 217 0 13 21 76 0 1003 21 183 330 ...
## $ FSD : int 407 0 25 1295 76 0 5032 635 405 890 ...
## $ FTX : num 97 95 87 101 89 112 29 63 83 90 ...
## $ FTN : num 97 95 87 101 89 112 29 63 83 90 ...
## $ MPR : int 1230 1554 1050 851 1416 945 1049 565 1361 34 ...
## $ MSN : int 0 0 343 0 0 0 25 0 170 5 ...
## $ MSD : int 0 0 660 0 0 0 915 0 228 0 ...
## $ MTX : num 128 125 106 155 102 126 125 138 94 130 ...
## $ MTN : num 128 125 106 155 102 126 125 138 94 130 ...
## $ FDC : num 2 0 2 0 4 1 0 3 3 0 ...
## $ AA15 : logi TRUE FALSE TRUE FALSE TRUE TRUE ...
I tried running a Random Forest on my small occurrence data set, but I could not get the function to work. I am sure I don’t understand the model well enough to set up the data correctly. You will see below that I get an error about missing objects that I do not understand. I did change out values for a non-logical field in case that was mine problem, but got the same error.
# For reproducibility; 123 has no particular meaning
set.seed(123)
index <- sample(1:nrow(FrostModel),size = 0.7*nrow(FrostModel))
train <- FrostModel[index,]
test <- FrostModel[-index,]
nrow(train)
## [1] 13
nrow(test)
## [1] 6
# Create a random forest with 500 trees; couldn't get this to work
rf <- randomForest(AA15 ~ SPR + SSN + SSD + STX +STN + OPR + OSN + OSD + OTX + OTN + NPR + NSN + NSD + NTX + NTN + DPR + DSN + DSD + DTX + DTN + JPR + JSN + JSD + JTX + JTN + FPR + FSN + FSD + FTX + FTN + MPR + MSN + MSD + MTX + MTN + FDC, data = train, importance = TRUE, ntree=500)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
# How many trees are needed to reach the minimum error estimate?
# This is a simple problem; it appears that about 71 trees would be enough.
which.min(rf$mse)
## [1] 71
# Using the importance() function to calculate the importance of each variable
imp <- as.data.frame(sort(importance(rf)[,1],decreasing = TRUE),optional = T)
names(imp) <- "% Inc MSE"
imp
## % Inc MSE
## FDC 10.9425184
## MTX 4.1201995
## MTN 4.0126168
## STN 1.9234897
## OTX 1.9025736
## NTN 0.8284135
## MPR 0.3844196
## SSN 0.0000000
## SSD 0.0000000
## OSN 0.0000000
## OSD 0.0000000
## NSN 0.0000000
## NSD 0.0000000
## JSD -0.1561776
## NPR -0.6353092
## OTN -0.6353479
## NTX -0.7417325
## STX -0.8920200
## SPR -1.0010015
## DSD -1.0010015
## FPR -1.0010015
## FTN -1.0010015
## OPR -1.3493956
## JSN -1.3623216
## DPR -1.3745786
## MSD -1.4083490
## DSN -1.4112077
## JTN -1.6370932
## MSN -1.6551301
## FTX -1.6997530
## JTX -1.7144257
## DTN -1.7418139
## JPR -1.8957848
## FSN -1.9915221
## FSD -1.9994604
## DTX -2.0839774
There is not a lot of interpretation to make here. I found out that actually getting a BIG data set is more difficult than it seems. They may have a lot of data (>100,000 observations), but when you boil it down to what you can use it reduces a lot (~7,000 observations).
Based on my simple analysis I discovered that planting crops after May 11 is a better idea for my area than closer to April 15. There is only a 10% chance of a Frost-Day after May 11 and over a 50% chance of one after April 15.
I diffenently need to learn more about modeling in general and Random Forest specially. The warning message makes it sound like I was trying to do regression and I was aiming for classification. The importance of the variables makes sense to me. The number of Frost-Days after April 15 (FDC) was the most important at almost 11%. The next two are the average maximum daily temperature in March and the average daily minimum for March at about 4% each. September, October and Novemeber temperatures come up next. Snow fall and rain seemed to be of less importance.
I would try using data closer to April next time, maybe January, February and March to see if I can get closer with less data and work.
I would like to credit Pedro M. on R-Bloggers for a good example on Radon Forest. http://www.r-bloggers.com/part-4a-modelling-predicting-the-amount-of-rain/