Introduction

I started with the idea of looking at the possible correlation between access to locally grown fresh food and better student performance in school, Good-Food Equals Good-Grades, but was stymied by the breadth of data sources. It was going to take me months to define good grades and good food.

Another idea came up while looking at the agriculture sites. Could I use weather data to predict useful information for food crop planting? At the big picture end of the question is climate change moving the best planting days (freeze or frost days)? At the narrow end of the question, when is the earliest I should plant my tomatoes this spring? This is a practical question I can use the answer to and the information looks more focused.

The job for me now is how best to predict the best planting date. My answer is to follow the data science workflow. I really like the simplicity and flow of Hadley Wickham’s Collect –> Analyze –> Communicate, but it is hard to beat a good mnemonic, so I decided to layout the rest of this document using OSEMN (with the hope of making it awesome).

Library Packages Used

library(rvest)          # To read in HTML table of Freeze-Frost Dates
library(dplyr)          # To use the tbl_df and other functions
library(randomForest)   # For attempted modeling

Obtain Data

The most important factor in determining when to plant any vegetable in your garden is the “LAST FROST DATE” in the spring, and the “FIRST FROST DATE” in the fall for your area. These dates for a given area are based on historical weather data from that area collected over a 30 year period and compiled by the National Climatic Data Center from over 5,800 Weather Monitoring Stations throughout the United States.

For each Weather Monitoring Station, a FREEZE DAY is any day in the year that the temperature reaches 32°F or below. A FROST DAY is any day in the year that the temperature reaches 36°F or below. So why worry about FROST DAYS more so than FREEZE DAYS? Well, a Freeze is what will kill many plants. But, Weather Monitoring Stations are typically mounted four to six feet above the ground. During clear, calm, and cold nights the temperature at ground level, where your garden is, can become much colder and even freeze. This is why most gardeners play it safe by using Frost Days as planting guides.

Traditional Dates

I wanted planting dates or Frost Days (dates) to compare to my analysis. My hope was to find some history of the Farmer’s Almanac planting dates over time for comparison. What I found was at least two almanacs (The Old Farmer’s Almanac and Farmers’ Almanac), neither of which seemed to keep old predictions around on their websites. Unlike predicting the winter weather, they all seemed to use the National Climatic Data Center dates for planting (last frost day) based on the USDA Hardiness Zones.

I found the best table form of the data on the Organic Gardening site and reproduce it below. I used R to read in the web tables and make a dataframe to show that I could read in HTML. This table was simple enough I could have copied it manually or maybe even used a simple copy-and-paste and formatted the results.

We should note that my home is in USDA Hardiness Zone 7, which gives the First Frost Date of October 15 and the Last Frost Date of April 15, so April 15 is my target date. We might also want to note that zone 1 has an incredibly short 1-month growing season.

I considered this table and the zones to be the “old school” method. When I found a “New and Improved” method on The How Do Gardner website I expected newer data to give me an earlier date (warming trend). The site reads: “The National Climatic Data Center has recently released an entirely new data set, this one collected between 1981 and 2010.” I entered my zip code into their new Freeze and Frost Day Calculator and discovered two things. My nearest weather monitoring station is named Vienna; and my last Frost Day is May 11 (last Freeze Day is April 27; Only a 10% chance of a Freeze/Frost after these dates).

I assume this is May 11, 2016 and it is almost a month later than the April 15 from the table. I now have a date range to examine and a significant one for a farmer or gardener.

Freeze-Frost Data Table

freeze <- read_html("http://organicgardening.about.com/od/organicgardening101/a/frostdatechart.htm", encoding = "UTF-8")
freeze <- html_nodes(freeze, "table")
freeze <- html_table(freeze[[1]], header = TRUE)
tbl_df(freeze)
## Source: local data frame [11 x 3]
## 
##    USDA Hardiness Zone First Frost Date                  Last Frost Date
##                  (int)            (chr)                            (chr)
## 1                    1        July 15th                        June 15th
## 2                    2      Autust 15th                         May 15th
## 3                    3   September 15th                         May 15th
## 4                    4   September 15th                         May 15th
## 5                    5     October 15th                       April 15th
## 6                    6     October 15th                       April 15th
## 7                    7     October 15th                       April 15th
## 8                    8    November 15th                       March 15th
## 9                    9    December 15th                    February 15th
## 10                  10    December 15th January 31st (sometimes earlier)
## 11                  11        No frost.                        No frost.

Weather Station Data

I requested the “Daily Normals” from all the weather stations in Fairfax County, Virginia from the Global Historical Climatology Network (GHCN) of the The National Climatic Data Center of the National Oceanic and Atmospheric Administration (NOAA) of the United States. I received a CSV file of 100,197 observations and 33 variables.

On first look the data seems to span the dates from September 1, 1950 to December 6, 2015. We can see the variables and sample data below. We see that there are forty-five (45) weather stations in Fairfax County and somehow that gives us forty-four (44) names, forty-two (42) elevations, forty-five (45) latitudes and forty-eight (48) longitudes. This hints at movement of some weather stations over the years.

A lot of data is missing with several fields having the value -9999, which the documentation said indicates missing data or data that has not been received. This should be fine, because what the documentation calls the five core values (PRCP, SNOW, SNWD, TMAX, TMIN) seem to have values and should be all I need.

Fairfax County Weather Data

weather <- read.csv("https://raw.githubusercontent.com/Godbero/CUNY-MSDA-IS607/master/fcwd.csv", header = TRUE, stringsAsFactors = TRUE)
str(weather)
## 'data.frame':    100197 obs. of  33 variables:
##  $ STATION     : Factor w/ 45 levels "GHCND:US1VAALC003",..: 35 35 35 35 35 35 35 35 35 35 ...
##  $ STATION_NAME: Factor w/ 44 levels "ALEXANDRIA 1.8 S VA US",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ ELEVATION   : Factor w/ 42 levels "105.5","106.1",..: 33 33 33 33 33 33 33 33 33 33 ...
##  $ LATITUDE    : Factor w/ 45 levels "38.713","38.71667",..: 19 19 19 19 19 19 19 19 19 19 ...
##  $ LONGITUDE   : Factor w/ 48 levels "-77.0547","-77.08333",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ DATE        : int  19500901 19500902 19500903 19500904 19500905 19500906 19500907 19500908 19500909 19500910 ...
##  $ MDPR        : int  -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 ...
##  $ MDSF        : int  -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 ...
##  $ DAPR        : int  -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 ...
##  $ DASF        : int  -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 ...
##  $ PRCP        : int  10 46 5 25 0 0 0 0 282 145 ...
##  $ SNWD        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ SNOW        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ TMAX        : int  306 311 317 283 222 228 250 272 222 306 ...
##  $ TMIN        : int  211 217 217 206 106 94 100 128 167 194 ...
##  $ TOBS        : int  256 278 233 222 156 200 217 211 217 217 ...
##  $ WESD        : int  -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 ...
##  $ WESF        : int  -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 ...
##  $ WDFG        : int  -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 ...
##  $ PGTM        : int  -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 ...
##  $ WSFG        : int  -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 ...
##  $ WT09        : int  -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 ...
##  $ WT14        : int  -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 ...
##  $ WT07        : int  -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 ...
##  $ WT01        : int  -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 ...
##  $ WT06        : int  -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 ...
##  $ WT05        : int  -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 ...
##  $ WT11        : int  -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 ...
##  $ WT04        : int  -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 ...
##  $ WT16        : int  -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 ...
##  $ WT08        : int  -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 ...
##  $ WT18        : int  -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 ...
##  $ WT03        : int  -9999 -9999 1 -9999 -9999 -9999 -9999 -9999 -9999 1 ...
tbl_df(weather)
## Source: local data frame [100,197 x 33]
## 
##              STATION             STATION_NAME ELEVATION LATITUDE LONGITUDE
##               (fctr)                   (fctr)    (fctr)   (fctr)    (fctr)
## 1  GHCND:USC00440403 BAILEYS CROSSROADS VA US      78.9    38.85 -77.13333
## 2  GHCND:USC00440403 BAILEYS CROSSROADS VA US      78.9    38.85 -77.13333
## 3  GHCND:USC00440403 BAILEYS CROSSROADS VA US      78.9    38.85 -77.13333
## 4  GHCND:USC00440403 BAILEYS CROSSROADS VA US      78.9    38.85 -77.13333
## 5  GHCND:USC00440403 BAILEYS CROSSROADS VA US      78.9    38.85 -77.13333
## 6  GHCND:USC00440403 BAILEYS CROSSROADS VA US      78.9    38.85 -77.13333
## 7  GHCND:USC00440403 BAILEYS CROSSROADS VA US      78.9    38.85 -77.13333
## 8  GHCND:USC00440403 BAILEYS CROSSROADS VA US      78.9    38.85 -77.13333
## 9  GHCND:USC00440403 BAILEYS CROSSROADS VA US      78.9    38.85 -77.13333
## 10 GHCND:USC00440403 BAILEYS CROSSROADS VA US      78.9    38.85 -77.13333
## ..               ...                      ...       ...      ...       ...
## Variables not shown: DATE (int), MDPR (int), MDSF (int), DAPR (int), DASF
##   (int), PRCP (int), SNWD (int), SNOW (int), TMAX (int), TMIN (int), TOBS
##   (int), WESD (int), WESF (int), WDFG (int), PGTM (int), WSFG (int), WT09
##   (int), WT14 (int), WT07 (int), WT01 (int), WT06 (int), WT05 (int), WT11
##   (int), WT04 (int), WT16 (int), WT08 (int), WT18 (int), WT03 (int)

Scrub Data

The documentation for the data set says the five core values are:

The documentation explains that TMAX is the maximum temperature on that date measured in Celsius to tenths (e.g., 306 = 30.6 degrees Celsius). TMIN is the minimum temperature on that day measured in the same units. We are primarily interested in TMIN, but it makes sense to keep all five core values.

We also need the date observed and the data on weather station location (at least until we determine the most complete data closest to me). We can get rid of the other columns and eventually narrow our data set to just seven (7) columns. On the other end, we can zero in on the weather station closest to me, drop the other 44 stations. This gives us the more manageable data set below.

Close & Complete Data

The latitude of Reston, Virginia (where I live) is 38.9544 and the longitude is -77.3464. I decide to compare my location with the weather stations by comparing lat and long and counting observations by station (30 years of data from 1 station ~= 10,958 observations, I would love 2 or 3 times that).

We start by using the Count function from dplyr to get the stations and their coordinates into a table with counts. We then do the lat and long math and add the columns. A difference of one degree in latitude is about 69 miles, which means a tenth would be 6.9 miles and a hundredth .69 miles. There are a lot of weather stations that are close to me. The table below drops lat and long, so the other columns will fit. The n column is the count of observations from that station and latdiff and longdiff measures the distance from Reston, VA.

Weather Stations in Fairfax County

stations <- count(weather, STATION, STATION_NAME, LATITUDE, LONGITUDE)
stations$LATITUDE <- as.character(stations$LATITUDE)
stations$LONGITUDE <- as.character(stations$LONGITUDE)
latdiff <- as.numeric(stations$LATITUDE) - 38.9544
longdiff <- as.numeric(stations$LONGITUDE) + 77.3464
stations <- cbind(stations, latdiff, longdiff)
stations <- arrange(stations, desc(latdiff), desc(longdiff))
stations$STATION <- substr(stations$STATION, 7, 17)             # shorten station ID for fit
subset(stations, select = c(-LATITUDE, -LONGITUDE))
##        STATION                 STATION_NAME     n   latdiff  longdiff
## 1  USC00442491            DRANESVILLE VA US  1064  0.028930 -0.003600
## 2  USC00443929                HERNDON VA US   944  0.012270 -0.036930
## 3  US1VAFX0035        HERNDON 0.7 WSW VA US   453  0.010400 -0.053600
## 4  US1VAFX0040           VIENNA 3.3 N VA US  1926 -0.006900  0.082700
## 5  US1VAFX0057         RESTON 1.8 SSW VA US   565 -0.025000 -0.014400
## 6  US1VAFX0001          HERNDON 3.3 S VA US  3181 -0.032300 -0.033700
## 7  US1VAFX0070        OAK HILL 1.0 NE VA US   239 -0.032500 -0.046600
## 8  US1VAFX0041          MCLEAN 2.3 SE VA US  2475 -0.036400  0.203000
## 9  US1VAFX0052           MCLEAN 1.5 S VA US  1753 -0.037700  0.164200
## 10 USC00448737                 VIENNA VA US   211 -0.037730  0.113070
## 11 US1VAFX0038           MCLEAN 1.5 S VA US   171 -0.038000  0.167000
## 12 US1VAFX0072       OAK HILL 0.3 WSW VA US   198 -0.042600 -0.067700
## 13 US1VAFX0053         MCLEAN 2.4 SSE VA US  1673 -0.046000  0.191500
## 14 US1VAFX0051           VIENNA 1.3 W VA US  1881 -0.051400  0.062800
## 15 US1VAFX0043         FAIRFAX 7.0 NW VA US  1399 -0.053600 -0.064900
## 16 USC00448737                 VIENNA VA US 16730 -0.054400  0.129730
## 17 USC00448737                 VIENNA VA US  3100 -0.054400  0.080100
## 18 USC00448737                 VIENNA VA US  3003 -0.054400  0.080010
## 19 US1VAFX0060           OAKTON 0.4 E VA US   684 -0.062200  0.053200
## 20 US1VAFX0061           VIENNA 0.8 S VA US   645 -0.065800  0.083500
## 21 US1VAFX0064     FALLS CHURCH 1.6 W VA US   744 -0.066900  0.142300
## 22 USC00442922      FALLS CHURCH 2 SW VA US  3132 -0.071070  0.163070
## 23 USC00448737                 VIENNA VA US  1405 -0.071070  0.096400
## 24 USC00441570              CHANTILLY VA US  2190 -0.071070 -0.086930
## 25 USC00440390       BACON NR FAIRFAX VA US   221 -0.087730 -0.003600
## 26 US1VAFX0054          FAIRFAX 4.1 W VA US   482 -0.089800 -0.026800
## 27 US1VAFX0003   LAKE BARCROFT 0.5 NE VA US   160 -0.096900  0.193900
## 28 US1VAFX0005          FAIRFAX 2.3 W VA US  1916 -0.097200  0.005500
## 29 US1VAFX0059          FAIRFAX 2.2 E VA US   104 -0.103300  0.088300
## 30 USC00440403     BAILEYS CROSSROADS VA US   486 -0.104400  0.213070
## 31 USC00442922      FALLS CHURCH 2 SW VA US  4385 -0.104400  0.146400
## 32 USC00442890                FAIRFAX VA US  1794 -0.104400  0.046400
## 33 USC00445218      MANION NR FAIRFAX VA US   224 -0.104400 -0.003600
## 34 US1VAFX0032      CENTREVILLE 1.0 W VA US   204 -0.105000 -0.114900
## 35 US1VAFX0046        FAIRFAX 2.6 ESE VA US   124 -0.110500  0.095500
## 36 US1VAFX0002  LAKE BARCROFT 0.8 SSW VA US   891 -0.113700  0.180500
## 37 USC00448266                SWETNAM VA US   808 -0.121067  0.013067
## 38 USC00442890                FAIRFAX VA US  3461 -0.121070  0.029730
## 39 US1VAFX0037           MANTUA 1.4 S VA US  1649 -0.122000  0.089500
## 40 USC00440216              ANNANDALE VA US  1338 -0.137730  0.146400
## 41 USC00440090   ALEXANDRIA CITY GARA VA US  6103 -0.154400  0.263070
## 42 US1VAFX0047           BURKE 1.4 NE VA US   449 -0.158800  0.090000
## 43 US1VAALC003       ALEXANDRIA 1.8 S VA US   515 -0.159200  0.261500
## 44 US1VAFX0024 WEST SPRINGFIELD 0.7 W VA US   124 -0.166700  0.100300
## 45 USC00443635               GROVETON VA US  6846 -0.187730  0.246400
## 46 US1VAFX0033      FRANCONIA 1.3 SSE VA US  3621 -0.208200  0.207800
## 47 US1VAFX0049       ALEXANDRIA 4.6 S VA US   607 -0.215900  0.291700
## 48 US1VAFX0063     ALEXANDRIA 5.6 SSW VA US   850 -0.223300  0.257400
## 49 US1VAFX0031    MOUNT VERNON 0.5 NE VA US  1927 -0.230400  0.244100
## 50 USW00093728 DAVISON ARMY AIR FIELD VA US  5080 -0.237730  0.163070
## 51 US1VAFX0050          LORTON 1.2 NE VA US  1785 -0.241400  0.120100
## 52 USC00440216              ANNANDALE VA US   562        NA        NA
## 53 USC00440403     BAILEYS CROSSROADS VA US     1        NA        NA
## 54 USC00442922      FALLS CHURCH 2 SW VA US    31        NA        NA
## 55 USC00448266                SWETNAM VA US     1        NA        NA
## 56 USC00448737                 VIENNA VA US  3682        NA        NA

Selected Data

The interesting thing about this data is how few observation there are from some of the stations. Dranesville, which is about 2 miles from me, has 1,064 readings, or less than 3 years worth. Since the one web site mentioned Vienna as my closest station and it also has the most data, I focused on that station. The Vienna station at row 16 has the most observations at 16,730. I found a few locations with Vienna in the name and few Vienna stations with the same ID, but different location data. I decided to pull these out and look more closely.

I looked at the most recent data for the first week of December 2015 and found four data sources for each day, however only station USC00448737, or 37, had a value for TMIN; US1VAFX0061, US1VAFX0040, and US1VAFX0051 all had -9999 for TMIN (they had values for precipitation). After spot checking and seeing the same pattern throughout the data set I decided to work with just station 37.

Vienna Weather Stations

weather <- subset(weather, subset = (substr(weather$STATION_NAME, 1, 6) == "VIENNA"))
weather <- subset(weather, select = c(STATION, DATE, PRCP, SNOW, SNWD, TMAX, TMIN))
tbl_df(weather)
## Source: local data frame [32,583 x 7]
## 
##              STATION     DATE  PRCP  SNOW  SNWD  TMAX  TMIN
##               (fctr)    (int) (int) (int) (int) (int) (int)
## 1  GHCND:US1VAFX0061 20130420   305 -9999 -9999 -9999 -9999
## 2  GHCND:US1VAFX0061 20130425    13 -9999 -9999 -9999 -9999
## 3  GHCND:US1VAFX0061 20130427     0     0 -9999 -9999 -9999
## 4  GHCND:US1VAFX0061 20130429    30 -9999 -9999 -9999 -9999
## 5  GHCND:US1VAFX0061 20130430    53 -9999 -9999 -9999 -9999
## 6  GHCND:US1VAFX0061 20130501    23 -9999 -9999 -9999 -9999
## 7  GHCND:US1VAFX0061 20130502     0     0 -9999 -9999 -9999
## 8  GHCND:US1VAFX0061 20130503     0     0 -9999 -9999 -9999
## 9  GHCND:US1VAFX0061 20130513     0     0 -9999 -9999 -9999
## 10 GHCND:US1VAFX0061 20130514     0     0 -9999 -9999 -9999
## ..               ...      ...   ...   ...   ...   ...   ...
weather <- subset(weather, subset = (STATION == "GHCND:USC00448737"))
weather <- weather[order(weather$DATE), ]
tbl_df(weather)
## Source: local data frame [28,131 x 7]
## 
##              STATION     DATE  PRCP  SNOW  SNWD  TMAX  TMIN
##               (fctr)    (int) (int) (int) (int) (int) (int)
## 1  GHCND:USC00448737 19250401     0 -9999 -9999 -9999 -9999
## 2  GHCND:USC00448737 19250402     5 -9999 -9999 -9999 -9999
## 3  GHCND:USC00448737 19250403     0 -9999 -9999 -9999 -9999
## 4  GHCND:USC00448737 19250404     0 -9999 -9999 -9999 -9999
## 5  GHCND:USC00448737 19250405     0 -9999 -9999 -9999 -9999
## 6  GHCND:USC00448737 19250406     0 -9999 -9999 -9999 -9999
## 7  GHCND:USC00448737 19250407     0 -9999 -9999 -9999 -9999
## 8  GHCND:USC00448737 19250408     0 -9999 -9999 -9999 -9999
## 9  GHCND:USC00448737 19250409     0 -9999 -9999 -9999 -9999
## 10 GHCND:USC00448737 19250410     0 -9999 -9999 -9999 -9999
## ..               ...      ...   ...   ...   ...   ...   ...

After searching through the Station 37 data I found values for TMIN started being recorded on December 1, 1995. Before that date all the values for TMIN are recorded as -9999. After this date there are regular recorded values and the occasional -9999. I decided to take the subset from 12/1/1995 on as my scrubbed data set, but I was disappointment not to have more data to analyze. I took one last look to insure I did not pick up any duplicate dates or two observations on the same day (the number of observations and the number of unique dates are the same). And, I replaced all the -9999 in the core-five with NA, because -9999 would be a cold temperature. I went from over 100,000 observations down to just over 7,000 (not my hoped for 30 years).

Scrubbed Final (small) Dataset

weather <- subset(weather, subset = (DATE >= "19951201"))
tbl_df(weather)
## Source: local data frame [7,284 x 7]
## 
##              STATION     DATE  PRCP  SNOW  SNWD  TMAX  TMIN
##               (fctr)    (int) (int) (int) (int) (int) (int)
## 1  GHCND:USC00448737 19951201     0     0     0    61   -11
## 2  GHCND:USC00448737 19951202     0     0     0   178    39
## 3  GHCND:USC00448737 19951203     0     0     0    94    -6
## 4  GHCND:USC00448737 19951204     0     0     0   172    50
## 5  GHCND:USC00448737 19951205     0     0     0   172   -22
## 6  GHCND:USC00448737 19951206     0     0     0    67    11
## 7  GHCND:USC00448737 19951207     0     0     0    39     6
## 8  GHCND:USC00448737 19951208     0     0     0    61   -61
## 9  GHCND:USC00448737 19951209 -9999   102    51    39   -28
## 10 GHCND:USC00448737 19951210     0     0 -9999     0  -100
## ..               ...      ...   ...   ...   ...   ...   ...
sapply(weather, function(x) length(unique(x)))
## STATION    DATE    PRCP    SNOW    SNWD    TMAX    TMIN 
##       1    7284     202      34      30      88      80
weather$PRCP[weather$PRCP == -9999] <- NA
weather$SNOW[weather$SNOW == -9999] <- NA
weather$SNWD[weather$SNWD == -9999] <- NA
weather$TMAX[weather$TMAX == -9999] <- NA
weather$TMIN[weather$TMIN == -9999] <- NA
tbl_df(weather)
## Source: local data frame [7,284 x 7]
## 
##              STATION     DATE  PRCP  SNOW  SNWD  TMAX  TMIN
##               (fctr)    (int) (int) (int) (int) (int) (int)
## 1  GHCND:USC00448737 19951201     0     0     0    61   -11
## 2  GHCND:USC00448737 19951202     0     0     0   178    39
## 3  GHCND:USC00448737 19951203     0     0     0    94    -6
## 4  GHCND:USC00448737 19951204     0     0     0   172    50
## 5  GHCND:USC00448737 19951205     0     0     0   172   -22
## 6  GHCND:USC00448737 19951206     0     0     0    67    11
## 7  GHCND:USC00448737 19951207     0     0     0    39     6
## 8  GHCND:USC00448737 19951208     0     0     0    61   -61
## 9  GHCND:USC00448737 19951209    NA   102    51    39   -28
## 10 GHCND:USC00448737 19951210     0     0    NA     0  -100
## ..               ...      ...   ...   ...   ...   ...   ...

Explore Data

We learned above that Freeze Days are days where the temperature is 32 degrees F or lower, which is equivalent to zero degree Celsius. We also learned that we are more interested in Frost Days, which are days where the temperature is 36 degrees F or lower. We need to convert 36 F to Celsius for a target temperature, which is (36 - 32) * 5/9 = 2.22 degrees C. Since TMIN is measured in tenths of degrees Celsius, we are looking for TMIN < 22. I added a column to the dataframe with the value set to true when TMIN < 22, and I summarized counts by month and year. Frost Days below shows that out of 7,284 observations, 121 are NA (no value for TMIN), 2,391 are frost-days and the rest have a TMIN > 2.2 degrees Celsius.

Frost Days

Frost <- (weather$TMIN < 22)
Year <- substr(weather$DATE, 1, 4)
Month <- substr(weather$DATE, 5, 6)
weather <- cbind(weather, Frost, Year, Month)
count(weather, Frost)
## Source: local data frame [3 x 2]
## 
##   Frost     n
##   (lgl) (int)
## 1 FALSE  4772
## 2  TRUE  2391
## 3    NA   121

Using the Year and Month columns for convenience we see the years 1995 to 1998 below. The 1995 data adds up to 31 observations, which is correct since we only have December. The year 1996 has 121 frost-days + 230 warmer days + 15 NA’s, which equals 366 (1996 must be a leap year). The year 1997 totals 365 with no NA’s and so on. The break out by month shows similar counts and lets us examine the important months of April and May for me.

Years & Months with Frost Days

FrostYears <- count(weather, Year, Frost, sort = TRUE)
FrostYears
## Source: local data frame [52 x 3]
## Groups: Year [21]
## 
##      Year Frost     n
##    (fctr) (lgl) (int)
## 1    1995  TRUE    28
## 2    1995 FALSE     3
## 3    1996 FALSE   230
## 4    1996  TRUE   121
## 5    1996    NA    15
## 6    1997 FALSE   242
## 7    1997  TRUE   123
## 8    1998 FALSE   251
## 9    1998  TRUE   104
## 10   1998    NA    10
## ..    ...   ...   ...
FrostMonths <- count(weather, Year, Month, Frost, sort = TRUE)
FrostMonths
## Source: local data frame [426 x 4]
## Groups: Year, Month [241]
## 
##      Year  Month Frost     n
##    (fctr) (fctr) (lgl) (int)
## 1    1995     12  TRUE    28
## 2    1995     12 FALSE     3
## 3    1996     01  TRUE    28
## 4    1996     01 FALSE     3
## 5    1996     02  TRUE    21
## 6    1996     02 FALSE     6
## 7    1996     02    NA     2
## 8    1996     03  TRUE    22
## 9    1996     03 FALSE     9
## 10   1996     04 FALSE    20
## ..    ...    ...   ...   ...

We also want to examine dates to see how often we had a Frost Day after April 15, and within that set how often they came after May 11 (my two target dates). To get the after April 15 Frost-Days I looked for frost from April 16 to June 30 of each year (>0415 & <0701). I did the same thing to get the Frost-Days after May 11. I stored both the frost-days after April 15 and May 11 in their own columns in the weather dataset for convenience. We can count the after April 15 frost-days the same way we did all the frost-days and if we subset when April is True we get the number of time there was a frost day after April 15 in a year. Same thing for May 11.

After April 15 Frost Days

April <- (weather$TMIN < 22) & (substr(weather$DATE, 5, 8) > '0415') & (substr(weather$DATE, 5, 8) < '0701')
May <- (weather$TMIN < 22) & (substr(weather$DATE, 5, 8) > '0511') & (substr(weather$DATE, 5, 8) < '0701')
weather <- cbind(weather, April, May)
April15 <- count(weather, Year, April, sort = TRUE)
April15 <- subset(April15, subset = April)
April15
## Source: local data frame [11 x 3]
## Groups: Year [11]
## 
##      Year April     n
##    (fctr) (lgl) (int)
## 1    1996  TRUE     2
## 2    1997  TRUE     2
## 3    1999  TRUE     2
## 4    2001  TRUE     4
## 5    2002  TRUE     1
## 6    2004  TRUE     3
## 7    2005  TRUE     3
## 8    2008  TRUE     2
## 9    2010  TRUE     1
## 10   2013  TRUE     5
## 11   2014  TRUE     6

After May 11 Frost Days

May11 <- count(weather, Year, May, sort = TRUE)
May11 <- subset(May11, subset = May)
May11
## Source: local data frame [2 x 3]
## Groups: Year [2]
## 
##     Year   May     n
##   (fctr) (lgl) (int)
## 1   1996  TRUE     1
## 2   2013  TRUE     2

We can see above that we had 11 years where there were frost-days after April 15. Now that we are down to 20 years of data that is more than half the years or 55%. Not great odds for betting the crop. We can also see that the trend is not towards less frost-days after April 15, because the highest numbers we have are for 2013 and 2014. May fairs better with only two years where there were frost-days after May 11, or 10%.

To help us visualize this I plotted the frost-days after April 15 by year. I added zeros for the years without frost-days to smooth out the plot.

After April 15 Frost Days

tmp <- c('1998', FALSE, 0)
April15 <- rbind(April15, tmp)
tmp <- c('2000', FALSE, 0)
April15 <- rbind(April15, tmp)
tmp <- c('2003', FALSE, 0)
April15 <- rbind(April15, tmp)
tmp <- c('2006', FALSE, 0)
April15 <- rbind(April15, tmp)
tmp <- c('2007', FALSE, 0)
April15 <- rbind(April15, tmp)
tmp <- c('2009', FALSE, 0)
April15 <- rbind(April15, tmp)
tmp <- c('2011', FALSE, 0)
April15 <- rbind(April15, tmp)
tmp <- c('2012', FALSE, 0)
April15 <- rbind(April15, tmp)
tmp <- c('2015', FALSE, 0)
April15 <- rbind(April15, tmp)
April15$Year <- as.character(April15$Year)
April15 <- April15[order(April15$Year), ]
plot(April15$Year, April15$n, type = "l", col = "red", main = "Frost Days After April 15th", xlab = "Year", ylab = "Days")

I thought it might be interesting to look at the average TMIN and TMAX by year to see if any trends show up. I pull all the April data out of weather and then aggregate by year taking the mean of TMAX and TMIN. The plot of TMIN follows. To me it looked more variable than the frost-days. It would be easy to use this same method on other months or a whole year, but I am not seeing the value here.

AprilTemps <- subset(weather, Month == "04")
AprilTemps <- subset(AprilTemps, select = c(DATE, Year, TMAX, TMIN))
AprilAvg <- aggregate(AprilTemps[, 3:4], list(AprilTemps$Year), mean, na.rm = TRUE)
plot(AprilAvg$Group.1, AprilAvg$TMIN, type = "l", col = "green", main = "Mean TMIN in April", xlab = "Year", ylab = "Celsius")

Model Data

My only modeling experience was our spam filter modeling in class. Perhaps that is why I could only think of a model that predicts whether there will be a frost after April 15 given the data for the previous months. Each observation for my model spans the “non-growing” season for my area (October to March) and has the number of Frost Days for each month, the average TMAX and TMIN for each month, the total snowfall and snow depth, and the total precipitation per month.

Model Structure

We would add a logical True-False column for frost-days after April 15 for that season. The thing that seems different from our spam example is that we could also include the number of Frost-Days after April 15 in our observation (it’s not just spam, but it’s spam 6 times). I understood that the algorithms we used did not understand the data in the columns, so I guess number of post April 15 frost-days could just be another column. My model would look like this:

  • Season = 5 characters, e.g., 96-97
  • SPR = total precipitation (mm) in September = numeric or integer
  • SSN = total snowfall (mm) in September = numeric or integer
  • SSD = total snow depth (mm) in September = numeric or integer
  • STX = mean daily max temperature (tenths of degrees C) = numeric or integer
  • STM = mean daily minimum temperature (tenths of degrees C) = numeric or integer
  • Same six value for October: ODF, OPR, OSN, OSD, OSX, OTM
  • Same six value for November: NDF, NPR, NSN, NSD, NSX, NTM
  • Same six value for December: DDF, DPR, DSN, DSD, DSX, DTM
  • Same six value for January: JDF, JPR, JSN, JSD, JSX, JTM
  • Same six value for February: FDF, FPR, FSN, FSD, FSX, FTM
  • Same six value for March: MDF, MPR, MSN, MSD, MSX, MTM
  • AFD = number of Frost-Days after April 15th this season = numeric or integer
  • Frost = True if minimum temp below 2.2 C after April 15th, else False (& AFD = 0) = logical

With my twenty years of data that would only give us 19 observations to train a model, which seems like to low a number. I wrote a function below to create the model dataset and give it a try. I also wanted to use XFD = # of frost-days in Xember for each month in a season, but did not get that working in time.

Frost Model Dataframe

for (i in 1996:2014) {
        # Create season label, e.g. 1996-1997
        season <- paste(as.character(i), as.character(i + 1), sep = "-")
        # Pull out the data for this season
        season.df <- subset(weather, subset = (weather$DATE >= paste(as.character(i), "0901", sep = "") & weather$DATE <= paste(as.character(i + 1), "0331", sep = "")))
        # Drop the columns we don't need
        season.df <- subset(season.df, select = (c(Month, PRCP, SNOW, SNWD, TMAX, TMIN)))
        # Find the average daily max and min temps
        Avg <- aggregate(season.df[, 5:6], list(season.df$Month), mean, na.rm = TRUE)
        # Find the total rain, snow, and snow depth
        Tot <- aggregate(season.df[, 2:4], list(season.df$Month), sum, na.rm = TRUE)
        # Make a season row for the model dataframe
        newRow <- data.frame(Season = season, SPR = Tot[4, 2], SSN = Tot[4, 3], SSD = Tot[4, 4], STX = round(Avg[4, 2], 0), STN = round(Avg[4, 3], 0), OPR = Tot[5, 2], OSN = Tot[5, 3], OSD = Tot[5, 4], OTX = round(Avg[5, 2], 0), OTN = round(Avg[5, 3], 0), NPR = Tot[6, 2], NSN = Tot[6, 3], NSD = Tot[6, 4], NTX = round(Avg[6, 2], 0), NTN = round(Avg[6, 3], 0), DPR = Tot[7, 2], DSN = Tot[7, 4], DSD = Tot[7, 4], DTX = round(Avg[7, 2], 0), DTN = round(Avg[7, 3], 0), JPR = Tot[1, 2], JSN = Tot[1, 3], JSD = Tot[1, 4], JTX = round(Avg[1, 2], 0), JTN = round(Avg[1, 3], 0), FPR = Tot[2, 2], FSN = Tot[2, 3], FSD = Tot[2, 4], FTX = round(Avg[2, 2], 0), FTN = round(Avg[2, 2], 0), MPR = Tot[3, 2], MSN = Tot[3, 3], MSD = Tot[3, 4], MTX = round(Avg[3, 2], 0), MTN = round(Avg[3, 2], 0), FDC = 0, AA15 = FALSE)
        if (i == 1996) {
                FrostModel <- newRow 
                
        } else {
                FrostModel <- rbind(FrostModel, newRow)
        }
}

for (i in 1:18) {
        # Number of Frost-Days after April 15 during this season
        FrostModel[i, 37] <- April15[i + 1, 3]
        FrostModel[i, 38] <- April15[i + 1, 2]
        FrostModel$FDC <- as.numeric(FrostModel$FDC)
        FrostModel$AA15 <- as.logical(FrostModel$AA15)
}
FrostModel
##       Season  SPR SSN SSD STX STN  OPR OSN OSD OTX OTN  NPR NSN NSD NTX
## 1  1996-1997 2385   0   0 238 146 1334   0   0 180  74  789   8   0 100
## 2  1997-1998 1277   0   0 235 133  792   0   0 189  74 1536   0   0 104
## 3  1998-1999  515   0   0 268 158  310   0   0 189  85  188   0   0 148
## 4  1999-2000 2832   0   0 239 143  543   0   0 171  63  537   0   0 173
## 5  2000-2001 1298   0   0 227 134   10   0   0 195  77  196   0   0 120
## 6  2001-2002  577   0   0 230 127  240   0   0 190  66  216   0   0 171
## 7  2002-2003  628   0   0 250 145 1447   0   0 166  83 1263   0   0 109
## 8  2003-2004 1804   0   0 236 148 1082   0   0 167  63 1378   0   0 163
## 9  2004-2005 1496   0   0 241 152  239   3   0 179  78 1144   0   0 151
## 10 2005-2006   49   0   0 267 150 2296   0   0 178  96  574   0   0 147
## 11 2006-2007 1491   0   0 217 135 1154   0   0 171  69 1655   0   0 138
## 12 2007-2008  446   0   0 258 141  934   0   0 219 112  343   0   0 127
## 13 2008-2009  832   0   0 244 148  506   0   0 171  62  395   0   0 118
## 14 2009-2010  439   0   0 229 147 1100   0   0 171  73 1154   0   0 137
## 15 2010-2011  370   0   0 267 143 1357   0   0 198  86  311   0   0 135
## 16 2011-2012 2353   0   0 245 156 1040   0   0 180  74  469   0   0 158
## 17 2012-2013  584   0   0 247 131  837   0   0 178  72  147   0   0  93
## 18 2013-2014  406   0   0 241 112 1253   0   0 185  78  586   0   0 100
## 19 2014-2015  321   0   0 283 176  746   0   0 232 125  530  25  25 142
##    NTN  DPR  DSN  DSD DTX DTN  JPR JSN  JSD JTX JTN  FPR  FSN  FSD FTX FTN
## 1   -4 1503    0    0  78  -4  618  76  101  59 -47  564  217  407  97  97
## 2   20  603  228  228  81 -13 1556   0   51  90   3 1819    0    0  95  95
## 3   14  511    0    0 107  12 1470  96  228  76 -32  724   13   25  87  87
## 4   49  765    0    0  88 -12  810 486 1881  55 -47  594   21 1295 101 101
## 5    7  559  203  203  29 -61  738  87  203  46 -49  536   76   76  89  89
## 6   39  452    0    0 107   4  404  76   76  88 -13  128    0    0 112 112
## 7   17 1032  735  735  53 -34  648 198  380  21 -62 1601 1003 5032  29  29
## 8   46 1483  560  560  70 -23  542 160  737  24 -68  610   21  635  63  63
## 9   35  760    0    0  80 -24  826 216  406  70 -30  404  183  405  83  83
## 10  26  744  404  404  51 -44  739   0    0 100  -6  590  330  890  90  90
## 11  32  411    0    0 124  -3  691  58   76  85 -22  716  275  786  40  40
## 12  21  918  178  178  72  -9  383  76  176  69 -33  728    0    0  85  85
## 13  16  841    0    0  89 -21  773  76  203  26 -64   87   26   50  87  87
## 14  42 1733 2211 2211  60 -35  676 216  253  46 -46 1042 1036 9221  31  31
## 15  11  441  101  101  32 -42  525 280  659  29 -55  491   23  280  89  89
## 16  33  862    0    0 100   3  710  25   75  73 -27  247   18    0  99  99
## 17 -10  799    0    0  84  -3  679  25   50  68 -34  347   10    0  49  49
## 18  -9 1004  151  151  75 -13  632 253  633  19 -83  841  431 1779  51  51
## 19  36  618    0    0  84  -4  713 167  483  31 -54  430  339 1296  14  14
##     MPR MSN MSD MTX MTN FDC  AA15
## 1  1230   0   0 128 128   2  TRUE
## 2  1554   0   0 125 125   0 FALSE
## 3  1050 343 660 106 106   2  TRUE
## 4   851   0   0 155 155   0 FALSE
## 5  1416   0   0 102 102   4  TRUE
## 6   945   0   0 126 126   1  TRUE
## 7  1049  25 915 125 125   0 FALSE
## 8   565   0   0 138 138   3  TRUE
## 9  1361 170 228  94  94   3  TRUE
## 10   34   5   0 130 130   0 FALSE
## 11  785  96 102 136 136   0 FALSE
## 12  777   0   0 134 134   2  TRUE
## 13  599 191 457 115 115   0 FALSE
## 14  493   0  51 140 140   1  TRUE
## 15 1215   8   0 122 122   0 FALSE
## 16  506   0   0 174 174   0 FALSE
## 17  783 191  50  75  75   5  TRUE
## 18  957 368 660  82  82   6  TRUE
## 19 1044 181 838  97  97   0 FALSE
str(FrostModel)
## 'data.frame':    19 obs. of  38 variables:
##  $ Season: Factor w/ 19 levels "1996-1997","1997-1998",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ SPR   : int  2385 1277 515 2832 1298 577 628 1804 1496 49 ...
##  $ SSN   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ SSD   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ STX   : num  238 235 268 239 227 230 250 236 241 267 ...
##  $ STN   : num  146 133 158 143 134 127 145 148 152 150 ...
##  $ OPR   : int  1334 792 310 543 10 240 1447 1082 239 2296 ...
##  $ OSN   : int  0 0 0 0 0 0 0 0 3 0 ...
##  $ OSD   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ OTX   : num  180 189 189 171 195 190 166 167 179 178 ...
##  $ OTN   : num  74 74 85 63 77 66 83 63 78 96 ...
##  $ NPR   : int  789 1536 188 537 196 216 1263 1378 1144 574 ...
##  $ NSN   : int  8 0 0 0 0 0 0 0 0 0 ...
##  $ NSD   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ NTX   : num  100 104 148 173 120 171 109 163 151 147 ...
##  $ NTN   : num  -4 20 14 49 7 39 17 46 35 26 ...
##  $ DPR   : int  1503 603 511 765 559 452 1032 1483 760 744 ...
##  $ DSN   : int  0 228 0 0 203 0 735 560 0 404 ...
##  $ DSD   : int  0 228 0 0 203 0 735 560 0 404 ...
##  $ DTX   : num  78 81 107 88 29 107 53 70 80 51 ...
##  $ DTN   : num  -4 -13 12 -12 -61 4 -34 -23 -24 -44 ...
##  $ JPR   : int  618 1556 1470 810 738 404 648 542 826 739 ...
##  $ JSN   : int  76 0 96 486 87 76 198 160 216 0 ...
##  $ JSD   : int  101 51 228 1881 203 76 380 737 406 0 ...
##  $ JTX   : num  59 90 76 55 46 88 21 24 70 100 ...
##  $ JTN   : num  -47 3 -32 -47 -49 -13 -62 -68 -30 -6 ...
##  $ FPR   : int  564 1819 724 594 536 128 1601 610 404 590 ...
##  $ FSN   : int  217 0 13 21 76 0 1003 21 183 330 ...
##  $ FSD   : int  407 0 25 1295 76 0 5032 635 405 890 ...
##  $ FTX   : num  97 95 87 101 89 112 29 63 83 90 ...
##  $ FTN   : num  97 95 87 101 89 112 29 63 83 90 ...
##  $ MPR   : int  1230 1554 1050 851 1416 945 1049 565 1361 34 ...
##  $ MSN   : int  0 0 343 0 0 0 25 0 170 5 ...
##  $ MSD   : int  0 0 660 0 0 0 915 0 228 0 ...
##  $ MTX   : num  128 125 106 155 102 126 125 138 94 130 ...
##  $ MTN   : num  128 125 106 155 102 126 125 138 94 130 ...
##  $ FDC   : num  2 0 2 0 4 1 0 3 3 0 ...
##  $ AA15  : logi  TRUE FALSE TRUE FALSE TRUE TRUE ...

Random Forest

I tried running a Random Forest on my small occurrence data set, but I could not get the function to work. I am sure I don’t understand the model well enough to set up the data correctly. You will see below that I get an error about missing objects that I do not understand. I did change out values for a non-logical field in case that was mine problem, but got the same error.

# For reproducibility; 123 has no particular meaning
set.seed(123)
index <- sample(1:nrow(FrostModel),size = 0.7*nrow(FrostModel))
train <- FrostModel[index,]
test <- FrostModel[-index,]
nrow(train)
## [1] 13
nrow(test)
## [1] 6
# Create a random forest with 500 trees; couldn't get this to work
rf <- randomForest(AA15 ~ SPR + SSN + SSD + STX +STN + OPR + OSN + OSD + OTX + OTN + NPR + NSN + NSD + NTX + NTN + DPR + DSN + DSD + DTX + DTN + JPR + JSN + JSD + JTX + JTN + FPR + FSN + FSD + FTX + FTN + MPR + MSN + MSD + MTX + MTN + FDC, data = train, importance = TRUE, ntree=500)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
# How many trees are needed to reach the minimum error estimate? 
# This is a simple problem; it appears that about 71 trees would be enough.
which.min(rf$mse)
## [1] 71
# Using the importance()  function to calculate the importance of each variable
imp <- as.data.frame(sort(importance(rf)[,1],decreasing = TRUE),optional = T)
names(imp) <- "% Inc MSE"
imp
##      % Inc MSE
## FDC 10.9425184
## MTX  4.1201995
## MTN  4.0126168
## STN  1.9234897
## OTX  1.9025736
## NTN  0.8284135
## MPR  0.3844196
## SSN  0.0000000
## SSD  0.0000000
## OSN  0.0000000
## OSD  0.0000000
## NSN  0.0000000
## NSD  0.0000000
## JSD -0.1561776
## NPR -0.6353092
## OTN -0.6353479
## NTX -0.7417325
## STX -0.8920200
## SPR -1.0010015
## DSD -1.0010015
## FPR -1.0010015
## FTN -1.0010015
## OPR -1.3493956
## JSN -1.3623216
## DPR -1.3745786
## MSD -1.4083490
## DSN -1.4112077
## JTN -1.6370932
## MSN -1.6551301
## FTX -1.6997530
## JTX -1.7144257
## DTN -1.7418139
## JPR -1.8957848
## FSN -1.9915221
## FSD -1.9994604
## DTX -2.0839774

Interpret Data

There is not a lot of interpretation to make here. I found out that actually getting a BIG data set is more difficult than it seems. They may have a lot of data (>100,000 observations), but when you boil it down to what you can use it reduces a lot (~7,000 observations).

Based on my simple analysis I discovered that planting crops after May 11 is a better idea for my area than closer to April 15. There is only a 10% chance of a Frost-Day after May 11 and over a 50% chance of one after April 15.

I diffenently need to learn more about modeling in general and Random Forest specially. The warning message makes it sound like I was trying to do regression and I was aiming for classification. The importance of the variables makes sense to me. The number of Frost-Days after April 15 (FDC) was the most important at almost 11%. The next two are the average maximum daily temperature in March and the average daily minimum for March at about 4% each. September, October and Novemeber temperatures come up next. Snow fall and rain seemed to be of less importance.

I would try using data closer to April next time, maybe January, February and March to see if I can get closer with less data and work.

I would like to credit Pedro M. on R-Bloggers for a good example on Radon Forest. http://www.r-bloggers.com/part-4a-modelling-predicting-the-amount-of-rain/