Using SQL inside R for Data Preparation

Using an SQL database as a data source is ideal for working with larger sets of data, but there will be times when you may want to use the SQL language to prepare your data when there is no time to set up and connect to an SQL database.

Never fear! As long as you can get your data quickly into R via either a .csv or Excel file or by connecting to a web resource, you can still use SQL syntax inside R to do data manipulations.

An Example

For the final bridge project this summer I looked at 94 Atlantic hurricanes from 1950 to 2012, but the dataset on hurricanes did not include the categories of each hurricane. Fortunately I found the official Saffir-Simpson Hurricane Wind Scale (SSHWS) table on the website of the National Hurricane Center (part of NOAA) at https://www.nhc.noaa.gov/aboutsshws.phpNOAA. Using SQL I was able to extract from it the category criteria and tie it back to the original hurricane dataset.

Sample Dataset: Atlantic Hurricanes

The Atlantic hurricanes dataset comes from a Git repository of datasets at http://vincentarelbundock.github.io/Rdatasets/. I downloaded a copy of the dataset and put it in my GitHub repository. Let’s import that to R first.

library(tidyverse)

gitURL1 <- "https://raw.githubusercontent.com/douglasbarley/coursedata/master/hurricanes.csv"
hurricanes <- read.csv(gitURL1)

glimpse(hurricanes)
## Rows: 94
## Columns: 13
## $ X              <chr> "Easy1950", "King1950", "Able1952", "Barbara1953", "...
## $ Name           <chr> "Easy", "King", "Able", "Barbara", "Florence", "Caro...
## $ Year           <int> 1950, 1950, 1952, 1953, 1953, 1954, 1954, 1954, 1955...
## $ LF.WindsMPH    <int> 120, 130, 85, 85, 85, 120, 120, 145, 120, 85, 120, 1...
## $ LF.PressureMB  <int> 958, 955, 985, 987, 985, 960, 954, 938, 962, 987, 96...
## $ LF.times       <int> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1...
## $ BaseDamage     <dbl> 3.3000, 28.0000, 2.7500, 1.0000, 0.2000, 460.2275, 4...
## $ NDAM2014       <int> 1870, 6030, 170, 65, 18, 21375, 3520, 28500, 2270, 1...
## $ AffectedStates <chr> "FL", "FL", "SC", "NC", "FL", "NC,NY,CT,RI", "MA,ME"...
## $ firstLF        <chr> "9/4/1950", "10/17/1950", "8/30/1952", "8/13/1953", ...
## $ deaths         <int> 2, 4, 3, 1, 0, 60, 20, 20, 0, 200, 7, 15, 416, 1, 0,...
## $ mf             <chr> "f", "m", "m", "f", "f", "f", "f", "f", "f", "f", "m...
## $ BaseDam2014    <dbl> 32.419419, 275.073859, 24.569434, 8.867416, 1.773483...

Data Wrangling

We can see that the category of each hurricane is not included in the first dataset.

Fortunately the Saffir-Simpsons scale is available from the National Hurricane Center (part of NOAA) at https://www.nhc.noaa.gov/aboutsshws.php. Let’s import this data from a csv file made from the website’s data.

gitURL2 <- "https://raw.githubusercontent.com/douglasbarley/coursedata/master/Saffir-SimpsonHurricaneWindScale.csv"
categories <- read.csv(gitURL2)

The ‘categories’ data frame shows wind speeds in three units: miles per hour (mph), kilometers per hour (kp/h) and knots (kt), while the hurricanes dataset only records landfall windspeed in mph. Therefore, before we can merge the category field into the original hurricanes dataset, we need to isolate just the wind speed in units of mph.

Let’s make a subset of the data that has only category and sustained.winds in MPH, and let’s also separate the Min and Max wind speeds into their own respective fields for easier comparisons down the line. In order to do the string manipulations we could use SQL substring and in-string functions to make the process easier, but how do we write SQL code in the R interface?

We can use an R package called sqldf (SQL data frame) to write SQL code against a data frame inside the R interface:

library(sqldf)

catwinds <- sqldf("select [Category],[Sustained.Winds],cast(substr([Sustained.Winds],0,case when instr([Sustained.Winds],'-') = 0 then instr([Sustained.Winds],' ') else instr([Sustained.Winds],'-') end) AS INT) AS MinSpeed, cast(case when instr([Sustained.Winds],'-') = 0 then 999 else substr([Sustained.Winds],instr([Sustained.Winds],'-')+1,instr([Sustained.Winds],' ')-instr([Sustained.Winds],'-')) end AS INT) AS MaxSpeed from categories where [Sustained.Winds] LIKE '%mph%'")

catwinds                   
##   Category   Sustained.Winds MinSpeed MaxSpeed
## 1        1         74-95 mph       74       95
## 2        2        96-110 mph       96      110
## 3        3       111-129 mph      111      129
## 4        4       130-156 mph      130      156
## 5        5 157 mph or higher      157      999

With this new data we can assign a category to each storm in the hurricanes dataset by comparing its LF windspeed to the Min and Max windspeeds in the SSHWS category criteria.

for (i in 1:nrow(hurricanes)){
  hurricanes$cat[i] <- catwinds$Category[hurricanes$LF.WindsMPH[i] >= catwinds$MinSpeed & hurricanes$LF.WindsMPH[i] <= catwinds$MaxSpeed]
}

# let's see what the above script did to the hurricanes table by looking at a subset of it

hurricanescat <-  sqldf("select [Name],[Year],[LF.WindsMPH], [cat] from hurricanes")

hurricanescat
##         Name Year LF.WindsMPH cat
## 1       Easy 1950         120   3
## 2       King 1950         130   4
## 3       Able 1952          85   1
## 4    Barbara 1953          85   1
## 5   Florence 1953          85   1
## 6      Carol 1954         120   3
## 7       Edna 1954         120   3
## 8      Hazel 1954         145   4
## 9     Connie 1955         120   3
## 10     Diane 1955          85   1
## 11      Ione 1955         120   3
## 12    Flossy 1956         105   2
## 13    Audrey 1957         145   4
## 14    Helene 1958         120   3
## 15     Debra 1959          85   1
## 16    Gracie 1959         120   3
## 17     Donna 1960         145   4
## 18     Ethel 1960          85   1
## 19     Carla 1961         145   4
## 20     Cindy 1963          85   1
## 21      Cleo 1964         105   2
## 22      Dora 1964         105   2
## 23     Hilda 1964         120   3
## 24    Isbell 1964         105   2
## 25     Betsy 1965         120   3
## 26      Alma 1966         105   2
## 27      Inez 1966          85   1
## 28    Beulah 1967         120   3
## 29    Gladys 1968         105   2
## 30   Camille 1969         190   5
## 31     Celia 1970         120   3
## 32      Fern 1971          85   1
## 33     Edith 1971         105   2
## 34    Ginger 1971          85   1
## 35     Agnes 1972          85   1
## 36    Carmen 1974         120   3
## 37    Eloise 1975         120   3
## 38     Belle 1976          85   1
## 39      Babe 1977          85   1
## 40       Bob 1979          85   1
## 41     David 1979         105   2
## 42  Frederic 1979         120   3
## 43     Allen 1980         115   3
## 44    Alicia 1983         115   3
## 45     Diana 1984         110   2
## 46       Bob 1985          75   1
## 47     Danny 1985          90   1
## 48     Elena 1985         115   3
## 49    Gloria 1985         120   3
## 50      Juan 1985          85   1
## 51      Kate 1985         100   2
## 52    Bonnie 1986          85   1
## 53   Charley 1986          75   1
## 54     Floyd 1987          75   1
## 55  Florence 1988          80   1
## 56   Chantal 1989          80   1
## 57      Hugo 1989         140   4
## 58     Jerry 1989          85   1
## 59       Bob 1991         105   2
## 60    Andrew 1992         170   5
## 61     Emily 1993         115   3
## 62      Erin 1995         100   2
## 63      Opal 1995         115   3
## 64    Bertha 1996         105   2
## 65      Fran 1996         115   3
## 66     Danny 1997          80   1
## 67    Bonnie 1998         110   2
## 68      Earl 1998          80   1
## 69   Georges 1998         105   2
## 70      Bret 1999         115   3
## 71     Floyd 1999         105   2
## 72     Irene 1999          80   1
## 73      Lili 2002          90   1
## 74 Claudette 2003          90   1
## 75    Isabel 2003         105   2
## 76      Alex 2004          80   1
## 77   Charley 2004         150   4
## 78    Gaston 2004          75   1
## 79   Frances 2004         105   2
## 80      Ivan 2004         120   3
## 81    Jeanne 2004         120   3
## 82     Cindy 2005          75   1
## 83    Dennis 2005         120   3
## 84   Katrina 2005         125   3
## 85   Ophelia 2005          75   1
## 86      Rita 2005         115   3
## 87     Wilma 2005         120   3
## 88  Humberto 2007          90   1
## 89     Dolly 2008          85   1
## 90    Gustav 2008         105   2
## 91       Ike 2008         110   2
## 92     Irene 2011          75   1
## 93     Isaac 2012          80   1
## 94     Sandy 2012          75   1

Conclusion

We now have an official category strength assigned to each hurricane based on its wind speed in MPH at landfall…thanks to the sqldf package!