Using an SQL database as a data source is ideal for working with larger sets of data, but there will be times when you may want to use the SQL language to prepare your data when there is no time to set up and connect to an SQL database.
Never fear! As long as you can get your data quickly into R via either a .csv or Excel file or by connecting to a web resource, you can still use SQL syntax inside R to do data manipulations.
For the final bridge project this summer I looked at 94 Atlantic hurricanes from 1950 to 2012, but the dataset on hurricanes did not include the categories of each hurricane. Fortunately I found the official Saffir-Simpson Hurricane Wind Scale (SSHWS) table on the website of the National Hurricane Center (part of NOAA) at https://www.nhc.noaa.gov/aboutsshws.phpNOAA. Using SQL I was able to extract from it the category criteria and tie it back to the original hurricane dataset.
The Atlantic hurricanes dataset comes from a Git repository of datasets at http://vincentarelbundock.github.io/Rdatasets/. I downloaded a copy of the dataset and put it in my GitHub repository. Let’s import that to R first.
library(tidyverse)
gitURL1 <- "https://raw.githubusercontent.com/douglasbarley/coursedata/master/hurricanes.csv"
hurricanes <- read.csv(gitURL1)
glimpse(hurricanes)
## Rows: 94
## Columns: 13
## $ X <chr> "Easy1950", "King1950", "Able1952", "Barbara1953", "...
## $ Name <chr> "Easy", "King", "Able", "Barbara", "Florence", "Caro...
## $ Year <int> 1950, 1950, 1952, 1953, 1953, 1954, 1954, 1954, 1955...
## $ LF.WindsMPH <int> 120, 130, 85, 85, 85, 120, 120, 145, 120, 85, 120, 1...
## $ LF.PressureMB <int> 958, 955, 985, 987, 985, 960, 954, 938, 962, 987, 96...
## $ LF.times <int> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1...
## $ BaseDamage <dbl> 3.3000, 28.0000, 2.7500, 1.0000, 0.2000, 460.2275, 4...
## $ NDAM2014 <int> 1870, 6030, 170, 65, 18, 21375, 3520, 28500, 2270, 1...
## $ AffectedStates <chr> "FL", "FL", "SC", "NC", "FL", "NC,NY,CT,RI", "MA,ME"...
## $ firstLF <chr> "9/4/1950", "10/17/1950", "8/30/1952", "8/13/1953", ...
## $ deaths <int> 2, 4, 3, 1, 0, 60, 20, 20, 0, 200, 7, 15, 416, 1, 0,...
## $ mf <chr> "f", "m", "m", "f", "f", "f", "f", "f", "f", "f", "m...
## $ BaseDam2014 <dbl> 32.419419, 275.073859, 24.569434, 8.867416, 1.773483...
We can see that the category of each hurricane is not included in the first dataset.
Fortunately the Saffir-Simpsons scale is available from the National Hurricane Center (part of NOAA) at https://www.nhc.noaa.gov/aboutsshws.php. Let’s import this data from a csv file made from the website’s data.
gitURL2 <- "https://raw.githubusercontent.com/douglasbarley/coursedata/master/Saffir-SimpsonHurricaneWindScale.csv"
categories <- read.csv(gitURL2)
The ‘categories’ data frame shows wind speeds in three units: miles per hour (mph), kilometers per hour (kp/h) and knots (kt), while the hurricanes dataset only records landfall windspeed in mph. Therefore, before we can merge the category field into the original hurricanes dataset, we need to isolate just the wind speed in units of mph.
Let’s make a subset of the data that has only category and sustained.winds in MPH, and let’s also separate the Min and Max wind speeds into their own respective fields for easier comparisons down the line. In order to do the string manipulations we could use SQL substring and in-string functions to make the process easier, but how do we write SQL code in the R interface?
We can use an R package called sqldf (SQL data frame) to write SQL code against a data frame inside the R interface:
library(sqldf)
catwinds <- sqldf("select [Category],[Sustained.Winds],cast(substr([Sustained.Winds],0,case when instr([Sustained.Winds],'-') = 0 then instr([Sustained.Winds],' ') else instr([Sustained.Winds],'-') end) AS INT) AS MinSpeed, cast(case when instr([Sustained.Winds],'-') = 0 then 999 else substr([Sustained.Winds],instr([Sustained.Winds],'-')+1,instr([Sustained.Winds],' ')-instr([Sustained.Winds],'-')) end AS INT) AS MaxSpeed from categories where [Sustained.Winds] LIKE '%mph%'")
catwinds
## Category Sustained.Winds MinSpeed MaxSpeed
## 1 1 74-95 mph 74 95
## 2 2 96-110 mph 96 110
## 3 3 111-129 mph 111 129
## 4 4 130-156 mph 130 156
## 5 5 157 mph or higher 157 999
With this new data we can assign a category to each storm in the hurricanes dataset by comparing its LF windspeed to the Min and Max windspeeds in the SSHWS category criteria.
for (i in 1:nrow(hurricanes)){
hurricanes$cat[i] <- catwinds$Category[hurricanes$LF.WindsMPH[i] >= catwinds$MinSpeed & hurricanes$LF.WindsMPH[i] <= catwinds$MaxSpeed]
}
# let's see what the above script did to the hurricanes table by looking at a subset of it
hurricanescat <- sqldf("select [Name],[Year],[LF.WindsMPH], [cat] from hurricanes")
hurricanescat
## Name Year LF.WindsMPH cat
## 1 Easy 1950 120 3
## 2 King 1950 130 4
## 3 Able 1952 85 1
## 4 Barbara 1953 85 1
## 5 Florence 1953 85 1
## 6 Carol 1954 120 3
## 7 Edna 1954 120 3
## 8 Hazel 1954 145 4
## 9 Connie 1955 120 3
## 10 Diane 1955 85 1
## 11 Ione 1955 120 3
## 12 Flossy 1956 105 2
## 13 Audrey 1957 145 4
## 14 Helene 1958 120 3
## 15 Debra 1959 85 1
## 16 Gracie 1959 120 3
## 17 Donna 1960 145 4
## 18 Ethel 1960 85 1
## 19 Carla 1961 145 4
## 20 Cindy 1963 85 1
## 21 Cleo 1964 105 2
## 22 Dora 1964 105 2
## 23 Hilda 1964 120 3
## 24 Isbell 1964 105 2
## 25 Betsy 1965 120 3
## 26 Alma 1966 105 2
## 27 Inez 1966 85 1
## 28 Beulah 1967 120 3
## 29 Gladys 1968 105 2
## 30 Camille 1969 190 5
## 31 Celia 1970 120 3
## 32 Fern 1971 85 1
## 33 Edith 1971 105 2
## 34 Ginger 1971 85 1
## 35 Agnes 1972 85 1
## 36 Carmen 1974 120 3
## 37 Eloise 1975 120 3
## 38 Belle 1976 85 1
## 39 Babe 1977 85 1
## 40 Bob 1979 85 1
## 41 David 1979 105 2
## 42 Frederic 1979 120 3
## 43 Allen 1980 115 3
## 44 Alicia 1983 115 3
## 45 Diana 1984 110 2
## 46 Bob 1985 75 1
## 47 Danny 1985 90 1
## 48 Elena 1985 115 3
## 49 Gloria 1985 120 3
## 50 Juan 1985 85 1
## 51 Kate 1985 100 2
## 52 Bonnie 1986 85 1
## 53 Charley 1986 75 1
## 54 Floyd 1987 75 1
## 55 Florence 1988 80 1
## 56 Chantal 1989 80 1
## 57 Hugo 1989 140 4
## 58 Jerry 1989 85 1
## 59 Bob 1991 105 2
## 60 Andrew 1992 170 5
## 61 Emily 1993 115 3
## 62 Erin 1995 100 2
## 63 Opal 1995 115 3
## 64 Bertha 1996 105 2
## 65 Fran 1996 115 3
## 66 Danny 1997 80 1
## 67 Bonnie 1998 110 2
## 68 Earl 1998 80 1
## 69 Georges 1998 105 2
## 70 Bret 1999 115 3
## 71 Floyd 1999 105 2
## 72 Irene 1999 80 1
## 73 Lili 2002 90 1
## 74 Claudette 2003 90 1
## 75 Isabel 2003 105 2
## 76 Alex 2004 80 1
## 77 Charley 2004 150 4
## 78 Gaston 2004 75 1
## 79 Frances 2004 105 2
## 80 Ivan 2004 120 3
## 81 Jeanne 2004 120 3
## 82 Cindy 2005 75 1
## 83 Dennis 2005 120 3
## 84 Katrina 2005 125 3
## 85 Ophelia 2005 75 1
## 86 Rita 2005 115 3
## 87 Wilma 2005 120 3
## 88 Humberto 2007 90 1
## 89 Dolly 2008 85 1
## 90 Gustav 2008 105 2
## 91 Ike 2008 110 2
## 92 Irene 2011 75 1
## 93 Isaac 2012 80 1
## 94 Sandy 2012 75 1
We now have an official category strength assigned to each hurricane based on its wind speed in MPH at landfall…thanks to the sqldf package!