In this section of the report I describe:
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url, destfile = "storm.csv.bz2")
storm <- read.csv("storm.csv.bz2")
Examining the structure of the data shows it to be a large dataset with nearly 1m observations and 37 variables.
str(storm)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
The variables of particular interest are likely to be date of events (BGN_DATE
), the type of event (EVTYPE
), and the relevant damage variables (FATALITIES
, INJURIES
, PROPDMG
, CROPDMG
).
For much of the data management I used the dplyr package, and lubridate for data conversion
library(dplyr)
library(lubridate)
## Convert column names to lower case for ease of use
colnames(storm) <- tolower(colnames(storm))
##Select variables of interest
storm1 <- select(storm, bgn_date, evtype, fatalities, injuries, propdmg, cropdmg)
## Create date variable
storm1$newdate <- mdy_hms(as.character(storm1$bgn_date))
head(storm1)
## bgn_date evtype fatalities injuries propdmg cropdmg
## 1 4/18/1950 0:00:00 TORNADO 0 15 25.0 0
## 2 4/18/1950 0:00:00 TORNADO 0 0 2.5 0
## 3 2/20/1951 0:00:00 TORNADO 0 2 25.0 0
## 4 6/8/1951 0:00:00 TORNADO 0 2 2.5 0
## 5 11/15/1951 0:00:00 TORNADO 0 2 2.5 0
## 6 11/15/1951 0:00:00 TORNADO 0 6 2.5 0
## newdate
## 1 1950-04-18
## 2 1950-04-18
## 3 1951-02-20
## 4 1951-06-08
## 5 1951-11-15
## 6 1951-11-15
## Plot the data over time
par(mfrow = c(1,3))
with(storm1, plot(table(year(newdate)), type = 'l', xlab="Year", ylab="Report frequency"))
with(storm1, plot(table(month(newdate, label = TRUE)), type = 'l', xlab="Month", ylab="Report frequency"))
with(storm1, barplot(table(wday(newdate, label = TRUE, abbr = TRUE)), xlab="Day of week", ylab="Report frequency"))
We can see event reporting has increased over the years with a rapid increase from the mid 1990s. Events are most likely to be reported in the summmer months and during the week.
We can see that evtype
is factor variable with 985 levels, whereas the codebook suggests that there are 48 storm codes: -
Astronomical Low Tide Z
Avalanche Z
Blizzard Z
Coastal Flood Z
Cold/Wind Chill Z
Debris Flow C
Dense Fog Z
Dense Smoke Z
Drought Z
Dust Devil C
Dust Storm Z
Excessive Heat Z
Extreme Cold/Wind Chill Z
Flash Flood C
Flood C
Frost/Freeze Z
Funnel Cloud C
Freezing Fog Z
Hail C
Heat Z
Heavy Rain C
Heavy Snow Z
High Surf Z
High Wind Z
Hurricane (Typhoon) Z
Ice Storm Z
Lake-Effect Snow Z
Lakeshore Flood Z
Lightning C
Marine Hail M
Marine High Wind M
Marine Strong Wind M
Marine Thunderstorm Wind M
Rip Current Z
Seiche Z
Sleet Z
Storm Surge/Tide Z
Strong Wind Z
Thunderstorm Wind C
Tornado C
Tropical Depression Z
Tropical Storm Z
Tsunami Z
Volcanic Ash Z
Waterspout M
Wildfire Z
Winter Storm Z
Winter Weather Z
We need to examine the evtype
variable more closely.
## Convert the storm codes to a look up table
lu<- c('Astronomical Low Tide',
'Avalanche',
'Blizzard',
'Coastal Flood',
'Cold/Wind Chill',
'Debris Flow',
'Dense Fog',
'Dense Smoke',
'Drought',
'Dust Devil',
'Dust Storm',
'Excessive Heat',
'Extreme Cold/Wind Chill',
'Flash Flood',
'Flood',
'Frost/Freeze',
'Funnel Cloud',
'Freezing Fog',
'Hail',
'Heat',
'Heavy Rain',
'Heavy Snow',
'High Surf',
'High Wind',
'Hurricane (Typhoon)',
'Ice Storm',
'Lake-Effect Snow',
'Lakeshore Flood',
'Lightning',
'Marine Hail',
'Marine High Wind',
'Marine Strong Wind',
'Marine Thunderstorm Wind',
'Rip Current',
'Seiche',
'Sleet',
'Storm Surge/Tide',
'Strong Wind',
'Thunderstorm Wind',
'Tornado',
'Tropical Depression',
'Tropical Storm',
'Tsunami',
'Volcanic Ash',
'Waterspout',
'Wildfire',
'Winter Storm',
'Winter Weather')
## Convert to lower case
lu<- tolower(lu)
head(table(storm1$evtype), 20); tail(table(storm1$evtype), 20)
##
## HIGH SURF ADVISORY COASTAL FLOOD FLASH FLOOD
## 1 1 1
## LIGHTNING TSTM WIND TSTM WIND (G45)
## 1 4 1
## WATERSPOUT WIND ?
## 1 1 1
## ABNORMAL WARMTH ABNORMALLY DRY ABNORMALLY WET
## 4 2 1
## ACCUMULATED SNOWFALL AGRICULTURAL FREEZE APACHE COUNTY
## 4 6 1
## ASTRONOMICAL HIGH TIDE ASTRONOMICAL LOW TIDE AVALANCE
## 103 174 1
## AVALANCHE BEACH EROSIN
## 386 1
##
## WIND DAMAGE WIND GUSTS WIND STORM
## 27 3 1
## WIND/HAIL WINDS WINTER MIX
## 1 36 3
## WINTER STORM WINTER STORM HIGH WINDS WINTER STORM/HIGH WIND
## 11433 1 1
## WINTER STORM/HIGH WINDS WINTER STORMS Winter Weather
## 1 3 19
## WINTER WEATHER WINTER WEATHER MIX WINTER WEATHER/MIX
## 7026 6 1104
## WINTERY MIX Wintry mix Wintry Mix
## 2 3 1
## WINTRY MIX WND
## 90 1
We can see that storm coding is highly inconsistent with variations in case, spelling, code combinations and so on.
## Create a new variable to see what proportion of storm data is correctly coded
storm1$evtype1 <- tolower(storm1$evtype)
storm1$evnew <- ifelse(storm1$evtype1 %in% lu, "correct", "incorrect")
with(storm1, table(evnew, year(newdate)))
##
## evnew 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959
## correct 223 269 272 492 609 992 968 1409 1314 1161
## incorrect 0 0 0 0 0 421 735 775 899 652
##
## evnew 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
## correct 1226 1494 1559 1145 1439 1800 1338 1730 1783 1416
## incorrect 719 752 830 823 909 1055 1050 958 1529 1510
##
## evnew 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
## correct 1421 1927 1456 2297 2783 2336 2026 2005 1899 2233
## incorrect 1794 1544 712 2166 2603 2639 1742 1723 1758 2046
##
## evnew 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
## correct 2965 2324 3562 3329 3769 4152 4361 3111 3310 4699
## incorrect 3181 2193 3570 4993 3566 3827 4365 4256 3947 5711
##
## evnew 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
## correct 4882 6019 7091 7827 11684 14617 20934 17356 22608 19221
## incorrect 6064 6503 6443 4780 8947 13353 11336 11324 15520 12068
##
## evnew 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
## correct 20399 21382 21791 25774 25613 25043 29739 43217 55606 45771
## incorrect 14072 13580 14502 13978 13750 14141 14295 72 57 46
##
## evnew 2010 2011
## correct 48093 62080
## incorrect 68 94
This shows that 29.6% of events are not coded as per the codebook, although it is greatly improved since 2007. We can’t ignore this so will need to recode the evtype
variable. There are 0 NAs in the dataset.
There are 852 evtype
variants to be recoded.
## Exclude correctly coded variables, group by `evtype`, count frequency and sort by frequency
storm1 %>% filter(evnew == 0) %>% group_by(evtype1) %>% summarise(freq = length(evtype1)) %>% arrange(-freq)
## Source: local data frame [0 x 2]
##
## Variables not shown: evtype1 (chr), freq (lgl)
## Recode most frequent coding errors
storm1[grep("^tstm", storm1$evtype1),]$evtype1 <- "thunderstorm wind"
storm1[grep("^thun", storm1$evtype1),]$evtype1 <- "thunderstorm wind"
storm1[grep("^marine ts", storm1$evtype1),]$evtype1 <- "marine thunderstorm wind"
storm1[grep("^hail", storm1$evtype1),]$evtype1 <- "hail"
storm1[grep("^winter w", storm1$evtype1),]$evtype1 <- "winter weather"
storm1[grep("^winter s", storm1$evtype1),]$evtype1 <- "winter storm"
storm1[grep("^tornado", storm1$evtype1),]$evtype1 <- "tornado"
storm1[grep("^lakes", storm1$evtype1),]$evtype1 <- "lakeshore flood"
storm1[grep("^wild", storm1$evtype1),]$evtype1 <- "wildfire"
storm1[grep("^waters", storm1$evtype1),]$evtype1 <- "waterspout"
storm1[grep("fld$", storm1$evtype1),]$evtype1 <- "flood"
## Review results
storm1 %>% group_by(evtype1) %>% summarise(freq = length(evtype1)) %>% arrange(-freq)
## Source: local data frame [737 x 2]
##
## evtype1 freq
## 1 thunderstorm wind 324774
## 2 hail 288776
## 3 tornado 60686
## 4 flash flood 54277
## 5 flood 28721
## 6 high wind 20214
## 7 lightning 15754
## 8 heavy snow 15708
## 9 marine thunderstorm wind 11987
## 10 heavy rain 11742
## .. ... ...
## and recheck proportion of storm codes recorded as per code book
storm1$evnew <- ifelse(storm1$evtype1 %in% lu, 'correct', 'incorrect')
with(storm1, table(evnew, year(newdate)))
##
## evnew 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959
## correct 223 269 272 492 609 1413 1703 2184 2213 1813
## incorrect 0 0 0 0 0 0 0 0 0 0
##
## evnew 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
## correct 1945 2246 2389 1968 2348 2855 2388 2688 3312 2926
## incorrect 0 0 0 0 0 0 0 0 0 0
##
## evnew 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
## correct 3215 3471 2168 4463 5386 4975 3768 3728 3657 4279
## incorrect 0 0 0 0 0 0 0 0 0 0
##
## evnew 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
## correct 6146 4517 7132 8322 7335 7979 8726 7367 7257 10410
## incorrect 0 0 0 0 0 0 0 0 0 0
##
## evnew 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
## correct 10946 12522 13534 11606 18950 25281 31458 27867 37209 30457
## incorrect 0 0 0 1001 1681 2689 812 813 919 832
##
## evnew 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
## correct 33477 34115 35674 39545 39132 38875 43955 43217 55606 45771
## incorrect 994 847 619 207 231 309 79 72 57 46
##
## evnew 2010 2011
## correct 48093 62080
## incorrect 68 94
This cleaning exerise reduced the miscoding rate to 1.4% which we will accept given time constraints on analysis.
## Create summaries of the the injury and damage variables
apply(storm1[,3:6], 2, summary)
## fatalities injuries propdmg cropdmg
## Min. 0.00000 0.0000 0.00 0.000
## 1st Qu. 0.00000 0.0000 0.00 0.000
## Median 0.00000 0.0000 0.00 0.000
## Mean 0.01678 0.1557 12.06 1.527
## 3rd Qu. 0.00000 0.0000 0.50 0.000
## Max. 583.00000 1700.0000 5000.00 990.000
For this exercise I have looked at overall costs of damage, fatalities, and total injury, using the correctly coded dataset, and calculated a series of new fields:
## filter the storm data frame to exclude miscoded events; create injury and cost variables
storm2 <- storm1 %>% filter(evnew == "correct") %>% mutate(injury = fatalities + injuries, cost = propdmg + cropdmg)
## create data summaries by event and year of total injury, total cost, injury rate, fatality rate, cost per 1000 events
storm3 <- storm2 %>% group_by(evtype1, year = year(newdate)) %>% summarise( n = length(injury), tot_inj = sum(injury), tot_cost = sum(cost), rate_inj = round(tot_inj/n * 1000,2), fatality_rate = round(sum(fatalities)/n *1000,2), cost_per1000 = round(tot_cost/n *1000,2)) %>% arrange(year, -fatality_rate)
## Because coding levels were low prior to 1995 I have decided to use data from 1995 onwards
storm3 <- storm3 %>% filter(year > 1994)
## See which event type is most frequent, has the greatest number of injuries, and highest costs
counts <- with(storm3, tapply(n, evtype1, sum))
which.max(counts)
## thunderstorm wind
## 37
injury <- with(storm3, tapply(tot_inj, evtype1, sum))
which.max(injury)
## tornado
## 38
costs <- with(storm3, tapply(tot_cost, evtype1, sum))
which.max(costs)
## thunderstorm wind
## 37
Thunderstorm winds are the most frequently reported events since 1995, however tornados are responsible for the largest number of injuries with thunderstorm winds for the greatest cost. If we calculate the injury rate per 1000 events, summarised in Figure 1, we can see that thunderstorm winds have lower injury rates than avalanches and heat, and lower than tornadoes. In general there have been declines in injury rates per 1000 reported events over the last 15 years.
Thunderstorm winds are responsible for the greatest absolute costs, but although much less frequent, volcanic ash and tsunamis incur greater cost per event. This is shown in Figure 2.
If there were more time, further breakdown by type of cost, trend and type of injury would be undertaken.
require(ggplot2)
g<- ggplot(data = storm3, aes(year, log10(rate_inj)))
g + geom_line() + facet_wrap(~evtype1) + theme(axis.text=element_text(size=6),
axis.title=element_text(size=10,face="bold"))+
theme(strip.text.x = element_text(size = 6, colour = "blue")) +geom_smooth() + ggtitle("Fig 1. Storm related injuries per 1000 reported events; 1995-2011")
require(lattice)
bwplot(evtype1~log10(cost_per1000), data = storm3, main = "Fig2. Boxplot of costs per 1000 storm events, 1995-2011")