Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern. This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
library(R.utils)
library(lattice)
filename='StormData.csv.bz2'
content='StormData.csv'
if (!file.exists(filename) & !file.exists(content))
{
download.file('https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2', filename, method='curl')
}
# Extracting the archive
if (file.exists(filename) & !file.exists(content))
{
bunzip2(filename, content, remove = FALSE, skip = TRUE)
}
Now, let’s have a first look at the data.
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL TORNADO
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL TORNADO
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 0 0 NA
## 2 0 0 NA
## 3 0 0 NA
## 4 0 0 NA
## 5 0 0 NA
## 6 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1 0 14.0 100 3 0 0 15 25.0
## 2 0 2.0 150 2 0 0 0 2.5
## 3 0 0.1 123 2 0 0 2 25.0
## 4 0 0.0 100 2 0 0 2 2.5
## 5 0 0.0 150 2 0 0 2 2.5
## 6 0 1.5 177 2 0 0 6 2.5
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1 K 0 3040 8812
## 2 K 0 3042 8755
## 3 K 0 3340 8742
## 4 K 0 3458 8626
## 5 K 0 3412 8642
## 6 K 0 3450 8748
## LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3051 8806 1
## 2 0 0 2
## 3 0 0 3
## 4 0 0 4
## 5 0 0 5
## 6 0 0 6
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1000 1 1/13/1972 0:00:00 0215 CST 67 HENRY AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1000 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1000 NA 0 8.4 200 3 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1000 2 250 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1000 3136 8524 3143 8522 1000
Checking how many rows have missing values.
Rows with missing fatalities:
## [1] 0
Rows with missing injuries:
## [1] 0
So as we can see, there are no rows with missing values.
Let’s have a look at all the types of events that this dataset contains.
This suggests that some event types are the same but due to a semantic difference they appear to be different in this dataframe. We can observe a few examples:
Now let’s fix these by grouping these events together under one event type.
Now let’s create a new dataframe by grouping EVTYPE, FATALITIES and INJURIES together, which will give us a better understanding of the impact of all events.
df2 <- aggregate(list(Fatalities=df1$FATALITIES, Injuries=df1$INJURIES), by=list(Event=df1$EVTYPE), FUN=sum)
head(df2)
## Event Fatalities Injuries
## 1 HIGH SURF ADVISORY 0 0
## 2 COASTAL FLOOD 0 0
## 3 FLASH FLOOD 0 0
## 4 LIGHTNING 0 0
## 5 TSTM WIND 0 0
## 6 TSTM WIND (G45) 0 0
Now that we have the impact on the population of the events, we can sort the dataframe by the number of fatalities and injuries in the order respectively.
## Event Fatalities Injuries
## 832 TORNADO 5633 91346
## 130 EXCESSIVE HEAT 1903 6525
## 153 FLASH FLOOD 978 1777
## 275 HEAT 937 2100
## 463 LIGHTNING 816 5230
## 584 RIP CURRENT 572 529
## 854 TSTM WIND 504 6957
## 170 FLOOD 470 6789
## 358 HIGH WIND 248 1137
## 19 AVALANCHE 224 170
## 969 WINTER STORM 206 1321
## 278 HEAT WAVE 172 309
## 140 EXTREME COLD 160 231
## 349 HIGH SURF 143 200
## 758 THUNDERSTORM WIND 133 1488
## 310 HEAVY SNOW 127 1021
## 141 EXTREME COLD/WIND CHILL 125 24
## 674 STRONG WIND 103 280
## 30 BLIZZARD 101 805
## 290 HEAVY RAIN 98 251
Now let’s draw a bar plot for the events with the highest fatalities.
plot1 <- barchart(Fatalities + Injuries ~ Event, data = df3[1:10,],
key = list(
space = "right",
text = list(c("Fatalities", "Injuries"), col = 'black'),
rectangles = list(col = c("dodgerblue", "salmon"))
),
main = "Fatalities and Injuries per event", xlab = "Event", ylab = "Count",
scales = list(x = list(rot = 45)),
col = c("dodgerblue", "salmon")
)
print(plot1)
Firstly, we have to assume that fatalities are more impactful than injuries, let’s assume in this case that fatalities are twice more impactful.
Now let’s create a new column for calculating total impact per event.
# Create a new column for number of total events
df3$Total_Events <- sapply(df3$Event, function(event_s){
nrow(df1[(df1$EVTYPE == event_s),])
})
df3$Impact_per_Event <- (df3$Fatalities * 2 + df3$Injuries) / df3$Total_Events
head(df3, 10)
## Event Fatalities Injuries Total_Events Impact_per_Event
## 832 TORNADO 5633 91346 60652 1.69181560
## 130 EXCESSIVE HEAT 1903 6525 1678 6.15673421
## 153 FLASH FLOOD 978 1777 54277 0.06877683
## 275 HEAT 937 2100 767 5.18122555
## 463 LIGHTNING 816 5230 15754 0.43557192
## 584 RIP CURRENT 572 529 774 2.16149871
## 854 TSTM WIND 504 6957 219940 0.03621442
## 170 FLOOD 470 6789 25326 0.30518045
## 358 HIGH WIND 248 1137 20212 0.08079359
## 19 AVALANCHE 224 170 386 1.60103627
Now let’s sort the dataframe wrt the Impact_per_Event column. Also, we have to assume that if the data is not available for more than 100 events than that data is not appropriate for impact analysis.
# Sorting data in the order of highest impact per event
df3 <- df3[(order(-df3$Impact_per_Event)),]
# Number of events should be higher than 100
df4 <- df3[(df3$Total_Events > 100),]
head(df4, 20)
## Event Fatalities Injuries Total_Events Impact_per_Event
## 130 EXCESSIVE HEAT 1903 6525 1678 6.1567342
## 275 HEAT 937 2100 767 5.1812256
## 584 RIP CURRENT 572 529 774 2.1614987
## 832 TORNADO 5633 91346 60652 1.6918156
## 19 AVALANCHE 224 170 386 1.6010363
## 188 FOG 62 734 538 1.5947955
## 117 DUST STORM 22 440 427 1.1334895
## 426 ICE STORM 89 1975 2006 1.0732802
## 401 HURRICANE 61 46 174 0.9655172
## 140 EXTREME COLD 160 231 655 0.8412214
## 846 TROPICAL STORM 58 340 690 0.6608696
## 349 HIGH SURF 143 200 953 0.5099685
## 463 LIGHTNING 816 5230 15754 0.4355719
## 957 WIND 23 86 340 0.3882353
## 954 WILDFIRE 87 1456 4218 0.3864391
## 79 COLD/WIND CHILL 95 12 539 0.3747681
## 30 BLIZZARD 101 805 2719 0.3703567
## 115 DUST DEVIL 2 42 141 0.3262411
## 884 UNSEASONABLY WARM 11 17 126 0.3095238
## 170 FLOOD 470 6789 25326 0.3051804
Now let’s draw a barplot of events which cause the highest impact.
plot2 <- barchart(Impact_per_Event ~ Event, data = df4[1:10,],
main = "Total Impact per event", xlab = "Event", ylab = "Total Impact",
scales = list(x = list(rot = 45)),
col = "dodgerblue"
)
print(plot2)
Now let’s create a new dataframe which states the total economic damage per event type.
df5 <- aggregate(list(Total_Economic_Damage=df1$PROPDMG + df1$CROPDMG), by=list(Event=df1$EVTYPE), FUN=sum)
head(df5)
## Event Total_Economic_Damage
## 1 HIGH SURF ADVISORY 200
## 2 COASTAL FLOOD 0
## 3 FLASH FLOOD 50
## 4 LIGHTNING 0
## 5 TSTM WIND 108
## 6 TSTM WIND (G45) 8
# Now sorting the dataframe by economic damage
df5 <- df5[(order(-df5$Total_Economic_Damage)),]
head(df5)
## Event Total_Economic_Damage
## 832 TORNADO 3312276.7
## 153 FLASH FLOOD 1599325.1
## 854 TSTM WIND 1445168.2
## 244 HAIL 1268289.7
## 170 FLOOD 1067976.4
## 758 THUNDERSTORM WIND 943635.6
Now, we’ll see what kind of events are likely to cause the highest economic damage in an event.
df5$Total_Events <- sapply(df5$Event, function(event_s){
nrow(df1[(df1$EVTYPE == event_s),])
})
df5$Economic_Impact <- df5$Total_Economic_Damage / df5$Total_Events
# Now sort the dataframe by economic damage per event
df5 <- df5[(order(-df5$Economic_Impact)),]
# Number of events must be greater than 100
df5 <- df5[(df5$Total_Events > 100),]
head(df5, 10)
## Event Total_Economic_Damage Total_Events Economic_Impact
## 401 HURRICANE 20852.99 174 119.84477
## 588 RIVER FLOOD 17345.70 173 100.26416
## 846 TROPICAL STORM 54322.80 690 78.72870
## 668 STORM SURGE 19398.49 261 74.32372
## 185 FLOODING 8824.90 120 73.54083
## 903 URBAN FLOOD 14216.50 249 57.09438
## 832 TORNADO 3312276.68 60652 54.61117
## 669 STORM SURGE/TIDE 7627.05 148 51.53412
## 164 FLASH FLOODING 33623.20 682 49.30088
## 170 FLOOD 1067976.36 25326 42.16917
Now let’s draw a barplot for economic damage per event type.
plot3 <- barchart(Economic_Impact ~ Event, data = df5[1:10,],
main = "Average Economic Impact per event", xlab = "Event", ylab = "Economic Impact per event",
scales = list(x = list(rot = 45)),
col = "dodgerblue"
)
print(plot3)