The document is about the exploratory analysis of the NOAA Storm Database. The dataset has been taken from this link. The dataset defines the enviornmental events, like typhoon, thunderstorms, floods, flash floods etc. which have occured since 1950 till 2011. The dataset also describes the economical loss that each of the events have caused to the country. Count for the number of injuries and fatalities that have occured due to each event has been defined too. The project is aimed at deriving the relationships and insights from the dataset about the injuries, property damage and the determining which event is the most dangerous event.
We’ll download the dataset from this link as a bz2 file. The format in which the data is available is a compressed form of the csv. This data is read using the read.csv function. The following code downloads the file from the internet and read it to create a dataframe.
download.file('https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2', 'StormData.csv.bz2')
dataset = read.csv('StormData.csv.bz2')
The dimsensions of the dataset are as follows:
dim(dataset)
## [1] 902297 37
The contents along with the datatypes of the columns in the dataset are as follows:
str(dataset)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
The libraries used in the project are mentioned below.
library(dplyr)
library(stringr)
library(gridExtra)
library(ggplot2)
library(usmap)
library(lubridate)
For the analysis of the dataset, we’ll try to understand and clean the values present in PROPDMGEXP and EVTYPE columns of the dataset.
The unique values in the column are as follows:
unique(dataset$PROPDMGEXP)
## [1] K M B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
We can map K to thousands, B to Billions, M to Millions and H to Hundreds. The data obtained from official NOAA website has been used to identify the meaning of variables in PROPDMGEXP column. Using the data from the website we can conclude that, the numerical correspond to 10 and ‘+’, ‘-’ correspond to 1 and ‘?’, ’’ correspond to 0.
Replacing the values we obtained from out analysis above into the dataset. Before doing that we’ll have to change the casing of the characters in the column to upper case.
dataset$PROPDMGEXP = str_trim(dataset$PROPDMGEXP, side = 'both')
dataset$PROPDMGEXP = factor(dataset$PROPDMGEXP, levels = c('-', '?', '+', '0', '1', '2', '3', '4', '5', '6', '7', '8', 'B', 'H', 'K', 'M', 'h', 'm', ''), labels = c(1, 0, 1, 10, 10, 10, 10, 10, 10, 10, 10, 10, 1000000000, 100, 1000, 1000000, 100, 1000000, 0))
dataset$PROPDMGEXP = as.numeric(as.character(dataset$PROPDMGEXP))
There are a lot of duplicates in the EVTYPE column. For transforming, we’ll change the casing of the column to lower case and remove all the numerical values in the column.
dataset$EVTYPE = str_squish(str_to_lower(dataset$EVTYPE))
dataset$EVTYPE = str_trim(str_replace_all(dataset$EVTYPE, '\\(.*\\)|[.0-9]|\\)', ''))
The above code removed all the code which was written in a very detailed way like, thunderstorm wind (g24) to thunderstorm wind.
a = str_replace_all(dataset$EVTYPE, 'tstm', 'thunderstorm')
a = str_replace_all(a, 'winds', 'wind')
a = str_replace_all(a, '.*thunderstorm.*', 'thunderstorm wind')
a = str_replace_all(a, 'flood(.*)', 'flood')
a = str_replace_all(a, '.*blizzard.*', 'blizzard')
a = str_replace_all(a, '.*snow.*', 'winter storm')
a = str_replace_all(a, '.*cold.*', 'winter weather')
a = str_replace_all(a, '.*warm.*', 'heat')
a = str_replace_all(a, '.*heat.*', 'heat')
a = str_replace_all(a, '.*hot.*', 'heat')
a = str_replace_all(a, '.*rain.*', 'heavy rain')
a = str_replace_all(a, '.*ice.*', 'ice storm')
a = str_replace_all(a, ' fld.*', ' flood')
a = str_replace_all(a, '.*^(?!flash).*flood.*', 'flood')
a = str_replace_all(a, '.*hurricane.*', 'hurricane')
a = str_replace_all(a, '.*dry.*', 'drought')
a = str_replace_all(a, '.*hail.*', 'hail')
a = str_replace_all(a, '.*surf.*', 'high surf')
a = str_replace_all(a, '.*tropical storm.*', 'tropical storm')
a = str_replace_all(a, '.*fire.*', 'wildfire')
a = str_replace_all(a, '.*wet.*', 'heavy rain')
a = str_replace_all(a, '.*frost.*', 'freeze/frost')
a = str_replace_all(a, '.*waterspout.*', 'waterspout')
a = str_replace_all(a, '.*lightning.*', 'lightning')
a = str_replace_all(a, '.*high.*tide.*', 'storm surge/tide')
a = str_replace_all(a, '.*tornado.*', 'tornado')
a = str_replace_all(a, '.*shower.*', 'heavy rain')
a = str_replace_all(a, '.*high wind.*', 'high wind')
dataset$EVTYPE = a
The EVTYPE column of the dataset is now relatively clean. Few spelling mistakes do exist, but the count of the erronious records is very low.
We’ll identify the most frequent events that have occured during the time-span.
events <- dataset %>%
group_by(EVTYPE) %>%
filter(EVTYPE != '?') %>%
summarise(No.Of.Occurances = n(),
Total_damage = sum(PROPDMG*PROPDMGEXP, na.rm = TRUE),
Total_injuries = sum(INJURIES, na.rm = TRUE),
Total_fatalities = sum(FATALITIES, na.rm = TRUE))
The contents of the dataset are as follows:
head(events)
## # A tibble: 6 x 5
## EVTYPE No.Of.Occurances Total_damage Total_injuries Total_fatalities
## <chr> <int> <dbl> <dbl> <dbl>
## 1 agricultur~ 6 0 0 0
## 2 apache cou~ 1 5000 0 0
## 3 astronomic~ 174 320000 0 0
## 4 avalance 1 0 0 1
## 5 avalanche 386 3721800 170 224
## 6 beach eros~ 1 0 0 0
Rearranging the dataset according to number of incidents
events = events[order(events$No.Of.Occurances, decreasing = TRUE), ]
The following dataframe describes the top 10 most frequent events.
head(select(events, EVTYPE, No.Of.Occurances), 10)
## # A tibble: 10 x 2
## EVTYPE No.Of.Occurances
## <chr> <int>
## 1 thunderstorm wind 336808
## 2 hail 289282
## 3 tornado 60684
## 4 flash flood 55037
## 5 flood 31056
## 6 winter storm 29131
## 7 high wind 21915
## 8 lightning 15760
## 9 heavy rain 12225
## 10 winter weather 9499
Rearranging the dataset according to
events = events[order(events$Total_damage, decreasing = TRUE), ]
The dataframe given below describes the top 10 events which have caused the most financial losses defined in US dollars($).
head(select(events, EVTYPE, Total_damage), 10)
## # A tibble: 10 x 2
## EVTYPE Total_damage
## <chr> <dbl>
## 1 flood 150837629534
## 2 hurricane 84756180010
## 3 tornado 56941934097
## 4 storm surge 43323536000
## 5 flash flood 16732372111
## 6 hail 15974566877
## 7 thunderstorm wind 12576178222
## 8 wildfire 8501628500
## 9 winter storm 7716297017
## 10 tropical storm 7714390550
Here we’ll plot a subplot, one for the top 10 events with the most injuries and fatalities.
Rearranging the Total_injuries column in a descending manner.
events = events[order(events$Total_injuries, decreasing = TRUE), ]
head(select(events, EVTYPE, Total_injuries), 10)
## # A tibble: 10 x 2
## EVTYPE Total_injuries
## <chr> <dbl>
## 1 tornado 91364
## 2 thunderstorm wind 9544
## 3 heat 9243
## 4 flood 6896
## 5 lightning 5231
## 6 winter storm 2486
## 7 ice storm 2154
## 8 flash flood 1785
## 9 wildfire 1608
## 10 high wind 1476
Rearranging the column Total_fatalities in descending manner
events = events[order(events$Total_fatalities, decreasing = TRUE), ]
head(select(events, EVTYPE, Total_fatalities), 10)
## # A tibble: 10 x 2
## EVTYPE Total_fatalities
## <chr> <dbl>
## 1 tornado 5633
## 2 heat 3178
## 3 flash flood 1018
## 4 lightning 817
## 5 thunderstorm wind 754
## 6 flood 535
## 7 winter weather 469
## 8 winter storm 375
## 9 rip current 368
## 10 high wind 292
We have established the most frequent events, the ones which cause most financial damage, and the ones which cause the most injuries and fatalities.
Now we’ll try to analyse the geographical regions/state which have had the most number of calamities. In order to eastablish this, we’ll group the dataset on the basis of states and identify the count for each state.
states = dataset %>%
group_by(STATE) %>%
summarise(Count = n())
Arranging the Count in descending order,
states = states[order(states$Count, decreasing = TRUE), ]
colnames(states) <- c('state', 'count')
head(states)
## # A tibble: 6 x 2
## state count
## <fct> <int>
## 1 TX 83728
## 2 KS 53440
## 3 OK 46802
## 4 MO 35648
## 5 IA 31069
## 6 NE 30271
As we can see above, Texas runs ahead with the majority of the events. All the states Texas, Kansas, Oklahoma, Missouri, Indiana and Nebraska, are all located in the central part of the country. We’ll analyse which event has occurred the most in these states. The plot below displays the states according to the number of events.
plot_usmap(data = states, value = 'count') +
scale_fill_continuous(low = 'white', high = 'red', name = 'Number of Events') +
theme(legend.position = 'right', title = element_text(size = 14, hjust = 0.5)) +
ggtitle('Number of Events across the Country')
state_events = dataset %>%
filter(STATE %in% c('TX', 'KS', 'OK', 'MO', 'IA', 'NE')) %>%
group_by(STATE, EVTYPE) %>%
summarise(Count = n())
state_events = state_events[order(state_events$Count, decreasing = T), ]
head(state_events, 10)
## # A tibble: 10 x 3
## # Groups: STATE [6]
## STATE EVTYPE Count
## <fct> <chr> <int>
## 1 TX hail 37209
## 2 KS hail 28470
## 3 OK hail 23701
## 4 TX thunderstorm wind 23394
## 5 NE hail 16357
## 6 KS thunderstorm wind 15546
## 7 OK thunderstorm wind 14939
## 8 MO hail 14193
## 9 MO thunderstorm wind 12186
## 10 IA thunderstorm wind 11295
We can see that hail is the most occured event for each state. We can also determine that hail and thunderstorm account for more than 50% of the events for each state.
For calculating the property damage, we’ll have to group the data on the basis of states and calculate the amount of of damage by multiplying the PROPDMG and PROPDMGEXP columns.
prop_dmg = dataset %>%
group_by(STATE) %>%
summarise(Property_damage = sum(PROPDMG*PROPDMGEXP/1000000000,
na.rm = TRUE))
colnames(prop_dmg) = c('state', 'property_damage')
plot_usmap(data = prop_dmg, value = 'property_damage') +
scale_fill_continuous(low = 'white', high = 'red',
name = 'Amount (in Billion $)') +
theme(legend.position = 'right', title = element_text(size = 14, hjust = 0.5)) +
ggtitle('Property damage across the Country')
We can see from the plot that California has experienced the most damage due to events in the entire country.
#Trend of the events over the years
For determining the trend of number of events across the years, we’ll add a new column to the dataset which will have the year of the event.
dataset$year = year(mdy_hms(dataset$BGN_DATE))
year_events = dataset %>%
group_by(year) %>%
summarise(Count = n())
ggplot(year_events) +
geom_line(aes(year, Count)) +
theme_bw()+
labs(title = 'Year wise trend of the number of events', x = 'Year', y = 'Count') +
theme(plot.title = element_text(hjust = 0.5))
Following are the observations from the storm dataset: