-This is an exploration of the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database.
-This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, which type of event, as well as the estimates of relevant fatalities, injuries, and various forms of damage.
-The dataset used in this project is provided by the U.S. National Oceanic and Atmospheric Administration (NOAA).
-The work is done on the subset of data where only the (EVTYPE, FATALITIES, CROPDMG, PROPDMG, and newly transformed columns from CROPDMGEXP and PROPDMGEXP which were CROPEXP and PROPEXP respectively were used for the complete analysis)
-There were exponential powers (Columns PROPDMGEXP and CROPDMGEXP) linked with the CROPDMG and PROPDMG values and so the exponential notations were converted and then multiplied with DMG columns to get the correct DMG values.
-Graphs were plotted using the ggplot2 package and data was formatted using dplyr package.
-This analysis discovered that tornado(s) are responsible for a maximum number of fatalities and injuries.
-This analysis also discovered that floods are responsible for maximum property damage, while Droughts cause maximum crop damage.
-Objective: Explore the NOAA Storm Database to help answer important questions about severe weather events.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:
Loading the required libraries and loading the raw data using the read.csv Getting the overview of data and summary of how the data looks
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
raw_data <- read.csv("repdata-data-StormData.csv")
summary(raw_data)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE
## Min. : 1.0 Length:902297 Length:902297 Length:902297
## 1st Qu.:19.0 Class :character Class :character Class :character
## Median :30.0 Mode :character Mode :character Mode :character
## Mean :31.2
## 3rd Qu.:45.0
## Max. :95.0
##
## COUNTY COUNTYNAME STATE EVTYPE
## Min. : 0.0 Length:902297 Length:902297 Length:902297
## 1st Qu.: 31.0 Class :character Class :character Class :character
## Median : 75.0 Mode :character Mode :character Mode :character
## Mean :100.6
## 3rd Qu.:131.0
## Max. :873.0
##
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE
## Min. : 0.000 Length:902297 Length:902297 Length:902297
## 1st Qu.: 0.000 Class :character Class :character Class :character
## Median : 0.000 Mode :character Mode :character Mode :character
## Mean : 1.484
## 3rd Qu.: 1.000
## Max. :3749.000
##
## END_TIME COUNTY_END COUNTYENDN END_RANGE
## Length:902297 Min. :0 Mode:logical Min. : 0.0000
## Class :character 1st Qu.:0 NA's:902297 1st Qu.: 0.0000
## Mode :character Median :0 Median : 0.0000
## Mean :0 Mean : 0.9862
## 3rd Qu.:0 3rd Qu.: 0.0000
## Max. :0 Max. :925.0000
##
## END_AZI END_LOCATI LENGTH WIDTH
## Length:902297 Length:902297 Min. : 0.0000 Min. : 0.000
## Class :character Class :character 1st Qu.: 0.0000 1st Qu.: 0.000
## Mode :character Mode :character Median : 0.0000 Median : 0.000
## Mean : 0.2301 Mean : 7.503
## 3rd Qu.: 0.0000 3rd Qu.: 0.000
## Max. :2315.0000 Max. :4400.000
##
## F MAG FATALITIES INJURIES
## Min. :0.0 Min. : 0.0 Min. : 0.0000 Min. : 0.0000
## 1st Qu.:0.0 1st Qu.: 0.0 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median :1.0 Median : 50.0 Median : 0.0000 Median : 0.0000
## Mean :0.9 Mean : 46.9 Mean : 0.0168 Mean : 0.1557
## 3rd Qu.:1.0 3rd Qu.: 75.0 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :5.0 Max. :22000.0 Max. :583.0000 Max. :1700.0000
## NA's :843563
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## Min. : 0.00 Length:902297 Min. : 0.000 Length:902297
## 1st Qu.: 0.00 Class :character 1st Qu.: 0.000 Class :character
## Median : 0.00 Mode :character Median : 0.000 Mode :character
## Mean : 12.06 Mean : 1.527
## 3rd Qu.: 0.50 3rd Qu.: 0.000
## Max. :5000.00 Max. :990.000
##
## WFO STATEOFFIC ZONENAMES LATITUDE
## Length:902297 Length:902297 Length:902297 Min. : 0
## Class :character Class :character Class :character 1st Qu.:2802
## Mode :character Mode :character Mode :character Median :3540
## Mean :2875
## 3rd Qu.:4019
## Max. :9706
## NA's :47
## LONGITUDE LATITUDE_E LONGITUDE_ REMARKS
## Min. :-14451 Min. : 0 Min. :-14455 Length:902297
## 1st Qu.: 7247 1st Qu.: 0 1st Qu.: 0 Class :character
## Median : 8707 Median : 0 Median : 0 Mode :character
## Mean : 6940 Mean :1452 Mean : 3509
## 3rd Qu.: 9605 3rd Qu.:3549 3rd Qu.: 8735
## Max. : 17124 Max. :9706 Max. :106220
## NA's :40
## REFNUM
## Min. : 1
## 1st Qu.:225575
## Median :451149
## Mean :451149
## 3rd Qu.:676723
## Max. :902297
##
The Variables useful to use in this analysis are: - EVTYPE (event type or calamity type) - FATALITIES - INJURIES - CROPDMG - CROPDMGEXP - PROPDMG - PROPDMGEXP
Across the United States, which types of events are most harmful with respect to population health?
So as we saw previously the variables FATALITIES and INJURIES grouped according to the eventype will provide us the answer for the question, that is which event type (EVTYPE Column) is responsible for the most harmful effect on the population health.
Selecting The FATALITIES and EVTYPE columns from the raw data and processing the subset to obtain top 10 harmful events based on the fatalities count.
#library(dplyr)
fatalities <- raw_data %>%
select(EVTYPE, FATALITIES) %>%
group_by(EVTYPE) %>%
summarise(FATALITIES = sum(FATALITIES))
## `summarise()` ungrouping output (override with `.groups` argument)
fatalities_top_10 <- fatalities[order(-fatalities$FATALITIES), ][1:10, ]
fatalities_top_10
## # A tibble: 10 x 2
## EVTYPE FATALITIES
## <chr> <dbl>
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
## 7 FLOOD 470
## 8 RIP CURRENT 368
## 9 HIGH WIND 248
## 10 AVALANCHE 224
Now Selecting the INJURIES and EVTYPE columns from the raw data and processing the subset to obtain top 10 harmful events based on the injuries count.
#library(dplyr)
injuries <- raw_data %>%
select(EVTYPE, INJURIES) %>%
group_by(EVTYPE) %>%
summarise(INJURIES = sum(INJURIES))
## `summarise()` ungrouping output (override with `.groups` argument)
injuries_top_10 <- injuries[order(-injuries$INJURIES), ][1:10, ]
injuries_top_10
## # A tibble: 10 x 2
## EVTYPE INJURIES
## <chr> <dbl>
## 1 TORNADO 91346
## 2 TSTM WIND 6957
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
## 7 ICE STORM 1975
## 8 FLASH FLOOD 1777
## 9 THUNDERSTORM WIND 1488
## 10 HAIL 1361
To get a more clear picture we must plot the data side by side and for this we will use the ggplot2 library
#library(ggplot2)
#library(gridExtra) #to plot side by side
fatalities_plot <- ggplot(fatalities_top_10, aes(reorder(EVTYPE, FATALITIES), FATALITIES)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
xlab("Event Type") + ylab("Fatalities")
injuries_plot <- ggplot(injuries_top_10, aes(reorder(EVTYPE, INJURIES), INJURIES)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
xlab("Event Type") + ylab("Injuries")
grid.arrange(fatalities_plot, injuries_plot, ncol = 2)
From this graph we get a clear picture that TORNADO is the most harmful event and causes the most injuries and fatalities in US.
Across the United States, which types of events have the greatest economic consequences?
From looking at the data we get a bit sense that to know about the economical destruction caused by the calamitic events, we can access 2 columns that are CROPDMG and PROPDMG which are the damage numbers to crops and properties respectively, finding out which specific event caused the most crop and property destruction we can figure out the answer to this question.
If you look at the data closely you can point out there is a bit problem in the data which is , there are two more columns linked with the DMG columns which are CROPDMGEXP and PROPDMGEXP which the exponential powers of the damage values and if not considered in the analysis it can cause misinterpretation of the data and can even result into faulty analyis.
So first we need to sort out this problem by converting the exponential notations into real powers and then multiplying the powers to the DMG values we will get our actual damage values and then we can perform further analysis.
Converting the exponential notations in the CROPDMG Column:
raw_data$CROPEXP[raw_data$CROPDMGEXP == "M"] <- 1e+06
raw_data$CROPEXP[raw_data$CROPDMGEXP == "K"] <- 1000
raw_data$CROPEXP[raw_data$CROPDMGEXP == "m"] <- 1e+06
raw_data$CROPEXP[raw_data$CROPDMGEXP == "B"] <- 1e+09
raw_data$CROPEXP[raw_data$CROPDMGEXP == "0"] <- 1
raw_data$CROPEXP[raw_data$CROPDMGEXP == "k"] <- 1000
raw_data$CROPEXP[raw_data$CROPDMGEXP == "2"] <- 100
raw_data$CROPEXP[raw_data$CROPDMGEXP == ""] <- 1
raw_data$CROPEXP[raw_data$CROPDMGEXP == "?"] <- 0
raw_data$CROPDMGVAL <- raw_data$CROPDMG * raw_data$CROPEXP
Converting the exponential notations in the PROPDMG Column:
raw_data$PROPEXP[raw_data$PROPDMGEXP == "K"] <- 1000
raw_data$PROPEXP[raw_data$PROPDMGEXP == "M"] <- 1e+06
raw_data$PROPEXP[raw_data$PROPDMGEXP == ""] <- 1
raw_data$PROPEXP[raw_data$PROPDMGEXP == "B"] <- 1e+09
raw_data$PROPEXP[raw_data$PROPDMGEXP == "m"] <- 1e+06
raw_data$PROPEXP[raw_data$PROPDMGEXP == "0"] <- 1
raw_data$PROPEXP[raw_data$PROPDMGEXP == "5"] <- 1e+05
raw_data$PROPEXP[raw_data$PROPDMGEXP == "6"] <- 1e+06
raw_data$PROPEXP[raw_data$PROPDMGEXP == "4"] <- 10000
raw_data$PROPEXP[raw_data$PROPDMGEXP == "2"] <- 100
raw_data$PROPEXP[raw_data$PROPDMGEXP == "3"] <- 1000
raw_data$PROPEXP[raw_data$PROPDMGEXP == "h"] <- 100
raw_data$PROPEXP[raw_data$PROPDMGEXP == "7"] <- 1e+07
raw_data$PROPEXP[raw_data$PROPDMGEXP == "H"] <- 100
raw_data$PROPEXP[raw_data$PROPDMGEXP == "1"] <- 10
raw_data$PROPEXP[raw_data$PROPDMGEXP == "8"] <- 1e+08
raw_data$PROPEXP[raw_data$PROPDMGEXP == "+"] <- 0
raw_data$PROPEXP[raw_data$PROPDMGEXP == "-"] <- 0
raw_data$PROPEXP[raw_data$PROPDMGEXP == "?"] <- 0
raw_data$PROPDMGVAL <- raw_data$PROPDMG * raw_data$PROPEXP
Now if we look at the newly constructed columns we can see we have the converted exponential values and the final DMG Colums(PROPDMGVAL and CROPDMGVAL):
head(raw_data[, 38:41])
## CROPEXP CROPDMGVAL PROPEXP PROPDMGVAL
## 1 1 0 1000 25000
## 2 1 0 1000 2500
## 3 1 0 1000 25000
## 4 1 0 1000 2500
## 5 1 0 1000 2500
## 6 1 0 1000 2500
Now we can process the the subset of the data with the newly created columns and the EVTYPE columns to provide answer to our question
Now we will add the CROPDMGVAL and PROPDMGVAL to get the total damage:
raw_data$TOTALDMG <- raw_data$CROPDMGVAL + raw_data$PROPDMGVAL
Selecting the EVTYPE and the TOTALDMG (created by calculation above) and processing the subset to obtain the top 10 events which caused the most crop destruction :
#library(dplyr)
total_dmg <- raw_data %>%
select(EVTYPE, TOTALDMG) %>%
group_by(EVTYPE) %>%
summarise(TOTALDMG = sum(TOTALDMG))
## `summarise()` ungrouping output (override with `.groups` argument)
total_dmg_top10 <- total_dmg[order(-total_dmg$TOTALDMG), ][1:10, ]
total_dmg_top10
## # A tibble: 10 x 2
## EVTYPE TOTALDMG
## <chr> <dbl>
## 1 FLOOD 150319678257
## 2 HURRICANE/TYPHOON 71913712800
## 3 TORNADO 57362333886.
## 4 STORM SURGE 43323541000
## 5 HAIL 18761221986.
## 6 FLASH FLOOD 18243991078.
## 7 DROUGHT 15018672000
## 8 HURRICANE 14610229010
## 9 RIVER FLOOD 10148404500
## 10 ICE STORM 8967041360
Plotting the above data :
economic_loss_plot <- ggplot(total_dmg_top10, aes(reorder(EVTYPE, TOTALDMG), TOTALDMG)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
xlab("Event Type") + ylab("TOTALDMG")
economic_loss_plot