Synopsis

In this analysis we tried to find the types of weather events which may cause highest impact in terms of death ,injuries and economic costs.The U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database which contains weather events from 1950s to current decade has been explored to find the required answers.After the analysis we found that TORNADO causes most fatalities and injuries becoming top event causing public health problems. Calculation of economic costs is calculated via both property and crop damage. FLOOD is found to be the top most event causing economic costs.

Loading and Processing the Raw Data

The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:

Storm Data [47Mb]

There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

National Weather Service Storm Data Documentation

National Climatic Data Center Storm Events FAQ

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

1. Data Processing.

The following code downloads the data zip file from the website. As second step using read.csv() function, file is unzipped and loaded into dataframe.

if(!exists("rawdata.bz2")) {
Url<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(Url, "rawdata.bz2")}
Url2<-"https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf"
download.file(Url2, basename(Url2))
if(!exists("df1")) {
    df1 <- read.csv(bzfile("rawdata.bz2"),header = TRUE)
  } 

2. Examining the data set and selecting necessary columns

str(df1)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

The data set has 902297 obs. of 37 variables and to our analysis we only need 6 variables from the dataset.

Health variables:
* FATALITIES: approx. number of deaths
* INJURIES: approx. number of injuries

Economic variables:

PROPDMG: approx. property damags
PROPDMGEXP: the units for property damage value
CROPDMG: approx. crop damages
CROPDMGEXP: the units for crop damage value

So we made a subset of dataframe using the following code.

df2 <- df1[,c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
dim(df2)
## [1] 902297      7

3. Checking for any missing value.

summary(df2)
##     EVTYPE            FATALITIES          INJURIES            PROPDMG       
##  Length:902297      Min.   :  0.0000   Min.   :   0.0000   Min.   :   0.00  
##  Class :character   1st Qu.:  0.0000   1st Qu.:   0.0000   1st Qu.:   0.00  
##  Mode  :character   Median :  0.0000   Median :   0.0000   Median :   0.00  
##                     Mean   :  0.0168   Mean   :   0.1557   Mean   :  12.06  
##                     3rd Qu.:  0.0000   3rd Qu.:   0.0000   3rd Qu.:   0.50  
##                     Max.   :583.0000   Max.   :1700.0000   Max.   :5000.00  
##   PROPDMGEXP           CROPDMG         CROPDMGEXP       
##  Length:902297      Min.   :  0.000   Length:902297     
##  Class :character   1st Qu.:  0.000   Class :character  
##  Mode  :character   Median :  0.000   Mode  :character  
##                     Mean   :  1.527                     
##                     3rd Qu.:  0.000                     
##                     Max.   :990.000

No missing values found in the dataset.

2. Data manipulation and Analysis

Qn.1 Across the United States, which types of events are most harmful with respect to population health?

In the data there were two variables regarding health impact;FATALITIES & INJURIES. Using dplyr package, data was initially grouped according to event types, the summarized according to totals of fatalities and injuries for each event. After arranging values, top 10 selected and saved in st1 & st2.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
st1<-df2%>%  group_by(EVTYPE)%>%
        summarise(fatality=sum(FATALITIES))%>%
        slice_max(order_by=fatality,n=10)
st1
## # A tibble: 10 × 2
##    EVTYPE         fatality
##    <chr>             <dbl>
##  1 TORNADO            5633
##  2 EXCESSIVE HEAT     1903
##  3 FLASH FLOOD         978
##  4 HEAT                937
##  5 LIGHTNING           816
##  6 TSTM WIND           504
##  7 FLOOD               470
##  8 RIP CURRENT         368
##  9 HIGH WIND           248
## 10 AVALANCHE           224
st2<-df2%>%  group_by(EVTYPE)%>%
    summarise(injury=sum(INJURIES))%>%
    slice_max(order_by=injury,n=10)
st2
## # A tibble: 10 × 2
##    EVTYPE            injury
##    <chr>              <dbl>
##  1 TORNADO            91346
##  2 TSTM WIND           6957
##  3 FLOOD               6789
##  4 EXCESSIVE HEAT      6525
##  5 LIGHTNING           5230
##  6 HEAT                2100
##  7 ICE STORM           1975
##  8 FLASH FLOOD         1777
##  9 THUNDERSTORM WIND   1488
## 10 HAIL                1361

Qn.2 Across the United States, which types of events have the greatest economic consequences?

Economic Impact

The data provides two types of economic impact, namely property damage (PROPDMG) and crop damage (CROPDMG). The actual damage in $USD is indicated by PROPDMGEXP and CROPDMGEXP parameters as exponential. After searching for individual events in the storm website, the index in the PROPDMGEXP and CROPDMGEXP can be interpreted roughly as the following:-

H, h -> hundreds = x100
K, K -> kilos = x1,000
M, m -> millions = x1,000,000
B,b -> billions = x1,000,000,000
(+) -> x1
(-) -> x0
(?) -> x0
blank -> x0
(0) -> x0

Step1: Two character vectors of unique values were made for property damage and crop damage. Corresponding multiplier vectors also made using above mentioned logic on exponential.

Step2: The unique value vector and multiplier vector were combined into df2,df3 separately for property and crop damage. columns were renamed so as to match the df2 vector.

prop_unique<-unique(df2$PROPDMGEXP)
crop_unique<-unique(df2$CROPDMGEXP)
prop_multi<-c(10^3,10^6,0,10^9,10^6,1,0,10^5,10^6,0,10^4,10^2,10^3,10^2,10^7,10^2,0,10^1,10^8)
crop_multi<-c(0,10^6,10^3,10^6,10^9,0,0,10^3,10^2)
df3<-data.frame(prop_unique,prop_multi)
df4<-data.frame(crop_unique,crop_multi)
colnames(df3)<- c("PROPDMGEXP","prop_multi")
colnames(df4)<- c("CROPDMGEXP","crop_multi")

The resultant vectors are then put into data pipeline of dplyr and each step doing the following task.

  1. combining df2 and property damage multiplier by PROPDMGEXP column and left join function.
  2. combining data frame from 1st step and crop damage multiplier by CROPDMGEXP column and left join function.
  3. Creation of actual damage by multiplication of damage and respective multiplier for crop and damage.
  4. Creation of total damage variable by totaling crop and property damage.
  5. grouping total damage by event type and summarizing the cost.
  6. Selection of top 10 values of cost.
st3 <-df2 %>% left_join(df3,by="PROPDMGEXP")%>%
            left_join(df4,by="CROPDMGEXP")%>%
            mutate(propdamage=PROPDMG*prop_multi,cropdamage=CROPDMG*crop_multi) %>%
            mutate(totaldamage=cropdamage+propdamage)%>%
            group_by(EVTYPE)%>%
            summarise(cost=sum(totaldamage))%>%
            slice_max(order_by=cost,n=10)
st3
## # A tibble: 10 × 2
##    EVTYPE                    cost
##    <chr>                    <dbl>
##  1 FLOOD             150319678250
##  2 HURRICANE/TYPHOON  71913712800
##  3 TORNADO            57362333650
##  4 STORM SURGE        43323541000
##  5 HAIL               18761221670
##  6 FLASH FLOOD        18243990610
##  7 DROUGHT            15018672000
##  8 HURRICANE          14610229010
##  9 RIVER FLOOD        10148404500
## 10 ICE STORM           8967041310

Results

1. Events that causing highest public health impact in US.

par(mfrow = c(1, 2))
plot1<-barplot(st1$fatality,main="Top 10 high fatality events",ylab = "Total fatality",col="blue")
text(plot1, par("usr")[3], labels = st1$EVTYPE, srt = 90, adj = c(1.1,1.1), xpd = TRUE, cex=0.7)
plot2<-barplot(st2$injury,main="Top 10 high injuries events",ylab = "Total injuries",col="magenta")
text(plot2, par("usr")[3], labels = st2$EVTYPE, srt = 90, adj = c(1.1,1.1), xpd = TRUE, cex=0.7)

The st1 & st2 summary tables made in the analysis section is plotted parallely showing top 10 fatalities and injuries causing events. It is understood that TORNADO as the lead cause of mortality and morbidity.

2. Events that causing highest economic costs in US.

par(mfrow = c(1, 1))
plot1<-barplot(st3$cost,main="Top 10 events with highest economic damage.",ylab = "Total cost",col="purple")
text(plot1, par("usr")[3], labels = st3$EVTYPE, srt = 90, adj = c(1.1,1.1), xpd = TRUE, cex=0.7)

The st3 summary table made earlier is plotted showing top 10 events causing highest damage. It is clear that FLOOD as the most damaging one.