Synopsis

This is my second project for reproducibe research. Reproducible research is very important. The dataset is from National Oceanic and Atmospheric Administration of U.S. I will use this dataset to analyze which type of storm has created a large impact on population health and on economy.

I have first processed this dataset, extracted the essential variables required for analysis, then I have checked for the null values and then I have created a graph to see which type of storm has created a huge impact.

After plotting graph, I found that Tornado has created highest health impact on population and Flood has created highest economic impact.

Data Processing

Code for reading the dataset

stormData<-read.csv("repdata_data_StormData.csv.bz2",header = TRUE, sep = ",")

Understanding the dataset

dim(stormData)
## [1] 902297     37
str(stormData)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
summary(stormData)
##     STATE__       BGN_DATE           BGN_TIME          TIME_ZONE        
##  Min.   : 1.0   Length:902297      Length:902297      Length:902297     
##  1st Qu.:19.0   Class :character   Class :character   Class :character  
##  Median :30.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :31.2                                                           
##  3rd Qu.:45.0                                                           
##  Max.   :95.0                                                           
##                                                                         
##      COUNTY       COUNTYNAME           STATE              EVTYPE         
##  Min.   :  0.0   Length:902297      Length:902297      Length:902297     
##  1st Qu.: 31.0   Class :character   Class :character   Class :character  
##  Median : 75.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :100.6                                                           
##  3rd Qu.:131.0                                                           
##  Max.   :873.0                                                           
##                                                                          
##    BGN_RANGE          BGN_AZI           BGN_LOCATI          END_DATE        
##  Min.   :   0.000   Length:902297      Length:902297      Length:902297     
##  1st Qu.:   0.000   Class :character   Class :character   Class :character  
##  Median :   0.000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :   1.484                                                           
##  3rd Qu.:   1.000                                                           
##  Max.   :3749.000                                                           
##                                                                             
##    END_TIME           COUNTY_END COUNTYENDN       END_RANGE       
##  Length:902297      Min.   :0    Mode:logical   Min.   :  0.0000  
##  Class :character   1st Qu.:0    NA's:902297    1st Qu.:  0.0000  
##  Mode  :character   Median :0                   Median :  0.0000  
##                     Mean   :0                   Mean   :  0.9862  
##                     3rd Qu.:0                   3rd Qu.:  0.0000  
##                     Max.   :0                   Max.   :925.0000  
##                                                                   
##    END_AZI           END_LOCATI            LENGTH              WIDTH         
##  Length:902297      Length:902297      Min.   :   0.0000   Min.   :   0.000  
##  Class :character   Class :character   1st Qu.:   0.0000   1st Qu.:   0.000  
##  Mode  :character   Mode  :character   Median :   0.0000   Median :   0.000  
##                                        Mean   :   0.2301   Mean   :   7.503  
##                                        3rd Qu.:   0.0000   3rd Qu.:   0.000  
##                                        Max.   :2315.0000   Max.   :4400.000  
##                                                                              
##        F               MAG            FATALITIES          INJURIES        
##  Min.   :0.0      Min.   :    0.0   Min.   :  0.0000   Min.   :   0.0000  
##  1st Qu.:0.0      1st Qu.:    0.0   1st Qu.:  0.0000   1st Qu.:   0.0000  
##  Median :1.0      Median :   50.0   Median :  0.0000   Median :   0.0000  
##  Mean   :0.9      Mean   :   46.9   Mean   :  0.0168   Mean   :   0.1557  
##  3rd Qu.:1.0      3rd Qu.:   75.0   3rd Qu.:  0.0000   3rd Qu.:   0.0000  
##  Max.   :5.0      Max.   :22000.0   Max.   :583.0000   Max.   :1700.0000  
##  NA's   :843563                                                           
##     PROPDMG         PROPDMGEXP           CROPDMG         CROPDMGEXP       
##  Min.   :   0.00   Length:902297      Min.   :  0.000   Length:902297     
##  1st Qu.:   0.00   Class :character   1st Qu.:  0.000   Class :character  
##  Median :   0.00   Mode  :character   Median :  0.000   Mode  :character  
##  Mean   :  12.06                      Mean   :  1.527                     
##  3rd Qu.:   0.50                      3rd Qu.:  0.000                     
##  Max.   :5000.00                      Max.   :990.000                     
##                                                                           
##      WFO             STATEOFFIC         ZONENAMES            LATITUDE   
##  Length:902297      Length:902297      Length:902297      Min.   :   0  
##  Class :character   Class :character   Class :character   1st Qu.:2802  
##  Mode  :character   Mode  :character   Mode  :character   Median :3540  
##                                                           Mean   :2875  
##                                                           3rd Qu.:4019  
##                                                           Max.   :9706  
##                                                           NA's   :47    
##    LONGITUDE        LATITUDE_E     LONGITUDE_       REMARKS         
##  Min.   :-14451   Min.   :   0   Min.   :-14455   Length:902297     
##  1st Qu.:  7247   1st Qu.:   0   1st Qu.:     0   Class :character  
##  Median :  8707   Median :   0   Median :     0   Mode  :character  
##  Mean   :  6940   Mean   :1452   Mean   :  3509                     
##  3rd Qu.:  9605   3rd Qu.:3549   3rd Qu.:  8735                     
##  Max.   : 17124   Max.   :9706   Max.   :106220                     
##                   NA's   :40                                        
##      REFNUM      
##  Min.   :     1  
##  1st Qu.:225575  
##  Median :451149  
##  Mean   :451149  
##  3rd Qu.:676723  
##  Max.   :902297  
## 
head(stormData)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE  EVTYPE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL TORNADO
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL TORNADO
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL TORNADO
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL TORNADO
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL TORNADO
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL TORNADO
##   BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1         0                                               0         NA
## 2         0                                               0         NA
## 3         0                                               0         NA
## 4         0                                               0         NA
## 5         0                                               0         NA
## 6         0                                               0         NA
##   END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1         0                      14.0   100 3   0          0       15    25.0
## 2         0                       2.0   150 2   0          0        0     2.5
## 3         0                       0.1   123 2   0          0        2    25.0
## 4         0                       0.0   100 2   0          0        2     2.5
## 5         0                       0.0   150 2   0          0        2     2.5
## 6         0                       1.5   177 2   0          0        6     2.5
##   PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1          K       0                                         3040      8812
## 2          K       0                                         3042      8755
## 3          K       0                                         3340      8742
## 4          K       0                                         3458      8626
## 5          K       0                                         3412      8642
## 6          K       0                                         3450      8748
##   LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1       3051       8806              1
## 2          0          0              2
## 3          0          0              3
## 4          0          0              4
## 5          0          0              5
## 6          0          0              6

Extracting specific variable from the strom dataset for analyzing the impact of storm on population Helath and Economy.

The variables which are important for our analysis are: INJURIES, FATALITIES, PROPDMG, CROPDMG, CROPDMGEXY, PROPDMGEXP and EVTYPE (this is our target variable.)

mainVar<-c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
finalData<- stormData[,mainVar]

Checking the first 6 and last 6 rows of our finalData which we extracted.

head(finalData)
##    EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO          0       15    25.0          K       0           
## 2 TORNADO          0        0     2.5          K       0           
## 3 TORNADO          0        2    25.0          K       0           
## 4 TORNADO          0        2     2.5          K       0           
## 5 TORNADO          0        2     2.5          K       0           
## 6 TORNADO          0        6     2.5          K       0
tail(finalData)
##                EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 902292 WINTER WEATHER          0        0       0          K       0          K
## 902293      HIGH WIND          0        0       0          K       0          K
## 902294      HIGH WIND          0        0       0          K       0          K
## 902295      HIGH WIND          0        0       0          K       0          K
## 902296       BLIZZARD          0        0       0          K       0          K
## 902297     HEAVY SNOW          0        0       0          K       0          K

Now, Checking for missing values

sum(is.na(finalData$FATALITIES))
## [1] 0
sum(is.na(finalData$INJURIES))
## [1] 0
sum(is.na(finalData$PROPDMG))
## [1] 0
sum(is.na(finalData$PROPDMGEXP))
## [1] 0
sum(is.na(finalData$CROPDMG))
## [1] 0
sum(is.na(finalData$CROPDMGEXP))
## [1] 0

After checking the missing values for each variable, their are not any missing value. So, now we will transform this variables.

Transforming the variable

max(finalData$EVTYPE)
## [1] "WND"
sort(table(finalData$EVTYPE), decreasing = TRUE)[1:15]
## 
##               HAIL          TSTM WIND  THUNDERSTORM WIND            TORNADO 
##             288661             219940              82563              60652 
##        FLASH FLOOD              FLOOD THUNDERSTORM WINDS          HIGH WIND 
##              54277              25326              20843              20212 
##          LIGHTNING         HEAVY SNOW         HEAVY RAIN       WINTER STORM 
##              15754              15708              11723              11433 
##     WINTER WEATHER       FUNNEL CLOUD   MARINE TSTM WIND 
##               7026               6839               6175

Now, we will group the same events together with the help of grep function.

finalData$EVENT <- "OTHER"
finalData$EVENT[grep("HAIL", finalData$EVTYPE, ignore.case = TRUE)] <- "HAIL"
finalData$EVENT[grep("HEAT", finalData$EVTYPE, ignore.case = TRUE)] <- "HEAT"
finalData$EVENT[grep("FLOOD", finalData$EVTYPE, ignore.case = TRUE)] <- "FLOOD"
finalData$EVENT[grep("WIND", finalData$EVTYPE, ignore.case = TRUE)] <- "WIND"
finalData$EVENT[grep("STORM", finalData$EVTYPE, ignore.case = TRUE)] <- "STORM"
finalData$EVENT[grep("SNOW", finalData$EVTYPE, ignore.case = TRUE)] <- "SNOW"
finalData$EVENT[grep("TORNADO", finalData$EVTYPE, ignore.case = TRUE)] <- "TORNADO"
finalData$EVENT[grep("WINTER", finalData$EVTYPE, ignore.case = TRUE)] <- "WINTER"
finalData$EVENT[grep("RAIN", finalData$EVTYPE, ignore.case = TRUE)] <- "RAIN"

Now,checking the values:

table(finalData$EVENT)
## 
##   FLOOD    HAIL    HEAT   OTHER    RAIN    SNOW   STORM TORNADO    WIND  WINTER 
##   82686  289270    2648   48970   12241   17660  113156   60700  255362   19604

Doing same with another variables:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
varia<- c("EVTYPE", "PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")
finDamage <- finalData[, varia]

sym <- sort(unique(as.character(finDamage$PROPDMGEXP)))
multiply <- c(0,0,0,1,10,10,10,10,10,10,10,10,10,10^9,10^2,10^2,10^3,10^6,10^6)
convert <- data.frame(sym, multiply)
finDamage$Prop<- convert$multiply[match(finDamage$PROPDMGEXP, convert$sym)]
finDamage$Crop <- convert$multiply[match(finDamage$CROPDMGEXP, convert$sym)]

finDamage <- finDamage %>% mutate(PROPDMG = PROPDMG*Prop) %>% mutate(CROPDMG = CROPDMG*Crop) %>% mutate(DMG = PROPDMG+CROPDMG)

finDamageTol <- finDamage %>% group_by(EVTYPE)%>% summarize(TOLEVTYPE=sum(DMG))%>%arrange(-TOLEVTYPE)
## `summarise()` ungrouping output (override with `.groups` argument)
head(finDamageTol,15)
## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 15 x 2
##    EVTYPE               TOLEVTYPE
##    <chr>                    <dbl>
##  1 FLOOD             150319678250
##  2 HURRICANE/TYPHOON  71913712800
##  3 TORNADO            57352117607
##  4 STORM SURGE        43323541000
##  5 FLASH FLOOD        17562132111
##  6 DROUGHT            15018672000
##  7 HURRICANE          14610229010
##  8 RIVER FLOOD        10148404500
##  9 ICE STORM           8967041810
## 10 TROPICAL STORM      8382236550
## 11 WINTER STORM        6715441260
## 12 HIGH WIND           5908617580
## 13 WILDFIRE            5060586800
## 14 TSTM WIND           5038936340
## 15 STORM SURGE/TIDE    4642038000

To analyze the health impact, we will calculate the total injuries and total fatalities for each event. Health Impact

library(dplyr)
finFatalities <- finalData %>% select(EVTYPE, FATALITIES) %>% group_by(EVTYPE) %>% summarise(tolFatalities = sum(FATALITIES)) %>% arrange(-tolFatalities)
## `summarise()` ungrouping output (override with `.groups` argument)
head(finFatalities, 15)
## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 15 x 2
##    EVTYPE            tolFatalities
##    <chr>                     <dbl>
##  1 TORNADO                    5633
##  2 EXCESSIVE HEAT             1903
##  3 FLASH FLOOD                 978
##  4 HEAT                        937
##  5 LIGHTNING                   816
##  6 TSTM WIND                   504
##  7 FLOOD                       470
##  8 RIP CURRENT                 368
##  9 HIGH WIND                   248
## 10 AVALANCHE                   224
## 11 WINTER STORM                206
## 12 RIP CURRENTS                204
## 13 HEAT WAVE                   172
## 14 EXTREME COLD                160
## 15 THUNDERSTORM WIND           133
finInjuries <- finalData %>% select(EVTYPE, INJURIES) %>% group_by(EVTYPE) %>% summarise(tolInjuries = sum(INJURIES)) %>% arrange(-tolInjuries)
## `summarise()` ungrouping output (override with `.groups` argument)
head(finInjuries, 15)
## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 15 x 2
##    EVTYPE            tolInjuries
##    <chr>                   <dbl>
##  1 TORNADO                 91346
##  2 TSTM WIND                6957
##  3 FLOOD                    6789
##  4 EXCESSIVE HEAT           6525
##  5 LIGHTNING                5230
##  6 HEAT                     2100
##  7 ICE STORM                1975
##  8 FLASH FLOOD              1777
##  9 THUNDERSTORM WIND        1488
## 10 HAIL                     1361
## 11 WINTER STORM             1321
## 12 HURRICANE/TYPHOON        1275
## 13 HIGH WIND                1137
## 14 HEAVY SNOW               1021
## 15 WILDFIRE                  911

Results

Health Impact

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.2
g <- ggplot(finFatalities[1:15,], aes(x=reorder(EVTYPE, -tolFatalities), y=tolFatalities))+geom_bar(stat="identity") + theme(axis.text.x = element_text(angle=90, vjust=0.5, hjust=1))+ ggtitle("Top 15 Events Which Has Highest Total Fatalities") +labs(x="Type of Event", y="Total Fatalities")
print(g)

g1 <- ggplot(finInjuries[1:15,], aes(x=reorder(EVTYPE, -tolInjuries), y=tolInjuries))+geom_bar(stat="identity") + theme(axis.text.x = element_text(angle=90, vjust=0.5, hjust=1))+ggtitle("Top 15 Events Which Has Highest Total Injuries") +labs(x="Type of Event", y="Total Injuries")
print(g1)

By seeing the graph, it can be concluded that Tornado has caused highest health impact in both fatalities and injuries.

Now, we will study its impact on Economy.

g2 <- ggplot(finDamageTol[1:15,], aes(x=reorder(EVTYPE, -TOLEVTYPE), y=TOLEVTYPE))+geom_bar(stat="identity") + theme(axis.text.x = element_text(angle=90, vjust=0.5, hjust=1))+ggtitle("Top 15 Events which has Highest Impact on Economy") +labs(x="EVENT TYPE", y="Total Impact on Economy")
print(g2)

It can be concluded that Flood has created highest economic impact.