Synoposis

The purpose of this script is to explore the NOAA Strom Data set and foud the severe weather events that affect public health (cause most fatalities and injuries) and economic damages(property damage and crop damage) the most.

To do this:

  1. First, I loaded the data into R, the data are very big and bzip2 compressed, use function bzfile to unzip the file.

  2. used dplyr to select out the desired columns and manipulate the data for summary.

  3. used ggplot2 to visualize the data.

  4. concluded that TORNADO causes the most fatalites and injuries, and the most economic damages.

Data Processing

# download the repdata-data-StromData.csv.bz2 file to your working directory
setwd("/Users/Tammy/online_courses/data_science")
dat<- read.csv(bzfile("repdata-data-StormData.csv.bz2"),quote="\"", stringsAsFactors=F)
library(dplyr)
# convert to local dataframe
dat<- tbl_df(dat)
dat
## Source: local data frame [902,297 x 37]
## 
##    STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1        1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2        1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3        1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4        1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5        1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6        1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
## 7        1 11/16/1951 0:00:00     0100       CST      9     BLOUNT    AL
## 8        1  1/22/1952 0:00:00     0900       CST    123 TALLAPOOSA    AL
## 9        1  2/13/1952 0:00:00     2000       CST    125 TUSCALOOSA    AL
## 10       1  2/13/1952 0:00:00     2000       CST     57    FAYETTE    AL
## ..     ...                ...      ...       ...    ...        ...   ...
## Variables not shown: EVTYPE (chr), BGN_RANGE (dbl), BGN_AZI (chr),
##   BGN_LOCATI (chr), END_DATE (chr), END_TIME (chr), COUNTY_END (dbl),
##   COUNTYENDN (lgl), END_RANGE (dbl), END_AZI (chr), END_LOCATI (chr),
##   LENGTH (dbl), WIDTH (dbl), F (int), MAG (dbl), FATALITIES (dbl),
##   INJURIES (dbl), PROPDMG (dbl), PROPDMGEXP (chr), CROPDMG (dbl),
##   CROPDMGEXP (chr), WFO (chr), STATEOFFIC (chr), ZONENAMES (chr), LATITUDE
##   (dbl), LONGITUDE (dbl), LATITUDE_E (dbl), LONGITUDE_ (dbl), REMARKS
##   (chr), REFNUM (dbl)

There are too many columns, let’s select the interesting ones.

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
dat_select<- select(dat, 
                    c(STATE,EVTYPE,FATALITIES,INJURIES,PROPDMG,CROPDMG))

dat_select
## Source: local data frame [902,297 x 6]
## 
##    STATE  EVTYPE FATALITIES INJURIES PROPDMG CROPDMG
## 1     AL TORNADO          0       15    25.0       0
## 2     AL TORNADO          0        0     2.5       0
## 3     AL TORNADO          0        2    25.0       0
## 4     AL TORNADO          0        2     2.5       0
## 5     AL TORNADO          0        2     2.5       0
## 6     AL TORNADO          0        6     2.5       0
## 7     AL TORNADO          0        1     2.5       0
## 8     AL TORNADO          0        0     2.5       0
## 9     AL TORNADO          1       14    25.0       0
## 10    AL TORNADO          0        0    25.0       0
## ..   ...     ...        ...      ...     ...     ...
# summary statistics
str(dat_select)
## Classes 'tbl_df', 'tbl' and 'data.frame':    902297 obs. of  6 variables:
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
summary(dat_select)
##     STATE              EVTYPE            FATALITIES      
##  Length:902297      Length:902297      Min.   :  0.0000  
##  Class :character   Class :character   1st Qu.:  0.0000  
##  Mode  :character   Mode  :character   Median :  0.0000  
##                                        Mean   :  0.0168  
##                                        3rd Qu.:  0.0000  
##                                        Max.   :583.0000  
##     INJURIES            PROPDMG           CROPDMG       
##  Min.   :   0.0000   Min.   :   0.00   Min.   :  0.000  
##  1st Qu.:   0.0000   1st Qu.:   0.00   1st Qu.:  0.000  
##  Median :   0.0000   Median :   0.00   Median :  0.000  
##  Mean   :   0.1557   Mean   :  12.06   Mean   :  1.527  
##  3rd Qu.:   0.0000   3rd Qu.:   0.50   3rd Qu.:  0.000  
##  Max.   :1700.0000   Max.   :5000.00   Max.   :990.000

Results

rank fatalities and injuries by different type of events

dat_select %>% group_by(EVTYPE) %>% summarise(sum_FATALITIES=sum(FATALITIES),
                                              sum_INJURIES=sum(INJURIES)) %>%
        arrange(desc(sum_FATALITIES, sum_INJURIES))
## Source: local data frame [985 x 3]
## 
##            EVTYPE sum_FATALITIES sum_INJURIES
## 1         TORNADO           5633        91346
## 2  EXCESSIVE HEAT           1903         6525
## 3     FLASH FLOOD            978         1777
## 4            HEAT            937         2100
## 5       LIGHTNING            816         5230
## 6       TSTM WIND            504         6957
## 7           FLOOD            470         6789
## 8     RIP CURRENT            368          232
## 9       HIGH WIND            248         1137
## 10      AVALANCHE            224          170
## ..            ...            ...          ...

we see that TORNADO causes the most fatalities.

boxplot of fatatlies caused by severe weather events

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.3
fatalities<- dat_select %>% 
        group_by(EVTYPE) %>% 
        summarise(sum_FATALITIES=sum(FATALITIES),sum_INJURIES=sum(INJURIES)) %>%
        arrange(desc(sum_FATALITIES, sum_INJURIES))

ggplot(fatalities) + geom_boxplot(aes(x=1, y=sum_FATALITIES)) + ylim(c(5,1000)) +
        xlab("severe weather events") + ylab("FATALITIES") +
        ggtitle("boxplot of FATALITEIS caused by severe weather events")
## Warning: Removed 910 rows containing non-finite values (stat_boxplot).

we see not many death were caused by other severe weather events. 5633 deaths by TONADO is really a lot!

dat_select %>% group_by(EVTYPE) %>% summarise(sum_FATALITIES=sum(FATALITIES),
                                              sum_INJURIES=sum(INJURIES)) %>%
        arrange(desc( sum_INJURIES, sum_FATALITIES))
## Source: local data frame [985 x 3]
## 
##               EVTYPE sum_FATALITIES sum_INJURIES
## 1            TORNADO           5633        91346
## 2          TSTM WIND            504         6957
## 3              FLOOD            470         6789
## 4     EXCESSIVE HEAT           1903         6525
## 5          LIGHTNING            816         5230
## 6               HEAT            937         2100
## 7          ICE STORM             89         1975
## 8        FLASH FLOOD            978         1777
## 9  THUNDERSTORM WIND            133         1488
## 10              HAIL             15         1361
## ..               ...            ...          ...

we see that TORNADO causes the most injuries as well.

rank property damage and crop damage by differnt type of events

dat_select %>% group_by(EVTYPE) %>% summarise(sum_PROPDMG=sum(PROPDMG),
                                              sum_CROPDMG=sum(CROPDMG)) %>% 
        arrange(desc(sum_PROPDMG, sum_CROPDMG))
## Source: local data frame [985 x 3]
## 
##                EVTYPE sum_PROPDMG sum_CROPDMG
## 1             TORNADO   3212258.2   100018.52
## 2         FLASH FLOOD   1420124.6   179200.46
## 3           TSTM WIND   1335965.6   109202.60
## 4               FLOOD    899938.5   168037.88
## 5   THUNDERSTORM WIND    876844.2    66791.45
## 6                HAIL    688693.4   579596.28
## 7           LIGHTNING    603351.8     3580.61
## 8  THUNDERSTORM WINDS    446293.2    18684.93
## 9           HIGH WIND    324731.6    17283.21
## 10       WINTER STORM    132720.6     1978.99
## ..                ...         ...         ...

we see TORNADO causes the most property damage.

dat_select %>% group_by(EVTYPE) %>% summarise(sum_PROPDMG=sum(PROPDMG),
                                              sum_CROPDMG=sum(CROPDMG)) %>% 
        arrange(desc(sum_CROPDMG, sum_PROPDMG))
## Source: local data frame [985 x 3]
## 
##                EVTYPE sum_PROPDMG sum_CROPDMG
## 1                HAIL   688693.38   579596.28
## 2         FLASH FLOOD  1420124.59   179200.46
## 3               FLOOD   899938.48   168037.88
## 4           TSTM WIND  1335965.61   109202.60
## 5             TORNADO  3212258.16   100018.52
## 6   THUNDERSTORM WIND   876844.17    66791.45
## 7             DROUGHT     4099.05    33898.62
## 8  THUNDERSTORM WINDS   446293.18    18684.93
## 9           HIGH WIND   324731.56    17283.21
## 10         HEAVY RAIN    50842.14    11122.80
## ..                ...         ...         ...

we see that HAIL causes the most crop damage.

add the PROPDMG and CROPDMG to a new column called ECONOMYDMG

dat_select %>% mutate(ECONOMYDMG= PROPDMG + CROPDMG) %>% group_by(EVTYPE) %>%
        summarise(sum_ECONOMYDMG=sum(ECONOMYDMG)) %>% 
        arrange(desc(sum_ECONOMYDMG)) 
## Source: local data frame [985 x 2]
## 
##                EVTYPE sum_ECONOMYDMG
## 1             TORNADO      3312276.7
## 2         FLASH FLOOD      1599325.1
## 3           TSTM WIND      1445168.2
## 4                HAIL      1268289.7
## 5               FLOOD      1067976.4
## 6   THUNDERSTORM WIND       943635.6
## 7           LIGHTNING       606932.4
## 8  THUNDERSTORM WINDS       464978.1
## 9           HIGH WIND       342014.8
## 10       WINTER STORM       134699.6
## ..                ...            ...

we see that TORNADO causes the most economy damage including property damage and crop damage.

Economy_dmg<- dat_select %>% 
        mutate(ECONOMYDMG= PROPDMG + CROPDMG) %>% group_by(EVTYPE) %>%
        summarise(sum_ECONOMYDMG=sum(ECONOMYDMG)) %>% 
        arrange(desc(sum_ECONOMYDMG))

boxplot of the total Economy damages for all the severe weather events from year 1950 to 2011

ggplot(Economy_dmg) + geom_boxplot(aes(x=1, y=sum_ECONOMYDMG))

median(Economy_dmg$sum_ECONOMYDMG[Economy_dmg$sum_ECONOMYDMG>0])
## [1] 59

Most severe events causes a median of 59 million US dolloars

from the boxplot above, we can see that the data are skewly distributed, let’s do a log2 transformation and plot the boxplot again

ggplot(Economy_dmg) + geom_boxplot(aes(x=1, y=log2(sum_ECONOMYDMG+1))) +
        xlab("severe weather events") + ylab("log2 transformed economy damage") +
        ggtitle("boxplot of severe weather events caused economy damage in log2 scale")