Synopsis

This report has the goal to identify the most harmful climatic events that affects the health and the economy on USA. To do that this report is going to use the Storm Data database provide by National Weather Service. More precisely it is going to use the Fatalities, Injuries, PorpertyDamage and EvType variables.

Data Processing

Reading and loading the raw data

The original data is available here : https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2.

 download.file(url="https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
               destfile = "sourceData.bz2")

 rawData <- read.table(file="sourceData.bz2",sep = ",", header = TRUE)
 
 str(rawData)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Trasnformations

The next section include the code to apply some changes necessaries to continue with the analysis. Also the creation of a new dataset that only include the fields useful for the analysis.

Modifications

  1. Changes on variables names
  2. From String to format date for STAR_DATE var
newRawData <- rawData%>%
  rename(START_DATE = BGN_DATE, 
                    EVENT_TYPE = EVTYPE,
                    PROPERTY_DAMAGE = PROPDMG,
                    STATE_CODE = STATE__,
                    PROPERTY_DAMAGE_EXP = PROPDMGEXP
         )%>%
  mutate(START_DATE = as.Date(START_DATE,"%m/%d/%Y"))

In order to make a successful groupBy on EVENT_TYPE the next changes are required

  1. Trim function applied to values on EVENT_TYPE var
  2. All values on EVENT_TYPE to uppercase
  3. Stem on HEAT, HEAT WAVE and EXCESSIVE HEAT event type. All will be HEAT
newRawData <- newRawData%>%
  mutate(EVENT_TYPE = trimws(EVENT_TYPE, which = c("both")))%>%
  mutate(EVENT_TYPE = toupper(EVENT_TYPE))%>%
  mutate(EVENT_TYPE = ifelse(EVENT_TYPE == "HEAT WAVE" | EVENT_TYPE == "EXCESSIVE HEAT",
                             "HEAT", EVENT_TYPE))

To determinate the property damage is necessary considerate that some values are expressed in K, M and B, so it is required to convert them to the same unit.

  1. Create a new column with the values on K. From Billion to k and from Million to K.
newRawData <- newRawData%>%
  mutate(PROPERTY_DAMAGE_THOUSAND = 
           ifelse(PROPERTY_DAMAGE_EXP =="M" | PROPERTY_DAMAGE_EXP =="m", (PROPERTY_DAMAGE * 1000),
                                      (ifelse(PROPERTY_DAMAGE_EXP =="B",
                                                  PROPERTY_DAMAGE * 1000000,
                                                  ( ifelse ( PROPERTY_DAMAGE_EXP == "K",
                                                             PROPERTY_DAMAGE , 0))
                                                  )
                                           )
                                        )
         )


dataToWork <- newRawData%>%
  select(EVENT_TYPE,FATALITIES,INJURIES,PROPERTY_DAMAGE,PROPERTY_DAMAGE_EXP, PROPERTY_DAMAGE_THOUSAND,STATE_CODE,STATE,START_DATE,COUNTY,COUNTYNAME)

str(dataToWork)
## 'data.frame':    902297 obs. of  11 variables:
##  $ EVENT_TYPE              : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ FATALITIES              : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES                : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPERTY_DAMAGE         : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPERTY_DAMAGE_EXP     : chr  "K" "K" "K" "K" ...
##  $ PROPERTY_DAMAGE_THOUSAND: num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ STATE_CODE              : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ STATE                   : chr  "AL" "AL" "AL" "AL" ...
##  $ START_DATE              : Date, format: "1950-04-18" "1950-04-18" ...
##  $ COUNTY                  : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME              : chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...

Results

To next some information about the variables related to the analysis and the graphics which helps to achieve the report goal. They show the behavior of events according to Fatalities, Injuries and Porperty Damage.

Fatalities

summary(dataToWork$FATALITIES)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0000   0.0000   0.0000   0.0168   0.0000 583.0000

fatalities vs eventType graphic

fatalitiesByEventType <- dataToWork%>%group_by(EVENT_TYPE)%>%summarise(deadByEvent = sum(FATALITIES))%>%filter(deadByEvent>100)

ggplot(fatalitiesByEventType, aes(x=EVENT_TYPE, y = deadByEvent))+
   geom_point()+
   scale_y_continuous(trans='log10')+
   scale_x_discrete(guide = guide_axis(angle = 90))

Injuries

summary(dataToWork$INJURIES)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##    0.0000    0.0000    0.0000    0.1557    0.0000 1700.0000

injuries vs eventType graphic

injuriesByEventType <- dataToWork%>%group_by(EVENT_TYPE)%>%summarise(injuriesByEvent = sum(INJURIES))%>%filter(injuriesByEvent>100)

ggplot(injuriesByEventType, aes(x=EVENT_TYPE, y = injuriesByEvent))+
   geom_point()+
   scale_x_discrete(guide = guide_axis(angle = 90))

Property damage

summary(dataToWork$PROPERTY_DAMAGE)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   12.06    0.50 5000.00

propertie damage vs eventType graphic

damageByEventType <- dataToWork%>%group_by(EVENT_TYPE)%>%summarise(damageByEvent = sum(PROPERTY_DAMAGE_THOUSAND))%>%filter(damageByEvent>100000)

ggplot(damageByEventType, aes(x=EVENT_TYPE, y = damageByEvent))+
   geom_point()+
   scale_y_continuous(trans='log10')+
   scale_x_discrete(guide = guide_axis(angle = 90))