This report has the goal to identify the most harmful climatic events that affects the health and the economy on USA. To do that this report is going to use the Storm Data database provide by National Weather Service. More precisely it is going to use the Fatalities, Injuries, PorpertyDamage and EvType variables.
The original data is available here : https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2.
download.file(url="https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
destfile = "sourceData.bz2")
rawData <- read.table(file="sourceData.bz2",sep = ",", header = TRUE)
str(rawData)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
The next section include the code to apply some changes necessaries to continue with the analysis. Also the creation of a new dataset that only include the fields useful for the analysis.
Modifications
newRawData <- rawData%>%
rename(START_DATE = BGN_DATE,
EVENT_TYPE = EVTYPE,
PROPERTY_DAMAGE = PROPDMG,
STATE_CODE = STATE__,
PROPERTY_DAMAGE_EXP = PROPDMGEXP
)%>%
mutate(START_DATE = as.Date(START_DATE,"%m/%d/%Y"))
In order to make a successful groupBy on EVENT_TYPE the next changes are required
newRawData <- newRawData%>%
mutate(EVENT_TYPE = trimws(EVENT_TYPE, which = c("both")))%>%
mutate(EVENT_TYPE = toupper(EVENT_TYPE))%>%
mutate(EVENT_TYPE = ifelse(EVENT_TYPE == "HEAT WAVE" | EVENT_TYPE == "EXCESSIVE HEAT",
"HEAT", EVENT_TYPE))
To determinate the property damage is necessary considerate that some values are expressed in K, M and B, so it is required to convert them to the same unit.
newRawData <- newRawData%>%
mutate(PROPERTY_DAMAGE_THOUSAND =
ifelse(PROPERTY_DAMAGE_EXP =="M" | PROPERTY_DAMAGE_EXP =="m", (PROPERTY_DAMAGE * 1000),
(ifelse(PROPERTY_DAMAGE_EXP =="B",
PROPERTY_DAMAGE * 1000000,
( ifelse ( PROPERTY_DAMAGE_EXP == "K",
PROPERTY_DAMAGE , 0))
)
)
)
)
dataToWork <- newRawData%>%
select(EVENT_TYPE,FATALITIES,INJURIES,PROPERTY_DAMAGE,PROPERTY_DAMAGE_EXP, PROPERTY_DAMAGE_THOUSAND,STATE_CODE,STATE,START_DATE,COUNTY,COUNTYNAME)
str(dataToWork)
## 'data.frame': 902297 obs. of 11 variables:
## $ EVENT_TYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ FATALITIES : num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPERTY_DAMAGE : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPERTY_DAMAGE_EXP : chr "K" "K" "K" "K" ...
## $ PROPERTY_DAMAGE_THOUSAND: num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ STATE_CODE : num 1 1 1 1 1 1 1 1 1 1 ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ START_DATE : Date, format: "1950-04-18" "1950-04-18" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME : chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
To next some information about the variables related to the analysis and the graphics which helps to achieve the report goal. They show the behavior of events according to Fatalities, Injuries and Porperty Damage.
summary(dataToWork$FATALITIES)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0168 0.0000 583.0000
fatalitiesByEventType <- dataToWork%>%group_by(EVENT_TYPE)%>%summarise(deadByEvent = sum(FATALITIES))%>%filter(deadByEvent>100)
ggplot(fatalitiesByEventType, aes(x=EVENT_TYPE, y = deadByEvent))+
geom_point()+
scale_y_continuous(trans='log10')+
scale_x_discrete(guide = guide_axis(angle = 90))
summary(dataToWork$INJURIES)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1557 0.0000 1700.0000
injuriesByEventType <- dataToWork%>%group_by(EVENT_TYPE)%>%summarise(injuriesByEvent = sum(INJURIES))%>%filter(injuriesByEvent>100)
ggplot(injuriesByEventType, aes(x=EVENT_TYPE, y = injuriesByEvent))+
geom_point()+
scale_x_discrete(guide = guide_axis(angle = 90))
summary(dataToWork$PROPERTY_DAMAGE)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 12.06 0.50 5000.00
damageByEventType <- dataToWork%>%group_by(EVENT_TYPE)%>%summarise(damageByEvent = sum(PROPERTY_DAMAGE_THOUSAND))%>%filter(damageByEvent>100000)
ggplot(damageByEventType, aes(x=EVENT_TYPE, y = damageByEvent))+
geom_point()+
scale_y_continuous(trans='log10')+
scale_x_discrete(guide = guide_axis(angle = 90))