In this report will cover basic exploration analysis of NOAA storm data to identify the most harmful events across USA during 1950~2011,Here’s National Climatic Data Center Storm Events FAQ.
The most important thing in data analysis is the question , then the data come, so lets define our questions before doing anything:
Across the United States
1-Which types of events are most harmful with respect to population health?
2-Which types of events have the greatest economic consequences?
So far good, Storm Data Documentation give us a hint about different variables,lets get the NOAA storm data from this link and prepare it for this quick study,
rawDataFilePath <- paste("data-raw","NOAA.bz2",sep = "\\")
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
,destfile = rawDataFilePath )
rawData <- read.csv(rawDataFilePath)
str(rawData)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
We’ve 902,297 Observations of 37 Variables, form the assignment we know “EVTYPE” var define event type,So, how can we start ? which vars determine the health or economic impact?
I read storm data documentation, and I cannot find clear code book or vars definitions , but I conclude that: As per documentations “2.6 Fatalities/Injuries -page 9”, “FATALITIES” & “INJURIES” vars define the population health impact, “2.7 Damage -page 12”, “CROPDMG” & “PROPDMG” vars define the economic impact, so lets reshape our dataset to include only those vars and remove NA data, lets do it first for only health impact vars:
healthImpactData <- rawData[,c('EVTYPE','FATALITIES','INJURIES')] %>% na.omit() %>% group_by(EVTYPE) %>% summarise( FATALITIES = sum(FATALITIES), INJURIES = sum(INJURIES) ,both= sum(INJURIES)+ sum(FATALITIES) )
str(healthImpactData)
## Classes 'tbl_df', 'tbl' and 'data.frame': 985 obs. of 4 variables:
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 0 0 ...
## $ INJURIES : num 0 0 0 0 0 0 0 0 0 0 ...
## $ both : num 0 0 0 0 0 0 0 0 0 0 ...
Super, so lets figure out which the most harmful event in term pf population health:
# setup plot area
par(mfrow = c(1, 2), mar = c(4, 4, 2, 1))
injData <- arrange(healthImpactData, desc(INJURIES)) %>% top_n(3)
## Selecting by both
screen(1)
## [1] FALSE
barplot(injData$INJURIES, names.arg = injData$EVTYPE, main = "Top 3 INJURIES",
ylab = "INJURIES", cex.axis = 0.8, cex.names = 0.7, las = 2)
facData <- arrange(healthImpactData, desc(FATALITIES)) %>% top_n(3)
## Selecting by both
barplot(facData$FATALITIES, names.arg = facData$EVTYPE, main = "Top 3 FATALITIES",
ylab = "FATALITIES", cex.axis = 0.8, cex.names = 0.7, las = 2)
So lets mix both vars together to get the overall figure:
par(mfrow = c(1, 1))
healthImpactData <- arrange(healthImpactData, desc(both)) %>% top_n(10)
## Selecting by both
barplot(healthImpactData$both, names.arg = healthImpactData$EVTYPE, main = "TOP 10 events impacted population Health(FATALITIES & INJURIES) 1950-2011 in USA ", ylab = "All", las = 1)
So , its clear from the plot above, Tornado is the most harmfull natural event of population health during 1950 to 2011 across USA.
we will repeat the above steps but the overall plot:
econmicImpactData <- rawData[,c('EVTYPE','PROPDMG', 'CROPDMG')] %>% na.omit() %>% group_by(EVTYPE) %>% summarise( PROPDMG = sum(PROPDMG), CROPDMG = sum(CROPDMG) ,both= sum(PROPDMG)+ sum(CROPDMG) ) %>% arrange(desc(both)) %>% top_n(10)
## Selecting by both
barplot(econmicImpactData$both, names.arg = econmicImpactData$EVTYPE, main = "TOP 10 events impacted Econimic (PROPDMG & CROPDMG) 1950-2011 in USA ", ylab = "All", las = 1)
Tornado is the most harmful event impacted both population health and economic during 1950 ~ 2011.
Thank you for reading this quick analysis.