The analysis answers the following questions:
The present analysis consists in the following steps:
url_df <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url_df,"repdata-data-StormData.csv.bz2")
df <- read.csv(bzfile("repdata-data-StormData.csv.bz2"))
number_uniqueEVTYPE <- length(unique(df$EVTYPE))
number_Summary <- length(grep("Summary", df$EVTYPE, ignore.case = TRUE))
number_TSTM <- length(grep("TSTM", df$EVTYPE, ignore.case = TRUE))
EVTYPE variable indicates the type of weather event. There are 985 unique EVTYPE variables which need to be converted to the 48 EVTYPE variables as set out by NOAA. You can find the 48 NOAA Event Type listed on page 6 of the National Weather Service Storm Data Documentation.
The 48 NOAA Event Types:
StormData <- c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood", "Cold/Wind Chill",
"Debris Flow", "Dense Fog", "Dense Smoke","Drought", "Dust Devil", "Dust Storm",
"Excessive Heat", "Extreme Cold/Wind Chill", "Flash Flood", "Flood", "Frost/Freeze",
"Funnel Cloud", "Freezing Fog", "Hail", "Heat", "Heavy Rain", "Heavy Snow", "High Surf",
"High Wind", "Hurricane (Typhoon)", "Ice Storm", "Lake-Effect Snow", "Lakeshore Flood",
"Lightning", "Marine Hail", "Marine High Wind", "Marine Strong Wind",
"Marine Thunderstorm Wind", "Rip Current", "Seiche", "Sleet", "Storm Surge/Tide",
"Strong Wind", "Thunderstorm Wind", "Tornado", "Tropical Depression",
"Tropical Storm", "Tsunami", "Volcanic Ash", "Waterspout", "Wildfire", "Winter Storm",
"Winter Weather", "Summary")
Besides, there are 76 records related to the summary data. So the individual type for the summary data - “Summary” - has been created.
Finally, there are 227236 abbreviations of “THUNDERSTORM” - “TSTM”. These data have been transformed:
df$EVTYPE <- gsub("TSTM", "THUNDERSTORM", df$EVTYPE, ignore.case = TRUE)
To convert the EVTYPE variables to the 48 NOAA Event Types (+ Summary) we used the method of approximate string matching, namely string distances, as described in “An introduction to data cleaning with R” by Edwin de Jonge and Mark van der Loo:
A string distance is an algorithm or equation that indicates how much two strings differ from each other. An important distance measure is implemented by the R’s native adist function. This function counts how many basic operations are needed to turn one string into another.
D <- adist(df$EVTYPE, StormData, ignore.case = TRUE)
i <- apply(D, 1, which.min)
df$TYPE <- StormData[i]
The function adist returns the distance matrix D between the vector StormData and the EVTYPE column. Then we find the index of the smallest distance for each row of D and create a new column TYPE with the corresponding values of the NOAA Event Types.
The analysis finds out the severest weather events. Pareto analysis is a likely technique for it. This technique helps to identify the root causes of the problems. Fundamentally, it estimates the benefit delivered by each cause, then selects a number of the most effective causes that deliver a total benefit reasonably close to the maximal possible one.
To find out types of events which are most harmful with respect to population health, firstly, we restricted to only events that have personal harm and summarized it for each group of types:
library("dplyr")
df_harm <- df %>% filter((FATALITIES + INJURIES) > 0) %>%
group_by(TYPE) %>% summarise(HARM = sum(FATALITIES + INJURIES))
Then we plotted the the Pareto chart using the package qcc.
library("qcc")
harm <- df_harm$HARM
names(harm) <- df_harm$TYPE
pareto.chart(harm, main = "Pareto chart for the most harmful events", ylab = "Total personal harm")
##
## Pareto chart analysis for harm
## Frequency Cum.Freq. Percentage Cum.Percent.
## Tornado 96997 96997 62.308171616 62.30817
## Thunderstorm Wind 10236 107233 6.575321347 68.88349
## Excessive Heat 8723 115956 5.603412281 74.48691
## Flood 8528 124484 5.478149711 79.96505
## Lightning 6051 130535 3.886993891 83.85205
## Heat 3649 134184 2.344015982 86.19606
## Flash Flood 2846 137030 1.828191144 88.02426
## Ice Storm 2118 139148 1.360544218 89.38480
## High Wind 1824 140972 1.171686805 90.55649
## Wildfire 1824 142796 1.171686805 91.72817
## Winter Storm 1633 144429 1.048993724 92.77717
## Hail 1377 145806 0.884546453 93.66171
## Hurricane (Typhoon) 1353 147159 0.869129521 94.53084
## Heavy Snow 1251 148410 0.803607562 95.33445
## Rip Current 1106 149516 0.710463600 96.04491
## Blizzard 906 150422 0.581989170 96.62690
## Winter Weather 606 151028 0.389277524 97.01618
## Dust Storm 462 151490 0.296775934 97.31296
## Tropical Storm 449 151939 0.288425096 97.60138
## Strong Wind 442 152381 0.283928491 97.88531
## Avalanche 420 152801 0.269796304 98.15511
## High Surf 408 153209 0.262087838 98.41720
## Funnel Cloud 396 153605 0.254379372 98.67157
## Heavy Rain 386 153991 0.247955651 98.91953
## Dense Fog 362 154353 0.232538719 99.15207
## Seiche 256 154609 0.164447271 99.31652
## Extreme Cold/Wind Chill 171 154780 0.109845638 99.42636
## Tsunami 162 154942 0.104064289 99.53043
## Coastal Flood 130 155072 0.083508380 99.61393
## Cold/Wind Chill 121 155193 0.077727030 99.69166
## Waterspout 93 155286 0.059740610 99.75140
## Marine Thunderstorm Wind 79 155365 0.050747400 99.80215
## Drought 68 155433 0.043681306 99.84583
## Storm Surge/Tide 67 155500 0.043038934 99.88887
## Freezing Fog 51 155551 0.032760980 99.92163
## Dust Devil 45 155596 0.028906747 99.95054
## Marine Strong Wind 37 155633 0.023767770 99.97431
## Marine Hail 26 155659 0.016701676 99.99101
## Marine High Wind 6 155665 0.003854233 99.99486
## Sleet 6 155671 0.003854233 99.99872
## Tropical Depression 2 155673 0.001284744 100.00000
abline(h=(sum(harm)*.8),col="red",lwd=1)
The function pareto.chart returns a table containing the descriptive statistics used to draw the Pareto chart. We also added a line - 80% of the total harm, for identify the severest events.
To find out types of events which have the greatest economic consequences, firstly, we restricted to only events that have dollar damage and summarized it for each group of types:
df_damage <- df %>% filter((PROPDMG + CROPDMG) > 0) %>% group_by(TYPE) %>%
summarise(DMG = sum(ifelse(PROPDMGEXP=="K",PROPDMG*10^3,
ifelse(PROPDMGEXP=="M",PROPDMG*10^6,
ifelse(PROPDMGEXP=="B",PROPDMG*10^9,PROPDMG))) +
ifelse(CROPDMGEXP=="K",CROPDMG*10^3,
ifelse(CROPDMGEXP=="M",CROPDMG*10^6,
ifelse(CROPDMGEXP=="B",CROPDMG*10^9,CROPDMG)))))
The variables PROPDMGEXP and CROPDMGEXP are the units for the PROPDMG and CROPDMG variables. Their values K, M, and B mean thousands, millions, and billions.
Plotting the Pareto chart:
damage <- df_damage$DMG
names(damage) <- df_damage$TYPE
pareto.chart(damage, ylab = "Total damage", main = "Pareto chart for the most damaging events")
##
## Pareto chart analysis for damage
## Frequency Cum.Freq. Percentage
## Flood 151264401313 151264401313 3.175332e+01
## Hurricane (Typhoon) 75471243830 226735645143 1.584287e+01
## Tornado 57345497660 284081142803 1.203793e+01
## Storm Surge/Tide 47965834000 332046976803 1.006895e+01
## Flash Flood 28862068967 360909045770 6.058706e+00
## Hail 18760322493 379669368263 3.938154e+00
## Drought 15025751600 394695119863 3.154195e+00
## Seiche 14625479310 409320599173 3.070171e+00
## Thunderstorm Wind 11050921073 420371520245 2.319802e+00
## Wildfire 9240150185 429611670430 1.939686e+00
## Ice Storm 9210099560 438821769990 1.933378e+00
## Tropical Storm 8409291550 447231061540 1.765273e+00
## Winter Storm 6717503751 453948565291 1.410134e+00
## High Wind 6680148743 460628714034 1.402292e+00
## Marine Thunderstorm Wind 2862652400 463491366434 6.009260e-01
## Winter Weather 2542298000 466033664434 5.336775e-01
## Heavy Rain 1562007442 467595671876 3.278955e-01
## Frost/Freeze 1456916000 469052587876 3.058348e-01
## Funnel Cloud 1380905000 470433492876 2.898786e-01
## Heavy Snow 1087550752 471521043628 2.282979e-01
## Lightning 948304537 472469348165 1.990674e-01
## Blizzard 771393950 473240742115 1.619305e-01
## Excessive Heat 649211480 473889953595 1.362820e-01
## Coastal Flood 563965560 474453919155 1.183873e-01
## Sleet 458485000 474912404155 9.624486e-02
## Heat 421360450 475333764605 8.845170e-02
## Strong Wind 251577740 475585342345 5.281103e-02
## Tsunami 144082000 475729424345 3.024559e-02
## Marine High Wind 112817010 475842241355 2.368247e-02
## High Surf 101555500 475943796855 2.131846e-02
## Marine Hail 70917000 476014713855 1.488685e-02
## Cold/Wind Chill 68600000 476083313855 1.440046e-02
## Dense Fog 65774000 476149087855 1.380723e-02
## Waterspout 60777200 476209865055 1.275831e-02
## Debris Flow 42050000 476251915055 8.827107e-03
## Lake-Effect Snow 40362000 476292277055 8.472763e-03
## Extreme Cold/Wind Chill 26503000 476318780055 5.563492e-03
## Freezing Fog 13554500 476332334555 2.845351e-03
## Dust Storm 9799100 476342133655 2.057020e-03
## Astronomical Low Tide 9745000 476351878655 2.045664e-03
## Avalanche 8721800 476360600455 1.830874e-03
## Lakeshore Flood 7545000 476368145455 1.583841e-03
## Marine Strong Wind 2118330 476370263785 4.446784e-04
## Tropical Depression 1737000 476372000785 3.646298e-04
## Dust Devil 743130 476372743915 1.559973e-04
## Volcanic Ash 500000 476373243915 1.049597e-04
## Rip Current 163000 476373406915 3.421685e-05
## Dense Smoke 100050 476373506965 2.100243e-05
##
## Pareto chart analysis for damage
## Cum.Percent.
## Flood 31.75332
## Hurricane (Typhoon) 47.59619
## Tornado 59.63412
## Storm Surge/Tide 69.70307
## Flash Flood 75.76178
## Hail 79.69993
## Drought 82.85413
## Seiche 85.92430
## Thunderstorm Wind 88.24410
## Wildfire 90.18379
## Ice Storm 92.11717
## Tropical Storm 93.88244
## Winter Storm 95.29257
## High Wind 96.69486
## Marine Thunderstorm Wind 97.29579
## Winter Weather 97.82947
## Heavy Rain 98.15736
## Frost/Freeze 98.46320
## Funnel Cloud 98.75308
## Heavy Snow 98.98137
## Lightning 99.18044
## Blizzard 99.34237
## Excessive Heat 99.47865
## Coastal Flood 99.59704
## Sleet 99.69329
## Heat 99.78174
## Strong Wind 99.83455
## Tsunami 99.86479
## Marine High Wind 99.88848
## High Surf 99.90980
## Marine Hail 99.92468
## Cold/Wind Chill 99.93908
## Dense Fog 99.95289
## Waterspout 99.96565
## Debris Flow 99.97448
## Lake-Effect Snow 99.98295
## Extreme Cold/Wind Chill 99.98851
## Freezing Fog 99.99136
## Dust Storm 99.99341
## Astronomical Low Tide 99.99546
## Avalanche 99.99729
## Lakeshore Flood 99.99887
## Marine Strong Wind 99.99932
## Tropical Depression 99.99968
## Dust Devil 99.99984
## Volcanic Ash 99.99994
## Rip Current 99.99998
## Dense Smoke 100.00000
abline(h=(sum(damage)*.8),col="red",lwd=1)
According to the Pareto principle roughly 80% of the effects come from 20% of the causes. So taking a look at the Pareto charts, we make up the following conclusions: