On this project we will read of storm related consequences data and try to answer the two questions:
Across the United States, which types of events (as indicated in the EVTYPE EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic
consequences? We will look at data and figure out which of the columns will be important to perform this analysis. Then we will group by data set by selected columns and sum the numbers that will answer the stated questions.
We will analyze the goverment data of the storm. * Dataset url * National Weather Service url * National Climatic Data Center Storm Events FAQ
We can read the data without unpucking them just by using read.csv function.
dane <- read.csv("repdata_data_StormData.csv.bz2")
We can take a quick look of the table, if it is well loaded.
head(dane)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL TORNADO
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL TORNADO
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 0 0 NA
## 2 0 0 NA
## 3 0 0 NA
## 4 0 0 NA
## 5 0 0 NA
## 6 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1 0 14.0 100 3 0 0 15 25.0
## 2 0 2.0 150 2 0 0 0 2.5
## 3 0 0.1 123 2 0 0 2 25.0
## 4 0 0.0 100 2 0 0 2 2.5
## 5 0 0.0 150 2 0 0 2 2.5
## 6 0 1.5 177 2 0 0 6 2.5
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1 K 0 3040 8812
## 2 K 0 3042 8755
## 3 K 0 3340 8742
## 4 K 0 3458 8626
## 5 K 0 3412 8642
## 6 K 0 3450 8748
## LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3051 8806 1
## 2 0 0 2
## 3 0 0 3
## 4 0 0 4
## 5 0 0 5
## 6 0 0 6
And check the column names.
names(dane)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
As we see the coulmn names we will need are EVTYPE and FATALITIES and INJURIES. Also we will need DMG and FARM to answer the second question.
As the first we will need fatalities and injuries for every type of disaster that can occur. For this we ill select some part of the data frame and save this into d2 variable.
library(tidyr)
library(dplyr)
##
## Dołączanie pakietu: 'dplyr'
## Następujące obiekty zostały zakryte z 'package:stats':
##
## filter, lag
## Następujące obiekty zostały zakryte z 'package:base':
##
## intersect, setdiff, setequal, union
d2 <- dane %>% group_by(EVTYPE) %>%
summarize( fat = sum(FATALITIES,na.rm=T) , inj = sum(INJURIES ,na.rm=T) ) %>%
ungroup() %>%
arrange(desc(fat))
head(d2)
## # A tibble: 6 × 3
## EVTYPE fat inj
## <chr> <dbl> <dbl>
## 1 TORNADO 5633 91346
## 2 EXCESSIVE HEAT 1903 6525
## 3 FLASH FLOOD 978 1777
## 4 HEAT 937 2100
## 5 LIGHTNING 816 5230
## 6 TSTM WIND 504 6957
We will keep the fatalities in the specific variable.
dane_fatal <- d2 %>% select(type=EVTYPE,fatalities=fat) %>% arrange(desc(fatalities))
head(dane_fatal)
## # A tibble: 6 × 2
## type fatalities
## <chr> <dbl>
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
The same thing for the injuries.
dane_injuries <- d2 %>% select(type=EVTYPE,injuries=inj) %>% arrange(desc(injuries))
head(dane_injuries)
## # A tibble: 6 × 2
## type injuries
## <chr> <dbl>
## 1 TORNADO 91346
## 2 TSTM WIND 6957
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
As we have seen in data overview we can use PROPDMG and CROPDMG to answer the second question. For this we will create a subset of main data set.
d3 <- dane %>% group_by(EVTYPE) %>%
summarize( dmg = sum(PROPDMG,na.rm=T) , farms = sum(CROPDMG ,na.rm=T) ) %>%
ungroup() %>%
arrange(desc(dmg))
head(d3)
## # A tibble: 6 × 3
## EVTYPE dmg farms
## <chr> <dbl> <dbl>
## 1 TORNADO 3212258. 100019.
## 2 FLASH FLOOD 1420125. 179200.
## 3 TSTM WIND 1335966. 109203.
## 4 FLOOD 899938. 168038.
## 5 THUNDERSTORM WIND 876844. 66791.
## 6 HAIL 688693. 579596.
We will also crate a data frame that adress the property dmg.
dane_dmg <- d3 %>% select(type=EVTYPE,property_dmg=dmg ) %>% arrange(desc(property_dmg))
head(dane_dmg)
## # A tibble: 6 × 2
## type property_dmg
## <chr> <dbl>
## 1 TORNADO 3212258.
## 2 FLASH FLOOD 1420125.
## 3 TSTM WIND 1335966.
## 4 FLOOD 899938.
## 5 THUNDERSTORM WIND 876844.
## 6 HAIL 688693.
And the crop dmg.
dane_farms <- d3 %>% select(type=EVTYPE,crop_dmg=farms ) %>% arrange(desc(crop_dmg))
head(dane_farms)
## # A tibble: 6 × 2
## type crop_dmg
## <chr> <dbl>
## 1 HAIL 579596.
## 2 FLASH FLOOD 179200.
## 3 FLOOD 168038.
## 4 TSTM WIND 109203.
## 5 TORNADO 100019.
## 6 THUNDERSTORM WIND 66791.
In order to reproduce plots you need ggplot2 package.
library(ggplot2)
Two elements indicated as harmful for population health are:
fatalities
injuries
The sumamrization of the top fatalities are:
head(dane_fatal)
## # A tibble: 6 × 2
## type fatalities
## <chr> <dbl>
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
The sumamrization of the top injuries are:
head(dane_injuries)
## # A tibble: 6 × 2
## type injuries
## <chr> <dbl>
## 1 TORNADO 91346
## 2 TSTM WIND 6957
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
We can see both on them on this brilliant plot
ggplot( d2[1:20,] , aes(x=EVTYPE ) ) +
geom_bar(aes(y = inj, fill = "red"), stat = "identity") +
geom_bar(aes(y = fat, fill = "blue"), stat = "identity") +
#scale_x_discrete(breaks = reorder( -( fat + inj)))
theme(axis.text.x = element_text(angle = 90)) +
scale_fill_manual(values = c("red", "blue"), labels = c("Fatalities", "Injuries"))+
labs(title = "Fatality Count by Disaster Type", x = "Disaster Type", y = "Number of Fatalities",fill = c("")) #+
As we see the main disaster to contribute to this are:
tornado
TSTM wind
The are two elements in this data frame that are related with economic consequences
property damage
crop damage
The list top proporty damages are
head(dane_dmg)
## # A tibble: 6 × 2
## type property_dmg
## <chr> <dbl>
## 1 TORNADO 3212258.
## 2 FLASH FLOOD 1420125.
## 3 TSTM WIND 1335966.
## 4 FLOOD 899938.
## 5 THUNDERSTORM WIND 876844.
## 6 HAIL 688693.
The list top crop damages are
head(dane_farms)
## # A tibble: 6 × 2
## type crop_dmg
## <chr> <dbl>
## 1 HAIL 579596.
## 2 FLASH FLOOD 179200.
## 3 FLOOD 168038.
## 4 TSTM WIND 109203.
## 5 TORNADO 100019.
## 6 THUNDERSTORM WIND 66791.
We can see the combination of these on the bar plot
ggplot( d3[1:20,] , aes(x=EVTYPE ) ) +
geom_bar(aes(y = dmg, fill = "red"), stat = "identity") +
geom_bar(aes(y = farms, fill = "blue"), stat = "identity") +
theme(axis.text.x = element_text(angle = 90)) +
scale_fill_manual(values = c("red", "blue"), labels = c("Farms", "Property")) +
labs(title = "Damage", x = "Disaster Type", y = "Number of dmg made",fill = c("")) #+
As we see in plot the main disaster to contribute to economic consequences are:
tornado (3212258+100019)
TSTM wind (109203+1335966)
flashflood (1420125+179200)