The goal of this report is to show the weather effects on Public Health and Economy of US. With that information the government authorities will be able to consider plans to define preventive and corrective actions to reduce the effects of some severe weather events such as tornados, floods and so on. The U.S. National Oceanic and Atmospheric Administration’s (NOAA) has gathered weather data since 1950, so an EDA (Exploratory Data Analysis) should be done to better know and understand the effects caused by weather which includes fatalities, injuries, and property damage.
The data was obtained from Peer-graded Assignment: Course Project 2 of Reproducible Research web site. The file is csv bz2 format so after the download it will be read.
fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileURL, destfile = "dataset.csv.bz2", mode = "wb")
data_1 <- read.csv("dataset.csv.bz2", stringsAsFactors = F)
data_1 <- as_tibble(data_1)
The original data set includes information since 1950 to 2011 which means that there are more than 60 years of data, but the early years have few information. BGN_DATE is a character variable so it will be transformed to date class and then extract just the year to compare the quantity of the data of the early years with the last 30 years.
new_date <- strptime(data_1$BGN_DATE, "%m/%d/%Y %H:%M:%S")
data_1$YEAR <- year(new_date)
year_table <- table(data_1$YEAR)
y1 <- sum(year_table)
y2 <- sum(head(year_table, 32))
y3 <- sum(tail(year_table, 30))
y3/y1
## [1] 0.9046556
90% of the data is placed in the last 30 years. A further Statistical Analysis should be done in order to infer the sample size. The statistical analysis is out of the scope of this report, so the entire data set will be used.
The first question is related to the health problems caused by weather. EVTYPE (type of weather event), FATALITIES, and INJURIES variables will be taken into account in this analysis. At the beginning, a data subset considers only FATALITIES. Second, the focus will be on INJURIES and finally the sum of these variables. To do that, the subset will be grouped by EVTYPE, and then sum to the total amount of the variable. To show the data, the subset will be arranged by descending mode, and finally select just two variables according to the each case and present them as a table.
data_2 <- data_1 %>% group_by(EVTYPE) %>% summarize(FATALITIES = sum(FATALITIES)) %>% arrange(desc(FATALITIES)) %>% select(EVTYPE, FATALITIES)
data_3 <- head(data_2, 10)
data_3
## # A tibble: 10 x 2
## EVTYPE FATALITIES
## <chr> <dbl>
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
## 7 FLOOD 470
## 8 RIP CURRENT 368
## 9 HIGH WIND 248
## 10 AVALANCHE 224
The Top Ten shows that the three major fatalities were caused by TORNADO, EXCESSIVE HEAT and FLASH FLOOD.
data_4 <- data_1 %>% group_by(EVTYPE) %>% summarize(INJURIES = sum(INJURIES)) %>% arrange(desc(INJURIES)) %>% select(EVTYPE, INJURIES)
data_5 <- head(data_4, 10)
data_5
## # A tibble: 10 x 2
## EVTYPE INJURIES
## <chr> <dbl>
## 1 TORNADO 91346
## 2 TSTM WIND 6957
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
## 7 ICE STORM 1975
## 8 FLASH FLOOD 1777
## 9 THUNDERSTORM WIND 1488
## 10 HAIL 1361
The Top Ten shows that the three major injuries were caused by TORNADO, TSTM WIND and FLOOD.
data_6 <- data_1 %>% group_by(EVTYPE) %>% summarize(HEALTH_PROBLEMS = sum(FATALITIES, INJURIES)) %>% arrange(desc(HEALTH_PROBLEMS)) %>% select(EVTYPE, HEALTH_PROBLEMS)
data_7 <- head(data_6, 10)
data_7
## # A tibble: 10 x 2
## EVTYPE HEALTH_PROBLEMS
## <chr> <dbl>
## 1 TORNADO 96979
## 2 EXCESSIVE HEAT 8428
## 3 TSTM WIND 7461
## 4 FLOOD 7259
## 5 LIGHTNING 6046
## 6 HEAT 3037
## 7 FLASH FLOOD 2755
## 8 ICE STORM 2064
## 9 THUNDERSTORM WIND 1621
## 10 WINTER STORM 1527
The Top Ten (Fatalities + Injuires) shows that the three major problems were caused by TORNADO, EXCESSIVE HEAT, and TSTM WIND.
The second question is related to the economic problems caused by weather. EVTYPE, PROPDMGEXP, PROPDMG, CROPDMGEXP, and CROPDMGEXP variables will be taken into account in this analysis. PROPDMGEXP and CROPDMGEXP have an alphabetical character that must be consider to determine the exact amount of the PROPDMG and CROPDMG variables. In this case, the original subset will be filtered considering just the rows with “”, “K”, “M”, and “B” in PROPDMGEXP and CROPDMGEXP and the others alphabetical characters will be ignored. Next, and as was done in health problems analysis, the subset will be grouped by EVTYPE, and then sum to the total amount of the variable. To show the data, the subset will be arranged by descending mode, and finally select just two variables according to the each case and present them as a table and in Millions.
x1 <- data_1 %>% filter(PROPDMGEXP == "")
x2 <- data_1 %>% filter(PROPDMGEXP == "K") %>% mutate(PROPDMG = PROPDMG*1000)
x3 <- data_1 %>% filter(PROPDMGEXP == "M") %>% mutate(PROPDMG = PROPDMG*1000000)
x4 <- data_1 %>% filter(PROPDMGEXP == "B") %>% mutate(PROPDMG = PROPDMG*1000000000)
data_10 <- rbind(x1, x2, x3, x4)
data_11 <- data_10 %>% group_by(EVTYPE) %>% summarize(PROPDMG = sum(PROPDMG)) %>% mutate(PROPDMG = PROPDMG/1000000) %>% arrange(desc(PROPDMG)) %>% select(EVTYPE, PROPDMG)
data_12 <- head(data_11, 10)
data_12
## # A tibble: 10 x 2
## EVTYPE PROPDMG
## <chr> <dbl>
## 1 FLOOD 144658.
## 2 HURRICANE/TYPHOON 69306.
## 3 TORNADO 56926.
## 4 STORM SURGE 43324.
## 5 FLASH FLOOD 16141.
## 6 HAIL 15727.
## 7 HURRICANE 11868.
## 8 TROPICAL STORM 7704.
## 9 WINTER STORM 6688.
## 10 HIGH WIND 5270.
The Top Ten shows, in Millions, that the three major property damages were caused by FLOOD, HURRICANE/TYPHOON, and TORNADO.
x10 <- data_1 %>% filter(CROPDMGEXP == "")
x20 <- data_1 %>% filter(CROPDMGEXP == "K") %>% mutate(CROPDMG = CROPDMG*1000)
x30 <- data_1 %>% filter(CROPDMGEXP == "M") %>% mutate(CROPDMG = CROPDMG*1000000)
x40 <- data_1 %>% filter(CROPDMGEXP == "B") %>% mutate(CROPDMG = CROPDMG*1000000000)
data_20 <- rbind(x10, x20, x30, x40)
data_21 <- data_20 %>% group_by(EVTYPE) %>% summarize(CROPDMG = sum(CROPDMG)) %>% mutate(CROPDMG = CROPDMG/1000000) %>% arrange(desc(CROPDMG)) %>% select(EVTYPE, CROPDMG)
data_22 <- head(data_21, 10)
data_22
## # A tibble: 10 x 2
## EVTYPE CROPDMG
## <chr> <dbl>
## 1 DROUGHT 13973.
## 2 FLOOD 5662.
## 3 RIVER FLOOD 5029.
## 4 ICE STORM 5022.
## 5 HAIL 3026.
## 6 HURRICANE 2742.
## 7 HURRICANE/TYPHOON 2608.
## 8 FLASH FLOOD 1421.
## 9 EXTREME COLD 1293.
## 10 FROST/FREEZE 1094.
The Top Ten shows, in Millions, that the three major crop damages were caused by DROUGHT, FLOOD, and RIVER FLOOD.
data_25 <- rbind(data_10, data_20)
data_30 <- data_25 %>% group_by(EVTYPE) %>% summarize(ECONOMIC_PROBLEMS = sum(PROPDMG, CROPDMG)) %>% mutate(ECONOMIC_PROBLEMS = ECONOMIC_PROBLEMS/1000000) %>% arrange(desc(ECONOMIC_PROBLEMS)) %>% select(EVTYPE, ECONOMIC_PROBLEMS)
data_31 <- head(data_30, 10)
data_31
## # A tibble: 10 x 2
## EVTYPE ECONOMIC_PROBLEMS
## <chr> <dbl>
## 1 FLOOD 150321.
## 2 HURRICANE/TYPHOON 71914.
## 3 TORNADO 57344.
## 4 STORM SURGE 43324.
## 5 HAIL 18754.
## 6 FLASH FLOOD 17564.
## 7 DROUGHT 15019.
## 8 HURRICANE 14610.
## 9 RIVER FLOOD 10148.
## 10 ICE STORM 8967.
The Top Ten shows, in Millions, that the three major (property + crop) damages were caused by FLOOD, HURRICANE/TYPHOON, and TORNADO.
As was mentioned before, this report considers the entire data set from 1950 to 2011. The distribution per year is presented in Figure 1.
barplot(table(data_1$YEAR), col = "green", xlab="Year", ylab="Data", main = "Barplot of Data per Year", panel.first = grid())
Figure 1. Barplot of Data per Year.
Data was gathered for more than 60 years, but in the last 20 years there are more data than before. It could be due to that in early years there wasn’t registered enough data.
The severe weather events that have caused fatalities and injuries are Tornados, Excessive Heat, and Tstm Wind. If we consider the total health problems (fatalities + injuries) the Tornado is the most dangerous event because it is close to 100000 health issues. The government should focus in Tornado effects.
g3 <- ggplot(data_7, aes(reorder(x = EVTYPE, -HEALTH_PROBLEMS), y = HEALTH_PROBLEMS))
g3 + geom_bar(stat = "identity", fill = "blue") + theme(axis.text.x = element_text(angle = 45, hjust =1, size = 8)) + ggtitle("Top Ten Health Problems per Weather Event") + xlab("Weather Event") + ylab("Health Problems")
Figure 2. Top Ten Health Problems per Weather Event.
This figure shows the total fatalities and injuries caused by the most severe weather events. The effect of the tornado is really huge compared to the others.
On the other hand, the severe weather events that have caused property and crop damages are Flood, Hurricane/Typhoon, and Tornado. Considering the total economic problems (properties + crops) the Flood is the most dangerous event because it has costed more than 150321 Million. The government should also focus in Flood effects.
g6 <- ggplot(data_31, aes(reorder(x = EVTYPE, -ECONOMIC_PROBLEMS), y = ECONOMIC_PROBLEMS))
g6 + geom_bar(stat = "identity", fill = "red") + theme(axis.text.x = element_text(angle = 45, hjust =1, size = 8)) + ggtitle("Top Ten Economic Problems per Weather Event") + xlab("Weather Event") + ylab("Economic Problems")
Figure 3. Top Ten Economic Problems per Weather Event.
This figure shows the total economic problems caused by the most severe weather events. The total amount, in Millions of USD, of the Flood, Hurricane/Typhoon, and Tornado is 279579.