The data analysis presented in this document is for Course Project 2 in the Reproducible Reasearch course, provided by John Hopkins University on Coursera platform.The database used for this analysis was provided by U.S. National Oceanic and Atmospheric Administration (NOAA) and it tracks major weather events, including where and when they occur, as well as estimation on fatalities, injuries, and economic figures.The main goals of this project were to evaluate what type wether events would have a higher impact on population health as well as economic consequences.To evaluate the impact on population health, injury and fatality values were used. Property and Crop damage costs were the variables used to evaluate the economic consequences. It was found that Tornados and Excessive Heat are the most impactful events on population health, while Drought and Hurricane/Typhoons were the events causing larger economic consequences.
As mentioned, the data used for this project was obtained from NOAA’s database. The funcion read.csv was used to read the data into R Studio, and then stored in a variable called StormData.
# Open csv file with raw data
StormData <- read.csv("repdata_data_StormData.csv.bz2")
StormData has a total of 902297 rows and 37 columns. The time frame of the data stored in StormData variable ranges from 1950 until 2011. Therefore, in order to help reducing the size of data to analyse, only data after 1990 was considered and stored in a new variable called SampleStormData. Further pre-processing included change date format of values in column BGN_DATE, and keep columns that contain relevant information for the data analysis.
# Change date format of column BGN_DATE to Date and Save it in new column BGN_DATE_YMD
StormData$BGN_DATE <- as.Date(format(strptime(as.character(StormData$BGN_DATE),
"%m/%d/%Y"),
"%Y-%m-%d"))
# Filter StormData for year > 1990
StormData$YEAR <- year(StormData$BGN_DATE)
SampleStormData <- StormData[StormData$YEAR > "1990", ]
# Columns to keep and filter dataset SampleStormData to keep just the columns selected
keepVar <- c("BGN_DATE","YEAR","STATE","EVTYPE","FATALITIES","INJURIES",
"PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP","LATITUDE",
"LONGITUDE","REFNUM")
SampleStormData <- SampleStormData[,keepVar]
# Set all Events to Capital letters
SampleStormData$EVTYPE <- factor(toupper(SampleStormData$EVTYPE))
In order to know what type of events were most harmful to the population health, between 1991 and 2011, we will group the variable EVTYPE, and then count the total of Injuries and Fatalities per event. Addionaly, those values will then be transformed into percentages.
# Total Injuries per Event Type
totalInjperEvType <- aggregate.data.frame(SampleStormData$INJURIES,
by = list(SampleStormData$EVTYPE),
FUN = sum)
colnames(totalInjperEvType) <- c("EVTYPE","Total_Injuries")
# Order Total Injuries in descend order
totalInjperEvType <- totalInjperEvType[order(totalInjperEvType$Total_Injuries,
decreasing = TRUE),]
# Show total Injuries in Percentage
totalInjperEvType$PercTotalInj <- totalInjperEvType$Total_Injuries/sum(totalInjperEvType$Total_Injuries)*100
totalInjperEvType$PercTotalInj <- round(totalInjperEvType$PercTotalInj, 2)
# Total Fatalities per Event Type
totalFatalperEvType <- aggregate.data.frame(SampleStormData$FATALITIES,
by = list(SampleStormData$EVTYPE),
FUN = sum)
colnames(totalFatalperEvType) <- c("EVTYPE","Total_Fatalities")
# Order Total Injuries in descend order
totalFatalperEvType <- totalFatalperEvType[order(totalFatalperEvType$Total_Fatalities,
decreasing = TRUE),]
# Show total Injuries in Percentage
totalFatalperEvType$PercTotalInj <- totalFatalperEvType$Total_Fatalities/sum(totalFatalperEvType$Total_Fatalities)*100
totalFatalperEvType$PercTotalInj <- round(totalFatalperEvType$PercTotalInj, 2)
# SUMMARY RESULTS IN A FORM OF BAR PLOT IN PANEL
plot1 <- ggplot(data = totalInjperEvType[totalInjperEvType$PercTotalInj > 5,],
aes(EVTYPE, PercTotalInj))+
labs(x = NULL, y = "Total Injuries (%)",
title = "Most harmful (>5%) events in terms of injury, 1991-2011")+
theme(plot.title = element_text(size=9),
axis.title = element_text(size = 9))+
geom_bar(stat = "Identity")+
geom_text(aes(label=PercTotalInj), vjust=-0.3, size=2)
plot2 <- ggplot(data = totalFatalperEvType[totalFatalperEvType$PercTotalInj > 5,],
aes(EVTYPE, PercTotalInj))+
labs(x = "Event Type ", y = "Total Fatalities (%)",
title = "Most harmful (>5%) events in terms of fatality, 1991-2011")+
theme(plot.title = element_text(size=9),
axis.title = element_text(size = 9))+
geom_bar(stat = "Identity")+
geom_text(aes(label=PercTotalInj), vjust=-0.3, size=2)
multiplot(plot1,plot2, cols=1)
Fig. 1 - Most harmful events (rate > %5) for Population Heath, 1991-2011.
Results presented in Fig.1 show that, generally, TORNADO is the weather event that causes more Injuries in the population, in a total of 35.5% of all injuries, while Excessive Heat is repsonsible for 17.3% of total fatalities, between 1991 and 2011.
Further data analysis was done intending to address the following questions:
# TORNADO
subsetTornado_INJ <- aggregate.data.frame(SampleStormData$INJURIES,
by = list(SampleStormData$EVTYPE == "TORNADO",
SampleStormData$STATE,
SampleStormData$BGN_DATE),
FUN = sum)
subsetTornado_INJ <- subsetTornado_INJ[subsetTornado_INJ$Group.1 =="TRUE",]
subsetTornado_INJ$Group.1 <- NULL
colnames(subsetTornado_INJ) <- c("STATE", "DATE","Total_Injuries")
subsetTornado_INJ <- subsetTornado_INJ[order(subsetTornado_INJ$Total_Injuries,
decreasing = TRUE),]
subsetTornado_INJ[1:3,]
## STATE DATE Total_Injuries
## 104123 AL 2011-04-27 2092
## 104678 MO 2011-05-22 1152
## 104162 TN 2011-04-27 857
Above are shown the top 3 tornados that caused most injuries. this data suggests that in 2011 a tornado caused large amouth of injuries (almost 3000) throughout the states of Alabama and Tennessie (AL and TN, respectively). Official data has been found supporting this cause, and a tornado outbreak in the Deep South has been reported.
To answer this question, data was filtered based on event type, and then groped by state and year. A sum function was applied to the variable FATALITIES.
# EXCESSIVE HEAT
subsetEXCHEAT <- aggregate.data.frame(SampleStormData$FATALITIES,
by = list(SampleStormData$EVTYPE == "EXCESSIVE HEAT",
SampleStormData$STATE,
SampleStormData$YEAR),
FUN = sum)
subsetEXCHEAT <- subsetEXCHEAT[subsetEXCHEAT$Group.1 =="TRUE",]
subsetEXCHEAT$Group.1 <- NULL
colnames(subsetEXCHEAT) <- c("STATE", "YEAR", "Total_Fatalities")
subsetEXCHEAT <- subsetEXCHEAT[order(subsetEXCHEAT$Total_Fatalities,
decreasing = TRUE),]
subsetEXCHEAT[1:3,]
## STATE YEAR Total_Fatalities
## 531 IL 1999 138
## 571 PA 1999 88
## 550 MO 1999 77
The results presented above show that large majoritie of fatalities due to Extreme Heat were in the states of Illinois, Pennsylvania and Montana, which is consistent with official records of a Summer heat wave during 1999.
To understand what weather events had the highest economic impact, we have to analyse the crop damage (CROPDMG) and property damaga (PROPDMG). The values in these columns had to be multiplied by the corresponding CROPDMGEXP and PROPDMGEXP to produce the actual monetary value. After doing so and check the data, one event in particular stood out as a possible outlier (REFNUM = 605943). It was then decided to eliminate it from the dataset.
# Replace 'K', 'M' and 'B' with their numeric equivalents. After that, replace NA with 1. Applies to PROPDMGEXP and CROPDMGEXP
SampleStormData$PROPDMGEXP <- recode(SampleStormData$PROPDMGEXP, 'K' = 1000, 'M'=1000000, 'B'=1000000000)
SampleStormData$CROPDMGEXP <- recode(SampleStormData$CROPDMGEXP, 'K' = 1000, 'M'=1000000, 'B'=1000000000)
SampleStormData$PROPDMGEXP[is.na(SampleStormData$PROPDMGEXP)] <- 1
SampleStormData$CROPDMGEXP[is.na(SampleStormData$CROPDMGEXP)] <- 1
# Calculate damage amounts for PROPDMG,CROPDMG and Total Damage Amount
SampleStormData$PROPDMGAMT <- SampleStormData$PROPDMG*SampleStormData$PROPDMGEXP
SampleStormData$CROPDMGAMT <- SampleStormData$CROPDMG*SampleStormData$CROPDMGEXP
SampleStormData$TOTDMGAMT <- SampleStormData$CROPDMGAMT + SampleStormData$PROPDMGAMT
No Official Weather Events of type FLOOD to be reported in CA on 2016-01-01. Most likely this entry is wrong. We delete it.
# Remove possible outlier
SampleStormData<- SampleStormData[!(SampleStormData$REFNUM == "605943"),]
To assess which weather event caused more crop damage, we sum the CROPDMGAMT by EVTYPE, and then calculate the percentage of each relative to the total crop costs.
# Crop Damaga data processing
TotalCropDmgPerEVTYPE <- aggregate.data.frame(SampleStormData$CROPDMGAMT,
by = list(SampleStormData$EVTYPE),
FUN = sum)
colnames(TotalCropDmgPerEVTYPE) <- c("EVTYPE","TOTALCROPDMGAMT")
# Sort data in descending order
TotalCropDmgPerEVTYPE <- TotalCropDmgPerEVTYPE[order(TotalCropDmgPerEVTYPE$TOTALCROPDMGAMT,
decreasing = TRUE),]
# Show Crop Damages in Percentage
TotalCropDmgPerEVTYPE$PercTotal <- TotalCropDmgPerEVTYPE$TOTALCROPDMGAMT/sum(TotalCropDmgPerEVTYPE$TOTALCROPDMGAMT)*100
TotalCropDmgPerEVTYPE$PercTotal <- round(TotalCropDmgPerEVTYPE$PercTotal, 2)
For total property damage costs we do the same data analysis as for crop damage.
# Total Property Damages per Event Type
TotalPropDmgPerEVTYPE <- aggregate.data.frame(SampleStormData$PROPDMGAMT,
by = list(SampleStormData$EVTYPE),
FUN = sum)
colnames(TotalPropDmgPerEVTYPE) <- c("EVTYPE","TOTALPROPDMGAMT")
# Sort data in descending order
TotalPropDmgPerEVTYPE <- TotalPropDmgPerEVTYPE[order(TotalPropDmgPerEVTYPE$TOTALPROPDMGAMT,
decreasing = TRUE),]
# Show total Injuries in Percentage
TotalPropDmgPerEVTYPE$PercTotal <- TotalPropDmgPerEVTYPE$TOTALPROPDMGAMT/
sum(TotalPropDmgPerEVTYPE$TOTALPROPDMGAMT)*100
TotalPropDmgPerEVTYPE$PercTotal <- round(TotalPropDmgPerEVTYPE$PercTotal, 2)
We plot the results of total crop damage amount (TotalCropDmgPerEVTYPE) and total property damage amount (TotalPropDmgPerEVTYPE) for values above 5%. This is just for the purpose of better visualizationof those results who are relevant for this analysis.
# SUMMARY RESULTS IN A FORM OF BAR PLOT IN PANEL
plot3 <- ggplot(data = TotalCropDmgPerEVTYPE[totalInjperEvType$PercTotal > 5,],
aes(EVTYPE, PercTotal))+
labs(x = NULL, y = "Total Crop Costs (%)",
title = "Events with higher (> 5%) Total Crop Costs, 1991-2011")+
theme(plot.title = element_text(size=9),
axis.title = element_text(size = 9))+
geom_bar(stat = "Identity")+
geom_text(aes(label=PercTotal), vjust=-0.3, size=2)
plot4 <- ggplot(data = TotalPropDmgPerEVTYPE[TotalPropDmgPerEVTYPE$PercTotal > 5,],
aes(EVTYPE, PercTotal))+
labs(x = "Event Type ", y = "Total Prop. Costs (%)",
title = "Events with higher (> 5%) Total Property Costs , 1991-2011")+
theme(plot.title = element_text(size=9),
axis.title = element_text(size = 9))+
geom_bar(stat = "Identity",
)+
geom_text(aes(label=PercTotal), vjust=-0.3, size=2)
multiplot(plot3,plot4, cols=1)
Fig. 2 - Weather events with higher (rate > %5) Economic consequences, 1991-2011.
Highest crop damage costs were caused by DROUGHT about 28.5% of the total costs reported between 1991 and 2011, while HURRICAN/TYPHOON events were responsible for 24.4% of the total property damage costs, as reported on FIG.2.
Further data analysis was done intending to address the following questions:
# Further Investication for Total Crop Damage
subsetDROUGHT <- SampleStormData[which(SampleStormData$EVTYPE == "DROUGHT"),
names(SampleStormData) %in%
c("BGN_DATE_YMD","YEAR","STATE","CROPDMGAMT")
]
subsetDROUGHT <- subsetDROUGHT[order(subsetDROUGHT$CROPDMGAMT,
decreasing = TRUE),]
subsetDROUGHT$CROPDMGAMT <- format(subsetDROUGHT$CROPDMGAMT, scientific = TRUE)
subsetDROUGHT[1:3,]
## YEAR STATE CROPDMGAMT
## 639347 2006 TX 1.0000e+09
## 422676 2001 IA 5.7885e+08
## 410175 2000 TX 5.1500e+08
In 2006, in Texas, it was recorded the DROUGHT event with highest amounth of crop damages, in a total of $1B dollars. According to official recordings, Wildfires have been reported during this time in Texas.
# Further Investication for Total Property Damage
subsetHURTYPH <- SampleStormData[which(SampleStormData$EVTYPE == "HURRICANE/TYPHOON"),
names(SampleStormData) %in%
c("BGN_DATE_YMD","YEAR","STATE","PROPDMGAMT")
]
subsetHURTYPH <- subsetHURTYPH[order(subsetHURTYPH$PROPDMGAMT,
decreasing = TRUE),]
subsetHURTYPH[1:3,]
## YEAR STATE PROPDMGAMT
## 577675 2005 LA 1.693e+10
## 569308 2005 FL 1.000e+10
## 581533 2005 MS 7.350e+09
2005 was a devastating year regarding Hurricanes; Katrina, for example. Total Property costs reflect that exacly, as we can above. In the state of Luisianna it has been reported approx. $17B dollars of property costs. Both Louisianna and Mississippi were sevearly hit by Katrina. Results also show that in the state of Florida, total property damage costs have been reported to be $10B dollars, these were due to Hurricane Wilma. These results show how devastating these weather events can be, not just in terms of lives but also in property costs.
With this project, it was possible to concluded which weather events have a higher impact on population health, as well as economic consequences. Data analysed was restricted to events reported between 1991 and 2011. In this time frame, Tornados were responsible for the majority of injuries, while excessive heat caused the highest number of fatalities. In terms of economic consequences, it was found that drought events are responsible for larger crop damage costs, while hurricanes/typhoon events cause the highest property damage costs.