This analysis uses the public available data from the “National Weather Service Storm Data Documentation”. The data consits of recorded weather events and there caracteristics like event-Type (e.g. Tornado), number of injuries and fatalities and the amount of property and crop damage. Especially this caracteristics are used, to answer the questions, which weather event-type is most harmful with respect to population health amd which events causes the highest economic damage. To get clean and tidy data, with respect to the traget of the analysis, some data processing is nessesary. After that, some comparative plots depict the top weather events with respect to there influence.
First we read in the data, from the unziped CSV-File and set the column “REFNUM” as rownames.
library(dplyr)
library(ggplot2)
library(gridExtra)
# Reading StormData from CSV File. Set the field "REFNUM" as rownames.
data2<-read.csv("repdata_data_StormData.csv", row.names = "REFNUM")
# Calculating the number of diferent event-types
num_EVTYPE<-length(levels(data2$EVTYPE))
A look on the column “EVTYPE” shows many event-types just differ in there capitalization and the use of spezial caracters. Because of that we have 985 different event-types. To get an reduction of that, we alter all characters to upper case and remove all non-alphabetical characters.
# Mutate all event-types to upper case and remove all non-alphabetical letters
data2$EVTYPE<-toupper(data2$EVTYPE)
data2$EVTYPE<-gsub("[^[:alpha:]]","",data2$EVTYPE)
# Create a new factor of the variable
data2$EVTYPE<-factor(data2$EVTYPE)
# Calculating the number of diferent event-types
num_EVTYPE<-length(levels(data2$EVTYPE))
Due to that, the number of different event-types could be reduced to 701.
The columns “PROPDMGEXP” and “CROPDMGEXP” give us the magnitude of the values in the coulumn “PROPDMG” and “CROPDMG”. The combination of both gives us the property- and crop-damange of the recorded event. After some inspection of the “PROPDMGEXP” and “CROPDMGEXP” fields, we see that there are some faulty recordings due to capitalization and invalid characters. So we transform any character to upper case and filter for valid characters (“B”, “K”, “M”) and empty spaces.
# Transform "PROPDMGEXP" and "CROPDMGEXP" to upper case
data2$PROPDMGEXP<-toupper(data2$PROPDMGEXP)
data2$CROPDMGEXP<-toupper(data2$CROPDMGEXP)
# Read number of rows befor filtering
lenght_before<-nrow(data2)
# Filter for valid characters
data2<-data2%>%filter((PROPDMGEXP %in% c("","B","K","M")) & (CROPDMGEXP %in% c("","B","K","M")))
# Create a new factor of the variables
data2$PROPDMGEXP<-factor(data2$PROPDMGEXP)
data2$CROPDMGEXP<-factor(data2$CROPDMGEXP)
# Read number of rows after filtering
lenght_after<-nrow(data2)
Because of filtering out some invalid records we lost 348 of 902297 rows. For now we assume, that this is not a significant lost of information.
Next we calculate the property and crop damage in dollar and save it in an new column.
# Calculate the property and crop damage in dollar
data2<-data2 %>% mutate(PROPDMG2= case_when(PROPDMGEXP=="B" ~ PROPDMG*1000000000,PROPDMGEXP=="K" ~ PROPDMG*1000, PROPDMGEXP=="M" ~ PROPDMG*1000000, TRUE ~ PROPDMG))
data2<-data2 %>% mutate(CROPDMG2 = case_when(CROPDMGEXP=="B" ~ CROPDMG*1000000000,CROPDMGEXP=="K" ~ CROPDMG*1000, CROPDMGEXP=="M" ~ CROPDMG*1000000, TRUE ~ CROPDMG))
At last we prase the string in the columns “BGN_DATE” and “END_DATE”, to get the variable as date-class.
# Transform the date-string to a variable of class date
data2<-data2%>%mutate(BGN_DATE=as.Date(BGN_DATE, "%m/%d/%Y") )
data2<-data2%>%mutate(END_DATE=as.Date(END_DATE, "%m/%d/%Y") )
The results of the analysis are shown in several comparative plots, which shows the total amount of fatalities, injuries, property damanage and crop damage of all weather events as well as the next 4 highest sums grouped by event.
To get these plots we group the clean dataset by event-type.
# Grouping the clean date by event-type
data_EVTYPE<-group_by(data2, EVTYPE)
First we calculate the sums of the fatalities grouped by weather event type. After that we add a row with the total amount of fatalities to view also this value in the plots below. In addition we order the rows and levels to get the highest values first and a ordered barplot.
# FATALITIES in total an per EVTYPE
results_FAT<-summarize(data_EVTYPE, sum_FAT=sum(FATALITIES))
# Calculate the total amaunt of all fatilities and add this entry as a new event-type to the results_FAT data
total_FAT<-sum(data2$FATALITIES)
levels(results_FAT$EVTYPE)<-c(levels(results_FAT$EVTYPE),"TOTAL")
results_FAT<-rbind(results_FAT,data.frame(EVTYPE=factor(x = "TOTAL", levels = levels(results_FAT$EVTYPE)), sum_FAT=as.double(total_FAT)))
# Order the records and levels in descending order
results_FAT<- results_FAT %>% arrange(desc(sum_FAT))
results_FAT$EVTYPE <- factor(results_FAT$EVTYPE, levels = results_FAT$EVTYPE[order(-results_FAT$sum_FAT)])
Now we do the same for the injuries-values.
# INJURIES in total an per EVTYPE
results_INJ<-summarize(data_EVTYPE, sum_INJ=sum(INJURIES))
# Calculate the total amaunt of all injuries and add this entry as a new event-type to the results_INJ data
total_INJ<-sum(data2$INJURIES)
levels(results_INJ$EVTYPE)<-c(levels(results_INJ$EVTYPE),"TOTAL")
results_INJ<-rbind(results_INJ,data.frame(EVTYPE=factor(x = "TOTAL", levels = levels(results_INJ$EVTYPE)), sum_INJ=as.double(total_INJ)))
# Order the records and levels in descending order
results_INJ<- results_INJ %>% arrange(desc(sum_INJ))
results_INJ$EVTYPE <- factor(results_INJ$EVTYPE, levels = results_INJ$EVTYPE[order(-results_INJ$sum_INJ)])
After that we plot the two barplots next to each other to see, which weather event types cause the most fatalities and injuries.
# Creating a barplot in respect to the fatilities
p1<-ggplot(results_FAT[c(1:5),], aes( y=sum_FAT, x=EVTYPE)) +
geom_bar(stat="identity" ) +
labs(title="Fatalities by weather event", x="Weather Event", y="Count" ) +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
# Creating a barplot in respect to the injuries
p2<-ggplot(results_INJ[c(1:5),], aes( y=sum_INJ, x=EVTYPE)) +
geom_bar(stat="identity" ) +
labs(title="Injuries by weather event", x="Weather Event", y="Count" ) +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
# Plotting both barplots next to each other
grid.arrange(p1,p2, nrow=1)
The left plot shows that near 33% of the total fatalities are caused by tornados folowed by excessiv heat, flash flood and heat. The right plot shows that near 66% of the injuries are also caused by tornados followed by TSTM wind, flood and excessiv heat. So the result is that the most harmful weather events with respect to population health are the tornados.
Second we calculate the sums of the property damage grouped by weather event-type. After that we add a row with the total amount of property damage to view also this value in the following plots below. In addition we order the rows and levels to get the highest values first and a ordered barplot.
# PROPERTY Damage in total and per EVTYPE
results_PROPDMG<-summarize(data_EVTYPE,sum_PROPDMG=sum(PROPDMG2))
# Calculate the total amaunt of all property damage and add this entry as a new event-type to the results_PROPDMG data
total_PROPDMG<-sum(data2$PROPDMG2)
levels(results_PROPDMG$EVTYPE)<-c(levels(results_PROPDMG$EVTYPE),"TOTAL")
results_PROPDMG<-rbind(results_PROPDMG,data.frame(EVTYPE=factor(x = "TOTAL", levels = levels(results_PROPDMG$EVTYPE)), sum_PROPDMG=as.double(total_PROPDMG)))
# Order the records and levels in descending order
results_PROPDMG<- results_PROPDMG %>% arrange(desc(sum_PROPDMG))
results_PROPDMG$EVTYPE <- factor(results_PROPDMG$EVTYPE, levels = results_PROPDMG$EVTYPE[order(-results_PROPDMG$sum_PROPDMG)])
Now we do the same for the crop damage values.
# CROP Damage in total and per EVTYPE
results_CROPDMG<-summarize(data_EVTYPE,sum_CROPDMG=sum(CROPDMG2))
# Calculate the total amaunt of all crop damage and add this entry as a new event-type to the results_CROPDMG data
total_CROPDMG<-sum(data2$CROPDMG2)
levels(results_CROPDMG$EVTYPE)<-c(levels(results_CROPDMG$EVTYPE),"TOTAL")
results_CROPDMG<-rbind(results_CROPDMG,data.frame(EVTYPE=factor(x = "TOTAL", levels = levels(results_CROPDMG$EVTYPE)), sum_CROPDMG=as.double(total_CROPDMG)))
# Order the records and levels in descending order
results_CROPDMG<- results_CROPDMG %>% arrange(desc(sum_CROPDMG))
results_CROPDMG$EVTYPE <- factor(results_CROPDMG$EVTYPE, levels = results_CROPDMG$EVTYPE[order(-results_CROPDMG$sum_CROPDMG)])
After that we plot the two barplots next to each other to see, which weather event types cause the most damage to property and crop.
p3<-ggplot(results_PROPDMG[c(1:5),], aes( y=sum_PROPDMG, x=EVTYPE)) +
geom_bar(stat="identity" ) +
labs(title="Property damage by weather event", x="Weather Event", y="Sum in $" ) +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
p4<-ggplot(results_CROPDMG[c(1:5),], aes( y=sum_CROPDMG, x=EVTYPE)) +
geom_bar(stat="identity" ) +
labs(title="Crop damage by weather event", x="Weather Event", y="Sum in $" ) +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
grid.arrange(p3,p4, nrow=1)
The left plot shows that near 35% of the total property damages are caused by floods folowed by hurricane-typhoon, tornados and storam surge. The right plot shows that near 30% of the crop damages are caused by droughts followed by flood, river-flood and ice-storms. So the result is that the weather events with highest damage to the economy (defined by property and crop) are floods and droughts.