This analysis is of the US National Oceanic and Atmospheric Administraions (NOAA) Storm data base . This date base is the collection of records of storms , severe weather events etc and provides information of when and where they occured , what were the injuries, fatalities etc. It also estimates the damage to the property , crop damage etc .
The analysis is to determine which event type has caused substantial effect on people’s health and which event types have caused major impact on the economy .
The database contains 90 K observations of 37 variables. It was analysed to understand how many event types were recorded (985 different categories of events). Also the database was summarised to check for missing values . The potential outliers were analysed and checked for its validity by reading through the event recording details in ‘Remarks’ .
Loading data into R - Since we are dealing with a lrge data set , use ‘fread’ for faster reading of data into R .
library("data.table")
## Warning: package 'data.table' was built under R version 3.2.5
stormdata <- fread( "StormData.csv",verbose = TRUE )
## Input contains no \n. Taking this to be a filename to open
## File opened, filesize is 0.523066 GB.
## Memory mapping ... ok
## Detected eol as \r\n (CRLF) in that order, the Windows standard.
## Positioned on line 1 after skip or autostart
## This line is the autostart and not blank so searching up for the last non-blank ... line 1
## Detecting sep ... ','
## Detected 37 columns. Longest stretch was from line 1 to line 30
## Starting data input on line 1 (either column names or first row of data). First 10 characters: "STATE__",
## All the fields on line 1 are character fields. Treating as the column names.
## Count of eol: 1307675 (including 1 at the end)
## Count of sep: 34819802
## nrow = MIN( nsep [34819802] / ncol [37] -1, neol [1307675] - nblank [1] ) = 967216
## Type codes ( first 5 rows): 3444344430000303003343333430000333303
## Type codes (+ middle 5 rows): 3444344434444303443343333434440333343
## Type codes (+ last 5 rows): 3444344434444303443343333434444333343
## Type codes: 3444344434444303443343333434444333343 (after applying colClasses and integer64)
## Type codes: 3444344434444303443343333434444333343 (after applying drop or select (if supplied)
## Allocating 37 column slots (37 - 0 dropped)
##
Read 0.0% of 967216 rows
Read 18.6% of 967216 rows
Read 30.0% of 967216 rows
Read 41.4% of 967216 rows
Read 52.7% of 967216 rows
Read 61.0% of 967216 rows
Read 73.4% of 967216 rows
Read 81.7% of 967216 rows
Read 88.9% of 967216 rows
Read 902297 rows and 37 (of 37) columns from 0.523 GB file in 00:00:12
## Warning in fread("StormData.csv", verbose = TRUE): Read less rows (902297)
## than were allocated (967216). Run again with verbose=TRUE and please
## report.
## 0.000s ( 0%) Memory map (rerun may be quicker)
## 0.000s ( 0%) sep and header detection
## 1.454s ( 13%) Count rows (wc -l)
## 0.000s ( 0%) Column type detection (first, middle and last 5 rows)
## 0.562s ( 5%) Allocation of 902297x37 result (xMB) in RAM
## 9.135s ( 82%) Reading data
## 0.000s ( 0%) Allocation for type bumps (if any), including gc time if triggered
## 0.000s ( 0%) Coercing data already read in type bumps (if any)
## 0.020s ( 0%) Changing na.strings to NA
## 11.171s Total
Convert the fields Event type , State and County to factors.
stormdata$COUNTY <- as.factor(stormdata$COUNTY)
stormdata$STATE <- as.factor(stormdata$STATE)
stormdata$EVTYPE <- as.factor(stormdata$EVTYPE)
Subset the storm data base so that we select only the required fields . This eases the load on processing .
stormdata1 <- as.data.frame(stormdata)
columns_needed <- c("FATALITIES","INJURIES","EVTYPE","CROPDMG","PROPDMG")
required_data <- stormdata1[columns_needed]
Sum up the fields by Event type .
health1 <- aggregate( . ~ EVTYPE , data = required_data, sum)
For answering this question , lets look into three values of events that cause the largest values of the ‘Fatalities’ and ‘Injuries’
Lets plot the 3 events that cause the largest injuries .
descend_injuries <- order(health1$INJURIES , decreasing = TRUE)
Injuries_data <- health1[descend_injuries,]
First3_Injuries <- Injuries_data[1:3,]
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.5
ggplot(First3_Injuries , aes(FATALITIES,INJURIES , width = 5000)) + geom_bar(stat="identity", fill ="green" , color = "red") + facet_grid(~ EVTYPE)
Similarly lets plot the three values of events that caused the largest fatalities .
descend_fatalities <- order(health1$FATALITIES , decreasing = TRUE)
Fatalities_data <- health1[descend_fatalities,]
First3_Fatality <- Fatalities_data[1:3,]
ggplot(First3_Fatality , aes(FATALITIES,INJURIES , width = 200)) + geom_bar(stat="identity" , fill ="blue" , color = "red") + facet_grid(~ EVTYPE)
The data provides us with the estimate of property damage and crop damage done by the different events . By summing the the values by event type we can understand which event type caused greatest economic consequence .
Lets plot the three evnts that caused maximum property damage .
descend_propertydmg <- order(health1$PROPDMG , decreasing = TRUE)
Propdmg_data <- health1[descend_propertydmg,]
First3_Propdmg <- Propdmg_data[1:3,]
ggplot(First3_Propdmg , aes(CROPDMG,PROPDMG , width = 2000)) + geom_bar(stat="identity" , fill ="yellow" , color = "red") + facet_grid(~ EVTYPE)
Now , lets plot the three events that caused maximum damage to crops .
descend_cropdmg <- order(health1$CROPDMG , decreasing = TRUE)
Cropdmg_data <- health1[descend_cropdmg,]
First3_Cropdmg <- Cropdmg_data[1:3,]
ggplot(First3_Cropdmg , aes(PROPDMG,CROPDMG , width = 20000)) + geom_bar(stat="identity" , fill ="grey" , color = "red") + facet_grid(~ EVTYPE)
EVENTS - TORNADO , FLASH FLOOD , EXCESSIVE HEAT , TSTM WIND
EVENTS - TORNADO , FLASH FLOOD , TSTM WIND, HAIL .