Project 2 Final

Image sourced from the website stormhighway.com, it is a photo taken by Dan Robinson.

Introduction

The data set I chose is from NOAA (National Oceanic and Atmospheric Administration), and it covers a variety of different stats on every tornado in the US in recorded history (1950-2023). I chose to analyze data on tornadoes because, since I was young, I have always been fascinated by extreme weather events. Specifically tornadoes sit right in the sweet spot between awe-inspiring and terrifying. Scientists are still not sure exactly how and why tornadoes form, they have always been a little mysterious. I have only been around a tornado in person twice, once when I was too young to remember, and a second time just this past year, but my father is from North Carolina, and he experienced quite a few in his childhood, stories he passed on to me that furthered my curiosity. My favorite movie as a child was ‘Twisters’, a movie all about chasing and analyzing tornadoes. A couple of the variables in the data set were confusing upon first look, so I made sure to clean the set by re-naming all the variables that were hard to understand, a couple of which still require further explanation. The ‘mag (magnitude)’, or as I renamed it, ‘severity_rating’, refers to the Fujita scale, which is a method of ranking tornadoes based on the damage they cause, associated with their wind speed. The rankings are 0-5, 0 being the least severe, 5 being the most. Unlike the rating system for hurricanes, according to the Fujita scale, a tornado could have devastatingly high wind speeds, but because it happened to form over a field, leaving people and property unaffected, this would result in a relatively low ranking. In fact, some of the technically ‘biggest’ tornadoes in history, are ranked relatively low on the Fujita scale, for that reason. Scientists look at the destruction after the fact, and then estimate the tornado’s wind speeds accordingly. Secondly, the ‘ns’ variable, or as I renamed it, “number_states”, simply refers to the number of states a single tornado affects/passes through. A phrase I will be using frequently in this report is ‘tornado alley’, which simply means the collection of states where the most tornadoes happen in the U.S. For both my preliminary plots, and my final visualizations, I downsized the data set to only include the variables I was confident I understood, which were “year”, “month”, “day”, “time”, “state”, “severity_rating’,”injuries”, “fatalities”, “start_latitude”, “start_longitude”, “end_latitude”, “end_longitude”, “length_miles”, “width_yards”, “property_loss”, and “number_states”. I included those variables to do some investigation to figure out what I wanted to visualize, however, I did not end up using many of the variables I initially included in my sub-data sets, for my final visualizations.

library(tidyverse)
library(tidyr)
library(hms)
library(highcharter)
library(plotly)
setwd("~/Documents/Data 110")
tornadoes <- readr::read_csv("1950-2023_all_tornadoes.csv") 
head(tornadoes)

## # A tibble: 6 × 29
##      om    yr mo    dy    date       time      tz st    stf     stn   mag   inj
##   <dbl> <dbl> <chr> <chr> <date>     <time> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1   192  1950 10    01    1950-10-01 21:00      3 OK    40       23     1     0
## 2   193  1950 10    09    1950-10-09 02:15      3 NC    37        9     3     3
## 3   195  1950 11    20    1950-11-20 02:20      3 KY    21        1     2     0
## 4   196  1950 11    20    1950-11-20 04:00      3 KY    21        2     1     0
## 5   197  1950 11    20    1950-11-20 07:30      3 MS    28       14     1     3
## 6   194  1950 11    04    1950-11-04 17:00      3 PA    42        5     3     1
## # ℹ 17 more variables: fat <dbl>, loss <dbl>, closs <dbl>, slat <dbl>,
## #   slon <dbl>, elat <dbl>, elon <dbl>, len <dbl>, wid <dbl>, ns <dbl>,
## #   sn <dbl>, sg <dbl>, f1 <dbl>, f2 <dbl>, f3 <dbl>, f4 <dbl>, fc <dbl>

Re-naming Variables:

Here I am using the rename function to change some of the less legible variable names to something that makes more sense.

tornadoes <- tornadoes |>    
  rename(                 
    year = yr,
    month = mo,
    day = dy,
    time_zone = tz,
    state = st,
    severity_rating = mag,
    injuries = inj,
    fatalities = fat,
    property_loss = loss,
    lat = slat,
    lng = slon,
    end_latitude = elat,
    end_longitude = elon,
    length_miles = len,
    width_yards = wid,
    number_states = ns,
    county_1 = f1,
    county_2 = f2,
    county_3 = f3,
    county_4 = f4)

Re-naming Months:

Here I am just re-naming the months from numbers to the actual names in case I want to use that variable for something later on.

tornadoes$month[tornadoes$month == "01"]<- "January" 
tornadoes$month[tornadoes$month == "02"]<- "February" 
tornadoes$month[tornadoes$month == "03"]<- "March"
tornadoes$month[tornadoes$month == "04"]<- "April"
tornadoes$month[tornadoes$month == "05"]<- "May"
tornadoes$month[tornadoes$month == "06"]<- "June"
tornadoes$month[tornadoes$month == "07"]<- "July"
tornadoes$month[tornadoes$month == "08"]<- "August"
tornadoes$month[tornadoes$month == "09"]<- "September"
tornadoes$month[tornadoes$month == "10"]<- "October"
tornadoes$month[tornadoes$month == "11"]<- "November"
tornadoes$month[tornadoes$month == "12"]<- "December"
head(tornadoes)

## # A tibble: 6 × 29
##      om  year month    day   date       time   time_zone state stf     stn
##   <dbl> <dbl> <chr>    <chr> <date>     <time>     <dbl> <chr> <chr> <dbl>
## 1   192  1950 October  01    1950-10-01 21:00          3 OK    40       23
## 2   193  1950 October  09    1950-10-09 02:15          3 NC    37        9
## 3   195  1950 November 20    1950-11-20 02:20          3 KY    21        1
## 4   196  1950 November 20    1950-11-20 04:00          3 KY    21        2
## 5   197  1950 November 20    1950-11-20 07:30          3 MS    28       14
## 6   194  1950 November 04    1950-11-04 17:00          3 PA    42        5
## # ℹ 19 more variables: severity_rating <dbl>, injuries <dbl>, fatalities <dbl>,
## #   property_loss <dbl>, closs <dbl>, lat <dbl>, lng <dbl>, end_latitude <dbl>,
## #   end_longitude <dbl>, length_miles <dbl>, width_yards <dbl>,
## #   number_states <dbl>, sn <dbl>, sg <dbl>, county_1 <dbl>, county_2 <dbl>,
## #   county_3 <dbl>, county_4 <dbl>, fc <dbl>

Checking for NA’s:

sum(is.na(tornadoes))

## [1] 0

Beginning of Making Second Final Plot:

Here I am creating a sub-data set using the select function, that only includes certain variables I might be interested in using. I am filtering the severity rating to only be category 5 tornadoes, because I find the ones that cause the most damage to be the most interesting. I am also filtering for the 7 states in tornado alley because that is where the most severe tornadoes are found most frequently.

tornadoes_2 <- tornadoes |> 
  select(year,              
         month,             
         day,                
         time,
         state,
         severity_rating,
         injuries,
         fatalities,
         lat,
         lng,
         end_latitude,
         end_longitude,
         length_miles,
         width_yards,
         number_states,
         property_loss) |>
  filter(severity_rating == 5) |>
  filter(state %in% c("TX", "OK","KS", "NE", "IA", "SD", "MO"))

Beginning of Creating Second Preliminary Plot:

In this sub-set, I wanted to find out how many tornadoes where recorded each year, so I used the group_by function to group by year, then I mutated a new column, num_tornadoes, and used the n() function to find the number of observations per grouped year. I then un-grouped the data so that it would be able to print in the format of the original data set without having to use a bunch of summarize arguments, meaning each entry for 1950 would return with the same value under the num_tornadoes column, and so on.

tornadoes_3 <- tornadoes |>
  group_by(year) |>
  mutate(num_tornadoes = n()) |>
  ungroup()
tornadoes_3

## # A tibble: 71,398 × 30
##       om  year month    day   date       time   time_zone state stf     stn
##    <dbl> <dbl> <chr>    <chr> <date>     <time>     <dbl> <chr> <chr> <dbl>
##  1   192  1950 October  01    1950-10-01 21:00          3 OK    40       23
##  2   193  1950 October  09    1950-10-09 02:15          3 NC    37        9
##  3   195  1950 November 20    1950-11-20 02:20          3 KY    21        1
##  4   196  1950 November 20    1950-11-20 04:00          3 KY    21        2
##  5   197  1950 November 20    1950-11-20 07:30          3 MS    28       14
##  6   194  1950 November 04    1950-11-04 17:00          3 PA    42        5
##  7   198  1950 December 02    1950-12-02 15:00          3 IL    17        7
##  8   199  1950 December 02    1950-12-02 16:00          3 IL    17        8
##  9   200  1950 December 02    1950-12-02 16:25          3 AR    5        12
## 10   201  1950 December 02    1950-12-02 17:30          3 IL    17        9
## # ℹ 71,388 more rows
## # ℹ 20 more variables: severity_rating <dbl>, injuries <dbl>, fatalities <dbl>,
## #   property_loss <dbl>, closs <dbl>, lat <dbl>, lng <dbl>, end_latitude <dbl>,
## #   end_longitude <dbl>, length_miles <dbl>, width_yards <dbl>,
## #   number_states <dbl>, sn <dbl>, sg <dbl>, county_1 <dbl>, county_2 <dbl>,
## #   county_3 <dbl>, county_4 <dbl>, fc <dbl>, num_tornadoes <int>

Beginning of Creating First Final Plot (1/2):

In this chunk, using the select function, I created a new sub data set with the same variables as the first sub data set, only this time, I did not filter for only category 5 tornadoes.

tornadoes_6 <- tornadoes |>
  select(year,
         month,
         day,
         time,
         state,
         severity_rating,
         injuries,
         fatalities,
         lat,
         lng,
         end_latitude,
         end_longitude,
         length_miles,
         width_yards,
         number_states,
         property_loss) |>
  filter(state %in% c("TX", "OK","KS", "NE", "IA", "SD", "MO"))

Beginning For Creating First Final Plot (2/2):

Here I am creating a very similar data set to the past 2, except I am filtering for category 4 and 5 tornadoes, so the data is small enough to be imported into highcharter without crashing r-studio.

tornadoes_8 <- tornadoes |>
  select(year,
         month,
         day,
         time,
         state,
         severity_rating,
         injuries,
         fatalities,
         lat,
         lng,
         end_latitude,
         end_longitude,
         length_miles,
         width_yards,
         number_states,
         property_loss) |>
  filter(state %in% c("TX", "OK","KS", "NE", "IA", "SD", "MO"))|>
  filter(severity_rating %in% c(4,5))

Beginning of Creating First Preliminary Plot:

In this chunk, I am creating a new data frame pulled from my third sub-setted data frame (tornadoes_6), by first grouping by state, and then summarizing by the sum of the tornadoes with rank 5 from each state, and then once I realized there was a tie, I also summarized by the sum of the tornadoes with rank 4 from each state. I then ranked them in descending order by each state’s number of both category 5 and 4 tornadoes combined, which eliminated the tie. I also summarized by the sum of all fatalities from every tornado in each of the seven states.

tornadoes_sev <- tornadoes_6 |>
  group_by(state) |>
summarise(
  num_cat5 = sum(severity_rating == 5),
  num_cat4 = sum(severity_rating == 4),
    num_fat = sum(fatalities)
) |>
  mutate(cat_rank = rank(-(num_cat5 + num_cat4))) |>
  mutate(cat_rank = as.character(cat_rank))
  
  
tornadoes_sev

## # A tibble: 7 × 5
##   state num_cat5 num_cat4 num_fat cat_rank
##   <chr>    <int>    <int>   <dbl> <chr>   
## 1 IA           7       51      97 3       
## 2 KS           9       48     297 4       
## 3 MO           2       50     440 5       
## 4 NE           2       37      60 6       
## 5 OK          10       65     483 1       
## 6 SD           1        9      19 7       
## 7 TX           6       61     658 2

Linear Regression Model:

Here I am creating my linear regression model to investigate the relationships between my predictor (independent) variables, length_miles, width_yards, and end_latitude, to my response (dependent) variable, fatalities.

lm_model <- lm(fatalities ~ time + length_miles + width_yards + end_latitude, data = tornadoes)

summary(lm_model)

## 
## Call:
## lm(formula = fatalities ~ time + length_miles + width_yards + 
##     end_latitude, data = tornadoes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -9.819  -0.083   0.020   0.110 155.978 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -9.620e-02  2.005e-02  -4.798  1.6e-06 ***
## time          8.811e-07  3.216e-07   2.740  0.00615 ** 
## length_miles  4.524e-02  7.414e-04  61.020  < 2e-16 ***
## width_yards   7.815e-04  2.909e-05  26.862  < 2e-16 ***
## end_latitude -4.413e-03  3.271e-04 -13.491  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.536 on 71393 degrees of freedom
## Multiple R-squared:  0.08356,    Adjusted R-squared:  0.08351 
## F-statistic:  1627 on 4 and 71393 DF,  p-value: < 2.2e-16

Equation:

Fatalities = -0.0962 + 0.0000008811(time) + 0.04524(length_miles) + 0.0007815(width_yards) - 0.004413(end_latitude)

Diagnostic Plot 1:

Here, I am plotting the regression line of fatalities and width of a tornado.

ggplot(tornadoes, aes(x = fatalities, y = width_yards)) +
  geom_point() +  
  geom_smooth(method = "lm", se = TRUE, color = "red") + 
  labs(x = "Number of Fatalities",
       y = "Width in Yards",
       title = "Linear Regression: Fatalities caused vs. Width of a Tornado")+  
  theme_bw()

## `geom_smooth()` using formula = 'y ~ x'

Diagnostic Plot 2:

Here, I am plotting the regression line of fatalities and time of tornado onset.

ggplot(tornadoes, aes(x = fatalities, y = time)) +
  geom_point() +  
  geom_smooth(method = "lm", se = TRUE, color = "red") + 
  labs(x = "Number of Fatalities",
       y = "Time of Tornado Onset",
       title = "Linear Regression: Fatalities caused vs. Time of Tornado Onset")+  
  theme_bw()

## `geom_smooth()` using formula = 'y ~ x'

Diagnostic Plot 3:

Here, I am creating a residuals vs. fitted plot for my linear regression model.

plot(lm_model, which=1)

Linear Regression Analysis:

Based on the p-values of the different variables in the model, I can see that length_miles, width_yards, and end_latitude (<2e-16) are all three, significantly associated with the number of fatalities a tornado causes, along with time, thought it is slightly less so (p-value 0.00615). The end latitude’s association most likely has something to do with certain coordinates having higher populated areas, or simply a place where tornadoes happen more frequently and more severely, i.e. tornado alley, so that relationship makes sense. I was surprised time was not more significantly associated at first, as I thought whether a tornado was at night or during the day would have a great effect on fatalities. Either because more people are out during the day and less at night, meaning less people would be hurt at night, or because it is harder to spot a tornado in the dark versus during the day, resulting in more injuries at night. I theorized that this could be due to the time frame when tornadoes most often happen, like during the afternoon/early evening, when it could still be light out. I Google searched this and that seemed to hold up, so, if that were true, there would simply be less data on nighttime tornadoes, meaning a pattern is hard to find. However, according to my second diagnostic plot, it seems there are quite a few recorded tornadoes in the late evening, so I suppose my initial thoughts were simply unfounded. The relationship between width and miles a tornado covers, to fatalities caused, was unsurprising, and seems rather self explanatory, i.e. the bigger a tornado is and the longer it lasts, the more chance it has to kill people. However, the adjusted r-squared for the model is quite low, accounting for only about 8.4% of variance in the data (variance in fatalities). The fact is, most tornadoes do not kill people, and therefore most of my data is of tornadoes with extremely low to no fatalities. This means it is hard to be confident about patterns when the data is so sparse, as is shown in my diagnostic plots one and two, with clustering around x=0, and only a few outliers pulling the data to the right, shaping the regression line. As I can see, though the lines in the first and second plots suggests some sort of positive relationship between the variables, e.g. as the width of a tornado increases, fatalities increase, the confidence level also widens significantly as the line ascends due to the lack of data. Potentially, my dependent variable is mostly influenced by independent variables I am unable to explore with this data set. Additionally, while it is true that bigger tornadoes that last longer have the potential to inflict great damage and cause high fatalities, that all depends on the random chance it forms over a populated area. So perhaps the variance in fatalities is mostly random. As for my third diagnostic plot, residuals vs fitted, most of the values are around zero, which means my predictions were good, but that is only really because most of the tornadoes are low casualty, i.e. that is where I have the bulk of data. It seems like my model works okay for predicting low casualty tornadoes but because the only real examples of higher fatality tornadoes are sparse outliers, the residuals grow higher and the model begins to fail as fatalities rise, due to the lack of data. Overall, the model is not a very good fit.

First Preliminary Plot:

In this chunk, I created my first ggplot, pulling from the data set tornadoes_sev. I set the x axis to the state, reordered in ascending order by their tornado severity ranking (states ranked based on how many 4/5 category tornadoes in total occurred in each state), the y axis to the total number of fatalities, and I filled by the tornado severity ranking. I used scale fill manual to create a gradient of reds because it felt like an appropriate color for visualizing destruction (dark red = most, light red = least). Lastly, I changed the theme to black and white, and adjusted the x-axis text by 45 degrees.

  plot1 <- tornadoes_sev |>
  ggplot() +
  geom_bar(aes(x = fct_reorder(state, cat_rank), y = num_fat, fill = cat_rank),
      position = "dodge", stat = "identity") +
  labs(y = "Number of Fatalities",
       x = "State",
       title = "States in Tornado Alley Compared by fatalities, and Ranked by Number of Cat. 4/5 Tornadoes",
       subtitle = "In the Years 1950-2023", 
       fill = "Rank by Number of Cat. 4/5 Tornadoes (most = 1, least = 7)",
      caption = "Source: NOAA") +
  scale_fill_manual(values = c("1" = '#8a0000', "2" = '#b30000', "3" = '#e40000', "4" = '#ff1e1e', "5" = '#ff6868', "6" = '#ff9090', "7" = '#feb9b9')) +
  theme_bw() |>
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
  
plot1

Second Preliminary Plot:

In this chunk I am looking at the number of tornadoes recorded each year, in U.S. history (1950-2023), simply to see if tornadoes are in fact trending upwards.

plot5 <- tornadoes_3 |>
  ggplot() +
  geom_bar(aes(x = year, y = num_tornadoes),
      position = "dodge", stat = "identity") +
  labs(y = "Number of Tornadoes Recorded",
       x = "Year",
       title = "Number of Tornadoes Recorded Each year, 1950-2023",
      caption = "Source: NOAA") +
  theme_bw() 
  

plot5

Final Plot 1 ggplot (1/2):

In this chynk, I created a scatterplot comparing the width of a tornado to the length, against the number of fatalities. I created a custom color pallet to represent each state in tornado alley. I also changed the theme to dark because I thought it made the points stand out better.

Final_1 <- ggplot(tornadoes_6, aes(x = width_yards, y = length_miles, color = state, size = fatalities)) +
  scale_color_manual(values = c("TX" = "green", "OK" = "pink", "KS" = "red", "NE" = "blue", "IA" = "yellow", "SD" = "purple", "MO" = "orange")) +
  geom_point(alpha = 0.6) +
  labs(title = "Tornado Alley Tornadoes Compared by Width, Miles Covered, and Injuries Caused",
       subtitle = "In the Years 1950-2023",
       caption = "Source: NOAA",
       x = "Width(yards)",
       y = "Miles Covered",
       color = "State",
       size = "Number of Fatalities") +
  theme_dark()
Final_1

Final Plot 1 Highcharter (2/2):

In this chunk, I created the above plot, with downsized values (only category 4,5 tornadoes), adding interactivity (highcharter). I made a custom pallet to represent the states, that is identical to the one above, but formatted for highcharter. I figured out how to do the majority of this from the class notes, and I still struggled a lot with it.

state_colors <- c("green", "pink", "red", "blue", "yellow", "purple", "orange")


  highchart() |>
    hc_add_series( data = tornadoes_8,
                   type = "scatter",
                   hcaes(x = width_yards, 
                         y = length_miles, 
                         group = state, 
                         size = fatalities),
                   opacity = 0.6) |>
  hc_colors(state_colors) |>
  hc_title(text = "Tornado Alley Tornadoes Compared by Width, Miles Covered, and Fatalities Caused") |>
  hc_subtitle(text = "In the Years 1950–2023") |>
  hc_xAxis(title = list(text = "Width (yards)")) |>
  hc_yAxis(title = list(text = "Miles Covered")) |>
  hc_legend(title = list(text = "State"), align = "right", verticalAlign = "top") |>
  hc_tooltip(pointFormat = paste(
    "<b>Width:</b> {point.x} yards<br>",
    "<b>Miles Covered:</b> {point.y}<br>",
    "<b>Fatalities:</b> {point.z}"
  ))

Final Plot 2:

Here I am making a 3d scatterplot in plotly comparing the width and fatalities of all category 5 tornadoes from tornado alley, in recorded history (1950-2023). I figured out how to do this based on web research–see bibliography. The main things in this chunk that I think are not self explanatory are, hoverinfo, which I set equal to “text” and object I defined the next line down. Hoverinfo refers to what information will be in the tooltip.

tornadoes_2 |>
  plot_ly(x= ~year,
          y= ~length_miles,
          z= ~width_yards, 
          type = "scatter3d",
          mode = "markers+lines",
          color = ~state, 
          colors = "Set1",
     hoverinfo = "text",
    text = ~paste("Year:", year,
                     "<b> Width: ", width_yards,
                     "<b> Miles Traveled: ", length_miles,
                     "<b> Fatalities: ", fatalities,
                     "<b> Source: NOAA"),
    marker = list(size = 5)) |>
    layout(
      legend = list(
      title = list(text = "<b> States </b>")),
    scene = list(
      xaxis = list(title = "Year", range = c(min(tornadoes_2$year), max(tornadoes_2$year))),
      yaxis = list(title = "Fatalities", range = c(0, max(tornadoes_2$fatalities))),
      zaxis = list(title = "Width (yards)", range = c(0, max(tornadoes_2$width_yards)))
    )
  )

Essay:

According to NOAA, as I discussed previously, tornado strength is measured by the damage it causes, which then allows scientists to attach an estimated wind speed after the fact, based on the force required to knock down trees, or throw cars, or even completely destroy towns. There is a lot about tornadoes, how they form, why they form, etc. that scientists don’t entirely understand, making them hard to predict, and random in many ways, as my linear regression model potentially showed. Even though certain variables like tornado width and length might be significantly associated with something like higher injuries or property damage, it doesn’t account for a lot of the variance in strong tornadoes, as so much rests on variables we don’t fully understand. What we do know, boils down to this: most tornadoes with the potential to be really destructive form in big storms called super cells, which are “rotating thunderstorms” (Severe Weather 101 - Tornadoes). The theory is that temperature changes inside and outside the super cell interacting, could be what causes tornadoes to form, but we also know there is evidence that is not always the case. Much of tornadoes are a mystery, which is a big part of why I find myself drawn to them. The first preliminary plot I made compared the 7 states of tornado alley by the number of category 5 and 4 tornadoes each state had in recorded history, and the total number of fatalities. Something interesting I found there, was that just because a state had more category 5 and 4 tornadoes (sometimes by a lot) than others, did not mean the state had more total fatalities than other states that had less severe tornadoes. I was initially confused by this, as I know tornado rankings come from damage caused, so, shouldn’t a state with the most categroy 4 and 5 tornadoes also boast the highest total tornado related casualties? I relalized the answer to this is most likely because the state with the most cat 4/5 tornadoes could have had more property loss than fatalities, as that contributes to rankings just the same, though it is still confusing. For my second preliminary plot, I wanted to see how many more tornadoes happen per year, as the years have gone on. Granted, some of the results could be due to the fact that we have gotten better at tracking and documenting when tornadoes occur, but you can clearly see an overall increase in the number of tornadoes per year, which is to be expected due to global warming causing more and more sever weather events. For my first final plot, I created a scatterplot, x axis width of the tornadoes, y axis length of the tornadoes (miles covered), color = state (I only included the 7 states in tornado alley again), and the size of the bubbles was fatalities. This plot was interesting, it seemed to show that it might be hard for a tornado to cover a long distance, and also hurt a lot of people. I wanted to look deeper into the width of a tornado versus miles covered, because in my regression model, it seemed width was slightly more associated with fatalities, and I was curious about that. I think this result is probably mostly due to chance, I know there are few tornadoes recorded in this data set that resulted in high casualties, so perhaps those tornadoes were simply randomly very wide, and not very long. I tried to put this plot into highcharter, because I don’t like how plotly takes away my caption and labels for my legend, etc., but unfortunately with the size of the data set, it nearly crashed my computer when I tried to run my code. Because of this, I ended up downsizing the data to only focus on category 4 and 5 tornadoes, which ended up being about 380 observations as opposed to the original 28,000 or so. For my second final plot, I created a 3d plotly scatterplot. When making this, I read a lot online about the specific syntax and code necessary to create this, referencing websites like plotly.com, their article entitled ‘3D scatterplots in R, ’as well as geeksforgeeks.org, their article entitled ’Scatter Slot using Plotly in R.’ I also checked out some of the prior work from my classmates who had done something similar. In this plot, I again compared the 7 states in tornado alley, but this time by width, fatalities, and year. As it seems, tornadoes with greater width might be associated with more deaths, and since it seems more tornadoes are happening each year, I wanted to see if over the years, tornadoes are getting wider and more fatal. The plot showed that over the years, it does seem tornadoes are getting wider, and also slightly more fatal, but it was hard to see any positive relationship between width and fatalities. The results were not necessarily a surprise, again, considering climate change. When it comes to things I wish I could have done, there is one thing I spent a particularly long time trying to fix, that I think would have made my plot more legible if it had worked. In the 3d scatterplot, I had initially wanted to make the size of the dots connected to the number of fatalities a tornado caused, as an extra way to view that variable, but unfortunately, plotly seemed to default to bigger bubbles=less fatalities, smaller bubbles=more fatalities. I could not figure out how to reverse that, so I took away the sizing altogether. Overall, I had a lot of trouble with the variables of this data set, the way the time variable was formatted gave me lots of problems initially, and there were several other variables that I had thought were one thing, but on closer inspection made no sense, and I had to do quite a bit more research than I was expecting. Some of the variables I had thought were quantitative ended up being categorical, and it drastically changed my plans for rangling and visualizing this data. I believe I made about 9 different plots, some GIS, some scatterplot, a heat map, etc., before the final report. I tried so many different things that didn’t really work, until finally, I landed on these plots.

Bibliography

Source for data set: NOAA

Article used for background research: ‘Severe Weather 101 – Tornadoes’–> sourced from nssl.noaa.gov

1st article used for 3d scatterplot reference: ‘3D Scatter Plots in R’–> sourced from plotly.com

2nd article used for 3d scatterplot refrence: ‘Scatter Slot using Plotly in R’–> sourced from geeksforgeeks.org

Source for opening image: stormhighway.com, a photo taken by Dan Robinson.