Impact of Tornadoes on Life, Property, and Agriculture in Maryland

Author

Mia Ramirez

Source: Wikipedia Commons for Fujita Scale, Tornado image: https://www.cbs8.com/article/weather/weather-blog/15-years-later-a-look-back-on-the-la-plata-tornado/65-434613884

Introduction

North America has the most tornadoes per year of any continent, and the United States has the most tornadoes per year of any country. Tornadoes are violent rotating pillars of air extending from clouds to the ground. According to NOAA, approximately 1,200 tornadoes occur annually across the United States, with Maryland averaging roughly 15-20 per year, mostly weak EF0-EF1 events. Survey teams rank the magnitude of a tornado based on wind speed, observed damage and other related factors. The Enhanced Fujita (EF) Scale rates tornadoes from EF0 to EF5 based on estimated wind speeds derived from observed damage, and has been used since 2007. The standardized ratings provide crucial data for emergency response, insurance claims, and tracking storm event impacts through NOAA’s Storm Events. This dataset is sourced from NOAA, containing information from 1950 to 2025, and I am interested in looking into the impacts across Maryland. I hope to better understand how the state has been affected and to see the ways the factors impact a tornado’s rating or “magnitude”

Citations:

Did You Know? | National Centers for Environmental Information (NCEI), www.ncei.noaa.gov/access/monitoring/dyk/tornadocount. Accessed 3 July 2026.

Monthly and Annual Numbers of Tornadoes - Graphs and Maps | NOAA Climate.Gov, www.climate.gov/maps-data/dataset/monthly-and-annual-numbers-tornadoes-graphs-and-maps. Accessed 3 July 2026.

Tornado Intensity - Wind Speed, Damage, Formation, www.britannica.com/science/tornado/Tornado-intensity. Accessed 3 July 2026.

Source of data: https://www.spc.noaa.gov/wcm/

Guide to data/“Read Me”: https://www.spc.noaa.gov/wcm/data/SPC_severe_database_description.pdf

Exploring Data, Correlation, Regression Models

Data Cleaning

I install the packages that will help me clean and organize the data (“tidyverse”, “dplyr”) as well as the packages to help me visualize the data (“ggplot2”, “plotly”).

install.packages("tidyverse")

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.6'
(as 'lib' is unspecified)

install.packages("dplyr")

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.6'
(as 'lib' is unspecified)

install.packages("ggplot2")

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.6'
(as 'lib' is unspecified)

install.packages("plotly")

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.6'
(as 'lib' is unspecified)

I load the libraries, including the library that contains the read_csv function to help import the dataset.

#Import libraries
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(tidyr)
library(readr)

#Import the dataset
tornadoes1950_2025 <- read_csv("1950-2025_all_tornadoes.csv")

Rows: 74956 Columns: 31

── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (5): om, mo, dy, st, stf
dbl  (22): yr, tz, stn, mag, inj, fat, loss, closs, slat, slon, elat, elon, ...
date  (2): date, edat
time  (2): time, etime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

I get an overview of the dataset. I use colnames to see the variables, str to see the the classes of all the variables, and head to get a snapshot of a few observations.

colnames(tornadoes1950_2025)

 [1] "om"    "yr"    "mo"    "dy"    "date"  "time"  "tz"    "st"    "stf"  
[10] "stn"   "mag"   "inj"   "fat"   "loss"  "closs" "slat"  "slon"  "elat" 
[19] "elon"  "len"   "wid"   "ns"    "sn"    "sg"    "f1"    "f2"    "f3"   
[28] "f4"    "fc"    "edat"  "etime"

str(tornadoes1950_2025)

spc_tbl_ [74,956 × 31] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ om   : chr [1:74956] "192" "193" "195" "196" ...
 $ yr   : num [1:74956] 1950 1950 1950 1950 1950 1950 1950 1950 1950 1950 ...
 $ mo   : chr [1:74956] "10" "10" "11" "11" ...
 $ dy   : chr [1:74956] "01" "09" "20" "20" ...
 $ date : Date[1:74956], format: "1950-10-01" "1950-10-09" ...
 $ time : 'hms' num [1:74956] 21:00:00 02:15:00 02:20:00 04:00:00 ...
  ..- attr(*, "units")= chr "secs"
 $ tz   : num [1:74956] 3 3 3 3 3 3 3 3 3 3 ...
 $ st   : chr [1:74956] "OK" "NC" "KY" "KY" ...
 $ stf  : chr [1:74956] "40" "37" "21" "21" ...
 $ stn  : num [1:74956] 23 9 1 2 14 5 7 8 12 9 ...
 $ mag  : num [1:74956] 1 3 2 1 1 3 2 3 3 1 ...
 $ inj  : num [1:74956] 0 3 0 0 3 1 3 25 0 0 ...
 $ fat  : num [1:74956] 0 0 0 0 0 0 1 2 0 0 ...
 $ loss : num [1:74956] 4 5 5 5 4 5 4 6 1 4 ...
 $ closs: num [1:74956] 0 0 0 0 0 0 0 0 0 0 ...
 $ slat : num [1:74956] 36.7 34.2 37.4 38.2 32.4 ...
 $ slon : num [1:74956] -102.5 -78.6 -87.2 -84.5 -89.1 ...
 $ elat : num [1:74956] 36.9 0 0 0 0 ...
 $ elon : num [1:74956] -102 0 0 0 0 ...
 $ len  : num [1:74956] 15.8 2 0.1 0.1 2 15.9 18.8 18 7.8 9.6 ...
 $ wid  : num [1:74956] 10 880 10 10 37 100 50 200 10 50 ...
 $ ns   : num [1:74956] 1 1 1 1 1 1 1 1 1 1 ...
 $ sn   : num [1:74956] 1 1 1 1 1 1 1 1 1 1 ...
 $ sg   : num [1:74956] 1 1 1 1 1 1 1 1 1 1 ...
 $ f1   : num [1:74956] 25 47 177 209 101 71 119 119 65 157 ...
 $ f2   : num [1:74956] 0 0 0 0 0 11 117 5 0 0 ...
 $ f3   : num [1:74956] 0 0 0 0 0 0 0 0 0 0 ...
 $ f4   : num [1:74956] 0 0 0 0 0 0 0 0 0 0 ...
 $ fc   : num [1:74956] 0 0 0 0 0 0 0 0 0 0 ...
 $ edat : Date[1:74956], format: NA NA ...
 $ etime: 'hms' num [1:74956] NA NA NA NA ...
  ..- attr(*, "units")= chr "secs"
 - attr(*, "spec")=
  .. cols(
  ..   om = col_character(),
  ..   yr = col_double(),
  ..   mo = col_character(),
  ..   dy = col_character(),
  ..   date = col_date(format = ""),
  ..   time = col_time(format = ""),
  ..   tz = col_double(),
  ..   st = col_character(),
  ..   stf = col_character(),
  ..   stn = col_double(),
  ..   mag = col_double(),
  ..   inj = col_double(),
  ..   fat = col_double(),
  ..   loss = col_double(),
  ..   closs = col_double(),
  ..   slat = col_double(),
  ..   slon = col_double(),
  ..   elat = col_double(),
  ..   elon = col_double(),
  ..   len = col_double(),
  ..   wid = col_double(),
  ..   ns = col_double(),
  ..   sn = col_double(),
  ..   sg = col_double(),
  ..   f1 = col_double(),
  ..   f2 = col_double(),
  ..   f3 = col_double(),
  ..   f4 = col_double(),
  ..   fc = col_double(),
  ..   edat = col_date(format = ""),
  ..   etime = col_time(format = "")
  .. )
 - attr(*, "problems")=<pointer: 0x620855511630>

head(tornadoes1950_2025)

# A tibble: 6 × 31
  om       yr mo    dy    date       time      tz st    stf     stn   mag   inj
  <chr> <dbl> <chr> <chr> <date>     <time> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 192    1950 10    01    1950-10-01 21:00      3 OK    40       23     1     0
2 193    1950 10    09    1950-10-09 02:15      3 NC    37        9     3     3
3 195    1950 11    20    1950-11-20 02:20      3 KY    21        1     2     0
4 196    1950 11    20    1950-11-20 04:00      3 KY    21        2     1     0
5 197    1950 11    20    1950-11-20 07:30      3 MS    28       14     1     3
6 194    1950 11    04    1950-11-04 17:00      3 PA    42        5     3     1
# ℹ 19 more variables: fat <dbl>, loss <dbl>, closs <dbl>, slat <dbl>,
#   slon <dbl>, elat <dbl>, elon <dbl>, len <dbl>, wid <dbl>, ns <dbl>,
#   sn <dbl>, sg <dbl>, f1 <dbl>, f2 <dbl>, f3 <dbl>, f4 <dbl>, fc <dbl>,
#   edat <date>, etime <time>

Next, I create cleaned datasets to work with. We use filter to create a dataset containing the tornado information for Maryland. We use select, and negate the selection using !, to make a dataset. We use select again and create a new dataset with the variables we removed from marylandtornadoes.

#Filter to keep only Maryland observations. Keep in mind some tornadoes do cross multiple states, some of these observations will have information for the maryland segment of a multi-state tornado.
maryland <- tornadoes1950_2025 %>% filter(tornadoes1950_2025$st == "MD")

#Select, and the ! to keep all variables other than those listed. This is the main tornado data.
marylandtornadoes <- maryland %>% select(!c(stn,stf,ns,sn,sg,f1,f2,f3,f4))

#Select, and the variables that help determine whether the observation is of a single, double, or multi-state tornado path.
marylandpaths <- maryland %>% select(c(om,stn,stf,ns,sn,sg,f1,f2,f3,f4))

For the paths dataset, NOAA uses number codes to help identify which tornado observations in the datasheet are of a tornado segment, or the whole record of a tornado. This is important because some tornadoes may have crossed multiple states and event done different amounts of property damage or killed different numbers of people across different states. We are going to use this information when we create our Tableau visualization.

#Clean the paths dataset, use rowwise to make sure the Number of States(ns), State Number (sn), and Segment Number (sg) are all kept in line.  Use mutate to help rename observations to make their meaning more clear.
marylandpaths <- marylandpaths %>% rowwise() |>
  mutate(Tornado_Segment = list(c_across(ns:sg)))
marylandpaths$Tornado_Segment <- as.character(marylandpaths$Tornado_Segment)
marylandpaths <- marylandpaths %>% mutate(Tornado_Info = Tornado_Segment)
marylandpaths <- marylandpaths %>% mutate(Tornado_Info = recode(Tornado_Info,
                          "c(3, 1, 2)" = "First Segment for Three-State Tornado",
                          "c(2, 1, 2)" = "First Segment for Two-State Tornado",
                          "c(2, 0, 1)" = "Two-State Tornado",
                          "c(1, 1, 1)" = "Entire Record [Unless All 4 FIPS zero]"))

For the dataset containing the other variables for the tornadoes, rename the columns so it’s more clear what variable is being recorded.

#Use dplyr function rename to adjust column names
marylandtornadoes <- marylandtornadoes %>% rename("Number_of_Tornadoes" = om,
                            "Year" = yr,
       "Month" = mo,
       "Day" = dy,
       "Date"= date,
       "Time" = time,
       "Time Zone" = tz,
       "State" = st,
       "Magnitude" = mag,
       "Injuries" = inj,
       "Fatalities"= fat,
       "Property_Loss" = loss,
       "Crop_Loss" = closs,
       "Start_Latitude" = slat,
       "Start_Longitude" = slon,
       "End_Latitude" = elat,
       "End_Longitude" = elon,
       "Path_Length" = len,
       "Path_Width" = wid,
       "Rated_or_Unknown" = fc,
       "End_Date" = edat,
       "End_Time" = etime)

I’m going to check and see if I have any NA in the observations for my variables. I see that End Date and End Time has multiple NAs, so it’s probably not ideal data to analyze. Also, I see the magnitude is listed as a number, but it’s not a quantitative variable in reality. Instead, it’s categorical, following the Enhanced Fujita scale. (-9 for EFU which is unknown speed and no surveyable damage, 0 for EF0 winds up to 85mph, 1 for EF1 winds up to 110mph, etc). I’m going to adjust the dataset accordingly.

#Check for NA in observations
colSums(is.na(marylandpaths))

             om             stn             stf              ns              sn 
              0               0               0               0               0 
             sg              f1              f2              f3              f4 
              0               0               0               0               0 
Tornado_Segment    Tornado_Info 
              0               0

colSums(is.na(marylandtornadoes))

Number_of_Tornadoes                Year               Month                 Day 
                  0                   0                   0                   0 
               Date                Time           Time Zone               State 
                  0                   0                   0                   0 
          Magnitude            Injuries          Fatalities       Property_Loss 
                  0                   0                   0                   0 
          Crop_Loss      Start_Latitude     Start_Longitude        End_Latitude 
                  0                   0                   0                   0 
      End_Longitude         Path_Length          Path_Width    Rated_or_Unknown 
                  0                   0                   0                   0 
           End_Date            End_Time 
                277                 277

#For fot the classses of all variables
str(marylandpaths)

rowws_df [430 × 12] (S3: rowwise_df/tbl_df/tbl/data.frame)
 $ om             : chr [1:430] "183" "93" "96" "97" ...
 $ stn            : num [1:430] 1 1 2 3 4 1 2 1 2 1 ...
 $ stf            : chr [1:430] "24" "24" "24" "24" ...
 $ ns             : num [1:430] 1 1 1 1 1 1 1 2 1 1 ...
 $ sn             : num [1:430] 1 1 1 1 1 1 1 1 1 1 ...
 $ sg             : num [1:430] 1 1 1 1 1 1 1 2 1 1 ...
 $ f1             : num [1:430] 29 21 21 11 31 33 3 21 23 47 ...
 $ f2             : num [1:430] 0 0 0 0 0 0 0 0 0 0 ...
 $ f3             : num [1:430] 0 0 0 0 0 0 0 0 0 0 ...
 $ f4             : num [1:430] 0 0 0 0 0 0 0 0 0 0 ...
 $ Tornado_Segment: chr [1:430] "c(1, 1, 1)" "c(1, 1, 1)" "c(1, 1, 1)" "c(1, 1, 1)" ...
 $ Tornado_Info   : chr [1:430] "Entire Record [Unless All 4 FIPS zero]" "Entire Record [Unless All 4 FIPS zero]" "Entire Record [Unless All 4 FIPS zero]" "Entire Record [Unless All 4 FIPS zero]" ...
 - attr(*, "groups")= tibble [430 × 1] (S3: tbl_df/tbl/data.frame)
  ..$ .rows: list<int> [1:430] 
  .. ..$ : int 1
  .. ..$ : int 2
  .. ..$ : int 3
  .. ..$ : int 4
  .. ..$ : int 5
  .. ..$ : int 6
  .. ..$ : int 7
  .. ..$ : int 8
  .. ..$ : int 9
  .. ..$ : int 10
  .. ..$ : int 11
  .. ..$ : int 12
  .. ..$ : int 13
  .. ..$ : int 14
  .. ..$ : int 15
  .. ..$ : int 16
  .. ..$ : int 17
  .. ..$ : int 18
  .. ..$ : int 19
  .. ..$ : int 20
  .. ..$ : int 21
  .. ..$ : int 22
  .. ..$ : int 23
  .. ..$ : int 24
  .. ..$ : int 25
  .. ..$ : int 26
  .. ..$ : int 27
  .. ..$ : int 28
  .. ..$ : int 29
  .. ..$ : int 30
  .. ..$ : int 31
  .. ..$ : int 32
  .. ..$ : int 33
  .. ..$ : int 34
  .. ..$ : int 35
  .. ..$ : int 36
  .. ..$ : int 37
  .. ..$ : int 38
  .. ..$ : int 39
  .. ..$ : int 40
  .. ..$ : int 41
  .. ..$ : int 42
  .. ..$ : int 43
  .. ..$ : int 44
  .. ..$ : int 45
  .. ..$ : int 46
  .. ..$ : int 47
  .. ..$ : int 48
  .. ..$ : int 49
  .. ..$ : int 50
  .. ..$ : int 51
  .. ..$ : int 52
  .. ..$ : int 53
  .. ..$ : int 54
  .. ..$ : int 55
  .. ..$ : int 56
  .. ..$ : int 57
  .. ..$ : int 58
  .. ..$ : int 59
  .. ..$ : int 60
  .. ..$ : int 61
  .. ..$ : int 62
  .. ..$ : int 63
  .. ..$ : int 64
  .. ..$ : int 65
  .. ..$ : int 66
  .. ..$ : int 67
  .. ..$ : int 68
  .. ..$ : int 69
  .. ..$ : int 70
  .. ..$ : int 71
  .. ..$ : int 72
  .. ..$ : int 73
  .. ..$ : int 74
  .. ..$ : int 75
  .. ..$ : int 76
  .. ..$ : int 77
  .. ..$ : int 78
  .. ..$ : int 79
  .. ..$ : int 80
  .. ..$ : int 81
  .. ..$ : int 82
  .. ..$ : int 83
  .. ..$ : int 84
  .. ..$ : int 85
  .. ..$ : int 86
  .. ..$ : int 87
  .. ..$ : int 88
  .. ..$ : int 89
  .. ..$ : int 90
  .. ..$ : int 91
  .. ..$ : int 92
  .. ..$ : int 93
  .. ..$ : int 94
  .. ..$ : int 95
  .. ..$ : int 96
  .. ..$ : int 97
  .. ..$ : int 98
  .. ..$ : int 99
  .. .. [list output truncated]
  .. ..@ ptype: int(0)

str(marylandtornadoes)

tibble [430 × 22] (S3: tbl_df/tbl/data.frame)
 $ Number_of_Tornadoes: chr [1:430] "183" "93" "96" "97" ...
 $ Year               : num [1:430] 1950 1952 1952 1952 1952 ...
 $ Month              : chr [1:430] "08" "04" "04" "04" ...
 $ Day                : chr [1:430] "11" "05" "05" "05" ...
 $ Date               : Date[1:430], format: "1950-08-11" "1952-04-05" ...
 $ Time               : 'hms' num [1:430] 17:00:00 12:00:00 14:00:00 15:00:00 ...
  ..- attr(*, "units")= chr "secs"
 $ Time Zone          : num [1:430] 3 3 3 3 3 3 3 3 3 3 ...
 $ State              : chr [1:430] "MD" "MD" "MD" "MD" ...
 $ Magnitude          : num [1:430] 1 1 1 0 1 1 1 0 1 1 ...
 $ Injuries           : num [1:430] 0 0 1 0 0 0 0 0 8 0 ...
 $ Fatalities         : num [1:430] 0 0 0 0 0 0 0 0 0 0 ...
 $ Property_Loss      : num [1:430] 3 4 4 2 4 3 2 0 4 4 ...
 $ Crop_Loss          : num [1:430] 0 0 0 0 0 0 0 0 0 0 ...
 $ Start_Latitude     : num [1:430] 39.3 39.5 39.6 39.1 39 ...
 $ Start_Longitude    : num [1:430] -75.9 -77.5 -77.4 -75.8 -77.2 ...
 $ End_Latitude       : num [1:430] 0 0 0 0 39 ...
 $ End_Longitude      : num [1:430] 0 0 0 0 -77.2 ...
 $ Path_Length        : num [1:430] 6 1 1 0.3 0.1 1 1 0.1 12 1.3 ...
 $ Path_Width         : num [1:430] 10 50 50 33 70 200 200 150 880 50 ...
 $ Rated_or_Unknown   : num [1:430] 0 0 0 0 0 0 0 0 0 0 ...
 $ End_Date           : Date[1:430], format: NA NA ...
 $ End_Time           : 'hms' num [1:430] NA NA NA NA ...
  ..- attr(*, "units")= chr "secs"

#Convert the column data to chr using as.character.
marylandtornadoes$Magnitude <- as.character(marylandtornadoes$Magnitude)

Correlation

Following some of the methods demonstrated in the class tutorial that Dr. Saidi has created, I’m going to do a correlation matrix, as well as multiple linear regression, to analyze the data further.

https://rpubs.com/rsaidi/1007730

install.packages("DataExplorer")

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.6'
(as 'lib' is unspecified)

library(DataExplorer)
library(ggplot2)
library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

Creating a correlation matrix to see which variables are have high, low, or no correlation. The red colors show that Injuries and Fatalities are variables that are highly (more than 0.5) correlated. The length of the tornado’s path is also highly (about or more than 0.5) correlated with fatalities and injuries.

plot_correlation(marylandtornadoes)

6 features with more than 20 categories ignored!
Number_of_Tornadoes: 397 categories
Day: 31 categories
Date: 238 categories
Time: 275 categories
End_Date: 75 categories
End_Time: 141 categories

Warning in cor(x = structure(list(Year = c(1950, 1952, 1952, 1952, 1952, : the
standard deviation is zero

cor(marylandtornadoes$Path_Length, marylandtornadoes$Fatalities)

[1] 0.4384552

Linear Regression and Multiple Regression

I am using a linear regression (formula y~x) to look for a relationship between two variables of note from above, Path_Length and Property_Loss. Based on the graph, this doesn’t look like there is a strong relationship. I’m going to move onto other ways of looking for insights.

linregmaryland <- ggplot(marylandtornadoes, aes(x = Path_Length,
                                                y = Property_Loss))+
  geom_point()+
  geom_smooth(method='lm',formula=y~x, se = FALSE, linetype= "dotdash")
  
  linregmaryland

I’m going to creature mutiple linear models, and pick some of the variables I’m interested in. Fatalities, Injuries, and Magnitude are all dependent variables (in fact, scientists determine a Tornado’s magnitude based on some of those other factors, and Fatalities/Injuries are usually the result of a tornado that has a severe presentation in Path_length and path_width) so I’m going to run a fit with theses variables. The correlation matrix helped me highlight some variables of note.
My result for fit1 has highlights at least two variables that are significant, Injuries and Path_Length, in relation with Fatalities. The R-squared value is also higher than 0.5 which means this is somewhat trustworthy data.

fit1 <- lm(Fatalities ~ Injuries + Property_Loss + Crop_Loss + Magnitude + Path_Length + Path_Width, data = marylandtornadoes) 
summary(fit1)


Call:
lm(formula = Fatalities ~ Injuries + Property_Loss + Crop_Loss + 
    Magnitude + Path_Length + Path_Width, data = marylandtornadoes)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.10343 -0.01391 -0.00017  0.00614  1.97125 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    3.978e-03  1.388e-01   0.029  0.97714    
Injuries       1.975e-02  1.271e-03  15.534  < 2e-16 ***
Property_Loss -3.879e-09  8.711e-09  -0.445  0.65633    
Crop_Loss     -8.241e-06  8.351e-06  -0.987  0.32426    
Magnitude0    -7.519e-03  1.391e-01  -0.054  0.95693    
Magnitude1     2.697e-04  1.392e-01   0.002  0.99846    
Magnitude2    -5.646e-02  1.408e-01  -0.401  0.68865    
Magnitude3     2.166e-01  1.490e-01   1.454  0.14676    
Magnitude4     1.244e-01  1.893e-01   0.657  0.51149    
Path_Length    7.187e-03  2.247e-03   3.199  0.00149 ** 
Path_Width    -9.825e-05  7.329e-05  -1.341  0.18076    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1387 on 419 degrees of freedom
Multiple R-squared:  0.5921,    Adjusted R-squared:  0.5824 
F-statistic: 60.82 on 10 and 419 DF,  p-value: < 2.2e-16

Fit2 shows a relationship where injuries and Path_Length continue to be significantly related to Fatalities, even though I removed some other variables in this model.

fit2 <- lm(Fatalities ~ Injuries + Crop_Loss + Path_Length + Path_Width, data = marylandtornadoes) 
summary(fit2)


Call:
lm(formula = Fatalities ~ Injuries + Crop_Loss + Path_Length + 
    Path_Width, data = marylandtornadoes)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.21114 -0.01564  0.00123  0.00566  1.97558 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.072e-03  9.602e-03  -0.320 0.749162    
Injuries     2.074e-02  1.097e-03  18.906  < 2e-16 ***
Crop_Loss   -9.461e-06  8.370e-06  -1.130 0.258940    
Path_Length  8.014e-03  2.253e-03   3.556 0.000418 ***
Path_Width  -1.003e-04  7.251e-05  -1.383 0.167303    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1416 on 425 degrees of freedom
Multiple R-squared:  0.5691,    Adjusted R-squared:  0.565 
F-statistic: 140.3 on 4 and 425 DF,  p-value: < 2.2e-16

I noticed in my correlation matrix that of the Magnitude variable, while EF0, EF1, and EF2 were not highly correlated with other variables, the EF3 and EF4 were. I’ve created another model where now I’m focused more on the EF3 and EF4 tornadoes. This model shoes that EF3 tornadoes are signficantly related to the Fatalities variable. I would expect then, to see observations where EF3 tornadoes in Maryland might have fatalities whereas other types of tornadoes in Maryland do not.

fit3 <- lm(Fatalities ~ Injuries + Crop_Loss + (Magnitude == "3") + (Magnitude == "4") + Path_Length + Path_Width, data = marylandtornadoes) 
summary(fit3)


Call:
lm(formula = Fatalities ~ Injuries + Crop_Loss + (Magnitude == 
    "3") + (Magnitude == "4") + Path_Length + Path_Width, data = marylandtornadoes)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.13909 -0.01173  0.00337  0.00727  1.98085 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)          -3.721e-03  9.527e-03  -0.391  0.69632    
Injuries              1.954e-02  1.269e-03  15.401  < 2e-16 ***
Crop_Loss            -7.973e-06  8.223e-06  -0.970  0.33283    
Magnitude == "3"TRUE  2.279e-01  5.418e-02   4.207 3.16e-05 ***
Magnitude == "4"TRUE  1.521e-01  1.270e-01   1.198  0.23156    
Path_Length           7.099e-03  2.225e-03   3.191  0.00153 ** 
Path_Width           -1.065e-04  7.197e-05  -1.480  0.13959    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.139 on 423 degrees of freedom
Multiple R-squared:  0.5868,    Adjusted R-squared:  0.5809 
F-statistic: 100.1 on 6 and 423 DF,  p-value: < 2.2e-16

Anova Test

ANOVA tests help determine that my models where I changed the variables for analysis (for example, removing Magnitude, only selecting certain magnitudes, yielded significant results. According to my p-value of 0.0007675, The p-value is small, showing that the Property_Loss variable contributes significantly to our model.
Based on this, I’m going to focus on highlighting Property_Loss in my visualizations.

anova(fit2, fit1)

Analysis of Variance Table

Model 1: Fatalities ~ Injuries + Crop_Loss + Path_Length + Path_Width
Model 2: Fatalities ~ Injuries + Property_Loss + Crop_Loss + Magnitude + 
    Path_Length + Path_Width
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1    425 8.5178                                  
2    419 8.0632  6   0.45456 3.9368 0.0007675 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(fit1, fit3)

Analysis of Variance Table

Model 1: Fatalities ~ Injuries + Property_Loss + Crop_Loss + Magnitude + 
    Path_Length + Path_Width
Model 2: Fatalities ~ Injuries + Crop_Loss + (Magnitude == "3") + (Magnitude == 
    "4") + Path_Length + Path_Width
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1    419 8.0632                           
2    423 8.1679 -4  -0.10473 1.3606 0.2468

R -Visualization

This bubble plot graph, with interactivity thanks to plotly, shows that the path and width of a tornado do not have to be large in order to do damage. There are many large “bubbles” in the lower left quadrant of the graph, and in fact many of those tornadoes were of a low magnitude. Because tornadoes are rated based on multiple factors, I suspect that some of those low magnitude, short-path tornadoes might have caused a lot of damage due to an unrepresented variable. I wonder if some of those tornadoes were in more developed areas?

scatterplot <- ggplot(marylandtornadoes, aes(x= Path_Length,
                                             y = Path_Width,
                                             size = Property_Loss,
                                             color = Magnitude))+
  geom_point(alpha=0.5)+
  labs(title = "Effect of Tornado Path and Width on Property Loss",
       subtitle = "Faster tornadoes are not responsible for the most damage",
       caption = "Source: NOAA")+
  xlab("Length of Tornado Path (mi)")+
  ylab("Width of Tornado Path (yd)")+
  scale_color_manual(values = c("black","violet","blue","cyan","yellow","red"),
                     labels = c("-9" = "Unknown", "0" = "EF0", "1" = "EF1", "2"="EF2","3"="EF3", "4"="EF4"))

ggplotly(scatterplot)

Tableau Visualization

The follow Tableau Public link brings us to a map of Maryland, with occurences of tornadoes that ONLY occurred in Maryland (none of these tornadoes crossed the border to do damage in other states).This is to ensure we only look at Property losses to places in Maryland. We used our marylandpaths dataset and the marylandtornadoes dataset to build this map. Upon examination, we can see some of the more devastating tornadoes took place in populated, densely built areas such as Baltimore, Columbia and Elicott City–and in fact, those tornadoes all took place in June 2024!

https://public.tableau.com/shared/8P7758GQR?:display_count=n&:origin=viz_share_link