Earthquakes & Induced quakes- Know the difference

INTRODUCTION:

This data set is taken from USGS(U.S Geological Survey). The USGS provides reliable scientific information to describe and understand the Earth; minimize loss of life and property from natural disasters; manage water, biological, energy, and mineral resources; and enhance and protect our quality of life.

As part of it’s program, USGS monitors and reports on earthquakes, assesses earthquake impacts and hazards, and conducts targeted research on the causes and effects of earthquakes. The USGS provides real-time notifications, feeds and web services about earthquakes just after they happen. Further details can be found in the link below,

https://earthquake.usgs.gov/earthquakes/feed/

The data set contains details of all earthquakes that have happened in the last 30 days and is updated every 15 mins in the USGS website. I have uploaded this data set with update settings to weekly levels in Kaggle’s Dataset settings. I used the instructions provided in this kernel https://www.kaggle.com/paultimothymooney/how-to-create-an-auto-updating-dataset-on-kaggle to create an auto updating data set on Kaggle.

THE ANALYSIS:

While mining this data set through normal EDA process I came across the fact that not all earthquakes are natural and few are indeed caused by humans although very small in numbers. I also found out that in a period of one month feb-mar 25th 2019 more than 8500 earthquakes have happened all over the world. Out which only 2% are not the usual earthquakes and are caused due to quarry blast, chemical explosion, ice quakes etc.

At the end of the analysis I have tried to predict earthquakes and other quakes(seismic activities related to explosion, quarry blast etc). I have also tried to handle the class imbalance problem because the data set is 98:2

The next steps are pretty usual ones with loading and probing the data. Let’s get started with loading the libraries first.

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.3.3

## Warning: package 'ggplot2' was built under R version 4.3.3

## Warning: package 'tibble' was built under R version 4.3.3

## Warning: package 'tidyr' was built under R version 4.3.3

## Warning: package 'readr' was built under R version 4.3.3

## Warning: package 'purrr' was built under R version 4.3.3

## Warning: package 'stringr' was built under R version 4.3.3

## Warning: package 'forcats' was built under R version 4.3.3

## Warning: package 'lubridate' was built under R version 4.3.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(gghighlight)

## Warning: package 'gghighlight' was built under R version 4.3.3

library(leaflet)

## Warning: package 'leaflet' was built under R version 4.3.3

library(IRdisplay)

## Warning: package 'IRdisplay' was built under R version 4.3.3

options(scipen = 999)
options(warn = -1)
library(lubridate)
library(viridis)

## Loading required package: viridisLite

library(data.table)

## 
## Attaching package: 'data.table'
## 
## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## 
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## 
## The following object is masked from 'package:purrr':
## 
##     transpose

library(caTools)
library(kernlab)

## 
## Attaching package: 'kernlab'
## 
## The following object is masked from 'package:purrr':
## 
##     cross
## 
## The following object is masked from 'package:ggplot2':
## 
##     alpha

library(caret)

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(ROSE)

## Loaded ROSE 0.0-4

## Reading in files

list.files(path = "../input")

## character(0)

earthquake <- read_csv("all_month.csv/all_month.csv")

## Rows: 8161 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (8): magType, net, id, place, type, status, locationSource, magSource
## dbl  (12): latitude, longitude, depth, mag, nst, gap, dmin, rms, horizontalE...
## dttm  (2): time, updated
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(earthquake)

## Rows: 8,161
## Columns: 22
## $ time            <dttm> 2019-03-11 04:30:18, 2019-03-11 04:24:26, 2019-03-11 …
## $ latitude        <dbl> 33.52100, 61.06940, 38.77333, 65.01810, 37.29130, 35.1…
## $ longitude       <dbl> -116.7945, -151.8329, -122.7435, -148.7292, -117.5088,…
## $ depth           <dbl> 2.82, 105.60, 1.16, 8.60, 6.90, 5.00, 6.20, 48.70, 2.5…
## $ mag             <dbl> 0.44, 1.10, 0.73, 1.60, 0.70, 1.80, 1.70, 1.90, 0.56, …
## $ magType         <chr> "ml", "ml", "md", "ml", "ml", "mb_lg", "ml", "ml", "md…
## $ nst             <dbl> 18, NA, 5, NA, 8, NA, 30, NA, 10, NA, NA, NA, NA, NA, …
## $ gap             <dbl> 53.00, NA, 213.00, NA, 176.36, 64.00, 93.14, NA, 77.00…
## $ dmin            <dbl> 0.009898, NA, 0.011220, NA, 0.375000, 0.219000, 0.1370…
## $ rms             <dbl> 0.1400, 0.2500, 0.0100, 0.5100, 0.2800, 0.5100, 0.3300…
## $ net             <chr> "ci", "ak", "nc", "ak", "nn", "us", "nn", "ak", "nc", …
## $ id              <chr> "ci38263479", "ak01937u587x", "nc73150441", "ak01937u2…
## $ updated         <dttm> 2019-03-11 04:33:55, 2019-03-11 04:27:53, 2019-03-11 …
## $ place           <chr> "11km NE of Aguanga, CA", "51km NW of Nikiski, Alaska"…
## $ type            <chr> "earthquake", "earthquake", "earthquake", "earthquake"…
## $ horizontalError <dbl> 0.25, NA, 1.55, NA, NA, 1.10, NA, NA, 0.42, 5.80, NA, …
## $ depthError      <dbl> 1.02, 1.50, 0.70, 0.20, 24.50, 1.70, 0.70, 1.50, 1.31,…
## $ magError        <dbl> 0.129, NA, 0.230, NA, NA, 0.255, NA, NA, NA, 0.080, NA…
## $ magNst          <dbl> 11, NA, 2, NA, NA, 4, NA, NA, 1, 15, NA, NA, 11, 60, 1…
## $ status          <chr> "automatic", "automatic", "automatic", "automatic", "a…
## $ locationSource  <chr> "ci", "ak", "nc", "ak", "nn", "us", "nn", "ak", "nc", …
## $ magSource       <chr> "ci", "ak", "nc", "ak", "nn", "us", "nn", "ak", "nc", …

head(earthquake)

## # A tibble: 6 × 22
##   time                latitude longitude  depth   mag magType   nst   gap
##   <dttm>                 <dbl>     <dbl>  <dbl> <dbl> <chr>   <dbl> <dbl>
## 1 2019-03-11 04:30:18     33.5    -117.    2.82  0.44 ml         18   53 
## 2 2019-03-11 04:24:26     61.1    -152.  106.    1.1  ml         NA   NA 
## 3 2019-03-11 04:22:42     38.8    -123.    1.16  0.73 md          5  213 
## 4 2019-03-11 04:13:42     65.0    -149.    8.6   1.6  ml         NA   NA 
## 5 2019-03-11 04:10:13     37.3    -118.    6.9   0.7  ml          8  176.
## 6 2019-03-11 04:04:12     35.1     -97.6   5     1.8  mb_lg      NA   64 
## # ℹ 14 more variables: dmin <dbl>, rms <dbl>, net <chr>, id <chr>,
## #   updated <dttm>, place <chr>, type <chr>, horizontalError <dbl>,
## #   depthError <dbl>, magError <dbl>, magNst <dbl>, status <chr>,
## #   locationSource <chr>, magSource <chr>

Brief about the Earthquake event terms/ variables:

depth- Depth of the event in kilometers. The depth where the earthquake begins to rupture. This depth may be relative to the WGS84 geoid, mean sea-level, or the average elevation of the seismic stations which provided arrival-time data for the earthquake location.

depthError- Uncertainty of reported depth of the event in kilometers.

dmin- Horizontal distance from the epicenter to the nearest station (in degrees). 1 degree is approximately 111.2 kilometers. In general, the smaller this number, the more reliable is the calculated depth of the earthquake.

gap- The largest azimuthal gap between azimuthally adjacent stations (in degrees). In general, the smaller this number, the more reliable is the calculated horizontal position of the earthquake. Earthquake locations in which the azimuthal gap exceeds 180 degrees typically have large location and depth uncertainties.

horizontalError- Uncertainty of reported location of the event in kilometers.

id- A unique identifier for the event. This is the current preferred id for the event, and may change over time. See the “ids” GeoJSON format property.

latitude- Decimal degrees latitude. Negative values for southern latitudes.

locationSource- The network that originally authored the reported location of this event.

longitude- Decimal degrees longitude. Negative values for western longitudes.

mag- The magnitude for the event. Earthquake magnitude is a measure of the size of an earthquake at its source. magError- Uncertainty of reported magnitude of the event.

magNst- The total number of seismic stations used to calculate the magnitude for this earthquake.

magSource- Network that originally authored the reported magnitude for this event.

magType- The method or algorithm used to calculate the preferred magnitude for the event.

net- The ID of a data contributor. Identifies the network considered to be the preferred source of information for this event.

nst- The total number of seismic stations used to determine earthquake location.

place- Textual description of named geographic region near to the event. This may be a city name, or a Flinn-Engdahl Region name. If there is no nearby city within 300 kilometers (or if the nearby cities database is unavailable for some reason), the Flinn-Engdahl (F-E) seismic and geographical regionalization scheme is used.

rms-The root-mean-square (RMS) travel time residual, in sec, using all weights. This parameter provides a measure of the fit of the observed arrival times to the predicted arrival times for this location. Smaller numbers reflect a better fit of the data.

status- Indicates whether the event has been reviewed by a human. Status is either automatic or reviewed. Automatic events are directly posted by automatic processing systems and have not been verified or altered by a human. Reviewed events have been looked at by a human. The level of review can range from a quick validity check to a careful reanalysis of the event.

time- Time when the event occurred. Times are reported in milliseconds since the epoch ( 1970-01-01T00:00:00.000Z), and do not include leap seconds. In certain output formats, the date is formatted for readability. type- Type of seismic event.

updated- Time when the event was most recently updated. Times are reported in milliseconds since the epoch.

EXPLORATORY ANALYSIS:

There are around 7 earthquakes that have a magnitude more than 6 in the richter scale.

options(repr.plot.width=12, repr.plot.height=6)

ggplot(earthquake, aes(time, mag)) + 
geom_line(color = 'steelblue')+ theme_bw()+ 
ggtitle("Earthquake Magnitude")+
xlab('Date')+
ylab('Magnitude')+
theme(plot.title = element_text(hjust = 0.5))

Feature Extraction:

As of now, I will extract two features from the data that are,

Location of the Quake(state/country etc.)

Time of the day in ‘Hour’

From the line plot below we can see that the maximum number of earthquakes that have happened at a particular hour. Midnight seems to be the most popular hour of the day when most of the earthquakes have happened. Evening 5 P.M sees the lowest total number of earthquakes in a day. When we see the maximum magnitude in each hour, then morning 8-9 A.M has the most impactful earthquake of around 7.

#Location
earthquake$location <- sub('.*,\\s*','', earthquake$place)

#Time of the day in 'Hour'
earthquake$hour <- ymd_hms(earthquake$time)
earthquake$hour <- hour(earthquake$hour)

#Visualizing the number of quakes that have happened at a particular time
earthquake %>% 
filter(!is.na(mag))%>%
group_by(hour)%>%
summarise(count = length(id),max_magnitude = max(mag))%>%
ggplot(aes(hour,count, color = max_magnitude))+geom_line()+
scale_color_viridis(direction = -1)+
scale_x_continuous(breaks=seq(0,23,1))+
xlab("Time of the day")+
ylab("Number of earthquakes")+
ggtitle("Number of quakes as per time of the day")+
theme_bw()+
theme(plot.title = element_text(hjust = 0.5))

In the next map we can see those places where the latest earthquakes have happened. The pacific belt starting from South American countries such as Chile, Peru, Ecuador passing through west coast of USA and then Alaska shows a lot of seismic activities. However, the magnitude is more or less in the range of 1-5. The next belt of seismic activities can be seen in countries like Indonesia, Japan, Papua New Guinea and Newzealand.

Following are the most prominent quakes that have happened with higher magnitude of 6 and above. (Between 18th Feb’19- 18th Mar’19)

Place: 115km ESE of Palora, Ecuador Magnitude: 7.5 Time: 2019-02-22 10:17:22

Place: 27km NNE of Azangaro, Peru Magnitude: 7 Time: 2019-03-01 08:50:41

Place: 116km SE of L’Esperance Rock, New Zealand Magnitude: 6.4 Time: 2019-03-06 15:46:14

Place: 49km NW of Namatanai, Papua New Guinea Magnitude: 6.4 Time: 2019-02-17 14:35:55

Place: 28km S of Cliza, Bolivia Magnitude: 6.3 Time: 2019-03-15 05:03:50

Place: 260km SE of Lambasa, Fiji Magnitude: 6.2 Time: 2019-03-10 08:12:25

Place: 140km SSW of Kulumadau, Papua New Guinea Magnitude: 6.1 Time: 2019-03-10 12:48:00

bins=seq(1, 8.0, by=1.0)
palette = colorBin( palette="YlOrBr", domain=earthquake$mag, na.color="transparent", bins=bins)


d=leaflet(earthquake) %>% 
  addTiles()  %>% 
  setView( lng = 166.45, lat = 21, zoom = 1.25) %>%
  addProviderTiles(providers$Esri.WorldImagery) %>%
  addCircleMarkers(~longitude, ~latitude, 
    fillColor = ~palette(mag), fillOpacity = 0.7, color="white", radius=3, stroke=FALSE,
     popup = paste("Place:", earthquake$place, "<br>",
            "Magnitude:", earthquake$mag, "<br>",
            "Time:", earthquake$time, "<br>")) %>%
  addLegend( pal=palette, values=~mag, opacity=0.9, title = "Magnitude", position = "bottomright" )


htmlwidgets::saveWidget(d, "d.html")
display_html('<iframe src="d.html" width=100% height=450></iframe>')

The top 30 locations where most of the seismic activities have taken place, some are countries and some are states of USA. Although around 35% of the earthquakes have happened California and 27% in Alaska, the average magnitude is not much at 0.9 and 1.7 respectively. In comparison to these places, countries like Newzealand, Indonesia, Papua New Guinea, Chile, Japan and Philiipines have received much lesser percent of earthquakes, however, the average magnitudes of these earthqaukes is above 4.4

earthquake %>% 
group_by(location) %>% 
filter(!(is.na(mag)))%>%
summarise(Number_of_quakes = length(location), 
          Average_Magnitude = mean(mag))%>%
mutate(Percent = round(prop.table(Number_of_quakes)*100,2))%>%
arrange(desc(Number_of_quakes))%>% top_n(25, Number_of_quakes)

## # A tibble: 25 × 4
##    location    Number_of_quakes Average_Magnitude Percent
##    <chr>                  <int>             <dbl>   <dbl>
##  1 CA                      2643             1.02    32.4 
##  2 Alaska                  2551             1.72    31.3 
##  3 Nevada                   480             0.776    5.88
##  4 Utah                     480             1.39     5.88
##  5 Puerto Rico              269             2.37     3.3 
##  6 Hawaii                   241             1.83     2.95
##  7 Montana                  174             1.27     2.13
##  8 Washington               114             1.06     1.4 
##  9 Indonesia                102             4.60     1.25
## 10 California                86             0.980    1.05
## # ℹ 15 more rows

We know that the depth is the distance where the earthquake begins to rupture. I was curious to know if a larger depth leads to a higher magnitude earthquake. I plotted the below scatter plot of magnitude vs. depth and found that a considerable number of earthquakes that are of magnitude between 4-5 only have larger depths. Even the depths of earthquakes that are more than 6 in magnitude are scattered. The lower magnitude earthquakes generally occur within 200km depth.

ggplot(earthquake, aes(mag,depth,color = hour))+
geom_jitter(alpha = 0.5)+
gghighlight(max(depth>200)| max(mag>4))+
scale_x_continuous(breaks=seq(0,9,1))+
scale_color_viridis(direction = -1)+
theme_bw()+
xlab('Magnitude')+
ylab('Depth')+
ggtitle('Magnitude Vs. Depth in Km')+
theme(plot.title = element_text(hjust = 0.5))

Seismic activities oocur due to other human activities as well such as explosion, quarry etc. However, 98% of the seismic activities do happen because of earthquake.

earthquake %>%
group_by(type)%>%
summarise(count = length(type))%>%
mutate(Percent = prop.table(count*100))%>%
ggplot(aes(type,Percent, fill = type))+
geom_col()+theme_bw()+
xlab('Type of Seismic Activity')+
ylab('Percent')+
ggtitle('Percent of Seismic Activity')+
theme(plot.title = element_text(hjust = 0.5))

When we visualize these sesismic activities according to their type, we find that most of the seismic activities other than an earthquake take place in USA (mostly CA and Alaska, Nevada, Washington), Canada. Alaska experiences a lot of ice quakes though. Click on the map to know more.

pal <- colorFactor(palette = "Accent",domain = earthquake$type)

e=leaflet(earthquake) %>% 
  addTiles()  %>% 
  setView( lng = -119.417931, lat = 50.778259, zoom = 3) %>%
  addProviderTiles(providers$CartoDB.Positron) %>%
  addCircleMarkers(~longitude, ~latitude, 
    fillColor = ~pal(type), fillOpacity = 0.5, color="white", radius=3, stroke=FALSE,
     popup = paste("Place:", earthquake$place, "<br>",
            "Magnitude:", earthquake$mag, "<br>",
            "Time:", earthquake$time, "<br>",
            "Type:", earthquake$type, "<br>")) %>%
  addLegend( pal=pal, values=~type, opacity=0.9, title = "Type", position = "bottomright" )


htmlwidgets::saveWidget(e, "e.html")
display_html('<iframe src="e.html" width=100% height=450></iframe>')

How to tell the difference between Earthquakes and Other Quakes?

Depth:

We know that ‘depth’ is where the earthquake begins to rupture. We can see from the boxplot below that most of the seismic activities not related to nature occur mostly at the surface of the earth. On the other hand earthquakes are mostly occur 10-20 Km below the surface and can occur upto several kilometers beneath the surface of the Earth. But this is typically not the case with other type of seismic activities.

ggplot(earthquake, aes('',depth, color = type))+geom_boxplot()+
scale_y_continuous(limits = c(0, 100))+theme_bw()+
xlab('Type of Seismic Activity')+
ggtitle('Depth of Seismic Activity')

Magnitude:

The next important parameter here is the magnitude. While a natural earthquake may have magnitude ranging from 0-7 on a scale. The magnitude of human made quakes is much lesser and mostly hover around 0-2.2 on a richter scale.

ggplot(earthquake, aes('',mag, color = type))+geom_boxplot()+
scale_y_continuous(limits = c(0, 8))+theme_bw()+
xlab('Type of Seismic Activity')+
ggtitle('Magnitude of Seismic Activity')

Time of the Day:

Well if you experience shaking post midnight till 5 in the morning consider checking the news for a chemical explosion in your area. If your dinner table is shaking then may be it’s quarry blast.

ggplot(earthquake, aes('',hour, color = type))+geom_boxplot()+
scale_y_continuous(limits = c(0, 24))+theme_bw()+
xlab('Type of Seismic Activity')+
ggtitle('Time of Seismic Activity')

PREDICTION:

In the next section I am trying to predict the type of seismic activity using few features such as depth, magnitude, latitude and longitude. There are many more features present in the data set but as of now let’s use these parameters to make a basic prediction.

First of all I am going to prepare the data set for prediction. I will categorize all the seismic activities that are not natural earthquakes into one category as we saw that most of other quakes have more or less same range of depth, magnitude and location.

#Choosing the relevant columns
quake <- earthquake[,c(2,3,4,5,24,15)]

#Categorising the type
quake$type <- ifelse(quake$type == 'earthquake', 'Earthquake', 'Otherquake')

#quake[,c(1:5)] <- scale(quake[,c(1:5)])

#Converting our target column 'type' into a factor
quake$type <- factor(quake$type)

#Checking and removing any row having NA
colSums(is.na(quake))

##  latitude longitude     depth       mag      hour      type 
##         0         0         0         4         0         0

quake <- quake[!is.na(quake$mag),]

latitude 0 longitude 0 depth 0 mag 4 hour 0 type 0

#Now look at the data
head(quake)

## # A tibble: 6 × 6
##   latitude longitude  depth   mag  hour type      
##      <dbl>     <dbl>  <dbl> <dbl> <int> <fct>     
## 1     33.5    -117.    2.82  0.44     4 Earthquake
## 2     61.1    -152.  106.    1.1      4 Earthquake
## 3     38.8    -123.    1.16  0.73     4 Earthquake
## 4     65.0    -149.    8.6   1.6      4 Earthquake
## 5     37.3    -118.    6.9   0.7      4 Earthquake
## 6     35.1     -97.6   5     1.8      4 Earthquake

I am going to use Support Vector Machines to make a prediction and I am anticipating a class imbalance problem here because in our data set 98% of the observations are natural earthquakes and only 2% of observations belong to other quakes.

I will first train the model without handling the class imbalance and then move on to use ROSE library to treat the class imbalance problem.

table(quake$type)

## 
## Earthquake Otherquake 
##       8005        152

# splitting the data between train and test
set.seed(1234)

indices = sample.split(quake$type, SplitRatio = 0.7)

train = quake[indices,]

test = quake[!(indices),]

head(train)

## # A tibble: 6 × 6
##   latitude longitude  depth   mag  hour type      
##      <dbl>     <dbl>  <dbl> <dbl> <int> <fct>     
## 1     33.5    -117.    2.82  0.44     4 Earthquake
## 2     61.1    -152.  106.    1.1      4 Earthquake
## 3     38.8    -123.    1.16  0.73     4 Earthquake
## 4     65.0    -149.    8.6   1.6      4 Earthquake
## 5     35.1     -97.6   5     1.8      4 Earthquake
## 6     37.3    -117.    6.2   1.7      3 Earthquake

table(train$type)

## 
## Earthquake Otherquake 
##       5604        106

table(test$type)

## 
## Earthquake Otherquake 
##       2401         46

In the above prediction we can clearly see the flaw. All the earthquakes are predicted correctly making the Sensitivity(True Positives) really high. However, none of the ‘other quakes’ are predicted correctly leading to a 0 Specificity(True Negative).

TREATING THE CLASS IMBALANCE:

I am going to use the ROSE (Randomly Over Sampling Examples) package to treat the class imbalance issue by using over sampling, under sampling and both. Let’s start with over sampling first.

set.seed(1234)
#Using RBF Kernel
Model_RBF <- ksvm(type~ ., data = train, scale = FALSE, kernel = "rbfdot")
Eval_RBF <- predict(Model_RBF, test[,-6])

  #confusion matrix - RBF Kernel
  confusionMatrix(Eval_RBF,test$type)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   Earthquake Otherquake
##   Earthquake       2401         46
##   Otherquake          0          0
##                                           
##                Accuracy : 0.9812          
##                  95% CI : (0.975, 0.9862) 
##     No Information Rate : 0.9812          
##     P-Value [Acc > NIR] : 0.5391          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 0.00000000003247
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.9812          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.9812          
##          Detection Rate : 0.9812          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : Earthquake      
##

set.seed(1234)
over <- ovun.sample(type~., data = train, method="over", N= 11640)$data

set.seed(1234)
#Using RBF Kernel
Model_RBF <- ksvm(type~ ., data = over, scale = FALSE, kernel = "rbfdot")
Eval_RBF <- predict(Model_RBF, test[,-6])

  #confusion matrix - RBF Kernel
  confusionMatrix(Eval_RBF,test$type)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   Earthquake Otherquake
##   Earthquake       2285          4
##   Otherquake        116         42
##                                              
##                Accuracy : 0.951              
##                  95% CI : (0.9416, 0.9592)   
##     No Information Rate : 0.9812             
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.3941             
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.9517             
##             Specificity : 0.9130             
##          Pos Pred Value : 0.9983             
##          Neg Pred Value : 0.2658             
##              Prevalence : 0.9812             
##          Detection Rate : 0.9338             
##    Detection Prevalence : 0.9354             
##       Balanced Accuracy : 0.9324             
##                                              
##        'Positive' Class : Earthquake         
##

Confusion Matrix and Statistics

        Reference

Prediction Earthquake Otherquake Earthquake 2358 8 Otherquake 136 47

           Accuracy : 0.9435             
             95% CI : (0.9338, 0.9522)   
No Information Rate : 0.9784             
P-Value [Acc > NIR] : 1            
 Kappa : 0.3742

Mcnemar’s Test P-Value : <0.0000000000000002

        Sensitivity : 0.9455             
        Specificity : 0.8545             
     Pos Pred Value : 0.9966             
     Neg Pred Value : 0.2568             
         Prevalence : 0.9784             
     Detection Rate : 0.9251

Detection Prevalence : 0.9282
Balanced Accuracy : 0.9000

   'Positive' Class : Earthquake

There we go, we can clearly see a significant increase in Specificity. 47 out of 55 Other quakes in our test data are predicted correctly although 150 earthquakes are predicted wrong. Introducing over sampling has really helped to treat the class imbalance and has significantly increased our Specificity to 85%.

Earthquakes & Induced quakes- Know the difference

Laura Carolina

2024-03-04