Goal

Landslide susceptibility refers to the spatial probability of landslides occurring over a specific area. While it can predict where a landslide may occur, it cannot determine the timing of the event, thus it remains a spatial probability. For landslide susceptibility mapping accurate locations of past landslides are necessary. This task is a classic classification problem where landslide locations serve as presence data, and non-landslide locations, randomly selected across the study area, serve as absence data. The selection of non-landslide locations requires extensive data exploration this link provides example of some of the techniques of absence data sampling. Predictors for the classification algorithm will be derived from a digital elevation model (DEM). Specifically, we will use eight DEM-based variables, referred to as causal factors in landslide susceptibility mapping. The accuracy of the landslide susceptibility map depends not only on the accurate identification of landslide locations but also on the quality of the causal factors. Since these causal factors are derived from the DEM, utilizing a high-quality DEM can significantly enhance the map’s accuracy. This study will compare the effectiveness of a 1-meter resolution DEM with that of the Shuttle Radar Topography Mission (SRTM) DEM, which has a 30-meter resolution.

Variables or Causal Factors

Eight DEM based variables will be used

  1. Elevation

  2. Slope

  3. Aspect

  4. Plan Curvature

  5. Profile Curvature

  6. Topographic Wetness Index (TWI)

  7. Stream Power Index (SPI)

  8. Terrain Ruggedness Index(TRI)

Install and Load Required Packages

library(sf)
library(tidyverse)
library(randomForest)

library(caret)
library(pROC)
library(e1071)

Upload and Convert Shape Files

We have two shape files both of them are point files one show the locations of landslides and another one is the locations of randomly generated non-landslide locations. Our study area is Watauga County of North Carolina. We will upload these two files in R and convert them into two separate dataframes.

# Upload the Shape Files
library(sf)
Landslides_Points <- st_read("Landslides_Points.shp")
Non_Landslides_Points <- st_read("Non_Landslides_Points.shp")

# Convert to Data Frames
landslides <- as.data.frame(Landslides_Points)
Non_landslides <- as.data.frame(Non_Landslides_Points)

Check the Dataframes

# Print the heads of the Data Frames
#print(head(landslides))
#print(head(Non_landslides))
#
library(kableExtra)
my_data <- as_tibble(Landslides_Points)
my_data
## # A tibble: 2,257 × 68
##    OBJECTID Process_ID Project         County Quad  Col_Date Mvmnt_Date SM_State
##       <int>      <dbl> <chr>           <chr>  <chr> <chr>    <chr>      <chr>   
##  1    59592       3256 Landslide Haza… Watau… Gran… <NA>     1940/08/1… <NA>    
##  2    59593       3257 Landslide Haza… Watau… Gran… <NA>     1940/08/1… <NA>    
##  3    59594       3258 Landslide Haza… Watau… Gran… <NA>     1940/08/1… <NA>    
##  4    59595       3259 Landslide Haza… Watau… Gran… <NA>     1940/08/1… <NA>    
##  5    59596       3260 Landslide Haza… Watau… Gran… <NA>     1940/08/1… <NA>    
##  6    59598       3262 Landslide Haza… Watau… Gran… <NA>     1940/08/1… <NA>    
##  7    59599       3263 Landslide Haza… Watau… Gran… <NA>     1940/08/1… <NA>    
##  8    59600       3264 Landslide Haza… Watau… Gran… <NA>     1940/08/1… <NA>    
##  9    59601       3265 Landslide Haza… Watau… Gran… <NA>     1940/08/1… <NA>    
## 10    59602       3266 Landslide Haza… Watau… Gran… <NA>     1940/08/1… <NA>    
## # ℹ 2,247 more rows
## # ℹ 60 more variables: SM_Rate_1 <chr>, SM_Mat_1 <chr>, SM_Type_1 <chr>,
## #   SM_Rate_2 <chr>, SM_Mat_2 <chr>, SM_Type_2 <chr>, Slp_Config <chr>,
## #   Fatalities <int>, Confidence <int>, MapMethod <chr>, GlobalID <chr>,
## #   SHAPE_1 <chr>, Slope_DEM_ <dbl>, Slope <dbl>, Slope_1 <dbl>, DEM_30 <dbl>,
## #   Slope_30 <dbl>, Elevation <dbl>, Aspect <dbl>, Plan <dbl>, Pr <dbl>,
## #   TWI <dbl>, SPI <dbl>, Slope_12 <dbl>, LS <dbl>, TRI <dbl>, V <chr>, …
my_data1 <- as_tibble(Non_Landslides_Points)
my_data1
## # A tibble: 2,257 × 36
##     OID_   CID Elevation   Plan     Pr Aspect Slope_12     SPI   TRI   TWI  Soil
##    <dbl> <dbl>     <dbl>  <dbl>  <dbl>  <dbl>    <dbl>   <dbl> <dbl> <dbl> <dbl>
##  1     0     0     3565.  3.11  -1.08   333.      9.13  0.154  0.517  1.83     6
##  2     0     0     3562. 14.5    1.33    91.2    38.3   6.72   0.578  2.18     6
##  3     0     0     3660. -4.13   2.22    91.4    20.4   1.85   0.524  2.60     6
##  4     0     0     3646.  3.82  -4.87   333.      8.97  0.152  0.434  1.85     6
##  5     0     0     4565.  4.01  -0.951   76.8    13.5   0.232  0.536  1.43     6
##  6     0     0     4137. -1.35   0.179  343.     17.6  34.0    0.496  5.84     6
##  7     0     0     4260.  0.194 -0.921   80.4     5.78  0.0967 0.351  2.29     6
##  8     0     0     4126.  1.62   4.13   154.     12.8   0.439  0.397  2.18     6
##  9     0     0     4447.  2.04  -6.32   177.     17.7   0.315  0.516  1.14     3
## 10     0     0     3995.  2.06  -1.03    30.9    11.0   0.188  0.535  1.64     4
## # ℹ 2,247 more rows
## # ℹ 25 more variables: Aspect_1 <dbl>, PL <dbl>, PR_1 <dbl>, Geology <dbl>,
## #   NDVI <dbl>, Rainfall <dbl>, Slope_1 <dbl>, SPI_1 <dbl>, TRI_1 <dbl>,
## #   TWI_1 <dbl>, Landuse <dbl>, LULC_chang <dbl>, Imprevious <dbl>, Rain <dbl>,
## #   Distance_R <dbl>, Aspect_DEM <dbl>, TRI_12 <dbl>, DEM_Projec <dbl>,
## #   Hillshade <dbl>, Plan_curva <dbl>, Profile_cu <dbl>, Slope <dbl>,
## #   SPI_3 <dbl>, TWI__3 <dbl>, geometry <POINT [°]>

These two data frames have several unnecessary columns or variables. I used ArcGIS software to derive the causal factors using DEM. Then extract multi-values to points tool was used to extract cell values of causal factors for each landslide and non-landslide points. We need one target variable (V) and eight causal factors or variables (Elevation, Slope, Aspect, Plan Curvature, Profile Curvature, TWI, TRI, SPI). We will select these variables and since these variables have different names or labels we will change them accordingly.

Select Required Coloumns for 1m DEM

library(tidyverse)
library(dplyr)
landslides <- landslides %>% select(Slope_12, Elevation, Aspect, Plan, Pr, V, TWI, SPI, TRI)
Non_landslides <- Non_landslides %>% select(Slope_12, Elevation, Aspect, Plan, Pr, CID, TWI, SPI, TRI)

# Rename columns for consistency
landslides <- rename(landslides, Slope = Slope_12, PL = Plan, PR = Pr)
Non_landslides <- rename(Non_landslides, Slope = Slope_12, PL = Plan, PR = Pr, V = CID)
##Lets show them using tables
library(DT)
dat1 <- landslides
dat2 <- Non_landslides
datatable(
data = landslides,
caption = "Landslides",
filter = "top",
class = 'cell-border stripe'
)
datatable(
data = Non_landslides,
caption = "Non_Landslides",
filter = "top",
class = 'cell-border stripe'
)
# library(kableExtra)
# dat1
# kable(dat1)%>%kable_styling(latex_options = 'sriped')
# dat <- landslides[1:10,]
# knitr::kable(dat, caption= "Landslides")
# dat1 <- Non_Landslides_Points[1:10,]
# knitr::kable(dat1, caption= "Non_Landslides")
# library(kableExtra)
# my_data <- as_tibble(Landslides_Points)
# my_data
# kable(dat1)%>%kable_styling(latex_options = 'sriped')
# dat <- landslides[1:10,]
# knitr::kable(dat, caption= "Landslides")
# dat1 <- Non_Landslides_Points[1:10,]
# knitr::kable(dat1, caption= "Non_Landslides")

Combine Data Frames and Clean Data

Now we will combine two dataframes into a single dataframe. But before that we need to check the type of variables.

tab1 <- summary(landslides)
tab2 <- summary(Non_landslides)
kable(tab1, caption="Summary Landslides")
Summary Landslides
Slope Elevation Aspect PL PR V TWI SPI TRI
Min. :-9999.00 Min. :-9999 Min. :-9999.00 Min. :-9999.000 Min. :-9999.000 Length:2257 Min. :-9999.000 Min. : -9999.00 Min. :-9999.000
1st Qu.: 25.61 1st Qu.: 2830 1st Qu.: 90.24 1st Qu.: -4.176 1st Qu.: -3.606 Class :character 1st Qu.: 1.172 1st Qu.: 1.15 1st Qu.: 0.476
Median : 30.57 Median : 3134 Median : 126.64 Median : -0.528 Median : 0.261 Mode :character Median : 2.541 Median : 4.37 Median : 0.500
Mean : -18.56 Mean : 3012 Mean : 85.47 Mean : -49.398 Mean : -48.531 NA Mean : -45.738 Mean : 186.77 Mean : -12.789
3rd Qu.: 35.32 3rd Qu.: 3437 3rd Qu.: 172.47 3rd Qu.: 2.988 3rd Qu.: 3.975 NA 3rd Qu.: 4.526 3rd Qu.: 34.64 3rd Qu.: 0.524
Max. : 72.29 Max. : 5162 Max. : 357.53 Max. : 44.670 Max. : 81.796 NA Max. : 14.860 Max. :232744.00 Max. : 0.842
kable(tab2, caption="Summary Non_Landslides")
Summary Non_Landslides
Slope Elevation Aspect PL PR V TWI SPI TRI
Min. :-9999.000 Min. :-9999 Min. :-9999.00 Min. :-9999.000 Min. :-9999.000 Min. :0 Min. :-9999.000 Min. : -9999.00 Min. :-9999.000
1st Qu.: 7.986 1st Qu.: 3056 1st Qu.: 91.27 1st Qu.: -1.134 1st Qu.: -1.390 1st Qu.:0 1st Qu.: 1.946 1st Qu.: 0.30 1st Qu.: 0.469
Median : 14.085 Median : 3335 Median : 185.17 Median : 0.012 Median : 0.000 Median :0 Median : 3.073 Median : 0.97 Median : 0.502
Mean : 5.951 Mean : 3345 Mean : 171.33 Mean : -8.904 Mean : -8.792 Mean :0 Mean : -18.668 Mean : 221.66 Mean : -65.956
3rd Qu.: 20.366 3rd Qu.: 3665 3rd Qu.: 268.38 3rd Qu.: 1.302 3rd Qu.: 1.411 3rd Qu.:0 3rd Qu.: 4.592 3rd Qu.: 5.02 3rd Qu.: 0.532
Max. : 59.009 Max. : 5295 Max. : 359.78 Max. : 21.310 Max. : 68.018 Max. :0 Max. : 19.732 Max. :270285.00 Max. : 0.884

Summary shows that the target variable V is in character format and we will convert it into factor format. All varibales have missing data (-9999), we will drop them after joining the two dataframes.

Combine the Dataframes

landslides$V <- as.character(landslides$V)
Non_landslides$V <- as.character(Non_landslides$V)
Data <- rbind(landslides, Non_landslides)
Data_m <- Data

Drop the missing values labeled as -9999

Data_m <- Data_m %>% filter(across(-V, ~ . != -9999))

Now lets see the summary again.

##Lets show them using tables
library(DT)
dat3 <- summary(Data_m)

kable(dat3, caption="Summary of Data")
Summary of Data
Slope Elevation Aspect PL PR V TWI SPI TRI
Min. : 0.01862 Min. :1413 Min. : 0.00 Min. :-53.1819 Min. :-71.61270 Length:4486 Min. :-1.033 Min. : -66.25 Min. :0.0186
1st Qu.:13.36940 1st Qu.:2933 1st Qu.: 91.27 1st Qu.: -2.3481 1st Qu.: -2.20290 Class :character 1st Qu.: 1.592 1st Qu.: 0.61 1st Qu.:0.4736
Median :23.40960 Median :3246 Median :144.91 Median : -0.1317 Median : 0.08716 Mode :character Median : 2.854 Median : 2.08 Median :0.5013
Mean :22.62100 Mean :3217 Mean :157.74 Mean : -0.3573 Mean : 0.13419 NA Mean : 3.239 Mean : 234.45 Mean :0.5012
3rd Qu.:31.42370 3rd Qu.:3556 3rd Qu.:217.14 3rd Qu.: 1.8725 3rd Qu.: 2.36965 NA 3rd Qu.: 4.548 3rd Qu.: 13.41 3rd Qu.:0.5280
Max. :72.29000 Max. :5295 Max. :359.78 Max. : 44.6705 Max. : 81.79600 NA Max. :19.732 Max. :270285.00 Max. :0.8844

Scale the Variables to Z-Score

## Warning: Using `across()` in `filter()` was deprecated in dplyr 1.0.8.
## ℹ Please use `if_any()` or `if_all()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Split Data into Train and Test Sets

# Load necessary libraries
library(caret)
library(dplyr)

# Set a seed for reproducibility
set.seed(123)

# Perform stratified sampling
trainIndex <- createDataPartition(Data_m$V, p = 0.8, list = FALSE)

# Split the data into training and test sets
trainData <- Data_m[trainIndex, ]
testData <- Data_m[-trainIndex, ]

# Check the distribution of target variable in training and test sets
dat4 <- table(trainData$V)
dat5 <- table(testData$V)
kable(dat4)
Var1 Freq
0 1792
1 1797
kable(dat5)
Var1 Freq
0 448
1 449

So in our training dataset there are 1792, absence and 1797 presence data where as in test dataset there are 448 absence and 449 presence data. Now we will train our model. ## Logistic Regression Model

# Load necessary libraries
library(pROC)
library(ggplot2)

library(gt)

# Train a logistic regression model
logistic_model_1m <- glm(V ~ ., data = trainData, family = binomial)

# Summarize the model
summary(logistic_model_1m)
## 
## Call:
## glm(formula = V ~ ., family = binomial, data = trainData)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.03158    0.04773  -0.662   0.5082    
## Slope        2.21441    0.07486  29.580  < 2e-16 ***
## Elevation   -0.31600    0.05157  -6.128 8.89e-10 ***
## Aspect      -0.38257    0.04935  -7.753 9.00e-15 ***
## PL          -0.13833    0.07261  -1.905   0.0568 .  
## PR          -0.17099    0.07159  -2.388   0.0169 *  
## TWI          0.28014    0.05302   5.284 1.27e-07 ***
## SPI          0.03856    0.06065   0.636   0.5249    
## TRI          0.15603    0.06062   2.574   0.0101 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4975.4  on 3588  degrees of freedom
## Residual deviance: 2831.4  on 3580  degrees of freedom
## AIC: 2849.4
## 
## Number of Fisher Scoring iterations: 5
exp(coef(logistic_model_1m))
## (Intercept)       Slope   Elevation      Aspect          PL          PR 
##   0.9689135   9.1559973   0.7290574   0.6821078   0.8708116   0.8428334 
##         TWI         SPI         TRI 
##   1.3233170   1.0393155   1.1688608
#kable(l$coefficients, caption="g")

Our goal is not to check the result of the logistic regression rather we are concerned about the prediction. Lets see the prediction accuracy.

Accuracy Assessment for 1 meter DEM

# Make predictions on the training set

library(dplyr)
library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
library(caret)

pred_prob_train <- predict(logistic_model_1m , newdata = trainData, type = "response")
pred_class_train <- ifelse(pred_prob_train > 0.5, 1, 0)

# Evaluate the model on training set
confusion_matrix_train <- table(trainData$V, pred_class_train)

##knitr::kable(confusion_matrix_train)
accuracy_train <- sum(diag(confusion_matrix_train)) / sum(confusion_matrix_train)
print(confusion_matrix_train)
##    pred_class_train
##        0    1
##   0 1487  305
##   1  251 1546
 print(paste("Training Accuracy:", accuracy_train))
## [1] "Training Accuracy: 0.84508219559766"
#kable(t)
# Make predictions on the test set
pred_prob_test <- predict(logistic_model_1m, newdata = testData, type = "response")
pred_class_test <- ifelse(pred_prob_test > 0.5, 1, 0)

# Evaluate the model on test set
confusion_matrix_test <- table(testData$V, pred_class_test)
accuracy_test <- sum(diag(confusion_matrix_test)) / sum(confusion_matrix_test)
print(confusion_matrix_test)
##    pred_class_test
##       0   1
##   0 382  66
##   1  71 378
print(paste("Test Accuracy:", accuracy_test))
## [1] "Test Accuracy: 0.84726867335563"
accuracy_test_1m <- as.data.frame(accuracy_test)

# Calculate ROC curve and AUC
roc_curve <- roc(testData$V, pred_prob_test)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_value <- auc(roc_curve)
auc_1m <- auc_value 
print(paste("AUC:", auc_value))
## [1] "AUC: 0.898628897550111"
# Plot ROC curve using ggplot2
roc_data <- data.frame(
  specificity = rev(roc_curve$specificities),
  sensitivity = rev(roc_curve$sensitivities)
)

ggplot(roc_data, aes(x = specificity, y = sensitivity)) +
  geom_line(color = "blue") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
  labs(title = "ROC Curve for Logistic Regression_1m", x = "Specificity", y = "Sensitivity") +
  theme_minimal()

write.csv(Data_m, "Data_1m.csv", row.names = FALSE)

Logistic regression model gives satisfactory accuracy for 1m DEM based causal factors. Now we will work with SRTM (30 m) DEM based causal factors and check the accuracy of prediction. I will help us to understand the role of the resolution and accuracy of DEM on landslide prediction over an area. We will repeat the whole process again only the data-set will be different.

Slelect Required Coloumns for SRTM (30m) DEM

# Upload the Shape Files
library(sf)
Landslides_Points <- st_read("Landslides_Points.shp")
## Reading layer `Landslides_Points' from data source 
##   `C:\R\Shape\Shape Files\Data\Landslides_Points.shp' using driver `ESRI Shapefile'
## Simple feature collection with 2257 features and 67 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 349401.4 ymin: 265880.2 xmax: 387724.9 ymax: 293335.3
## Projected CRS: NAD83 / North Carolina
Non_Landslides_Points <- st_read("Non_Landslides_Points.shp")
## Reading layer `Non_Landslides_Points' from data source 
##   `C:\R\Shape\Shape Files\Data\Non_Landslides_Points.shp' using driver `ESRI Shapefile'
## Simple feature collection with 2257 features and 35 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -81.9147 ymin: 36.11658 xmax: -81.46408 ymax: 36.38499
## Geodetic CRS:  NAD83
library(tidyverse)
library(dplyr)
landslides <- as.data.frame(Landslides_Points)
Non_landslides <- as.data.frame(Non_Landslides_Points)
landslides_30 <- landslides %>% select(Slope_1_14, DEM_Projec, Aspect_DEM, Plan_curva, Profile_cu, V, TWI__3, SPI_3, TRI_12)
Non_landslides_30 <- Non_landslides %>% select(Slope, DEM_Projec, Aspect_DEM, Plan_curva, Profile_cu, CID, TWI__3, SPI_3, TRI_12)

# Rename columns for consistency
landslides_30 <- rename(landslides_30, Slope = Slope_1_14, PL = Plan_curva, PR = Profile_cu, Elevation= DEM_Projec, Aspect=Aspect_DEM, TWI= TWI__3, SPI= SPI_3, TRI=TRI_12)
Non_landslides_30 <- rename(Non_landslides_30, PL = Plan_curva, PR = Profile_cu, Elevation= DEM_Projec, Aspect=Aspect_DEM, TWI= TWI__3, SPI= SPI_3, TRI=TRI_12, V = CID)
##Lets show them using tables
library(DT)
dat1 <- landslides_30
dat2 <- Non_landslides_30
datatable(
data = landslides_30,
caption = "Landslides",
filter = "top",
class = 'cell-border stripe'
)
datatable(
data = Non_landslides_30,
caption = "Non_Landslides",
filter = "top",
class = 'cell-border stripe'
)

Combine the Dataframes and Clean Data

tab1 <- summary(landslides_30)
tab2 <- summary(Non_landslides_30)
kable(tab1, caption="Summary Landslides_30m")
Summary Landslides_30m
Slope Elevation Aspect PL PR V TWI SPI TRI
Min. : 2.07 Min. : 449.0 Min. :-9999.00 Min. :-3.11276 Min. :-3.21878 Length:2257 Min. :-3.4113 Min. : 0.0 Min. :0.2922
1st Qu.:15.21 1st Qu.: 876.0 1st Qu.: 88.23 1st Qu.:-0.31934 1st Qu.:-0.46307 Class :character 1st Qu.:-2.0951 1st Qu.: 0.0 1st Qu.:0.4547
Median :20.25 Median : 971.0 Median : 130.91 Median : 0.03593 Median :-0.07091 Mode :character Median :-1.5829 Median : 15.8 Median :0.5048
Mean :20.08 Mean : 950.7 Mean : 126.98 Mean : 0.03297 Mean :-0.08110 NA Mean :-1.4046 Mean : 1148.7 Mean :0.5088
3rd Qu.:24.74 3rd Qu.:1063.0 3rd Qu.: 176.63 3rd Qu.: 0.37393 3rd Qu.: 0.29913 NA 3rd Qu.:-0.9447 3rd Qu.: 44.1 3rd Qu.:0.5644
Max. :44.21 Max. :1586.0 Max. : 359.49 Max. : 3.60871 Max. : 2.86714 NA Max. :10.0097 Max. :2402323.8 Max. :0.7527
kable(tab2, caption="Summary Non_Landslides")
Summary Non_Landslides
Slope Elevation Aspect PL PR V TWI SPI TRI
Min. : 0.3552 Min. : 434 Min. :-9999.0 Min. :-2.04006 Min. :-2.39026 Min. :0 Min. :-2.9974 Min. : 0.0 Min. :0.1510
1st Qu.: 6.7917 1st Qu.: 936 1st Qu.: 85.6 1st Qu.:-0.22708 1st Qu.:-0.21649 1st Qu.:0 1st Qu.:-1.5306 1st Qu.: 0.0 1st Qu.:0.4013
Median :11.1627 Median :1022 Median : 175.2 Median : 0.00000 Median : 0.02460 Median :0 Median :-0.8667 Median : 12.1 Median :0.4701
Mean :11.7711 Mean :1030 Mean : 160.2 Mean : 0.01086 Mean : 0.05174 Mean :0 Mean :-0.1922 Mean : 2444.6 Mean :0.4697
3rd Qu.:15.9215 3rd Qu.:1126 3rd Qu.: 257.0 3rd Qu.: 0.24601 3rd Qu.: 0.32753 3rd Qu.:0 3rd Qu.: 0.4310 3rd Qu.: 48.6 3rd Qu.:0.5400
Max. :37.2150 Max. :1618 Max. : 359.5 Max. : 1.72111 Max. : 2.73880 Max. :0 Max. :12.0588 Max. :2588779.8 Max. :0.7704
landslides_30$V <- as.character(landslides_30$V)
Non_landslides_30$V <- as.character(Non_landslides_30$V)
Data_30 <- rbind(landslides_30, Non_landslides_30)
Data_m_30 <- Data_30

Drop the Missing Values Labled as -9999

Data_m_30 <- Data_m_30 %>% filter(across(-V, ~ . != -9999))
## Warning: Using `across()` in `filter()` was deprecated in dplyr 1.0.8.
## ℹ Please use `if_any()` or `if_all()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Store the Clean Dataset for 1m DEM and 30m DEM as CSV Files

We will store the clean 1m DEM and 30m DEM data as csv files so that we can use it later and we do not have to clean it again.

write.csv(Data_m, "Data_1m.csv", row.names = FALSE)
write.csv(Data_m_30, "Data_30m.csv", row.names = FALSE)

Scale the variables to Z Score

# Scale variables to z-scores
scale_to_z <- function(x) {
  (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}

Data_m_30 <- Data_m_30 %>% mutate(across(.cols = -V, .fns = scale_to_z))
Data_m_30$V <- as.factor(Data_m_30$V)

Split Data into Train and Test Sets

# Load necessary libraries
library(caret)
library(dplyr)

# Set a seed for reproducibility
set.seed(123)

# Perform stratified sampling
trainIndex <- createDataPartition(Data_m_30$V, p = 0.8, list = FALSE)

# Split the data into training and test sets
trainData <- Data_m_30[trainIndex, ]
testData <- Data_m_30[-trainIndex, ]

# Check the distribution of target variable in training and test sets
dat4 <- table(trainData$V)
dat5 <- table(testData$V)
kable(dat4)
Var1 Freq
0 1804
1 1804
kable(dat5)
Var1 Freq
0 450
1 451

Logistic Regression Model

# Load necessary libraries
library(pROC)
library(ggplot2)

library(gt)

# Train a logistic regression model
logistic_model_30m <- glm(V ~ ., data = trainData, family = binomial)

# Summarize the model
summary(logistic_model_30m)
## 
## Call:
## glm(formula = V ~ ., family = binomial, data = trainData)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.04696    0.04473  -1.050    0.294    
## Slope        1.35156    0.06116  22.099  < 2e-16 ***
## Elevation   -0.57739    0.04683 -12.329  < 2e-16 ***
## Aspect      -0.37151    0.04437  -8.373  < 2e-16 ***
## PL          -0.25238    0.05986  -4.216 2.48e-05 ***
## PR          -0.27709    0.05670  -4.887 1.02e-06 ***
## TWI         -0.44548    0.08233  -5.411 6.26e-08 ***
## SPI          0.12954    0.10531   1.230    0.219    
## TRI          0.56899    0.04986  11.411  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5001.8  on 3607  degrees of freedom
## Residual deviance: 3304.0  on 3599  degrees of freedom
## AIC: 3322
## 
## Number of Fisher Scoring iterations: 5
exp(coef(logistic_model_30m))
## (Intercept)       Slope   Elevation      Aspect          PL          PR 
##   0.9541291   3.8634647   0.5613606   0.6896913   0.7769459   0.7579835 
##         TWI         SPI         TRI 
##   0.6405158   1.1383086   1.7664896
#summary(logistic_model_1m)

Lets see the logistic regression model results of two models side by side.

a <- summary(logistic_model_30m)
b <- summary(logistic_model_1m)
re1 <- as.data.frame(a$coefficients)
re2 <- as.data.frame(b$coefficients)
re3 <- cbind(re1,re2)
colnames(re3) <- c("Estimate_30m", "Std.Error_30m", "Z Value_30m", "Pr(>|Z|_30m)",
                   "Estimate_1m", "Std.Error_1m", "Z Value_1m", "Pr(>|Z|_1m)")


kable(re3, caption = "Comarison")
Comarison
Estimate_30m Std.Error_30m Z Value_30m Pr(>|Z|_30m) Estimate_1m Std.Error_1m Z Value_1m Pr(>|Z|_1m)
(Intercept) -0.0469563 0.0447307 -1.049754 0.2938313 -0.0315799 0.0477302 -0.6616344 0.5082055
Slope 1.3515644 0.0611590 22.099196 0.0000000 2.2144091 0.0748610 29.5802686 0.0000000
Elevation -0.5773918 0.0468332 -12.328697 0.0000000 -0.3160028 0.0515664 -6.1280777 0.0000000
Aspect -0.3715111 0.0443703 -8.372967 0.0000000 -0.3825676 0.0493470 -7.7525966 0.0000000
PL -0.2523846 0.0598575 -4.216422 0.0000248 -0.1383296 0.0726106 -1.9050876 0.0567687
PR -0.2770936 0.0566990 -4.887102 0.0000010 -0.1709860 0.0715890 -2.3884399 0.0169201
TWI -0.4454815 0.0823265 -5.411154 0.0000001 0.2801415 0.0530191 5.2837881 0.0000001
SPI 0.1295435 0.1053086 1.230131 0.2186479 0.0385623 0.0606510 0.6358070 0.5249023
TRI 0.5689943 0.0498644 11.410838 0.0000000 0.1560296 0.0606154 2.5740911 0.0100504
# datatable(
# data = re3,
# caption = "Comparison 30m vs 1m",
# filter = "top",
# class = 'cell-border stripe'
# )

Our main goal is prediction so we are not concerned about the coefficients. But when we will produce the landslide susceptibility map we will only use the significant casual factors. For both DEMs Slope elevation, aspect, profile curvature, TWI and TRI are the significant factors.

Accuracy Assessment for 30 meter DEM

library(caret)

pred_prob_train <- predict(logistic_model_30m , newdata = trainData, type = "response")
pred_class_train <- ifelse(pred_prob_train > 0.5, 1, 0)

# Evaluate the model on training set
confusion_matrix_train <- table(trainData$V, pred_class_train)

##knitr::kable(confusion_matrix_train)
accuracy_train <- sum(diag(confusion_matrix_train)) / sum(confusion_matrix_train)
print(confusion_matrix_train)
##    pred_class_train
##        0    1
##   0 1410  394
##   1  370 1434
 print(paste("Training Accuracy:", accuracy_train))
## [1] "Training Accuracy: 0.788248337028825"
 # Make predictions on the test set
pred_prob_test <- predict(logistic_model_30m, newdata = testData, type = "response")
pred_class_test <- ifelse(pred_prob_test > 0.5, 1, 0)

# Evaluate the model on test set
confusion_matrix_test <- table(testData$V, pred_class_test)
accuracy_test <- sum(diag(confusion_matrix_test)) / sum(confusion_matrix_test)
print(confusion_matrix_test)
##    pred_class_test
##       0   1
##   0 345 105
##   1  91 360
print(paste("Test Accuracy:", accuracy_test))
## [1] "Test Accuracy: 0.782463928967814"
accuracy_test_30m <- as.data.frame(accuracy_test)

# Calculate ROC curve and AUC
roc_curve <- roc(testData$V, pred_prob_test)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_value <- auc(roc_curve)

auc_30m <- as.data.frame(auc_value)
print(paste("AUC:", auc_value))
## [1] "AUC: 0.857031288494703"
# Plot ROC curve using ggplot2
roc_data <- data.frame(
  specificity = rev(roc_curve$specificities),
  sensitivity = rev(roc_curve$sensitivities)
)

ggplot(roc_data, aes(x = specificity, y = sensitivity)) +
  geom_line(color = "green") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
  labs(title = "ROC Curve for Logistic Regression_30m", x = "Specificity", y = "Sensitivity") +
  theme_minimal()

Comparison of Accuracy Between 1m and 30m DEM

## Combine Four Dataframes that we created before and we will relabel
library(flextable)
## 
## Attaching package: 'flextable'
## The following objects are masked from 'package:kableExtra':
## 
##     as_image, footnote
## The following object is masked from 'package:purrr':
## 
##     compose
Acc_Comp_LR <- cbind(accuracy_test_1m, accuracy_test_30m, auc_1m, auc_30m)
colnames(Acc_Comp_LR) <- c('Accuracy_1m', 'Accuracy_30m', 'AUC_1m', 'AUC_30m')
#kable(Acc_Comp_LR, caption= "Comparison of Logistic Regression Models" )
Acc_Comp_LR%>%
  kbl(caption = "Comparison of Logistic Regression Models")%>%
  kable_styling()
Comparison of Logistic Regression Models
Accuracy_1m Accuracy_30m AUC_1m AUC_30m
0.8472687 0.7824639 0.8986289 0.8570313
#flextable(Acc_Comp_LR, )

Logistic regression model shows that if variables are derived using 1m DEM it gives better accuracy than the variables derived from 30 m DEM. We will check another model here.

Random Forest Model

We will not carry out any hyper parameter tuning right now. Here we will just see the differences. Later we will work on hyper parameter tuning.

1m DEM

Data Processing

# Upload the Shape Files
library(sf)
Landslides_Points <- st_read("Landslides_Points.shp")
## Reading layer `Landslides_Points' from data source 
##   `C:\R\Shape\Shape Files\Data\Landslides_Points.shp' using driver `ESRI Shapefile'
## Simple feature collection with 2257 features and 67 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 349401.4 ymin: 265880.2 xmax: 387724.9 ymax: 293335.3
## Projected CRS: NAD83 / North Carolina
Non_Landslides_Points <- st_read("Non_Landslides_Points.shp")
## Reading layer `Non_Landslides_Points' from data source 
##   `C:\R\Shape\Shape Files\Data\Non_Landslides_Points.shp' using driver `ESRI Shapefile'
## Simple feature collection with 2257 features and 35 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -81.9147 ymin: 36.11658 xmax: -81.46408 ymax: 36.38499
## Geodetic CRS:  NAD83
# Convert to Data Frames
landslides <- as.data.frame(Landslides_Points)
Non_landslides <- as.data.frame(Non_Landslides_Points)
library(tidyverse)
library(dplyr)
landslides <- landslides %>% select(Slope_12, Elevation, Aspect, Plan, Pr, V, TWI, SPI, TRI)
Non_landslides <- Non_landslides %>% select(Slope_12, Elevation, Aspect, Plan, Pr, CID, TWI, SPI, TRI)

# Rename columns for consistency
landslides <- rename(landslides, Slope = Slope_12, PL = Plan, PR = Pr)
Non_landslides <- rename(Non_landslides, Slope = Slope_12, PL = Plan, PR = Pr, V = CID)

#  Combine Data frames
landslides$V <- as.character(landslides$V)
Non_landslides$V <- as.character(Non_landslides$V)
Data <- rbind(landslides, Non_landslides)
Data_m <- Data
Data_m <- Data_m %>% filter(across(-V, ~ . != -9999))
## Warning: Using `across()` in `filter()` was deprecated in dplyr 1.0.8.
## ℹ Please use `if_any()` or `if_all()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#Train and Test Datasets
# Load necessary libraries
library(caret)
library(dplyr)

# Set a seed for reproducibility
set.seed(123)
Data_m$V <- as.factor(Data_m$V)

# Perform stratified sampling
trainIndex <- createDataPartition(Data_m$V, p = 0.8, list = FALSE)

# Split the data into training and test sets
trainData <- Data_m[trainIndex, ]
testData <- Data_m[-trainIndex, ]
# Train a Random Forest model
rf_model_1m <- randomForest(V ~ ., data = trainData, ntree = 500, mtry = 3, importance = TRUE)

# Summarize the model
print(rf_model_1m)
## 
## Call:
##  randomForest(formula = V ~ ., data = trainData, ntree = 500,      mtry = 3, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 15.44%
## Confusion matrix:
##      0    1 class.error
## 0 1494  298   0.1662946
## 1  256 1541   0.1424597
# Make predictions on the test set
rf_pred_prob <- predict(rf_model_1m, newdata = testData, type = "prob")[,2]
rf_pred_class <- predict(rf_model_1m, newdata = testData, type = "response")

# Evaluate the model
confusion_matrix_rf <- confusionMatrix(as.factor(rf_pred_class), as.factor(testData$V))
print(confusion_matrix_rf)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 380  81
##          1  68 368
##                                           
##                Accuracy : 0.8339          
##                  95% CI : (0.8079, 0.8577)
##     No Information Rate : 0.5006          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6678          
##                                           
##  Mcnemar's Test P-Value : 0.3256          
##                                           
##             Sensitivity : 0.8482          
##             Specificity : 0.8196          
##          Pos Pred Value : 0.8243          
##          Neg Pred Value : 0.8440          
##              Prevalence : 0.4994          
##          Detection Rate : 0.4236          
##    Detection Prevalence : 0.5139          
##       Balanced Accuracy : 0.8339          
##                                           
##        'Positive' Class : 0               
## 
# Calculate ROC curve and AUC
roc_curve_rf <- roc(testData$V, rf_pred_prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_value_rf <- auc(roc_curve_rf)
print(paste("AUC for Random Forest:", auc_value_rf))
## [1] "AUC for Random Forest: 0.915461939230035"
auc_1m <- as.data.frame(auc_value_rf )
# Plot ROC curve using ggplot2
roc_data_rf <- data.frame(
  specificity = rev(roc_curve_rf$specificities),
  sensitivity = rev(roc_curve_rf$sensitivities)
)

ggplot(roc_data_rf, aes(x = specificity, y = sensitivity)) +
  geom_line(color = "blue") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
  labs(title = "ROC Curve for Random Forest", x = "Specificity", y = "Sensitivity") +
  theme_minimal()

# Get variable importance
importance_rf <- importance(rf_model_1m )
importance_df <- data.frame(Variable = rownames(importance_rf), 
                            MeanDecreaseGini = importance_rf[, "MeanDecreaseGini"])

# Plot variable importance using ggplot2
p1 <- ggplot(importance_df, aes(x = reorder(Variable, MeanDecreaseGini), y = MeanDecreaseGini)) +
  geom_bar(stat = "identity", fill = "blue") +
  coord_flip() +
  labs(title = "Variable Importance Plot_1m", x = "Variable", y = "Mean Decrease in Gini") +
  theme_minimal()
p1

30 m DEM

library(tidyverse)
library(dplyr)
landslides <- as.data.frame(Landslides_Points)
Non_landslides <- as.data.frame(Non_Landslides_Points)
landslides_30 <- landslides %>% select(Slope_1_14, DEM_Projec, Aspect_DEM, Plan_curva, Profile_cu, V, TWI__3, SPI_3, TRI_12)
Non_landslides_30 <- Non_landslides %>% select(Slope, DEM_Projec, Aspect_DEM, Plan_curva, Profile_cu, CID, TWI__3, SPI_3, TRI_12)

# Rename columns for consistency
landslides_30 <- rename(landslides_30, Slope = Slope_1_14, PL = Plan_curva, PR = Profile_cu, Elevation= DEM_Projec, Aspect=Aspect_DEM, TWI= TWI__3, SPI= SPI_3, TRI=TRI_12)
Non_landslides_30 <- rename(Non_landslides_30, PL = Plan_curva, PR = Profile_cu, Elevation= DEM_Projec, Aspect=Aspect_DEM, TWI= TWI__3, SPI= SPI_3, TRI=TRI_12, V = CID)
landslides_30$V <- as.factor(landslides_30$V)
Non_landslides_30$V <- as.factor(Non_landslides_30$V)
Data_30 <- rbind(landslides_30, Non_landslides_30)
Data_m_30 <- Data_30

Data_m_30 <- Data_m_30 %>% filter(across(-V, ~ . != -9999))
## Warning: Using `across()` in `filter()` was deprecated in dplyr 1.0.8.
## ℹ Please use `if_any()` or `if_all()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Scale variables to z-scores
scale_to_z <- function(x) {
  (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}

Data_m_30 <- Data_m_30 %>% mutate(across(.cols = -V, .fns = scale_to_z))
Data_m_30$V <- as.factor(Data_m_30$V)

# Load necessary libraries
library(caret)
library(dplyr)

# Set a seed for reproducibility
set.seed(123)

# Perform stratified sampling
trainIndex <- createDataPartition(Data_m_30$V, p = 0.8, list = FALSE)

# Split the data into training and test sets
trainData <- Data_m_30[trainIndex, ]
testData <- Data_m_30[-trainIndex, ]
# Train a Random Forest model
rf_model_30m <- randomForest(V ~ ., data = trainData, ntree = 500, mtry = 3, importance = TRUE)

# Summarize the model
print(rf_model_30m)
## 
## Call:
##  randomForest(formula = V ~ ., data = trainData, ntree = 500,      mtry = 3, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 19.96%
## Confusion matrix:
##      1    0 class.error
## 1 1453  351   0.1945676
## 0  369 1435   0.2045455
# Make predictions on the test set
rf_pred_prob <- predict(rf_model_30m, newdata = testData, type = "prob")[,2]
rf_pred_class <- predict(rf_model_30m, newdata = testData, type = "response")

# Evaluate the model
confusion_matrix_rf <- confusionMatrix(as.factor(rf_pred_class), as.factor(testData$V))
print(confusion_matrix_rf)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   0
##          1 366  90
##          0  85 360
##                                           
##                Accuracy : 0.8058          
##                  95% CI : (0.7784, 0.8311)
##     No Information Rate : 0.5006          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6115          
##                                           
##  Mcnemar's Test P-Value : 0.7624          
##                                           
##             Sensitivity : 0.8115          
##             Specificity : 0.8000          
##          Pos Pred Value : 0.8026          
##          Neg Pred Value : 0.8090          
##              Prevalence : 0.5006          
##          Detection Rate : 0.4062          
##    Detection Prevalence : 0.5061          
##       Balanced Accuracy : 0.8058          
##                                           
##        'Positive' Class : 1               
## 
# Calculate ROC curve and AUC
roc_curve_rf <- roc(testData$V, rf_pred_prob)
## Setting levels: control = 1, case = 0
## Setting direction: controls < cases
auc_value_rf <- auc(roc_curve_rf)
print(paste("AUC for Random Forest:", auc_value_rf))
## [1] "AUC for Random Forest: 0.885826558265583"
auc_30m <- as.data.frame(auc_value_rf )
# Plot ROC curve using ggplot2
roc_data_rf <- data.frame(
  specificity = rev(roc_curve_rf$specificities),
  sensitivity = rev(roc_curve_rf$sensitivities)
)

ggplot(roc_data_rf, aes(x = specificity, y = sensitivity)) +
  geom_line(color = "blue") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
  labs(title = "ROC Curve for Random Forest_30m", x = "Specificity", y = "Sensitivity") +
  theme_minimal()

# Get variable importance
importance_rf <- importance(rf_model_30m )
importance_df <- data.frame(Variable = rownames(importance_rf), 
                            MeanDecreaseGini = importance_rf[, "MeanDecreaseGini"])

# Plot variable importance using ggplot2
p2 <- ggplot(importance_df, aes(x = reorder(Variable, MeanDecreaseGini), y = MeanDecreaseGini)) +
  geom_bar(stat = "identity", fill = "blue") +
  coord_flip() +
  labs(title = "Variable Importance Plot_30m", x = "Variable", y = "Mean Decrease in Gini") +
  theme_minimal()
p2

Comaprison 1m vs 30m

library(patchwork)

p1 + p2

auc_comparison <- cbind(auc_1m, auc_30m)
colnames(auc_comparison) <- c("auc_1m", "auc_30m")

auc_comparison%>%
  kbl()%>%
  kable_styling()
auc_1m auc_30m
0.9154619 0.8858266

Both the model shows that 1m DEM based model gives better accuracy.

Now we will take a little different approach. Previously we used DEM based factors. Now we will add some non-DEM based factors

  1. Land use/Land Cover

  2. Land use/Land Cover Change Index

  3. Imperiousness

  4. Rainfall

  5. Distance from Road (DR)

  6. Geology

  7. Normalized Difference Vegetation Index (NDVI)

  8. Soil

Since in previous modeling we did not include these variables we will create two sperate data frames with these variables

Data Cleaning and Preparation

# Upload the Shape Files
library(sf)
Landslides_Points <- st_read("Landslides_Points.shp")
## Reading layer `Landslides_Points' from data source 
##   `C:\R\Shape\Shape Files\Data\Landslides_Points.shp' using driver `ESRI Shapefile'
## Simple feature collection with 2257 features and 67 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 349401.4 ymin: 265880.2 xmax: 387724.9 ymax: 293335.3
## Projected CRS: NAD83 / North Carolina
Non_Landslides_Points <- st_read("Non_Landslides_Points.shp")
## Reading layer `Non_Landslides_Points' from data source 
##   `C:\R\Shape\Shape Files\Data\Non_Landslides_Points.shp' using driver `ESRI Shapefile'
## Simple feature collection with 2257 features and 35 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -81.9147 ymin: 36.11658 xmax: -81.46408 ymax: 36.38499
## Geodetic CRS:  NAD83
# Convert to Data Frames
landslides <- as.data.frame(Landslides_Points)
Non_landslides <- as.data.frame(Non_Landslides_Points)
library(tidyverse)
library(dplyr)
landslides <- landslides %>% select(Slope_12, Elevation, Aspect, Plan, Pr, V, TWI, SPI, TRI, Imprevious , Change, Geology, Rain, Distance_R, NDVI, Soil, Landuse)
Non_landslides <- Non_landslides %>% select(Slope_12, Elevation, Aspect, Plan, Pr, CID, TWI, SPI, TRI, Imprevious, LULC_chang, Geology, Rain, Distance_R, NDVI, Soil, Landuse)

#  Rename columns for consistency
 landslides <- rename(landslides, Slope = Slope_12, PL = Plan, PR = Pr, Imp= Imprevious, Rainfall= Rain, DR= Distance_R )
 Non_landslides <- rename(Non_landslides, Slope = Slope_12, PL = Plan, PR = Pr, V = CID, Imp= Imprevious, Rainfall= Rain, DR= Distance_R, Change=LULC_chang)

##Combine Datasets
 landslides$V <- as.character(landslides$V)
Non_landslides$V <- as.character(Non_landslides$V)
Data <- rbind(landslides, Non_landslides)
Data_m <- Data
## Drop the Missing values
Data_m <- Data_m %>% filter(across(-V, ~ . != -9999))
## Warning: Using `across()` in `filter()` was deprecated in dplyr 1.0.8.
## ℹ Please use `if_any()` or `if_all()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#write.csv(Data_m, "Data_All_1m.csv", row.names = FALSE)

#Scale the Variables
# Load necessary library
library(dplyr)

# Function to scale variables to z-scores
scale_to_z <- function(x) {
  (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}

Data_m <- Data_m %>%
  mutate(across(.cols = -c(V, Soil, Landuse, Geology, Change), .fns = scale_to_z))
Data_m$Soil <- as.factor(Data_m$Soil )
Data_m$Geology <- as.factor(Data_m$Geology )
Data_m$Landuse <- as.factor(Data_m$Landuse )
Data_m$Change <- as.factor(Data_m$Change )
write.csv(Data_m, "Data_All_1m.csv", row.names = FALSE)

Split Data into Train and Test Sets

# Load necessary libraries
library(caret)
library(dplyr)

# Set a seed for reproducibility
set.seed(123)
Data_m$V <- as.factor(Data_m$V)
# Perform stratified sampling
trainIndex <- createDataPartition(Data_m$V, p = 0.8, list = FALSE)

# Split the data into training and test sets
trainData <- Data_m[trainIndex, ]
testData <- Data_m[-trainIndex, ]

Logistic Regression Model

# Load necessary libraries
library(pROC)
library(ggplot2)

library(gt)
Data_m$V <- as.factor(Data_m$V)
# Train a logistic regression model
logistic_model_1m_all <- glm(V ~ ., data = trainData, family = binomial)

# Summarize the model
summary(logistic_model_1m_all )
## 
## Call:
## glm(formula = V ~ ., family = binomial, data = trainData)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.932e-01  3.371e-01  -1.759 0.078501 .  
## Slope        2.129e+00  8.098e-02  26.292  < 2e-16 ***
## Elevation   -8.920e-02  8.369e-02  -1.066 0.286519    
## Aspect      -3.406e-01  5.262e-02  -6.473 9.60e-11 ***
## PL          -1.137e-01  7.597e-02  -1.497 0.134510    
## PR          -1.504e-01  7.318e-02  -2.055 0.039904 *  
## TWI          3.265e-01  5.636e-02   5.793 6.91e-09 ***
## SPI          8.032e-03  3.486e-02   0.230 0.817786    
## TRI          2.174e-01  6.272e-02   3.466 0.000528 ***
## Imp         -3.075e-01  1.480e-01  -2.078 0.037736 *  
## Change3      4.280e-01  5.332e-01   0.803 0.422228    
## Change8     -2.899e-01  4.121e-01  -0.703 0.481759    
## Change10    -2.911e+00  1.045e+03  -0.003 0.997778    
## Change11     3.253e-01  2.392e-01   1.360 0.173909    
## Geology2    -9.832e-01  2.137e-01  -4.600 4.22e-06 ***
## Geology3     5.798e-01  2.181e-01   2.659 0.007839 ** 
## Geology4    -1.397e+00  3.065e-01  -4.557 5.18e-06 ***
## Geology5    -1.570e+01  4.112e+02  -0.038 0.969552    
## Geology6    -1.884e-01  1.467e-01  -1.284 0.198968    
## Geology7    -2.578e-01  2.012e-01  -1.281 0.200215    
## Rainfall    -2.884e-01  6.605e-02  -4.367 1.26e-05 ***
## DR           8.459e-02  5.625e-02   1.504 0.132640    
## NDVI         2.590e-01  6.456e-02   4.012 6.02e-05 ***
## Soil2       -7.975e-01  1.577e+00  -0.506 0.613104    
## Soil3       -3.432e-01  7.770e-01  -0.442 0.658678    
## Soil4       -1.097e-01  3.020e-01  -0.363 0.716556    
## Soil5       -3.098e+00  1.166e+00  -2.656 0.007913 ** 
## Soil6        4.570e-01  3.011e-01   1.518 0.129079    
## Soil7        6.673e-02  2.999e-01   0.223 0.823903    
## Landuse22    1.239e+00  5.853e-01   2.117 0.034281 *  
## Landuse23    2.139e+00  1.142e+00   1.873 0.061136 .  
## Landuse24    3.612e+00  1.958e+00   1.844 0.065145 .  
## Landuse31   -1.072e+01  6.076e+02  -0.018 0.985928    
## Landuse41    9.666e-01  2.016e-01   4.795 1.63e-06 ***
## Landuse42   -3.524e-01  4.419e-01  -0.798 0.425153    
## Landuse43   -1.202e-01  2.305e-01  -0.522 0.601931    
## Landuse52    2.268e-01  6.032e-01   0.376 0.706971    
## Landuse71   -7.449e-01  7.888e-01  -0.944 0.345002    
## Landuse81   -7.090e-01  2.728e-01  -2.599 0.009362 ** 
## Landuse90   -1.161e+01  1.455e+03  -0.008 0.993633    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4968.5  on 3583  degrees of freedom
## Residual deviance: 2551.2  on 3544  degrees of freedom
## AIC: 2631.2
## 
## Number of Fisher Scoring iterations: 14
exp(coef(logistic_model_1m))
## (Intercept)       Slope   Elevation      Aspect          PL          PR 
##   0.9689135   9.1559973   0.7290574   0.6821078   0.8708116   0.8428334 
##         TWI         SPI         TRI 
##   1.3233170   1.0393155   1.1688608

Accuracy Assessment

library(caret)

pred_prob_train <- predict(logistic_model_1m , newdata = trainData, type = "response")
pred_class_train <- ifelse(pred_prob_train > 0.5, 1, 0)

# Evaluate the model on training set
confusion_matrix_train <- table(trainData$V, pred_class_train)

##knitr::kable(confusion_matrix_train)
accuracy_train <- sum(diag(confusion_matrix_train)) / sum(confusion_matrix_train)
print(confusion_matrix_train)
##    pred_class_train
##        0    1
##   0 1488  300
##   1  256 1540
 print(paste("Training Accuracy:", accuracy_train))
## [1] "Training Accuracy: 0.844866071428571"
 # Make predictions on the test set
pred_prob_test <- predict(logistic_model_1m, newdata = testData, type = "response")
pred_class_test <- ifelse(pred_prob_test > 0.5, 1, 0)

# Evaluate the model on test set
confusion_matrix_test <- table(testData$V, pred_class_test)
accuracy_test <- sum(diag(confusion_matrix_test)) / sum(confusion_matrix_test)
print(confusion_matrix_test)
##    pred_class_test
##       0   1
##   0 375  71
##   1  65 384
# Make predictions on the test set
pred_prob_test <- predict(logistic_model_1m, newdata = testData, type = "response")
pred_class_test <- ifelse(pred_prob_test > 0.5, 1, 0)

# Evaluate the model on test set
confusion_matrix_test <- table(testData$V, pred_class_test)
accuracy_test <- sum(diag(confusion_matrix_test)) / sum(confusion_matrix_test)
print(confusion_matrix_test)
##    pred_class_test
##       0   1
##   0 375  71
##   1  65 384
print(paste("Test Accuracy:", accuracy_test))
## [1] "Test Accuracy: 0.84804469273743"
accuracy_test_1m_all <- as.data.frame(accuracy_test)

# Calculate ROC curve and AUC
roc_curve <- roc(testData$V, pred_prob_test)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_value <- auc(roc_curve)
auc_1m <- auc_value 
auc_1m_All <- as.data.frame(auc_value)
print(paste("AUC:", auc_value))
## [1] "AUC: 0.899138094619833"

SRTM (30m)

# Upload the Shape Files
library(sf)
Landslides_Points <- st_read("Landslides_Points.shp")
## Reading layer `Landslides_Points' from data source 
##   `C:\R\Shape\Shape Files\Data\Landslides_Points.shp' using driver `ESRI Shapefile'
## Simple feature collection with 2257 features and 67 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 349401.4 ymin: 265880.2 xmax: 387724.9 ymax: 293335.3
## Projected CRS: NAD83 / North Carolina
Non_Landslides_Points <- st_read("Non_Landslides_Points.shp")
## Reading layer `Non_Landslides_Points' from data source 
##   `C:\R\Shape\Shape Files\Data\Non_Landslides_Points.shp' using driver `ESRI Shapefile'
## Simple feature collection with 2257 features and 35 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -81.9147 ymin: 36.11658 xmax: -81.46408 ymax: 36.38499
## Geodetic CRS:  NAD83
library(tidyverse)
library(dplyr)
landslides <- as.data.frame(Landslides_Points)
Non_landslides <- as.data.frame(Non_Landslides_Points)
landslides_30 <- landslides %>% select(Slope_1_14, DEM_Projec, Aspect_DEM, Plan_curva, Profile_cu, V, TWI__3, SPI_3, TRI_12, Imprevious , Change, Geology, Rain, Distance_R, NDVI, Soil, Landuse)
Non_landslides_30 <- Non_landslides %>% select(Slope, DEM_Projec, Aspect_DEM, Plan_curva, Profile_cu, CID, TWI__3, SPI_3, TRI_12, Imprevious, LULC_chang, Geology, Rain, Distance_R, NDVI, Soil, Landuse)

# Rename columns for consistency
landslides_30 <- rename(landslides_30, Slope = Slope_1_14, PL = Plan_curva, PR = Profile_cu, Elevation= DEM_Projec, Aspect=Aspect_DEM, TWI= TWI__3, SPI= SPI_3, TRI=TRI_12, Imp= Imprevious, Rainfall= Rain, DR= Distance_R)
Non_landslides_30 <- rename(Non_landslides_30, PL = Plan_curva, PR = Profile_cu, Elevation= DEM_Projec, Aspect=Aspect_DEM, TWI= TWI__3, SPI= SPI_3, TRI=TRI_12, V = CID, Imp= Imprevious, Rainfall= Rain, DR= Distance_R, Change=LULC_chang)


landslides_30$V <- as.character(landslides_30$V)
Non_landslides_30$V <- as.character(Non_landslides_30$V)
Data_30 <- rbind(landslides_30, Non_landslides_30)
Data_m_30_All <- Data_30

Data_m_30_All <- Data_m_30_All %>% filter(across(-V, ~ . != -9999))
## Warning: Using `across()` in `filter()` was deprecated in dplyr 1.0.8.
## ℹ Please use `if_any()` or `if_all()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Scale variables to z-scores
scale_to_z <- function(x) {
  (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}

Data_m_30_All <- Data_m_30_All %>% mutate(across(.cols = -V, .fns = scale_to_z))
Data_m_30_All$V <- as.factor(Data_m_30_All$V)

Data_m_30_All$Soil <- as.factor(Data_m_30_All$Soil )
Data_m_30_All$Geology <- as.factor(Data_m_30_All$Geology )
Data_m_30_All$Landuse <- as.factor(Data_m_30_All$Landuse )
Data_m_30_All$Change <- as.factor(Data_m_30_All$Change )
write.csv(Data_m_30_All, "Data_All_30m.csv", row.names = FALSE)
# Load necessary libraries
library(caret)
library(dplyr)

# Set a seed for reproducibility
set.seed(123)

# Perform stratified sampling
trainIndex <- createDataPartition(Data_m_30_All$V, p = 0.8, list = FALSE)

# Split the data into training and test sets
trainData <- Data_m_30_All[trainIndex, ]
testData <- Data_m_30_All[-trainIndex, ]

Logistic Regression Model (30m)

# Load necessary libraries
library(pROC)
library(ggplot2)

library(gt)

# Train a logistic regression model
logistic_model_30m_All <- glm(V ~ ., data = trainData, family = binomial)

# Summarize the model
summary(logistic_model_30m_All)
## 
## Call:
## glm(formula = V ~ ., family = binomial, data = trainData)
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -8.78817 2399.54468  -0.004  0.99708    
## Slope                          1.32015    0.06714  19.663  < 2e-16 ***
## Elevation                     -0.35077    0.07901  -4.439 9.03e-06 ***
## Aspect                        -0.38827    0.04792  -8.102 5.40e-16 ***
## PL                            -0.17453    0.06316  -2.763  0.00572 ** 
## PR                            -0.23243    0.05904  -3.937 8.25e-05 ***
## TWI                           -0.35262    0.08759  -4.026 5.68e-05 ***
## SPI                            0.05814    0.03468   1.676  0.09365 .  
## TRI                            0.43022    0.05471   7.864 3.73e-15 ***
## Imp                           -0.15058    0.12003  -1.255  0.20965    
## Change0.132491109316288       -2.39375 3393.46863  -0.001  0.99944    
## Change0.547077872218507        0.77116    0.46275   1.666  0.09562 .  
## Change2.6200116867296         -0.21569    0.36976  -0.583  0.55968    
## Change3.44918521253404        -1.42871 2118.78452  -0.001  0.99946    
## Change3.86377197543626         0.25914    0.21203   1.222  0.22165    
## Geology-0.572883768189359     -1.59029    0.19637  -8.099 5.56e-16 ***
## Geology-0.141785862467707     -0.16160    0.19010  -0.850  0.39528    
## Geology0.289312043253945      -1.69809    0.30485  -5.570 2.54e-08 ***
## Geology0.720409948975597     -17.63793  797.19173  -0.022  0.98235    
## Geology1.15150785469725       -0.63433    0.13410  -4.730 2.24e-06 ***
## Geology1.5826057604189        -0.14733    0.18648  -0.790  0.42950    
## Rainfall                      -0.31314    0.06118  -5.119 3.08e-07 ***
## DR                             0.15756    0.05118   3.079  0.00208 ** 
## NDVI                           0.18499    0.06035   3.065  0.00218 ** 
## Soil-2.17319987284062         -2.74623    1.73956  -1.579  0.11441    
## Soil-1.54568057601141         -1.12567    0.75167  -1.498  0.13424    
## Soil-0.918161279182197        -0.39696    0.27291  -1.455  0.14580    
## Soil-0.290641982352987       -16.05199  287.60091  -0.056  0.95549    
## Soil0.336877314476222          0.04945    0.27072   0.183  0.85507    
## Soil0.964396611305432         -0.34253    0.27022  -1.268  0.20493    
## Landuse-1.48123028866035       8.88355 2399.54467   0.004  0.99705    
## Landuse-1.41404193684783       9.43539 2399.54473   0.004  0.99686    
## Landuse-1.34685358503532      10.41828 2399.54486   0.004  0.99654    
## Landuse-1.2796652332228       10.61164 2399.54525   0.004  0.99647    
## Landuse-0.809346770535189     -4.42004 2743.17837  -0.002  0.99871    
## Landuse-0.137463252410028      9.73041 2399.54467   0.004  0.99676    
## Landuse-0.0702749005975118     8.75122 2399.54471   0.004  0.99709    
## Landuse-0.00308654878499566    8.83675 2399.54467   0.004  0.99706    
## Landuse0.60160861752765        9.36249 2399.54471   0.004  0.99689    
## Landuse1.87818730196546        7.95211 2399.54480   0.003  0.99736    
## Landuse2.55007082009062        8.44723 2399.54468   0.004  0.99719    
## Landuse3.15476598640326       -2.44852 3393.46863  -0.001  0.99942    
## Landuse3.49070774546584       -2.94774 3393.46869  -0.001  0.99931    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4997.6  on 3604  degrees of freedom
## Residual deviance: 2982.0  on 3562  degrees of freedom
## AIC: 3068
## 
## Number of Fisher Scoring iterations: 15
exp(coef(logistic_model_30m))
## (Intercept)       Slope   Elevation      Aspect          PL          PR 
##   0.9541291   3.8634647   0.5613606   0.6896913   0.7769459   0.7579835 
##         TWI         SPI         TRI 
##   0.6405158   1.1383086   1.7664896

Accuracy Assessment 30 m

library(caret)

pred_prob_train <- predict(logistic_model_30m_All , newdata = trainData, type = "response")
pred_class_train <- ifelse(pred_prob_train > 0.5, 1, 0)

# Evaluate the model on training set
confusion_matrix_train <- table(trainData$V, pred_class_train)

##knitr::kable(confusion_matrix_train)
accuracy_train <- sum(diag(confusion_matrix_train)) / sum(confusion_matrix_train)
print(confusion_matrix_train)
##    pred_class_train
##        0    1
##   0 1451  350
##   1  311 1493
print(paste("Training Accuracy:", accuracy_train))
## [1] "Training Accuracy: 0.816643550624133"
 # Make predictions on the test set
pred_prob_test <- predict(logistic_model_30m_All, newdata = testData, type = "response")
pred_class_test <- ifelse(pred_prob_test > 0.5, 1, 0)

# Evaluate the model on test set
confusion_matrix_test <- table(testData$V, pred_class_test)
accuracy_test <- sum(diag(confusion_matrix_test)) / sum(confusion_matrix_test)
print(confusion_matrix_test)
##    pred_class_test
##       0   1
##   0 366  84
##   1  89 362
print(paste("Test Accuracy:", accuracy_test))
## [1] "Test Accuracy: 0.807991120976693"
accuracy_test_30m_All <- as.data.frame(accuracy_test)

# Calculate ROC curve and AUC
roc_curve <- roc(testData$V, pred_prob_test)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_value <- auc(roc_curve)

auc_30m_All <- as.data.frame(auc_value)
print(paste("AUC:", auc_value))
## [1] "AUC: 0.897161862527716"

Random Forest Model

1m DEM With All Data

Data_m <- read.csv("Data_All_1m.csv")
Data_m$V <- as.factor(Data_m$V )
Data_m$Soil <- as.factor(Data_m$Soil )
Data_m$Geology <- as.factor(Data_m$Geology )
Data_m$Landuse <- as.factor(Data_m$Landuse )
Data_m$Change <- as.factor(Data_m$Change )

#Train and Test Datasets
# Load necessary libraries
library(caret)
library(dplyr)

# Set a seed for reproducibility
set.seed(123)
Data_m$V <- as.factor(Data_m$V)

# Perform stratified sampling
trainIndex <- createDataPartition(Data_m$V, p = 0.8, list = FALSE)

# Split the data into training and test sets
trainData <- Data_m[trainIndex, ]
testData <- Data_m[-trainIndex, ]

# Train a Random Forest model
rf_model_1m_All <- randomForest(V ~ ., data = trainData, ntree = 500, mtry = 3, importance = TRUE)

# Summarize the model
print(rf_model_1m_All)
## 
## Call:
##  randomForest(formula = V ~ ., data = trainData, ntree = 500,      mtry = 3, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 12.83%
## Confusion matrix:
##      0    1 class.error
## 0 1550  238   0.1331096
## 1  222 1574   0.1236080
# Make predictions on the test set
rf_pred_prob <- predict(rf_model_1m_All, newdata = testData, type = "prob")[,2]
rf_pred_class <- predict(rf_model_1m_All, newdata = testData, type = "response")

# Evaluate the model
confusion_matrix_rf <- confusionMatrix(as.factor(rf_pred_class), as.factor(testData$V))
print(confusion_matrix_rf)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 393  54
##          1  53 395
##                                          
##                Accuracy : 0.8804         
##                  95% CI : (0.8574, 0.901)
##     No Information Rate : 0.5017         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.7609         
##                                          
##  Mcnemar's Test P-Value : 1              
##                                          
##             Sensitivity : 0.8812         
##             Specificity : 0.8797         
##          Pos Pred Value : 0.8792         
##          Neg Pred Value : 0.8817         
##              Prevalence : 0.4983         
##          Detection Rate : 0.4391         
##    Detection Prevalence : 0.4994         
##       Balanced Accuracy : 0.8804         
##                                          
##        'Positive' Class : 0              
## 
# Calculate ROC curve and AUC
roc_curve_rf <- roc(testData$V, rf_pred_prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_value_rf <- auc(roc_curve_rf)
print(paste("AUC for Random Forest:", auc_value_rf))
## [1] "AUC for Random Forest: 0.945938657904461"
auc_1m_All <- as.data.frame(auc_value_rf )
# Plot ROC curve using ggplot2
roc_data_rf <- data.frame(
  specificity = rev(roc_curve_rf$specificities),
  sensitivity = rev(roc_curve_rf$sensitivities)
)

ggplot(roc_data_rf, aes(x = specificity, y = sensitivity)) +
  geom_line(color = "blue") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
  labs(title = "ROC Curve for Random Forest", x = "Specificity", y = "Sensitivity") +
  theme_minimal()

# Get variable importance
importance_rf <- importance(rf_model_1m_All )
importance_df <- data.frame(Variable = rownames(importance_rf), 
                            MeanDecreaseGini = importance_rf[, "MeanDecreaseGini"])

# Plot variable importance using ggplot2
p3 <- ggplot(importance_df, aes(x = reorder(Variable, MeanDecreaseGini), y = MeanDecreaseGini)) +
  geom_bar(stat = "identity", fill = "blue") +
  coord_flip() +
  labs(title = "Variable Importance Plot_1m", x = "Variable", y = "Mean Decrease in Gini") +
  theme_minimal()
p3

30 m DEM

Data_m_30 <- read.csv("Data_All_30m")

Data_m_30$V <- as.factor(Data_m_30$V )
Data_m_30$Soil <- as.factor(Data_m_30$Soil )
Data_m_30$Geology <- as.factor(Data_m_30$Geology )
Data_m_30$Landuse <- as.factor(Data_m_30$Landuse )
Data_m_30Change <- as.factor(Data_m_30$Change )


# Load necessary libraries
library(caret)
library(dplyr)

# Set a seed for reproducibility
set.seed(123)

# Perform stratified sampling
trainIndex <- createDataPartition(Data_m_30$V, p = 0.8, list = FALSE)

# Split the data into training and test sets
trainData <- Data_m_30[trainIndex, ]
testData <- Data_m_30[-trainIndex, ]

# Train a Random Forest model
rf_model_30m_All <- randomForest(V ~ ., data = trainData, ntree = 500, mtry = 3, importance = TRUE)

# Summarize the model
print(rf_model_30m_All )
## 
## Call:
##  randomForest(formula = V ~ ., data = trainData, ntree = 500,      mtry = 3, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 14.51%
## Confusion matrix:
##      0    1 class.error
## 0 1529  272   0.1510272
## 1  251 1553   0.1391353
# Make predictions on the test set
rf_pred_prob <- predict(rf_model_30m_All , newdata = testData, type = "prob")[,2]
rf_pred_class <- predict(rf_model_30m_All , newdata = testData, type = "response")

# Evaluate the model
confusion_matrix_rf <- confusionMatrix(as.factor(rf_pred_class), as.factor(testData$V))
print(confusion_matrix_rf)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 393  59
##          1  57 392
##                                           
##                Accuracy : 0.8713          
##                  95% CI : (0.8476, 0.8924)
##     No Information Rate : 0.5006          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7425          
##                                           
##  Mcnemar's Test P-Value : 0.926           
##                                           
##             Sensitivity : 0.8733          
##             Specificity : 0.8692          
##          Pos Pred Value : 0.8695          
##          Neg Pred Value : 0.8731          
##              Prevalence : 0.4994          
##          Detection Rate : 0.4362          
##    Detection Prevalence : 0.5017          
##       Balanced Accuracy : 0.8713          
##                                           
##        'Positive' Class : 0               
## 
# Calculate ROC curve and AUC
roc_curve_rf <- roc(testData$V, rf_pred_prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_value_rf <- auc(roc_curve_rf)
print(paste("AUC for Random Forest:", auc_value_rf))
## [1] "AUC for Random Forest: 0.943254496181325"
auc_30m_All <- as.data.frame(auc_value_rf )
# Plot ROC curve using ggplot2
roc_data_rf <- data.frame(
  specificity = rev(roc_curve_rf$specificities),
  sensitivity = rev(roc_curve_rf$sensitivities)
)

ggplot(roc_data_rf, aes(x = specificity, y = sensitivity)) +
  geom_line(color = "blue") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
  labs(title = "ROC Curve for Random Forest_30m", x = "Specificity", y = "Sensitivity") +
  theme_minimal()

# Get variable importance
importance_rf <- importance(rf_model_30m_All )
importance_df <- data.frame(Variable = rownames(importance_rf), 
                            MeanDecreaseGini = importance_rf[, "MeanDecreaseGini"])

# Plot variable importance using ggplot2
p4 <- ggplot(importance_df, aes(x = reorder(Variable, MeanDecreaseGini), y = MeanDecreaseGini)) +
  geom_bar(stat = "identity", fill = "blue") +
  coord_flip() +
  labs(title = "Variable Importance Plot_30m", x = "Variable", y = "Mean Decrease in Gini") +
  theme_minimal()
p4

library(patchwork)

p1 + p2+p3+p4 

Lets compare all models.

It is obvious that random forest model gives better accuracy than the logistic regression model. When DEM based factors are complimented with non-DEM based factors the accuracy increases for both the model. We used 8 DEM based factors and 8 non-DEM based factors. In all models 1m DEM based factors give the highest accuracy. On interesting findings is that for SRTM based factors if we compliment it with non_DEM based factors the accuracy improved by 6.1%.
LR_1m_DEM Only LR_1m_All Factors LR_30m _DEM Only LR_30m All Factors RF_1m_DEM Only RF_1m_All Factors RF_30m DEM Only RF_30m All Factors
Accuracy 0.845 0.833 0.783 0.808 0.834 0.880 0.806 0.877
Area Under the Curve (AUC) 0.899 0.915 0.857 0.897 0.915 0.946 0.888 0.943