Ryan Nicholas Project 3

Landslides

Intro to Data set

For my data set, I chose a NASA Global Landslide Catalog. This data set has information from 2007 to 2018. It includes many types of landslides, from big to small, and the data was collected from all kinds of media, from News to scientific studies. The data sources are listed in the “Source_Name” Variable and the link.

This data set has 31 variables, and I’m not going to use all of them, but I will explain them. As explained in the first paragraph, the first two are the source’s name and link to the source from which they pulled the landslide information. The next 4 Variables Explain everything about the landslide event’s time, place, and name. The location description explains where the landslide happened. Location accuracy shows how accurate the plotting of the landslide is. The next variable shows the landslide category, which describes what type of landslide it is. Then, you have the landslide trigger, which describes what triggered the landslide. Then, you have the landslide size, which indicates how big the landslide is. The landslide setting shows where the landslide took place, such as above a road or on an unknown slope. Fatality count shows the number of deaths caused by the landslide, and injury count shows the amount of injuries caused by the landslide. Storm name shows the storm’s name (if there is one) that triggered the landslide. The photo link shows links to photos that relate to the landslide. Notes are for any notes the data collector has. The next two, up until country name and ID, are about the data that’s not so important, but they are mostly about the source of the data. The country name and code show the country in which the landslide happened. Admin division name shows the region where the landslide happened, and admin division population shows the population of the region where the landslide happened (Not much info is given on these variables, so this is more of a guess). Gazeteer closest point shows where the landslide was observed from and gazeteer distance shows how far from the landslide the observation was made. The next 3 Variables talk about when the data was submitted and created and if it was edited when it was. Finally, the last two are longitude and latitude, which show the longitude and latitude of the landslide.

I didn’t do much cleaning besides removing NA’s from the data I was using, further explanation can be found in the data filtering sections of each plot.

Background Research And Personal Interest on Data set

I personally chose this data set as I love working with Natural event data sets. For my last project, I did volcano’s, and for this project, I’m doing landslides. I’m less knowledgeable about landslides than volcano, so I’m really excited to dig into this data set. However, I did do some background research, which I will explain next.

According to the BGS, “A landslide is a mass movement of material, such as rock, earth, or debris, down a slope.” These can also be very dangerous in some cases. Many things can trigger a landslide, from natural erosion to human intervention with construction.

I found an interesting piece of information on another website: “Globally, the highest numbers of fatalities from landslides occur in the mountains of Asia and Central and South America, as well as on steep islands in the Caribbean and Southeast Asia.” I got this information from USCDornsife. I want to explore this myself in my final plot, seeing if it’s true that the most dangerous landslides come from this region.

library(tidyverse)

Warning: package 'readr' was built under R version 4.3.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(highcharter)

Warning: package 'highcharter' was built under R version 4.3.3

Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo

landslides <-read.csv("Global_Landslide_Catalog_Export_20240504.csv")

Linear Regression Data Filtering

To do my Linear Regression, I want to see if there is a correlation between number of injuries and the number of deaths; in order to do that, I need to filter out the NAs from both

LandslideDandENONA <- landslides |>
  filter(!is.na(injury_count) & !is.na(fatality_count) & !fatality_count > 100 & !injury_count > 50) #filters out the NA's from bout injury_count and fatality_count and filters out values greater then 500 for fatality count and values greater then 50 for injury count.

To reduce spread, I want to limit the fatality count to 500 and reduce the injury count to a max of 50. This is to improve the range and keep the graph readable.

Linear Regression And Analysis

ggplot(LandslideDandENONA, aes(x = injury_count, y= fatality_count)) + labs(title = "Injury Count VS Fatality Count For Landslides", 
caption = "Source: NASA") + 
xlab("Injury Count") +
ylab("Fatality Count") +
theme_minimal(base_size = 12) + 
geom_point() + 

geom_smooth(method='lm',formula=y~x)

#making the linear reg graph

I limited the count so much as I felt it made the graph far too wide.

Analysis

cor(LandslideDandENONA$injury_count, LandslideDandENONA$fatality_count)

[1] 0.2979571

There is a very small to no correlation here.

fit1 <- lm(fatality_count ~ injury_count, data=LandslideDandENONA)
summary(fit1)


Call:
lm(formula = fatality_count ~ injury_count, data = LandslideDandENONA)

Residuals:
    Min      1Q  Median      3Q     Max 
-25.687  -0.745  -0.745  -0.745  99.255 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.74495    0.06341   11.75   <2e-16 ***
injury_count  0.55427    0.02432   22.79   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.55 on 5330 degrees of freedom
Multiple R-squared:  0.08878,   Adjusted R-squared:  0.08861 
F-statistic: 519.3 on 1 and 5330 DF,  p-value: < 2.2e-16

My correlation for this relationship is 0.2979, which is close to zero, which means there isn’t a correlation here.

However, the equation for this regression would be,

fatality_count = 0.554 (injury_count) + 0.745

This equation means that for every increase in the fatality count the injury count will increase by 0.554 and the intercept will be 0.745.

The adjusted R square value shows that about 8% of the variation in observation is explained by the model this means that 92% of the variation is likely not explained.

Latitude Positions Effect On Landslides (Plot 1)

Data Filtering For Plot 1

For plot one, I want to categorize every piece of data to add a distance for the latitude variable. I will do this by making five categorical variables; they will represent the values under 45 positive latitude and over 45 positive latitude. Then, there will be two more negative latitudes with the same structure. Finally, there will be one for any that’s between 1 and -1.

EquatorLandslides <- landslides |>
  mutate(EquatorDistance = ifelse(latitude <=-46, "-46 to -90", ifelse(latitude <=-1 & latitude >= -45, "-1 to -45",ifelse(latitude >=-0.99 & latitude <= 0.99, "-0.99 to 0.99", ifelse(latitude >= 1 & latitude <= 45, "1 to 45","46 to 90")))))  #This is me using ifelse and mutate to add a new variable, this variable will be used to help create my graph, I am assigning a latitude range to all data points.

I then filter out the non applicable landslide sizes, we will see the possible size variables.

Uniquesizes <- unique(EquatorLandslides$landslide_size)
print(Uniquesizes)

[1] "large"        "small"        "medium"       "unknown"      "very_large"  
[6] ""             "catastrophic"

#Here I look through to see what sizes landslides come in

I want to remove “unknown” and “” sized landslides as they do not help my data.

EquatorLandslides <-EquatorLandslides |>
  filter(landslide_size != "" & landslide_size != "unknown") #Here I filter out the blank size and the unknown size as both wont be helpful for my visualization

Plot 1

highchart1 <- highchart() |>
  hc_chart(type = 'column') |>
  hc_xAxis(categories = c("Small","Medium","Large","Very Large","Catastrophic"), title = list(text = 'Size Of Landslide')) |>
  hc_add_series(name = "-90 to -46",
                data = EquatorLandslides|>
                  filter(EquatorDistance == "-46 to -90")
                |>   group_by(landslide_size)
                |> summarize(count = n()) |> 
                  pull(count), color = "#782624") |>
  hc_add_series(name = "-45 to -1",
                data = EquatorLandslides|>
                  filter(EquatorDistance == "-1 to -45")
                |>   group_by(landslide_size)
                |> summarize(count = n()) |>
                  pull(count), color = "#DC4D01") |>
    hc_add_series(name = "-0.99 to 0.99",
                data = EquatorLandslides|>
                  filter(EquatorDistance == "-0.99 to 0.99")
                |>   group_by(landslide_size)
                |> summarize(count = n()) |>
                  pull(count), color = "#F0E130") |>
      hc_add_series(name = "1 to 45",
                data = EquatorLandslides|>
                  filter(EquatorDistance == "1 to 45")
                |>   group_by(landslide_size)
                |> summarize(count = n()) |>
                  pull(count), color = "#3eb489") |>
      hc_add_series(name = "46 to 90",
                data = EquatorLandslides|>
                  filter(EquatorDistance == "46 to 90")
                |>   group_by(landslide_size)
                |> summarize(count = n()) |>
                  pull(count), color = "#000080") |>
  #I had no choice but to brute force it as I added a series for each individual range and also got the count for how many landslides happened at that range and at that size.
  hc_yAxis(title = list(text = "Number Of Landslides")) |>
  hc_legend(title = list(text= "Latitude Range"), layout = "vertical", align = "right", verticalalign = "right", borderWidth = 1, size = 0.5) |>
  hc_title(text = "How Latitude Position Effects Landslides") |>
  hc_caption(text = "source: NASA") |>
  hc_add_theme(hc_theme_ft())
  

highchart1

Explanation: As you can see, I made a bar chart with Highcharter for my first visualization. This bar chart aims to see if latitude affects landslides, and from this, I can see a pattern that shows that maybe it does. If you play around with the legend a bit, you can see a lot of landslides in the 1 to 45 latitude region. This region has the most landslides by far. Also, anything over the equator has more landslides than anything under the equator, which I found interesting.

I would have liked to figure out how to eliminate the -1 and -2 on the X-axis when I zoom in on -90 to -46. But overall, I am happy with this visualization.

How many landslides are in different landslide categories (Plot 2)

Data Filtering and Creation of Plot 2

To begin, I created a custom field to calculate the total number of landslides for each category. Here’s how I did it: I used the equation ‘Count([LANDSLIDE CATEGORY])’ in the custom field, which allowed me to get the total count of landslides for each category.

I also filtered out the null and blank landslide categories as they were unimportant to the visualization.

Plot 2

https://public.tableau.com/views/TreemapLandslides/Sheet1?:language=en-US&publish=yes&:sid=&:display_count=n&:origin=viz_share_link

Explanation:

For this one, I wanted to do a treemap. I haven’t messed with treemaps much, so it would be an excellent opportunity to do so. Each color represents one of the different landslide categories for this treemap, and the size depends on the number of landslides in that category.

The conclusion I got from this is that normal landslides happen more often than the rest. If you hover over a box, you can also see a tooltip that shows the number of fatalities and injuries caused by that type. This treemap also showed me that normal landslides were the most deadly, which I feel is because they happen the most.

I wanted to add columns for each landslide size, but I didn’t because I felt it would have made the plot too similar to my first plot.

Landslide Danger Across Countries (Plot 3)

Data Filtering Of Plot 3

For this one, I wanted to make a map. I struggled with creating a map for project two, so I want to make this one my “redemption” in a way.

I also used Tableau for this one as it would be much easier to use than Leaflet.

The first thing I’d like to do for this is make a new variable.

So, for this, I want to create a danger level value, which will be done by again creating a calculated field equation. This calculated field equation will be…

“AVG([Fatality Count] * 0.70) + AVG([Injury Count] * 0.30)”

This represents that the average death counts will be weighted by 70%, and average injury counts will be weighted by 30%. Overall, this should give me a range on how dangerous the landslides are in that region.

I then filtered out any country with fewer than 25 landslides, as they would have gotten in the way because their numbers would have been massively inflated.

I also wanted to make the average fatality and the average injury count as separate variables, so I used the same method as before.

Plot 3

https://public.tableau.com/app/profile/ryan.nicholas8068/viz/LandslideMap/Sheet4

Explanation:

For this one, as I explained earlier, I wanted to do a map where I plot my data on a world map. I first tried displaying individual landslides, but that didn’t work well and looked ugly and messy. So, I sorted by country. I tried to think of a way to display both the average fatalities of a country and the average injuries of a country caused by landslides, so I decided to make a value that combined the two. This value represents the level of danger a country faces on average from landslides. On average, countries with a darker shade of red, have more dangerous landslides than countries with a lighter color of red. I wish the data set had its own “Danger Level” variable, but I also liked the challenge of making my own.

I think it’s really interesting that Asia and South America have, on average, more dangerous landslides than the rest of the world, and it also proves my background research right. I wish this data had more information and more value because I would have loved to see more countries. Unfortunately, many countries only had 1-10 landslides, which would have thrown off my collecting averages.

If I were to change anything, it’d be finding a new data set I can interlink with this one, maybe one about regions with a lot of mountains. I also wish I could have figured out how to put a tooltip that showed the most common landslide type for that country.

Bibliography

Data Source from: https://data.nasa.gov/Earth-Science/Global-Landslide-Catalog-Export/dd9e-wu2v/about_data

“Global toll from landslides is heaviest in developing countries” by Joshua West, https://dornsife.usc.edu/news/stories/global-toll-from-landslides-is-heaviest-in-developing-countries/#:~:text=Thousands%20killed%20in%20single%20events&text=Globally%2C%20the%20highest%20numbers%20of,the%20Caribbean%20and%20Southeast%20Asia.

“Understanding landslides” by BGS https://www.bgs.ac.uk/discovering-geology/earth-hazards/landslides/#:~:text=A%20landslide%20is%20a%20mass,fail%20and%20a%20landslide%20occurs.