A. The source and topic of the data, any variables included, what kind of variables they are, and and how you cleaned the dataset up

I found roller_coasters.csv in our class-shared Google drive provided by Professor Saidi. One of the reasons I selected this data set is because of the variety of variables. It includes over 50 columns of characters, integers, doubles, and numeric data types detailing information about cost, manufacturers and designers, opening date, material types, and so much more I thought was interesting.

Looking at the data set, it was obvious that a lot of information was blank/incomplete and had to be cleaned before I could create any visualizations. I started to clean the data by adding a “NA” value to any blank cells. Otherwise, when I proceeded with filtering, I would get flawed results. After adding the necessary NAs by mutate_all columns, I began filtering each column that I would be utilizing for my visualization. I did so by using the command filter(!is.na(“Column Name”)). Next, since our data includes roller coasters that have been closed and removed (destroyed), I wanted to filter for only operating roller coasters. I thought this would be more relevant, and including closed and removed coasters would make our scatter plot too cluttered. I was able to do so by filtering (status %in% c(“Operating”)). Then to ensure I was using the appropriate units for speed and height, as the units are mixed between km/h and mph, ft, and m. I filtered the speed1_value column by c(“mph”) and the height_value column by c(“ft”).

Finally, when I went to plot, my x-axis would have dozens of values because the Length column was a character type and included “ft” behind the number value, like “100 ft.” I needed to create a column that would only have the number value the “100,” removing the “ft,” and converting it into a numeric type. This was difficult because simply just mutate(Length, length_value = as. numeric(Length)) leads to “!NAs introduced by coercion” and no values in the length_value column. To fix this and get the correct outcome, I had to add mutate str_extract and mutate str_replace_all beforehand to remove commas, “ft,” and anything that would cause errors when I went to convert as. numeric.

Create a Scatterplot to Visualize Operating Roller Coasters

library(tidyverse)
library(plotly)
library(treemap)
library(viridis)
library(highcharter)

Set working directory and load the dataset

setwd("C:/Users/jakea/OneDrive/Desktop/MC 2022/DATA-110/Database Data_110")
coasters <- read.csv("roller_coasters.csv") # load dataset

Mutate_all dataset to add “NA” where cells are blank then remove NA’s from nessacary columns, addtionally filter for operating Coasters, and data in ft and mph

coasters_clean <- coasters %>%
  mutate_all(~ifelse(.x == "", NA, .x)) %>% # Add NA value to blanks
  filter(!is.na(Length) & !is.na(speed1_value) & !is.na(height_value) & !is.na(Type_Main) & !is.na(Location) & !is.na(year_introduced) & !is.na(Manufacturer) & !is.na(Model) & !is.na(opening_date_clean) & !is.na(Cost) & !is.na(Duration) & !is.na(Designer)) %>% # Removes the NA values from necessary columns
  filter(Status %in% c("Operating")) %>%# only operating coasters
  filter(speed1_unit %in% c("mph")) %>% # only uses speed data tracked in mph
  filter(height_unit %in% c("ft")) # only uses height data tracked in ft

Create new colmun that removes the “ft” in Length and changes the value from a character to a numeric

coasters_clean <- coasters_clean %>% 
  mutate(length_value = str_extract(Length, "\\d+(,\\d+)?\\.?\\d*")) %>% # Extracts the length values from the Length column using a regular expression pattern and assign results to new column length_value
  mutate(length_value = str_replace_all(length_value, ",", "")) %>% # Removes commas before converting data type
  mutate(length_value = as.numeric(length_value)) # Converts the length_value column to numeric data type

Create the scatterplot outline

coaster_plot <- ggplot(coasters_clean, aes(length_value, height_value)) +
 labs(title = "Operating Roller Coasters")+
  xlab("Length (ft)") +
  ylab("Height (ft)") 
coaster_plot

Add geom_point setting size = length_value color = Type_Main

coaster_plot <- ggplot(coasters_clean, aes(length_value, height_value)) +
  geom_point(aes(size = speed1_value, color = Type_Main)) +
  labs(title = "Operating Roller Coasters")+
  xlab("Length (ft)") +
  ylab("Height (ft)") 
coaster_plot 

Customize the Scatterplot

coaster_plot <- ggplot(coasters_clean, aes(length_value, height_value)) +
  geom_point(aes(size = speed1_value, color = Type_Main)) +
  scale_color_manual(values = c("Wood" = "#964B00", "Steel" = "#808080", "Other" = "#1F77B4")) + # Manually set colors to match material
  labs(title = "Operating Roller Coasters")+
  xlab("Length (ft)") +
  ylab("Height (ft)") +
  guides(size = guide_legend(title = "Speed (mph)"), color = guide_legend(title = "Roller Coaster Type")) + # Customize the legend titles
   ylim(0, 500) +
  theme_bw() + # Change the theme 
  theme_update(plot.title = element_text(hjust = 0.5)) # Center the title
coaster_plot 

Add Interactivty

coaster_plot <- ggplot(coasters_clean, aes(length_value, height_value)) +
  geom_point(aes(size = speed1_value, color = Type_Main)) +
  scale_color_manual(values = c("Wood" = "#964B00", "Steel" = "#808080", "Other" = "#1F77B4"),name = "Roller Coaster Type &") + # Manually set colors to match material and customize the legend title
  scale_size_continuous(name  = "Speed (mph)") + # Customize the legend title
  labs(title = "Operating Roller Coasters", 
       x = "Length (ft)", 
       y = "Height (ft)") +
       ylim(0, 500) +
  theme_bw() + # Change the theme 
  theme_update(plot.title = element_text(hjust = 0.5)) # Center the title
coaster_plotly <- ggplotly(coaster_plot)
coaster_plotly 

Hard to see the dots behind one another, so make the points transparent and add information to tooltips

coaster_plot <- ggplot(coasters_clean, aes(length_value, height_value)) +
  geom_point(aes(size = speed1_value, color = Type_Main, label = coaster_name, label2 = Status, label3 = Speed, label4 = Height, label5 = Drop, label6 = Length), alpha = 0.5) +
  scale_color_manual(values = c("Wood" = "#964B00", "Steel" = "#808080", "Other" = "#1F77B4"),name = "Roller Coaster Type &") + # Manually set colors to match material
  scale_size_continuous(name  = "Speed (mph)") +
  labs(title = "Operating Roller Coasters", 
       x = "Length (ft)", 
       y = "Height (ft)") +
       ylim(0, 500) +
  theme_bw() + # Change the theme 
  theme_update(plot.title = element_text(hjust = 0.5)) # Center the title
coaster_plotly <- ggplotly(coaster_plot)
coaster_plotly

Create a 3D scatter plot

 # Create a 3D scatter plot
coaster_plot_ly <- plot_ly(
  coasters_clean, x = ~speed1_value, y = ~length_value, z = ~height_value, 
  color = ~Type_Main, colors = c("Wood" = "#964B00", "Steel" = "#808080", "Other" = "#1F77B4"),
  type = "scatter3d", mode = "markers", marker = list(size = 5), alpha = 0.75, text = ~paste("Name: ", coaster_name, "<br>", "Type: ", Type_Main, "<br>", "Manufacturer: ", Manufacturer, "<br>", "Location: ", Location, "<br>",  "Opened: ", Opening.date, "<br>", "Length: ", length_value, "ft", "<br>", "Speed: ", speed1_value, "mph", "<br>", "Height: ", height_value, "ft")) %>% # "<br>" creates spacing between each label in the tool tip, so it is not one long line
  layout(title = "Operating Roller Coasters", 
         scene = list(xaxis = list(title = "Speed (mph)", range = c(150, 0)),
          yaxis = list(title = "Length (ft)", range = c(0, 8000)),
          zaxis = list(title = "Height (ft)", range = c(0, 500))))
coaster_plot_ly

Explore creating a scatter plot with highcharter

coaster_hc <- highchart() %>%
  hc_chart(type = "scatter") %>%
  hc_add_series(coasters_clean[coasters_clean$Type_Main == "Steel",], "scatter", 
                hcaes(x = length_value, y = height_value, size = speed1_value), name = "Steel", color = "#808080") %>%
  hc_add_series(coasters_clean[coasters_clean$Type_Main == "Wood",], "scatter", 
                hcaes(x = length_value, y = height_value, size = speed1_value), name = "Wooden", color ="#964B00") %>%
  hc_add_series(coasters_clean[coasters_clean$Type_Main == "Other",], "scatter", 
                hcaes(x = length_value, y = height_value, size = speed1_value), name = "Other", color = "#1F77B4") %>%
  hc_title(text = "Tallest and Longest Operating Roller Coasters Around the World") %>%
  hc_subtitle(text = "Source: roller_coasters.csv") %>%
  hc_xAxis(title = list(text = "Length (ft)")) %>%
  hc_yAxis(title = list(text = "Height (ft)"),
           max = 600) %>% # Bigger circle so had to increase range
  hc_tooltip(pointFormat = "Coaster: {point.coaster_name}, Location: {point.Location}, Manufacturer: {point.Manufacturer}, Designer: {point.Designer}, Opening Date: {point.opening_date_clean}, Model: {point.Model}, Type: {series.name}, Cost: {point.Cost}, Length: {point.x}m, Height: {point.y}ft, Speed: {point.speed1_value}mph, Duration: {point.Duration}min") %>% # Customize tooltip 
  hc_add_theme(hc_theme_economist())
coaster_hc

Another prespecitve viewing the height vs speed

coaster_hc <- highchart() %>%
  hc_chart(type = "scatter") %>%
  hc_add_series(coasters_clean[coasters_clean$Type_Main == "Steel",], "scatter", 
                hcaes(x = speed1_value, y = height_value, size = length_value), name = "Steel", color = "#808080") %>%
  hc_add_series(coasters_clean[coasters_clean$Type_Main == "Wood",], "scatter", 
                hcaes(x = speed1_value, y = height_value, size = length_value), name = "Wooden", color ="#964B00") %>%
  hc_add_series(coasters_clean[coasters_clean$Type_Main == "Other",], "scatter", 
                hcaes(x = speed1_value, y = height_value, size = length_value), name = "Other", color = "#1F77B4") %>%
  hc_title(text = "Fastest Operating Roller Coasters Around the World") %>%
  hc_subtitle(text = "Source: roller_coasters.csv") %>%
  hc_xAxis(title = list(text = "Speed (mph)")) %>%
  hc_yAxis(title = list(text = "Height (ft)"),
           max = 600) %>% # Bigger circle so had to increase range
  hc_tooltip(pointFormat = "Coaster: {point.coaster_name}, Location: {point.Location}, Manufacturer: {point.Manufacturer}, Designer: {point.Designer}, Opening Date: {point.opening_date_clean}, Model: {point.Model}, Type: {series.name}, Cost: {point.Cost}, Length: {point.x}m, Height: {point.y}ft, Speed: {point.speed1_value}mph, Duration: {point.Duration}min") %>% # Customize tooltip 
  hc_add_theme(hc_theme_economist())
coaster_hc

B. What the visualization represents? Any interesting patterns or surprises that arise within the visualization.

My various visualizations throughout the markdown file show the relationship between roller coasters’ height, length, and speed. Coloring the scatter based on material type allows us to see if there is any correlation with steel, wooden, or other coasters, and adding information in tool tips gives us information about variables such as cost. Comparing the tallest to longest coasters, we see a positive trend. However, it is weak and has a significant out-liner. Looking at the size of the points (speed of coasters), we see that slower coasters (smaller circles) tend to be under 200 ft, although there are many larger circles in the same range.

Viewing the 3d scatter plot gives us another dimension of viewing the data and rotating the plot to the perspective of height at the y-axis and speed at the x-axis. We see a strong positive trend showing that taller coasters are faster than shorter ones, except the Furius Baco, which speed is 83.8 mph, greater than 90 percent of other coasters, while it’s shortest of all the coasters in my cleaned data set. Furthermore, viewing my high charter scatter plot that compares the tallest and fastest coaster reinforces the positive trend between height and speed, regardless of the coaster type.

Another interesting pattern I found was that Intamin was a popular manufacturer of the most high-end (faster steel coasters) and even some smaller steel coasters. On the other hand, no manufacturer made more than one wooden coaster in the cleaned data set. Lastly, I noticed that Universal Studio and Disney roller coasters cost the most out of all the coasters. Hargid’s Magical Creatures is the most expensive, costing 300 million, and Expedition Everest at Disney Animal Kingdom is the second most expensive, costing 100 million.

C. anything that you might have shown that you could not get to work or that you wished you could have included.

I wish that would’ve included a side-by-side scatter plot comparing the older closed/removed roller coasters. It could be helpful and give us a view of how the roaster coasters have changed over time. Technological advances have enabled us to build faster and larger coasters than ever before. Additionally, we could take a deeper look at cost, manufacturers, and material types. When did steel coasters become dominant in the industry? Have coasters gotten cheaper or more expensive to build? Who has been the industry-leading manufacturer over time? How has competition changed?