I found roller_coasters.csv in our class-shared Google drive provided by Professor Saidi. One of the reasons I selected this data set is because of the variety of variables. It includes over 50 columns of characters, integers, doubles, and numeric data types detailing information about cost, manufacturers and designers, opening date, material types, and so much more I thought was interesting.
Looking at the data set, it was obvious that a lot of information was blank/incomplete and had to be cleaned before I could create any visualizations. I started to clean the data by adding a “NA” value to any blank cells. Otherwise, when I proceeded with filtering, I would get flawed results. After adding the necessary NAs by mutate_all columns, I began filtering each column that I would be utilizing for my visualization. I did so by using the command filter(!is.na(“Column Name”)). Next, since our data includes roller coasters that have been closed and removed (destroyed), I wanted to filter for only operating roller coasters. I thought this would be more relevant, and including closed and removed coasters would make our scatter plot too cluttered. I was able to do so by filtering (status %in% c(“Operating”)). Then to ensure I was using the appropriate units for speed and height, as the units are mixed between km/h and mph, ft, and m. I filtered the speed1_value column by c(“mph”) and the height_value column by c(“ft”).
Finally, when I went to plot, my x-axis would have dozens of values because the Length column was a character type and included “ft” behind the number value, like “100 ft.” I needed to create a column that would only have the number value the “100,” removing the “ft,” and converting it into a numeric type. This was difficult because simply just mutate(Length, length_value = as. numeric(Length)) leads to “!NAs introduced by coercion” and no values in the length_value column. To fix this and get the correct outcome, I had to add mutate str_extract and mutate str_replace_all beforehand to remove commas, “ft,” and anything that would cause errors when I went to convert as. numeric.
library(tidyverse)
library(plotly)
library(treemap)
library(viridis)
library(highcharter)
setwd("C:/Users/jakea/OneDrive/Desktop/MC 2022/DATA-110/Database Data_110")
coasters <- read.csv("roller_coasters.csv") # load dataset
coasters_clean <- coasters %>%
mutate_all(~ifelse(.x == "", NA, .x)) %>% # Add NA value to blanks
filter(!is.na(Length) & !is.na(speed1_value) & !is.na(height_value) & !is.na(Type_Main) & !is.na(Location) & !is.na(year_introduced) & !is.na(Manufacturer) & !is.na(Model) & !is.na(opening_date_clean) & !is.na(Cost) & !is.na(Duration) & !is.na(Designer)) %>% # Removes the NA values from necessary columns
filter(Status %in% c("Operating")) %>%# only operating coasters
filter(speed1_unit %in% c("mph")) %>% # only uses speed data tracked in mph
filter(height_unit %in% c("ft")) # only uses height data tracked in ft
coasters_clean <- coasters_clean %>%
mutate(length_value = str_extract(Length, "\\d+(,\\d+)?\\.?\\d*")) %>% # Extracts the length values from the Length column using a regular expression pattern and assign results to new column length_value
mutate(length_value = str_replace_all(length_value, ",", "")) %>% # Removes commas before converting data type
mutate(length_value = as.numeric(length_value)) # Converts the length_value column to numeric data type
coaster_plot <- ggplot(coasters_clean, aes(length_value, height_value)) +
labs(title = "Operating Roller Coasters")+
xlab("Length (ft)") +
ylab("Height (ft)")
coaster_plot
coaster_plot <- ggplot(coasters_clean, aes(length_value, height_value)) +
geom_point(aes(size = speed1_value, color = Type_Main)) +
labs(title = "Operating Roller Coasters")+
xlab("Length (ft)") +
ylab("Height (ft)")
coaster_plot
coaster_plot <- ggplot(coasters_clean, aes(length_value, height_value)) +
geom_point(aes(size = speed1_value, color = Type_Main)) +
scale_color_manual(values = c("Wood" = "#964B00", "Steel" = "#808080", "Other" = "#1F77B4")) + # Manually set colors to match material
labs(title = "Operating Roller Coasters")+
xlab("Length (ft)") +
ylab("Height (ft)") +
guides(size = guide_legend(title = "Speed (mph)"), color = guide_legend(title = "Roller Coaster Type")) + # Customize the legend titles
ylim(0, 500) +
theme_bw() + # Change the theme
theme_update(plot.title = element_text(hjust = 0.5)) # Center the title
coaster_plot
coaster_plot <- ggplot(coasters_clean, aes(length_value, height_value)) +
geom_point(aes(size = speed1_value, color = Type_Main)) +
scale_color_manual(values = c("Wood" = "#964B00", "Steel" = "#808080", "Other" = "#1F77B4"),name = "Roller Coaster Type &") + # Manually set colors to match material and customize the legend title
scale_size_continuous(name = "Speed (mph)") + # Customize the legend title
labs(title = "Operating Roller Coasters",
x = "Length (ft)",
y = "Height (ft)") +
ylim(0, 500) +
theme_bw() + # Change the theme
theme_update(plot.title = element_text(hjust = 0.5)) # Center the title
coaster_plotly <- ggplotly(coaster_plot)
coaster_plotly
coaster_plot <- ggplot(coasters_clean, aes(length_value, height_value)) +
geom_point(aes(size = speed1_value, color = Type_Main, label = coaster_name, label2 = Status, label3 = Speed, label4 = Height, label5 = Drop, label6 = Length), alpha = 0.5) +
scale_color_manual(values = c("Wood" = "#964B00", "Steel" = "#808080", "Other" = "#1F77B4"),name = "Roller Coaster Type &") + # Manually set colors to match material
scale_size_continuous(name = "Speed (mph)") +
labs(title = "Operating Roller Coasters",
x = "Length (ft)",
y = "Height (ft)") +
ylim(0, 500) +
theme_bw() + # Change the theme
theme_update(plot.title = element_text(hjust = 0.5)) # Center the title
coaster_plotly <- ggplotly(coaster_plot)
coaster_plotly
# Create a 3D scatter plot
coaster_plot_ly <- plot_ly(
coasters_clean, x = ~speed1_value, y = ~length_value, z = ~height_value,
color = ~Type_Main, colors = c("Wood" = "#964B00", "Steel" = "#808080", "Other" = "#1F77B4"),
type = "scatter3d", mode = "markers", marker = list(size = 5), alpha = 0.75, text = ~paste("Name: ", coaster_name, "<br>", "Type: ", Type_Main, "<br>", "Manufacturer: ", Manufacturer, "<br>", "Location: ", Location, "<br>", "Opened: ", Opening.date, "<br>", "Length: ", length_value, "ft", "<br>", "Speed: ", speed1_value, "mph", "<br>", "Height: ", height_value, "ft")) %>% # "<br>" creates spacing between each label in the tool tip, so it is not one long line
layout(title = "Operating Roller Coasters",
scene = list(xaxis = list(title = "Speed (mph)", range = c(150, 0)),
yaxis = list(title = "Length (ft)", range = c(0, 8000)),
zaxis = list(title = "Height (ft)", range = c(0, 500))))
coaster_plot_ly
coaster_hc <- highchart() %>%
hc_chart(type = "scatter") %>%
hc_add_series(coasters_clean[coasters_clean$Type_Main == "Steel",], "scatter",
hcaes(x = length_value, y = height_value, size = speed1_value), name = "Steel", color = "#808080") %>%
hc_add_series(coasters_clean[coasters_clean$Type_Main == "Wood",], "scatter",
hcaes(x = length_value, y = height_value, size = speed1_value), name = "Wooden", color ="#964B00") %>%
hc_add_series(coasters_clean[coasters_clean$Type_Main == "Other",], "scatter",
hcaes(x = length_value, y = height_value, size = speed1_value), name = "Other", color = "#1F77B4") %>%
hc_title(text = "Tallest and Longest Operating Roller Coasters Around the World") %>%
hc_subtitle(text = "Source: roller_coasters.csv") %>%
hc_xAxis(title = list(text = "Length (ft)")) %>%
hc_yAxis(title = list(text = "Height (ft)"),
max = 600) %>% # Bigger circle so had to increase range
hc_tooltip(pointFormat = "Coaster: {point.coaster_name}, Location: {point.Location}, Manufacturer: {point.Manufacturer}, Designer: {point.Designer}, Opening Date: {point.opening_date_clean}, Model: {point.Model}, Type: {series.name}, Cost: {point.Cost}, Length: {point.x}m, Height: {point.y}ft, Speed: {point.speed1_value}mph, Duration: {point.Duration}min") %>% # Customize tooltip
hc_add_theme(hc_theme_economist())
coaster_hc
coaster_hc <- highchart() %>%
hc_chart(type = "scatter") %>%
hc_add_series(coasters_clean[coasters_clean$Type_Main == "Steel",], "scatter",
hcaes(x = speed1_value, y = height_value, size = length_value), name = "Steel", color = "#808080") %>%
hc_add_series(coasters_clean[coasters_clean$Type_Main == "Wood",], "scatter",
hcaes(x = speed1_value, y = height_value, size = length_value), name = "Wooden", color ="#964B00") %>%
hc_add_series(coasters_clean[coasters_clean$Type_Main == "Other",], "scatter",
hcaes(x = speed1_value, y = height_value, size = length_value), name = "Other", color = "#1F77B4") %>%
hc_title(text = "Fastest Operating Roller Coasters Around the World") %>%
hc_subtitle(text = "Source: roller_coasters.csv") %>%
hc_xAxis(title = list(text = "Speed (mph)")) %>%
hc_yAxis(title = list(text = "Height (ft)"),
max = 600) %>% # Bigger circle so had to increase range
hc_tooltip(pointFormat = "Coaster: {point.coaster_name}, Location: {point.Location}, Manufacturer: {point.Manufacturer}, Designer: {point.Designer}, Opening Date: {point.opening_date_clean}, Model: {point.Model}, Type: {series.name}, Cost: {point.Cost}, Length: {point.x}m, Height: {point.y}ft, Speed: {point.speed1_value}mph, Duration: {point.Duration}min") %>% # Customize tooltip
hc_add_theme(hc_theme_economist())
coaster_hc
My various visualizations throughout the markdown file show the relationship between roller coasters’ height, length, and speed. Coloring the scatter based on material type allows us to see if there is any correlation with steel, wooden, or other coasters, and adding information in tool tips gives us information about variables such as cost. Comparing the tallest to longest coasters, we see a positive trend. However, it is weak and has a significant out-liner. Looking at the size of the points (speed of coasters), we see that slower coasters (smaller circles) tend to be under 200 ft, although there are many larger circles in the same range.
Viewing the 3d scatter plot gives us another dimension of viewing the data and rotating the plot to the perspective of height at the y-axis and speed at the x-axis. We see a strong positive trend showing that taller coasters are faster than shorter ones, except the Furius Baco, which speed is 83.8 mph, greater than 90 percent of other coasters, while it’s shortest of all the coasters in my cleaned data set. Furthermore, viewing my high charter scatter plot that compares the tallest and fastest coaster reinforces the positive trend between height and speed, regardless of the coaster type.
Another interesting pattern I found was that Intamin was a popular manufacturer of the most high-end (faster steel coasters) and even some smaller steel coasters. On the other hand, no manufacturer made more than one wooden coaster in the cleaned data set. Lastly, I noticed that Universal Studio and Disney roller coasters cost the most out of all the coasters. Hargid’s Magical Creatures is the most expensive, costing 300 million, and Expedition Everest at Disney Animal Kingdom is the second most expensive, costing 100 million.
I wish that would’ve included a side-by-side scatter plot comparing the older closed/removed roller coasters. It could be helpful and give us a view of how the roaster coasters have changed over time. Technological advances have enabled us to build faster and larger coasters than ever before. Additionally, we could take a deeper look at cost, manufacturers, and material types. When did steel coasters become dominant in the industry? Have coasters gotten cheaper or more expensive to build? Who has been the industry-leading manufacturer over time? How has competition changed?