Roller Coasters

pacman::p_load(knitr, readr, RColorBrewer, tidyverse)

The starting data set

The original csv file of roller coaster data from Kaggle, Kaggle roller coaster data, is a tidy dataset with 14 variables in 157 rows.

rc <- as_tibble(read.csv("C:/Users/Owner/Desktop/My Documents/School/DataFiles/RollerCoaster/RollerCoaster.csv"))
str(rc)

## tibble [157 x 14] (S3: tbl_df/tbl/data.frame)
##  $ Coaster          : chr [1:157] "Zippin Pippin" "Jack Rabbit" "Thunderhawk" "Giant Dipper" ...
##  $ Park             : chr [1:157] "Libertyland" "Kennywood Park" "Dorney Park" "Santa Cruz Beach Boardwalk" ...
##  $ City             : chr [1:157] "Memphis" "West Mifflin" "Allentown" "Santa Cruz" ...
##  $ State            : chr [1:157] "Tennessee" "Pennsylvania" "Pennsylvania" "California" ...
##  $ Type             : chr [1:157] "Wooden" "Wooden" "Wooden" "Wooden" ...
##  $ Design           : chr [1:157] "Sit Down" "Sit Down" "Sit Down" "Sit Down" ...
##  $ Year_Opened      : int [1:157] 1915 1921 1923 1924 1924 1927 1935 1940 1946 1951 ...
##  $ Top_Speed        : int [1:157] 40 45 45 55 55 48 50 55 50 25 ...
##  $ Max_Height       : int [1:157] 70 40 80 70 70 85 55 90 84 37 ...
##  $ Drop             : int [1:157] 70 70 65 65 95 78 52 89 78 25 ...
##  $ Length           : int [1:157] 2865 2132 2767 2640 2887 2746 2650 2800 3360 1300 ...
##  $ Duration         : int [1:157] 90 96 78 112 101 75 105 120 105 84 ...
##  $ Inversions       : chr [1:157] "N" "N" "N" "N" ...
##  $ Num_of_Inversions: int [1:157] 0 0 0 0 0 0 0 0 0 0 ...

The data includes:

Coaster name, character
theme or amusement Park, City and State where it’s located, all character
Year Opened, four digit integer
Type of construction, categorical: wood or steel
Design, categorical

unique(rc$Design)  # display the categories for Design

## [1] "Sit Down"      "Suspended"     "Stand Up"      "Pipeline"     
## [5] "Inverted"      "Wing"          "Flying"        "4th Dimension"

Top Speed (mph), numeric
Max Height (feet), numeric
Drop (feet), numeric
Length (feet), numeric
Duration (seconds), numeric
If the coaster has Inversions (Y/N)
Number of inversions, integer

31 rows have incomplete data, missing values in the Speed, Height, Drop, Length, Duration or Num_of_Inversions columns.

colSums(is.na(rc))  # count the number of na in each column

##           Coaster              Park              City             State 
##                 0                 0                 0                 0 
##              Type            Design       Year_Opened         Top_Speed 
##                 0                 0                 0                10 
##        Max_Height              Drop            Length          Duration 
##                 5                 3                 3                28 
##        Inversions Num_of_Inversions 
##                 0                 1

A little data research

The Roller Coaster Database shows 708 operating roller coasters in the United States. There’s isn’t any information on Kaggle about the selection criteria of the roller coasters included in the dataset so conclusions based on this dataset probably don’t reflect reality. I feel I can augment the data without angst, all in the name of fun and learning.

I used the Roller Coaster Database (RCDB), Wikipedia, Coasterpedia - Roller Coaster wiki and Ultimate Coaster to find some of the missing data values and to add a few more rows to the roller coaster dataset. My augmented dataset has 187 roller coasters.

# read RollerCoaster_new and change the column names to lower case

rc <- read.csv("C:/Users/Owner/Desktop/My Documents/School/DataFiles/RollerCoaster/RollerCoaster_new.csv") %>%
  as_tibble() %>%
  rename_with(tolower)      # rename the variables to lower case
str(rc)                     # look at the tibble

## tibble [187 x 14] (S3: tbl_df/tbl/data.frame)
##  $ coaster          : chr [1:187] "Adventure Express" "Afterburner" "Alpengeist" "American Eagle" ...
##  $ park             : chr [1:187] "Kings Island" "Fun Spot" "Busch Gardens Williamsburg" "Six Flags Great America" ...
##  $ city             : chr [1:187] "Kings Mills" "Angola" "Williamsburg" "Gurnee" ...
##  $ state            : chr [1:187] "Ohio" "Indiana" "Virginia" "Illinois" ...
##  $ type             : chr [1:187] "Steel" "Steel" "Steel" "Wood" ...
##  $ design           : chr [1:187] "Sit Down" "Sit Down" "Inverted" "Sit Down" ...
##  $ year_opened      : int [1:187] 1991 1991 1997 1981 1991 1999 2000 2005 1998 1999 ...
##  $ top_speed        : num [1:187] 35 45 67 66 50 73 50 64.8 70 81 ...
##  $ max_height       : num [1:187] 63 56 195 127 128 170 95 45 200 105 ...
##  $ drop             : num [1:187] 47 47 170 147 144 210 92 31 105 85 ...
##  $ length           : num [1:187] 2963 635 3828 4650 2700 ...
##  $ duration         : int [1:187] 140 66 190 143 110 135 115 64 48 120 ...
##  $ inversions       : chr [1:187] "N" "Y" "Y" "N" ...
##  $ num_of_inversions: int [1:187] 0 1 6 0 4 0 0 0 3 5 ...

But I still couldn’t find the drop height for ten of the roller coasters.

colSums(is.na(rc))  # count the number of na in each column

##           coaster              park              city             state 
##                 0                 0                 0                 0 
##              type            design       year_opened         top_speed 
##                 0                 0                 0                 0 
##        max_height              drop            length          duration 
##                 0                10                 0                 0 
##        inversions num_of_inversions 
##                 0                 0

Filling in all the drops

Rather than eliminate these ten rows, I’ll assign values for drop by using the average proportion of drop to max_height:

mean(rc$drop / rc$max_height, na.rm = TRUE)   # calc the average proportion of drop to height

## [1] 0.908926

This proportion is pretty close to one, I’ll look at the relationship of drop to max_height and see if it’s really that close to one.

rc %>% ggplot(aes(x = drop, y = max_height)) + 
         geom_point(na.rm = TRUE) +                          # dot plot drop against height 
         geom_smooth(na.rm = TRUE) +                         # and draw a line on the plot
         theme_minimal()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

It’s not quite a straight line, and there are a few points a distance from the line. Since mean is not resistant to outliers, I’ll see if there are outliers for the proportion.

dbh <-rc %>% select (drop, max_height) %>%               # create a tibble 
       mutate(dropbyheight = drop / max_height)          # with the proportion of drop to max_height
dbh %>% ggplot(aes(dropbyheight)) +                      # boxplot dropbyheight to see outliers
         geom_boxplot(outlier.color = "red", outlier.size = 2.5, na.rm = TRUE) +
         theme_minimal()

There are both low and high outliers. I’ll calculate the cutoffs for eliminating the the low and high outliers using the standard 25th percentile - IQR * 1.5 and 75th percentile + IQR * 1.5.

lb <- quantile(dbh$dropbyheight, 0.25, na.rm = TRUE) -     # calculate the cutoff for low outliers
      IQR(dbh$dropbyheight, na.rm = TRUE) * 1.5

ub <- quantile(dbh$dropbyheight,0.75, na.rm = TRUE) +      # calculate the cutoff for high outliers
      IQR(dbh$dropbyheight, na.rm = TRUE) * 1.5 

x <- dbh %>% filter(dropbyheight > lb & dropbyheight < ub, na.rm = TRUE)    # filter out the outliers
xhd <-  mean(x$dropbyheight)                                  # calc the avg. proportion of drop / max_height
xhd

## [1] 0.8978595

This looks a little better, so I’ll use this average proportion to fill in the missing drop values.

x <- rc$max_height * xhd                        # create a vector of drop values using height * avg. proportion
rc$drop <- coalesce(rc$drop, x)                 # fill in the empty drop values in drop
colSums(is.na(rc))                              # check that there are no more missing values

##           coaster              park              city             state 
##                 0                 0                 0                 0 
##              type            design       year_opened         top_speed 
##                 0                 0                 0                 0 
##        max_height              drop            length          duration 
##                 0                 0                 0                 0 
##        inversions num_of_inversions 
##                 0                 0

Adding design as numeric

The design of a roller coaster has an impact on the ride experience. In order to include it in the Fun Factor calculation it needs to have a numeric value. The categories for design and their descriptions are:

unique(rc$design)   # display the categories for design

## [1] "Sit Down"      "Inverted"      "Wing"          "Suspended"    
## [5] "Stand Up"      "Flying"        "Pipeline"      "4th Dimension"

Sit Down - a traditional roller coaster ridden while sitting down.

Bobsled - designed like a bobsled run – without a fixed track. The train travels freely through a trough.

Stand Up - a coaster ridden while standing up instead of sitting down.

Flying - meant to simulate the sensations of flight by with riders in a prone superhero-like position.

Pipeline - a coaster where riders are positioned between the rails instead of above or below.

Inverted - a roller coaster which uses trains traveling beneath, rather than on top of, the track. Unlike a suspended roller coaster, an inverted roller coaster’s trains don’t pivot freely.

Suspended - a roller coaster using trains which travel beneath the track and pivot on a swinging arm from side to side, exaggerating the track’s banks and turns.

Wing - pairs of riders sit on either side of a roller coaster track in which nothing is above or below the riders.

4th Dimension - riders are rotated independently of the orientation of the track, generally about a horizontal axis that is perpendicular to the track.

Let’s see how many of each design are in the roller coaster dataset:

rc %>%  ggplot(aes(design)) +                                                   # bar plot for design variable
        geom_bar(position = "identity", alpha = 0.6) +
        theme_minimal()

Sit Down coasters are a definite majority, there’s only one 4th Dimension, about 25 Inverted, and a few of most of the other designs. Based on this diagram, using the design descriptions and continuing with angst-free data augmentation, I’m assigning the following numeric values for design:

Sit Down = 100

Bobsled = 200

Inverted = 300

Stand Up = 400

Suspended = 450

Flying = 500

Pipeline = 550

Wing = 600

4th Dimension = 700

rc <- rc %>% mutate(dsgn_val =                           # create a new variable and 
             ifelse(design == "Sit Down", 100,           # assign numeric values to each
             ifelse(design == "Bobsled", 200,            # design category in the tibble
             ifelse(design == "Inverted", 300,
             ifelse(design == "Stand Up", 400,
             ifelse(design == "Suspended", 450,
             ifelse(design == "Flying", 500,
             ifelse(design == "Pipeline", 550, 
             ifelse(design == "Wing", 600,
             ifelse(design == "4th Dimension", 700, 100)))))))))
             )

Roller Coaster Fun Factor

Now I’m ready to calculate the Fun Factor. Inversions are a big element in roller coaster excitement. But Number of Inversions is a single digit in all of the rows and it won’t have a very big impact on the total Fun Factor. So I’ll multiply the number of inversions by 100 to give those coasters their due score.

Fun Factor is the sum of speed + height + drop + length + duration + 100 * number of inversions + design value. I’m expecting a wide spread in Fun Factor. Most of the variables in the data have wide ranges. There are older roller coasters and kiddie roller coasters included in the data that are shorter, have slower speeds and don’t have inversions.

summary(rc)

##    coaster              park               city              state          
##  Length:187         Length:187         Length:187         Length:187        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      type              design           year_opened     top_speed    
##  Length:187         Length:187         Min.   :1915   Min.   :  6.0  
##  Class :character   Class :character   1st Qu.:1988   1st Qu.: 47.4  
##  Mode  :character   Mode  :character   Median :1997   Median : 55.0  
##                                        Mean   :1991   Mean   : 55.3  
##                                        3rd Qu.:2000   3rd Qu.: 65.0  
##                                        Max.   :2016   Max.   :120.0  
##    max_height         drop           length        duration    
##  Min.   : 18.0   Min.   :  9.0   Min.   : 200   Min.   : 28.0  
##  1st Qu.: 78.0   1st Qu.: 70.0   1st Qu.:2131   1st Qu.: 92.0  
##  Median :105.0   Median : 92.0   Median :2767   Median :120.0  
##  Mean   :118.6   Mean   :109.3   Mean   :2973   Mean   :122.9  
##  3rd Qu.:149.5   3rd Qu.:141.0   3rd Qu.:3886   3rd Qu.:150.0  
##  Max.   :420.0   Max.   :400.0   Max.   :7359   Max.   :420.0  
##   inversions        num_of_inversions    dsgn_val    
##  Length:187         Min.   :0.000     Min.   :100.0  
##  Class :character   1st Qu.:0.000     1st Qu.:100.0  
##  Mode  :character   Median :0.000     Median :100.0  
##                     Mean   :1.668     Mean   :158.6  
##                     3rd Qu.:3.000     3rd Qu.:100.0  
##                     Max.   :7.000     Max.   :700.0

rc <- rc %>%
      mutate(fun = top_speed + max_height + drop + length + duration +    # calculate the fun factor
               (num_of_inversions * 100) +  dsgn_val) 
rc %>%  ggplot(aes(fun, fill = type)) +                                       # plot a histogram of fun factor
        geom_histogram(position = "identity", alpha = 0.6, color = "white") +     # outline the bars in white
        labs(x = "Fun Factor", y = "Coaster Count") +                              # label the x & y axis
        scale_fill_discrete(name = "Construction", type = c("steelblue4", "brown4")) + # label the legend 
        theme_minimal()                                                              # and use descriptive colors

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The top roller coaster

top20 <- arrange(rc, desc(fun)) %>%                              # sort rc descending by fun factor
         select(fun, coaster, state, park, year_opened,          # select columns     
                length, design, dsgn_val, num_of_inversions) %>% 
         head(20)                                                # grab the top 20

The coaster with the highest fun factor is wooden! It’s the Beast at Paramounts Kings Island in Ohio, and it’s the longest roller coaster in the data. The second highest fun factor is the Son Of Beast, the second longest and also at Paramounts Kings Island. Folks in Ohio can have lots of coasting fun, six of the longest coasters are in Ohio.

kable(top20, align = 'c', row.names = TRUE,                      # display top 20
       col.names = c("Fun Factor", "Coaster", "State", "Park",  
                     "Year", "Length", "Design", "Design Value", "# Inversions"),
      caption = "Top 20 Roller Coasters")

Top 20 Roller Coasters
	Fun Factor	Coaster	State	Park	Year	Length	Design	Design Value	# Inversions
1	8025	Beast	Ohio	Paramounts Kings Island	1979	7359	Sit Down	100	0
2	7882	Son Of Beast	Ohio	Paramounts Kings Island	2000	7032	Sit Down	100	1
3	7647	Fury 325	North Carolina	Carowinds	2015	6602	Sit Down	100	0
4	7538	Millennium Force	Ohio	Cedar Point	2000	6595	Sit Down	100	0
5	7087	Voyage	Indiana	Holiday World	2006	6442	Sit Down	100	0
6	6713	California Screamin	California	Disneys California Adventure	2001	6072	Sit Down	100	1
7	6620	Desperado	Nevada	Buffalo Bills Resort & Casino	1994	5843	Sit Down	100	0
8	6365	Mamba	Missouri	Worlds of Fun	1998	5600	Sit Down	100	0
9	6360	Steel Force	Pennsylvania	Dorney Park	1997	5600	Sit Down	100	0
10	6217	Wild Thing	Minnesota	Valleyfair!	1996	5460	Sit Down	100	0
11	6207	Titan	Texas	Six Flags Over Texas	2001	5312	Sit Down	100	0
12	6161	Superman - Ride Of Steel	Massachusetts	Six Flags New England	2000	5400	Sit Down	100	0
13	6159	Nitro	New Jersey	Six Flags Great Adventure	2001	5394	Sit Down	100	0
14	6147	Intimidator	North Carolina	Carowinds	2010	5316	Sit Down	100	0
15	6108	Superman - Ride Of Steel	New York	Six Flags Darien Lake	1999	5400	Sit Down	100	0
16	6101	Mean Streak	Ohio	Cedar Point	1991	5427	Sit Down	100	0
17	6087	Diamondback	Ohio	Kings Island	2009	5282	Sit Down	100	0
18	6043	Superman - Ride Of Steel	Maryland	Six Flags America	2000	5350	Sit Down	100	0
19	5917	Riddlers Revenge	California	Six Flags Magic Mountain	1998	4370	Stand Up	400	6
20	5798	Magnum XL-200	Ohio	Cedar Point	1989	5106	Sit Down	100	0

Length is the biggest contributing component in Fun Factor. I don’t think one component should outweigh the others in the calculation. I’ll weight length, as I did for Number of Inversions, so it contributes to Fun Factor more equally. The longer lengths are in thousands of feet, so I’ll divide it by 10 to bring it down into scale with the other components. Dividing length by 10 gives the following histogram for Fun Factor.

colSums(is.na(rc))

##           coaster              park              city             state 
##                 0                 0                 0                 0 
##              type            design       year_opened         top_speed 
##                 0                 0                 0                 0 
##        max_height              drop            length          duration 
##                 0                 0                 0                 0 
##        inversions num_of_inversions          dsgn_val               fun 
##                 0                 0                 0                 0

rc <- rc %>%                                                              
      mutate(fun = top_speed + max_height + drop +    # re-calculate fun factor       
               (length / 10) + duration +             # with lenght / 10
               (num_of_inversions * 100) +  dsgn_val) 
rc %>%  ggplot(aes(fun, fill = type)) +                                       # plot a new histogram of fun factor
        geom_histogram(position = "identity", alpha = 0.6, color = "white") +      # outline the bars
        labs(x = "Fun Factor", y = "Coaster Count") +                              # label the x & y axis
        scale_fill_discrete(name = "Construction", type = c("steelblue4", "brown4"))  + # label the legend 
        theme_minimal()                                                              # and use descriptive colors

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

top20 <- arrange(rc, desc(fun)) %>%                              # sort rc descending by fun factor
         rowid_to_column() %>%                                         # add row ids as a column
         select(rowid, fun, coaster, state, park, year_opened,          # select columns     
                length, design, dsgn_val, num_of_inversions) %>% 
         head(20)                                                # grab the top 20
noi <- top20 %>%
       filter(num_of_inversions == 0)
noi

## # A tibble: 2 x 10
##   rowid   fun coaster      state      park    year_opened length design dsgn_val
##   <int> <dbl> <chr>        <chr>      <chr>         <int>  <dbl> <chr>     <dbl>
## 1    12 1705. Fury 325     North Car~ Carowi~        2015   6602 Sit D~      100
## 2    19 1602. Millennium ~ Ohio       Cedar ~        2000   6595 Sit D~      100
## # ... with 1 more variable: num_of_inversions <int>

Now the top coasters are steel, have higher design values and more inversions. But longer, sit down coasters with no inversions also show up in the top 20, Fury 325 at 12 and Millennium Force at 19. So I think the new Fun Factor calculation gives a better representation for each component.

kable(top20, align = 'c',                                                          # display top 20
       col.names = c("Row#", "Fun Factor", "Coaster", "State", "Park",  
                     "Year", "Length", "Design", "Design Value", "# Inversions"),
       caption = "Top 20 Roller Coasters - scaled Fun Factor")

Top 20 Roller Coasters - scaled Fun Factor
Row#	Fun Factor	Coaster	State	Park	Year	Length	Design	Design Value	# Inversions
1	2147.0	X2	California	Six Flags Magic Mountain	2002	3610	4th Dimension	700	2
2	1984.0	Riddlers Revenge	California	Six Flags Magic Mountain	1998	4370	Stand Up	400	6
3	1916.3	Montu	Florida	Busch Gardens Tampa	1996	3983	Inverted	300	7
4	1904.8	Alpengeist	Virginia	Busch Gardens Williamsburg	1997	3828	Inverted	300	6
5	1826.5	Chang	Kentucky	Six Flags Kentucky Kingdom	1997	4155	Stand Up	400	5
6	1762.0	Viper	California	Six Flags Magic Mountain	1990	3830	Sit Down	100	7
7	1753.7	Medusa	California	Six Flags Discovery Kingdom	2000	3937	Sit Down	100	7
8	1732.5	Scream!	California	Six Flags Magic Mountain	2003	3985	Sit Down	100	7
9	1728.0	Raptor	Ohio	Cedar Point	1994	3790	Inverted	300	6
10	1716.0	Great American Scream Machine	New Jersey	Six Flags Great Adventure	1989	3800	Sit Down	100	7
11	1709.8	Kumba	Florida	Busch Gardens Tampa	1993	3978	Sit Down	100	7
12	1705.2	Fury 325	North Carolina	Carowinds	2015	6602	Sit Down	100	0
13	1697.7	Kraken	Florida	SeaWorld Orlando	2000	4177	Sit Down	100	7
14	1692.0	Mantis	Ohio	Cedar Point	1996	3900	Stand Up	400	4
15	1673.5	Medusa	New Jersey	Six Flags Great Adventure	1999	3985	Sit Down	100	7
16	1672.5	Silver Bullet	California	Knott’s Berry Farm	2004	3125	Inverted	300	6
17	1658.5	Superman Krypton Coaster	Texas	Six Flags Fiesta Texas	2000	4025	Sit Down	100	6
18	1633.9	Batman The Ride	Texas	Six Flags Fiesta Texas	2015	1019	Wing	600	6
19	1602.5	Millennium Force	Ohio	Cedar Point	2000	6595	Sit Down	100	0
20	1587.0	Incredible Hulk	Florida	Universal Studios Islands of Adventure	1999	3700	Sit Down	100	7

Older and kiddie coasters

low10 <- arrange(rc, desc(fun)) %>%                                       # sort rc descending by fun factor
         rowid_to_column() %>%                                            # add row ids as a column
         select(rowid, fun, coaster, state, park, type, year_opened) %>%  # select a subset of columns
         tail(10)                                                         # assign the last 10 row to low10

oldest <- arrange(rc, desc(fun)) %>%                 # sort rc descending by fun factor
          rowid_to_column() %>%                      # add row ids as a column
          filter(year_opened == min(year_opened))    # find the coaster with the earliest year

As I expected, coasters at the low end are older or geared to the very young. However the oldest coaster Zippin Pippin in Tennessee, built in 1915, is ranked 150 out of 187 and has a fun factor of 656.5 due to it’s length of 2865 feet.

kable(low10, align = c('c','l','l','l','c','c'),                        # display low10
       col.names = c("Row#", "Fun Factor", "Coaster", "State", "Park",  
                     "Type", "Year"),
       caption = "Bottom 10 Coasters")

Bottom 10 Coasters
Row#	Fun Factor	Coaster	State	Park	Type	Year
178	401.0000	Comet	Pennsylvania	Waldameer Park	Wood	1951
179	377.1000	Woodstock Express	Ohio	Cedar Point	Steel	1999
180	377.0000	Bobsleds	New York	Seabreeze	Steel	1962
181	365.2000	Leap The Dips	Pennsylvania	Lakemont Park	Wood	1999
182	362.7122	Wild Chipmunk	Colorado	Lakeside Amusement Park	Steel	1955
183	286.5703	Gadget’s Go Coaster	California	Disneyland	Steel	1993
184	281.5200	Merlin’s Revenge	California	Castle Amusement Park	Steel	2001
185	280.0000	High Speed Thrill Coaster	Pennsylvania	Knoebels	Steel	1955
186	279.5200	Spacely’s Sprocket Rockets	Illinois	Six Flags Great America	Steel	1998
187	236.3593	Jr. Gemini	Ohio	Cedar Point	Steel	1979

kable(oldest, align = 'c', caption = "Oldest Coaster") # display it

Oldest Coaster
rowid	coaster	park	city	state	type	design	year_opened	top_speed	max_height	drop	length	duration	inversions	num_of_inversions	dsgn_val	fun
150	Zippin Pippin	Libertyland	Memphis	Tennessee	Wood	Sit Down	1915	40	70	70	2865	90	N	0	100	656.5

Where the fun is

To get a more realistic view of the distribution of roller coasters in the U.S., I used RCDB’s census page where I could search for the total count of operating roller coasters by state.

rc1 <- as_tibble(read.csv("C:/Users/Owner/Desktop/My Documents/School/DataFiles/RollerCoaster/USRollerCoasterCensus.csv"))
str(rc1)

## tibble [43 x 4] (S3: tbl_df/tbl/data.frame)
##  $ Location: chr [1:43] "Alabama" "Arizona" "Arkansas" "California" ...
##  $ Steel   : int [1:43] 7 4 5 78 12 5 47 20 4 19 ...
##  $ Wood    : int [1:43] 1 0 1 6 2 3 3 2 2 4 ...
##  $ Total   : int [1:43] 8 4 6 84 14 8 50 22 6 23 ...

rcc <-  arrange(rc1, Total, Location)         # sort by total coasters
row.names(rcc) <- rcc$Location       # rename the rows according to the state

## Warning: Setting row names on a tibble is deprecated.

rc_mat <- data.matrix(rcc)            # convert dataframe to a matrix  (required by heatmap)
rc_mat_ <- rc_mat[,2:4]     # remove the redundant column of state

# Call heatmap using a ColorBrewer color set, margins=c(7,10) for aspect ratio, titles of graph, x and y labels, 
# font size of x and y labels, and set up a RowSideColors bar

varcols = setNames(colorRampPalette(brewer.pal(nrow(rc_mat_), "BuPu"))(nrow(rc_mat_)), 
                  rownames(rc_mat_))   # parameter for RowSideColors

## Warning in brewer.pal(nrow(rc_mat_), "BuPu"): n too large, allowed maximum for palette BuPu is 9
## Returning the palette you asked for with that many colors

heatmap(rc_mat_, 
        Rowv = NA, Colv = NA, 
        col= colorRampPalette(brewer.pal(nrow(rc_mat_), "BuPu"))(nrow(rc_mat_)),
        s=0.6, v=1, scale="column", 
        margins=c(5,8), 
        main = "Where the Coasters are", 
        xlab = "Roller Coasters", 
        ylab="State",
        labCol = c("Steel","Wood", "Total"),
        cexCol=1, 
        RowSideColors = varcols)

## layout: widths =  0.05 0.2 4 , heights =  0.25 4 ; lmat=
##      [,1] [,2] [,3]
## [1,]    0    0    4
## [2,]    3    1    2

## Warning in brewer.pal(nrow(rc_mat_), "BuPu"): n too large, allowed maximum for palette BuPu is 9
## Returning the palette you asked for with that many colors

California at the top isn’t surprising, there are a number of amusement parks in LA, Orange and San Diego counties, and the temperate climate means the parks can be open year round (except that all amusement parks are closed right now due to Covid 19). Pennsylvania has five more operating coasters than Florida, which I didn’t expect. Almost a third of them are wooden (Pennsylvania has the most wooden roller coasters) and ten of them were opened before 1955. The oldest operating roller coaster in Florida opened in 1972.

Seven states don’t have any roller coasters, Delaware, Rhode Island and Vermont are small and close to states with amusement parks. Montana and Wyoming are probably too sparsely populated, and Alaska is probably too cold. Disney runs the Aulani resort in Hawaii, but it doesn’t have a roller coaster. I doubt anyone misses them in the island paradise.

rcc <- arrange(rc1, desc(Total))             # sort the roller coaster census descending by Total
kable(rcc, align = c("l", "c", "c", "c"),    # display the roller coaster census data
      caption = "U.S. Roller Coaster Census")

U.S. Roller Coaster Census
Location	Steel	Wood	Total
California	78	6	84
Pennsylvania	37	18	55
Florida	47	3	50
Texas	44	4	48
New Jersey	41	2	43
New York	34	6	40
Ohio	32	8	40
Missouri	19	6	25
Illinois	19	4	23
Georgia	20	2	22
Virginia	18	4	22
Maryland	17	2	19
North Carolina	13	2	15
Colorado	12	2	14
Indiana	8	6	14
Massachusetts	13	1	14
Michigan	10	3	13
Minnesota	11	2	13
Tennessee	11	2	13
Wisconsin	6	6	12
Kentucky	8	3	11
Utah	9	1	10
Iowa	5	4	9
Alabama	7	1	8
Connecticut	5	3	8
New Hampshire	6	2	8
Oklahoma	7	1	8
New Mexico	6	1	7
Arkansas	5	1	6
Idaho	4	2	6
Maine	5	1	6
Washington	4	2	6
Kansas	5	0	5
Nevada	5	0	5
Oregon	5	0	5
Arizona	4	0	4
Louisiana	4	0	4
South Carolina	3	1	4
West Virginia	2	2	4
South Dakota	2	0	2
Mississippi	1	0	1
Nebraska	1	0	1
North Dakota	1	0	1

Notes

I was able to do almost everything I wanted to in R and R Markdown.

I’m very pleased to have found kable to format tibbles nicely. In order to make the .rmd reusable when the data changes I want to be able to output data values inline with the text, i.e. output the year for the oldest coaster in the middle of the sentece. I tried several different functions, print(), paste() and finally looked at the R Markdown cheatsheet and discovered the single backtick. That worked beautifully.
I haven’t been able to read a data file from GitHub, I tried including the RCurl library and the following code

#  rc <- read.csv(getURL("https://github.com/sopranomax/Data110_Projects/a5664bb01296baacdb5093bd523d8a0118b3a737/RollerCoaster.csv"))

but get the following error message.

Error in function (type, msg, asError = TRUE) : error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version

References

Kaggle, kaggle.com;
- Roller coaster data, Kaggle roller coaster data
Roller Coaster Database, rcdb.com
Coasterpedia, coasterpedia.net
Ultimate Coaster, ultimaterollercoaster.com
tripsavvy, tripsavvy.com;
- Roller coaster terms, tripsavvy roller coaster terms

Resources

Github repository, https://github.com/sopranomax/Data110_Projects
- Original Roller Coaster dataset, RollerCoaster.csv
- Augmented Roller Coaster dataset, RollerCoaster_new.csv
- U.S. Roller Coaster Census dataset, USRollerCoasterCensus.csv
RPubs, Roller Coasters html