pacman::p_load(knitr, readr, RColorBrewer, tidyverse)

The starting data set

The original csv file of roller coaster data from Kaggle, Kaggle roller coaster data, is a tidy dataset with 14 variables in 157 rows.

rc <- as_tibble(read.csv("C:/Users/Owner/Desktop/My Documents/School/DataFiles/RollerCoaster/RollerCoaster.csv"))
str(rc)
## tibble [157 x 14] (S3: tbl_df/tbl/data.frame)
##  $ Coaster          : chr [1:157] "Zippin Pippin" "Jack Rabbit" "Thunderhawk" "Giant Dipper" ...
##  $ Park             : chr [1:157] "Libertyland" "Kennywood Park" "Dorney Park" "Santa Cruz Beach Boardwalk" ...
##  $ City             : chr [1:157] "Memphis" "West Mifflin" "Allentown" "Santa Cruz" ...
##  $ State            : chr [1:157] "Tennessee" "Pennsylvania" "Pennsylvania" "California" ...
##  $ Type             : chr [1:157] "Wooden" "Wooden" "Wooden" "Wooden" ...
##  $ Design           : chr [1:157] "Sit Down" "Sit Down" "Sit Down" "Sit Down" ...
##  $ Year_Opened      : int [1:157] 1915 1921 1923 1924 1924 1927 1935 1940 1946 1951 ...
##  $ Top_Speed        : int [1:157] 40 45 45 55 55 48 50 55 50 25 ...
##  $ Max_Height       : int [1:157] 70 40 80 70 70 85 55 90 84 37 ...
##  $ Drop             : int [1:157] 70 70 65 65 95 78 52 89 78 25 ...
##  $ Length           : int [1:157] 2865 2132 2767 2640 2887 2746 2650 2800 3360 1300 ...
##  $ Duration         : int [1:157] 90 96 78 112 101 75 105 120 105 84 ...
##  $ Inversions       : chr [1:157] "N" "N" "N" "N" ...
##  $ Num_of_Inversions: int [1:157] 0 0 0 0 0 0 0 0 0 0 ...

The data includes:

unique(rc$Design)  # display the categories for Design
## [1] "Sit Down"      "Suspended"     "Stand Up"      "Pipeline"     
## [5] "Inverted"      "Wing"          "Flying"        "4th Dimension"

31 rows have incomplete data, missing values in the Speed, Height, Drop, Length, Duration or Num_of_Inversions columns.

colSums(is.na(rc))  # count the number of na in each column
##           Coaster              Park              City             State 
##                 0                 0                 0                 0 
##              Type            Design       Year_Opened         Top_Speed 
##                 0                 0                 0                10 
##        Max_Height              Drop            Length          Duration 
##                 5                 3                 3                28 
##        Inversions Num_of_Inversions 
##                 0                 1

A little data research

The Roller Coaster Database shows 708 operating roller coasters in the United States. There’s isn’t any information on Kaggle about the selection criteria of the roller coasters included in the dataset so conclusions based on this dataset probably don’t reflect reality. I feel I can augment the data without angst, all in the name of fun and learning.

I used the Roller Coaster Database (RCDB), Wikipedia, Coasterpedia - Roller Coaster wiki and Ultimate Coaster to find some of the missing data values and to add a few more rows to the roller coaster dataset. My augmented dataset has 187 roller coasters.

# read RollerCoaster_new and change the column names to lower case

rc <- read.csv("C:/Users/Owner/Desktop/My Documents/School/DataFiles/RollerCoaster/RollerCoaster_new.csv") %>%
  as_tibble() %>%
  rename_with(tolower)      # rename the variables to lower case
str(rc)                     # look at the tibble
## tibble [187 x 14] (S3: tbl_df/tbl/data.frame)
##  $ coaster          : chr [1:187] "Adventure Express" "Afterburner" "Alpengeist" "American Eagle" ...
##  $ park             : chr [1:187] "Kings Island" "Fun Spot" "Busch Gardens Williamsburg" "Six Flags Great America" ...
##  $ city             : chr [1:187] "Kings Mills" "Angola" "Williamsburg" "Gurnee" ...
##  $ state            : chr [1:187] "Ohio" "Indiana" "Virginia" "Illinois" ...
##  $ type             : chr [1:187] "Steel" "Steel" "Steel" "Wood" ...
##  $ design           : chr [1:187] "Sit Down" "Sit Down" "Inverted" "Sit Down" ...
##  $ year_opened      : int [1:187] 1991 1991 1997 1981 1991 1999 2000 2005 1998 1999 ...
##  $ top_speed        : num [1:187] 35 45 67 66 50 73 50 64.8 70 81 ...
##  $ max_height       : num [1:187] 63 56 195 127 128 170 95 45 200 105 ...
##  $ drop             : num [1:187] 47 47 170 147 144 210 92 31 105 85 ...
##  $ length           : num [1:187] 2963 635 3828 4650 2700 ...
##  $ duration         : int [1:187] 140 66 190 143 110 135 115 64 48 120 ...
##  $ inversions       : chr [1:187] "N" "Y" "Y" "N" ...
##  $ num_of_inversions: int [1:187] 0 1 6 0 4 0 0 0 3 5 ...

But I still couldn’t find the drop height for ten of the roller coasters.

colSums(is.na(rc))  # count the number of na in each column
##           coaster              park              city             state 
##                 0                 0                 0                 0 
##              type            design       year_opened         top_speed 
##                 0                 0                 0                 0 
##        max_height              drop            length          duration 
##                 0                10                 0                 0 
##        inversions num_of_inversions 
##                 0                 0

Filling in all the drops

Rather than eliminate these ten rows, I’ll assign values for drop by using the average proportion of drop to max_height:

mean(rc$drop / rc$max_height, na.rm = TRUE)   # calc the average proportion of drop to height
## [1] 0.908926

This proportion is pretty close to one, I’ll look at the relationship of drop to max_height and see if it’s really that close to one.

rc %>% ggplot(aes(x = drop, y = max_height)) + 
         geom_point(na.rm = TRUE) +                          # dot plot drop against height 
         geom_smooth(na.rm = TRUE) +                         # and draw a line on the plot
         theme_minimal()                                     
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

It’s not quite a straight line, and there are a few points a distance from the line. Since mean is not resistant to outliers, I’ll see if there are outliers for the proportion.

dbh <-rc %>% select (drop, max_height) %>%               # create a tibble 
       mutate(dropbyheight = drop / max_height)          # with the proportion of drop to max_height
dbh %>% ggplot(aes(dropbyheight)) +                      # boxplot dropbyheight to see outliers
         geom_boxplot(outlier.color = "red", outlier.size = 2.5, na.rm = TRUE) +
         theme_minimal()

There are both low and high outliers. I’ll calculate the cutoffs for eliminating the the low and high outliers using the standard 25th percentile - IQR * 1.5 and 75th percentile + IQR * 1.5.

lb <- quantile(dbh$dropbyheight, 0.25, na.rm = TRUE) -     # calculate the cutoff for low outliers
      IQR(dbh$dropbyheight, na.rm = TRUE) * 1.5

ub <- quantile(dbh$dropbyheight,0.75, na.rm = TRUE) +      # calculate the cutoff for high outliers
      IQR(dbh$dropbyheight, na.rm = TRUE) * 1.5 

x <- dbh %>% filter(dropbyheight > lb & dropbyheight < ub, na.rm = TRUE)    # filter out the outliers
xhd <-  mean(x$dropbyheight)                                  # calc the avg. proportion of drop / max_height
xhd
## [1] 0.8978595

This looks a little better, so I’ll use this average proportion to fill in the missing drop values.

x <- rc$max_height * xhd                        # create a vector of drop values using height * avg. proportion
rc$drop <- coalesce(rc$drop, x)                 # fill in the empty drop values in drop
colSums(is.na(rc))                              # check that there are no more missing values
##           coaster              park              city             state 
##                 0                 0                 0                 0 
##              type            design       year_opened         top_speed 
##                 0                 0                 0                 0 
##        max_height              drop            length          duration 
##                 0                 0                 0                 0 
##        inversions num_of_inversions 
##                 0                 0

Adding design as numeric

The design of a roller coaster has an impact on the ride experience. In order to include it in the Fun Factor calculation it needs to have a numeric value. The categories for design and their descriptions are:

unique(rc$design)   # display the categories for design
## [1] "Sit Down"      "Inverted"      "Wing"          "Suspended"    
## [5] "Stand Up"      "Flying"        "Pipeline"      "4th Dimension"

Sit Down - a traditional roller coaster ridden while sitting down.

Bobsled - designed like a bobsled run – without a fixed track. The train travels freely through a trough.

Stand Up - a coaster ridden while standing up instead of sitting down.

Flying - meant to simulate the sensations of flight by with riders in a prone superhero-like position.

Pipeline - a coaster where riders are positioned between the rails instead of above or below.

Inverted - a roller coaster which uses trains traveling beneath, rather than on top of, the track. Unlike a suspended roller coaster, an inverted roller coaster’s trains don’t pivot freely.

Suspended - a roller coaster using trains which travel beneath the track and pivot on a swinging arm from side to side, exaggerating the track’s banks and turns.

Wing - pairs of riders sit on either side of a roller coaster track in which nothing is above or below the riders.

4th Dimension - riders are rotated independently of the orientation of the track, generally about a horizontal axis that is perpendicular to the track.

Let’s see how many of each design are in the roller coaster dataset:

rc %>%  ggplot(aes(design)) +                                                   # bar plot for design variable
        geom_bar(position = "identity", alpha = 0.6) +
        theme_minimal()

Sit Down coasters are a definite majority, there’s only one 4th Dimension, about 25 Inverted, and a few of most of the other designs. Based on this diagram, using the design descriptions and continuing with angst-free data augmentation, I’m assigning the following numeric values for design:

Sit Down = 100

Bobsled = 200

Inverted = 300

Stand Up = 400

Suspended = 450

Flying = 500

Pipeline = 550

Wing = 600

4th Dimension = 700

rc <- rc %>% mutate(dsgn_val =                           # create a new variable and 
             ifelse(design == "Sit Down", 100,           # assign numeric values to each
             ifelse(design == "Bobsled", 200,            # design category in the tibble
             ifelse(design == "Inverted", 300,
             ifelse(design == "Stand Up", 400,
             ifelse(design == "Suspended", 450,
             ifelse(design == "Flying", 500,
             ifelse(design == "Pipeline", 550, 
             ifelse(design == "Wing", 600,
             ifelse(design == "4th Dimension", 700, 100)))))))))
             )

Roller Coaster Fun Factor

Now I’m ready to calculate the Fun Factor. Inversions are a big element in roller coaster excitement. But Number of Inversions is a single digit in all of the rows and it won’t have a very big impact on the total Fun Factor. So I’ll multiply the number of inversions by 100 to give those coasters their due score.

Fun Factor is the sum of speed + height + drop + length + duration + 100 * number of inversions + design value. I’m expecting a wide spread in Fun Factor. Most of the variables in the data have wide ranges. There are older roller coasters and kiddie roller coasters included in the data that are shorter, have slower speeds and don’t have inversions.

summary(rc)
##    coaster              park               city              state          
##  Length:187         Length:187         Length:187         Length:187        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      type              design           year_opened     top_speed    
##  Length:187         Length:187         Min.   :1915   Min.   :  6.0  
##  Class :character   Class :character   1st Qu.:1988   1st Qu.: 47.4  
##  Mode  :character   Mode  :character   Median :1997   Median : 55.0  
##                                        Mean   :1991   Mean   : 55.3  
##                                        3rd Qu.:2000   3rd Qu.: 65.0  
##                                        Max.   :2016   Max.   :120.0  
##    max_height         drop           length        duration    
##  Min.   : 18.0   Min.   :  9.0   Min.   : 200   Min.   : 28.0  
##  1st Qu.: 78.0   1st Qu.: 70.0   1st Qu.:2131   1st Qu.: 92.0  
##  Median :105.0   Median : 92.0   Median :2767   Median :120.0  
##  Mean   :118.6   Mean   :109.3   Mean   :2973   Mean   :122.9  
##  3rd Qu.:149.5   3rd Qu.:141.0   3rd Qu.:3886   3rd Qu.:150.0  
##  Max.   :420.0   Max.   :400.0   Max.   :7359   Max.   :420.0  
##   inversions        num_of_inversions    dsgn_val    
##  Length:187         Min.   :0.000     Min.   :100.0  
##  Class :character   1st Qu.:0.000     1st Qu.:100.0  
##  Mode  :character   Median :0.000     Median :100.0  
##                     Mean   :1.668     Mean   :158.6  
##                     3rd Qu.:3.000     3rd Qu.:100.0  
##                     Max.   :7.000     Max.   :700.0
rc <- rc %>%
      mutate(fun = top_speed + max_height + drop + length + duration +    # calculate the fun factor
               (num_of_inversions * 100) +  dsgn_val) 
rc %>%  ggplot(aes(fun, fill = type)) +                                       # plot a histogram of fun factor
        geom_histogram(position = "identity", alpha = 0.6, color = "white") +     # outline the bars in white
        labs(x = "Fun Factor", y = "Coaster Count") +                              # label the x & y axis
        scale_fill_discrete(name = "Construction", type = c("steelblue4", "brown4")) + # label the legend 
        theme_minimal()                                                              # and use descriptive colors
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The top roller coaster

top20 <- arrange(rc, desc(fun)) %>%                              # sort rc descending by fun factor
         select(fun, coaster, state, park, year_opened,          # select columns     
                length, design, dsgn_val, num_of_inversions) %>% 
         head(20)                                                # grab the top 20

The coaster with the highest fun factor is wooden! It’s the Beast at Paramounts Kings Island in Ohio, and it’s the longest roller coaster in the data. The second highest fun factor is the Son Of Beast, the second longest and also at Paramounts Kings Island. Folks in Ohio can have lots of coasting fun, six of the longest coasters are in Ohio.

kable(top20, align = 'c', row.names = TRUE,                      # display top 20
       col.names = c("Fun Factor", "Coaster", "State", "Park",  
                     "Year", "Length", "Design", "Design Value", "# Inversions"),
      caption = "Top 20 Roller Coasters")
Top 20 Roller Coasters
Fun Factor Coaster State Park Year Length Design Design Value # Inversions
1 8025 Beast Ohio Paramounts Kings Island 1979 7359 Sit Down 100 0
2 7882 Son Of Beast Ohio Paramounts Kings Island 2000 7032 Sit Down 100 1
3 7647 Fury 325 North Carolina Carowinds 2015 6602 Sit Down 100 0
4 7538 Millennium Force Ohio Cedar Point 2000 6595 Sit Down 100 0
5 7087 Voyage Indiana Holiday World 2006 6442 Sit Down 100 0
6 6713 California Screamin California Disneys California Adventure 2001 6072 Sit Down 100 1
7 6620 Desperado Nevada Buffalo Bills Resort & Casino 1994 5843 Sit Down 100 0
8 6365 Mamba Missouri Worlds of Fun 1998 5600 Sit Down 100 0
9 6360 Steel Force Pennsylvania Dorney Park 1997 5600 Sit Down 100 0
10 6217 Wild Thing Minnesota Valleyfair! 1996 5460 Sit Down 100 0
11 6207 Titan Texas Six Flags Over Texas 2001 5312 Sit Down 100 0
12 6161 Superman - Ride Of Steel Massachusetts Six Flags New England 2000 5400 Sit Down 100 0
13 6159 Nitro New Jersey Six Flags Great Adventure 2001 5394 Sit Down 100 0
14 6147 Intimidator North Carolina Carowinds 2010 5316 Sit Down 100 0
15 6108 Superman - Ride Of Steel New York Six Flags Darien Lake 1999 5400 Sit Down 100 0
16 6101 Mean Streak Ohio Cedar Point 1991 5427 Sit Down 100 0
17 6087 Diamondback Ohio Kings Island 2009 5282 Sit Down 100 0
18 6043 Superman - Ride Of Steel Maryland Six Flags America 2000 5350 Sit Down 100 0
19 5917 Riddlers Revenge California Six Flags Magic Mountain 1998 4370 Stand Up 400 6
20 5798 Magnum XL-200 Ohio Cedar Point 1989 5106 Sit Down 100 0

Length is the biggest contributing component in Fun Factor. I don’t think one component should outweigh the others in the calculation. I’ll weight length, as I did for Number of Inversions, so it contributes to Fun Factor more equally. The longer lengths are in thousands of feet, so I’ll divide it by 10 to bring it down into scale with the other components. Dividing length by 10 gives the following histogram for Fun Factor.

colSums(is.na(rc)) 
##           coaster              park              city             state 
##                 0                 0                 0                 0 
##              type            design       year_opened         top_speed 
##                 0                 0                 0                 0 
##        max_height              drop            length          duration 
##                 0                 0                 0                 0 
##        inversions num_of_inversions          dsgn_val               fun 
##                 0                 0                 0                 0
rc <- rc %>%                                                              
      mutate(fun = top_speed + max_height + drop +    # re-calculate fun factor       
               (length / 10) + duration +             # with lenght / 10
               (num_of_inversions * 100) +  dsgn_val) 
rc %>%  ggplot(aes(fun, fill = type)) +                                       # plot a new histogram of fun factor
        geom_histogram(position = "identity", alpha = 0.6, color = "white") +      # outline the bars
        labs(x = "Fun Factor", y = "Coaster Count") +                              # label the x & y axis
        scale_fill_discrete(name = "Construction", type = c("steelblue4", "brown4"))  + # label the legend 
        theme_minimal()                                                              # and use descriptive colors
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

top20 <- arrange(rc, desc(fun)) %>%                              # sort rc descending by fun factor
         rowid_to_column() %>%                                         # add row ids as a column
         select(rowid, fun, coaster, state, park, year_opened,          # select columns     
                length, design, dsgn_val, num_of_inversions) %>% 
         head(20)                                                # grab the top 20
noi <- top20 %>%
       filter(num_of_inversions == 0)
noi
## # A tibble: 2 x 10
##   rowid   fun coaster      state      park    year_opened length design dsgn_val
##   <int> <dbl> <chr>        <chr>      <chr>         <int>  <dbl> <chr>     <dbl>
## 1    12 1705. Fury 325     North Car~ Carowi~        2015   6602 Sit D~      100
## 2    19 1602. Millennium ~ Ohio       Cedar ~        2000   6595 Sit D~      100
## # ... with 1 more variable: num_of_inversions <int>

Now the top coasters are steel, have higher design values and more inversions. But longer, sit down coasters with no inversions also show up in the top 20, Fury 325 at 12 and Millennium Force at 19. So I think the new Fun Factor calculation gives a better representation for each component.

kable(top20, align = 'c',                                                          # display top 20
       col.names = c("Row#", "Fun Factor", "Coaster", "State", "Park",  
                     "Year", "Length", "Design", "Design Value", "# Inversions"),
       caption = "Top 20 Roller Coasters - scaled Fun Factor")
Top 20 Roller Coasters - scaled Fun Factor
Row# Fun Factor Coaster State Park Year Length Design Design Value # Inversions
1 2147.0 X2 California Six Flags Magic Mountain 2002 3610 4th Dimension 700 2
2 1984.0 Riddlers Revenge California Six Flags Magic Mountain 1998 4370 Stand Up 400 6
3 1916.3 Montu Florida Busch Gardens Tampa 1996 3983 Inverted 300 7
4 1904.8 Alpengeist Virginia Busch Gardens Williamsburg 1997 3828 Inverted 300 6
5 1826.5 Chang Kentucky Six Flags Kentucky Kingdom 1997 4155 Stand Up 400 5
6 1762.0 Viper California Six Flags Magic Mountain 1990 3830 Sit Down 100 7
7 1753.7 Medusa California Six Flags Discovery Kingdom 2000 3937 Sit Down 100 7
8 1732.5 Scream! California Six Flags Magic Mountain 2003 3985 Sit Down 100 7
9 1728.0 Raptor Ohio Cedar Point 1994 3790 Inverted 300 6
10 1716.0 Great American Scream Machine New Jersey Six Flags Great Adventure 1989 3800 Sit Down 100 7
11 1709.8 Kumba Florida Busch Gardens Tampa 1993 3978 Sit Down 100 7
12 1705.2 Fury 325 North Carolina Carowinds 2015 6602 Sit Down 100 0
13 1697.7 Kraken Florida SeaWorld Orlando 2000 4177 Sit Down 100 7
14 1692.0 Mantis Ohio Cedar Point 1996 3900 Stand Up 400 4
15 1673.5 Medusa New Jersey Six Flags Great Adventure 1999 3985 Sit Down 100 7
16 1672.5 Silver Bullet California Knott’s Berry Farm 2004 3125 Inverted 300 6
17 1658.5 Superman Krypton Coaster Texas Six Flags Fiesta Texas 2000 4025 Sit Down 100 6
18 1633.9 Batman The Ride Texas Six Flags Fiesta Texas 2015 1019 Wing 600 6
19 1602.5 Millennium Force Ohio Cedar Point 2000 6595 Sit Down 100 0
20 1587.0 Incredible Hulk Florida Universal Studios Islands of Adventure 1999 3700 Sit Down 100 7

Older and kiddie coasters

low10 <- arrange(rc, desc(fun)) %>%                                       # sort rc descending by fun factor
         rowid_to_column() %>%                                            # add row ids as a column
         select(rowid, fun, coaster, state, park, type, year_opened) %>%  # select a subset of columns
         tail(10)                                                         # assign the last 10 row to low10

oldest <- arrange(rc, desc(fun)) %>%                 # sort rc descending by fun factor
          rowid_to_column() %>%                      # add row ids as a column
          filter(year_opened == min(year_opened))    # find the coaster with the earliest year

As I expected, coasters at the low end are older or geared to the very young. However the oldest coaster Zippin Pippin in Tennessee, built in 1915, is ranked 150 out of 187 and has a fun factor of 656.5 due to it’s length of 2865 feet.

kable(low10, align = c('c','l','l','l','c','c'),                        # display low10
       col.names = c("Row#", "Fun Factor", "Coaster", "State", "Park",  
                     "Type", "Year"),
       caption = "Bottom 10 Coasters")
Bottom 10 Coasters
Row# Fun Factor Coaster State Park Type Year
178 401.0000 Comet Pennsylvania Waldameer Park Wood 1951
179 377.1000 Woodstock Express Ohio Cedar Point Steel 1999
180 377.0000 Bobsleds New York Seabreeze Steel 1962
181 365.2000 Leap The Dips Pennsylvania Lakemont Park Wood 1999
182 362.7122 Wild Chipmunk Colorado Lakeside Amusement Park Steel 1955
183 286.5703 Gadget’s Go Coaster California Disneyland Steel 1993
184 281.5200 Merlin’s Revenge California Castle Amusement Park Steel 2001
185 280.0000 High Speed Thrill Coaster Pennsylvania Knoebels Steel 1955
186 279.5200 Spacely’s Sprocket Rockets Illinois Six Flags Great America Steel 1998
187 236.3593 Jr. Gemini Ohio Cedar Point Steel 1979
kable(oldest, align = 'c', caption = "Oldest Coaster") # display it
Oldest Coaster
rowid coaster park city state type design year_opened top_speed max_height drop length duration inversions num_of_inversions dsgn_val fun
150 Zippin Pippin Libertyland Memphis Tennessee Wood Sit Down 1915 40 70 70 2865 90 N 0 100 656.5

Where the fun is

To get a more realistic view of the distribution of roller coasters in the U.S., I used RCDB’s census page where I could search for the total count of operating roller coasters by state.

rc1 <- as_tibble(read.csv("C:/Users/Owner/Desktop/My Documents/School/DataFiles/RollerCoaster/USRollerCoasterCensus.csv"))
str(rc1)
## tibble [43 x 4] (S3: tbl_df/tbl/data.frame)
##  $ Location: chr [1:43] "Alabama" "Arizona" "Arkansas" "California" ...
##  $ Steel   : int [1:43] 7 4 5 78 12 5 47 20 4 19 ...
##  $ Wood    : int [1:43] 1 0 1 6 2 3 3 2 2 4 ...
##  $ Total   : int [1:43] 8 4 6 84 14 8 50 22 6 23 ...
rcc <-  arrange(rc1, Total, Location)         # sort by total coasters
row.names(rcc) <- rcc$Location       # rename the rows according to the state
## Warning: Setting row names on a tibble is deprecated.
rc_mat <- data.matrix(rcc)            # convert dataframe to a matrix  (required by heatmap)
rc_mat_ <- rc_mat[,2:4]     # remove the redundant column of state

# Call heatmap using a ColorBrewer color set, margins=c(7,10) for aspect ratio, titles of graph, x and y labels, 
# font size of x and y labels, and set up a RowSideColors bar

varcols = setNames(colorRampPalette(brewer.pal(nrow(rc_mat_), "BuPu"))(nrow(rc_mat_)), 
                  rownames(rc_mat_))   # parameter for RowSideColors
## Warning in brewer.pal(nrow(rc_mat_), "BuPu"): n too large, allowed maximum for palette BuPu is 9
## Returning the palette you asked for with that many colors
heatmap(rc_mat_, 
        Rowv = NA, Colv = NA, 
        col= colorRampPalette(brewer.pal(nrow(rc_mat_), "BuPu"))(nrow(rc_mat_)),
        s=0.6, v=1, scale="column", 
        margins=c(5,8), 
        main = "Where the Coasters are", 
        xlab = "Roller Coasters", 
        ylab="State",
        labCol = c("Steel","Wood", "Total"),
        cexCol=1, 
        RowSideColors = varcols)          
## layout: widths =  0.05 0.2 4 , heights =  0.25 4 ; lmat=
##      [,1] [,2] [,3]
## [1,]    0    0    4
## [2,]    3    1    2
## Warning in brewer.pal(nrow(rc_mat_), "BuPu"): n too large, allowed maximum for palette BuPu is 9
## Returning the palette you asked for with that many colors

California at the top isn’t surprising, there are a number of amusement parks in LA, Orange and San Diego counties, and the temperate climate means the parks can be open year round (except that all amusement parks are closed right now due to Covid 19). Pennsylvania has five more operating coasters than Florida, which I didn’t expect. Almost a third of them are wooden (Pennsylvania has the most wooden roller coasters) and ten of them were opened before 1955. The oldest operating roller coaster in Florida opened in 1972.

Seven states don’t have any roller coasters, Delaware, Rhode Island and Vermont are small and close to states with amusement parks. Montana and Wyoming are probably too sparsely populated, and Alaska is probably too cold. Disney runs the Aulani resort in Hawaii, but it doesn’t have a roller coaster. I doubt anyone misses them in the island paradise.

rcc <- arrange(rc1, desc(Total))             # sort the roller coaster census descending by Total
kable(rcc, align = c("l", "c", "c", "c"),    # display the roller coaster census data
      caption = "U.S. Roller Coaster Census")
U.S. Roller Coaster Census
Location Steel Wood Total
California 78 6 84
Pennsylvania 37 18 55
Florida 47 3 50
Texas 44 4 48
New Jersey 41 2 43
New York 34 6 40
Ohio 32 8 40
Missouri 19 6 25
Illinois 19 4 23
Georgia 20 2 22
Virginia 18 4 22
Maryland 17 2 19
North Carolina 13 2 15
Colorado 12 2 14
Indiana 8 6 14
Massachusetts 13 1 14
Michigan 10 3 13
Minnesota 11 2 13
Tennessee 11 2 13
Wisconsin 6 6 12
Kentucky 8 3 11
Utah 9 1 10
Iowa 5 4 9
Alabama 7 1 8
Connecticut 5 3 8
New Hampshire 6 2 8
Oklahoma 7 1 8
New Mexico 6 1 7
Arkansas 5 1 6
Idaho 4 2 6
Maine 5 1 6
Washington 4 2 6
Kansas 5 0 5
Nevada 5 0 5
Oregon 5 0 5
Arizona 4 0 4
Louisiana 4 0 4
South Carolina 3 1 4
West Virginia 2 2 4
South Dakota 2 0 2
Mississippi 1 0 1
Nebraska 1 0 1
North Dakota 1 0 1

Notes

I was able to do almost everything I wanted to in R and R Markdown.

  1. I’m very pleased to have found kable to format tibbles nicely. In order to make the .rmd reusable when the data changes I want to be able to output data values inline with the text, i.e. output the year for the oldest coaster in the middle of the sentece. I tried several different functions, print(), paste() and finally looked at the R Markdown cheatsheet and discovered the single backtick. That worked beautifully.

  2. I haven’t been able to read a data file from GitHub, I tried including the RCurl library and the following code

#  rc <- read.csv(getURL("https://github.com/sopranomax/Data110_Projects/a5664bb01296baacdb5093bd523d8a0118b3a737/RollerCoaster.csv"))

but get the following error message.

Error in function (type, msg, asError = TRUE) : error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version

References

  1. Kaggle, kaggle.com;
  2. Roller Coaster Database, rcdb.com
  3. Coasterpedia, coasterpedia.net
  4. Ultimate Coaster, ultimaterollercoaster.com
  5. tripsavvy, tripsavvy.com;

Resources

  1. Github repository, https://github.com/sopranomax/Data110_Projects
  2. RPubs, Roller Coasters html