pacman::p_load(knitr, readr, RColorBrewer, tidyverse)
The original csv file of roller coaster data from Kaggle, Kaggle roller coaster data, is a tidy dataset with 14 variables in 157 rows.
rc <- as_tibble(read.csv("C:/Users/Owner/Desktop/My Documents/School/DataFiles/RollerCoaster/RollerCoaster.csv"))
str(rc)
## tibble [157 x 14] (S3: tbl_df/tbl/data.frame)
## $ Coaster : chr [1:157] "Zippin Pippin" "Jack Rabbit" "Thunderhawk" "Giant Dipper" ...
## $ Park : chr [1:157] "Libertyland" "Kennywood Park" "Dorney Park" "Santa Cruz Beach Boardwalk" ...
## $ City : chr [1:157] "Memphis" "West Mifflin" "Allentown" "Santa Cruz" ...
## $ State : chr [1:157] "Tennessee" "Pennsylvania" "Pennsylvania" "California" ...
## $ Type : chr [1:157] "Wooden" "Wooden" "Wooden" "Wooden" ...
## $ Design : chr [1:157] "Sit Down" "Sit Down" "Sit Down" "Sit Down" ...
## $ Year_Opened : int [1:157] 1915 1921 1923 1924 1924 1927 1935 1940 1946 1951 ...
## $ Top_Speed : int [1:157] 40 45 45 55 55 48 50 55 50 25 ...
## $ Max_Height : int [1:157] 70 40 80 70 70 85 55 90 84 37 ...
## $ Drop : int [1:157] 70 70 65 65 95 78 52 89 78 25 ...
## $ Length : int [1:157] 2865 2132 2767 2640 2887 2746 2650 2800 3360 1300 ...
## $ Duration : int [1:157] 90 96 78 112 101 75 105 120 105 84 ...
## $ Inversions : chr [1:157] "N" "N" "N" "N" ...
## $ Num_of_Inversions: int [1:157] 0 0 0 0 0 0 0 0 0 0 ...
The data includes:
unique(rc$Design) # display the categories for Design
## [1] "Sit Down" "Suspended" "Stand Up" "Pipeline"
## [5] "Inverted" "Wing" "Flying" "4th Dimension"
31 rows have incomplete data, missing values in the Speed, Height, Drop, Length, Duration or Num_of_Inversions columns.
colSums(is.na(rc)) # count the number of na in each column
## Coaster Park City State
## 0 0 0 0
## Type Design Year_Opened Top_Speed
## 0 0 0 10
## Max_Height Drop Length Duration
## 5 3 3 28
## Inversions Num_of_Inversions
## 0 1
The Roller Coaster Database shows 708 operating roller coasters in the United States. There’s isn’t any information on Kaggle about the selection criteria of the roller coasters included in the dataset so conclusions based on this dataset probably don’t reflect reality. I feel I can augment the data without angst, all in the name of fun and learning.
I used the Roller Coaster Database (RCDB), Wikipedia, Coasterpedia - Roller Coaster wiki and Ultimate Coaster to find some of the missing data values and to add a few more rows to the roller coaster dataset. My augmented dataset has 187 roller coasters.
# read RollerCoaster_new and change the column names to lower case
rc <- read.csv("C:/Users/Owner/Desktop/My Documents/School/DataFiles/RollerCoaster/RollerCoaster_new.csv") %>%
as_tibble() %>%
rename_with(tolower) # rename the variables to lower case
str(rc) # look at the tibble
## tibble [187 x 14] (S3: tbl_df/tbl/data.frame)
## $ coaster : chr [1:187] "Adventure Express" "Afterburner" "Alpengeist" "American Eagle" ...
## $ park : chr [1:187] "Kings Island" "Fun Spot" "Busch Gardens Williamsburg" "Six Flags Great America" ...
## $ city : chr [1:187] "Kings Mills" "Angola" "Williamsburg" "Gurnee" ...
## $ state : chr [1:187] "Ohio" "Indiana" "Virginia" "Illinois" ...
## $ type : chr [1:187] "Steel" "Steel" "Steel" "Wood" ...
## $ design : chr [1:187] "Sit Down" "Sit Down" "Inverted" "Sit Down" ...
## $ year_opened : int [1:187] 1991 1991 1997 1981 1991 1999 2000 2005 1998 1999 ...
## $ top_speed : num [1:187] 35 45 67 66 50 73 50 64.8 70 81 ...
## $ max_height : num [1:187] 63 56 195 127 128 170 95 45 200 105 ...
## $ drop : num [1:187] 47 47 170 147 144 210 92 31 105 85 ...
## $ length : num [1:187] 2963 635 3828 4650 2700 ...
## $ duration : int [1:187] 140 66 190 143 110 135 115 64 48 120 ...
## $ inversions : chr [1:187] "N" "Y" "Y" "N" ...
## $ num_of_inversions: int [1:187] 0 1 6 0 4 0 0 0 3 5 ...
But I still couldn’t find the drop height for ten of the roller coasters.
colSums(is.na(rc)) # count the number of na in each column
## coaster park city state
## 0 0 0 0
## type design year_opened top_speed
## 0 0 0 0
## max_height drop length duration
## 0 10 0 0
## inversions num_of_inversions
## 0 0
Rather than eliminate these ten rows, I’ll assign values for drop by using the average proportion of drop to max_height:
mean(rc$drop / rc$max_height, na.rm = TRUE) # calc the average proportion of drop to height
## [1] 0.908926
This proportion is pretty close to one, I’ll look at the relationship of drop to max_height and see if it’s really that close to one.
rc %>% ggplot(aes(x = drop, y = max_height)) +
geom_point(na.rm = TRUE) + # dot plot drop against height
geom_smooth(na.rm = TRUE) + # and draw a line on the plot
theme_minimal()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
It’s not quite a straight line, and there are a few points a distance from the line. Since mean is not resistant to outliers, I’ll see if there are outliers for the proportion.
dbh <-rc %>% select (drop, max_height) %>% # create a tibble
mutate(dropbyheight = drop / max_height) # with the proportion of drop to max_height
dbh %>% ggplot(aes(dropbyheight)) + # boxplot dropbyheight to see outliers
geom_boxplot(outlier.color = "red", outlier.size = 2.5, na.rm = TRUE) +
theme_minimal()
There are both low and high outliers. I’ll calculate the cutoffs for eliminating the the low and high outliers using the standard 25th percentile - IQR * 1.5 and 75th percentile + IQR * 1.5.
lb <- quantile(dbh$dropbyheight, 0.25, na.rm = TRUE) - # calculate the cutoff for low outliers
IQR(dbh$dropbyheight, na.rm = TRUE) * 1.5
ub <- quantile(dbh$dropbyheight,0.75, na.rm = TRUE) + # calculate the cutoff for high outliers
IQR(dbh$dropbyheight, na.rm = TRUE) * 1.5
x <- dbh %>% filter(dropbyheight > lb & dropbyheight < ub, na.rm = TRUE) # filter out the outliers
xhd <- mean(x$dropbyheight) # calc the avg. proportion of drop / max_height
xhd
## [1] 0.8978595
This looks a little better, so I’ll use this average proportion to fill in the missing drop values.
x <- rc$max_height * xhd # create a vector of drop values using height * avg. proportion
rc$drop <- coalesce(rc$drop, x) # fill in the empty drop values in drop
colSums(is.na(rc)) # check that there are no more missing values
## coaster park city state
## 0 0 0 0
## type design year_opened top_speed
## 0 0 0 0
## max_height drop length duration
## 0 0 0 0
## inversions num_of_inversions
## 0 0
The design of a roller coaster has an impact on the ride experience. In order to include it in the Fun Factor calculation it needs to have a numeric value. The categories for design and their descriptions are:
unique(rc$design) # display the categories for design
## [1] "Sit Down" "Inverted" "Wing" "Suspended"
## [5] "Stand Up" "Flying" "Pipeline" "4th Dimension"
Sit Down - a traditional roller coaster ridden while sitting down.
Bobsled - designed like a bobsled run – without a fixed track. The train travels freely through a trough.
Stand Up - a coaster ridden while standing up instead of sitting down.
Flying - meant to simulate the sensations of flight by with riders in a prone superhero-like position.
Pipeline - a coaster where riders are positioned between the rails instead of above or below.
Inverted - a roller coaster which uses trains traveling beneath, rather than on top of, the track. Unlike a suspended roller coaster, an inverted roller coaster’s trains don’t pivot freely.
Suspended - a roller coaster using trains which travel beneath the track and pivot on a swinging arm from side to side, exaggerating the track’s banks and turns.
Wing - pairs of riders sit on either side of a roller coaster track in which nothing is above or below the riders.
4th Dimension - riders are rotated independently of the orientation of the track, generally about a horizontal axis that is perpendicular to the track.
Let’s see how many of each design are in the roller coaster dataset:
rc %>% ggplot(aes(design)) + # bar plot for design variable
geom_bar(position = "identity", alpha = 0.6) +
theme_minimal()
Sit Down coasters are a definite majority, there’s only one 4th Dimension, about 25 Inverted, and a few of most of the other designs. Based on this diagram, using the design descriptions and continuing with angst-free data augmentation, I’m assigning the following numeric values for design:
Sit Down = 100
Bobsled = 200
Inverted = 300
Stand Up = 400
Suspended = 450
Flying = 500
Pipeline = 550
Wing = 600
4th Dimension = 700
rc <- rc %>% mutate(dsgn_val = # create a new variable and
ifelse(design == "Sit Down", 100, # assign numeric values to each
ifelse(design == "Bobsled", 200, # design category in the tibble
ifelse(design == "Inverted", 300,
ifelse(design == "Stand Up", 400,
ifelse(design == "Suspended", 450,
ifelse(design == "Flying", 500,
ifelse(design == "Pipeline", 550,
ifelse(design == "Wing", 600,
ifelse(design == "4th Dimension", 700, 100)))))))))
)
Now I’m ready to calculate the Fun Factor. Inversions are a big element in roller coaster excitement. But Number of Inversions is a single digit in all of the rows and it won’t have a very big impact on the total Fun Factor. So I’ll multiply the number of inversions by 100 to give those coasters their due score.
Fun Factor is the sum of speed + height + drop + length + duration + 100 * number of inversions + design value. I’m expecting a wide spread in Fun Factor. Most of the variables in the data have wide ranges. There are older roller coasters and kiddie roller coasters included in the data that are shorter, have slower speeds and don’t have inversions.
summary(rc)
## coaster park city state
## Length:187 Length:187 Length:187 Length:187
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## type design year_opened top_speed
## Length:187 Length:187 Min. :1915 Min. : 6.0
## Class :character Class :character 1st Qu.:1988 1st Qu.: 47.4
## Mode :character Mode :character Median :1997 Median : 55.0
## Mean :1991 Mean : 55.3
## 3rd Qu.:2000 3rd Qu.: 65.0
## Max. :2016 Max. :120.0
## max_height drop length duration
## Min. : 18.0 Min. : 9.0 Min. : 200 Min. : 28.0
## 1st Qu.: 78.0 1st Qu.: 70.0 1st Qu.:2131 1st Qu.: 92.0
## Median :105.0 Median : 92.0 Median :2767 Median :120.0
## Mean :118.6 Mean :109.3 Mean :2973 Mean :122.9
## 3rd Qu.:149.5 3rd Qu.:141.0 3rd Qu.:3886 3rd Qu.:150.0
## Max. :420.0 Max. :400.0 Max. :7359 Max. :420.0
## inversions num_of_inversions dsgn_val
## Length:187 Min. :0.000 Min. :100.0
## Class :character 1st Qu.:0.000 1st Qu.:100.0
## Mode :character Median :0.000 Median :100.0
## Mean :1.668 Mean :158.6
## 3rd Qu.:3.000 3rd Qu.:100.0
## Max. :7.000 Max. :700.0
rc <- rc %>%
mutate(fun = top_speed + max_height + drop + length + duration + # calculate the fun factor
(num_of_inversions * 100) + dsgn_val)
rc %>% ggplot(aes(fun, fill = type)) + # plot a histogram of fun factor
geom_histogram(position = "identity", alpha = 0.6, color = "white") + # outline the bars in white
labs(x = "Fun Factor", y = "Coaster Count") + # label the x & y axis
scale_fill_discrete(name = "Construction", type = c("steelblue4", "brown4")) + # label the legend
theme_minimal() # and use descriptive colors
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
top20 <- arrange(rc, desc(fun)) %>% # sort rc descending by fun factor
select(fun, coaster, state, park, year_opened, # select columns
length, design, dsgn_val, num_of_inversions) %>%
head(20) # grab the top 20
The coaster with the highest fun factor is wooden! It’s the Beast at Paramounts Kings Island in Ohio, and it’s the longest roller coaster in the data. The second highest fun factor is the Son Of Beast, the second longest and also at Paramounts Kings Island. Folks in Ohio can have lots of coasting fun, six of the longest coasters are in Ohio.
kable(top20, align = 'c', row.names = TRUE, # display top 20
col.names = c("Fun Factor", "Coaster", "State", "Park",
"Year", "Length", "Design", "Design Value", "# Inversions"),
caption = "Top 20 Roller Coasters")
| Fun Factor | Coaster | State | Park | Year | Length | Design | Design Value | # Inversions | |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 8025 | Beast | Ohio | Paramounts Kings Island | 1979 | 7359 | Sit Down | 100 | 0 |
| 2 | 7882 | Son Of Beast | Ohio | Paramounts Kings Island | 2000 | 7032 | Sit Down | 100 | 1 |
| 3 | 7647 | Fury 325 | North Carolina | Carowinds | 2015 | 6602 | Sit Down | 100 | 0 |
| 4 | 7538 | Millennium Force | Ohio | Cedar Point | 2000 | 6595 | Sit Down | 100 | 0 |
| 5 | 7087 | Voyage | Indiana | Holiday World | 2006 | 6442 | Sit Down | 100 | 0 |
| 6 | 6713 | California Screamin | California | Disneys California Adventure | 2001 | 6072 | Sit Down | 100 | 1 |
| 7 | 6620 | Desperado | Nevada | Buffalo Bills Resort & Casino | 1994 | 5843 | Sit Down | 100 | 0 |
| 8 | 6365 | Mamba | Missouri | Worlds of Fun | 1998 | 5600 | Sit Down | 100 | 0 |
| 9 | 6360 | Steel Force | Pennsylvania | Dorney Park | 1997 | 5600 | Sit Down | 100 | 0 |
| 10 | 6217 | Wild Thing | Minnesota | Valleyfair! | 1996 | 5460 | Sit Down | 100 | 0 |
| 11 | 6207 | Titan | Texas | Six Flags Over Texas | 2001 | 5312 | Sit Down | 100 | 0 |
| 12 | 6161 | Superman - Ride Of Steel | Massachusetts | Six Flags New England | 2000 | 5400 | Sit Down | 100 | 0 |
| 13 | 6159 | Nitro | New Jersey | Six Flags Great Adventure | 2001 | 5394 | Sit Down | 100 | 0 |
| 14 | 6147 | Intimidator | North Carolina | Carowinds | 2010 | 5316 | Sit Down | 100 | 0 |
| 15 | 6108 | Superman - Ride Of Steel | New York | Six Flags Darien Lake | 1999 | 5400 | Sit Down | 100 | 0 |
| 16 | 6101 | Mean Streak | Ohio | Cedar Point | 1991 | 5427 | Sit Down | 100 | 0 |
| 17 | 6087 | Diamondback | Ohio | Kings Island | 2009 | 5282 | Sit Down | 100 | 0 |
| 18 | 6043 | Superman - Ride Of Steel | Maryland | Six Flags America | 2000 | 5350 | Sit Down | 100 | 0 |
| 19 | 5917 | Riddlers Revenge | California | Six Flags Magic Mountain | 1998 | 4370 | Stand Up | 400 | 6 |
| 20 | 5798 | Magnum XL-200 | Ohio | Cedar Point | 1989 | 5106 | Sit Down | 100 | 0 |
Length is the biggest contributing component in Fun Factor. I don’t think one component should outweigh the others in the calculation. I’ll weight length, as I did for Number of Inversions, so it contributes to Fun Factor more equally. The longer lengths are in thousands of feet, so I’ll divide it by 10 to bring it down into scale with the other components. Dividing length by 10 gives the following histogram for Fun Factor.
colSums(is.na(rc))
## coaster park city state
## 0 0 0 0
## type design year_opened top_speed
## 0 0 0 0
## max_height drop length duration
## 0 0 0 0
## inversions num_of_inversions dsgn_val fun
## 0 0 0 0
rc <- rc %>%
mutate(fun = top_speed + max_height + drop + # re-calculate fun factor
(length / 10) + duration + # with lenght / 10
(num_of_inversions * 100) + dsgn_val)
rc %>% ggplot(aes(fun, fill = type)) + # plot a new histogram of fun factor
geom_histogram(position = "identity", alpha = 0.6, color = "white") + # outline the bars
labs(x = "Fun Factor", y = "Coaster Count") + # label the x & y axis
scale_fill_discrete(name = "Construction", type = c("steelblue4", "brown4")) + # label the legend
theme_minimal() # and use descriptive colors
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
top20 <- arrange(rc, desc(fun)) %>% # sort rc descending by fun factor
rowid_to_column() %>% # add row ids as a column
select(rowid, fun, coaster, state, park, year_opened, # select columns
length, design, dsgn_val, num_of_inversions) %>%
head(20) # grab the top 20
noi <- top20 %>%
filter(num_of_inversions == 0)
noi
## # A tibble: 2 x 10
## rowid fun coaster state park year_opened length design dsgn_val
## <int> <dbl> <chr> <chr> <chr> <int> <dbl> <chr> <dbl>
## 1 12 1705. Fury 325 North Car~ Carowi~ 2015 6602 Sit D~ 100
## 2 19 1602. Millennium ~ Ohio Cedar ~ 2000 6595 Sit D~ 100
## # ... with 1 more variable: num_of_inversions <int>
Now the top coasters are steel, have higher design values and more inversions. But longer, sit down coasters with no inversions also show up in the top 20, Fury 325 at 12 and Millennium Force at 19. So I think the new Fun Factor calculation gives a better representation for each component.
kable(top20, align = 'c', # display top 20
col.names = c("Row#", "Fun Factor", "Coaster", "State", "Park",
"Year", "Length", "Design", "Design Value", "# Inversions"),
caption = "Top 20 Roller Coasters - scaled Fun Factor")
| Row# | Fun Factor | Coaster | State | Park | Year | Length | Design | Design Value | # Inversions |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 2147.0 | X2 | California | Six Flags Magic Mountain | 2002 | 3610 | 4th Dimension | 700 | 2 |
| 2 | 1984.0 | Riddlers Revenge | California | Six Flags Magic Mountain | 1998 | 4370 | Stand Up | 400 | 6 |
| 3 | 1916.3 | Montu | Florida | Busch Gardens Tampa | 1996 | 3983 | Inverted | 300 | 7 |
| 4 | 1904.8 | Alpengeist | Virginia | Busch Gardens Williamsburg | 1997 | 3828 | Inverted | 300 | 6 |
| 5 | 1826.5 | Chang | Kentucky | Six Flags Kentucky Kingdom | 1997 | 4155 | Stand Up | 400 | 5 |
| 6 | 1762.0 | Viper | California | Six Flags Magic Mountain | 1990 | 3830 | Sit Down | 100 | 7 |
| 7 | 1753.7 | Medusa | California | Six Flags Discovery Kingdom | 2000 | 3937 | Sit Down | 100 | 7 |
| 8 | 1732.5 | Scream! | California | Six Flags Magic Mountain | 2003 | 3985 | Sit Down | 100 | 7 |
| 9 | 1728.0 | Raptor | Ohio | Cedar Point | 1994 | 3790 | Inverted | 300 | 6 |
| 10 | 1716.0 | Great American Scream Machine | New Jersey | Six Flags Great Adventure | 1989 | 3800 | Sit Down | 100 | 7 |
| 11 | 1709.8 | Kumba | Florida | Busch Gardens Tampa | 1993 | 3978 | Sit Down | 100 | 7 |
| 12 | 1705.2 | Fury 325 | North Carolina | Carowinds | 2015 | 6602 | Sit Down | 100 | 0 |
| 13 | 1697.7 | Kraken | Florida | SeaWorld Orlando | 2000 | 4177 | Sit Down | 100 | 7 |
| 14 | 1692.0 | Mantis | Ohio | Cedar Point | 1996 | 3900 | Stand Up | 400 | 4 |
| 15 | 1673.5 | Medusa | New Jersey | Six Flags Great Adventure | 1999 | 3985 | Sit Down | 100 | 7 |
| 16 | 1672.5 | Silver Bullet | California | Knott’s Berry Farm | 2004 | 3125 | Inverted | 300 | 6 |
| 17 | 1658.5 | Superman Krypton Coaster | Texas | Six Flags Fiesta Texas | 2000 | 4025 | Sit Down | 100 | 6 |
| 18 | 1633.9 | Batman The Ride | Texas | Six Flags Fiesta Texas | 2015 | 1019 | Wing | 600 | 6 |
| 19 | 1602.5 | Millennium Force | Ohio | Cedar Point | 2000 | 6595 | Sit Down | 100 | 0 |
| 20 | 1587.0 | Incredible Hulk | Florida | Universal Studios Islands of Adventure | 1999 | 3700 | Sit Down | 100 | 7 |
low10 <- arrange(rc, desc(fun)) %>% # sort rc descending by fun factor
rowid_to_column() %>% # add row ids as a column
select(rowid, fun, coaster, state, park, type, year_opened) %>% # select a subset of columns
tail(10) # assign the last 10 row to low10
oldest <- arrange(rc, desc(fun)) %>% # sort rc descending by fun factor
rowid_to_column() %>% # add row ids as a column
filter(year_opened == min(year_opened)) # find the coaster with the earliest year
As I expected, coasters at the low end are older or geared to the very young. However the oldest coaster Zippin Pippin in Tennessee, built in 1915, is ranked 150 out of 187 and has a fun factor of 656.5 due to it’s length of 2865 feet.
kable(low10, align = c('c','l','l','l','c','c'), # display low10
col.names = c("Row#", "Fun Factor", "Coaster", "State", "Park",
"Type", "Year"),
caption = "Bottom 10 Coasters")
| Row# | Fun Factor | Coaster | State | Park | Type | Year |
|---|---|---|---|---|---|---|
| 178 | 401.0000 | Comet | Pennsylvania | Waldameer Park | Wood | 1951 |
| 179 | 377.1000 | Woodstock Express | Ohio | Cedar Point | Steel | 1999 |
| 180 | 377.0000 | Bobsleds | New York | Seabreeze | Steel | 1962 |
| 181 | 365.2000 | Leap The Dips | Pennsylvania | Lakemont Park | Wood | 1999 |
| 182 | 362.7122 | Wild Chipmunk | Colorado | Lakeside Amusement Park | Steel | 1955 |
| 183 | 286.5703 | Gadget’s Go Coaster | California | Disneyland | Steel | 1993 |
| 184 | 281.5200 | Merlin’s Revenge | California | Castle Amusement Park | Steel | 2001 |
| 185 | 280.0000 | High Speed Thrill Coaster | Pennsylvania | Knoebels | Steel | 1955 |
| 186 | 279.5200 | Spacely’s Sprocket Rockets | Illinois | Six Flags Great America | Steel | 1998 |
| 187 | 236.3593 | Jr. Gemini | Ohio | Cedar Point | Steel | 1979 |
kable(oldest, align = 'c', caption = "Oldest Coaster") # display it
| rowid | coaster | park | city | state | type | design | year_opened | top_speed | max_height | drop | length | duration | inversions | num_of_inversions | dsgn_val | fun |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 150 | Zippin Pippin | Libertyland | Memphis | Tennessee | Wood | Sit Down | 1915 | 40 | 70 | 70 | 2865 | 90 | N | 0 | 100 | 656.5 |
To get a more realistic view of the distribution of roller coasters in the U.S., I used RCDB’s census page where I could search for the total count of operating roller coasters by state.
rc1 <- as_tibble(read.csv("C:/Users/Owner/Desktop/My Documents/School/DataFiles/RollerCoaster/USRollerCoasterCensus.csv"))
str(rc1)
## tibble [43 x 4] (S3: tbl_df/tbl/data.frame)
## $ Location: chr [1:43] "Alabama" "Arizona" "Arkansas" "California" ...
## $ Steel : int [1:43] 7 4 5 78 12 5 47 20 4 19 ...
## $ Wood : int [1:43] 1 0 1 6 2 3 3 2 2 4 ...
## $ Total : int [1:43] 8 4 6 84 14 8 50 22 6 23 ...
rcc <- arrange(rc1, Total, Location) # sort by total coasters
row.names(rcc) <- rcc$Location # rename the rows according to the state
## Warning: Setting row names on a tibble is deprecated.
rc_mat <- data.matrix(rcc) # convert dataframe to a matrix (required by heatmap)
rc_mat_ <- rc_mat[,2:4] # remove the redundant column of state
# Call heatmap using a ColorBrewer color set, margins=c(7,10) for aspect ratio, titles of graph, x and y labels,
# font size of x and y labels, and set up a RowSideColors bar
varcols = setNames(colorRampPalette(brewer.pal(nrow(rc_mat_), "BuPu"))(nrow(rc_mat_)),
rownames(rc_mat_)) # parameter for RowSideColors
## Warning in brewer.pal(nrow(rc_mat_), "BuPu"): n too large, allowed maximum for palette BuPu is 9
## Returning the palette you asked for with that many colors
heatmap(rc_mat_,
Rowv = NA, Colv = NA,
col= colorRampPalette(brewer.pal(nrow(rc_mat_), "BuPu"))(nrow(rc_mat_)),
s=0.6, v=1, scale="column",
margins=c(5,8),
main = "Where the Coasters are",
xlab = "Roller Coasters",
ylab="State",
labCol = c("Steel","Wood", "Total"),
cexCol=1,
RowSideColors = varcols)
## layout: widths = 0.05 0.2 4 , heights = 0.25 4 ; lmat=
## [,1] [,2] [,3]
## [1,] 0 0 4
## [2,] 3 1 2
## Warning in brewer.pal(nrow(rc_mat_), "BuPu"): n too large, allowed maximum for palette BuPu is 9
## Returning the palette you asked for with that many colors
California at the top isn’t surprising, there are a number of amusement parks in LA, Orange and San Diego counties, and the temperate climate means the parks can be open year round (except that all amusement parks are closed right now due to Covid 19). Pennsylvania has five more operating coasters than Florida, which I didn’t expect. Almost a third of them are wooden (Pennsylvania has the most wooden roller coasters) and ten of them were opened before 1955. The oldest operating roller coaster in Florida opened in 1972.
Seven states don’t have any roller coasters, Delaware, Rhode Island and Vermont are small and close to states with amusement parks. Montana and Wyoming are probably too sparsely populated, and Alaska is probably too cold. Disney runs the Aulani resort in Hawaii, but it doesn’t have a roller coaster. I doubt anyone misses them in the island paradise.
rcc <- arrange(rc1, desc(Total)) # sort the roller coaster census descending by Total
kable(rcc, align = c("l", "c", "c", "c"), # display the roller coaster census data
caption = "U.S. Roller Coaster Census")
| Location | Steel | Wood | Total |
|---|---|---|---|
| California | 78 | 6 | 84 |
| Pennsylvania | 37 | 18 | 55 |
| Florida | 47 | 3 | 50 |
| Texas | 44 | 4 | 48 |
| New Jersey | 41 | 2 | 43 |
| New York | 34 | 6 | 40 |
| Ohio | 32 | 8 | 40 |
| Missouri | 19 | 6 | 25 |
| Illinois | 19 | 4 | 23 |
| Georgia | 20 | 2 | 22 |
| Virginia | 18 | 4 | 22 |
| Maryland | 17 | 2 | 19 |
| North Carolina | 13 | 2 | 15 |
| Colorado | 12 | 2 | 14 |
| Indiana | 8 | 6 | 14 |
| Massachusetts | 13 | 1 | 14 |
| Michigan | 10 | 3 | 13 |
| Minnesota | 11 | 2 | 13 |
| Tennessee | 11 | 2 | 13 |
| Wisconsin | 6 | 6 | 12 |
| Kentucky | 8 | 3 | 11 |
| Utah | 9 | 1 | 10 |
| Iowa | 5 | 4 | 9 |
| Alabama | 7 | 1 | 8 |
| Connecticut | 5 | 3 | 8 |
| New Hampshire | 6 | 2 | 8 |
| Oklahoma | 7 | 1 | 8 |
| New Mexico | 6 | 1 | 7 |
| Arkansas | 5 | 1 | 6 |
| Idaho | 4 | 2 | 6 |
| Maine | 5 | 1 | 6 |
| Washington | 4 | 2 | 6 |
| Kansas | 5 | 0 | 5 |
| Nevada | 5 | 0 | 5 |
| Oregon | 5 | 0 | 5 |
| Arizona | 4 | 0 | 4 |
| Louisiana | 4 | 0 | 4 |
| South Carolina | 3 | 1 | 4 |
| West Virginia | 2 | 2 | 4 |
| South Dakota | 2 | 0 | 2 |
| Mississippi | 1 | 0 | 1 |
| Nebraska | 1 | 0 | 1 |
| North Dakota | 1 | 0 | 1 |
I was able to do almost everything I wanted to in R and R Markdown.
I’m very pleased to have found kable to format tibbles nicely. In order to make the .rmd reusable when the data changes I want to be able to output data values inline with the text, i.e. output the year for the oldest coaster in the middle of the sentece. I tried several different functions, print(), paste() and finally looked at the R Markdown cheatsheet and discovered the single backtick. That worked beautifully.
I haven’t been able to read a data file from GitHub, I tried including the RCurl library and the following code
# rc <- read.csv(getURL("https://github.com/sopranomax/Data110_Projects/a5664bb01296baacdb5093bd523d8a0118b3a737/RollerCoaster.csv"))
but get the following error message.
Error in function (type, msg, asError = TRUE) : error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version