milk listrx and pop data framesProject 2 called on students to select three of our classmates’ selected untidy datasets and proposed analyses, create CSV files containing all of the information included in each dataset, read the CSV files into R, use tidyr and dplyr as needed to tidy and transform the data, and carry out the proposed analyses.
The three untidy datasets that I chose to work with were already available online in CSV format, but in one case, the data for each of the observations of interest were downloaded in five separate CSV files, which were renamed for ease of use and ultimately combined.
NOTE: Several of the required CSV files are downloaded to the current working directory in the execution of this project’s R code, including the relatively large 271.8 MB CSV file of Citi Bike trip data.
# Change working directory as needed, CSV files are downloaded to current working directory
setwd("~/Project2")
library(RCurl)
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
library(lubridate)
library(leaflet)
library(knitr)
Aaron Grzasko’s Nutitritional Content of Milk proposal required selecting five different types of milk, consolidating their nutritional information, and analyzing that data across type (here almond, goat, human, rice, and whole). The data were sourced from the USDA Food Composition Databases. The Stack Overflow questions and answers referenced below were very helpful guides for reading in the individual CSV files into a single list of data frames which could be manipulated and tidied collectively, and for showing a method of indexing the list elements so that a column consisting of the relevant milk type (based on the list element name) could be added to each data frame.
download.file("https://raw.githubusercontent.com/juddanderman/Project2/master/milk_almond.csv", "milk_almond.csv", method = "curl")
download.file("https://raw.githubusercontent.com/juddanderman/Project2/master/milk_goat.csv", "milk_goat.csv", method = "curl")
download.file("https://raw.githubusercontent.com/juddanderman/Project2/master/milk_human.csv", "milk_human.csv", method = "curl")
download.file("https://raw.githubusercontent.com/juddanderman/Project2/master/milk_rice.csv", "milk_rice.csv", method = "curl")
download.file("https://raw.githubusercontent.com/juddanderman/Project2/master/milk_whole.csv", "milk_whole.csv", method = "curl")
milk_files <- list.files(pattern = "^milk.+\\.csv$")
milk <- lapply(milk_files,
function(x) read.csv(file = x, header = FALSE,
stringsAsFactors = FALSE, encoding = "latin1"))
names(milk) <- str_replace_all(milk_files, "(milk_)|(\\.csv)", "")
milk listBefore reducing the data for each milk type to 3 columns representing the nutrient names, units of measure, and values per 100 g - an attribute common to all the selected milk types and the basis for the subsequent analysis - I extracted the cup sizes in grams for each type so that calories per cup could be easily calculated.
milk <- lapply(milk, function(x) x %>% slice(5:nrow(x)))
# The line of code below was used to ID the columns with cup sizes
# lapply(milk, function(x) x[1, ])
cups <- data.frame(c("almond", "goat", "human", "rice", "whole"),
as.numeric(c(str_extract(milk$almond[1, 4], "\\d{3}\\.\\d{1}"),
str_extract(milk$goat[1, 5], "\\d{3}\\.\\d{1}"),
str_extract(milk$human[1, 5], "\\d{3}\\.\\d{1}"),
str_extract(milk$rice[1, 4], "\\d{3}\\.\\d{1}"),
str_extract(milk$whole[1, 4], "\\d{3}\\.\\d{1}"))),
stringsAsFactors = FALSE)
colnames(cups) <- c("type", "cup.size")
milk <- lapply(milk, function(x) x %>% select(num_range("V", 1:3)))
new_colnames <- paste(milk[[1]][1, ])
new_colnames[3] <- "Value/100g"
milk <- lapply(milk, setNames, nm = new_colnames)
milk <- lapply(milk, function(x) x %>% slice(2:nrow(x)))
add_index_colname <- function(i, x) {
x[[i]] <- x[[i]] %>% mutate(Kind = rep(names(x[i]), times = nrow(x[[i]])))
}
milk <- lapply(seq_along(milk), add_index_colname, x = milk)
names(milk) <- str_replace_all(milk_files, "(milk_)|(\\.csv)", "")
milk <- bind_rows(milk)
milk <- milk %>% filter(Unit != "") %>% unite(Nutrient, Nutrient, Unit, sep = " (")
milk$Nutrient <- str_c(milk$Nutrient, ")")
milk <- milk %>% group_by(Kind) %>% spread(Nutrient, `Value/100g`, convert = TRUE)
kable(milk %>% group_by(Kind) %>% summarize(`Protein (g)`/`Energy (kcal)`))
| Kind | Protein (g)/Energy (kcal) |
|---|---|
| almond | 0.0393333 |
| goat | 0.0515942 |
| human | 0.0147143 |
| rice | 0.0059574 |
| whole | 0.0516393 |
milk <- milk %>% inner_join(cups, by = c("Kind" = "type"))
kable(milk %>% group_by(Kind) %>%
summarize(`kcal/cup` = `Energy (kcal)` * (cup.size/100)))
| Kind | kcal/cup |
|---|---|
| almond | 39.30 |
| goat | 168.36 |
| human | 172.20 |
| rice | 112.80 |
| whole | 148.84 |
kable(milk %>% group_by(Kind) %>%
summarize(`kcal/cup` = `Energy (kcal)` * (cup.size/100)) %>%
arrange(`kcal/cup`))
| Kind | kcal/cup |
|---|---|
| almond | 39.30 |
| rice | 112.80 |
| whole | 148.84 |
| goat | 168.36 |
| human | 172.20 |
kable(milk %>% group_by(Kind) %>%
summarize(net.carbohydrates =
`Sugars, total (g)` - `Fiber, total dietary (g)`))
| Kind | net.carbohydrates |
|---|---|
| almond | 0.00 |
| goat | 4.45 |
| human | 6.89 |
| rice | 4.98 |
| whole | 5.05 |
I provided a link to US National Health Expenditure Accounts (NHEA) historical data from CMS and requested an analysis of changes in total prescription drug spending per capita and the distribution of funding sources for prescription drug expenditures over time.
NHE.URL <- getURL("https://raw.githubusercontent.com/juddanderman/Project2/master/NHE2014.csv")
spend <- read.csv(text = NHE.URL, header = FALSE, stringsAsFactors = FALSE)
spend[1:5, 1:5]
## V1
## 1 National Health Expenditures by Type of Service and Source of Funds: Calendar Years 1960-2014
## 2 Expenditure Amount (Millions)
## 3 Total National Health Expenditures
## 4 Out of pocket
## 5 Health Insurance
## V2 V3 V4 V5
## 1
## 2 1960 1961 1962 1963
## 3 27,214 29,138 31,842 34,595
## 4 12,949 13,357 14,255 15,311
## 5 7,497 8,236 8,999 9,892
colnames(spend)[2:ncol(spend)] <- spend[2, 2:ncol(spend)]
pop <- spend %>% filter(V1 == "POPULATION")
pop[2:ncol(pop)] <- as.integer(pop[2:ncol(pop)])
rx <- spend %>% slice(which(str_detect(spend$V1, "Prescription")):(which(str_detect(spend$V1, "Prescription")) + 29))
colnames(rx)[1] <- "payer"
title <- spend[1, 1]
rx and pop data framesI cleaned up and trimmed the strings representing payer names or funding sources, but first looked at the unaltered rx$payer vector as the levels of indentation revealed at a glance which elements of the vector represented umbrella categories of payer types.
#rx$payer
rx$payer <- str_replace(rx$payer, "\\s\\(.+\\)", "")
rx$payer <- str_replace_all(rx$payer, "\\*|'", "")
rx$payer <- str_replace_all(rx$payer, "\\b\\s\\b|/", ".")
rx$payer[7:8] <- str_c(str_trim(rx$payer[6]), str_trim(rx$payer[7:8]), sep = "-")
rx$payer[10:11] <- str_c(str_trim(rx$payer[9]), str_trim(rx$payer[10:11]), sep = "-")
rx$payer[21:22] <- str_c(str_trim(rx$payer[20]), str_trim(rx$payer[21:22]), sep = "-")
rx$payer[24:25] <- str_c(str_trim(rx$payer[23]), str_trim(rx$payer[24:25]), sep = "-")
rx$payer <- str_trim(rx$payer)
rx$payer
## [1] "Total.Prescription.Drug.Expenditures"
## [2] "Out.of.pocket"
## [3] "Health.Insurance"
## [4] "Private.Health.Insurance"
## [5] "Medicare"
## [6] "Medicaid"
## [7] "Medicaid-Federal"
## [8] "Medicaid-State.and.Local"
## [9] "CHIP"
## [10] "CHIP-Federal"
## [11] "CHIP-State.and.Local"
## [12] "Department.of.Defense"
## [13] "Department.of.Veterans.Affairs"
## [14] "Other.Third.Party.Payers.and.Programs"
## [15] "Worksite.Health.Care"
## [16] "Other.Private.Revenues"
## [17] "Indian.Health.Services"
## [18] "Workers.Compensation"
## [19] "General.Assistance"
## [20] "Maternal.Child.Health"
## [21] "Maternal.Child.Health-Federal"
## [22] "Maternal.Child.Health-State.and.Local"
## [23] "Vocational.Rehabilitation"
## [24] "Vocational.Rehabilitation-Federal"
## [25] "Vocational.Rehabilitation-State.and.Local"
## [26] "Other.Federal.Programs"
## [27] "SAMHSA"
## [28] "Other.State.and.Local.Programs"
## [29] "School.Health"
## [30] "Total.CMS.Programs"
rx <- rx %>% gather(year, spending, 2:ncol(rx)) %>% spread(payer, spending)
rx <- apply(rx, 2, function(x) str_replace(x, "-", 0))
rx <- apply(rx, 2, function(x) str_replace(x, ",", ""))
rx <- apply(rx, 2, function(x) as.integer(str_trim(x)))
rx <- data.frame(rx, stringsAsFactors = FALSE)
pop <- pop %>% gather(year, population, 2:ncol(pop)) %>% select(year, population)
pop$year <- as.integer(pop$year)
head(pop)
## year population
## 1 1960 186
## 2 1961 189
## 3 1962 192
## 4 1963 195
## 5 1964 197
## 6 1965 200
rx <- rx %>% inner_join(pop, by = "year") %>% select(year, population, everything())
rx[1:5, 1:5]
## year population CHIP CHIP.Federal CHIP.State.and.Local
## 1 1960 186 0 0 0
## 2 1961 189 0 0 0
## 3 1962 192 0 0 0
## 4 1963 195 0 0 0
## 5 1964 197 0 0 0
per_cap <- rx %>% group_by(year) %>% summarize(per.cap.rx = Total.Prescription.Drug.Expenditures/population)
ggplot(per_cap, aes(x = year, y = per.cap.rx)) +
geom_point() + geom_smooth() +
labs(x = "Year", y = "Per Capita Rx Expenditures (nominal)")
In an attempt to adjust for inflation, I sourced annual averages of the Consumer Price Index for All Urban Consumers (CPI-U) from the Bureau of Labor Statistics (I used the data from Series ID: CUUR0000AA0 with 1967 as the base year, and selected for annual averages to be included in the .xlsx download, which was later saved in .csv format) and joined those data to the prescription spending data extracted from the original NHEA table. Then, using the average CPI-U for 2014 as an index, I standardized prescription drug spending each year from 1960 to 2014 in 2014 dollars.
cpi.URL <- getURL("https://raw.githubusercontent.com/juddanderman/Project2/master/CPI-U.csv")
cpi <- read.csv(text = cpi.URL, header = FALSE, stringsAsFactors = FALSE)
cpi <- cpi %>% slice(10:nrow(cpi))
colnames(cpi) <- cpi[1, ]
cpi <- cpi[-1, ]
cpi <- cpi %>% select(Year, Annual)
colnames(cpi)[2] <- "avg.cpi"
cpi <- data.frame(apply(cpi, 1:2, function(x) as.numeric(x)))
cpi <- cpi %>% filter (Year >= 1960 & Year <= 2014)
per_cap <- per_cap %>% inner_join(cpi, by = c("year" = "Year"))
cpi_index <- per_cap$avg.cpi[per_cap$year == 2014]
per_cap <- per_cap %>% mutate(infl.adj.rx = per.cap.rx * (cpi_index/avg.cpi))
ggplot(per_cap, aes(x = year, y = infl.adj.rx)) +
geom_point() + geom_smooth() +
labs(x = "Year", y = "Per Capita Rx Expenditures (in 2014 dollars)")
per_cap <- per_cap %>% select(year, per.cap.rx, infl.adj.rx) %>% gather(type, per.cap, per.cap.rx:infl.adj.rx)
ggplot(per_cap, aes(x = year, y = per.cap, color = type)) +
geom_point() + geom_smooth() +
labs(x = "Year", y = "Per Capita Rx Expenditures",
main = "Inflation Adjusted and Unadjusted Per Capita Expenditures") +
scale_color_discrete(name = "Inflation Adjusted/Nominal Spending",
labels = c("2014 dollars", "nominal")) +
theme(legend.position = "bottom")
The inflation adjusted scatter plot of per capita prescription drug spending over time shows greater variabliity in the rate and direction of change than the unadjusted plot of nominal spending per capita, as we see additional periods of noticeable decline or stabilization. We do, however, see a fairly consistent strong increase in inflation adjusted per capita spending from 1980 to 2008-2009, at which point we may be witnessing the effects of the of the Great Recession on prescription drug expenditures.
rx_props <- rx %>% mutate_each(funs(./Total.Prescription.Drug.Expenditures), -c(year, population, Total.Prescription.Drug.Expenditures)) %>% select(-Total.Prescription.Drug.Expenditures)
props_3 <- rx_props %>% select(year, Health.Insurance, Out.of.pocket, Other.Third.Party.Payers.and.Programs)
# The line below was used to check that the sum of the proportions
# was equal to one for each year
#props_3 %>% select(-year) %>% mutate(check_props = rowSums(.))
props_3 <- props_3 %>% gather(payer, prop, 2:4)
props_3$payer <- factor(props_3$payer,
levels = c("Health.Insurance",
"Other.Third.Party.Payers.and.Programs"))
ggplot(props_3, aes(x = year, y = prop)) +
geom_bar(aes(fill = payer), stat = "identity", position = "stack") +
labs(x = "Year", y = "Proportion") +
theme(legend.position = "bottom") +
scale_fill_discrete(name = "Funding Source",
labels = c("Health Insurance",
"Out of Pocket",
"Other Third Party Payers"))
ggplot(props_3, aes(x = year, y = prop)) +
geom_area(aes(fill = payer), position = "stack") +
labs(x = "Year", y = "Proportion") +
theme(legend.position = "bottom") +
scale_fill_discrete(name = "Funding Source",
labels = c("Health Insurance",
"Out of Pocket",
"Other Third Party Payers"))
props_18 <- rx_props %>% select(year, Out.of.pocket, Private.Health.Insurance, Medicare, Medicaid, CHIP, Department.of.Defense, Department.of.Veterans.Affairs, Worksite.Health.Care, Other.Private.Revenues, Indian.Health.Services, Workers.Compensation, General.Assistance, Maternal.Child.Health, Vocational.Rehabilitation, Other.Federal.Programs, SAMHSA, Other.State.and.Local.Programs, School.Health)
# The line below was used to check that the sum of the proportions
# was equal to one for each year
#props_18 %>% select(-year) %>% mutate(check_props = rowSums(.))
props_18 <- props_18 %>% gather(payer, prop, 2:ncol(props_18))
(props_18 %>% filter(year == 2014) %>% arrange(desc(prop)))[1:5, ]
## year payer prop
## 1 2014 Private.Health.Insurance 0.42757425
## 2 2014 Medicare 0.29018670
## 3 2014 Out.of.pocket 0.15022607
## 4 2014 Medicaid 0.09181452
## 5 2014 Department.of.Defense 0.01844151
set.seed(2)
ggplot(props_18, aes(x = year, y = prop)) +
geom_bar(aes(fill = payer), stat = "identity", position = "stack") +
labs(x = "Year", y = "Proportion") +
theme(legend.position = "right") +
scale_fill_manual(name = "Payer",
values = sample(rainbow(18),
size = 18,
replace = FALSE)) +
annotate("text", x = 1980, y = 0.25, label = "Out of Pocket") +
annotate("text", x = 2004, y = 0.5, label = "Private Insurance") +
annotate("text", x = 2010, y = 0.76, label = "Medicare") +
annotate("text", x = 1990, y = 0.91, label = "Medicaid")
props_18$prop[props_18$year == 2014 & props_18$payer == "Private.Health.Insurance"]
## [1] 0.4275743
set.seed(2)
ggplot(props_18, aes(x = year, y = prop)) +
geom_area(aes(fill = payer), position = "stack") +
labs(x = "Year", y = "Proportion") +
theme(legend.position = "right") +
scale_fill_manual(name = "Payer",
values = sample(rainbow(18),
size = 18,
replace = FALSE)) +
annotate("text", x = 1980, y = 0.25, label = "Out of Pocket") +
annotate("text", x = 2004, y = 0.5, label = "Private Insurance") +
annotate("text", x = 2010, y = 0.76, label = "Medicare") +
annotate("text", x = 1990, y = 0.91, label = "Medicaid")
The stacked bar and area plots of the proportion of prescription drug expenditures by funding source reveal a consistent decrease in the proportion of out-of-pocket spending and an increase in the proportion of spending by health insurance payers. When the health insurers are broken down into more specific payer types, we can more easily see the change over time in the share of prescription drug spending by private insurance companies, Medicaid, and Medicare. Interestingly, we can very clearly see the results of the implementation of Medicare Part D prescription drug benefit program in 2006 in the sharp increase in the proportion of spending by Medicare and decreases in the proportions of the other top payers at that point in time.
Finally, I plot the inflation adjusted total prescription drug spending over time for the four largest funding sources revealed in the stacked bar and area plots above, i.e. out-of-pocket, private insurance, Medicaid, Medicare.
infl_adj_top_4 <- rx %>% select(year, Out.of.pocket, Private.Health.Insurance, Medicaid, Medicare) %>% inner_join(cpi, by = c("year" = "Year")) %>% mutate_each(funs((.) * (cpi_index/avg.cpi)), -year) %>% select(1:5) %>% gather(payer, infl.adj.spending, 2:5)
ggplot(infl_adj_top_4, aes(x = year, y = infl.adj.spending, color = payer)) +
geom_point() + geom_smooth() +
labs(x = "Year", y = "Inflatjion Adj. Spending (2014 dollars)")
Sharon Morris provided a link to the June 2016 NYC Citi Bike Trip Data and requested an analysis to produce a profile of the typical Citi Bike user.
download.file("https://s3.amazonaws.com/tripdata/201606-citibike-tripdata.zip", "201606-citibike-tripdata.zip", method = "curl")
unzip("201606-citibike-tripdata.zip")
bike <- read.csv("201606-citibike-tripdata.csv", header = TRUE, stringsAsFactors = FALSE)
head(bike)
## tripduration starttime stoptime start.station.id
## 1 1470 6/1/2016 00:00:18 6/1/2016 00:24:48 380
## 2 229 6/1/2016 00:00:20 6/1/2016 00:04:09 3092
## 3 344 6/1/2016 00:00:21 6/1/2016 00:06:06 449
## 4 1120 6/1/2016 00:00:28 6/1/2016 00:19:09 522
## 5 229 6/1/2016 00:00:53 6/1/2016 00:04:42 335
## 6 946 6/1/2016 00:01:01 6/1/2016 00:16:48 503
## start.station.name start.station.latitude start.station.longitude
## 1 W 4 St & 7 Ave S 40.73401 -74.00294
## 2 Berry St & N 8 St 40.71901 -73.95853
## 3 W 52 St & 9 Ave 40.76462 -73.98789
## 4 E 51 St & Lexington Ave 40.75715 -73.97208
## 5 Washington Pl & Broadway 40.72904 -73.99405
## 6 E 20 St & Park Ave 40.73827 -73.98752
## end.station.id end.station.name end.station.latitude
## 1 3236 W 42 St & Dyer Ave 40.75898
## 2 3103 N 11 St & Wythe Ave 40.72153
## 3 469 Broadway & W 53 St 40.76344
## 4 401 Allen St & Rivington St 40.72020
## 5 285 Broadway & E 14 St 40.73455
## 6 495 W 47 St & 10 Ave 40.76270
## end.station.longitude bikeid usertype birth.year gender
## 1 -73.99380 19859 Subscriber 1972 1
## 2 -73.95782 16233 Subscriber 1967 1
## 3 -73.98268 22397 Subscriber 1989 1
## 4 -73.98998 16231 Subscriber 1991 1
## 5 -73.99074 15400 Subscriber 1989 1
## 6 -73.99301 25193 Subscriber 1974 1
mdy_hms(bike$stoptime[1]) - mdy_hms(bike$starttime[1])
## Time difference of 24.5 mins
bike$tripduration[1]/60
## [1] 24.5
day_times <- bike %>% select(starttime) %>%
mutate(weekday = wday(mdy_hms(starttime), label = TRUE, abbr = FALSE),
time_of_day = hour(mdy_hms(bike$starttime)))
day_of_wk <- day_times %>% group_by(weekday) %>%
summarize(num_rides = n())
day_of_wk <- day_of_wk %>%
mutate(prop_rides = round(num_rides/sum(day_of_wk$num_rides), digits = 2)) %>%
arrange(weekday)
kable(day_of_wk)
| weekday | num_rides | prop_rides |
|---|---|---|
| Sunday | 151296 | 0.10 |
| Monday | 200488 | 0.14 |
| Tuesday | 209782 | 0.14 |
| Wednesday | 262595 | 0.18 |
| Thursday | 261553 | 0.18 |
| Friday | 200696 | 0.14 |
| Saturday | 173908 | 0.12 |
start_day_times <- day_times %>%
mutate(day_type = ifelse (weekday %in% c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday"), "weekday", "weekend")) %>%
group_by(day_type, time_of_day) %>% summarize(num_rides = n()) %>%
mutate(prop_rides = round(num_rides/sum(day_of_wk$num_rides), digits = 2)) %>%
arrange(day_type, time_of_day) %>%
top_n(5, wt = num_rides)
kable(start_day_times)
| day_type | time_of_day | num_rides | prop_rides |
|---|---|---|---|
| weekday | 8 | 107598 | 0.07 |
| weekday | 9 | 77398 | 0.05 |
| weekday | 17 | 121131 | 0.08 |
| weekday | 18 | 120633 | 0.08 |
| weekday | 19 | 83371 | 0.06 |
| weekend | 13 | 26509 | 0.02 |
| weekend | 14 | 27804 | 0.02 |
| weekend | 15 | 27463 | 0.02 |
| weekend | 16 | 27144 | 0.02 |
| weekend | 17 | 27271 | 0.02 |
commute_rides <- start_day_times %>% filter(day_type == "weekday") %>% summarize(prop_rides = sum(prop_rides))
From the tables above, we can see that the majority of Citi bike rides in June 2016 occured during the work week, and since 34% of all rides began on weekdays between 8 and 10 am or 5 and 8 pm, it appears that the largest share of Citi Bike rides are for commuting to and from work.
gender_codes <- data.frame(key = c("unknown", "male", "female"), val = 0:2)
gender_name <- function(x) gender_codes$key[gender_codes$val == x]
# Check customers for birth year and known gender data
nrow(bike %>% filter(usertype == "Customer" & !is.na(birth.year)))
## [1] 0
nrow(bike %>% filter(usertype == "Customer" & gender != 0))
## [1] 0
avg_age <- round(mean(year(Sys.time()) - bike$birth.year, na.rm = TRUE), digits = 1)
avg_trip_minutes <- round(mean(bike$tripduration)/60, digits = 2)
typ_gender <- bike %>% group_by(gender) %>%
summarize(N = n()) %>% top_n(1) %>% select(gender) %>%
mutate(gender = gender_name(gender))
## Selecting by N
names(typ_gender) <- "typ_gender"
typical_rider <- bike %>% group_by(usertype, gender) %>%
summarize(N = n(),
age = mean(year(Sys.time()) - birth.year, na.rm = TRUE),
duration_minutes = mean(tripduration)/60)
typical_rider <- typical_rider %>%
mutate(gender = gender_name(gender), age = round(age, digits = 1),
duration_minutes = round(duration_minutes, digits = 2))
The first table below shows the most common gender and average age of Citi Bike subscribers - since we do not have gender or birth year data for the non-subscribing customers - and the average trip length in minutes for all rides. The second table breaks down the average trip duration in minutes across rider types (i.e. subscribers and customers) and genders.
kable(data.frame(typ_gender, avg_age, avg_trip_minutes))
| typ_gender | avg_age | avg_trip_minutes |
|---|---|---|
| male | 38.3 | 16.5 |
kable(typical_rider)
| usertype | gender | N | age | duration_minutes |
|---|---|---|---|---|
| Customer | unknown | 156832 | NaN | 35.87 |
| Subscriber | unknown | 41404 | 42.9 | 24.29 |
| Subscriber | male | 947464 | 38.7 | 13.26 |
| Subscriber | female | 314618 | 37.1 | 15.56 |
Here we examine the most used Citi Bike stations in terms of total tranfers (i.e. ride starts and ends) and then map those stations using the leaflet package.
starts <- bike %>% group_by(start.station.name) %>%
summarize(N = n()) %>% arrange(desc(N))
starts <- starts %>%
transmute(ride.status =
str_extract(colnames(starts[1]), "^[[:alpha:]]+\\b"),
station.name = start.station.name, N)
ends <- bike %>% group_by(end.station.name) %>%
summarize(N = n()) %>% arrange(desc(N))
ends <- ends %>%
transmute(ride.status =
str_extract(colnames(ends[1]), "^[[:alpha:]]+\\b"),
station.name = end.station.name, N)
pop_stations <- bind_rows(starts, ends)
pop_stations$ride.status <- factor(pop_stations$ride.status, levels=c("start", "end"))
pop_stations <- pop_stations %>%
spread(ride.status, N) %>%
mutate(total = start + end) %>%
top_n(20, wt = total)
pop_stations <- left_join(pop_stations,
bike[, c("start.station.name", "start.station.id",
"start.station.latitude", "start.station.longitude")],
by = c("station.name" = "start.station.name")) %>%
distinct()
pop_stations <- pop_stations %>%
rename(station.id = start.station.id,
latitude = start.station.latitude,
longitude = start.station.longitude)
kable(pop_stations %>% select(station.name, total))
| station.name | total |
|---|---|
| 12 Ave & W 40 St | 20382 |
| 8 Ave & W 33 St | 17957 |
| Broadway & E 14 St | 15899 |
| Broadway & E 22 St | 23597 |
| Broadway & W 24 St | 16060 |
| Carmine St & 6 Ave | 17821 |
| Central Park S & 6 Ave | 15771 |
| Christopher St & Greenwich St | 15622 |
| Cleveland Pl & Spring St | 17575 |
| E 17 St & Broadway | 22807 |
| E 47 St & Park Ave | 15561 |
| Greenwich Ave & 8 Ave | 18430 |
| Lafayette St & E 8 St | 20410 |
| Lafayette St & Jersey St | 16120 |
| Pershing Square North | 31296 |
| South End Ave & Liberty St | 17628 |
| Vesey Pl & River Terrace | 16422 |
| W 20 St & 11 Ave | 19401 |
| W 21 St & 6 Ave | 21348 |
| West St & Chambers St | 24680 |
nyc1 <- leaflet() %>% setView(lng = -73.995, lat = 40.73, zoom = 12) %>%
addTiles() %>%
addCircles(lng = pop_stations$longitude, lat = pop_stations$latitude,
radius = pop_stations$total / 50,
color = "red",
popup = paste(sep = "<br/>", pop_stations$station.name,
paste("Citibike Station ID:",
pop_stations$station.id)))
nyc1
Finally, I found the most popular Citi Bike routes in June 2016 and roughly mapped those routes as lines between the relevant start and end stations. Interestingly, several of the most common routes had the same start and end station.
routes <- bike %>% group_by(start.station.id, start.station.name,
start.station.longitude, start.station.latitude,
end.station.id, end.station.name,
end.station.longitude, end.station.latitude) %>%
summarize(rides = n()) %>% arrange(desc(rides)) %>% ungroup() %>% top_n(10, wt = rides)
kable(routes[, c("start.station.name", "end.station.name", "rides")])
| start.station.name | end.station.name | rides |
|---|---|---|
| Central Park S & 6 Ave | Central Park S & 6 Ave | 1136 |
| Central Park S & 6 Ave | 5 Ave & E 78 St | 1002 |
| 12 Ave & W 40 St | West St & Chambers St | 627 |
| Central Park S & 6 Ave | Central Park West & W 85 St | 592 |
| Centre St & Chambers St | Centre St & Chambers St | 563 |
| Yankee Ferry Terminal | Yankee Ferry Terminal | 557 |
| 5 Ave & E 78 St | Central Park West & W 85 St | 545 |
| Soissons Landing | Yankee Ferry Terminal | 544 |
| Central Park West & W 85 St | Central Park S & 6 Ave | 537 |
| Soissons Landing | Soissons Landing | 536 |
nyc2 <- leaflet() %>% setView(lng = -73.975, lat = 40.740, zoom = 12) %>%
addTiles() %>%
addMarkers(lng = routes$start.station.longitude,
lat = routes$start.station.latitude,
popup = paste(sep = "<br/>", routes$start.station.name,
paste("Citibike Station ID:",
routes$start.station.id))) %>%
addMarkers(lng = routes$end.station.longitude,
lat = routes$end.station.latitude,
popup = paste(sep = "<br/>", routes$end.station.name,
paste("Citibike Station ID:",
routes$end.station.id)))
# add lines between citibike start and end stations, replace "red" below with
# palette()[(i %% length(palette())) + 1] to get different colored routes
# modify palette as needed: e.g. palette("default"), palette(c("red", "blue", "gray"))
for (i in 1:nrow(routes)) {
nyc2 <- nyc2 %>%
addPolylines(lng = unlist(routes[i, c(3, 7)]),
lat = unlist(routes[i, c(4, 8)]),
opacity = 0.6, color = "red")
}
nyc2