Project 2 called on students to select three of our classmates’ selected untidy datasets and proposed analyses, create CSV files containing all of the information included in each dataset, read the CSV files into R, use tidyr and dplyr as needed to tidy and transform the data, and carry out the proposed analyses.

The three untidy datasets that I chose to work with were already available online in CSV format, but in one case, the data for each of the observations of interest were downloaded in five separate CSV files, which were renamed for ease of use and ultimately combined.

NOTE: Several of the required CSV files are downloaded to the current working directory in the execution of this project’s R code, including the relatively large 271.8 MB CSV file of Citi Bike trip data.

Load required packages

# Change working directory as needed, CSV files are downloaded to current working directory
setwd("~/Project2")

library(RCurl)
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
library(lubridate)
library(leaflet)
library(knitr)

Animal and Plant Milk Nutritional Content

Aaron Grzasko’s Nutitritional Content of Milk proposal required selecting five different types of milk, consolidating their nutritional information, and analyzing that data across type (here almond, goat, human, rice, and whole). The data were sourced from the USDA Food Composition Databases. The Stack Overflow questions and answers referenced below were very helpful guides for reading in the individual CSV files into a single list of data frames which could be manipulated and tidied collectively, and for showing a method of indexing the list elements so that a column consisting of the relevant milk type (based on the list element name) could be added to each data frame.

Import CSV data

download.file("https://raw.githubusercontent.com/juddanderman/Project2/master/milk_almond.csv", "milk_almond.csv", method = "curl")
download.file("https://raw.githubusercontent.com/juddanderman/Project2/master/milk_goat.csv", "milk_goat.csv", method = "curl")
download.file("https://raw.githubusercontent.com/juddanderman/Project2/master/milk_human.csv", "milk_human.csv", method = "curl")
download.file("https://raw.githubusercontent.com/juddanderman/Project2/master/milk_rice.csv", "milk_rice.csv", method = "curl")
download.file("https://raw.githubusercontent.com/juddanderman/Project2/master/milk_whole.csv", "milk_whole.csv", method = "curl")

milk_files <- list.files(pattern = "^milk.+\\.csv$")

milk <- lapply(milk_files, 
               function(x) read.csv(file = x, header = FALSE, 
                                    stringsAsFactors = FALSE, encoding = "latin1"))

names(milk) <- str_replace_all(milk_files, "(milk_)|(\\.csv)", "")

Clean and tidy data frame elements of milk list

Before reducing the data for each milk type to 3 columns representing the nutrient names, units of measure, and values per 100 g - an attribute common to all the selected milk types and the basis for the subsequent analysis - I extracted the cup sizes in grams for each type so that calories per cup could be easily calculated.

milk <- lapply(milk, function(x) x %>% slice(5:nrow(x)))

# The line of code below was used to ID the columns with cup sizes
# lapply(milk, function(x) x[1, ])

cups <- data.frame(c("almond", "goat", "human", "rice", "whole"), 
         as.numeric(c(str_extract(milk$almond[1, 4], "\\d{3}\\.\\d{1}"), 
           str_extract(milk$goat[1, 5], "\\d{3}\\.\\d{1}"),
           str_extract(milk$human[1, 5], "\\d{3}\\.\\d{1}"),
           str_extract(milk$rice[1, 4], "\\d{3}\\.\\d{1}"),
           str_extract(milk$whole[1, 4], "\\d{3}\\.\\d{1}"))),
         stringsAsFactors =  FALSE)
colnames(cups) <- c("type", "cup.size")
           
milk <- lapply(milk, function(x) x %>% select(num_range("V", 1:3)))

new_colnames <- paste(milk[[1]][1, ])
new_colnames[3] <- "Value/100g"

milk <- lapply(milk, setNames, nm = new_colnames)

milk <- lapply(milk, function(x) x %>% slice(2:nrow(x)))

add_index_colname <- function(i, x) {
  x[[i]] <- x[[i]] %>% mutate(Kind = rep(names(x[i]), times = nrow(x[[i]])))
}

milk <- lapply(seq_along(milk), add_index_colname, x = milk)
names(milk) <- str_replace_all(milk_files, "(milk_)|(\\.csv)", "")

milk <- bind_rows(milk)
milk <- milk %>% filter(Unit != "") %>% unite(Nutrient, Nutrient, Unit, sep = " (")

milk$Nutrient <- str_c(milk$Nutrient, ")")

milk <- milk %>% group_by(Kind) %>% spread(Nutrient, `Value/100g`, convert = TRUE)

Calculate the ratio of protein to total calories for each milk type

kable(milk %>% group_by(Kind) %>% summarize(`Protein (g)`/`Energy (kcal)`))
Kind Protein (g)/Energy (kcal)
almond 0.0393333
goat 0.0515942
human 0.0147143
rice 0.0059574
whole 0.0516393

Calculate the average calorie content per cup across milk types

milk <- milk %>% inner_join(cups, by = c("Kind" = "type"))

kable(milk %>% group_by(Kind) %>% 
        summarize(`kcal/cup` = `Energy (kcal)` * (cup.size/100)))
Kind kcal/cup
almond 39.30
goat 168.36
human 172.20
rice 112.80
whole 148.84

Arrange the observations in ascending order by total calories per cup

kable(milk %>% group_by(Kind) %>% 
        summarize(`kcal/cup` = `Energy (kcal)` * (cup.size/100)) %>% 
        arrange(`kcal/cup`))
Kind kcal/cup
almond 39.30
rice 112.80
whole 148.84
goat 168.36
human 172.20

Calculate net carbohydrates (i.e. gross carbohydrates - dietary fiber)

kable(milk %>% group_by(Kind) %>% 
        summarize(net.carbohydrates = 
                    `Sugars, total (g)` - `Fiber, total dietary (g)`))
Kind net.carbohydrates
almond 0.00
goat 4.45
human 6.89
rice 4.98
whole 5.05

US Prescription Drug Spending 1960-2014

I provided a link to US National Health Expenditure Accounts (NHEA) historical data from CMS and requested an analysis of changes in total prescription drug spending per capita and the distribution of funding sources for prescription drug expenditures over time.

Import CSV data

NHE.URL <- getURL("https://raw.githubusercontent.com/juddanderman/Project2/master/NHE2014.csv")

spend <- read.csv(text = NHE.URL, header = FALSE, stringsAsFactors = FALSE)
spend[1:5, 1:5]
##                                                                                               V1
## 1 National Health Expenditures by Type of Service and Source of Funds:  Calendar Years 1960-2014
## 2                                                                  Expenditure Amount (Millions)
## 3                                                             Total National Health Expenditures
## 4                                                                                  Out of pocket
## 5                                                                               Health Insurance
##         V2       V3       V4       V5
## 1                                    
## 2     1960     1961     1962     1963
## 3  27,214   29,138   31,842   34,595 
## 4  12,949   13,357   14,255   15,311 
## 5   7,497    8,236    8,999    9,892
colnames(spend)[2:ncol(spend)] <- spend[2, 2:ncol(spend)]

Extract population counts and prescription drug spending by year

pop <- spend %>% filter(V1 == "POPULATION")
pop[2:ncol(pop)] <- as.integer(pop[2:ncol(pop)]) 

rx <- spend %>% slice(which(str_detect(spend$V1, "Prescription")):(which(str_detect(spend$V1, "Prescription")) + 29))
colnames(rx)[1] <- "payer"

title <- spend[1, 1]

Clean up and melt data, join rx and pop data frames

I cleaned up and trimmed the strings representing payer names or funding sources, but first looked at the unaltered rx$payer vector as the levels of indentation revealed at a glance which elements of the vector represented umbrella categories of payer types.

#rx$payer
rx$payer <- str_replace(rx$payer, "\\s\\(.+\\)", "")
rx$payer <- str_replace_all(rx$payer, "\\*|'", "")
rx$payer <- str_replace_all(rx$payer, "\\b\\s\\b|/", ".")
rx$payer[7:8] <- str_c(str_trim(rx$payer[6]), str_trim(rx$payer[7:8]), sep = "-")
rx$payer[10:11] <- str_c(str_trim(rx$payer[9]), str_trim(rx$payer[10:11]), sep = "-")
rx$payer[21:22] <- str_c(str_trim(rx$payer[20]), str_trim(rx$payer[21:22]), sep = "-")
rx$payer[24:25] <- str_c(str_trim(rx$payer[23]), str_trim(rx$payer[24:25]), sep = "-")
rx$payer <- str_trim(rx$payer)
rx$payer
##  [1] "Total.Prescription.Drug.Expenditures"     
##  [2] "Out.of.pocket"                            
##  [3] "Health.Insurance"                         
##  [4] "Private.Health.Insurance"                 
##  [5] "Medicare"                                 
##  [6] "Medicaid"                                 
##  [7] "Medicaid-Federal"                         
##  [8] "Medicaid-State.and.Local"                 
##  [9] "CHIP"                                     
## [10] "CHIP-Federal"                             
## [11] "CHIP-State.and.Local"                     
## [12] "Department.of.Defense"                    
## [13] "Department.of.Veterans.Affairs"           
## [14] "Other.Third.Party.Payers.and.Programs"    
## [15] "Worksite.Health.Care"                     
## [16] "Other.Private.Revenues"                   
## [17] "Indian.Health.Services"                   
## [18] "Workers.Compensation"                     
## [19] "General.Assistance"                       
## [20] "Maternal.Child.Health"                    
## [21] "Maternal.Child.Health-Federal"            
## [22] "Maternal.Child.Health-State.and.Local"    
## [23] "Vocational.Rehabilitation"                
## [24] "Vocational.Rehabilitation-Federal"        
## [25] "Vocational.Rehabilitation-State.and.Local"
## [26] "Other.Federal.Programs"                   
## [27] "SAMHSA"                                   
## [28] "Other.State.and.Local.Programs"           
## [29] "School.Health"                            
## [30] "Total.CMS.Programs"
rx <- rx %>% gather(year, spending, 2:ncol(rx)) %>% spread(payer, spending)

rx <- apply(rx, 2, function(x) str_replace(x, "-", 0))
rx <- apply(rx, 2, function(x) str_replace(x, ",", ""))
rx <- apply(rx, 2, function(x) as.integer(str_trim(x)))

rx <- data.frame(rx, stringsAsFactors = FALSE)

pop <- pop %>% gather(year, population, 2:ncol(pop)) %>% select(year, population)
pop$year <- as.integer(pop$year)

head(pop)
##   year population
## 1 1960        186
## 2 1961        189
## 3 1962        192
## 4 1963        195
## 5 1964        197
## 6 1965        200
rx <- rx %>% inner_join(pop, by = "year") %>% select(year, population, everything())

rx[1:5, 1:5]
##   year population CHIP CHIP.Federal CHIP.State.and.Local
## 1 1960        186    0            0                    0
## 2 1961        189    0            0                    0
## 3 1962        192    0            0                    0
## 4 1963        195    0            0                    0
## 5 1964        197    0            0                    0

Total spending per capita on prescription drugs

per_cap <- rx %>% group_by(year) %>% summarize(per.cap.rx = Total.Prescription.Drug.Expenditures/population)

ggplot(per_cap, aes(x = year, y = per.cap.rx)) + 
  geom_point() + geom_smooth() + 
  labs(x = "Year", y = "Per Capita Rx Expenditures (nominal)")

Inflation adjusted per capita prescription expenditures

In an attempt to adjust for inflation, I sourced annual averages of the Consumer Price Index for All Urban Consumers (CPI-U) from the Bureau of Labor Statistics (I used the data from Series ID: CUUR0000AA0 with 1967 as the base year, and selected for annual averages to be included in the .xlsx download, which was later saved in .csv format) and joined those data to the prescription spending data extracted from the original NHEA table. Then, using the average CPI-U for 2014 as an index, I standardized prescription drug spending each year from 1960 to 2014 in 2014 dollars.

cpi.URL <- getURL("https://raw.githubusercontent.com/juddanderman/Project2/master/CPI-U.csv")

cpi <- read.csv(text = cpi.URL, header = FALSE, stringsAsFactors = FALSE)
cpi <- cpi %>% slice(10:nrow(cpi))
colnames(cpi) <- cpi[1, ]
cpi <- cpi[-1, ]

cpi <- cpi %>% select(Year, Annual)
colnames(cpi)[2] <- "avg.cpi"
cpi <- data.frame(apply(cpi, 1:2, function(x) as.numeric(x)))

cpi <- cpi %>% filter (Year >= 1960 & Year <= 2014)

per_cap <- per_cap %>% inner_join(cpi, by = c("year" = "Year"))

cpi_index <- per_cap$avg.cpi[per_cap$year == 2014]

per_cap <- per_cap %>% mutate(infl.adj.rx = per.cap.rx * (cpi_index/avg.cpi))

ggplot(per_cap, aes(x = year, y = infl.adj.rx)) + 
  geom_point() + geom_smooth() + 
  labs(x = "Year", y = "Per Capita Rx Expenditures (in 2014 dollars)")

per_cap <- per_cap %>% select(year, per.cap.rx, infl.adj.rx) %>% gather(type, per.cap, per.cap.rx:infl.adj.rx)

ggplot(per_cap, aes(x = year, y = per.cap, color = type)) + 
  geom_point() + geom_smooth() + 
  labs(x = "Year", y = "Per Capita Rx Expenditures", 
       main = "Inflation Adjusted and Unadjusted Per Capita Expenditures") +
  scale_color_discrete(name = "Inflation Adjusted/Nominal Spending",
                       labels = c("2014 dollars", "nominal")) +
  theme(legend.position = "bottom")

The inflation adjusted scatter plot of per capita prescription drug spending over time shows greater variabliity in the rate and direction of change than the unadjusted plot of nominal spending per capita, as we see additional periods of noticeable decline or stabilization. We do, however, see a fairly consistent strong increase in inflation adjusted per capita spending from 1980 to 2008-2009, at which point we may be witnessing the effects of the of the Great Recession on prescription drug expenditures.

Change in distribution of spending across funding sources/payers over time

rx_props <- rx %>% mutate_each(funs(./Total.Prescription.Drug.Expenditures), -c(year, population, Total.Prescription.Drug.Expenditures)) %>% select(-Total.Prescription.Drug.Expenditures)

props_3 <- rx_props %>% select(year, Health.Insurance, Out.of.pocket,  Other.Third.Party.Payers.and.Programs)

# The line below was used to check that the sum of the proportions
# was equal to one for each year

#props_3 %>% select(-year) %>% mutate(check_props = rowSums(.))

props_3 <- props_3 %>% gather(payer, prop, 2:4)

props_3$payer <- factor(props_3$payer, 
                        levels = c("Health.Insurance",
                                   "Other.Third.Party.Payers.and.Programs"))

ggplot(props_3, aes(x = year, y = prop)) +
  geom_bar(aes(fill = payer), stat = "identity", position = "stack") + 
  labs(x = "Year", y = "Proportion") +
  theme(legend.position = "bottom") +
  scale_fill_discrete(name = "Funding Source", 
                      labels = c("Health Insurance", 
                                 "Out of Pocket",
                                 "Other Third Party Payers"))

ggplot(props_3, aes(x = year, y = prop)) +
  geom_area(aes(fill = payer), position = "stack") + 
  labs(x = "Year", y = "Proportion") +
  theme(legend.position = "bottom") +
  scale_fill_discrete(name = "Funding Source", 
                      labels = c("Health Insurance", 
                                 "Out of Pocket",
                                 "Other Third Party Payers"))

props_18 <- rx_props %>% select(year, Out.of.pocket, Private.Health.Insurance, Medicare, Medicaid, CHIP, Department.of.Defense, Department.of.Veterans.Affairs, Worksite.Health.Care, Other.Private.Revenues, Indian.Health.Services, Workers.Compensation, General.Assistance, Maternal.Child.Health, Vocational.Rehabilitation, Other.Federal.Programs, SAMHSA, Other.State.and.Local.Programs, School.Health)

# The line below was used to check that the sum of the proportions
# was equal to one for each year

#props_18 %>% select(-year) %>% mutate(check_props = rowSums(.))

props_18 <- props_18 %>% gather(payer, prop, 2:ncol(props_18))

(props_18 %>% filter(year == 2014) %>% arrange(desc(prop)))[1:5, ]
##   year                    payer       prop
## 1 2014 Private.Health.Insurance 0.42757425
## 2 2014                 Medicare 0.29018670
## 3 2014            Out.of.pocket 0.15022607
## 4 2014                 Medicaid 0.09181452
## 5 2014    Department.of.Defense 0.01844151
set.seed(2)
ggplot(props_18, aes(x = year, y = prop)) +
  geom_bar(aes(fill = payer), stat = "identity", position = "stack") + 
  labs(x = "Year", y = "Proportion") +
  theme(legend.position = "right") +
  scale_fill_manual(name = "Payer", 
                    values = sample(rainbow(18), 
                                    size = 18, 
                                    replace = FALSE)) +
  annotate("text", x = 1980, y = 0.25, label = "Out of Pocket") +
  annotate("text", x = 2004, y = 0.5, label = "Private Insurance") +
  annotate("text", x = 2010, y = 0.76, label = "Medicare") +
  annotate("text", x = 1990, y = 0.91, label = "Medicaid") 

props_18$prop[props_18$year == 2014 & props_18$payer == "Private.Health.Insurance"]
## [1] 0.4275743
set.seed(2)
ggplot(props_18, aes(x = year, y = prop)) +
  geom_area(aes(fill = payer), position = "stack") + 
  labs(x = "Year", y = "Proportion") +
  theme(legend.position = "right") +
  scale_fill_manual(name = "Payer", 
                    values = sample(rainbow(18), 
                                    size = 18, 
                                    replace = FALSE)) +
  annotate("text", x = 1980, y = 0.25, label = "Out of Pocket") +
  annotate("text", x = 2004, y = 0.5, label = "Private Insurance") +
  annotate("text", x = 2010, y = 0.76, label = "Medicare") +
  annotate("text", x = 1990, y = 0.91, label = "Medicaid") 

The stacked bar and area plots of the proportion of prescription drug expenditures by funding source reveal a consistent decrease in the proportion of out-of-pocket spending and an increase in the proportion of spending by health insurance payers. When the health insurers are broken down into more specific payer types, we can more easily see the change over time in the share of prescription drug spending by private insurance companies, Medicaid, and Medicare. Interestingly, we can very clearly see the results of the implementation of Medicare Part D prescription drug benefit program in 2006 in the sharp increase in the proportion of spending by Medicare and decreases in the proportions of the other top payers at that point in time.

Finally, I plot the inflation adjusted total prescription drug spending over time for the four largest funding sources revealed in the stacked bar and area plots above, i.e. out-of-pocket, private insurance, Medicaid, Medicare.

infl_adj_top_4 <- rx %>% select(year, Out.of.pocket, Private.Health.Insurance, Medicaid, Medicare) %>% inner_join(cpi, by = c("year" = "Year")) %>% mutate_each(funs((.) * (cpi_index/avg.cpi)), -year) %>% select(1:5) %>% gather(payer, infl.adj.spending, 2:5)

ggplot(infl_adj_top_4, aes(x = year, y = infl.adj.spending, color = payer)) +
  geom_point() + geom_smooth() +
  labs(x = "Year", y = "Inflatjion Adj. Spending (2014 dollars)")

NYC Citi Bike Data

Sharon Morris provided a link to the June 2016 NYC Citi Bike Trip Data and requested an analysis to produce a profile of the typical Citi Bike user.

Import CSV data

download.file("https://s3.amazonaws.com/tripdata/201606-citibike-tripdata.zip", "201606-citibike-tripdata.zip", method = "curl")

unzip("201606-citibike-tripdata.zip")

bike <- read.csv("201606-citibike-tripdata.csv", header = TRUE, stringsAsFactors = FALSE)

head(bike)
##   tripduration         starttime          stoptime start.station.id
## 1         1470 6/1/2016 00:00:18 6/1/2016 00:24:48              380
## 2          229 6/1/2016 00:00:20 6/1/2016 00:04:09             3092
## 3          344 6/1/2016 00:00:21 6/1/2016 00:06:06              449
## 4         1120 6/1/2016 00:00:28 6/1/2016 00:19:09              522
## 5          229 6/1/2016 00:00:53 6/1/2016 00:04:42              335
## 6          946 6/1/2016 00:01:01 6/1/2016 00:16:48              503
##         start.station.name start.station.latitude start.station.longitude
## 1         W 4 St & 7 Ave S               40.73401               -74.00294
## 2        Berry St & N 8 St               40.71901               -73.95853
## 3          W 52 St & 9 Ave               40.76462               -73.98789
## 4  E 51 St & Lexington Ave               40.75715               -73.97208
## 5 Washington Pl & Broadway               40.72904               -73.99405
## 6       E 20 St & Park Ave               40.73827               -73.98752
##   end.station.id        end.station.name end.station.latitude
## 1           3236      W 42 St & Dyer Ave             40.75898
## 2           3103     N 11 St & Wythe Ave             40.72153
## 3            469      Broadway & W 53 St             40.76344
## 4            401 Allen St & Rivington St             40.72020
## 5            285      Broadway & E 14 St             40.73455
## 6            495        W 47 St & 10 Ave             40.76270
##   end.station.longitude bikeid   usertype birth.year gender
## 1             -73.99380  19859 Subscriber       1972      1
## 2             -73.95782  16233 Subscriber       1967      1
## 3             -73.98268  22397 Subscriber       1989      1
## 4             -73.98998  16231 Subscriber       1991      1
## 5             -73.99074  15400 Subscriber       1989      1
## 6             -73.99301  25193 Subscriber       1974      1

Citi Bike user profiles

mdy_hms(bike$stoptime[1]) - mdy_hms(bike$starttime[1])
## Time difference of 24.5 mins
bike$tripduration[1]/60
## [1] 24.5
day_times <- bike %>% select(starttime) %>% 
  mutate(weekday =  wday(mdy_hms(starttime), label = TRUE, abbr = FALSE),
         time_of_day = hour(mdy_hms(bike$starttime)))

day_of_wk <- day_times %>% group_by(weekday) %>% 
  summarize(num_rides = n())

day_of_wk <- day_of_wk %>% 
  mutate(prop_rides = round(num_rides/sum(day_of_wk$num_rides), digits = 2)) %>%
  arrange(weekday)  

kable(day_of_wk)
weekday num_rides prop_rides
Sunday 151296 0.10
Monday 200488 0.14
Tuesday 209782 0.14
Wednesday 262595 0.18
Thursday 261553 0.18
Friday 200696 0.14
Saturday 173908 0.12
start_day_times <- day_times %>% 
  mutate(day_type = ifelse (weekday %in% c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday"), "weekday", "weekend")) %>% 
  group_by(day_type, time_of_day) %>% summarize(num_rides = n()) %>% 
   mutate(prop_rides = round(num_rides/sum(day_of_wk$num_rides), digits = 2)) %>%
  arrange(day_type, time_of_day) %>%
  top_n(5, wt = num_rides)

kable(start_day_times)
day_type time_of_day num_rides prop_rides
weekday 8 107598 0.07
weekday 9 77398 0.05
weekday 17 121131 0.08
weekday 18 120633 0.08
weekday 19 83371 0.06
weekend 13 26509 0.02
weekend 14 27804 0.02
weekend 15 27463 0.02
weekend 16 27144 0.02
weekend 17 27271 0.02
commute_rides <- start_day_times %>% filter(day_type == "weekday") %>% summarize(prop_rides = sum(prop_rides))

From the tables above, we can see that the majority of Citi bike rides in June 2016 occured during the work week, and since 34% of all rides began on weekdays between 8 and 10 am or 5 and 8 pm, it appears that the largest share of Citi Bike rides are for commuting to and from work.

gender_codes <- data.frame(key = c("unknown", "male", "female"), val = 0:2)
gender_name <- function(x)  gender_codes$key[gender_codes$val == x]

# Check customers for birth year and known gender data
nrow(bike %>% filter(usertype == "Customer" & !is.na(birth.year)))
## [1] 0
nrow(bike %>% filter(usertype == "Customer" & gender != 0))
## [1] 0
avg_age <- round(mean(year(Sys.time()) - bike$birth.year, na.rm = TRUE), digits = 1)
avg_trip_minutes <- round(mean(bike$tripduration)/60, digits = 2)
typ_gender <- bike %>% group_by(gender) %>% 
  summarize(N = n()) %>% top_n(1) %>% select(gender) %>% 
  mutate(gender = gender_name(gender))
## Selecting by N
names(typ_gender) <- "typ_gender"

typical_rider <- bike %>% group_by(usertype, gender) %>%
  summarize(N = n(), 
            age = mean(year(Sys.time()) - birth.year, na.rm = TRUE),
            duration_minutes = mean(tripduration)/60) 

typical_rider <- typical_rider %>% 
  mutate(gender = gender_name(gender), age = round(age, digits = 1), 
         duration_minutes = round(duration_minutes, digits = 2))

The first table below shows the most common gender and average age of Citi Bike subscribers - since we do not have gender or birth year data for the non-subscribing customers - and the average trip length in minutes for all rides. The second table breaks down the average trip duration in minutes across rider types (i.e. subscribers and customers) and genders.

kable(data.frame(typ_gender, avg_age, avg_trip_minutes))
typ_gender avg_age avg_trip_minutes
male 38.3 16.5
kable(typical_rider)
usertype gender N age duration_minutes
Customer unknown 156832 NaN 35.87
Subscriber unknown 41404 42.9 24.29
Subscriber male 947464 38.7 13.26
Subscriber female 314618 37.1 15.56

Most used Citi Bike stations

Here we examine the most used Citi Bike stations in terms of total tranfers (i.e. ride starts and ends) and then map those stations using the leaflet package.

starts <- bike %>% group_by(start.station.name) %>% 
  summarize(N = n()) %>% arrange(desc(N))
starts <- starts %>% 
  transmute(ride.status = 
              str_extract(colnames(starts[1]), "^[[:alpha:]]+\\b"), 
            station.name = start.station.name, N)

ends <- bike %>% group_by(end.station.name) %>% 
  summarize(N = n()) %>% arrange(desc(N))
ends <- ends %>% 
  transmute(ride.status = 
              str_extract(colnames(ends[1]), "^[[:alpha:]]+\\b"), 
            station.name = end.station.name, N)

pop_stations <- bind_rows(starts, ends)
pop_stations$ride.status <- factor(pop_stations$ride.status, levels=c("start", "end"))
pop_stations <- pop_stations %>% 
  spread(ride.status, N) %>% 
  mutate(total = start + end) %>% 
  top_n(20, wt = total)

pop_stations <- left_join(pop_stations, 
                          bike[, c("start.station.name", "start.station.id",
                                   "start.station.latitude", "start.station.longitude")], 
                          by = c("station.name" = "start.station.name")) %>% 
  distinct()

pop_stations <- pop_stations %>% 
  rename(station.id = start.station.id, 
         latitude = start.station.latitude, 
         longitude = start.station.longitude)

kable(pop_stations %>% select(station.name, total))
station.name total
12 Ave & W 40 St 20382
8 Ave & W 33 St 17957
Broadway & E 14 St 15899
Broadway & E 22 St 23597
Broadway & W 24 St 16060
Carmine St & 6 Ave 17821
Central Park S & 6 Ave 15771
Christopher St & Greenwich St 15622
Cleveland Pl & Spring St 17575
E 17 St & Broadway 22807
E 47 St & Park Ave 15561
Greenwich Ave & 8 Ave 18430
Lafayette St & E 8 St 20410
Lafayette St & Jersey St 16120
Pershing Square North 31296
South End Ave & Liberty St 17628
Vesey Pl & River Terrace 16422
W 20 St & 11 Ave 19401
W 21 St & 6 Ave 21348
West St & Chambers St 24680
nyc1 <- leaflet() %>% setView(lng = -73.995, lat = 40.73, zoom = 12) %>%  
  addTiles() %>%
  addCircles(lng = pop_stations$longitude, lat = pop_stations$latitude, 
             radius = pop_stations$total / 50,
             color = "red", 
             popup = paste(sep = "<br/>", pop_stations$station.name,
                           paste("Citibike Station ID:",
                                 pop_stations$station.id))) 

nyc1

Data sources:

Helpful R resources:

http://stackoverflow.com/questions/17499013/how-do-i-make-a-list-of-data-frames

http://stackoverflow.com/questions/12344982/r-lapply-statement-with-index