Data 607 Project 2

Load required packages
Animal and Plant Milk Nutritional Content
US Prescription Drug Spending 1960-2014
NYC Citi Bike Data
Data sources:
Helpful R resources:

Project 2 called on students to select three of our classmates’ selected untidy datasets and proposed analyses, create CSV files containing all of the information included in each dataset, read the CSV files into R, use tidyr and dplyr as needed to tidy and transform the data, and carry out the proposed analyses.

The three untidy datasets that I chose to work with were already available online in CSV format, but in one case, the data for each of the observations of interest were downloaded in five separate CSV files, which were renamed for ease of use and ultimately combined.

NOTE: Several of the required CSV files are downloaded to the current working directory in the execution of this project’s R code, including the relatively large 271.8 MB CSV file of Citi Bike trip data.

Load required packages

# Change working directory as needed, CSV files are downloaded to current working directory
setwd("~/Project2")

library(RCurl)
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
library(lubridate)
library(leaflet)
library(knitr)

Animal and Plant Milk Nutritional Content

Aaron Grzasko’s Nutitritional Content of Milk proposal required selecting five different types of milk, consolidating their nutritional information, and analyzing that data across type (here almond, goat, human, rice, and whole). The data were sourced from the USDA Food Composition Databases. The Stack Overflow questions and answers referenced below were very helpful guides for reading in the individual CSV files into a single list of data frames which could be manipulated and tidied collectively, and for showing a method of indexing the list elements so that a column consisting of the relevant milk type (based on the list element name) could be added to each data frame.

Import CSV data

download.file("https://raw.githubusercontent.com/juddanderman/Project2/master/milk_almond.csv", "milk_almond.csv", method = "curl")
download.file("https://raw.githubusercontent.com/juddanderman/Project2/master/milk_goat.csv", "milk_goat.csv", method = "curl")
download.file("https://raw.githubusercontent.com/juddanderman/Project2/master/milk_human.csv", "milk_human.csv", method = "curl")
download.file("https://raw.githubusercontent.com/juddanderman/Project2/master/milk_rice.csv", "milk_rice.csv", method = "curl")
download.file("https://raw.githubusercontent.com/juddanderman/Project2/master/milk_whole.csv", "milk_whole.csv", method = "curl")

milk_files <- list.files(pattern = "^milk.+\\.csv$")

milk <- lapply(milk_files, 
               function(x) read.csv(file = x, header = FALSE, 
                                    stringsAsFactors = FALSE, encoding = "latin1"))

names(milk) <- str_replace_all(milk_files, "(milk_)|(\\.csv)", "")

Clean and tidy data frame elements of `milk` list

Before reducing the data for each milk type to 3 columns representing the nutrient names, units of measure, and values per 100 g - an attribute common to all the selected milk types and the basis for the subsequent analysis - I extracted the cup sizes in grams for each type so that calories per cup could be easily calculated.

milk <- lapply(milk, function(x) x %>% slice(5:nrow(x)))

# The line of code below was used to ID the columns with cup sizes
# lapply(milk, function(x) x[1, ])

cups <- data.frame(c("almond", "goat", "human", "rice", "whole"), 
         as.numeric(c(str_extract(milk$almond[1, 4], "\\d{3}\\.\\d{1}"), 
           str_extract(milk$goat[1, 5], "\\d{3}\\.\\d{1}"),
           str_extract(milk$human[1, 5], "\\d{3}\\.\\d{1}"),
           str_extract(milk$rice[1, 4], "\\d{3}\\.\\d{1}"),
           str_extract(milk$whole[1, 4], "\\d{3}\\.\\d{1}"))),
         stringsAsFactors =  FALSE)
colnames(cups) <- c("type", "cup.size")
           
milk <- lapply(milk, function(x) x %>% select(num_range("V", 1:3)))

new_colnames <- paste(milk[[1]][1, ])
new_colnames[3] <- "Value/100g"

milk <- lapply(milk, setNames, nm = new_colnames)

milk <- lapply(milk, function(x) x %>% slice(2:nrow(x)))

add_index_colname <- function(i, x) {
  x[[i]] <- x[[i]] %>% mutate(Kind = rep(names(x[i]), times = nrow(x[[i]])))
}

milk <- lapply(seq_along(milk), add_index_colname, x = milk)
names(milk) <- str_replace_all(milk_files, "(milk_)|(\\.csv)", "")

milk <- bind_rows(milk)
milk <- milk %>% filter(Unit != "") %>% unite(Nutrient, Nutrient, Unit, sep = " (")

milk$Nutrient <- str_c(milk$Nutrient, ")")

milk <- milk %>% group_by(Kind) %>% spread(Nutrient, `Value/100g`, convert = TRUE)

Calculate the ratio of protein to total calories for each milk type

kable(milk %>% group_by(Kind) %>% summarize(`Protein (g)`/`Energy (kcal)`))

Kind	`Protein (g)`/`Energy (kcal)`
almond	0.0393333
goat	0.0515942
human	0.0147143
rice	0.0059574
whole	0.0516393

Calculate the average calorie content per cup across milk types

milk <- milk %>% inner_join(cups, by = c("Kind" = "type"))

kable(milk %>% group_by(Kind) %>% 
        summarize(`kcal/cup` = `Energy (kcal)` * (cup.size/100)))

Kind	kcal/cup
almond	39.30
goat	168.36
human	172.20
rice	112.80
whole	148.84

Arrange the observations in ascending order by total calories per cup

kable(milk %>% group_by(Kind) %>% 
        summarize(`kcal/cup` = `Energy (kcal)` * (cup.size/100)) %>% 
        arrange(`kcal/cup`))

Kind	kcal/cup
almond	39.30
rice	112.80
whole	148.84
goat	168.36
human	172.20

Calculate net carbohydrates (i.e. gross carbohydrates - dietary fiber)

kable(milk %>% group_by(Kind) %>% 
        summarize(net.carbohydrates = 
                    `Sugars, total (g)` - `Fiber, total dietary (g)`))

Kind	net.carbohydrates
almond	0.00
goat	4.45
human	6.89
rice	4.98
whole	5.05

US Prescription Drug Spending 1960-2014

I provided a link to US National Health Expenditure Accounts (NHEA) historical data from CMS and requested an analysis of changes in total prescription drug spending per capita and the distribution of funding sources for prescription drug expenditures over time.

Import CSV data

NHE.URL <- getURL("https://raw.githubusercontent.com/juddanderman/Project2/master/NHE2014.csv")

spend <- read.csv(text = NHE.URL, header = FALSE, stringsAsFactors = FALSE)
spend[1:5, 1:5]

##                                                                                               V1
## 1 National Health Expenditures by Type of Service and Source of Funds:  Calendar Years 1960-2014
## 2                                                                  Expenditure Amount (Millions)
## 3                                                             Total National Health Expenditures
## 4                                                                                  Out of pocket
## 5                                                                               Health Insurance
##         V2       V3       V4       V5
## 1                                    
## 2     1960     1961     1962     1963
## 3  27,214   29,138   31,842   34,595 
## 4  12,949   13,357   14,255   15,311 
## 5   7,497    8,236    8,999    9,892

colnames(spend)[2:ncol(spend)] <- spend[2, 2:ncol(spend)]

Extract population counts and prescription drug spending by year

pop <- spend %>% filter(V1 == "POPULATION")
pop[2:ncol(pop)] <- as.integer(pop[2:ncol(pop)]) 

rx <- spend %>% slice(which(str_detect(spend$V1, "Prescription")):(which(str_detect(spend$V1, "Prescription")) + 29))
colnames(rx)[1] <- "payer"

title <- spend[1, 1]

Clean up and melt data, join `rx` and `pop` data frames

I cleaned up and trimmed the strings representing payer names or funding sources, but first looked at the unaltered rx$payer vector as the levels of indentation revealed at a glance which elements of the vector represented umbrella categories of payer types.

#rx$payer
rx$payer <- str_replace(rx$payer, "\\s\\(.+\\)", "")
rx$payer <- str_replace_all(rx$payer, "\\*|'", "")
rx$payer <- str_replace_all(rx$payer, "\\b\\s\\b|/", ".")
rx$payer[7:8] <- str_c(str_trim(rx$payer[6]), str_trim(rx$payer[7:8]), sep = "-")
rx$payer[10:11] <- str_c(str_trim(rx$payer[9]), str_trim(rx$payer[10:11]), sep = "-")
rx$payer[21:22] <- str_c(str_trim(rx$payer[20]), str_trim(rx$payer[21:22]), sep = "-")
rx$payer[24:25] <- str_c(str_trim(rx$payer[23]), str_trim(rx$payer[24:25]), sep = "-")
rx$payer <- str_trim(rx$payer)
rx$payer

##  [1] "Total.Prescription.Drug.Expenditures"     
##  [2] "Out.of.pocket"                            
##  [3] "Health.Insurance"                         
##  [4] "Private.Health.Insurance"                 
##  [5] "Medicare"                                 
##  [6] "Medicaid"                                 
##  [7] "Medicaid-Federal"                         
##  [8] "Medicaid-State.and.Local"                 
##  [9] "CHIP"                                     
## [10] "CHIP-Federal"                             
## [11] "CHIP-State.and.Local"                     
## [12] "Department.of.Defense"                    
## [13] "Department.of.Veterans.Affairs"           
## [14] "Other.Third.Party.Payers.and.Programs"    
## [15] "Worksite.Health.Care"                     
## [16] "Other.Private.Revenues"                   
## [17] "Indian.Health.Services"                   
## [18] "Workers.Compensation"                     
## [19] "General.Assistance"                       
## [20] "Maternal.Child.Health"                    
## [21] "Maternal.Child.Health-Federal"            
## [22] "Maternal.Child.Health-State.and.Local"    
## [23] "Vocational.Rehabilitation"                
## [24] "Vocational.Rehabilitation-Federal"        
## [25] "Vocational.Rehabilitation-State.and.Local"
## [26] "Other.Federal.Programs"                   
## [27] "SAMHSA"                                   
## [28] "Other.State.and.Local.Programs"           
## [29] "School.Health"                            
## [30] "Total.CMS.Programs"

rx <- rx %>% gather(year, spending, 2:ncol(rx)) %>% spread(payer, spending)

rx <- apply(rx, 2, function(x) str_replace(x, "-", 0))
rx <- apply(rx, 2, function(x) str_replace(x, ",", ""))
rx <- apply(rx, 2, function(x) as.integer(str_trim(x)))

rx <- data.frame(rx, stringsAsFactors = FALSE)

pop <- pop %>% gather(year, population, 2:ncol(pop)) %>% select(year, population)
pop$year <- as.integer(pop$year)

head(pop)

##   year population
## 1 1960        186
## 2 1961        189
## 3 1962        192
## 4 1963        195
## 5 1964        197
## 6 1965        200

rx <- rx %>% inner_join(pop, by = "year") %>% select(year, population, everything())

rx[1:5, 1:5]

##   year population CHIP CHIP.Federal CHIP.State.and.Local
## 1 1960        186    0            0                    0
## 2 1961        189    0            0                    0
## 3 1962        192    0            0                    0
## 4 1963        195    0            0                    0
## 5 1964        197    0            0                    0

Total spending per capita on prescription drugs

per_cap <- rx %>% group_by(year) %>% summarize(per.cap.rx = Total.Prescription.Drug.Expenditures/population)

ggplot(per_cap, aes(x = year, y = per.cap.rx)) + 
  geom_point() + geom_smooth() + 
  labs(x = "Year", y = "Per Capita Rx Expenditures (nominal)")

Inflation adjusted per capita prescription expenditures

In an attempt to adjust for inflation, I sourced annual averages of the Consumer Price Index for All Urban Consumers (CPI-U) from the Bureau of Labor Statistics (I used the data from Series ID: CUUR0000AA0 with 1967 as the base year, and selected for annual averages to be included in the .xlsx download, which was later saved in .csv format) and joined those data to the prescription spending data extracted from the original NHEA table. Then, using the average CPI-U for 2014 as an index, I standardized prescription drug spending each year from 1960 to 2014 in 2014 dollars.

cpi.URL <- getURL("https://raw.githubusercontent.com/juddanderman/Project2/master/CPI-U.csv")

cpi <- read.csv(text = cpi.URL, header = FALSE, stringsAsFactors = FALSE)
cpi <- cpi %>% slice(10:nrow(cpi))
colnames(cpi) <- cpi[1, ]
cpi <- cpi[-1, ]

cpi <- cpi %>% select(Year, Annual)
colnames(cpi)[2] <- "avg.cpi"
cpi <- data.frame(apply(cpi, 1:2, function(x) as.numeric(x)))

cpi <- cpi %>% filter (Year >= 1960 & Year <= 2014)

per_cap <- per_cap %>% inner_join(cpi, by = c("year" = "Year"))

cpi_index <- per_cap$avg.cpi[per_cap$year == 2014]

per_cap <- per_cap %>% mutate(infl.adj.rx = per.cap.rx * (cpi_index/avg.cpi))

ggplot(per_cap, aes(x = year, y = infl.adj.rx)) + 
  geom_point() + geom_smooth() + 
  labs(x = "Year", y = "Per Capita Rx Expenditures (in 2014 dollars)")

per_cap <- per_cap %>% select(year, per.cap.rx, infl.adj.rx) %>% gather(type, per.cap, per.cap.rx:infl.adj.rx)

ggplot(per_cap, aes(x = year, y = per.cap, color = type)) + 
  geom_point() + geom_smooth() + 
  labs(x = "Year", y = "Per Capita Rx Expenditures", 
       main = "Inflation Adjusted and Unadjusted Per Capita Expenditures") +
  scale_color_discrete(name = "Inflation Adjusted/Nominal Spending",
                       labels = c("2014 dollars", "nominal")) +
  theme(legend.position = "bottom")

The inflation adjusted scatter plot of per capita prescription drug spending over time shows greater variabliity in the rate and direction of change than the unadjusted plot of nominal spending per capita, as we see additional periods of noticeable decline or stabilization. We do, however, see a fairly consistent strong increase in inflation adjusted per capita spending from 1980 to 2008-2009, at which point we may be witnessing the effects of the of the Great Recession on prescription drug expenditures.

Change in distribution of spending across funding sources/payers over time

rx_props <- rx %>% mutate_each(funs(./Total.Prescription.Drug.Expenditures), -c(year, population, Total.Prescription.Drug.Expenditures)) %>% select(-Total.Prescription.Drug.Expenditures)

props_3 <- rx_props %>% select(year, Health.Insurance, Out.of.pocket,  Other.Third.Party.Payers.and.Programs)

# The line below was used to check that the sum of the proportions
# was equal to one for each year

#props_3 %>% select(-year) %>% mutate(check_props = rowSums(.))

props_3 <- props_3 %>% gather(payer, prop, 2:4)

props_3$payer <- factor(props_3$payer, 
                        levels = c("Health.Insurance",
                                   "Other.Third.Party.Payers.and.Programs"))

ggplot(props_3, aes(x = year, y = prop)) +
  geom_bar(aes(fill = payer), stat = "identity", position = "stack") + 
  labs(x = "Year", y = "Proportion") +
  theme(legend.position = "bottom") +
  scale_fill_discrete(name = "Funding Source", 
                      labels = c("Health Insurance", 
                                 "Out of Pocket",
                                 "Other Third Party Payers"))

ggplot(props_3, aes(x = year, y = prop)) +
  geom_area(aes(fill = payer), position = "stack") + 
  labs(x = "Year", y = "Proportion") +
  theme(legend.position = "bottom") +
  scale_fill_discrete(name = "Funding Source", 
                      labels = c("Health Insurance", 
                                 "Out of Pocket",
                                 "Other Third Party Payers"))

props_18 <- rx_props %>% select(year, Out.of.pocket, Private.Health.Insurance, Medicare, Medicaid, CHIP, Department.of.Defense, Department.of.Veterans.Affairs, Worksite.Health.Care, Other.Private.Revenues, Indian.Health.Services, Workers.Compensation, General.Assistance, Maternal.Child.Health, Vocational.Rehabilitation, Other.Federal.Programs, SAMHSA, Other.State.and.Local.Programs, School.Health)

# The line below was used to check that the sum of the proportions
# was equal to one for each year

#props_18 %>% select(-year) %>% mutate(check_props = rowSums(.))

props_18 <- props_18 %>% gather(payer, prop, 2:ncol(props_18))

(props_18 %>% filter(year == 2014) %>% arrange(desc(prop)))[1:5, ]

##   year                    payer       prop
## 1 2014 Private.Health.Insurance 0.42757425
## 2 2014                 Medicare 0.29018670
## 3 2014            Out.of.pocket 0.15022607
## 4 2014                 Medicaid 0.09181452
## 5 2014    Department.of.Defense 0.01844151

set.seed(2)
ggplot(props_18, aes(x = year, y = prop)) +
  geom_bar(aes(fill = payer), stat = "identity", position = "stack") + 
  labs(x = "Year", y = "Proportion") +
  theme(legend.position = "right") +
  scale_fill_manual(name = "Payer", 
                    values = sample(rainbow(18), 
                                    size = 18, 
                                    replace = FALSE)) +
  annotate("text", x = 1980, y = 0.25, label = "Out of Pocket") +
  annotate("text", x = 2004, y = 0.5, label = "Private Insurance") +
  annotate("text", x = 2010, y = 0.76, label = "Medicare") +
  annotate("text", x = 1990, y = 0.91, label = "Medicaid")

props_18$prop[props_18$year == 2014 & props_18$payer == "Private.Health.Insurance"]

## [1] 0.4275743

set.seed(2)
ggplot(props_18, aes(x = year, y = prop)) +
  geom_area(aes(fill = payer), position = "stack") + 
  labs(x = "Year", y = "Proportion") +
  theme(legend.position = "right") +
  scale_fill_manual(name = "Payer", 
                    values = sample(rainbow(18), 
                                    size = 18, 
                                    replace = FALSE)) +
  annotate("text", x = 1980, y = 0.25, label = "Out of Pocket") +
  annotate("text", x = 2004, y = 0.5, label = "Private Insurance") +
  annotate("text", x = 2010, y = 0.76, label = "Medicare") +
  annotate("text", x = 1990, y = 0.91, label = "Medicaid")

The stacked bar and area plots of the proportion of prescription drug expenditures by funding source reveal a consistent decrease in the proportion of out-of-pocket spending and an increase in the proportion of spending by health insurance payers. When the health insurers are broken down into more specific payer types, we can more easily see the change over time in the share of prescription drug spending by private insurance companies, Medicaid, and Medicare. Interestingly, we can very clearly see the results of the implementation of Medicare Part D prescription drug benefit program in 2006 in the sharp increase in the proportion of spending by Medicare and decreases in the proportions of the other top payers at that point in time.

Finally, I plot the inflation adjusted total prescription drug spending over time for the four largest funding sources revealed in the stacked bar and area plots above, i.e. out-of-pocket, private insurance, Medicaid, Medicare.

infl_adj_top_4 <- rx %>% select(year, Out.of.pocket, Private.Health.Insurance, Medicaid, Medicare) %>% inner_join(cpi, by = c("year" = "Year")) %>% mutate_each(funs((.) * (cpi_index/avg.cpi)), -year) %>% select(1:5) %>% gather(payer, infl.adj.spending, 2:5)

ggplot(infl_adj_top_4, aes(x = year, y = infl.adj.spending, color = payer)) +
  geom_point() + geom_smooth() +
  labs(x = "Year", y = "Inflatjion Adj. Spending (2014 dollars)")

NYC Citi Bike Data

Sharon Morris provided a link to the June 2016 NYC Citi Bike Trip Data and requested an analysis to produce a profile of the typical Citi Bike user.

Import CSV data

download.file("https://s3.amazonaws.com/tripdata/201606-citibike-tripdata.zip", "201606-citibike-tripdata.zip", method = "curl")

unzip("201606-citibike-tripdata.zip")

bike <- read.csv("201606-citibike-tripdata.csv", header = TRUE, stringsAsFactors = FALSE)

head(bike)

##   tripduration         starttime          stoptime start.station.id
## 1         1470 6/1/2016 00:00:18 6/1/2016 00:24:48              380
## 2          229 6/1/2016 00:00:20 6/1/2016 00:04:09             3092
## 3          344 6/1/2016 00:00:21 6/1/2016 00:06:06              449
## 4         1120 6/1/2016 00:00:28 6/1/2016 00:19:09              522
## 5          229 6/1/2016 00:00:53 6/1/2016 00:04:42              335
## 6          946 6/1/2016 00:01:01 6/1/2016 00:16:48              503
##         start.station.name start.station.latitude start.station.longitude
## 1         W 4 St & 7 Ave S               40.73401               -74.00294
## 2        Berry St & N 8 St               40.71901               -73.95853
## 3          W 52 St & 9 Ave               40.76462               -73.98789
## 4  E 51 St & Lexington Ave               40.75715               -73.97208
## 5 Washington Pl & Broadway               40.72904               -73.99405
## 6       E 20 St & Park Ave               40.73827               -73.98752
##   end.station.id        end.station.name end.station.latitude
## 1           3236      W 42 St & Dyer Ave             40.75898
## 2           3103     N 11 St & Wythe Ave             40.72153
## 3            469      Broadway & W 53 St             40.76344
## 4            401 Allen St & Rivington St             40.72020
## 5            285      Broadway & E 14 St             40.73455
## 6            495        W 47 St & 10 Ave             40.76270
##   end.station.longitude bikeid   usertype birth.year gender
## 1             -73.99380  19859 Subscriber       1972      1
## 2             -73.95782  16233 Subscriber       1967      1
## 3             -73.98268  22397 Subscriber       1989      1
## 4             -73.98998  16231 Subscriber       1991      1
## 5             -73.99074  15400 Subscriber       1989      1
## 6             -73.99301  25193 Subscriber       1974      1

Citi Bike user profiles

mdy_hms(bike$stoptime[1]) - mdy_hms(bike$starttime[1])

## Time difference of 24.5 mins

bike$tripduration[1]/60

## [1] 24.5

day_times <- bike %>% select(starttime) %>% 
  mutate(weekday =  wday(mdy_hms(starttime), label = TRUE, abbr = FALSE),
         time_of_day = hour(mdy_hms(bike$starttime)))

day_of_wk <- day_times %>% group_by(weekday) %>% 
  summarize(num_rides = n())

day_of_wk <- day_of_wk %>% 
  mutate(prop_rides = round(num_rides/sum(day_of_wk$num_rides), digits = 2)) %>%
  arrange(weekday)  

kable(day_of_wk)

weekday	num_rides	prop_rides
Sunday	151296	0.10
Monday	200488	0.14
Tuesday	209782	0.14
Wednesday	262595	0.18
Thursday	261553	0.18
Friday	200696	0.14
Saturday	173908	0.12

start_day_times <- day_times %>% 
  mutate(day_type = ifelse (weekday %in% c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday"), "weekday", "weekend")) %>% 
  group_by(day_type, time_of_day) %>% summarize(num_rides = n()) %>% 
   mutate(prop_rides = round(num_rides/sum(day_of_wk$num_rides), digits = 2)) %>%
  arrange(day_type, time_of_day) %>%
  top_n(5, wt = num_rides)

kable(start_day_times)

day_type	time_of_day	num_rides	prop_rides
weekday	8	107598	0.07
weekday	9	77398	0.05
weekday	17	121131	0.08
weekday	18	120633	0.08
weekday	19	83371	0.06
weekend	13	26509	0.02
weekend	14	27804	0.02
weekend	15	27463	0.02
weekend	16	27144	0.02
weekend	17	27271	0.02

commute_rides <- start_day_times %>% filter(day_type == "weekday") %>% summarize(prop_rides = sum(prop_rides))

From the tables above, we can see that the majority of Citi bike rides in June 2016 occured during the work week, and since 34% of all rides began on weekdays between 8 and 10 am or 5 and 8 pm, it appears that the largest share of Citi Bike rides are for commuting to and from work.

gender_codes <- data.frame(key = c("unknown", "male", "female"), val = 0:2)
gender_name <- function(x)  gender_codes$key[gender_codes$val == x]

# Check customers for birth year and known gender data
nrow(bike %>% filter(usertype == "Customer" & !is.na(birth.year)))

## [1] 0

nrow(bike %>% filter(usertype == "Customer" & gender != 0))

## [1] 0

avg_age <- round(mean(year(Sys.time()) - bike$birth.year, na.rm = TRUE), digits = 1)
avg_trip_minutes <- round(mean(bike$tripduration)/60, digits = 2)
typ_gender <- bike %>% group_by(gender) %>% 
  summarize(N = n()) %>% top_n(1) %>% select(gender) %>% 
  mutate(gender = gender_name(gender))

## Selecting by N

names(typ_gender) <- "typ_gender"

typical_rider <- bike %>% group_by(usertype, gender) %>%
  summarize(N = n(), 
            age = mean(year(Sys.time()) - birth.year, na.rm = TRUE),
            duration_minutes = mean(tripduration)/60) 

typical_rider <- typical_rider %>% 
  mutate(gender = gender_name(gender), age = round(age, digits = 1), 
         duration_minutes = round(duration_minutes, digits = 2))

The first table below shows the most common gender and average age of Citi Bike subscribers - since we do not have gender or birth year data for the non-subscribing customers - and the average trip length in minutes for all rides. The second table breaks down the average trip duration in minutes across rider types (i.e. subscribers and customers) and genders.

kable(data.frame(typ_gender, avg_age, avg_trip_minutes))

typ_gender	avg_age	avg_trip_minutes
male	38.3	16.5

kable(typical_rider)

usertype	gender	N	age	duration_minutes
Customer	unknown	156832	NaN	35.87
Subscriber	unknown	41404	42.9	24.29
Subscriber	male	947464	38.7	13.26
Subscriber	female	314618	37.1	15.56

Most used Citi Bike stations

Here we examine the most used Citi Bike stations in terms of total tranfers (i.e. ride starts and ends) and then map those stations using the leaflet package.

starts <- bike %>% group_by(start.station.name) %>% 
  summarize(N = n()) %>% arrange(desc(N))
starts <- starts %>% 
  transmute(ride.status = 
              str_extract(colnames(starts[1]), "^[[:alpha:]]+\\b"), 
            station.name = start.station.name, N)

ends <- bike %>% group_by(end.station.name) %>% 
  summarize(N = n()) %>% arrange(desc(N))
ends <- ends %>% 
  transmute(ride.status = 
              str_extract(colnames(ends[1]), "^[[:alpha:]]+\\b"), 
            station.name = end.station.name, N)

pop_stations <- bind_rows(starts, ends)
pop_stations$ride.status <- factor(pop_stations$ride.status, levels=c("start", "end"))
pop_stations <- pop_stations %>% 
  spread(ride.status, N) %>% 
  mutate(total = start + end) %>% 
  top_n(20, wt = total)

pop_stations <- left_join(pop_stations, 
                          bike[, c("start.station.name", "start.station.id",
                                   "start.station.latitude", "start.station.longitude")], 
                          by = c("station.name" = "start.station.name")) %>% 
  distinct()

pop_stations <- pop_stations %>% 
  rename(station.id = start.station.id, 
         latitude = start.station.latitude, 
         longitude = start.station.longitude)

kable(pop_stations %>% select(station.name, total))

station.name	total
12 Ave & W 40 St	20382
8 Ave & W 33 St	17957
Broadway & E 14 St	15899
Broadway & E 22 St	23597
Broadway & W 24 St	16060
Carmine St & 6 Ave	17821
Central Park S & 6 Ave	15771
Christopher St & Greenwich St	15622
Cleveland Pl & Spring St	17575
E 17 St & Broadway	22807
E 47 St & Park Ave	15561
Greenwich Ave & 8 Ave	18430
Lafayette St & E 8 St	20410
Lafayette St & Jersey St	16120
Pershing Square North	31296
South End Ave & Liberty St	17628
Vesey Pl & River Terrace	16422
W 20 St & 11 Ave	19401
W 21 St & 6 Ave	21348
West St & Chambers St	24680

nyc1 <- leaflet() %>% setView(lng = -73.995, lat = 40.73, zoom = 12) %>%  
  addTiles() %>%
  addCircles(lng = pop_stations$longitude, lat = pop_stations$latitude, 
             radius = pop_stations$total / 50,
             color = "red", 
             popup = paste(sep = "<br/>", pop_stations$station.name,
                           paste("Citibike Station ID:",
                                 pop_stations$station.id))) 

nyc1

Most popular Citi Bike routes

Finally, I found the most popular Citi Bike routes in June 2016 and roughly mapped those routes as lines between the relevant start and end stations. Interestingly, several of the most common routes had the same start and end station.

routes <- bike %>% group_by(start.station.id, start.station.name, 
                            start.station.longitude, start.station.latitude, 
                            end.station.id, end.station.name, 
                            end.station.longitude, end.station.latitude) %>% 
  summarize(rides = n()) %>% arrange(desc(rides)) %>% ungroup() %>% top_n(10, wt = rides)

kable(routes[, c("start.station.name", "end.station.name", "rides")])

start.station.name	end.station.name	rides
Central Park S & 6 Ave	Central Park S & 6 Ave	1136
Central Park S & 6 Ave	5 Ave & E 78 St	1002
12 Ave & W 40 St	West St & Chambers St	627
Central Park S & 6 Ave	Central Park West & W 85 St	592
Centre St & Chambers St	Centre St & Chambers St	563
Yankee Ferry Terminal	Yankee Ferry Terminal	557
5 Ave & E 78 St	Central Park West & W 85 St	545
Soissons Landing	Yankee Ferry Terminal	544
Central Park West & W 85 St	Central Park S & 6 Ave	537
Soissons Landing	Soissons Landing	536

nyc2 <- leaflet() %>% setView(lng = -73.975, lat = 40.740, zoom = 12) %>% 
  addTiles() %>% 
  addMarkers(lng = routes$start.station.longitude, 
             lat = routes$start.station.latitude,
             popup = paste(sep = "<br/>", routes$start.station.name,
                           paste("Citibike Station ID:",
                                 routes$start.station.id)))  %>% 
  addMarkers(lng = routes$end.station.longitude, 
             lat = routes$end.station.latitude,
             popup = paste(sep = "<br/>", routes$end.station.name,
                           paste("Citibike Station ID:",
                                 routes$end.station.id)))

# add lines between citibike start and end stations, replace "red" below with 
# palette()[(i %% length(palette())) + 1] to get different colored routes
# modify palette as needed: e.g. palette("default"), palette(c("red", "blue", "gray"))

for (i in 1:nrow(routes)) {
  nyc2 <- nyc2 %>% 
    addPolylines(lng = unlist(routes[i, c(3, 7)]), 
                 lat = unlist(routes[i, c(4, 8)]), 
                 opacity = 0.6, color = "red") 
}
nyc2

Data sources:

Helpful R resources:

http://stackoverflow.com/questions/17499013/how-do-i-make-a-list-of-data-frames

http://stackoverflow.com/questions/12344982/r-lapply-statement-with-index

Data 607 Project 2

Judd Anderman

October 16, 2016

Load required packages

Animal and Plant Milk Nutritional Content

Import CSV data

Clean and tidy data frame elements of `milk` list

Calculate the ratio of protein to total calories for each milk type

Calculate the average calorie content per cup across milk types

Arrange the observations in ascending order by total calories per cup

Calculate net carbohydrates (i.e. gross carbohydrates - dietary fiber)

US Prescription Drug Spending 1960-2014

Import CSV data

Extract population counts and prescription drug spending by year

Clean up and melt data, join `rx` and `pop` data frames

Total spending per capita on prescription drugs

Inflation adjusted per capita prescription expenditures

Change in distribution of spending across funding sources/payers over time

NYC Citi Bike Data

Import CSV data

Citi Bike user profiles

Most used Citi Bike stations

Most popular Citi Bike routes

Data sources:

Helpful R resources:

Data 607 Project 2

Judd Anderman

October 16, 2016

Load required packages

Animal and Plant Milk Nutritional Content

Import CSV data

Clean and tidy data frame elements of milk list

Calculate the ratio of protein to total calories for each milk type

Calculate the average calorie content per cup across milk types

Arrange the observations in ascending order by total calories per cup

Calculate net carbohydrates (i.e. gross carbohydrates - dietary fiber)

US Prescription Drug Spending 1960-2014

Import CSV data

Extract population counts and prescription drug spending by year

Clean up and melt data, join rx and pop data frames

Total spending per capita on prescription drugs

Inflation adjusted per capita prescription expenditures

Change in distribution of spending across funding sources/payers over time

NYC Citi Bike Data

Import CSV data

Citi Bike user profiles

Most used Citi Bike stations

Most popular Citi Bike routes

Data sources:

Helpful R resources:

Clean and tidy data frame elements of `milk` list

Clean up and melt data, join `rx` and `pop` data frames