Let’s start with the Hawaii Farmer’s Markets data, because this will take a lot of work to tidy.
hawaii.url <- "https://raw.githubusercontent.com/EyeDen/data607/master/Project%202/Hawaii_Farmer_s_Markets.csv"
download.file(hawaii.url, "farmers_markets.csv")
hi.farmers.markets <- read.csv("farmers_markets.csv", na.strings = c("", " "))
To begin with, there is a huge section of data that is entered incorrectly. Near as I can assume, the original CSV should have had sections for each island’s farmer’s markets, as evidenced by the multiple headers within the file. However, at some point someone simply put all the farmer’s markets together, which is why there is an Island column.
Moreover, we see that on the incorrectly entered entries, the farmer’s market name contains multiple lines of information. Whoever entered this used a newline character rather than a comma, or simply copy-pasted from something else. Fortunately, those incorrect entries are all duplicates. Just searching through the CSV in Excel was enough to see this. That means we can simply delete those entries out of hand.
hi.farmers.markets <- hi.farmers.markets[-c(34:146), ]
Conveniently this also removes the duplicate headers throughout the dataset.
There are special characters within our data as well that we don’t need. â???T shows up in multiple names. As I’m not sure what they might mean, I am opting to remove them. Likewise, the occassional \n shows up throughout our data.
hi.farmers.markets$FARMER.S.MARKET <- iconv(hi.farmers.markets$FARMER.S.MARKET, to = "ASCII//TRANSLIT")
hi.farmers.markets[6, 1]
## [1] "Hawaiian Homestead Farmersa?T Market at Kuhio Hale Building"
All those weird characters are now converted to a?T, so we can remove them and the \n through regex.
hi.farmers.markets$FARMER.S.MARKET <- str_replace_all(hi.farmers.markets$FARMER.S.MARKET, "a\\?T", "")
hi.farmers.markets$FARMER.S.MARKET <- str_replace_all(hi.farmers.markets$FARMER.S.MARKET, "\n", "")
hi.farmers.markets$Location.1 <- str_replace_all(hi.farmers.markets$Location.1, "\n", " ")
Before we move on, let’s decide what we’d like to examine. There are a lot of missing values in this dataset. Had we more time, it might be possible to fill the gaps, but for now let’s focus on the general schedule for each farmer’s market, and which island they serve.
We do have a slight problem where some island values are incorrectly filled with websites, so we’ll also take care of that now.
hi.farmers.markets <- hi.farmers.markets[, -c(10, 11, 13, 14, 15, 16)]
hi.farmers.markets$Island <- str_replace_all(hi.farmers.markets$Island, "www\\.co\\.honolulu\\.hi\\.us\\/parks\\/programs\\/pom\\s?|\nhttp\\:\\/\\/alamoanafarmersmarket\\.com\\/ala\\-moana\\.html", "")
head(hi.farmers.markets, 10)
## FARMER.S.MARKET
## 1 Country Market & Craft Fair at Waimanalo Homestead Community Center
## 2 Halawa District Park (People's Open Market)
## 3 Kahaluu Farmers' Market at Windward Baptist Church
## 4 Kailua Elementary School
## 5 Fern Forest Town Market
## 6 Hawaiian Homestead Farmers Market at Kuhio Hale Building
## 7 Hilo Coffee Mill "Market @ The Mill"
## 8 Hilos Farmers Market
## 9 Honokaa Farmers Market at Old Botelho Bldg., Downtown Honokaa
## 10 Ho'oulu Community Farmers Market at Keauhou Beach Resort Royal Gardens
## Sunday Monday Tuesday Wednesday Thursday Friday Saturday
## 1 X <NA> <NA> <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA> <NA> X <NA>
## 3 <NA> <NA> <NA> <NA> <NA> X <NA>
## 4 <NA> <NA> <NA> <NA> <NA> <NA> X
## 5 <NA> <NA> <NA> <NA> <NA> <NA> X*
## 6 <NA> <NA> <NA> <NA> <NA> <NA> X
## 7 <NA> <NA> <NA> <NA> <NA> <NA> X
## 8 <NA> <NA> <NA> X <NA> <NA> X
## 9 <NA> <NA> <NA> <NA> <NA> <NA> X
## 10 <NA> <NA> <NA> X <NA> <NA> <NA>
## Time Island
## 1 9:00 AM to 4:00PM Oahu
## 2 7:00 AM to 8:00 AM Oahu
## 3 3:00 PM to 7:00 PM Oahu
## 4 4:00 PM to 8:00 PM Oahu
## 5 9:00 AM to 1:00 PM Hawaii
## 6 7:00 AM to 12:00 NOON Hawaii
## 7 8:00 AM - 1:00 PM Hawaii
## 8 6:00 AM to 4:00 PM Hawaii
## 9 7:30 AM to 2:00 PM Hawaii
## 10 12:30 PM to 5:30 PM Hawaii
Now let’s try working on the hours of operations!
hi.farmers.markets$Open <- ""
hi.farmers.markets$Close <- ""
hi.farmers.markets$Diff <- ""
My thoughts are to split the hours between Open and Close for those with actual hours of operations. After all, how would one quantify a farmer’s market with only a single time? Or the one market that is "All Day"? Also, we’ll add a column for how long they are operating.
split.time <- function(x){
for(i in 1:length(x$Time)){
hours <- str_extract(x$Time[i], "\\d+\\:\\d+\\s?[:upper:]{2} to \\d+\\:\\d+\\s?[:alpha:]{2,}|\\d+\\:\\d+\\s?[:alpha:]{2} \\- \\d+\\:\\d+\\s?[:alpha:]{2,}")
if(!is.na(hours)){
open.hour <- str_extract(hours, "\\d+\\:\\d+\\s?[:upper:]{2}")
close.hour <- str_extract(hours, "\\s\\d+\\:\\d+\\s?[:alpha:]{2,}")
if(!str_detect(open.hour, "\\s")){
l <- str_extract(open.hour, "\\d+\\:\\d+")
r <- str_extract(open.hour, "[:upper:]{2}")
open.hour <- paste(l, r, sep = " ")
}
if(str_detect(close.hour, "NOON|Noon|noon")){
close.hour <- str_replace(close.hour, "NOON|Noon|noon", "PM")
}
close.hour <- str_replace(close.hour, "\\s", "")
if(!str_detect(close.hour, "\\s")){
l <- str_extract(close.hour, "\\d+\\:\\d+")
r <- str_extract(open.hour, "[:upper:]{2}")
close.hour <- paste(l, r, sep = " ")
}
#x$Open[i] <- format(strptime(open.hour, format = "%H:%M %p"))
#x$Close[i] <- format(strptime(close.hour, format = "%H:%M %p"))
o <- format(strptime(open.hour, format = "%H:%M %p"))
c <- format(strptime(close.hour, format = "%H:%M %p"))
d <- abs(difftime(o, c, units = "hours"))
x$Open[i] <- open.hour
x$Close[i] <- close.hour
x$Diff[i] <- d
}
}
return(x)
}
hi.farmers.markets <- split.time(hi.farmers.markets)
hi.farmers.markets <- hi.farmers.markets[, -9]
We have two more things to settle before we can get to any analyzing. First, we should remove any duplicate farmer’s markets, if any. Then, we should reformat the days. We don’t need the * for our purpose, but I think we can keep the X.
hi.farmers.markets <- hi.farmers.markets[!duplicated(hi.farmers.markets$FARMER.S.MARKET), ]
hi.farmers.markets$Sunday[!is.na(hi.farmers.markets$Sunday)] <- "X"
hi.farmers.markets$Sunday <- factor(hi.farmers.markets$Sunday)
hi.farmers.markets$Monday[!is.na(hi.farmers.markets$Monday)] <- "X"
hi.farmers.markets$Monday <- factor(hi.farmers.markets$Monday)
hi.farmers.markets$Tuesday[!is.na(hi.farmers.markets$Tuesday)] <- "X"
hi.farmers.markets$Tuesday <- factor(hi.farmers.markets$Tuesday)
hi.farmers.markets$Wednesday[!is.na(hi.farmers.markets$Wednesday)] <- "X"
hi.farmers.markets$Wednesday <- factor(hi.farmers.markets$Wednesday)
hi.farmers.markets$Thursday[!is.na(hi.farmers.markets$Thursday)] <- "X"
hi.farmers.markets$Thursday <- factor(hi.farmers.markets$Thursday)
hi.farmers.markets$Friday[!is.na(hi.farmers.markets$Friday)] <- "X"
hi.farmers.markets$Friday <- factor(hi.farmers.markets$Friday)
hi.farmers.markets$Saturday[!is.na(hi.farmers.markets$Saturday)] <- "X"
hi.farmers.markets$Saturday <- factor(hi.farmers.markets$Saturday)
colnames(hi.farmers.markets)[1] <- "Markets"
# write.csv(hi.farmers.markets, "clean_farmers_markets.csv", row.names = FALSE)
We see that the most popular time for a farmer’s market to open is 7:00 AM, followed by 9:00 AM, 8:00 AM, and 10:00 AM. There are also quite a few afternoon openings at 3:00 PM and 4:00 PM.
We see that the most popular closing time is 1:00 PM, followed by 2:00 PM, and 11:00 AM, 12:00 PM, and 5:00 PM.
Interestingly, most farmer’s markets are open for only 1 hr. Others are open from 3 - 5 hrs. A surprising number are only open for 45 minutes.
It isn’t much of a surprise that Oahu has the lion’s share of farmer’s markets, but it is surprising that it holds twice as much as second place. Still, the Big Island has a respectable representation with 25 farmer’s markets.
Lastly, we see that most of the farmer’s markets are open only one day a week. Though some are open 7 days a week, none are open 6 days a week. The mean number of days is 1.702703.
metro.url <- "https://raw.githubusercontent.com/EyeDen/data607/master/Project%202/Fare_Card_History_for_Metropolitan_Transportation_Authority__MTA___Beginning_2010.csv"
download.file(metro.url, "2010_mta.csv")
metrocard <- read.csv("2010_mta.csv")
There is a lot of information in this dataset, which records Metrocard usage per week from 2010 - 2018. For now, I am only interested in the information on Astoria’s stations, as that is where I live and two stations are currently shut down for platform work, and more will be shut down later in the year. So let’s filter this big dataset down.
By peeking at the levels of the Station column, almost all stations are broken down by cross streets. I am only interested in these stations on the N/W line:
So, for example, if I wanted the information for 30 Av, I would need to look up 30TH AVENUE & 31ST STREET. But there is a quirk, as the levels show two separate entries: 30TH AVENUE & 31ST STREET and 30TH AVENUE & 31ST STREET, and there are a different number of entries for both.
For now, let’s extract both and we can work on tidying them after.
astoria.stations <- c("30TH AVENUE & 31ST STREET", "30TH AVENUE & 31ST STREET ", "36TH AVENUE & 31ST STREET", "36TH AVENUE & 31ST STREET ", "39TH AVENUE & 31ST STREET", "39TH AVENUE & 31ST STREET ", "ASTORIA BLVD & 31ST STREET", "ASTORIA BLVD & 31ST STREET ", "BROADWAY & 31ST STREET", "BROADWAY & 31ST STREET ", "DITMARS BLVD & 31ST STREET ", "DITMARS BLVD & 31ST STREET", "QUEENSBORO PLAZA", "QUEENSBORO PLAZA ")
astoria <- metrocard[metrocard$Station %in% astoria.stations, ]
head(astoria, 10)
## From.Date To.Date Remote.Station.ID Station
## 79 2/13/2016 2/19/2016 R090 39TH AVENUE & 31ST STREET
## 80 2/13/2016 2/19/2016 R091 36TH AVENUE & 31ST STREET
## 81 2/13/2016 2/19/2016 R092 BROADWAY & 31ST STREET
## 82 2/13/2016 2/19/2016 R093 30TH AVENUE & 31ST STREET
## 83 2/13/2016 2/19/2016 R094 ASTORIA BLVD & 31ST STREET
## 84 2/13/2016 2/19/2016 R095 DITMARS BLVD & 31ST STREET
## 110 2/13/2016 2/19/2016 R121 QUEENSBORO PLAZA
## 545 1/30/2016 2/5/2016 R090 39TH AVENUE & 31ST STREET
## 546 1/30/2016 2/5/2016 R091 36TH AVENUE & 31ST STREET
## 547 1/30/2016 2/5/2016 R092 BROADWAY & 31ST STREET
## Full.Fare Senior.Citizen...Disabled
## 79 5498 310
## 80 12391 688
## 81 20706 1386
## 82 23772 1247
## 83 21134 894
## 84 29178 1510
## 110 19525 601
## 545 5863 340
## 546 14497 843
## 547 24436 1590
## X7.Day.ADA.Farecard.Access.System.Unlimited
## 79 58
## 80 60
## 81 134
## 82 154
## 83 115
## 84 151
## 110 122
## 545 57
## 546 67
## 547 152
## X30.Day.ADA.Farecard.Access.System.Unlimited Joint.Rail.Road.Ticket
## 79 104 2
## 80 155 12
## 81 325 8
## 82 337 13
## 83 264 12
## 84 308 13
## 110 314 83
## 545 134 6
## 546 223 11
## 547 366 15
## X7.Day.Unlimited X30.Day.Unlimited
## 79 4524 3834
## 80 7505 14468
## 81 10911 29449
## 82 13389 31086
## 83 11629 24597
## 84 13895 35371
## 110 14633 19267
## 545 4678 4781
## 546 8873 17065
## 547 12121 34025
## X14.Day.Reduced.Fare.Media.Unlimited X1.Day.Unlimited
## 79 0 0
## 80 0 0
## 81 0 0
## 82 0 0
## 83 0 0
## 84 0 0
## 110 0 0
## 545 0 0
## 546 0 0
## 547 0 0
## X14.Day.Unlimited X7.Day.Express.Bus.Pass Transit.Check.Metrocard
## 79 0 0 46
## 80 0 0 135
## 81 0 0 279
## 82 0 0 196
## 83 0 0 282
## 84 0 0 434
## 110 0 0 167
## 545 0 0 49
## 546 0 0 155
## 547 0 0 361
## LIB.Special.Senior Rail.Road.Unlimited.No.Trade
## 79 122 1
## 80 239 13
## 81 455 10
## 82 511 12
## 83 272 24
## 84 712 8
## 110 253 108
## 545 153 1
## 546 310 18
## 547 473 17
## Transit.Check.Metrocard.Annual.Metrocard Mail.and.Ride.EZPass.Express
## 79 317 56
## 80 1014 170
## 81 2548 406
## 82 2714 383
## 83 2025 351
## 84 2881 547
## 110 2143 392
## 545 363 81
## 546 1150 171
## 547 3049 486
## Mail.and.Ride.Unlimited Path.2.Trip Airtran.Full.Fare Airtran.30.Day
## 79 57 0 141 0
## 80 270 0 131 0
## 81 650 0 252 0
## 82 537 0 247 0
## 83 371 0 210 0
## 84 502 0 269 0
## 110 350 0 349 0
## 545 86 0 112 0
## 546 326 0 143 0
## 547 720 0 270 0
## Airtran.10.Trip Airtran.Monthly
## 79 0 0
## 80 0 0
## 81 0 0
## 82 0 0
## 83 0 0
## 84 0 0
## 110 0 0
## 545 0 0
## 546 0 0
## 547 0 0
Looking at the data again, it seems that every station has a corresponding Remote.Station ID. Rather than replacing the actual station names, we could simply remove them, and replace the Remote.Station IDs with their corresponding station names. Then we can remove the Station column completely, along with any other columns we won’t be needing.
astoria$Remote.Station.ID <- str_replace_all(astoria$Remote.Station.ID, "R090", "39th Ave")
astoria$Remote.Station.ID <- str_replace_all(astoria$Remote.Station.ID, "R091", "36th Ave")
astoria$Remote.Station.ID <- str_replace_all(astoria$Remote.Station.ID, "R092", "Broadway")
astoria$Remote.Station.ID <- str_replace_all(astoria$Remote.Station.ID, "R093", "30th Ave")
astoria$Remote.Station.ID <- str_replace_all(astoria$Remote.Station.ID, "R094", "Astoria Blvd")
astoria$Remote.Station.ID <- str_replace_all(astoria$Remote.Station.ID, "R095", "Ditmars Blvd")
astoria$Remote.Station.ID <- str_replace_all(astoria$Remote.Station.ID, "R121", "Queensboro Plaza")
astoria$Remote.Station.ID <- factor(astoria$Remote.Station.ID)
astoria$From.Date <- str_extract_all(astoria$From.Date, "\\d{4}")
astoria <- astoria[, c(1, 3, 5, 6, 10, 11)]
colnames(astoria) <- c("Year", "Station", "Full.Fare", "SCD", "x7.Day.Un", "x30.Day.Un")
astoria <- astoria[order(as.numeric(astoria$Year)), ]
astoria$Year <- as.numeric(astoria$Year)
annual.astoria <- aggregate(. ~ Year + Station, data = astoria, sum)
head(annual.astoria, 10)
## Year Station Full.Fare SCD x7.Day.Un x30.Day.Un
## 1 2010 30th Ave 890466 37916 300552 1158843
## 2 2011 30th Ave 1505589 64101 532517 1767177
## 3 2012 30th Ave 1612274 71517 585371 1860489
## 4 2013 30th Ave 1585521 75747 670494 1825441
## 5 2014 30th Ave 1479235 75287 727827 1736341
## 6 2015 30th Ave 1449250 69118 486062 1501683
## 7 2016 30th Ave 1345898 78681 782095 1667235
## 8 2017 30th Ave 963811 55998 551091 1151983
## 9 2018 30th Ave 1 0 0 1
## 10 2010 36th Ave 482943 20246 175349 580075
Now that we’ve got our data, let’s see the affectof the 30th Avenue and 36th Avenue closures that happened in late 2017.
As expected, there were some major declines in 2017 for both 30th Ave and 36th Ave. Did the other stations pick up the slack?
As we expect, ridership remains up for 2017 for each neighboring station, however there is not a marked increase that we would expect. In most instances they remain the same, or have a slight increase/decrease. Perhaps we are being too granular? What if we took the total ridership for Full.Fare, SCD, x7.Day.Un, and x30.Day.Un?
Indeed, ridership has mostly held steady through late 2017 for the open stations. Admittedly, this would need to be revisited once 2018 is through, and we can see the full effect the (planned) seven month shutdown had on Astoria. However, I would have expected the steep drop seen in the closed stations to appear on the open ones! The drop is especially dramatic when one considers those stations were closed for only one full month in 2017.
ridership.diff
## Station Percent.Change
## 1 Astoria Blvd 0.01936947
## 2 30th Ave -0.29712262
## 3 Broadway -0.03966321
## 4 36th Ave -0.31777958
## 5 39th Ave -0.03849471
## 6 Queensboro Plaza 0.08487820
We can easily calculate the drop between years. For both 30th Ave and 36th Ave ridership levels dropped by 30%, but we don’t see the same increase in numbers for the neighboring stations. As the graphs illustrate, most of the stations also see minor decreases. Astoria Blvd sees a negligible one percent increase, and Queensboro Plaza, one of the main choke points into Manhattan, only sees an 8% increase.
CitiBike also came to Astoria around this time. Perhaps people affected by the subway closures opted to take other methods to cope, such as CitiBike, or the bus, or taxis/ride shares.
movie.url <- "https://raw.githubusercontent.com/EyeDen/data607/master/Project%202/movie_metadata.csv"
download.file(movie.url, "imdb_data.csv")
imdb <- read.csv("imdb_data.csv", na.strings = c("", " ", "NA"))
There are a lot of columns here. For now, let’s concern ourselves with the power of social media. Is there a positive correlation between Facebook likes for the movie/cast and IMDB score? Does this also translate to a better gross?
We’ll start by removing all the columns we don’t need. We will also rearrange the columns into something more readable. Lastly, we’ll only concern ourselves with films categorized as USA.
imdb <- imdb[, c(12, 21, 24, 2, 11, 7, 15, 9, 23, 13, 14, 28, 19, 26)]
imdb <- subset(imdb, country == "USA")
imdb <- imdb[, -2]
The movie_title column has a strange character at the end of every movie, so let’s tidy that up next. We will also not concern ourselves with any entries that have NA values, nor any movies with 0 Facebook likes.
imdb$movie_title <- str_replace_all(imdb$movie_title, "Â\\s*", "")
imdb <- imdb[complete.cases(imdb),]
imdb <- imdb[imdb$movie_facebook_likes != 0,]
imdb <- imdb[!duplicated(imdb$movie_title), ]
## movie_title title_year gross
## 1 Avatar 2009 760505847
## 27 Titanic 1997 658672302
## 30 Jurassic World 2015 652177271
## 18 The Avengers 2012 623279547
## 67 The Dark Knight 2008 533316061
## 241 Star Wars: Episode I - The Phantom Menace 1999 474544677
## 3025 Star Wars: Episode IV - A New Hope 1977 460935665
## 9 Avengers: Age of Ultron 2015 458991599
## 4 The Dark Knight Rises 2012 448130642
## 3081 E.T. the Extra-Terrestrial 1982 434949459
Avatar is king of the gross according to IMDB, followed by Titanic, Jurassic World, The Avengers, and The Dark Knight. Only 4 of the top 10 movies come from this decade. 4 come from the previous millennia, and E.T. the Extra-Terrestrial is the oldest of them all.
## movie_title title_year movie_facebook_likes
## 97 Interstellar 2014 349000
## 297 Django Unchained 2012 199000
## 11 Batman v Superman: Dawn of Justice 2016 197000
## 180 The Revenant 2015 190000
## 98 Inception 2010 175000
## 100 The Hobbit: An Unexpected Journey 2012 166000
## 4 The Dark Knight Rises 2012 164000
## 279 The Martian 2015 153000
## 30 Jurassic World 2015 150000
## 1579 The Grand Budapest Hotel 2014 149000
In contrast, the movies with the most Facebook likes all come from this decade. Inception is the oldest of the 10, but falls in the middle of the group.
## movie_title title_year imdb_score
## 1938 The Shawshank Redemption 1994 9.3
## 3467 The Godfather 1972 9.2
## 67 The Dark Knight 2008 9.0
## 2838 The Godfather: Part II 1974 9.0
## 340 The Lord of the Rings: The Return of the King 2003 8.9
## 1875 Schindler's List 1993 8.9
## 3356 Pulp Fiction 1994 8.9
## 98 Inception 2010 8.8
## 684 Fight Club 1999 8.8
## 837 Forrest Gump 1994 8.8
Lastly, the movies that are the highest rated on IMDB are relatively older and more diverse. We also see a much higher showing of movies in the 1970s, with The Godfather and The Godfather: Part II showing up. Half the movies in the top ten come from the 1990s, perhaps showing some sort of bias in age of users?
Let’s see if there are any overlaps.
common
## [1] movie_title title_year
## [3] director_name actor_1_name
## [5] actor_2_name actor_3_name
## [7] gross budget
## [9] num_voted_users cast_total_facebook_likes
## [11] movie_facebook_likes num_user_for_reviews
## [13] imdb_score
## <0 rows> (or 0-length row.names)
Nope. There are no common movies in the top 10 of each list. So let’s see what other trends we can see.
There doesn’t seem to be much of a trend comparing movie likes with cast total likes. On the average, the movie will garner more likes than the cast total.
On the other hand, there seems to be a rough suggestion of a movie’s gross given its Facebook likes.
Comparing the gross with IMDB rating is just a mess. There seems to be no suggestion between the two.
Let’s see if we can look at Facebook likes vs gross one more time. First, let’s limit ourselves to 2012 - 2017. 2012 is the year in which Facebook went public, so we will assume that is when Facebook hit its peak popularity.
Without running this through a machine learning algorithm it’s hard to know for certain, but maybe? It makes a kind of sense that Facebook likes will mirror gross, as both would be an expression of how an audience would react to a movie, and it’s a more raw form than having to rate a movie specifically.