For this project, we’ll be taking 3 data sources referenced from recent discussion topics and analyzing them. We’ll need to tidy the datasets, then take a brief dive into what kind of story the data wants to tell us.
We’ll be using a few pages that we’ll load first:
library(tidyr)
library(scales)
library(dplyr)
library(ggplot2)
library(stringr)
library(lubridate)
library(openxlsx)
This table takes a look at the average high, low, precipiation, and snow by month for your local area. The website determined LaGuardia Airport is closest to me, so we’re taking a look at a snapshot of that table taken on March 2, 2020. Let’s load the data and take a look:
weather <- read.csv('https://raw.githubusercontent.com/dataconsumer101/data607/master/laguardia_weather.csv', stringsAsFactors = F)
head(weather)
## X Jan Feb Mar Apr May Jun Jul Aug Sep
## 1 Average high in ºF 39.00 42.00 50.00 61 71.00 80.00 85.0 84.00 76.00
## 2 Average low in ºF 27.00 29.00 35.00 44 54.00 64.00 69.0 69.00 62.00
## 3 Av. precipitation in inch 3.17 2.76 3.97 4 3.79 3.94 4.5 4.12 3.73
## 4 Av. snowfall in inch 7.00 9.00 4.00 1 0.00 0.00 0.0 0.00 0.00
## Oct Nov Dec
## 1 65.00 55.00 44.00
## 2 51.00 42.00 32.00
## 3 3.78 3.41 3.56
## 4 0.00 0.00 5.00
A few things we need to tidy this dataset: * We need to rename the first column to something more meaningful * It might be useful to convert the month names into numbers or a factor so that we can plot it in chronological order
names(weather)[1] <- 'measure'
w <- gather(weather, month, value, -measure)
w$month <- factor(w$month, levels = c('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'))
str(w)
## 'data.frame': 48 obs. of 3 variables:
## $ measure: chr "Average high in ºF" "Average low in ºF" "Av. precipitation in inch" "Av. snowfall in inch" ...
## $ month : Factor w/ 12 levels "Jan","Feb","Mar",..: 1 1 1 1 2 2 2 2 3 3 ...
## $ value : num 39 27 3.17 7 42 29 2.76 9 50 35 ...
Now that the data is ready, lets plot out the data and share some thoughts:
# avg high and low temps by month
filter(w, substr(measure,1,7) == 'Average') %>%
ggplot(aes(x = month, y = value, color = measure, group = measure)) +
geom_line() +
theme_bw() +
labs(x = element_blank(), y = element_blank(), title = 'Average Temperature by Month, 1981 - 2010', subtitle = 'LaGuardia Airport') +
theme(legend.title = element_blank())
filter(w, substr(measure,1,7) != 'Average') %>%
ggplot(aes(x = month, y = value, color = measure, group = measure)) +
geom_line() +
theme_bw() +
labs(x = element_blank(), y = element_blank(), title = 'Average Rain and Snow by Month, 1981 - 2010', subtitle = 'LaGuardia Airport') +
theme(legend.title = element_blank())
Not surprisingly, the hottest months are during summer. It looks like from June to mid August the temperature averages 80 or above, which is very warm. Also not a surprise, the coldest months are in winter, but its a bit harder to see the dip there because the chart above is setup as a traditional calendar, Jan - Dec.
Regarding rain and snow, there’s virtually no snow between May and November and it typically peaks in February. We haven’t had any snow this season, so I wonder how our current year compares to the climate data here. Rain is pretty even throughout the year, peaking in July and also dipping during the winter months.
w$month <- factor(w$month, levels = c('Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun'))
filter(w, substr(measure,1,7) == 'Average') %>%
ggplot(aes(x = month, y = value, color = measure, group = measure)) +
geom_line() +
theme_bw() +
labs(x = element_blank(), y = element_blank(), title = 'Average Temperature by Month, 1981 - 2010', subtitle = 'LaGuardia Airport') +
theme(legend.title = element_blank()) +
geom_hline(yintercept = 32, linetype = 2)
filter(w, substr(measure,1,7) != 'Average') %>%
ggplot(aes(x = month, y = value, color = measure, group = measure)) +
geom_line() +
theme_bw() +
labs(x = element_blank(), y = element_blank(), title = 'Average Rain and Snow by Month, 1981 - 2010', subtitle = 'LaGuardia Airport') +
theme(legend.title = element_blank())
Here’s a look at the average temperature chart with winter in the middle and a horizontal line at the freezing point. It looks like there’s danger of freezing temperatures from December to mid February. Watch out for driving on icy roads in the morning!
Next, lets take a look at MBTA data, which details average ridership across different public transportation modes in Boston. W
mbta <- read.xlsx('https://github.com/chitrarth2018/607-Project-2/blob/master/mbta.xlsx?raw=true', startRow = 2)
head(mbta)
## X1 mode 2007-01 2007-02 2007-03 2007-04 2007-05 2007-06 2007-07
## 1 1 All Modes by Qtr NA NA 1187.653 NA NA 1245.959 NA
## 2 2 Boat 4.000 3.600 40.000 4.300 4.900 5.800 6.521
## 3 3 Bus 335.819 338.675 339.867 352.162 354.367 350.543 357.519
## 4 4 Commuter Rail 142.200 138.500 137.700 139.500 139.000 143.000 142.391
## 5 5 Heavy Rail 435.294 448.271 458.583 472.201 474.579 477.032 471.735
## 6 6 Light Rail 227.231 240.262 241.444 255.557 248.262 246.108 243.286
## 2007-08 2007-09 2007-10 2007-11 2007-12 2008-01 2008-02 2008-03 2008-04
## 1 NA 1256.571 NA NA 1216.890 NA NA 1253.522 NA
## 2 6.572 5.469 5.145 3.763 2.985 3.175 3.111 3.512 4.164
## 3 355.479 372.598 368.847 330.826 312.920 340.324 352.905 361.155 368.189
## 4 142.364 143.051 146.542 145.089 141.585 142.145 142.607 137.453 140.389
## 5 461.605 499.566 457.741 488.348 448.268 472.624 492.100 494.046 513.204
## 6 234.907 265.748 241.434 250.497 233.379 241.223 249.306 253.132 271.070
## 2008-05 2008-06 2008-07 2008-08 2008-09 2008-10 2008-11 2008-12 2009-01
## 1 NA 1314.821 NA NA 1307.041 NA NA 1232.655 NA
## 2 4.015 5.189 6.016 5.800 4.587 4.285 3.488 3.007 3.014
## 3 363.903 362.962 370.921 361.057 389.537 357.974 345.423 325.767 338.532
## 4 142.585 142.057 145.731 144.565 141.907 151.957 152.952 140.810 141.448
## 5 507.952 518.349 512.309 476.990 517.324 523.644 487.115 446.743 461.004
## 6 258.351 266.961 270.158 239.344 258.171 250.063 232.068 205.420 215.660
## 2009-02 2009-03 2009-04 2009-05 2009-06 2009-07 2009-08 2009-09 2009-10
## 1 NA 1209.792 NA NA 1233.085 NA NA 1230.461 NA
## 2 3.196 3.330 4.049 4.119 4.900 6.444 5.903 4.696 4.212
## 3 360.412 353.686 359.380 354.750 347.865 339.477 332.661 374.260 385.868
## 4 143.529 142.893 142.340 144.225 142.006 137.691 139.158 139.087 137.104
## 5 482.407 467.224 493.152 475.634 473.099 470.828 466.676 500.403 513.406
## 6 228.737 222.844 238.232 224.962 226.259 230.308 231.783 250.922 230.739
## 2009-11 2009-12 2010-01 2010-02 2010-03 2010-04 2010-05 2010-06 2010-07
## 1 NA 1207.845 NA NA 1208.857 NA NA 1244.409 NA
## 2 3.576 3.113 3.207 3.195 3.481 4.452 4.415 5.411 6.513
## 3 366.980 332.394 362.226 361.138 373.443 378.611 380.171 363.275 353.040
## 4 129.343 126.066 130.910 131.918 131.252 131.722 128.800 129.144 122.935
## 5 480.278 440.925 464.069 480.121 483.397 502.374 487.400 490.263 488.587
## 6 214.711 194.446 204.396 213.136 211.693 227.246 217.805 215.922 218.729
## 2010-08 2010-09 2010-10 2010-11 2010-12 2011-01 2011-02 2011-03 2011-04
## 1 NA 1225.527 NA NA 1216.262 NA NA 1223.452 NA
## 2 6.269 4.699 4.402 3.731 3.156 3.140 3.284 3.674 4.251
## 3 343.688 381.622 384.987 367.955 326.338 334.958 346.234 380.399 380.446
## 4 129.732 132.892 131.033 130.889 121.422 128.396 125.463 134.374 134.169
## 5 473.731 521.099 532.403 502.887 450.433 468.418 504.068 516.730 528.631
## 6 210.530 236.368 236.366 221.881 196.211 198.450 219.886 227.935 242.280
## 2011-05 2011-06 2011-07 2011-08 2011-09 2011-10
## 1 NA 1302.414 NA NA 1290.549 NA
## 2 4.431 5.474 6.581 6.733 5.003 4.484
## 3 385.289 376.317 361.585 353.793 388.271 398.456
## 4 136.140 135.581 132.410 130.616 136.901 128.720
## 5 528.122 529.528 532.888 508.145 550.137 554.932
## 6 225.776 221.865 231.010 220.164 244.949 237.768
List of action items to tidy: * Remove the first column as we don’t need row numbers as a column * Take the wide month columns and bring them into their own field * Convert the month to a date datatype * Remove summary values
mbta2 <- select(mbta, -X1) %>%
gather(month, avg_riders, -mode) %>%
mutate(month = as.Date(str_c(month, '-01')))
#just transportation modes
t <- filter(mbta2, mode != 'TOTAL' & mode != 'All Modes by Qtr' & mode != 'Pct Chg / Yr')
#summary fields
ts <- filter(mbta2, mode == 'TOTAL' | mode == 'All Modes by Qtr' | mode == 'Pct Chg / Yr')
Okay, lets take a quick look at average ridership by mode:
ggplot(t, aes(x = month, y = avg_riders, color = mode)) +
geom_line() +
theme(legend.title = element_blank()) +
labs(x = element_blank(), y = element_blank(), title = 'Average Ridership by Transportation Mode', subtitle = 'MBTA Data') +
theme_bw()
It looks like heavy rail has the most riders, followed by bus, light rail, and then commuter rail. All other modes seem insignificant compared to the top 4. Its possible the capacity of the others is highly limited and has special considerations, like special router or special passengers.
When looking at the top 3 transportation modes, it looks like there’s some seasonality in the recent years, specifically ridership being the lowest at the end of the year, possibly due to fewer commuters during holidays. Let’s take a look at ridership by month with years split out for the top 3 modes:
filter(t, mode == 'Heavy Rail' | mode == 'Bus' | mode == 'Light Rail') %>%
mutate(mo = month(month), yr = year(month)) %>%
ggplot(aes(x = mo, y = avg_riders, color = factor(yr))) +
geom_line() +
facet_wrap(~mode, nrow = 3, scales = 'free_y') +
scale_x_continuous(breaks = seq(1,12,1)) +
theme(legend.title = element_blank()) +
labs(x = element_blank(), y = element_blank(), title = 'Average Riders by Month', subtitle = 'Top 3 Transportation Modes - MBTA') +
theme_bw()
Well, it looks like not only does ridership drop in December, it also seems to dip again in August. This could be because there’s students and other commuters on summer holidays.
Just out of curosity, lets take a look at ridership for the least popular modes:
filter(t, mode != 'Heavy Rail' & mode != 'Bus' & mode != 'Light Rail' & mode != 'Commuter Rail') %>%
mutate(mo = month(month), yr = year(month)) %>%
ggplot(aes(x = mo, y = avg_riders, color = factor(yr))) +
geom_line() +
facet_wrap(~mode, nrow = 3, scales = 'free_y') +
scale_x_continuous(breaks = seq(1,12,1)) +
theme(legend.title = element_blank()) +
labs(x = element_blank(), y = element_blank(), title = 'Average Riders by Month', subtitle = 'Top 3 Transportation Modes - MBTA') +
theme_bw()
It would be interesting to see other data points regarding time, money, and trip counts to provide further analysis.
Lets take a quick look at the dataset for Bob Ross Paintings.
br <- read.csv('https://raw.githubusercontent.com/dataconsumer101/data607/master/bob_ross.csv', stringsAsFactors = F)
head(br)
## EPISODE TITLE APPLE_FRAME AURORA_BOREALIS BARN BEACH BOAT
## 1 S01E01 "A WALK IN THE WOODS" 0 0 0 0 0
## 2 S01E02 "MT. MCKINLEY" 0 0 0 0 0
## 3 S01E03 "EBONY SUNSET" 0 0 0 0 0
## 4 S01E04 "WINTER MIST" 0 0 0 0 0
## 5 S01E05 "QUIET STREAM" 0 0 0 0 0
## 6 S01E06 "WINTER MOON" 0 0 0 0 0
## BRIDGE BUILDING BUSHES CABIN CACTUS CIRCLE_FRAME CIRRUS CLIFF CLOUDS CONIFER
## 1 0 0 1 0 0 0 0 0 0 0
## 2 0 0 0 1 0 0 0 0 1 1
## 3 0 0 0 1 0 0 0 0 0 1
## 4 0 0 1 0 0 0 0 0 1 1
## 5 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 1 0 0 0 0 0 1
## CUMULUS DECIDUOUS DIANE_ANDRE DOCK DOUBLE_OVAL_FRAME FARM FENCE FIRE
## 1 0 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 1 0
## 4 0 0 0 0 0 0 0 0
## 5 0 1 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0
## FLORIDA_FRAME FLOWERS FOG FRAMED GRASS GUEST HALF_CIRCLE_FRAME
## 1 0 0 0 0 1 0 0
## 2 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0
## HALF_OVAL_FRAME HILLS LAKE LAKES LIGHTHOUSE MILL MOON MOUNTAIN MOUNTAINS
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 1 0
## 3 0 0 0 0 0 0 0 1 1
## 4 0 0 1 0 0 0 0 1 0
## 5 0 0 0 0 0 0 0 0 0
## 6 0 0 1 0 0 0 1 1 1
## NIGHT OCEAN OVAL_FRAME PALM_TREES PATH PERSON PORTRAIT RECTANGLE_3D_FRAME
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0
## 6 1 0 0 0 0 0 0 0
## RECTANGULAR_FRAME RIVER ROCKS SEASHELL_FRAME SNOW SNOWY_MOUNTAIN SPLIT_FRAME
## 1 0 1 0 0 0 0 0
## 2 0 0 0 0 1 1 0
## 3 0 0 0 0 0 0 0
## 4 0 0 0 0 0 1 0
## 5 0 1 1 0 0 0 0
## 6 0 0 0 0 1 1 0
## STEVE_ROSS STRUCTURE SUN TOMB_FRAME TREE TREES TRIPLE_FRAME WATERFALL WAVES
## 1 0 0 0 0 1 1 0 0 0
## 2 0 0 0 0 1 1 0 0 0
## 3 0 1 1 0 1 1 0 0 0
## 4 0 0 0 0 1 1 0 0 0
## 5 0 0 0 0 1 1 0 0 0
## 6 0 1 0 0 1 1 0 0 0
## WINDMILL WINDOW_FRAME WINTER WOOD_FRAMED
## 1 0 0 0 0
## 2 0 0 1 0
## 3 0 0 1 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 1 0
It look like thematic elements within each painting is tallied. Each painting is associated to a season and an episode. Let’s add the elements to a field and count them that way, as well as add in the season number as a separate field:
names(br)[1] <- 'episode'
names(br)[2] <- 'title'
bob <- gather(br, element, count, -episode, -title) %>%
filter(count == 1) %>%
mutate(season = as.numeric(substr(episode, 2,3)),
ep_num = as.numeric(substr(episode, 5,6)),
element = str_to_title(str_replace_all(element, '_', ' '))
)
head(bob)
## episode title element count season ep_num
## 1 S17E08 "VIEW FROM THE PARK" Apple Frame 1 17 8
## 2 S08E13 "NORTHERN LIGHTS" Aurora Borealis 1 8 13
## 3 S29E12 "AURORA'S DANCE" Aurora Borealis 1 29 12
## 4 S03E11 "RUSTIC BARN" Barn 1 3 11
## 5 S05E10 "THE WINDMILL" Barn 1 5 10
## 6 S05E13 "MEADOW STREAM" Barn 1 5 13
Let’s visualize this data to make it easier to understand. Lets start by looking at the most and also least most used elements:
group_by(bob, element) %>%
summarize(total_count = sum(count)) %>%
ggplot(aes(x = reorder(element, total_count), y = total_count)) +
geom_col() +
coord_flip() +
theme_bw() +
labs(x = element_blank(), y = element_blank(), title = 'Most Popular Thematic Elements Used by Bob Ross') +
theme(text = element_text(size = 7))
The top two elements are tree and trees, which are arguably the same thing. The next two are types of trees, so its safe to say that Bob Ross really incorporated trees into his paintings. Most of these elements are found in nature.
Lets take a look at the total number of elements in each painting:
s_count <- group_by(bob, season, ep_num) %>%
summarize(total_count = sum(count))
ggplot(s_count, aes(x = total_count)) +
geom_density() +
labs(x = element_blank(), y = element_blank(), title = 'Probabilty Density Chart of Thematic Elements Per Painting') +
theme_classic()
summary(s_count$total_count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 6.000 8.000 8.053 10.000 15.000
mean(s_count$total_count)
## [1] 8.0525
boxplot(s_count$total_count)
It looks like there’s usually about 8 elements in each painting, but as few as 2 and at most there have been 15 elements.
Just out of curiosity, what were the paintings with 2 and 15 elements, and which ones were they?
filter(s_count, total_count == 2 | total_count == 15) %>%
inner_join(bob, by = c('season' = 'season', 'ep_num' = 'ep_num')) %>%
group_by(episode, title) %>%
summarize(total_elements = sum(count))
## # A tibble: 5 x 3
## # Groups: episode [5]
## episode title total_elements
## <chr> <chr> <int>
## 1 S05E12 "\"INDIAN GIRL\"" 2
## 2 S07E05 "\"PORTRAIT OF SALLY\"" 2
## 3 S10E09 "\"MOUNTAIN OVAL\"" 15
## 4 S16E06 "\"CONTEMPLATIVE LADY\"" 2
## 5 S29E10 "\"POT 'O POSIES\"" 2
Not sure what the Guest element is, but it seems to be one of two elements for the paintings with only two elements.
Mountain Oval sounds like it has a lot of elements!