Intro

For this project, we’ll be taking 3 data sources referenced from recent discussion topics and analyzing them. We’ll need to tidy the datasets, then take a brief dive into what kind of story the data wants to tell us.

We’ll be using a few pages that we’ll load first:

library(tidyr)
library(scales)
library(dplyr)
library(ggplot2)
library(stringr)
library(lubridate)
library(openxlsx)

1. Weather by Month from usclimatedata.com

This table takes a look at the average high, low, precipiation, and snow by month for your local area. The website determined LaGuardia Airport is closest to me, so we’re taking a look at a snapshot of that table taken on March 2, 2020. Let’s load the data and take a look:

weather <- read.csv('https://raw.githubusercontent.com/dataconsumer101/data607/master/laguardia_weather.csv', stringsAsFactors = F)
head(weather)
##                           X   Jan   Feb   Mar Apr   May   Jun  Jul   Aug   Sep
## 1        Average high in ºF 39.00 42.00 50.00  61 71.00 80.00 85.0 84.00 76.00
## 2         Average low in ºF 27.00 29.00 35.00  44 54.00 64.00 69.0 69.00 62.00
## 3 Av. precipitation in inch  3.17  2.76  3.97   4  3.79  3.94  4.5  4.12  3.73
## 4      Av. snowfall in inch  7.00  9.00  4.00   1  0.00  0.00  0.0  0.00  0.00
##     Oct   Nov   Dec
## 1 65.00 55.00 44.00
## 2 51.00 42.00 32.00
## 3  3.78  3.41  3.56
## 4  0.00  0.00  5.00

A few things we need to tidy this dataset: * We need to rename the first column to something more meaningful * It might be useful to convert the month names into numbers or a factor so that we can plot it in chronological order

names(weather)[1] <- 'measure'
w <- gather(weather, month, value, -measure)
w$month <- factor(w$month, levels = c('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'))
str(w)
## 'data.frame':    48 obs. of  3 variables:
##  $ measure: chr  "Average high in ºF" "Average low in ºF" "Av. precipitation in inch" "Av. snowfall in inch" ...
##  $ month  : Factor w/ 12 levels "Jan","Feb","Mar",..: 1 1 1 1 2 2 2 2 3 3 ...
##  $ value  : num  39 27 3.17 7 42 29 2.76 9 50 35 ...

Now that the data is ready, lets plot out the data and share some thoughts:

# avg high and low temps by month
filter(w, substr(measure,1,7) == 'Average') %>%
ggplot(aes(x = month, y = value, color = measure, group = measure)) +
  geom_line() +
  theme_bw() +
  labs(x = element_blank(), y = element_blank(), title = 'Average Temperature by Month, 1981 - 2010', subtitle = 'LaGuardia Airport') +
  theme(legend.title = element_blank())

filter(w, substr(measure,1,7) != 'Average') %>%
ggplot(aes(x = month, y = value, color = measure, group = measure)) +
  geom_line() +
  theme_bw() +
  labs(x = element_blank(), y = element_blank(), title = 'Average Rain and Snow by Month, 1981 - 2010', subtitle = 'LaGuardia Airport') +
  theme(legend.title = element_blank())

Not surprisingly, the hottest months are during summer. It looks like from June to mid August the temperature averages 80 or above, which is very warm. Also not a surprise, the coldest months are in winter, but its a bit harder to see the dip there because the chart above is setup as a traditional calendar, Jan - Dec.

Regarding rain and snow, there’s virtually no snow between May and November and it typically peaks in February. We haven’t had any snow this season, so I wonder how our current year compares to the climate data here. Rain is pretty even throughout the year, peaking in July and also dipping during the winter months.

w$month <- factor(w$month, levels = c('Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun'))

filter(w, substr(measure,1,7) == 'Average') %>%
ggplot(aes(x = month, y = value, color = measure, group = measure)) +
  geom_line() +
  theme_bw() +
  labs(x = element_blank(), y = element_blank(), title = 'Average Temperature by Month, 1981 - 2010', subtitle = 'LaGuardia Airport') +
  theme(legend.title = element_blank()) +
  geom_hline(yintercept = 32, linetype = 2)

filter(w, substr(measure,1,7) != 'Average') %>%
ggplot(aes(x = month, y = value, color = measure, group = measure)) +
  geom_line() +
  theme_bw() +
  labs(x = element_blank(), y = element_blank(), title = 'Average Rain and Snow by Month, 1981 - 2010', subtitle = 'LaGuardia Airport') +
  theme(legend.title = element_blank())

Here’s a look at the average temperature chart with winter in the middle and a horizontal line at the freezing point. It looks like there’s danger of freezing temperatures from December to mid February. Watch out for driving on icy roads in the morning!


MBTA

Next, lets take a look at MBTA data, which details average ridership across different public transportation modes in Boston. W

mbta <- read.xlsx('https://github.com/chitrarth2018/607-Project-2/blob/master/mbta.xlsx?raw=true', startRow = 2)
head(mbta)
##   X1             mode 2007-01 2007-02  2007-03 2007-04 2007-05  2007-06 2007-07
## 1  1 All Modes by Qtr      NA      NA 1187.653      NA      NA 1245.959      NA
## 2  2             Boat   4.000   3.600   40.000   4.300   4.900    5.800   6.521
## 3  3              Bus 335.819 338.675  339.867 352.162 354.367  350.543 357.519
## 4  4    Commuter Rail 142.200 138.500  137.700 139.500 139.000  143.000 142.391
## 5  5       Heavy Rail 435.294 448.271  458.583 472.201 474.579  477.032 471.735
## 6  6       Light Rail 227.231 240.262  241.444 255.557 248.262  246.108 243.286
##   2007-08  2007-09 2007-10 2007-11  2007-12 2008-01 2008-02  2008-03 2008-04
## 1      NA 1256.571      NA      NA 1216.890      NA      NA 1253.522      NA
## 2   6.572    5.469   5.145   3.763    2.985   3.175   3.111    3.512   4.164
## 3 355.479  372.598 368.847 330.826  312.920 340.324 352.905  361.155 368.189
## 4 142.364  143.051 146.542 145.089  141.585 142.145 142.607  137.453 140.389
## 5 461.605  499.566 457.741 488.348  448.268 472.624 492.100  494.046 513.204
## 6 234.907  265.748 241.434 250.497  233.379 241.223 249.306  253.132 271.070
##   2008-05  2008-06 2008-07 2008-08  2008-09 2008-10 2008-11  2008-12 2009-01
## 1      NA 1314.821      NA      NA 1307.041      NA      NA 1232.655      NA
## 2   4.015    5.189   6.016   5.800    4.587   4.285   3.488    3.007   3.014
## 3 363.903  362.962 370.921 361.057  389.537 357.974 345.423  325.767 338.532
## 4 142.585  142.057 145.731 144.565  141.907 151.957 152.952  140.810 141.448
## 5 507.952  518.349 512.309 476.990  517.324 523.644 487.115  446.743 461.004
## 6 258.351  266.961 270.158 239.344  258.171 250.063 232.068  205.420 215.660
##   2009-02  2009-03 2009-04 2009-05  2009-06 2009-07 2009-08  2009-09 2009-10
## 1      NA 1209.792      NA      NA 1233.085      NA      NA 1230.461      NA
## 2   3.196    3.330   4.049   4.119    4.900   6.444   5.903    4.696   4.212
## 3 360.412  353.686 359.380 354.750  347.865 339.477 332.661  374.260 385.868
## 4 143.529  142.893 142.340 144.225  142.006 137.691 139.158  139.087 137.104
## 5 482.407  467.224 493.152 475.634  473.099 470.828 466.676  500.403 513.406
## 6 228.737  222.844 238.232 224.962  226.259 230.308 231.783  250.922 230.739
##   2009-11  2009-12 2010-01 2010-02  2010-03 2010-04 2010-05  2010-06 2010-07
## 1      NA 1207.845      NA      NA 1208.857      NA      NA 1244.409      NA
## 2   3.576    3.113   3.207   3.195    3.481   4.452   4.415    5.411   6.513
## 3 366.980  332.394 362.226 361.138  373.443 378.611 380.171  363.275 353.040
## 4 129.343  126.066 130.910 131.918  131.252 131.722 128.800  129.144 122.935
## 5 480.278  440.925 464.069 480.121  483.397 502.374 487.400  490.263 488.587
## 6 214.711  194.446 204.396 213.136  211.693 227.246 217.805  215.922 218.729
##   2010-08  2010-09 2010-10 2010-11  2010-12 2011-01 2011-02  2011-03 2011-04
## 1      NA 1225.527      NA      NA 1216.262      NA      NA 1223.452      NA
## 2   6.269    4.699   4.402   3.731    3.156   3.140   3.284    3.674   4.251
## 3 343.688  381.622 384.987 367.955  326.338 334.958 346.234  380.399 380.446
## 4 129.732  132.892 131.033 130.889  121.422 128.396 125.463  134.374 134.169
## 5 473.731  521.099 532.403 502.887  450.433 468.418 504.068  516.730 528.631
## 6 210.530  236.368 236.366 221.881  196.211 198.450 219.886  227.935 242.280
##   2011-05  2011-06 2011-07 2011-08  2011-09 2011-10
## 1      NA 1302.414      NA      NA 1290.549      NA
## 2   4.431    5.474   6.581   6.733    5.003   4.484
## 3 385.289  376.317 361.585 353.793  388.271 398.456
## 4 136.140  135.581 132.410 130.616  136.901 128.720
## 5 528.122  529.528 532.888 508.145  550.137 554.932
## 6 225.776  221.865 231.010 220.164  244.949 237.768

List of action items to tidy: * Remove the first column as we don’t need row numbers as a column * Take the wide month columns and bring them into their own field * Convert the month to a date datatype * Remove summary values

mbta2 <- select(mbta, -X1) %>%
  gather(month, avg_riders, -mode) %>%
  mutate(month = as.Date(str_c(month, '-01')))

#just transportation modes
t <- filter(mbta2, mode != 'TOTAL' & mode != 'All Modes by Qtr' & mode != 'Pct Chg / Yr')

#summary fields
ts <- filter(mbta2, mode == 'TOTAL' | mode == 'All Modes by Qtr' | mode == 'Pct Chg / Yr')

Okay, lets take a quick look at average ridership by mode:

ggplot(t, aes(x = month, y = avg_riders, color = mode)) +
  geom_line() +
  theme(legend.title = element_blank()) +
  labs(x = element_blank(), y = element_blank(), title = 'Average Ridership by Transportation Mode', subtitle = 'MBTA Data') +
  theme_bw()

It looks like heavy rail has the most riders, followed by bus, light rail, and then commuter rail. All other modes seem insignificant compared to the top 4. Its possible the capacity of the others is highly limited and has special considerations, like special router or special passengers.

When looking at the top 3 transportation modes, it looks like there’s some seasonality in the recent years, specifically ridership being the lowest at the end of the year, possibly due to fewer commuters during holidays. Let’s take a look at ridership by month with years split out for the top 3 modes:

filter(t, mode == 'Heavy Rail' | mode == 'Bus' | mode == 'Light Rail') %>%
  mutate(mo = month(month), yr = year(month)) %>%
  ggplot(aes(x = mo, y = avg_riders, color = factor(yr))) +
  geom_line() +
  facet_wrap(~mode, nrow = 3, scales = 'free_y') +
  scale_x_continuous(breaks = seq(1,12,1)) +
  theme(legend.title = element_blank()) +
  labs(x = element_blank(), y = element_blank(), title = 'Average Riders by Month', subtitle = 'Top 3 Transportation Modes - MBTA') +
  theme_bw()

Well, it looks like not only does ridership drop in December, it also seems to dip again in August. This could be because there’s students and other commuters on summer holidays.

Just out of curosity, lets take a look at ridership for the least popular modes:

filter(t, mode != 'Heavy Rail' & mode != 'Bus' & mode != 'Light Rail' & mode != 'Commuter Rail') %>%
  mutate(mo = month(month), yr = year(month)) %>% 
  ggplot(aes(x = mo, y = avg_riders, color = factor(yr))) +
  geom_line() +
  facet_wrap(~mode, nrow = 3, scales = 'free_y') +
  scale_x_continuous(breaks = seq(1,12,1)) +
  theme(legend.title = element_blank()) +
  labs(x = element_blank(), y = element_blank(), title = 'Average Riders by Month', subtitle = 'Top 3 Transportation Modes - MBTA') +
  theme_bw()

  • First, it looks like there was a spike in Boat ridership in March 2007, but generally boat ridership reaches its peak in summer.
  • Private Bus ridership looks to be down to a new level since the beginning of 2009.
  • RIDE has been steadily increasing every year.
  • Trackless Trolley has been constant except for the second half of 2010.
  • All of the above follow the same holiday trend where ridership is down at the end of the year.

It would be interesting to see other data points regarding time, money, and trip counts to provide further analysis.


Bob Ross

Lets take a quick look at the dataset for Bob Ross Paintings.

br <- read.csv('https://raw.githubusercontent.com/dataconsumer101/data607/master/bob_ross.csv', stringsAsFactors = F)
head(br)
##   EPISODE                 TITLE APPLE_FRAME AURORA_BOREALIS BARN BEACH BOAT
## 1  S01E01 "A WALK IN THE WOODS"           0               0    0     0    0
## 2  S01E02        "MT. MCKINLEY"           0               0    0     0    0
## 3  S01E03        "EBONY SUNSET"           0               0    0     0    0
## 4  S01E04         "WINTER MIST"           0               0    0     0    0
## 5  S01E05        "QUIET STREAM"           0               0    0     0    0
## 6  S01E06         "WINTER MOON"           0               0    0     0    0
##   BRIDGE BUILDING BUSHES CABIN CACTUS CIRCLE_FRAME CIRRUS CLIFF CLOUDS CONIFER
## 1      0        0      1     0      0            0      0     0      0       0
## 2      0        0      0     1      0            0      0     0      1       1
## 3      0        0      0     1      0            0      0     0      0       1
## 4      0        0      1     0      0            0      0     0      1       1
## 5      0        0      0     0      0            0      0     0      0       0
## 6      0        0      0     1      0            0      0     0      0       1
##   CUMULUS DECIDUOUS DIANE_ANDRE DOCK DOUBLE_OVAL_FRAME FARM FENCE FIRE
## 1       0         1           0    0                 0    0     0    0
## 2       0         0           0    0                 0    0     0    0
## 3       0         0           0    0                 0    0     1    0
## 4       0         0           0    0                 0    0     0    0
## 5       0         1           0    0                 0    0     0    0
## 6       0         0           0    0                 0    0     0    0
##   FLORIDA_FRAME FLOWERS FOG FRAMED GRASS GUEST HALF_CIRCLE_FRAME
## 1             0       0   0      0     1     0                 0
## 2             0       0   0      0     0     0                 0
## 3             0       0   0      0     0     0                 0
## 4             0       0   0      0     0     0                 0
## 5             0       0   0      0     0     0                 0
## 6             0       0   0      0     0     0                 0
##   HALF_OVAL_FRAME HILLS LAKE LAKES LIGHTHOUSE MILL MOON MOUNTAIN MOUNTAINS
## 1               0     0    0     0          0    0    0        0         0
## 2               0     0    0     0          0    0    0        1         0
## 3               0     0    0     0          0    0    0        1         1
## 4               0     0    1     0          0    0    0        1         0
## 5               0     0    0     0          0    0    0        0         0
## 6               0     0    1     0          0    0    1        1         1
##   NIGHT OCEAN OVAL_FRAME PALM_TREES PATH PERSON PORTRAIT RECTANGLE_3D_FRAME
## 1     0     0          0          0    0      0        0                  0
## 2     0     0          0          0    0      0        0                  0
## 3     0     0          0          0    0      0        0                  0
## 4     0     0          0          0    0      0        0                  0
## 5     0     0          0          0    0      0        0                  0
## 6     1     0          0          0    0      0        0                  0
##   RECTANGULAR_FRAME RIVER ROCKS SEASHELL_FRAME SNOW SNOWY_MOUNTAIN SPLIT_FRAME
## 1                 0     1     0              0    0              0           0
## 2                 0     0     0              0    1              1           0
## 3                 0     0     0              0    0              0           0
## 4                 0     0     0              0    0              1           0
## 5                 0     1     1              0    0              0           0
## 6                 0     0     0              0    1              1           0
##   STEVE_ROSS STRUCTURE SUN TOMB_FRAME TREE TREES TRIPLE_FRAME WATERFALL WAVES
## 1          0         0   0          0    1     1            0         0     0
## 2          0         0   0          0    1     1            0         0     0
## 3          0         1   1          0    1     1            0         0     0
## 4          0         0   0          0    1     1            0         0     0
## 5          0         0   0          0    1     1            0         0     0
## 6          0         1   0          0    1     1            0         0     0
##   WINDMILL WINDOW_FRAME WINTER WOOD_FRAMED
## 1        0            0      0           0
## 2        0            0      1           0
## 3        0            0      1           0
## 4        0            0      0           0
## 5        0            0      0           0
## 6        0            0      1           0

It look like thematic elements within each painting is tallied. Each painting is associated to a season and an episode. Let’s add the elements to a field and count them that way, as well as add in the season number as a separate field:

names(br)[1] <- 'episode'
names(br)[2] <- 'title'
bob <- gather(br, element, count, -episode, -title) %>%
  filter(count == 1) %>%
  mutate(season = as.numeric(substr(episode, 2,3)),
         ep_num = as.numeric(substr(episode, 5,6)),
         element = str_to_title(str_replace_all(element, '_', ' '))
         )
head(bob)
##   episode                title         element count season ep_num
## 1  S17E08 "VIEW FROM THE PARK"     Apple Frame     1     17      8
## 2  S08E13    "NORTHERN LIGHTS" Aurora Borealis     1      8     13
## 3  S29E12     "AURORA'S DANCE" Aurora Borealis     1     29     12
## 4  S03E11        "RUSTIC BARN"            Barn     1      3     11
## 5  S05E10       "THE WINDMILL"            Barn     1      5     10
## 6  S05E13      "MEADOW STREAM"            Barn     1      5     13
  • Edited the field names to avoid caps lock, just because its easier to work with.
  • Also formatted the column names to make them easier to read.

Let’s visualize this data to make it easier to understand. Lets start by looking at the most and also least most used elements:

group_by(bob, element) %>%
  summarize(total_count = sum(count)) %>%
  ggplot(aes(x = reorder(element, total_count), y = total_count)) +
  geom_col() +
  coord_flip() +
  theme_bw() +
  labs(x = element_blank(), y = element_blank(), title = 'Most Popular Thematic Elements Used by Bob Ross') +
  theme(text = element_text(size = 7)) 

The top two elements are tree and trees, which are arguably the same thing. The next two are types of trees, so its safe to say that Bob Ross really incorporated trees into his paintings. Most of these elements are found in nature.

Lets take a look at the total number of elements in each painting:

s_count <- group_by(bob, season, ep_num) %>%
  summarize(total_count = sum(count))

ggplot(s_count, aes(x = total_count)) +
  geom_density() +
  labs(x = element_blank(), y = element_blank(), title = 'Probabilty Density Chart of Thematic Elements Per Painting') +
  theme_classic()

summary(s_count$total_count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   6.000   8.000   8.053  10.000  15.000
mean(s_count$total_count)
## [1] 8.0525
boxplot(s_count$total_count)

It looks like there’s usually about 8 elements in each painting, but as few as 2 and at most there have been 15 elements.

Just out of curiosity, what were the paintings with 2 and 15 elements, and which ones were they?

filter(s_count, total_count == 2 | total_count == 15) %>%
  inner_join(bob, by = c('season' = 'season', 'ep_num' = 'ep_num')) %>%
  group_by(episode, title) %>%
  summarize(total_elements = sum(count))
## # A tibble: 5 x 3
## # Groups:   episode [5]
##   episode title                    total_elements
##   <chr>   <chr>                             <int>
## 1 S05E12  "\"INDIAN GIRL\""                     2
## 2 S07E05  "\"PORTRAIT OF SALLY\""               2
## 3 S10E09  "\"MOUNTAIN OVAL\""                  15
## 4 S16E06  "\"CONTEMPLATIVE LADY\""              2
## 5 S29E10  "\"POT 'O POSIES\""                   2

Not sure what the Guest element is, but it seems to be one of two elements for the paintings with only two elements.

Mountain Oval sounds like it has a lot of elements!