Introduction and Approach

For this project we were asked to choose three “wide” datasets, create .CSV (or alternatively databases) for them, then tidy and analyze the data. I partnered with two others from Data 607 for this project - Karim Hammoud and Jack Write. Our methodology was to each take one dataset to work on and then share the code and discussions questions / challenges with the rest of the group, the hopeful outcome being higher quality work based on feedback from all in the group.

Dataset 1: Global Food and Feed Production Analysis, by Cameron

Proposed Analysis

The key questions that were proposed to answer are:

  1. For a specific country see its current top 3 production items and whether they have changed since 1961

  2. Determine whether feed has overtaken food in any areas or food items

  3. Compare a high protein crop like a legume to an animal-based product and determine whether they have increased or decreased over time.

Data

The data set I chose to work with is the Food and Agricultural Organization (FAO) data on worldwide food (for human consumption and feed (for animal consumption) production and distribution, as hosted on Kaggle. It is a very wide data set with 63 columns in total.

Formal citation for data set:

Oppenheim, R. (2017; November). Who eats the food we grow?, Version 7. Retrieved 30 September 2020 from https://www.kaggle.com/dorbicycle/world-foodfeed-production/version/7.

# Load required packages
library(tidyverse)
library(gridExtra)

# Load data from Github
rawdata <- read.csv("https://raw.githubusercontent.com/cwestsmith/cuny-msds/master/datasets/FAO.csv")

# Quick look at the data to make sure it loaded correctly.  It consists of 63 columns and 21,477 rows at this point.
glimpse(rawdata)
## Rows: 21,477
## Columns: 63
## $ Area.Abbreviation <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", ...
## $ Area.Code         <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ Area              <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afg...
## $ Item.Code         <int> 2511, 2805, 2513, 2513, 2514, 2514, 2517, 2520, 2...
## $ Item              <chr> "Wheat and products", "Rice (Milled Equivalent)",...
## $ Element.Code      <int> 5142, 5142, 5521, 5142, 5521, 5142, 5142, 5142, 5...
## $ Element           <chr> "Food", "Food", "Feed", "Food", "Feed", "Food", "...
## $ Unit              <chr> "1000 tonnes", "1000 tonnes", "1000 tonnes", "100...
## $ latitude          <dbl> 33.94, 33.94, 33.94, 33.94, 33.94, 33.94, 33.94, ...
## $ longitude         <dbl> 67.71, 67.71, 67.71, 67.71, 67.71, 67.71, 67.71, ...
## $ Y1961             <int> 1928, 183, 76, 237, 210, 403, 17, 0, 111, 45, 0, ...
## $ Y1962             <int> 1904, 183, 76, 237, 210, 403, 18, 0, 97, 45, 0, 4...
## $ Y1963             <int> 1666, 182, 76, 237, 214, 410, 19, 0, 103, 45, 0, ...
## $ Y1964             <int> 1950, 220, 76, 238, 216, 415, 20, 0, 110, 45, 0, ...
## $ Y1965             <int> 2001, 220, 76, 238, 216, 415, 21, 0, 113, 31, 0, ...
## $ Y1966             <int> 1808, 195, 75, 237, 216, 413, 22, 0, 117, 14, 16,...
## $ Y1967             <int> 2053, 231, 71, 225, 235, 454, 23, 0, 128, 19, 23,...
## $ Y1968             <int> 2045, 235, 72, 227, 232, 448, 24, 0, 130, 30, 31,...
## $ Y1969             <int> 2154, 238, 73, 230, 236, 455, 25, 0, 134, 34, 28,...
## $ Y1970             <int> 1819, 213, 74, 234, 200, 383, 26, 0, 125, 15, 9, ...
## $ Y1971             <int> 1963, 205, 71, 223, 201, 386, 26, 0, 147, 0, 13, ...
## $ Y1972             <int> 2215, 233, 70, 219, 216, 416, 27, 0, 138, 0, 13, ...
## $ Y1973             <int> 2310, 246, 72, 225, 228, 439, 27, 0, 143, 28, 6, ...
## $ Y1974             <int> 2335, 246, 76, 240, 231, 445, 28, 0, 160, 32, 0, ...
## $ Y1975             <int> 2434, 255, 77, 244, 234, 451, 29, 0, 169, 20, 0, ...
## $ Y1976             <int> 2512, 263, 80, 255, 240, 463, 37, 0, 324, 28, 10,...
## $ Y1977             <int> 2282, 235, 60, 185, 228, 439, 32, 0, 176, 24, 16,...
## $ Y1978             <int> 2454, 254, 65, 203, 234, 451, 33, 0, 225, 24, 13,...
## $ Y1979             <int> 2443, 270, 64, 198, 228, 440, 31, 0, 232, 34, 6, ...
## $ Y1980             <int> 2129, 259, 64, 202, 226, 437, 31, 0, 240, 61, 15,...
## $ Y1981             <int> 2133, 248, 60, 189, 210, 407, 29, 0, 247, 50, 0, ...
## $ Y1982             <int> 2068, 217, 55, 174, 199, 384, 27, 0, 248, 43, 0, ...
## $ Y1983             <int> 1994, 217, 53, 167, 192, 371, 28, 0, 242, 38, 0, ...
## $ Y1984             <int> 1851, 197, 51, 160, 182, 353, 26, 0, 235, 46, 0, ...
## $ Y1985             <int> 1791, 186, 48, 151, 173, 334, 25, 0, 226, 23, 0, ...
## $ Y1986             <int> 1683, 200, 46, 145, 170, 330, 23, 0, 217, 25, 0, ...
## $ Y1987             <int> 2194, 193, 46, 145, 154, 298, 23, 0, 196, 3, 0, 8...
## $ Y1988             <int> 1801, 202, 47, 148, 148, 287, 23, 0, 198, 45, 0, ...
## $ Y1989             <int> 1754, 191, 46, 145, 137, 265, 23, 0, 184, 54, 0, ...
## $ Y1990             <int> 1640, 199, 43, 135, 144, 279, 24, 0, 205, 47, 0, ...
## $ Y1991             <int> 1539, 197, 43, 132, 126, 245, 24, 0, 203, 29, 0, ...
## $ Y1992             <int> 1582, 249, 40, 120, 90, 170, 18, 0, 210, 29, 0, 5...
## $ Y1993             <int> 1840, 218, 50, 155, 141, 272, 22, 0, 210, 29, 0, ...
## $ Y1994             <int> 1855, 260, 46, 143, 150, 289, 20, 0, 211, 29, 0, ...
## $ Y1995             <int> 1853, 319, 41, 125, 159, 310, 21, 0, 212, 29, 0, ...
## $ Y1996             <int> 2177, 254, 44, 138, 108, 209, 17, 0, 213, 29, 0, ...
## $ Y1997             <int> 2343, 326, 50, 159, 90, 173, 20, 0, 214, 28, 0, 5...
## $ Y1998             <int> 2407, 347, 48, 154, 99, 192, 21, 0, 214, 28, 0, 5...
## $ Y1999             <int> 2463, 270, 43, 141, 72, 141, 17, 0, 217, 28, 0, 6...
## $ Y2000             <int> 2600, 372, 26, 84, 35, 66, 20, 0, 219, 29, 0, 61,...
## $ Y2001             <int> 2668, 411, 29, 83, 48, 93, 20, 0, 215, 29, 0, 61,...
## $ Y2002             <int> 2776, 448, 70, 122, 89, 170, 18, 0, 217, 29, 0, 7...
## $ Y2003             <int> 3095, 460, 48, 144, 63, 117, 16, 1, 347, 51, 0, 9...
## $ Y2004             <int> 3249, 419, 58, 185, 120, 231, 15, 2, 276, 50, 0, ...
## $ Y2005             <int> 3486, 445, 236, 43, 208, 67, 21, 1, 294, 29, 0, 1...
## $ Y2006             <int> 3704, 546, 262, 44, 233, 82, 11, 1, 294, 61, 0, 1...
## $ Y2007             <int> 4164, 455, 263, 48, 249, 67, 19, 0, 260, 65, 0, 1...
## $ Y2008             <int> 4252, 490, 230, 62, 247, 69, 21, 0, 242, 54, 0, 2...
## $ Y2009             <int> 4538, 415, 379, 55, 195, 71, 18, 0, 250, 114, 0, ...
## $ Y2010             <int> 4605, 442, 315, 60, 178, 82, 14, 0, 192, 83, 0, 2...
## $ Y2011             <int> 4711, 476, 203, 72, 191, 73, 14, 0, 169, 83, 0, 2...
## $ Y2012             <int> 4810, 425, 367, 78, 200, 77, 14, 0, 196, 69, 0, 2...
## $ Y2013             <int> 4895, 422, 360, 89, 200, 76, 12, 0, 230, 81, 0, 2...

Tidy and Transform

To enhance our ability to analyze the data within the Tidyverse we need to do some tidying and transforming. This increases the number of rows from 21,477 to 1,138,281, so the benefit of enhanced analysis comes at a cost of CPU performance.

# Combine the year columns into key-pair columns indicating year and units.
longdata_cs <- rawdata %>% 
  pivot_longer(Y1961:Y2013, names_to = "Year", values_to = "Total_Units")

# Pivot the food and feed columns for a wider format.
longdata_cs <- longdata_cs %>% 
  pivot_wider(names_from = Element, values_from = Total_Units, values_fill = 0)

# Remove the Y in from the values in the year column and convert to number.
longdata_cs <- longdata_cs %>% mutate(Year=as.numeric(gsub("Y","", Year)))

# Confirm everything looks ok
str(longdata_cs)
## tibble [1,138,281 x 12] (S3: tbl_df/tbl/data.frame)
##  $ Area.Abbreviation: chr [1:1138281] "AFG" "AFG" "AFG" "AFG" ...
##  $ Area.Code        : int [1:1138281] 2 2 2 2 2 2 2 2 2 2 ...
##  $ Area             : chr [1:1138281] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ Item.Code        : int [1:1138281] 2511 2511 2511 2511 2511 2511 2511 2511 2511 2511 ...
##  $ Item             : chr [1:1138281] "Wheat and products" "Wheat and products" "Wheat and products" "Wheat and products" ...
##  $ Element.Code     : int [1:1138281] 5142 5142 5142 5142 5142 5142 5142 5142 5142 5142 ...
##  $ Unit             : chr [1:1138281] "1000 tonnes" "1000 tonnes" "1000 tonnes" "1000 tonnes" ...
##  $ latitude         : num [1:1138281] 33.9 33.9 33.9 33.9 33.9 ...
##  $ longitude        : num [1:1138281] 67.7 67.7 67.7 67.7 67.7 ...
##  $ Year             : num [1:1138281] 1961 1962 1963 1964 1965 ...
##  $ Food             : int [1:1138281] 1928 1904 1666 1950 2001 1808 2053 2045 2154 1819 ...
##  $ Feed             : int [1:1138281] 0 0 0 0 0 0 0 0 0 0 ...
# Create new data frame including only the columns required for analysis and
# use group_by to consolidate similar rows
analysisdata <- longdata_cs %>% 
  select(Area, Year, Item, Feed, Food) %>%
  group_by(Area, Year, Item) %>%
  summarize(Feed_Units=sum(Feed), Food_Units=sum(Food), Total_Units=sum(Feed+Food))

# Make sure everything looks as intended 
head(analysisdata, 10)
## # A tibble: 10 x 6
## # Groups:   Area, Year [1]
##    Area         Year Item                     Feed_Units Food_Units Total_Units
##    <chr>       <dbl> <chr>                         <int>      <int>       <int>
##  1 Afghanistan  1961 Alcoholic Beverages               0          0           0
##  2 Afghanistan  1961 Animal fats                       0         20          20
##  3 Afghanistan  1961 Apples and products               0         14          14
##  4 Afghanistan  1961 Bananas                           0          0           0
##  5 Afghanistan  1961 Barley and products              76        237         313
##  6 Afghanistan  1961 Beer                              0          0           0
##  7 Afghanistan  1961 Beverages, Alcoholic              0          0           0
##  8 Afghanistan  1961 Bovine Meat                       0         43          43
##  9 Afghanistan  1961 Butter, Ghee                      0         11          11
## 10 Afghanistan  1961 Cereals - Excluding Beer        286       2767        3053

Analysis

Now that the data is tidied it can be analyzed. We can see here that in 1961 the top producers were U.S., China, and India (in that order), but in 2013 the order has shifted: China, India, and U.S.

# View top countries by Total_Units by earliest year (1961)
analysisdata %>% ungroup() %>%
  filter(Year == min(Year)) %>%
  select(Year, Area, Total_Units) %>%
  group_by(Year, Area) %>%
  summarize(Total_Units = sum(Total_Units)) %>%
  arrange(desc(Total_Units))
## # A tibble: 174 x 3
## # Groups:   Year [1]
##     Year Area                     Total_Units
##    <dbl> <chr>                          <int>
##  1  1961 United States of America      559347
##  2  1961 China, mainland               477279
##  3  1961 India                         309565
##  4  1961 Germany                       212839
##  5  1961 France                        138385
##  6  1961 Poland                        130442
##  7  1961 Brazil                        126501
##  8  1961 United Kingdom                117090
##  9  1961 Japan                         112141
## 10  1961 Italy                         105534
## # ... with 164 more rows
# View top countries by Total_Units by latest year (2013)
analysisdata %>% ungroup() %>%
  filter(Year == max(Year)) %>%
  select(Year, Area, Total_Units) %>%
  group_by(Year, Area) %>%
  summarize(Total_Units = sum(Total_Units)) %>%
  arrange(desc(Total_Units))
## # A tibble: 174 x 3
## # Groups:   Year [1]
##     Year Area                     Total_Units
##    <dbl> <chr>                          <int>
##  1  2013 China, mainland              3191155
##  2  2013 India                        1336593
##  3  2013 United States of America      938639
##  4  2013 Brazil                        439967
##  5  2013 Russian Federation            370913
##  6  2013 Nigeria                       300923
##  7  2013 Indonesia                     256888
##  8  2013 Germany                       218891
##  9  2013 Mexico                        210121
## 10  2013 Pakistan                      190505
## # ... with 164 more rows

Question 1

For a specific country see its current top 3 production items and whether they have changed since 1961

We can see that yes the items have changed from 1961 to 2013 (the latest year of data).

In 1961 the items were (1) Milk - Excluding Butter; (2) Cereals - Excluding Beer; and (3) Starchy Roots.

In 2013 the items were (1) Milk - Excluding Butter; (2) Cereals - Excluding Beer; and (3) Wheat and products.

Interestingly starchy roots is not even in the top 10 in 2013. We can see in the chart that it has steadily been decreasing over time.

# Top 10 Units in 1961
analysisdata %>% filter(Area == "Switzerland", Total_Units > 0, Year == 1961) %>%
  group_by(Year, Area, Item) %>%
  summarize(Total_Units = sum(Total_Units)) %>%
  slice_max(Total_Units, n=10)
## # A tibble: 10 x 4
## # Groups:   Year, Area [1]
##     Year Area        Item                     Total_Units
##    <dbl> <chr>       <chr>                          <int>
##  1  1961 Switzerland Milk - Excluding Butter         5484
##  2  1961 Switzerland Cereals - Excluding Beer        1341
##  3  1961 Switzerland Starchy Roots                   1021
##  4  1961 Switzerland Potatoes and products           1009
##  5  1961 Switzerland Fruits - Excluding Wine          746
##  6  1961 Switzerland Wheat and products               673
##  7  1961 Switzerland Alcoholic Beverages              661
##  8  1961 Switzerland Fruits, Other                    405
##  9  1961 Switzerland Vegetables                       400
## 10  1961 Switzerland Beer                             385
# Top  Units in 2013
analysisdata %>% filter(Area == "Switzerland", Total_Units > 0, Year == 2013) %>%
  group_by(Year, Area, Item) %>%
  summarize(Total_Units = sum(Total_Units)) %>%
  slice_max(Total_Units, n=10)
## # A tibble: 10 x 4
## # Groups:   Year, Area [1]
##     Year Area        Item                     Total_Units
##    <dbl> <chr>       <chr>                          <int>
##  1  2013 Switzerland Milk - Excluding Butter         6780
##  2  2013 Switzerland Cereals - Excluding Beer        1706
##  3  2013 Switzerland Wheat and products               983
##  4  2013 Switzerland Vegetables                       880
##  5  2013 Switzerland Fruits - Excluding Wine          837
##  6  2013 Switzerland Alcoholic Beverages              790
##  7  2013 Switzerland Vegetables, Other                649
##  8  2013 Switzerland Meat                             584
##  9  2013 Switzerland Sugar & Sweeteners               488
## 10  2013 Switzerland Beer                             454
# What happened to starchy roots in Switzerland?
analysisdata %>% filter(Area == "Switzerland", Item == "Starchy Roots", Total_Units > 0) %>%
  ggplot(aes(x=Year, y=Total_Units)) +
  geom_bar(stat="identity") +
  ggtitle("Starchy Roots Production in Switzerland - 1961 to 2013")

Question 2

Determine whether feed has overtaken food in any areas or food items

In 2013 there were 627 items with a feed to food ratio of greater than one, meaning that more food was produced for animal consumption than human consumption for those products. By comparison, in 1961 there were 274 products.

# Calculate and summarize ratio for 2013
analysisdata %>% ungroup %>% filter(Year==2013) %>% 
  select(Item, Feed_Units, Food_Units) %>%
  mutate(Feed_To_Food_Ratio = Feed_Units / Food_Units) %>%
  filter(Food_Units > 0, Feed_Units > 0, Feed_To_Food_Ratio > 1) %>%
  arrange(desc(Feed_To_Food_Ratio))
## # A tibble: 627 x 4
##    Item                       Feed_Units Food_Units Feed_To_Food_Ratio
##    <chr>                           <int>      <int>              <dbl>
##  1 Rape and Mustardseed             3715          1              3715 
##  2 Maize and products               2099          3               700.
##  3 Pulses, Other and products        532          1               532 
##  4 Soyabeans                         959          2               480.
##  5 Barley and products               475          1               475 
##  6 Barley and products              8000         17               471.
##  7 Barley and products              1380          3               460 
##  8 Barley and products              7313         16               457.
##  9 Soyabeans                         444          1               444 
## 10 Barley and products              6598         25               264.
## # ... with 617 more rows
# Calculate and summarize ratio for 1961

analysisdata %>% ungroup %>% filter(Year==1961) %>% 
  select(Item, Feed_Units, Food_Units) %>%
  mutate(Feed_To_Food_Ratio = Feed_Units / Food_Units) %>%
  filter(Food_Units > 0, Feed_Units > 0, Feed_To_Food_Ratio > 1) %>%
  arrange(desc(Feed_To_Food_Ratio))
## # A tibble: 274 x 4
##    Item                Feed_Units Food_Units Feed_To_Food_Ratio
##    <chr>                    <int>      <int>              <dbl>
##  1 Cereals, Other            1202          1              1202 
##  2 Barley and products       2616          3               872 
##  3 Oats                       584          1               584 
##  4 Oats                       485          1               485 
##  5 Oats                      2366          5               473.
##  6 Barley and products       4070         15               271.
##  7 Barley and products       2603         11               237.
##  8 Barley and products       2009          9               223.
##  9 Barley and products       5000         28               179.
## 10 Cereals, Other             430          3               143.
## # ... with 264 more rows

Question 3

Compare a high protein crop like a legume to an animal-based product and determine whether they have increased or decreased over time.

Both soyabean and bovine meat production have increased significantly since 1961, with soyabean production increasing by a factor of 6.3 and bovine meat by 2.63.

bovine_soya_combined <- analysisdata %>% ungroup %>%
  filter(Item=="Soyabeans" | Item=="Bovine Meat", Total_Units > 0) %>%
  select(Year, Item, Total_Units) %>%
  group_by(Year, Item) %>%
  summarize(Total_Units = sum(Total_Units))

soyabean1961 <- bovine_soya_combined %>% 
  filter(Item == "Soyabeans", Year == 1961) %>% 
  select(Total_Units)
soyabean2013 <- bovine_soya_combined %>% 
  filter(Item == "Soyabeans", Year == 2013) %>% 
  select(Total_Units)

bovine1961 <- bovine_soya_combined %>% 
  filter(Item == "Bovine Meat", Year == 1961) %>% 
  select(Total_Units)
bovine2013 <- bovine_soya_combined %>% 
  filter(Item == "Bovine Meat", Year == 2013) %>% 
  select(Total_Units)

soyabeanchange <- soyabean2013$Total_Units / soyabean1961$Total_Units
bovinechange <- bovine2013$Total_Units / bovine1961$Total_Units

chart1 <- bovine_soya_combined %>% 
  filter(Item == "Soyabeans") %>%
  ggplot(aes(x = Year, y = Total_Units)) +
  geom_bar(stat="identity") +
  ggtitle("Global Soyabeans Production - 1961 to 2013") + 
  theme(text = element_text(size=8))

chart2 <- bovine_soya_combined %>% 
  ggplot(aes(x = Year, y = Total_Units, fill = Item)) +
  geom_bar(position = "dodge", stat="identity") +
  ggtitle("Global Bovine Meat vs Soyabeans Production - 1961 to 2013") + 
  theme(text = element_text(size=8))

cat("Since 1961 Soyabean production has increased by a factor of", round(soyabeanchange, 2), "\n")
## Since 1961 Soyabean production has increased by a factor of 6.3
cat("Since 1961 Bovine meat production has increased by a factor of", round(bovinechange, 2), "\n")
## Since 1961 Bovine meat production has increased by a factor of 2.63
grid.arrange(arrangeGrob(chart1, chart2), nrow = 1)

Conclusion

In conclusion, production has changed dramatically since 1961 from a quantity (# of units produced), type (food versus feed), and area (country of production) perspective. There was a significant spike in production during the early 1990’s. Detailed summary below.

summaryinfo <- analysisdata %>% 
  ungroup() %>%
  filter(Year == 1961 | Year == 2013) %>%
  select(Year, Feed_Units, Food_Units, Total_Units) %>%
  group_by(Year) %>%
  summarize(Feed_Units = sum(Feed_Units), Food_Units = sum(Food_Units), Total_Units = sum(Total_Units))

feed_increase <- as.numeric(summaryinfo[2,2] / summaryinfo[1,2])
food_increase <- as.numeric(summaryinfo[2,3] / summaryinfo[1,3])
total_increase <- as.numeric(summaryinfo[2,4] / summaryinfo[1,4])

cat("Feed production has increased by a factor of", round(feed_increase,2), "since 1961\n")
## Feed production has increased by a factor of NA since 1961
cat("Food production has increased by a factor of", round(food_increase,2), "since 1961\n")
## Food production has increased by a factor of NA since 1961
cat("Total production has increased by a factor of", round(total_increase,2), "since 1961\n")
## Total production has increased by a factor of NA since 1961

Dataset 2: NE Collections and Expenses, by Kareem

Proposed Analysis

In this project we will use the data of the north east region different collections of revenue and spending in 2016, the budget and finances of nine states in the north east with the total of US collection of revenue from that year.

The following terms are used to describe a state’s finances:

Revenues come mainly from tax collections, licensing fees, federal aid, and returns on investments.

Expenditures generally include spending on government salaries, infrastructure, education, public pensions, public assistance, corrections, Medicaid, and transportation.

State funds include general and other state-based funds.

Federal funds are "funds received directly from the federal government.

Total Expense is calculated by adding together the totals for state and federal funds used for expenditures.

The analysis: 1 - Analyze the total collection revenues and Expenses for NE States. 2 - Compare the different collection revenues. 3 - Analyze NE states per capita collection and per capita spending.

Data

Load data from Github

Formal citation for data set:

Ballotpedia. New York state budget and finances. Retrieved 1 October 2020 from https://ballotpedia.org/New_York_state_budget_and_finances.

url <- "https://raw.githubusercontent.com/cwestsmith/cuny-msds/master/datasets/NE_tax_and_spending.csv"

NE_data <- read.csv(url, sep = ",")
head(NE_data)
##           State Property_taxes Sales_and_gross_receipts Licenses Income_taxes
## 1      New York             NA                 24790017  1794987     50690443
## 2   Connecticut             NA                  6149782   455454      8276620
## 3    New Jersey           4638                 13173332  1499889     15585479
## 4  Pennsylvania          43124                 19284374  2159170     14388463
## 5 New Hampshire         406394                   982832   329549       788210
## 6         Maine          35425                  2077913   272253      1689129
##   Other_taxes Federal_aid State_funds Federal_funds X2016_population
## 1     4078516    48168577      101232         49476         19745289
## 2      363091     6349942       24964          6122          3576452
## 3     1283382    15438821       42398         17440          8944469
## 4     1519458    21647328       50858         27153         12784227
## 5      134961     1658713        3623          2162          1334795
## 6       55522     2997132        5519          2536          1331479

Tidy

Calculate the total collection which includes all different types of taxes as well as total expenses for state and federal funds.

# Replace NA with 0, 
# Calculate the total collection per state
# Calculating the total expense per state
wide_data <- NE_data %>%
 replace(is.na(.), 0) %>%
 mutate(Total_collections = Property_taxes+Sales_and_gross_receipts+Licenses+Income_taxes+Other_taxes) %>%
  mutate(Total_Expense = State_funds + Federal_funds)

head(wide_data)
##           State Property_taxes Sales_and_gross_receipts Licenses Income_taxes
## 1      New York              0                 24790017  1794987     50690443
## 2   Connecticut              0                  6149782   455454      8276620
## 3    New Jersey           4638                 13173332  1499889     15585479
## 4  Pennsylvania          43124                 19284374  2159170     14388463
## 5 New Hampshire         406394                   982832   329549       788210
## 6         Maine          35425                  2077913   272253      1689129
##   Other_taxes Federal_aid State_funds Federal_funds X2016_population
## 1     4078516    48168577      101232         49476         19745289
## 2      363091     6349942       24964          6122          3576452
## 3     1283382    15438821       42398         17440          8944469
## 4     1519458    21647328       50858         27153         12784227
## 5      134961     1658713        3623          2162          1334795
## 6       55522     2997132        5519          2536          1331479
##   Total_collections Total_Expense
## 1          81353963        150708
## 2          15244947         31086
## 3          31546720         59838
## 4          37394589         78011
## 5           2641946          5785
## 6           4130242          8055

Calculate

Calculate the per capita for collection and expenses per state

wide_data <- wide_data %>%
 mutate(Per_capita_collection = round((Total_collections / X2016_population)*1000)) %>% 
 mutate(Per_capita_expense = round((Total_Expense / X2016_population)*1000000))

head(wide_data)
##           State Property_taxes Sales_and_gross_receipts Licenses Income_taxes
## 1      New York              0                 24790017  1794987     50690443
## 2   Connecticut              0                  6149782   455454      8276620
## 3    New Jersey           4638                 13173332  1499889     15585479
## 4  Pennsylvania          43124                 19284374  2159170     14388463
## 5 New Hampshire         406394                   982832   329549       788210
## 6         Maine          35425                  2077913   272253      1689129
##   Other_taxes Federal_aid State_funds Federal_funds X2016_population
## 1     4078516    48168577      101232         49476         19745289
## 2      363091     6349942       24964          6122          3576452
## 3     1283382    15438821       42398         17440          8944469
## 4     1519458    21647328       50858         27153         12784227
## 5      134961     1658713        3623          2162          1334795
## 6       55522     2997132        5519          2536          1331479
##   Total_collections Total_Expense Per_capita_collection Per_capita_expense
## 1          81353963        150708                  4120               7633
## 2          15244947         31086                  4263               8692
## 3          31546720         59838                  3527               6690
## 4          37394589         78011                  2925               6102
## 5           2641946          5785                  1979               4334
## 6           4130242          8055                  3102               6050

Transform

Transform the data from wide to long

# Transform and arrange by states
long_data <- wide_data %>%
 rename(Sales_receipts = Sales_and_gross_receipts, population = X2016_population) %>%
 gather(Details, "Total", 2:14) %>%
 arrange(State)

head(long_data)
##         State        Details   Total
## 1 Connecticut Property_taxes       0
## 2 Connecticut Sales_receipts 6149782
## 3 Connecticut       Licenses  455454
## 4 Connecticut   Income_taxes 8276620
## 5 Connecticut    Other_taxes  363091
## 6 Connecticut    Federal_aid 6349942

Export

Export data into a .csv file

# Get working directory path
path <- getwd()

# Export file to working directory.  The file.path function has been used to ensure platform independence (i.e. take into account the different path syntaxes for various operating systems)
write.csv(long_data, file.path(path, "project2_dataset2.csv"))

Analysis

Filtering only the Collection streams for NE States.

# Selecting the target of the filter which are the collections
target <- c("Property_taxes" , "Sales_receipts", "Licenses", "Income_taxes", "Other_taxes")

# Apply the filters and filter out the US collection which is reletivly high,
Revenue <- filter(long_data, Details %in% target & !(State == "United States"))

head(Revenue)
##         State        Details   Total
## 1 Connecticut Property_taxes       0
## 2 Connecticut Sales_receipts 6149782
## 3 Connecticut       Licenses  455454
## 4 Connecticut   Income_taxes 8276620
## 5 Connecticut    Other_taxes  363091
## 6       Maine Property_taxes   35425

Plot to compare the different Collections streams per State

Revenue %>% 
  ggplot(aes(fill=Details,y=Total,x=reorder(State,-Total))) + 
  geom_bar(position="dodge",stat="identity") + 
  ylab("Total Collections per Thousands")+ xlab("North East States") + 
  theme(axis.text.x = element_text(angle=90))

Filtering the Per_capita_collection and Per_capita_expense for NE States.

# selecting the target for filter which is per capita
target <- c("Per_capita_collection" , "Per_capita_expense")

# Filter by per capita collection and per capita expense
long_data <- filter(long_data, Details %in% target)

head(long_data)
##           State               Details Total
## 1   Connecticut Per_capita_collection  4263
## 2   Connecticut    Per_capita_expense  8692
## 3         Maine Per_capita_collection  3102
## 4         Maine    Per_capita_expense  6050
## 5 Massachusetts Per_capita_collection  4005
## 6 Massachusetts    Per_capita_expense  8867

Plot for Per Capita Collection and Expense for NE States.

long_data %>% 
  ggplot(aes(fill=Details,y=Total,x=reorder(State,-Total))) + 
  geom_bar(position="dodge",stat="identity") + 
  ylab("Per Capita Totals")+ xlab("North East States") + 
  theme(axis.text.x = element_text(angle=90))

Conclusion

Based on the above analysis we can see that New York has the highest overall state revenue, expense collections, income taxes and sales receipts, but Vermont has the highest per capita collections and Rhode Island has the highest per capita expenses. All of the states in the North East region have higher per capita expenses than the US average, and within that region only New Hampshire fell below the US average for per capita collection.

Dataset 3: Movies on Streaming Platforms, by Jack

Approach

Orli asked the question, which streaming service gives the best bang for your buck? For this assignment I analyzed the distribution of ratings and defined a “good movie” as one standard deviation above the mean.

Data

Load data from Github

Formal citation for data set:

Bhatia, Ruchi. (2020; May). Movies on Netflix, Primve Video, Hulu and Disney +, Version 2. Retrieved 3 October 2020 from https://www.kaggle.com/ruchi798/movies-on-netflix-prime-video-hulu-and-disney.

url <- "https://raw.githubusercontent.com/cwestsmith/cuny-msds/master/datasets/MoviesOnStreamingPlatforms_updated.csv"

dat <- read.csv(url)

Tidy

In order for this data to be tidy, every column should be a variable and every row should be an observation. The rows are each a movie (which is our observation) but columns such as, genre, language and director each contain multiple variables.

Age could also be a useful variable to analyze if it was transformed into an int and the variable changed to minimum age (min.age)

The analysis requested by the poster wanted to know the intersection between ranking score and provider, so we will need to change the IMDb and Rotten.Tomatoes rankings so they can be compared.

There are also extraneous rows that will need to be dealt with after our data is tidied.

glimpse(dat)
## Rows: 16,744
## Columns: 17
## $ X               <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
## $ ID              <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
## $ Title           <chr> "Inception", "The Matrix", "Avengers: Infinity War"...
## $ Year            <int> 2010, 1999, 2018, 1985, 1966, 2018, 2002, 2012, 198...
## $ Age             <chr> "13+", "18+", "13+", "7+", "18+", "7+", "18+", "18+...
## $ IMDb            <dbl> 8.8, 8.7, 8.5, 8.5, 8.8, 8.4, 8.5, 8.4, 8.4, 8.3, 8...
## $ Rotten.Tomatoes <chr> "87%", "87%", "84%", "96%", "97%", "97%", "95%", "8...
## $ Netflix         <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Hulu            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Prime.Video     <int> 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, ...
## $ Disney.         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Type            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Directors       <chr> "Christopher Nolan", "Lana Wachowski,Lilly Wachowsk...
## $ Genres          <chr> "Action,Adventure,Sci-Fi,Thriller", "Action,Sci-Fi"...
## $ Country         <chr> "United States,United Kingdom", "United States", "U...
## $ Language        <chr> "English,Japanese,French", "English", "English", "E...
## $ Runtime         <int> 148, 136, 149, 116, 161, 117, 150, 165, 115, 153, 1...

Ratings:

-rotten.tomatoes and IMDb to percentage -get the percent symbol out and normalize

dat<-dat%>%
  mutate(Rotten.Tomatoes=as.numeric(gsub("%","",Rotten.Tomatoes)))

# Rotten tomatoes as percentage
dat$Rotten.Tomatoes<-dat$Rotten.Tomatoes/100

# IMDB to percentage
dat$IMDb<-dat$IMDb/10

dat<-transform(dat,IMDb=as.numeric(IMDb),
          Rotten.Tomatoes=as.numeric(Rotten.Tomatoes))

Age:

-change age to min.age and remove the + sign, change to int

dat<-dat%>%
  mutate(Age=as.numeric(gsub("\\+","",Age)))

# Replace "all" with 0
dat<-dat%>%
  mutate(Age=as.numeric(gsub("all",0,Age)))

Disney is named “Disney.”, ill remove the period

dat<-dat%>%
  rename(Disney=Disney.)

Remove type and X columns

dat<-dat%>%
  select(-X,-Type)

Genres:

Note: This was not critical for the analysis, but has been included as an interesting challenge to tackle.

-turn into variables for each movie type

-remove original genres column

Step 1:

get a list of unique Genre types

pattern<-"[A-z]+"
list<-dat$Genres%>%
  unlist()%>%
  str_extract_all(pattern)%>%
  unique()
unique_genres<-list%>%
  unlist()%>%
  unique()
unique_genres
##  [1] "Action"      "Adventure"   "Sci"         "Fi"          "Thriller"   
##  [6] "Comedy"      "Western"     "Animation"   "Family"      "Biography"  
## [11] "Drama"       "Music"       "War"         "Crime"       "Fantasy"    
## [16] "Romance"     "History"     "Mystery"     "Horror"      "Sport"      
## [21] "Documentary" "Musical"     "News"        "Short"       "Reality"    
## [26] "TV"          "Talk"        "Show"        "Game"        "Film"       
## [31] "Noir"

Step 2:

Use the list to create new variables and fill them when the “Genres” column contains the variable.

dat1<-dat
N<-length(unique_genres)

# Create regex pattern for the list of unique genres
pat<-paste(unique_genres,collapse="|")

# Loop to create variables for each movie type
for(i in 1:N){
  # Load variable with genre of interest
  var<-unique_genres[i]

  dat1<-dat1%>%
    mutate( !!var := str_extract(Genres,!!var))
}

# Change value from name of the genre to binary
dat2 <- data.frame(
  lapply(dat1,function(x){
    gsub(pat,1,x)
  }))

# Remove extraneous columns
dat2 <- dat2%>%
  select(-Genres,-Fi,-TV,-Show,-Film)

dat2 %>%
  select(Action:Noir) %>%
  head()
##   Action Adventure  Sci Thriller Comedy Western Animation Family Biography
## 1      1         1    1        1   <NA>    <NA>      <NA>   <NA>      <NA>
## 2      1      <NA>    1     <NA>   <NA>    <NA>      <NA>   <NA>      <NA>
## 3      1         1    1     <NA>   <NA>    <NA>      <NA>   <NA>      <NA>
## 4   <NA>         1    1     <NA>      1    <NA>      <NA>   <NA>      <NA>
## 5   <NA>      <NA> <NA>     <NA>   <NA>       1      <NA>   <NA>      <NA>
## 6      1         1    1     <NA>   <NA>    <NA>         1      1      <NA>
##   Drama Music  War Crime Fantasy Romance History Mystery Horror Sport
## 1  <NA>  <NA> <NA>  <NA>    <NA>    <NA>    <NA>    <NA>   <NA>  <NA>
## 2  <NA>  <NA> <NA>  <NA>    <NA>    <NA>    <NA>    <NA>   <NA>  <NA>
## 3  <NA>  <NA> <NA>  <NA>    <NA>    <NA>    <NA>    <NA>   <NA>  <NA>
## 4  <NA>  <NA> <NA>  <NA>    <NA>    <NA>    <NA>    <NA>   <NA>  <NA>
## 5  <NA>  <NA> <NA>  <NA>    <NA>    <NA>    <NA>    <NA>   <NA>  <NA>
## 6  <NA>  <NA> <NA>  <NA>    <NA>    <NA>    <NA>    <NA>   <NA>  <NA>
##   Documentary Musical News Short Reality Talk Game Noir
## 1        <NA>    <NA> <NA>  <NA>    <NA> <NA> <NA> <NA>
## 2        <NA>    <NA> <NA>  <NA>    <NA> <NA> <NA> <NA>
## 3        <NA>    <NA> <NA>  <NA>    <NA> <NA> <NA> <NA>
## 4        <NA>    <NA> <NA>  <NA>    <NA> <NA> <NA> <NA>
## 5        <NA>    <NA> <NA>  <NA>    <NA> <NA> <NA> <NA>
## 6        <NA>    <NA> <NA>  <NA>    <NA> <NA> <NA> <NA>

The film genres are now variables, so you could do analysis like calculating which genre is the highest rated.

Analysis

Exploring the Ratings

First I want to see the distributions of the Rotten Tomatoes and IMDb scores. If they are similar enough, maybe I can impute a missing score from the other.

scores <- dat1 %>%
  filter((!is.na(IMDb)), !is.na(Rotten.Tomatoes))%>%
  select(IMDb,Rotten.Tomatoes)

scores <- scores %>%
  mutate(
    avg_score=rowMeans(.)
  )

# Summary tibble
imdb_summary<-summary(scores$IMDb)
rotten_summary<-summary(scores$Rotten.Tomatoes)
avg_summary<-summary(scores$avg_score)

sum_titles<-c("min","1st Q","median","mean","3rd Q","Max")

summary_tibble<-tibble(sum_titles,imdb_summary,rotten_summary,avg_summary)

boxplot(scores)

Interpreting the boxplot and summary:

It seems that the IMDb ratings are highly clustered around the mean, with a left skew. The Rotten Tomatoes scores seem to be much more evenly distributed because the IQR is much wider. Let’s look at a histogram to confirm.

score_longer<-scores%>%
  select(IMDb,Rotten.Tomatoes)%>%
  pivot_longer(
    cols = IMDb:Rotten.Tomatoes,
    names_to="scorer",
    values_to="values"
  )

score_longer %>% 
  ggplot(aes(x=values,color=scorer))+
  geom_histogram(fill="white",alpha=0.5,position="identity",bins=100)

I would expect the “actual quality” of movies to be a normal distribution. For this reason I am tempted to reject the Rotten Tomatoes ratings for my analysis. There might be some bias in the Rotten Tomatoes ratings that I will discuss in my conclusion.

Since we are rejecting the Rotten Tomatoes ratings, we can bring back in all of the scored IMDb ratings.

imdb_scores<-dat1%>%
  filter(!is.na(IMDb))%>%
  select(IMDb)%>%
  pivot_longer(
    cols = IMDb,
    names_to="IMDb",
    values_to="score"
  )

ggplot(imdb_scores,aes(x=score))+
  geom_histogram(bins=30)

Now that I have a good distribution, I want to decide what a good movie is. I will say that any movie 1 standard deviation above the mean is a good movie.

# Get the score floor
score_floor<-mean(imdb_scores$score)+sd(imdb_scores$score)

# Get our database with imdb scores greater than the score floor
tidy_data<-dat1%>%
  filter(!is.na(IMDb))%>%
  filter(IMDb>!!score_floor)%>%
  select(Title,IMDb,Netflix,Hulu,Prime.Video,Disney)

good_count<-tidy_data%>%
  summarize(sum(Netflix),sum(Hulu),sum(Prime.Video),sum(Disney))%>%unlist
good_count<-str_extract(good_count, "\\d+")%>%
  as.numeric()

provider<-c("netflix","hulu","prime_video","disney")

cost<-c(8.99,11.99,8.99,6.99)

cost_compare<-tibble("provider"=provider,"good_movies"=good_count,"price"=cost)

cost_compare<-cost_compare%>%
  mutate("movie_per_dollar" = round(good_movies/price,2))
  
cost_compare%>%
  select(provider,movie_per_dollar)
## # A tibble: 4 x 2
##   provider    movie_per_dollar
##   <chr>                  <dbl>
## 1 netflix                 75.6
## 2 hulu                    12.0
## 3 prime_video            177. 
## 4 disney                  18.2

Conclusion

I believe we should reject the Rotten Tomatoes scores because I believe it is biased. Rotten Tomatoes is a consumer targeted movie ranking site. I believe there is probably a lot of manipulation of scores by movie producers to try to get people to go to the movie. There is evidence of this as the largest bin is a 100% score, and the count seems to increase linearly with score.

If we decide value is based on the amount of good movies we have access to, then we should count the number of good movies available on a streaming service and get a metric of good movies per dollar. Clearly from the tibble above, Amazon Prime is the streaming service that offers the most good movies per dollar.