For this project we were asked to choose three “wide” datasets, create .CSV (or alternatively databases) for them, then tidy and analyze the data. I partnered with two others from Data 607 for this project - Karim Hammoud and Jack Write. Our methodology was to each take one dataset to work on and then share the code and discussions questions / challenges with the rest of the group, the hopeful outcome being higher quality work based on feedback from all in the group.
The key questions that were proposed to answer are:
For a specific country see its current top 3 production items and whether they have changed since 1961
Determine whether feed has overtaken food in any areas or food items
Compare a high protein crop like a legume to an animal-based product and determine whether they have increased or decreased over time.
The data set I chose to work with is the Food and Agricultural Organization (FAO) data on worldwide food (for human consumption and feed (for animal consumption) production and distribution, as hosted on Kaggle. It is a very wide data set with 63 columns in total.
Formal citation for data set:
Oppenheim, R. (2017; November). Who eats the food we grow?, Version 7. Retrieved 30 September 2020 from https://www.kaggle.com/dorbicycle/world-foodfeed-production/version/7.
# Load required packages
library(tidyverse)
library(gridExtra)
# Load data from Github
rawdata <- read.csv("https://raw.githubusercontent.com/cwestsmith/cuny-msds/master/datasets/FAO.csv")
# Quick look at the data to make sure it loaded correctly. It consists of 63 columns and 21,477 rows at this point.
glimpse(rawdata)## Rows: 21,477
## Columns: 63
## $ Area.Abbreviation <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", ...
## $ Area.Code <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ Area <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afg...
## $ Item.Code <int> 2511, 2805, 2513, 2513, 2514, 2514, 2517, 2520, 2...
## $ Item <chr> "Wheat and products", "Rice (Milled Equivalent)",...
## $ Element.Code <int> 5142, 5142, 5521, 5142, 5521, 5142, 5142, 5142, 5...
## $ Element <chr> "Food", "Food", "Feed", "Food", "Feed", "Food", "...
## $ Unit <chr> "1000 tonnes", "1000 tonnes", "1000 tonnes", "100...
## $ latitude <dbl> 33.94, 33.94, 33.94, 33.94, 33.94, 33.94, 33.94, ...
## $ longitude <dbl> 67.71, 67.71, 67.71, 67.71, 67.71, 67.71, 67.71, ...
## $ Y1961 <int> 1928, 183, 76, 237, 210, 403, 17, 0, 111, 45, 0, ...
## $ Y1962 <int> 1904, 183, 76, 237, 210, 403, 18, 0, 97, 45, 0, 4...
## $ Y1963 <int> 1666, 182, 76, 237, 214, 410, 19, 0, 103, 45, 0, ...
## $ Y1964 <int> 1950, 220, 76, 238, 216, 415, 20, 0, 110, 45, 0, ...
## $ Y1965 <int> 2001, 220, 76, 238, 216, 415, 21, 0, 113, 31, 0, ...
## $ Y1966 <int> 1808, 195, 75, 237, 216, 413, 22, 0, 117, 14, 16,...
## $ Y1967 <int> 2053, 231, 71, 225, 235, 454, 23, 0, 128, 19, 23,...
## $ Y1968 <int> 2045, 235, 72, 227, 232, 448, 24, 0, 130, 30, 31,...
## $ Y1969 <int> 2154, 238, 73, 230, 236, 455, 25, 0, 134, 34, 28,...
## $ Y1970 <int> 1819, 213, 74, 234, 200, 383, 26, 0, 125, 15, 9, ...
## $ Y1971 <int> 1963, 205, 71, 223, 201, 386, 26, 0, 147, 0, 13, ...
## $ Y1972 <int> 2215, 233, 70, 219, 216, 416, 27, 0, 138, 0, 13, ...
## $ Y1973 <int> 2310, 246, 72, 225, 228, 439, 27, 0, 143, 28, 6, ...
## $ Y1974 <int> 2335, 246, 76, 240, 231, 445, 28, 0, 160, 32, 0, ...
## $ Y1975 <int> 2434, 255, 77, 244, 234, 451, 29, 0, 169, 20, 0, ...
## $ Y1976 <int> 2512, 263, 80, 255, 240, 463, 37, 0, 324, 28, 10,...
## $ Y1977 <int> 2282, 235, 60, 185, 228, 439, 32, 0, 176, 24, 16,...
## $ Y1978 <int> 2454, 254, 65, 203, 234, 451, 33, 0, 225, 24, 13,...
## $ Y1979 <int> 2443, 270, 64, 198, 228, 440, 31, 0, 232, 34, 6, ...
## $ Y1980 <int> 2129, 259, 64, 202, 226, 437, 31, 0, 240, 61, 15,...
## $ Y1981 <int> 2133, 248, 60, 189, 210, 407, 29, 0, 247, 50, 0, ...
## $ Y1982 <int> 2068, 217, 55, 174, 199, 384, 27, 0, 248, 43, 0, ...
## $ Y1983 <int> 1994, 217, 53, 167, 192, 371, 28, 0, 242, 38, 0, ...
## $ Y1984 <int> 1851, 197, 51, 160, 182, 353, 26, 0, 235, 46, 0, ...
## $ Y1985 <int> 1791, 186, 48, 151, 173, 334, 25, 0, 226, 23, 0, ...
## $ Y1986 <int> 1683, 200, 46, 145, 170, 330, 23, 0, 217, 25, 0, ...
## $ Y1987 <int> 2194, 193, 46, 145, 154, 298, 23, 0, 196, 3, 0, 8...
## $ Y1988 <int> 1801, 202, 47, 148, 148, 287, 23, 0, 198, 45, 0, ...
## $ Y1989 <int> 1754, 191, 46, 145, 137, 265, 23, 0, 184, 54, 0, ...
## $ Y1990 <int> 1640, 199, 43, 135, 144, 279, 24, 0, 205, 47, 0, ...
## $ Y1991 <int> 1539, 197, 43, 132, 126, 245, 24, 0, 203, 29, 0, ...
## $ Y1992 <int> 1582, 249, 40, 120, 90, 170, 18, 0, 210, 29, 0, 5...
## $ Y1993 <int> 1840, 218, 50, 155, 141, 272, 22, 0, 210, 29, 0, ...
## $ Y1994 <int> 1855, 260, 46, 143, 150, 289, 20, 0, 211, 29, 0, ...
## $ Y1995 <int> 1853, 319, 41, 125, 159, 310, 21, 0, 212, 29, 0, ...
## $ Y1996 <int> 2177, 254, 44, 138, 108, 209, 17, 0, 213, 29, 0, ...
## $ Y1997 <int> 2343, 326, 50, 159, 90, 173, 20, 0, 214, 28, 0, 5...
## $ Y1998 <int> 2407, 347, 48, 154, 99, 192, 21, 0, 214, 28, 0, 5...
## $ Y1999 <int> 2463, 270, 43, 141, 72, 141, 17, 0, 217, 28, 0, 6...
## $ Y2000 <int> 2600, 372, 26, 84, 35, 66, 20, 0, 219, 29, 0, 61,...
## $ Y2001 <int> 2668, 411, 29, 83, 48, 93, 20, 0, 215, 29, 0, 61,...
## $ Y2002 <int> 2776, 448, 70, 122, 89, 170, 18, 0, 217, 29, 0, 7...
## $ Y2003 <int> 3095, 460, 48, 144, 63, 117, 16, 1, 347, 51, 0, 9...
## $ Y2004 <int> 3249, 419, 58, 185, 120, 231, 15, 2, 276, 50, 0, ...
## $ Y2005 <int> 3486, 445, 236, 43, 208, 67, 21, 1, 294, 29, 0, 1...
## $ Y2006 <int> 3704, 546, 262, 44, 233, 82, 11, 1, 294, 61, 0, 1...
## $ Y2007 <int> 4164, 455, 263, 48, 249, 67, 19, 0, 260, 65, 0, 1...
## $ Y2008 <int> 4252, 490, 230, 62, 247, 69, 21, 0, 242, 54, 0, 2...
## $ Y2009 <int> 4538, 415, 379, 55, 195, 71, 18, 0, 250, 114, 0, ...
## $ Y2010 <int> 4605, 442, 315, 60, 178, 82, 14, 0, 192, 83, 0, 2...
## $ Y2011 <int> 4711, 476, 203, 72, 191, 73, 14, 0, 169, 83, 0, 2...
## $ Y2012 <int> 4810, 425, 367, 78, 200, 77, 14, 0, 196, 69, 0, 2...
## $ Y2013 <int> 4895, 422, 360, 89, 200, 76, 12, 0, 230, 81, 0, 2...
To enhance our ability to analyze the data within the Tidyverse we need to do some tidying and transforming. This increases the number of rows from 21,477 to 1,138,281, so the benefit of enhanced analysis comes at a cost of CPU performance.
# Combine the year columns into key-pair columns indicating year and units.
longdata_cs <- rawdata %>%
pivot_longer(Y1961:Y2013, names_to = "Year", values_to = "Total_Units")
# Pivot the food and feed columns for a wider format.
longdata_cs <- longdata_cs %>%
pivot_wider(names_from = Element, values_from = Total_Units, values_fill = 0)
# Remove the Y in from the values in the year column and convert to number.
longdata_cs <- longdata_cs %>% mutate(Year=as.numeric(gsub("Y","", Year)))
# Confirm everything looks ok
str(longdata_cs)## tibble [1,138,281 x 12] (S3: tbl_df/tbl/data.frame)
## $ Area.Abbreviation: chr [1:1138281] "AFG" "AFG" "AFG" "AFG" ...
## $ Area.Code : int [1:1138281] 2 2 2 2 2 2 2 2 2 2 ...
## $ Area : chr [1:1138281] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ Item.Code : int [1:1138281] 2511 2511 2511 2511 2511 2511 2511 2511 2511 2511 ...
## $ Item : chr [1:1138281] "Wheat and products" "Wheat and products" "Wheat and products" "Wheat and products" ...
## $ Element.Code : int [1:1138281] 5142 5142 5142 5142 5142 5142 5142 5142 5142 5142 ...
## $ Unit : chr [1:1138281] "1000 tonnes" "1000 tonnes" "1000 tonnes" "1000 tonnes" ...
## $ latitude : num [1:1138281] 33.9 33.9 33.9 33.9 33.9 ...
## $ longitude : num [1:1138281] 67.7 67.7 67.7 67.7 67.7 ...
## $ Year : num [1:1138281] 1961 1962 1963 1964 1965 ...
## $ Food : int [1:1138281] 1928 1904 1666 1950 2001 1808 2053 2045 2154 1819 ...
## $ Feed : int [1:1138281] 0 0 0 0 0 0 0 0 0 0 ...
# Create new data frame including only the columns required for analysis and
# use group_by to consolidate similar rows
analysisdata <- longdata_cs %>%
select(Area, Year, Item, Feed, Food) %>%
group_by(Area, Year, Item) %>%
summarize(Feed_Units=sum(Feed), Food_Units=sum(Food), Total_Units=sum(Feed+Food))
# Make sure everything looks as intended
head(analysisdata, 10)## # A tibble: 10 x 6
## # Groups: Area, Year [1]
## Area Year Item Feed_Units Food_Units Total_Units
## <chr> <dbl> <chr> <int> <int> <int>
## 1 Afghanistan 1961 Alcoholic Beverages 0 0 0
## 2 Afghanistan 1961 Animal fats 0 20 20
## 3 Afghanistan 1961 Apples and products 0 14 14
## 4 Afghanistan 1961 Bananas 0 0 0
## 5 Afghanistan 1961 Barley and products 76 237 313
## 6 Afghanistan 1961 Beer 0 0 0
## 7 Afghanistan 1961 Beverages, Alcoholic 0 0 0
## 8 Afghanistan 1961 Bovine Meat 0 43 43
## 9 Afghanistan 1961 Butter, Ghee 0 11 11
## 10 Afghanistan 1961 Cereals - Excluding Beer 286 2767 3053
Now that the data is tidied it can be analyzed. We can see here that in 1961 the top producers were U.S., China, and India (in that order), but in 2013 the order has shifted: China, India, and U.S.
# View top countries by Total_Units by earliest year (1961)
analysisdata %>% ungroup() %>%
filter(Year == min(Year)) %>%
select(Year, Area, Total_Units) %>%
group_by(Year, Area) %>%
summarize(Total_Units = sum(Total_Units)) %>%
arrange(desc(Total_Units))## # A tibble: 174 x 3
## # Groups: Year [1]
## Year Area Total_Units
## <dbl> <chr> <int>
## 1 1961 United States of America 559347
## 2 1961 China, mainland 477279
## 3 1961 India 309565
## 4 1961 Germany 212839
## 5 1961 France 138385
## 6 1961 Poland 130442
## 7 1961 Brazil 126501
## 8 1961 United Kingdom 117090
## 9 1961 Japan 112141
## 10 1961 Italy 105534
## # ... with 164 more rows
# View top countries by Total_Units by latest year (2013)
analysisdata %>% ungroup() %>%
filter(Year == max(Year)) %>%
select(Year, Area, Total_Units) %>%
group_by(Year, Area) %>%
summarize(Total_Units = sum(Total_Units)) %>%
arrange(desc(Total_Units))## # A tibble: 174 x 3
## # Groups: Year [1]
## Year Area Total_Units
## <dbl> <chr> <int>
## 1 2013 China, mainland 3191155
## 2 2013 India 1336593
## 3 2013 United States of America 938639
## 4 2013 Brazil 439967
## 5 2013 Russian Federation 370913
## 6 2013 Nigeria 300923
## 7 2013 Indonesia 256888
## 8 2013 Germany 218891
## 9 2013 Mexico 210121
## 10 2013 Pakistan 190505
## # ... with 164 more rows
Question 1
For a specific country see its current top 3 production items and whether they have changed since 1961
We can see that yes the items have changed from 1961 to 2013 (the latest year of data).
In 1961 the items were (1) Milk - Excluding Butter; (2) Cereals - Excluding Beer; and (3) Starchy Roots.
In 2013 the items were (1) Milk - Excluding Butter; (2) Cereals - Excluding Beer; and (3) Wheat and products.
Interestingly starchy roots is not even in the top 10 in 2013. We can see in the chart that it has steadily been decreasing over time.
# Top 10 Units in 1961
analysisdata %>% filter(Area == "Switzerland", Total_Units > 0, Year == 1961) %>%
group_by(Year, Area, Item) %>%
summarize(Total_Units = sum(Total_Units)) %>%
slice_max(Total_Units, n=10)## # A tibble: 10 x 4
## # Groups: Year, Area [1]
## Year Area Item Total_Units
## <dbl> <chr> <chr> <int>
## 1 1961 Switzerland Milk - Excluding Butter 5484
## 2 1961 Switzerland Cereals - Excluding Beer 1341
## 3 1961 Switzerland Starchy Roots 1021
## 4 1961 Switzerland Potatoes and products 1009
## 5 1961 Switzerland Fruits - Excluding Wine 746
## 6 1961 Switzerland Wheat and products 673
## 7 1961 Switzerland Alcoholic Beverages 661
## 8 1961 Switzerland Fruits, Other 405
## 9 1961 Switzerland Vegetables 400
## 10 1961 Switzerland Beer 385
# Top Units in 2013
analysisdata %>% filter(Area == "Switzerland", Total_Units > 0, Year == 2013) %>%
group_by(Year, Area, Item) %>%
summarize(Total_Units = sum(Total_Units)) %>%
slice_max(Total_Units, n=10)## # A tibble: 10 x 4
## # Groups: Year, Area [1]
## Year Area Item Total_Units
## <dbl> <chr> <chr> <int>
## 1 2013 Switzerland Milk - Excluding Butter 6780
## 2 2013 Switzerland Cereals - Excluding Beer 1706
## 3 2013 Switzerland Wheat and products 983
## 4 2013 Switzerland Vegetables 880
## 5 2013 Switzerland Fruits - Excluding Wine 837
## 6 2013 Switzerland Alcoholic Beverages 790
## 7 2013 Switzerland Vegetables, Other 649
## 8 2013 Switzerland Meat 584
## 9 2013 Switzerland Sugar & Sweeteners 488
## 10 2013 Switzerland Beer 454
# What happened to starchy roots in Switzerland?
analysisdata %>% filter(Area == "Switzerland", Item == "Starchy Roots", Total_Units > 0) %>%
ggplot(aes(x=Year, y=Total_Units)) +
geom_bar(stat="identity") +
ggtitle("Starchy Roots Production in Switzerland - 1961 to 2013") Question 2
Determine whether feed has overtaken food in any areas or food items
In 2013 there were 627 items with a feed to food ratio of greater than one, meaning that more food was produced for animal consumption than human consumption for those products. By comparison, in 1961 there were 274 products.
# Calculate and summarize ratio for 2013
analysisdata %>% ungroup %>% filter(Year==2013) %>%
select(Item, Feed_Units, Food_Units) %>%
mutate(Feed_To_Food_Ratio = Feed_Units / Food_Units) %>%
filter(Food_Units > 0, Feed_Units > 0, Feed_To_Food_Ratio > 1) %>%
arrange(desc(Feed_To_Food_Ratio))## # A tibble: 627 x 4
## Item Feed_Units Food_Units Feed_To_Food_Ratio
## <chr> <int> <int> <dbl>
## 1 Rape and Mustardseed 3715 1 3715
## 2 Maize and products 2099 3 700.
## 3 Pulses, Other and products 532 1 532
## 4 Soyabeans 959 2 480.
## 5 Barley and products 475 1 475
## 6 Barley and products 8000 17 471.
## 7 Barley and products 1380 3 460
## 8 Barley and products 7313 16 457.
## 9 Soyabeans 444 1 444
## 10 Barley and products 6598 25 264.
## # ... with 617 more rows
# Calculate and summarize ratio for 1961
analysisdata %>% ungroup %>% filter(Year==1961) %>%
select(Item, Feed_Units, Food_Units) %>%
mutate(Feed_To_Food_Ratio = Feed_Units / Food_Units) %>%
filter(Food_Units > 0, Feed_Units > 0, Feed_To_Food_Ratio > 1) %>%
arrange(desc(Feed_To_Food_Ratio))## # A tibble: 274 x 4
## Item Feed_Units Food_Units Feed_To_Food_Ratio
## <chr> <int> <int> <dbl>
## 1 Cereals, Other 1202 1 1202
## 2 Barley and products 2616 3 872
## 3 Oats 584 1 584
## 4 Oats 485 1 485
## 5 Oats 2366 5 473.
## 6 Barley and products 4070 15 271.
## 7 Barley and products 2603 11 237.
## 8 Barley and products 2009 9 223.
## 9 Barley and products 5000 28 179.
## 10 Cereals, Other 430 3 143.
## # ... with 264 more rows
Question 3
Compare a high protein crop like a legume to an animal-based product and determine whether they have increased or decreased over time.
Both soyabean and bovine meat production have increased significantly since 1961, with soyabean production increasing by a factor of 6.3 and bovine meat by 2.63.
bovine_soya_combined <- analysisdata %>% ungroup %>%
filter(Item=="Soyabeans" | Item=="Bovine Meat", Total_Units > 0) %>%
select(Year, Item, Total_Units) %>%
group_by(Year, Item) %>%
summarize(Total_Units = sum(Total_Units))
soyabean1961 <- bovine_soya_combined %>%
filter(Item == "Soyabeans", Year == 1961) %>%
select(Total_Units)
soyabean2013 <- bovine_soya_combined %>%
filter(Item == "Soyabeans", Year == 2013) %>%
select(Total_Units)
bovine1961 <- bovine_soya_combined %>%
filter(Item == "Bovine Meat", Year == 1961) %>%
select(Total_Units)
bovine2013 <- bovine_soya_combined %>%
filter(Item == "Bovine Meat", Year == 2013) %>%
select(Total_Units)
soyabeanchange <- soyabean2013$Total_Units / soyabean1961$Total_Units
bovinechange <- bovine2013$Total_Units / bovine1961$Total_Units
chart1 <- bovine_soya_combined %>%
filter(Item == "Soyabeans") %>%
ggplot(aes(x = Year, y = Total_Units)) +
geom_bar(stat="identity") +
ggtitle("Global Soyabeans Production - 1961 to 2013") +
theme(text = element_text(size=8))
chart2 <- bovine_soya_combined %>%
ggplot(aes(x = Year, y = Total_Units, fill = Item)) +
geom_bar(position = "dodge", stat="identity") +
ggtitle("Global Bovine Meat vs Soyabeans Production - 1961 to 2013") +
theme(text = element_text(size=8))
cat("Since 1961 Soyabean production has increased by a factor of", round(soyabeanchange, 2), "\n")## Since 1961 Soyabean production has increased by a factor of 6.3
## Since 1961 Bovine meat production has increased by a factor of 2.63
In conclusion, production has changed dramatically since 1961 from a quantity (# of units produced), type (food versus feed), and area (country of production) perspective. There was a significant spike in production during the early 1990’s. Detailed summary below.
summaryinfo <- analysisdata %>%
ungroup() %>%
filter(Year == 1961 | Year == 2013) %>%
select(Year, Feed_Units, Food_Units, Total_Units) %>%
group_by(Year) %>%
summarize(Feed_Units = sum(Feed_Units), Food_Units = sum(Food_Units), Total_Units = sum(Total_Units))
feed_increase <- as.numeric(summaryinfo[2,2] / summaryinfo[1,2])
food_increase <- as.numeric(summaryinfo[2,3] / summaryinfo[1,3])
total_increase <- as.numeric(summaryinfo[2,4] / summaryinfo[1,4])
cat("Feed production has increased by a factor of", round(feed_increase,2), "since 1961\n")## Feed production has increased by a factor of NA since 1961
## Food production has increased by a factor of NA since 1961
## Total production has increased by a factor of NA since 1961
In this project we will use the data of the north east region different collections of revenue and spending in 2016, the budget and finances of nine states in the north east with the total of US collection of revenue from that year.
The following terms are used to describe a state’s finances:
Revenues come mainly from tax collections, licensing fees, federal aid, and returns on investments.
Expenditures generally include spending on government salaries, infrastructure, education, public pensions, public assistance, corrections, Medicaid, and transportation.
State funds include general and other state-based funds.
Federal funds are "funds received directly from the federal government.
Total Expense is calculated by adding together the totals for state and federal funds used for expenditures.
The analysis: 1 - Analyze the total collection revenues and Expenses for NE States. 2 - Compare the different collection revenues. 3 - Analyze NE states per capita collection and per capita spending.
Load data from Github
Formal citation for data set:
Ballotpedia. New York state budget and finances. Retrieved 1 October 2020 from https://ballotpedia.org/New_York_state_budget_and_finances.
url <- "https://raw.githubusercontent.com/cwestsmith/cuny-msds/master/datasets/NE_tax_and_spending.csv"
NE_data <- read.csv(url, sep = ",")
head(NE_data)## State Property_taxes Sales_and_gross_receipts Licenses Income_taxes
## 1 New York NA 24790017 1794987 50690443
## 2 Connecticut NA 6149782 455454 8276620
## 3 New Jersey 4638 13173332 1499889 15585479
## 4 Pennsylvania 43124 19284374 2159170 14388463
## 5 New Hampshire 406394 982832 329549 788210
## 6 Maine 35425 2077913 272253 1689129
## Other_taxes Federal_aid State_funds Federal_funds X2016_population
## 1 4078516 48168577 101232 49476 19745289
## 2 363091 6349942 24964 6122 3576452
## 3 1283382 15438821 42398 17440 8944469
## 4 1519458 21647328 50858 27153 12784227
## 5 134961 1658713 3623 2162 1334795
## 6 55522 2997132 5519 2536 1331479
Calculate the total collection which includes all different types of taxes as well as total expenses for state and federal funds.
# Replace NA with 0,
# Calculate the total collection per state
# Calculating the total expense per state
wide_data <- NE_data %>%
replace(is.na(.), 0) %>%
mutate(Total_collections = Property_taxes+Sales_and_gross_receipts+Licenses+Income_taxes+Other_taxes) %>%
mutate(Total_Expense = State_funds + Federal_funds)
head(wide_data)## State Property_taxes Sales_and_gross_receipts Licenses Income_taxes
## 1 New York 0 24790017 1794987 50690443
## 2 Connecticut 0 6149782 455454 8276620
## 3 New Jersey 4638 13173332 1499889 15585479
## 4 Pennsylvania 43124 19284374 2159170 14388463
## 5 New Hampshire 406394 982832 329549 788210
## 6 Maine 35425 2077913 272253 1689129
## Other_taxes Federal_aid State_funds Federal_funds X2016_population
## 1 4078516 48168577 101232 49476 19745289
## 2 363091 6349942 24964 6122 3576452
## 3 1283382 15438821 42398 17440 8944469
## 4 1519458 21647328 50858 27153 12784227
## 5 134961 1658713 3623 2162 1334795
## 6 55522 2997132 5519 2536 1331479
## Total_collections Total_Expense
## 1 81353963 150708
## 2 15244947 31086
## 3 31546720 59838
## 4 37394589 78011
## 5 2641946 5785
## 6 4130242 8055
Calculate the per capita for collection and expenses per state
wide_data <- wide_data %>%
mutate(Per_capita_collection = round((Total_collections / X2016_population)*1000)) %>%
mutate(Per_capita_expense = round((Total_Expense / X2016_population)*1000000))
head(wide_data)## State Property_taxes Sales_and_gross_receipts Licenses Income_taxes
## 1 New York 0 24790017 1794987 50690443
## 2 Connecticut 0 6149782 455454 8276620
## 3 New Jersey 4638 13173332 1499889 15585479
## 4 Pennsylvania 43124 19284374 2159170 14388463
## 5 New Hampshire 406394 982832 329549 788210
## 6 Maine 35425 2077913 272253 1689129
## Other_taxes Federal_aid State_funds Federal_funds X2016_population
## 1 4078516 48168577 101232 49476 19745289
## 2 363091 6349942 24964 6122 3576452
## 3 1283382 15438821 42398 17440 8944469
## 4 1519458 21647328 50858 27153 12784227
## 5 134961 1658713 3623 2162 1334795
## 6 55522 2997132 5519 2536 1331479
## Total_collections Total_Expense Per_capita_collection Per_capita_expense
## 1 81353963 150708 4120 7633
## 2 15244947 31086 4263 8692
## 3 31546720 59838 3527 6690
## 4 37394589 78011 2925 6102
## 5 2641946 5785 1979 4334
## 6 4130242 8055 3102 6050
Transform the data from wide to long
# Transform and arrange by states
long_data <- wide_data %>%
rename(Sales_receipts = Sales_and_gross_receipts, population = X2016_population) %>%
gather(Details, "Total", 2:14) %>%
arrange(State)
head(long_data)## State Details Total
## 1 Connecticut Property_taxes 0
## 2 Connecticut Sales_receipts 6149782
## 3 Connecticut Licenses 455454
## 4 Connecticut Income_taxes 8276620
## 5 Connecticut Other_taxes 363091
## 6 Connecticut Federal_aid 6349942
Export data into a .csv file
Filtering only the Collection streams for NE States.
# Selecting the target of the filter which are the collections
target <- c("Property_taxes" , "Sales_receipts", "Licenses", "Income_taxes", "Other_taxes")
# Apply the filters and filter out the US collection which is reletivly high,
Revenue <- filter(long_data, Details %in% target & !(State == "United States"))
head(Revenue)## State Details Total
## 1 Connecticut Property_taxes 0
## 2 Connecticut Sales_receipts 6149782
## 3 Connecticut Licenses 455454
## 4 Connecticut Income_taxes 8276620
## 5 Connecticut Other_taxes 363091
## 6 Maine Property_taxes 35425
Plot to compare the different Collections streams per State
Revenue %>%
ggplot(aes(fill=Details,y=Total,x=reorder(State,-Total))) +
geom_bar(position="dodge",stat="identity") +
ylab("Total Collections per Thousands")+ xlab("North East States") +
theme(axis.text.x = element_text(angle=90))Filtering the Per_capita_collection and Per_capita_expense for NE States.
# selecting the target for filter which is per capita
target <- c("Per_capita_collection" , "Per_capita_expense")
# Filter by per capita collection and per capita expense
long_data <- filter(long_data, Details %in% target)
head(long_data)## State Details Total
## 1 Connecticut Per_capita_collection 4263
## 2 Connecticut Per_capita_expense 8692
## 3 Maine Per_capita_collection 3102
## 4 Maine Per_capita_expense 6050
## 5 Massachusetts Per_capita_collection 4005
## 6 Massachusetts Per_capita_expense 8867
Plot for Per Capita Collection and Expense for NE States.
long_data %>%
ggplot(aes(fill=Details,y=Total,x=reorder(State,-Total))) +
geom_bar(position="dodge",stat="identity") +
ylab("Per Capita Totals")+ xlab("North East States") +
theme(axis.text.x = element_text(angle=90))Based on the above analysis we can see that New York has the highest overall state revenue, expense collections, income taxes and sales receipts, but Vermont has the highest per capita collections and Rhode Island has the highest per capita expenses. All of the states in the North East region have higher per capita expenses than the US average, and within that region only New Hampshire fell below the US average for per capita collection.
Orli asked the question, which streaming service gives the best bang for your buck? For this assignment I analyzed the distribution of ratings and defined a “good movie” as one standard deviation above the mean.
Load data from Github
Formal citation for data set:
Bhatia, Ruchi. (2020; May). Movies on Netflix, Primve Video, Hulu and Disney +, Version 2. Retrieved 3 October 2020 from https://www.kaggle.com/ruchi798/movies-on-netflix-prime-video-hulu-and-disney.
In order for this data to be tidy, every column should be a variable and every row should be an observation. The rows are each a movie (which is our observation) but columns such as, genre, language and director each contain multiple variables.
Age could also be a useful variable to analyze if it was transformed into an int and the variable changed to minimum age (min.age)
The analysis requested by the poster wanted to know the intersection between ranking score and provider, so we will need to change the IMDb and Rotten.Tomatoes rankings so they can be compared.
There are also extraneous rows that will need to be dealt with after our data is tidied.
## Rows: 16,744
## Columns: 17
## $ X <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
## $ ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
## $ Title <chr> "Inception", "The Matrix", "Avengers: Infinity War"...
## $ Year <int> 2010, 1999, 2018, 1985, 1966, 2018, 2002, 2012, 198...
## $ Age <chr> "13+", "18+", "13+", "7+", "18+", "7+", "18+", "18+...
## $ IMDb <dbl> 8.8, 8.7, 8.5, 8.5, 8.8, 8.4, 8.5, 8.4, 8.4, 8.3, 8...
## $ Rotten.Tomatoes <chr> "87%", "87%", "84%", "96%", "97%", "97%", "95%", "8...
## $ Netflix <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Hulu <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Prime.Video <int> 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, ...
## $ Disney. <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Type <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Directors <chr> "Christopher Nolan", "Lana Wachowski,Lilly Wachowsk...
## $ Genres <chr> "Action,Adventure,Sci-Fi,Thriller", "Action,Sci-Fi"...
## $ Country <chr> "United States,United Kingdom", "United States", "U...
## $ Language <chr> "English,Japanese,French", "English", "English", "E...
## $ Runtime <int> 148, 136, 149, 116, 161, 117, 150, 165, 115, 153, 1...
Ratings:
-rotten.tomatoes and IMDb to percentage -get the percent symbol out and normalize
dat<-dat%>%
mutate(Rotten.Tomatoes=as.numeric(gsub("%","",Rotten.Tomatoes)))
# Rotten tomatoes as percentage
dat$Rotten.Tomatoes<-dat$Rotten.Tomatoes/100
# IMDB to percentage
dat$IMDb<-dat$IMDb/10
dat<-transform(dat,IMDb=as.numeric(IMDb),
Rotten.Tomatoes=as.numeric(Rotten.Tomatoes))Age:
-change age to min.age and remove the + sign, change to int
dat<-dat%>%
mutate(Age=as.numeric(gsub("\\+","",Age)))
# Replace "all" with 0
dat<-dat%>%
mutate(Age=as.numeric(gsub("all",0,Age)))Disney is named “Disney.”, ill remove the period
Remove type and X columns
Genres:
Note: This was not critical for the analysis, but has been included as an interesting challenge to tackle.
-turn into variables for each movie type
-remove original genres column
Step 1:
get a list of unique Genre types
pattern<-"[A-z]+"
list<-dat$Genres%>%
unlist()%>%
str_extract_all(pattern)%>%
unique()
unique_genres<-list%>%
unlist()%>%
unique()
unique_genres## [1] "Action" "Adventure" "Sci" "Fi" "Thriller"
## [6] "Comedy" "Western" "Animation" "Family" "Biography"
## [11] "Drama" "Music" "War" "Crime" "Fantasy"
## [16] "Romance" "History" "Mystery" "Horror" "Sport"
## [21] "Documentary" "Musical" "News" "Short" "Reality"
## [26] "TV" "Talk" "Show" "Game" "Film"
## [31] "Noir"
Step 2:
Use the list to create new variables and fill them when the “Genres” column contains the variable.
dat1<-dat
N<-length(unique_genres)
# Create regex pattern for the list of unique genres
pat<-paste(unique_genres,collapse="|")
# Loop to create variables for each movie type
for(i in 1:N){
# Load variable with genre of interest
var<-unique_genres[i]
dat1<-dat1%>%
mutate( !!var := str_extract(Genres,!!var))
}
# Change value from name of the genre to binary
dat2 <- data.frame(
lapply(dat1,function(x){
gsub(pat,1,x)
}))
# Remove extraneous columns
dat2 <- dat2%>%
select(-Genres,-Fi,-TV,-Show,-Film)
dat2 %>%
select(Action:Noir) %>%
head()## Action Adventure Sci Thriller Comedy Western Animation Family Biography
## 1 1 1 1 1 <NA> <NA> <NA> <NA> <NA>
## 2 1 <NA> 1 <NA> <NA> <NA> <NA> <NA> <NA>
## 3 1 1 1 <NA> <NA> <NA> <NA> <NA> <NA>
## 4 <NA> 1 1 <NA> 1 <NA> <NA> <NA> <NA>
## 5 <NA> <NA> <NA> <NA> <NA> 1 <NA> <NA> <NA>
## 6 1 1 1 <NA> <NA> <NA> 1 1 <NA>
## Drama Music War Crime Fantasy Romance History Mystery Horror Sport
## 1 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 5 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 6 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## Documentary Musical News Short Reality Talk Game Noir
## 1 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 5 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 6 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
The film genres are now variables, so you could do analysis like calculating which genre is the highest rated.
Exploring the Ratings
First I want to see the distributions of the Rotten Tomatoes and IMDb scores. If they are similar enough, maybe I can impute a missing score from the other.
scores <- dat1 %>%
filter((!is.na(IMDb)), !is.na(Rotten.Tomatoes))%>%
select(IMDb,Rotten.Tomatoes)
scores <- scores %>%
mutate(
avg_score=rowMeans(.)
)
# Summary tibble
imdb_summary<-summary(scores$IMDb)
rotten_summary<-summary(scores$Rotten.Tomatoes)
avg_summary<-summary(scores$avg_score)
sum_titles<-c("min","1st Q","median","mean","3rd Q","Max")
summary_tibble<-tibble(sum_titles,imdb_summary,rotten_summary,avg_summary)
boxplot(scores)Interpreting the boxplot and summary:
It seems that the IMDb ratings are highly clustered around the mean, with a left skew. The Rotten Tomatoes scores seem to be much more evenly distributed because the IQR is much wider. Let’s look at a histogram to confirm.
score_longer<-scores%>%
select(IMDb,Rotten.Tomatoes)%>%
pivot_longer(
cols = IMDb:Rotten.Tomatoes,
names_to="scorer",
values_to="values"
)
score_longer %>%
ggplot(aes(x=values,color=scorer))+
geom_histogram(fill="white",alpha=0.5,position="identity",bins=100)I would expect the “actual quality” of movies to be a normal distribution. For this reason I am tempted to reject the Rotten Tomatoes ratings for my analysis. There might be some bias in the Rotten Tomatoes ratings that I will discuss in my conclusion.
Since we are rejecting the Rotten Tomatoes ratings, we can bring back in all of the scored IMDb ratings.
imdb_scores<-dat1%>%
filter(!is.na(IMDb))%>%
select(IMDb)%>%
pivot_longer(
cols = IMDb,
names_to="IMDb",
values_to="score"
)
ggplot(imdb_scores,aes(x=score))+
geom_histogram(bins=30)Now that I have a good distribution, I want to decide what a good movie is. I will say that any movie 1 standard deviation above the mean is a good movie.
# Get the score floor
score_floor<-mean(imdb_scores$score)+sd(imdb_scores$score)
# Get our database with imdb scores greater than the score floor
tidy_data<-dat1%>%
filter(!is.na(IMDb))%>%
filter(IMDb>!!score_floor)%>%
select(Title,IMDb,Netflix,Hulu,Prime.Video,Disney)
good_count<-tidy_data%>%
summarize(sum(Netflix),sum(Hulu),sum(Prime.Video),sum(Disney))%>%unlist
good_count<-str_extract(good_count, "\\d+")%>%
as.numeric()
provider<-c("netflix","hulu","prime_video","disney")
cost<-c(8.99,11.99,8.99,6.99)
cost_compare<-tibble("provider"=provider,"good_movies"=good_count,"price"=cost)
cost_compare<-cost_compare%>%
mutate("movie_per_dollar" = round(good_movies/price,2))
cost_compare%>%
select(provider,movie_per_dollar)## # A tibble: 4 x 2
## provider movie_per_dollar
## <chr> <dbl>
## 1 netflix 75.6
## 2 hulu 12.0
## 3 prime_video 177.
## 4 disney 18.2
I believe we should reject the Rotten Tomatoes scores because I believe it is biased. Rotten Tomatoes is a consumer targeted movie ranking site. I believe there is probably a lot of manipulation of scores by movie producers to try to get people to go to the movie. There is evidence of this as the largest bin is a 100% score, and the count seems to increase linearly with score.
If we decide value is based on the amount of good movies we have access to, then we should count the number of good movies available on a streaming service and get a metric of good movies per dollar. Clearly from the tibble above, Amazon Prime is the streaming service that offers the most good movies per dollar.