Introduction

The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work.

We are asked to choose three datasets from Week 5’s Discussion, transform the data, perform analysis, and have a conclusion.

NYC Subway Ridership from 2013

Cameron Smith provided this example where he suggests to analyze change in ridership by borough from year to next.

The data is found here. I entered the data into a CSV file using Excel’s Get Data method, where it can extract data tables from a website.

Loading

url <- 'https://github.com/dcorrea614/MSDS/raw/master/Annual%20Subway%20Ridership.csv'

dfMTA <- read.csv(file = url)

str(dfMTA)
## 'data.frame':    434 obs. of  10 variables:
##  $ ï..Station..alphabetical.by.borough.: chr  "The Bronx" "138 St-Grand Concourse" "149 St-Grand    Concourse" "161 St-Yankee    Stadium" ...
##  $ X2013                               : chr  "The Bronx" "957,984" "4,427,399" "8,766,012" ...
##  $ X2014                               : chr  "The Bronx" "1,033,559" "4,536,888" "8,961,029" ...
##  $ X2015                               : chr  "The Bronx" "1,056,380" "4,424,754" "8,922,188" ...
##  $ X2016                               : chr  "The Bronx" "1,070,024" "4,381,900" "8,784,407" ...
##  $ X2017                               : chr  "The Bronx" "1,036,746" "4,255,015" "8,596,506" ...
##  $ X2018                               : chr  "The Bronx" "944,598" "3,972,763" "8,392,290" ...
##  $ X2017.2018.Change                   : chr  "The Bronx" "-92,148" "-282,252" "-204,216" ...
##  $ X2017.2018.Change2                  : chr  "The Bronx" "-8.9%" "-6.6%" "-2.40%" ...
##  $ X2018.Rank                          : chr  "The Bronx" "365" "121" "38" ...

Taking a look inside the data, columns have extra characters, and all the numeric values are stored as characters that include commas. Additionally, the dataset is separated by boroughs and the borough names are entered as rows. We need to identify where each borough begins and ends.

Cleaning/Transforming Data

# changing the column names
new_col_name <- c('Station', 2013, 2014, 2015, 2016, 2017, 2018, '2017 - 2018 Net Change', 
                   '2017 - 2018 % Change', '2018 Rank')
colnames(dfMTA) <- new_col_name

# finding the rows where the boroughs are entered
borough <- c('The Bronx', 'Brooklyn', 'Manhattan', 'Queens')

rowvalues <- c()

for(i in 1:length(borough)){
  rowvalues[i] <- rownames(dfMTA[which(dfMTA$'2013' == borough[i]),])
}
rowvalues
## [1] "1"   "70"  "228" "350"
#now that we now where the boroughs dataset begins and ends, we can capture the 
# data accordingly
dfBronx <- dfMTA[2:69,]
dfBronx['Borough'] <- borough[1]
  
dfBrooklyn <- dfMTA[71:227,]
dfBrooklyn['Borough'] <- borough[2]

dfManhattan <- dfMTA[229:349,]
dfManhattan['Borough'] <- borough[3]

dfQueens <- dfMTA[351:dim(dfMTA)[1],]
dfQueens['Borough'] <- borough[4]

# combined all sub datasets
dfMTA2 <- rbind(dfBronx, dfBrooklyn, dfManhattan, dfQueens)

# changed the columns from character to integer and removing commas
dfMTA2 <- dfMTA2 %>%
  mutate('2013' = as.integer(str_remove_all(dfMTA2$'2013', ',')),
         '2014' = as.integer(str_remove_all(dfMTA2$'2014', ',')),
         '2015' = as.integer(str_remove_all(dfMTA2$'2015', ',')),
         '2016' = as.integer(str_remove_all(dfMTA2$'2016', ',')),
         '2017' = as.integer(str_remove_all(dfMTA2$'2017', ',')),
         '2018' = as.integer(str_remove_all(dfMTA2$'2018', ',')),
         '2017 - 2018 Net Change' = as.integer(str_remove_all(dfMTA2$'2017 - 2018 Net Change', ',')),
         '2017 - 2018 % Change' = as.numeric(str_remove_all(dfMTA2$'2017 - 2018 % Change', '%')),
         '2018 Rank' = as.integer(dfMTA2$'2018 Rank')) %>%
  select(Borough, colnames(dfMTA2))

Analysis

The data is now cleaned and we can look at the ridership by borough.

# subset of the data we want to look at
colnames2 <- c('Borough', 2013, 2014, 2015, 2016, 2017, 2018)
  
dfMTA3 <- dfMTA2 %>%
  select(colnames2)
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(colnames2)` instead of `colnames2` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
aggMTA <- dfMTA3 %>%
  pivot_longer(!Borough, names_to = 'Year', values_to = 'Ridership') %>%
  group_by(Borough, Year) %>%
  summarize(Avg_Ridership = mean(Ridership, na.rm = TRUE))
## `summarise()` regrouping output by 'Borough' (override with `.groups` argument)
ggplot(data = aggMTA) +
  geom_bar(mapping = aes(x = Year, y = Avg_Ridership, fill = Borough), stat = 'identity') +
  facet_grid(~ Borough) +
  theme(axis.text.x = element_text(angle = 70, hjust = 1)) +
  labs(title = 'Average Riderhip from 2013 - 2018 by Borough')

Conclusion

From our analysis, ridership by borough has not changed much from 2013 - 2018. Additionally, the data shows that Queens as far more riders than any other borough.

School Diversity

Zhouxin Shi provided this example where he suggests to filter out all schools with a population of less than 100, since any school with a small population may distort our next calculation. Then to calculate the racial average for each school in each state and put into a separate table

Loading

url <- 'https://github.com/dcorrea614/MSDS/raw/master/school_diversity.csv'

dfSchool <- read.csv(file = url)
str(dfSchool)
## 'data.frame':    27944 obs. of  16 variables:
##  $ X           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ LEAID       : int  100002 100005 100005 100006 100006 100007 100007 100008 100011 100012 ...
##  $ LEA_NAME    : chr  "alabama youth services" "albertville city" "albertville city" "marshall county" ...
##  $ ST          : chr  "AL" "AL" "AL" "AL" ...
##  $ d_Locale_Txt: chr  NA "town-distant" "town-distant" "rural-distant" ...
##  $ SCHOOL_YEAR : chr  "1994-1995" "1994-1995" "2016-2017" "1994-1995" ...
##  $ AIAN        : num  0 0 0.294 0.104 0.492 ...
##  $ Asian       : num  0.589 0.321 0.551 0.134 0.299 ...
##  $ Black       : num  71.709 1.283 3.194 0.373 1.073 ...
##  $ Hispanic    : num  0.196 4.522 46.741 0.909 21.294 ...
##  $ White       : num  27.5 93.9 46.8 98.5 75.8 ...
##  $ Multi       : num  NA NA 2.44 NA 1.04 ...
##  $ Total       : int  509 3118 5447 6707 5687 7671 13938 10440 1973 2389 ...
##  $ diverse     : chr  "Diverse" "Extremely undiverse" "Diverse" "Extremely undiverse" ...
##  $ variance    : num  NA NA 0.0116 NA NA ...
##  $ int_group   : chr  NA NA "Highly integrated" NA ...

Cleaning/Transforming Data

Here we see that the numeric data is currently stored as numeric. To complete the analysis, we need to filter out some of the columns and rows.

Additionally, we are assuming that any NA value in the Racial Percentage is 0.

# we are interested in getting the counts of the students' race
dfSchool2 <- dfSchool %>%
  mutate_all(~replace(., is.na(.), 0)) %>%
  filter(Total > 100) %>%
  mutate(Asian_Pop = Asian / 100 * Total,
         Black_Pop = Black / 100 * Total,
         Hispanic_Pop = Hispanic / 100 * Total,
         White_Pop = White / 100 * Total,
         Multi_Pop = Multi / 100 * Total,
         ) %>%
  group_by(LEA_NAME, ST) %>%
  summarize(Asian = mean(Asian_Pop),
            Black = mean(Black_Pop),
            Hispanic = mean(Hispanic_Pop),
            White = mean(White_Pop),
            Multi = mean(Multi_Pop)
            )
## `summarise()` regrouping output by 'LEA_NAME' (override with `.groups` argument)
head(dfSchool2)
## # A tibble: 6 x 7
## # Groups:   LEA_NAME [6]
##   LEA_NAME                        ST    Asian   Black Hispanic White Multi
##   <chr>                           <chr> <dbl>   <dbl>    <dbl> <dbl> <dbl>
## 1 a-c central cusd 262            IL     0       2.5      6.    489    4. 
## 2 a e r o  spec educ coop         IL     0       8.00     8.00  120.   0  
## 3 a.c.g.c. public school district MN     4.00    1.00    66.    721.  12.0
## 4 abbeville 60                    SC     7    1355.      25.5  1974.  32.5
## 5 abbotsford school district      WI     2       7.00   181.    504.   2.5
## 6 abbott isd                      TX     0       4.00    25.    248.   3.5

Analysis

The above data frame contains the requested analysis which was to “calculate the racial average for each school in each state and put into a separate table”.

Since the spread is very wide I believe it is best to compare the Race ratioin schools.

dfSchool4 <- dfSchool %>%
  mutate_all(~replace(., is.na(.), 0)) %>%
  filter(Total > 100) %>%
  mutate(Asian = Asian / 100,
         Black = Black / 100,
         Hispanic = Hispanic / 100,
         White = White / 100,
         Multi = Multi / 100,
         ) %>%
  group_by(ST) %>%
  summarize(Asian = mean(Asian),
            Black = mean(Black),
            Hispanic = mean(Hispanic),
            White = mean(White),
            Multi = mean(Multi)) %>%
  pivot_longer(!ST, names_to = 'Ethnicity', values_to = 'Ratio')
## `summarise()` ungrouping output (override with `.groups` argument)
ggplot(data = dfSchool4, aes(x = Ethnicity, y = Ratio, fill = Ethnicity)) +
  geom_boxplot() +
  labs(title = 'Ethnicity Ratio Distribution in Schools')

Conclusion

From our analysis, we see that the overall the spread of White student ratio is the highest by a wide margin, where the other race ratios are more comparable.

Streaming Movies

Orli Khaimova provided this example where she suggests to analyze which streaming platform has better movies. The CSV was found on Kaggle.

Loading

url <- 'https://github.com/dcorrea614/MSDS/raw/master/MoviesOnStreamingPlatforms_updated.csv'

dfMovies <- read.csv(file = url)
str(dfMovies)
## 'data.frame':    16744 obs. of  17 variables:
##  $ X              : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ ID             : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Title          : chr  "Inception" "The Matrix" "Avengers: Infinity War" "Back to the Future" ...
##  $ Year           : int  2010 1999 2018 1985 1966 2018 2002 2012 1981 2009 ...
##  $ Age            : chr  "13+" "18+" "13+" "7+" ...
##  $ IMDb           : num  8.8 8.7 8.5 8.5 8.8 8.4 8.5 8.4 8.4 8.3 ...
##  $ Rotten.Tomatoes: chr  "87%" "87%" "84%" "96%" ...
##  $ Netflix        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Hulu           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Prime.Video    : int  0 0 0 0 1 0 1 0 0 0 ...
##  $ Disney.        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Type           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Directors      : chr  "Christopher Nolan" "Lana Wachowski,Lilly Wachowski" "Anthony Russo,Joe Russo" "Robert Zemeckis" ...
##  $ Genres         : chr  "Action,Adventure,Sci-Fi,Thriller" "Action,Sci-Fi" "Action,Adventure,Sci-Fi" "Adventure,Comedy,Sci-Fi" ...
##  $ Country        : chr  "United States,United Kingdom" "United States" "United States" "United States" ...
##  $ Language       : chr  "English,Japanese,French" "English" "English" "English" ...
##  $ Runtime        : int  148 136 149 116 161 117 150 165 115 153 ...

Cleaning/Transforming

By looking at the data we see that the first column is not needed, the Rotten Tomatoes rating is saved as a character and has a % symbol, and there are “.” in some of the column names.

# Cleaning
dfMovies <- dfMovies[,-1] %>%
  rename(Rotten_Tomatoes = Rotten.Tomatoes, Prime_Video = Prime.Video, Disney = Disney.) %>%
  mutate(Rotten_Tomatoes = as.integer(str_remove(Rotten_Tomatoes, '%')))

# Transforming - we need to identify the platforms where the movies can be streamed.
# I created a subset for each platform and then combined them after
dfNetflix <- dfMovies %>%
  filter(Netflix == 1) %>%
  select(Title, IMDb, Rotten_Tomatoes)
dfNetflix['Platform'] <- 'Netflix'

dfHulu <- dfMovies %>%
  filter(Hulu == 1) %>%
  select(Title, IMDb, Rotten_Tomatoes)
dfHulu['Platform'] <- 'Hulu'

dfPrime_Video <- dfMovies %>%
  filter(Prime_Video == 1) %>%
  select(Title, IMDb, Rotten_Tomatoes)
dfPrime_Video['Platform'] <- 'Prime_Video'

dfDisney <- dfMovies %>%
  filter(Disney == 1) %>%
  select(Title, IMDb, Rotten_Tomatoes)
dfDisney['Platform'] <- 'Disney'

dfMovies2 <- rbind(dfNetflix, dfHulu, dfPrime_Video, dfDisney)

Analysis

The data is now cleaned and can be compared.

ggplot(data = dfMovies2, aes(x = Platform, y = IMDb, fill = Platform)) +
  geom_boxplot() + 
  labs(title = 'IMDb Rating by Platform') 
## Warning: Removed 576 rows containing non-finite values (stat_boxplot).

ggplot(data = dfMovies2, aes(x = Platform, y = Rotten_Tomatoes, fill = Platform)) +
  geom_boxplot() + 
  labs(title = 'Rotten Tomoatoes Rating by Platform') 
## Warning: Removed 11895 rows containing non-finite values (stat_boxplot).

# There are 11895 Rotten Tomatoes Values that are NA

Conclusion

By comparing the platforms by Rotten Tomatoes rating, we see that 11895 out of 17381 rows have NA values, where IMDb is missing only 576 rows. I recommend to only compare the platforms based on the IMDb rating. The graph suggest that the Disney platform has the highest IMDb rated movies and Prime Video has the lowest.

References

“Annual Subway Ridership”, http://web.mta.info/nyct/facts/ridership/ridership_sub_annual.htm

https://raw.githubusercontent.com/szx868/data607/master/school_diversity.csv (refernce not provided)

Ruchi Bhatia.“Movies on Netflix, Prime Video, Hulu and Disney+”, https://www.kaggle.com/ruchi798/movies-on-netflix-prime-video-hulu-and-disney