Statement of Purpose (Narrative Summary)

Data source:

https://www.kaggle.com/datasets/shivamb/netflix-shows

This original goal of this project was to analyze genre trends across different countries of production—simultaneously defining a two-mode network to focus our analysis upon. It was our hope that by so doing, we could identify genres that showed strong cross-country popularity, and rank genres according to their respective universality. However, given the nature of how our raw data was organized, we quickly identified numerous difficulties with this approach. Chief among these were the sheer number of discrete categories for both countries of production, and genres—having noted almost 300 separate genres for movies, and several hundred additional genres for television shows; while we were able to identify approximately 12 individual movie categories that should have been easily navigable, the nature of how these categories can be combined quickly devolved into a near-factorial problem (i.e., 12!).

Initial efforts were given to parse out these hybrid categories into multiple entries—e.g., splitting a movie categorized into three genres into three separate entries for analysis; immediate problems with skewing our data became apparent once we realized that our country categorization also fell into a similar realm—with many entries having upwards of five production countries. In the worst-case scenarios, this created the potential for a single film to become split into upwards of 20 distinct country/genre pairings; given that the overwhelming majority of films we were looking at presented these complexities, it became apparent that our data would become dominated by this phenomena to such a degree as to be virtually useless in its relevancy.

We then weighed out multiple alternate paths of analysis that could sidestep this problem of fundamentally altering our ratios, and allow a cleaner analysis. Unfortunately, director and cast pairings led to similar conclusions, and few other categories appeared to have significantly interesting information for analysis. After evaluating a series of dead ends, we settled on a compromise solution in the 11th-hour of this project—having run ourselves into a veritable corner without time to start over from scratch with an entirely new set of data, yet still preserving the ability to recycle some of our previous analysis: a two-mode network analysis which seeks to examine a collated list of genres against the complete list of ratings.

It has thus become our goal to identify which genres and rating categories have greatest representation amongst Netflix offerings, and ideally highlight which pairings show the greatest prominence therein. Secondarily, this project has evolved deeply into a journey of discovery—deepening our understanding not only of R, but of the Tidyverse (and particularly dplyr), the professionalism offered by the “distill” package, and far too many encounters with mismatched data types across packages that were never intended to translate smoothly between each other without fundamental mastery of their inherent intricies; on this last point, the sheer scale of the project before us was laid bare, and our relative neophyte status in this world on full display.

We would like to claim this analysis is as interesting as our initial ambitions—it isn’t; we would like to claim that what follows is as elegant as the naïve dreams we held when we began this project—it’s not. Instead, the following markdown has landed in something of a “Hail Mary” of desperation—raging against the looming deadline clock to cobble together a narrative of tribulations, shattered hopes, and a labyrinthine maze of dead-ends.

Initialization

In this section, we will go over how our header was created with the “distill” package, and showcase succinct ways to ensure packages are both installed and initialized properly for the main code to function.

Further documentation on the distill package can be found on the link below:

https://rstudio.github.io/distill/website.html

This code chunk is a direct copy of how our header has been structured, and is included for anyone who wishes to replicate the template that we have utilized to upload our markdown file to RPubs. In order to utilize this header formatting, the “distill” and “markdown” packages need to be installed, and are explained more under the “Package Installation” section of the markdown.

NOTE: If you want to use this template, remember to remove the comment hashtags!

Further documentation on the markdown package can be found on the link below:

https://rmarkdown.rstudio.com/

#title: "Netflix Genre-Rating Analysis"
#description: | 
#  An exploration of how genre-rating pairings intersect within Netflix programming, a subsequent discussion on the merits of these findings, and a journey of skill-development within RStudio.
#author:
#  - first_name: "Nick"
#    last_name: "D'Ambrosio"
#    affiliation: University of Washington Bothell - School of IAS
#    affiliation_url: https://www.uwb.edu/ias
#  
#  - first_name: "Rico"
#    last_name: "Espinosa"
#    affiliation: University of Washington Bothell - School of IAS
#    affiliation_url: https://www.uwb.edu/ias
#
#  - first_name: "Jaime"
#    last_name: "Yazzie"
#    affiliation: University of Washington Bothell - School of IAS
#    affiliation_url: https://www.uwb.edu/ias
#date: "June 3, 2023"
#output:
#  distill::distill_article:
#    toc: true
#    toc_depth: 3        # Note: "distill" automatically defaults to a header depth of 3
#    #toc_float: false   # Note: "distill" places the table-of-contents to the left by default; change to #"true" to move it to the top
#

Package Installation

Make sure to run this code chunk first if you want to create the same markdown file as we have utilized here—as both the “distill” and “markdown” packages were utilized to upload this file to RPubs, and simply to create markdown files, respectively; they need to be installed prior to creating the header shown above.

Only two packages are needed for this code to run: “igraph”, and “dplyr”. Additionally, “distill” and “markdown” have been included both for consistent internal use across the project team, and for anyone who wishes to follow in our footsteps.

By organizing our packages into a combined list, we are able to easily ensure that each package is checked for installation on the local machine, and then installed if necessary. Additional packages can also be easily added to this list for installation and initialization simply by amending the list itself, with no further modification to the code necessary.

The for-loop will cycle through each element of the “requiredPackages” list (i.e., each element = an individual package), check if they are present, and install them if missing from the local machine.

Further documentation on the deplyr package can be found on the link below:

https://dplyr.tidyverse.org/

requiredPackages = c("igraph", "dplyr", "distill", "markdown", "tidyverse") # Combines packages into list

for (package in requiredPackages) { # Installs packages if not yet installed
  if (!requireNamespace(package, quietly = TRUE))
    install.packages(package)
}

Package Initialization

The following for-loop will load each package into R on the local machine, saving the user from having to individually initialize each package. The suppressPackageStartupMessages() function allows this to happen silently behind the scenes, without flooding the program with initialization messages—though the user should be aware that error warnings may not display normally if the package fails to initialize properly; while this should not be an issue with the code included in this markdown, the user would do well to remember this potentiality if using this function in their own programming.

for (package in requiredPackages) {
  suppressPackageStartupMessages(library(package, character.only = TRUE))
}

Variant Code

Included are some examples of alternate code that can be utilized for installation and/or initialization. All of these sections are commented out, and can be made viable simply be removing the comment hashtag.

Alternate Package Initialization

Three additional variations of code have been included below that achieve the same function of initialization, and represent alternative pathways a user may take to reach a similar destination. The first two alternatives—lapply() and sapply() are not entirely silent in their initialization, and will return a brief list of features in each package; the suppressPackageStartMessages() function is entirely silent, however—and is merely a manual brute force alternative to the above for-loop.

  #lapply(requiredPackages, library, character.only = TRUE, quietly = TRUE)


  #sapply(requiredPackages, library, quietly = TRUE, character.only = TRUE)


  #suppressPackageStartupMessages(library(igraph))  # Loads igraph, quietly
  #suppressPackageStartupMessages(library(dplyr))   # Loads dplyr, quietly

Complete Startup Code (Simplified Variant)

A simplified form of the start-up code has been presented here, which simultaneously loads and installs required packages as needed. It is included to provide a relatively universal four-line for-loop for this process—ensuring the user only need amend their package list. It remains commented out to make clear the distinction between installation and initialization in this project, however.

Alternately, this for-loop could be easily crafted into a defined function—e.g., swiftStart(requiredPackages)—and turned into a minimal package in its own right; this exercise has not been undertaken here, though remains an option for the reader.

#requiredPackages = c("igraph", "dplyr", "distill", "markdown") # Combines packages into list

#for (package in requiredPackages) { # Installs packages if not yet installed
#  if (!requireNamespace(package, quietly = TRUE))
#    install.packages(package)
#  suppressPackageStartupMessages(library(package, character.only = TRUE))

Setup Code

Data Loading

Here we will set our working drive where our data is loaded, and upload our “netflix_titles.csv” file. For the sake of completeness, we are checking the current location of our working drive, setting it to our desired location, and re-checking our working drive location to ensure we are looking at the correct folder for where our data is located.

Be sure to change the setwd() function to your own preferred folder location.

Again, the raw data can be downloaded from:

https://www.kaggle.com/datasets/shivamb/netflix-shows

getwd()   # Confirm current working drive location

[1] "C:/Users/cnand/Desktop/Schoolwork/Fall 2022 - Summer 2023/Spring 2023/Data Visualization/Final Project"

setwd('C:/Users/cnand/Desktop/Schoolwork/Fall 2022 - Summer 2023/Spring 2023/Data Visualization/Final Project')   # Set new working drive location
getwd()   # Confirm new working drive location

[1] "C:/Users/cnand/Desktop/Schoolwork/Fall 2022 - Summer 2023/Spring 2023/Data Visualization/Final Project"

netflixRaw = read.csv("netflix_titles.csv", header = TRUE, row.names=1)
  #netflixGraph = graph_from_data_frame(netflixRaw, directed = FALSE)

Data Manipulation

In this section, we will convert our data to the tibble data.frame, rename headers for clarity, and utilize the dplyr package to remove unutilized rows and columns from our new tibble. In so doing, we will create a second data object that will be the subject of our manipulations—leaving our raw data intact in case we wish to restore columns and/or rows which have been removed; while it is not necessary to create two data objects for our purposes here (and does represent an additional load minor load in memory), we have elected to do so for both the sake of simplicity, and ease of manipulating increasingly filtered data.

Type Conversion

This line will convert our data.frame to the “tibble” type; simply, think of tibbles as a more modern take on data frames that provide improvements towards formatting and error checking over their predecessors. While this conversion is likely unnecessary to achieve our goals, dplyr encourages the use of tibbles; additionally, the dplyr package is one component of the larger “tidyverse” ecosystem—a collection of packages developed to extend the capabilities of R and share similar underlying structure; tidyverse itself also encourages the usage of tibbles, rather than the standard data.frame included in the base version of R.

Further information on the tidyverse ecosystem can be found on the link below:

https://www.tidyverse.org/

netflixRaw = as_tibble(netflixRaw)
netflixClean = as_tibble(netflixRaw) # Converts raw data to tibble format

Renaming Headers

While not strictly necessary, this allows us to relabel our headers for easier visualization within our R output. Several headers have been commented out for the sake of efficiency—as we are not interested in including these columns for our purposes here, we may simply remove the columns as seen in the “Filtering Data” section; to be fair, modern computers should show imperceptible differences in processing time in renaming these additional columns that are slated to be removed in this program—but we feel it is good practice to develop efficient code wherever possible, rather than rely upon hardware to buoy poor software practices.

For our purposes, we have elected to only rename columns related to Genre, and television/film rating (e.g., PG-13, TV-MA); code to rename the additional columns has been included and commented out to allow manipulation in these areas if desired.

netflixClean = netflixClean %>% 
  rename(
    #Type = type,
    #Title = title,
    #Director = director,
    #Cast = cast,
    #Country = country,
    #Date_Added = date_added,
    #Release_Year = release_year,
    Rating = rating,
    #Duration = duration,
    Genre = listed_in,
    #Description = description
  )

Filtering Data

We have utilized two functions in the following code chunk: filter(), and mutate()—both from the dplyr package. In the case of the former, filter() allows us to select only the rows that match the conditions in which we are interested in; for our purposes, we have elected to narrow our focus to 12 specific genres from our raw data set.

There are several reasons for this decision: first, that the raw data includes over 300 different unique character strings to serve as “Genre” identifiers; while this is not an insurmountable obstacle in and of itself—as we may simply break up apart combined genres into multiple entries (i.e., a film designated “Romance, Comedy” can be broken down into separate entries for “Romance” and “Comedy”)—it does bring us to the second part of the problem: how does our analysis skew when the overwhelming majority of films and television shows on Netflix are represented under multiple genres? Thus, for the sake of simplicity we have elected to focus only on films which are categorized within a single discrete category, that has no overlapping features with other genres; while this does truncate our total available data significantly, it also avoids misrepresenting single films under multiple entries.

Lastly, commented sections within the filter() function have been left in to represent pathways to only select Movies (while filtering out TV shows), and two separate pathways to remove rows that have NULL data fields under their respective columns.

The second function included under the below code chunk is that of mutate(). For our purposes, we have simply elected to utilize the mutate() function to eliminate columns that we are not interested in at this time—narrowing our focus solely to the Genre and Rating columns.

The end result of this process is that we end up with a truncated set of rows with only our two columns of interest.

For any user wishing further depth on these functions, both filter() and mutate() have significant scope beyond what we have included below, and we encourage further exploration on the topic with the following documentation:

https://dplyr.tidyverse.org/reference/filter.html

https://dplyr.tidyverse.org/reference/mutate.html

netflixClean = netflixClean %>% 
  filter(
    #Type == "Movie"    # Only looks at movies, and removes tv shows from results
    #Country != "",     # Removes rows which do not have country data
    Genre == "Action & Adventure" | Genre == "Children & Family Movies" | Genre == "Comedies" | Genre == "Documentaries" | Genre == "Dramas" | Genre == "Horror Movies" | Genre == "Music & Musicals" | Genre == "Romantic Movies" | Genre == "Sci-Fi & Fantasy" | Genre == "Sports Movies" | Genre == "Stand-Up Comedy" | Genre == "Thrillers",
    Rating!= ""
    #director = NULL,   # Removes rows without director data
    #cast = NULL,       # Removes rows without cast data
    #date_added = NULL,
    #release_year = NULL,
    #rating = NULL,
    #duration = NULL,
    #description = NULL
  )

netflixClean = netflixClean %>% 
  mutate(
    show_id = NULL,
    type = NULL,   # Will remove "Type" (e.g., "Movie", "TV Show") column
    title = NULL,
    director = NULL,
    cast = NULL,
    country = NULL,
    date_added = NULL,
    release_year = NULL,
    #Rating = NULL,
    duration = NULL,
    #Genre = NULL,
    description = NULL
  )

We can see the results of the above filtering when we call our ‘netflixClean’ tibble object, showcasing Rating/Genre pairings for our selected Genres:

netflixClean

# A tibble: 1,418 × 2
   Rating Genre                   
   <chr>  <chr>                   
 1 PG-13  Documentaries           
 2 PG     Children & Family Movies
 3 TV-14  Thrillers               
 4 TV-Y   Children & Family Movies
 5 PG-13  Comedies                
 6 PG-13  Thrillers               
 7 PG     Documentaries           
 8 R      Action & Adventure      
 9 TV-PG  Children & Family Movies
10 TV-Y   Children & Family Movies
# ℹ 1,408 more rows

Counting

Here we call the count() function to better understand how many iterations of each genre (n = 12), rating (n = 11), and genre-rating pairs (n = 57) are present in our collated data. Additionally, we have converted our total pairings into a matrix in anticipation of utilizing it to offer clean visualization within our plots in the following section.

Secondarily to its primary purpose: this provides an excellent anchor point of reference—a convenient location to double-check data against to ensure that it matches expectations; throughout the process of visualization, this section proved instrumental.

count(netflixClean, vars = Genre)

# A tibble: 12 × 2
   vars                         n
   <chr>                    <int>
 1 Action & Adventure         128
 2 Children & Family Movies   215
 3 Comedies                   110
 4 Documentaries              359
 5 Dramas                     137
 6 Horror Movies               55
 7 Music & Musicals            10
 8 Romantic Movies              3
 9 Sci-Fi & Fantasy             1
10 Sports Movies                1
11 Stand-Up Comedy            334
12 Thrillers                   65

count(netflixClean, vars = Rating)

# A tibble: 11 × 2
   vars      n
   <chr> <int>
 1 G        16
 2 NR       22
 3 PG       45
 4 PG-13   117
 5 R       240
 6 TV-14   182
 7 TV-G     33
 8 TV-MA   505
 9 TV-PG   103
10 TV-Y     85
11 TV-Y7    70

count(netflixClean, vars = Genre, wt_var = Rating)

# A tibble: 57 × 3
   vars                     wt_var     n
   <chr>                    <chr>  <int>
 1 Action & Adventure       PG-13     23
 2 Action & Adventure       R         86
 3 Action & Adventure       TV-14      2
 4 Action & Adventure       TV-MA     17
 5 Children & Family Movies G         15
 6 Children & Family Movies PG        25
 7 Children & Family Movies TV-G      11
 8 Children & Family Movies TV-PG      9
 9 Children & Family Movies TV-Y      85
10 Children & Family Movies TV-Y7     70
# ℹ 47 more rows

fullList = as.matrix(count(netflixClean, vars = Genre, wt_var = Rating))
fullList

      vars                       wt_var  n    
 [1,] "Action & Adventure"       "PG-13" " 23"
 [2,] "Action & Adventure"       "R"     " 86"
 [3,] "Action & Adventure"       "TV-14" "  2"
 [4,] "Action & Adventure"       "TV-MA" " 17"
 [5,] "Children & Family Movies" "G"     " 15"
 [6,] "Children & Family Movies" "PG"    " 25"
 [7,] "Children & Family Movies" "TV-G"  " 11"
 [8,] "Children & Family Movies" "TV-PG" "  9"
 [9,] "Children & Family Movies" "TV-Y"  " 85"
[10,] "Children & Family Movies" "TV-Y7" " 70"
[11,] "Comedies"                 "NR"    "  3"
[12,] "Comedies"                 "PG"    "  1"
[13,] "Comedies"                 "PG-13" " 29"
[14,] "Comedies"                 "R"     " 31"
[15,] "Comedies"                 "TV-14" " 17"
[16,] "Comedies"                 "TV-MA" " 29"
[17,] "Documentaries"            "G"     "  1"
[18,] "Documentaries"            "NR"    " 12"
[19,] "Documentaries"            "PG"    " 11"
[20,] "Documentaries"            "PG-13" " 14"
[21,] "Documentaries"            "R"     " 14"
[22,] "Documentaries"            "TV-14" " 98"
[23,] "Documentaries"            "TV-G"  " 18"
[24,] "Documentaries"            "TV-MA" "115"
[25,] "Documentaries"            "TV-PG" " 76"
[26,] "Dramas"                   "NR"    "  1"
[27,] "Dramas"                   "PG"    "  8"
[28,] "Dramas"                   "PG-13" " 38"
[29,] "Dramas"                   "R"     " 44"
[30,] "Dramas"                   "TV-14" "  9"
[31,] "Dramas"                   "TV-MA" " 29"
[32,] "Dramas"                   "TV-PG" "  8"
[33,] "Horror Movies"            "NR"    "  1"
[34,] "Horror Movies"            "PG-13" "  4"
[35,] "Horror Movies"            "R"     " 27"
[36,] "Horror Movies"            "TV-14" "  3"
[37,] "Horror Movies"            "TV-MA" " 20"
[38,] "Music & Musicals"         "R"     "  1"
[39,] "Music & Musicals"         "TV-14" "  3"
[40,] "Music & Musicals"         "TV-MA" "  3"
[41,] "Music & Musicals"         "TV-PG" "  3"
[42,] "Romantic Movies"          "TV-14" "  1"
[43,] "Romantic Movies"          "TV-G"  "  1"
[44,] "Romantic Movies"          "TV-MA" "  1"
[45,] "Sci-Fi & Fantasy"         "TV-14" "  1"
[46,] "Sports Movies"            "TV-PG" "  1"
[47,] "Stand-Up Comedy"          "NR"    "  5"
[48,] "Stand-Up Comedy"          "PG-13" "  1"
[49,] "Stand-Up Comedy"          "R"     "  7"
[50,] "Stand-Up Comedy"          "TV-14" " 28"
[51,] "Stand-Up Comedy"          "TV-G"  "  3"
[52,] "Stand-Up Comedy"          "TV-MA" "284"
[53,] "Stand-Up Comedy"          "TV-PG" "  6"
[54,] "Thrillers"                "PG-13" "  8"
[55,] "Thrillers"                "R"     " 30"
[56,] "Thrillers"                "TV-14" " 20"
[57,] "Thrillers"                "TV-MA" "  7"

dat = graph_from_data_frame(count(netflixClean, vars = Genre, wt_var = Rating))
dat

IGRAPH 0877cbf DN-- 23 57 -- 
+ attr: name (v/c), n (e/n)
+ edges from 0877cbf (vertex names):
 [1] Action & Adventure      ->PG-13 Action & Adventure      ->R    
 [3] Action & Adventure      ->TV-14 Action & Adventure      ->TV-MA
 [5] Children & Family Movies->G     Children & Family Movies->PG   
 [7] Children & Family Movies->TV-G  Children & Family Movies->TV-PG
 [9] Children & Family Movies->TV-Y  Children & Family Movies->TV-Y7
[11] Comedies                ->NR    Comedies                ->PG   
[13] Comedies                ->PG-13 Comedies                ->R    
[15] Comedies                ->TV-14 Comedies                ->TV-MA
+ ... omitted several edges

Remove duplicates

Here is the final step in the process to clean-up the original data into a usable network—purging duplicate entries to avoid creating an unnecessarily large network of actors and events; prior to this point, each genre-rating pair would have had two individual distinct nodes—creating a visual mess as a best case scenario, and virtually useless clutter at worst (and in all likelihood, reality).

netflixSimple = distinct(netflixClean, Rating, Genre)
combo = netflixSimple

Here Be Dragons

In this section, we find the remains of commented out code that has survived the many purges of failed attempts to derive aesthetically beautiful plots from the homely results we finally arrived at—the lingering echoes of our innocence, as it were; the true scale of the paths stumbled and searching helplessly upon have been lost to the madness of late-nights, and fervent prayers to forgotten gods.

Smaug

Unknown code chunk; a mishmash of desperate and vain stumbles to claw hope from despair. This appears to have been a last-ditch attempt to recycle previous projects (or third-party code?) to finally beat our tibble data back into the shape iGraph needed; results were approximately as successful as forcing a round peg the size of a redwood into a square hole the size of a shoebox—perhaps humorous to witness, but wildly unsuccessful.

#suppressPackageStartupMessages(library(asnipe)) # Loads asnipe, quietly
#suppressPackageStartupMessages(library(igraph)) # Loads igraph, quietly
#suppressPackageStartupMessages(library(dendextend)) # Loads dendextend, quietly

# Create a graph from Shizuka et al (2014) network of golden-crowned sparrows
#assoc = as.matrix(read.csv("https://dshizuka.github.io/networkanalysis/SampleData/Sample_association.csv", header = T, row.names=1))
#gbi = t(assoc) # Transpose the data 
#mat = get_network(t(assoc), association_index="SRI") # Adjacency matrix based on "simple ratio index" (measure for animal network analysis)
#gSparrow = graph_from_adjacency_matrix(mat, "undirected", weighted = T) # igraph object

#seedValue = 2022 # Can be any value desired; change here until arriving at visual representation desired

#set.seed(seedValue) # Important to establish conformity across our plots
#plot(gSparrow, edge.width=E(gSparrow)$weight, vertex.label = "")  # Note weighted edges

Ancalagon the Black

Further desperate code chunks, left in as an artifact to our failures; they represent more of the dead-end approaches taken to wrangle our tibble data into something iGraph recognizes.

#suppressPackageStartupMessages(library(asnipe)) # Loads asnipe, quietly
#suppressPackageStartupMessages(library(dendextend)) # Loads dendextend, quietly

#seedValue = 2023



#netflixGraph = as.matrix(netflixClean, header = T, row.names = 1)
#netflixGraph2 = t(netflixGraph)
#netflixMatrix = get_network(t(netflixGraph))
#netflixMatrix2 = graph_from_adjacency_matrix(netflixMatrix, "undirected", weighted = T)

#set.seed(seedValue)
#plot(netflixGraph, edge.width=E(netflixGraph)$weight, vertex.label = "")

Glaurung

What follows is what remains of many false starts to beat our tibble data types back into iGraph usable formats. None of these were successful, and these comments remained buried at the end of our markdown in shame; they are presented here in remembrance for the light they falsely promised us.

#netflixGraph = as.matrix(netflixClean, header = T, row.names = 1)
#netflixGraph2 = t(netflixGraph)
#netflixGraph
#netflixGraph2

Chrysophylax Dives

Here we find several of the failed attempts to sort our Ratings and Genres into separate color combinations. Alas, this task ultimately proved to be beyond us—a further shipwreck upon the seas of our inadequacy.

#dat <- data.frame(col1 = letters[1:20], 
#                  col2 = LETTERS[1:20])
#
#g <- graph.edgelist(as.matrix(dat),directed = T)
 
#assign color information
#try head(V(g)) to see the structure of nodes
#V(g)$color <- c("red", "blue" )
 
#plot(g, layout= layout_nicely(g))

Plots (AKA: The Pot of Pyrite at the End of the Rainbow)

Rewriting to Excel

This was the semi-successful result of trying to wrangle our tibble data back into iGraph; while we were able to utilize it to feed back into iGraph successfully, its merits were lackluster visually compared to our goals. It remains commented out currently, as RStudio is error-prone at trying to re-create new data when doing so would require overwriting existing data—understandable as a precaution, doubly-so when said data is currently loaded in RStudio itself.

#write.csv(combo, "C:/Users/cnand/Desktop/Schoolwork/Fall 2022 - Summer 2023/Spring 2023/Data Visualization/Final Project/graph.csv", row.names = FALSE)

A Monument to Failure and Perserverence

Included here is an amalgamation of failed attempts to wrangle our data back into working order, and awkward stretches to bridge the gap between our previous successes, to our hopeful goals.

Several example plots have been left uncommented to highlight the evolution of our progress starting at the point of barest success. In and of themselves, they represent little that is interesting—but may represent a useful contextual reference for our later plots in terms of evolutionary progress.

#ggplot(netflixClean, aes(x = weight, y = hindfoot_length))

combo = combo %>%
  #join the data to itself
  left_join(combo, by = c('Genre', 'Rating')) #%>%
  #this is undirected so x %--% y is the same as y %--% x
  #filter(item.x < item.y) %>%
  #group_by(item.x, item.y) %>%
  #summarize(n = sum(n))

graph = graph_from_data_frame(combo, directed = FALSE)
#head(V(graph))

excelGraph = read.csv("graph.csv", header = TRUE, as.is = T)
#head(V(excelGraph))

#g2 <- graph.data.frame(topology, vertices=answers, directed=FALSE)
#g <- simplify(g2)
#V(g)$color <- ifelse(V(g)$Q1_I1 == 1, "lightblue", "orange")

#<- graph.data.frame(combo, vertices=excelGraph, directed=FALSE)

#V(g)$color <- ifelse(V(g)$Q1_I1 == 1, "lightblue", "orange")

#g <- simplify(g2)
#V(g)$color <- ifelse(V(g)$Q1_I1 == 1, "lightblue", "orange")


graphMatrixPre = as.matrix(graph)
#graphMatrixPre2 = as.matrix(excelGraph)
#head(V(graphMatrixPre))
graphMatrix = graph_from_adjacency_matrix(graphMatrixPre)
#graphMatrix2 = graph_from_adjacency_matrix(graphMatrixPre2)
head(V(graphMatrix))

+ 6/23 vertices, named, from 0922b4f:
[1] PG-13 PG    TV-14 TV-Y  R     TV-PG

#cols = 5:74
str(netflixClean)

tibble [1,418 × 2] (S3: tbl_df/tbl/data.frame)
 $ Rating: chr [1:1418] "PG-13" "PG" "TV-14" "TV-Y" ...
 $ Genre : chr [1:1418] "Documentaries" "Children & Family Movies" "Thrillers" "Children & Family Movies" ...

nodeWeight = count(netflixClean, vars = Genre)
str(nodeWeight)

tibble [12 × 2] (S3: tbl_df/tbl/data.frame)
 $ vars: chr [1:12] "Action & Adventure" "Children & Family Movies" "Comedies" "Documentaries" ...
 $ n   : int [1:12] 128 215 110 359 137 55 10 3 1 1 ...

edgeWeight = count(netflixClean, vars = Genre, wt_var = Rating)
str(edgeWeight)

tibble [57 × 3] (S3: tbl_df/tbl/data.frame)
 $ vars  : chr [1:57] "Action & Adventure" "Action & Adventure" "Action & Adventure" "Action & Adventure" ...
 $ wt_var: chr [1:57] "PG-13" "R" "TV-14" "TV-MA" ...
 $ n     : int [1:57] 23 86 2 17 15 25 11 9 85 70 ...

#nodeWeight = lapply(nodeWeight[cols], as.numeric)

#set.seed(seedValue)
par(mar=c(0,0,0,0))
plot(graph, vertex.size=nodeWeight$n / 15, edge.width = edgeWeight$n / 2)

plot(graph)

plot(graph, edge.width = edgeWeight$n / 10)

plot(graph, vertex.size=nodeWeight$n / 10, edge.width = 10)

circle = layout_in_circle(graph)
#colors = c("gray50", "gold")
#V(graphMatrix)$color = ifelse(excelGraph[V(graphMatrix), 2] == "1", "gray", "brown")
head(V(graphMatrix))

+ 6/23 vertices, named, from 0922b4f:
[1] PG-13 PG    TV-14 TV-Y  R     TV-PG

#V(graphMatrix)$color = ifelse(excelGraph[V(graphMatrix), 2] == "2", "gray", "brown")
#dat = data.frame(col = let)
#V(excelGraph)$color <- c("red", "blue" )

#plot(graphMatrix, directed = FALSE)
#plot(graph, layout = circle, vertex.color = col[as.numeric(V(g)$type) + 1]),
#edge.width = edgeWeight$n * 100,

#layout_fr = layout_with_fr(graphMatrix)

plot(graphMatrix, layout=layout_nicely)

The Glimmers of Promise

While these plots never materialized into the end-goal forms, we still feel they represent useful objects of analysis and visualization for the purposes we have undertaken. Included are a series of four circle plots, which we believe offer a better visual representation of the data observed.

In the first graph, we see raw unweighted connections between genres and ratings. It is provided to showcase a reference anchor point to better visualize the visual beautification we endeavored to create. Also left as a relic of our process is the mislabeling of color pairings to highlight the emergence of mistranslations between our data and our plots—clearly, there has been a divergence between intent and reality, which we have been unsuccessful in navigating.

In the second graph, we see our edges weighted by the prevalence of connections between our genre-rating pairs (i.e., actor-event pairs). The most striking and obvious observation is the connection strength between the “Children & Family Movies” genre, and “G” rating; given the traditional preference for children’s movies to be broadly accessible to families of all ages, this follows intuitively—and we see similar strong connections between the genre and the ratings “PG” and “TV-Y”. However, given the problems noted in plots #3 and #4, it would be unwise that this data is fully accurate—the level of mistranslations between data may well extend into this plot as well.

In the third graph, we see the relative prominence of each genre and rating to one another. Unfortunately, due to data mistranslations yet undiscovered, this is the point where our analysis begins to break down into uselessness—we know from our previous analysis (as observed in the Count section) that TV-Y and unrated programming do not represent the dominant ratings for Netflix programming; similarly, we know that neither Action & Adventure nor Sci-Fi & Fantasy represent the preeminent genres of our collated data. Additionally, we notice that our edges have disappeared from all nodes not directly tied to the “Documentaries” genre for reasons unbeknownst to us; our working assumption remains a data mistranslation, but more experienced eyes (or significantly greater time) would be needed to disentangle this particular mystery fully.

The fourth graph is something of a hopeful proof-of-intent for what we set out to achieve in following this path—a relatively clean and readable plot, with weighted edges and nodes clearly showcasing prominence at a glance. Unfortunately, it shares the characteristic failures observed in the third graph—that of removing all edges not tied to “Documentaries”, and incorrectly weighting the nodes. Despite this, we feel it represents an imperfectly flawed model to showcase where our end destination was headed. Had we resolved these issues in a timely manner, it is likely we would have utilized the working template to construct a series of plots offering a deeper look at the evolution of genre-rating pairings across time.

seedValue = 2023 # Utilize variable for seed; easier to change in one location, than across all

par(mar=c(0,0,0,0)) # Increase usable area of plot; visually easier to identify connections

head(V(graphMatrix))

+ 6/23 vertices, named, from 0922b4f:
[1] PG-13 PG    TV-14 TV-Y  R     TV-PG

colors = ifelse(combo$Rating == "PG", "darkgreen", 
         ifelse(combo$Rating == "PG-13", "orange",
         ifelse(combo$Rating == "R", "red",
         ifelse("black"))))

fullListMatrix = as.matrix(fullList)

set.seed(seedValue)
plot(graphMatrix, layout=layout_in_circle, vertex.color = colors) # Raw connections, unweighted
#title(main = "Tears", line = -2, y = 10)
#title(sub = "Despair", line = -5)
#legend(x = -0.55, y = -1.2, c("Fourth Grade","Fifth Grade", "Principal"), pch=21, pt.bg = legendcolors, pt.cex=2.5, cex=1.2, bty="n", ncol=1)

legend(x = -1.95, y = 1.15, c("Raw connections"))

set.seed(seedValue)
plot(graphMatrix, layout=layout_in_circle, edge.width = edgeWeight$n / 12) # Connections weighted by number of pairings between each Rating/Genre pair
legend(x = -2.05, y = 1.15, c("Edges weighted by prevalence"))

set.seed(seedValue)
plot(graphMatrix, layout=layout_in_circle, vertex.size=nodeWeight$n / 10) # Plot weighted by number of entries in each category
legend(x = -2.15, y = 1.25, c("Nodes weighted by prevalence"))

set.seed(seedValue)
plot(graphMatrix, layout=layout_in_circle, edge.width = edgeWeight$n / 4, vertex.size=nodeWeight$n / 10) # Plot weighted by both Rating/Genre edge pairing, and nodes weighted by total number of entries in each category
legend(x = -2.15, y = 1.25, c("Weighted edges & nodes"))

A Deeper Look at Analysis

Given the nature of how we have structured our data—as well as its raw form—it is difficult to derive a significant amount of value based off the methods of analyses learned over the quarter. Nonetheless, the degree() function provides a similar (i.e., identical) function to our counting in the previous section. To wit: it summarizes our findings for prevalence of genre and rating. In this, we are able to see a preference for movies rated R and TV-MA—with these two roughly analogous ratings comprising 745 of 1418 entries, or approximately 52.539% of the total entries; we also see that PG-13 and TV-14 (also roughly analogous in rating) comprise an additional 21.086% (n = 299) of our total entries. Combined, this represents a total of 1044 entries geared towards individuals thirteen and above, or 73.625% of Netflix programming analyzed. Youth-oriented programming (e.g., PG, G, TV-Y, TV-Y7, TV-G, TV-PG) accounts for 352 entries, or 24.827% of total entries. There are 22 unrated entries, representing 1.551% of total programming.

Assuming our method of paring down the dataset has left our results roughly equivalent to Netflix programming more broadly, this reflects a tendency for Netflix to heavily prioritize adult-oriented programming, with youth and child programming considered secondary concerns. This analysis makes a certain intuitive sense—adults are by a significant margin the most likely consumers to be Netflix subscribers themselves, and it would be an unwise assumption that all Netflix subscribers will have children. With adolescent-oriented programming representing around a fifth of total programming, and child-friendly programming representing around a quarter of total programming, this gives Netflix a profile that incorporates the interests of younger viewers and families, while still favoring paying clientele. In fairness, there is no reason why programming acceptable to a broader audience cannot be watched by older viewers—G and PG films can certainly be appropriate for adults, even when they are not considered the sole or primary audience.

Luckily, as we still have access to our raw data we are able to draw comparisons against the entire available Netflix database of programming to see if these comparisons hold water. From this, we can see 4009 entries (45.521% of total, vs. 52.539% truncated) are rated either R, NC-17 or TV-MA; this suggests a similar pattern to what we observed in our data, though certainly not to the same degree of prominence as our collated data suggests. When comparing for adolescent-oriented programming (PG-13, TV-14) we see a total of 2650 entries (30.090% of total, vs. 21.086% truncated)—representing a combined 75.610% of total programming for Netflix options appropriate to adolescents and adults—similar to our collated findings at 73.625%. Lastly, child-friendly programming (G, PG, TV-G, TV-PG, TV-Y, TV-Y7, TVY&-FV) represents 2058 entries, or 23.368% of total Netflix programming—also close to our previous truncated findings of 24.827% for this age group.

While the findings are broadly similar between our collated data and the raw data—and thus our conclusions about adults being largely the primary target audience of Netflix programming writ large—we do see a notable difference in terms of adolescent-friendly programming. Whereas our previous assessment saw adults favored at a ratio of 2.492:1 in relation to adolescents, we see that ratio drop to 1.513:1—still in favor of adults, but not nearly to the dominant degree previously observed. This suggests a programming strategy from Netflix that seeks a broader audience appeal than previously considered—if still favoring paying subscribers, it is a favor that simultaneously encourages subscribers with families.

When examining our collated genres for prevalence, we see the prominence of Documentaries (n=359) and Stand-Up Comedy (n=334) as relatively outsized categories, with Children & Family Movies (n=215) coming in a respectable third place. When cross-referencing these categories with betweenness, we see that Documentaries has the highest observed betweenness score amongst genres (86.616), followed by Children & Family Movies (46.693); conversely, we see that Stand-Up Comedy only has a betweenness score of 8.075. From this, we might interpret these findings that both Documentaries and Children & Family movies represent genres with broad appeal across age demographics, and that Stand-Up Comedy is relatively limited in terms of age.

We can further analyze this hypothesis by checking it against our genre-rating pairs, and assessing for rating variability for these genres. In the case of Documentaries, we see a broad range of Ratings ranging from G/TV-G to R/TV-MA which remain relatively populated from the PG/TV-PG age bracket through adults. Similarly, when checking Stand-Up Comedy against Ratings we observe an overwhelming clustering towards TV-MA, with a smaller pairing seen with TV-14; this suggests the possibility that Stand-Up Comedy is largely oriented towards a specific age range, which might help explain the relatively low betweenness score seen with it.

Unfortunately, this analysis begins to break down when examining Children & Family Movies. From our Genre-Rating pairings, we can observe only the following pairings with this genre: G, PG, TV-G, TV-PG, TV-Y, and TV-Y7—suggesting a similarly narrow age range as observed with Stand-Up Comedies. Given the significant discrepency in the betweenness scores for both of these genres, and their relatively similar demographic focuses towards specific age ranges, one is tempted to ask the question: what does this mean?

Perhaps the answer lies with the importance of the betweenness scores of ratings themselves. If this were the case, we might expect that high betweenness scores for ratings observed in Children & Family Movies to correlate with the betweenness score in the genre itself; similarly, we might expect to see a low betweenness score for TV-MA, as that represents the dominant rating prevalent within the Stand-Up Comedy Genre.

Unfortunately, this hypothesis only raises further questions, as the observed betweenness scores for TV-PG (44.917) and TV-MA (41.875) are the highest of any rating. While we do observe a higher score for PG, it certainly isn’t to the extent necessary to explain this discrepency by itself. Similarly, the PG (14.564) and TV-G (9.756) ratings do not likely account for this difference even when factored into the prevalence of betweenness divergence when comparing these genres.

The simplest answer we can think to offer to this conundrum is thus: the usage of betweenness for this data is not particularly relevant nor useful—and insights derived may be little more than coincidental in their relation towards genre or rating prominence.

Still, we should not write off the prominence of these genres entirely simply because they are unable to be analyzed via betweenness; they still provide an important window into the the workings of Netflix programming. With the prominence of Children and Family Movies, we begin to see the emergence of what constitutes a large proportion of youth-oriented programming, given its unanimous representation in from G to TV-Y7—which makes a certain sense, as it would be odd to produce programming intended at children, yet not appropriate for them; of course, this insight isn’t particularly interesting in itself—in this, it’s more confirmation that what we intuitively might expect aligns with the data we have derived. Thus, while it is not groundbreaking, it at least helps provide evidence that our methodology aligns with the reality we could expect to find. It also suggests that child-oriented programming offered by Netflix is typically labelled explicitly as such, with a full 100% of TV-Y and TV-Y7 programming observed falling under the Children & Family Movies category. Additionally: 45.455% of G, 55.556% of PG, and 33.333% of TV-G programming also fall under this genre—further reinforcing the idea that Netflix strongly preferences an explicit orientation towards Children (and families with children) under this genre categorization.

Similarly,

netflixAnalysis = graph_from_data_frame(netflixClean, directed = FALSE)
#netflixAnalysis
#head(V(netflixAnalysis))

#diameter(netflixAnalysis)
#edge_density(netflixAnalysis)
#edge_density(graphMatrix)
#transititivity(netflixAnalysis)

degree(netflixAnalysis)

                   PG-13                       PG 
                     117                       45 
                   TV-14                     TV-Y 
                     182                       85 
                       R                    TV-PG 
                     240                      103 
                    TV-G                    TV-Y7 
                      33                       70 
                   TV-MA                        G 
                     505                       16 
                      NR            Documentaries 
                      22                      359 
Children & Family Movies                Thrillers 
                     215                       65 
                Comedies       Action & Adventure 
                     110                      128 
                  Dramas            Horror Movies 
                     137                       55 
         Stand-Up Comedy         Music & Musicals 
                     334                       10 
         Romantic Movies            Sports Movies 
                       3                        1 
        Sci-Fi & Fantasy 
                       1

count(netflixRaw, vars = rating)

# A tibble: 18 × 2
   vars           n
   <chr>      <int>
 1 ""             4
 2 "66 min"       1
 3 "74 min"       1
 4 "84 min"       1
 5 "G"           41
 6 "NC-17"        3
 7 "NR"          80
 8 "PG"         287
 9 "PG-13"      490
10 "R"          799
11 "TV-14"     2160
12 "TV-G"       220
13 "TV-MA"     3207
14 "TV-PG"      863
15 "TV-Y"       307
16 "TV-Y7"      334
17 "TV-Y7-FV"     6
18 "UR"           3

count(netflixClean, vars = Genre, wt_var = Rating)

# A tibble: 57 × 3
   vars                     wt_var     n
   <chr>                    <chr>  <int>
 1 Action & Adventure       PG-13     23
 2 Action & Adventure       R         86
 3 Action & Adventure       TV-14      2
 4 Action & Adventure       TV-MA     17
 5 Children & Family Movies G         15
 6 Children & Family Movies PG        25
 7 Children & Family Movies TV-G      11
 8 Children & Family Movies TV-PG      9
 9 Children & Family Movies TV-Y      85
10 Children & Family Movies TV-Y7     70
# ℹ 47 more rows

betweenness(netflixAnalysis)

                   PG-13                       PG 
             3.351965440             14.564281379 
                   TV-14                     TV-Y 
            37.792843092              0.000000000 
                       R                    TV-PG 
            16.359213071             44.916605425 
                    TV-G                    TV-Y7 
             9.756488132              0.000000000 
                   TV-MA                        G 
            41.874507360              0.310161830 
                      NR            Documentaries 
             0.073934271             86.615503960 
Children & Family Movies                Thrillers 
            46.693407921              0.719259376 
                Comedies       Action & Adventure 
             3.287497371              1.117510567 
                  Dramas            Horror Movies 
            11.041896424              0.406851904 
         Stand-Up Comedy         Music & Musicals 
             8.075264892              0.038208428 
         Romantic Movies            Sports Movies 
             0.004599158              0.000000000 
        Sci-Fi & Fantasy 
             0.000000000

Conclusion

The chief takeaway from this project has been thus: our ambitions exceeded our capabilities. While we were able to finally arrive at a modicum of useful visualization, it has been clear that all three of us have been in over our heads in terms of familiarity—necessitating the rapid absorption of a breadth of material and knowledge previously unknown. If one of us were previously familiar with the syntax of R prior to this course, it is possible that the end results may have been different—being able to lend a degree of prior experience and knowledge to our endeavors; instead, we have found ourselves in something of a “blind-leading-the-blind” situation.

This is not to say the project has been failure—while we may have not arrived at our original goals (or necessarily even our secondary or tertiary goals), the entire experience has been deeply enlightening towards our understanding of what data visualization can comprise—having undergone a crash-course through the Tidyverse, sifting mountains of forums on StackExchange, StackOverflow, RStudio Community, as well as pouring over documentation for iGraph, dplyr, and RDocumentation itself.

The truth is, we weren’t able to successfully translate our data manipulations from the Tidyverse into a successful visual representation. There’s no easy way to pretty it up—this experience has been defined not by success in our original aim, but rather the journey of discovery along the way. We would love to have delivered a beautiful final project to serve as a capstone over the quarter, but that remains a dream unfortunately out of reach.

In hindsight, we may have had greater success had we chosen a simpler project—the ambitions of analyzing a two-mode network for the first time while simultaneously learning multiple packages from scratch proved too much—a case of trying to run at a sprint while barely past walking without assistance.

Contextually for our team, further work on this project could likely be buoyed by consulting with someone possessing additional expertise—particularly regarding how to translate data from the Tidyverse back into iGraph; as iGraph represents a more familiar ecosystem to us, the translation between the two has similarly represented a hurdle which we were unable to clear—simply, there are points of connection and understanding between what we already know, and what we have learned along the way, which remain unbridged.

More broadly: being able to utilize a greater amount of the raw dataset would likely prove a boon in future endeavors. As our collated version of data that focuses only on single-genre examples reduced our sample size to approximately an eighth of the original, we have likely skewed our results unintentionally. The path forward in this regard is likely complicated—other than a brute force approach to manually reorganizing the data over an extended period of time, no simple answers emerged for the skill level we possess; we did consider writing functions which could separate categories in an automated fashion upon hitting commas while parsing character text (the sole dividing character between genres), but our lack of familiarity with R left little pathway forward to do so in an efficient manner for the available time.

Additionally, this pathway would better open up the ability to divide our analysis into defined sections of time (e.g., 5 years, 10 years), and compare if trends in pairings were consistent temporally—at least in terms of what programming Netflix has determined to have lasting popular appeal. We considered including this data within this project, but a cursory analysis proved that our somewhat inelegant solution to limiting categorical overload amongst genres limited our data too heavily into recent years to prove particularly insightful—with the majority of films coming out within the 21st century, and disproportionately weighted towards the present day. Even if such a ratio were to hold true across the larger dataset, we could still expect to see our entries from the 20th-century go from n < 100 to n > 500; given that the 1990s held no more than fifty entries in total, with no more than thirty prior to 1990, it is difficult to see how much value could have been derived via such an analysis given the methods we deployed during this project—success in this regard would likely require rethinking our approach fundamentally to avoid overly limiting our available data.

It remains our hope that what has been delivered showcases our adaptability and resilience in the face of adversity, and a willingness to deliver a product to the best of our capabilities—if not our own self-standards.