# packages
library(plotly)
library(dslabs)
library(ggplot2)
library(tidyverse)

Problem 1 [20 points]

Create a plotly graph of your choosing that represents at least two variables, one of which must be a categorical variable.

This plot can be a scatter plot, overlayed density plots (graphing variable is continuous, separate densities grouped by categorical variable), etc. choropleth maps could also be on the list…you have to admit they look kinda cool.

The graph must include:

customized hover text that is informative to the graphing elements in the plot
separate color to represent groups
labeled axes and appropriate title

# read in the dataset
vgsales <- read.csv("vgsales.csv")

# clean the dataset
vg_clean <- vgsales %>%
  drop_na(Genre, Platform, Name, Global_Sales)

# Find top 5 platforms
table <- vg_clean$Platform %>%
  table() %>%
  sort() %>%
  tail(5) %>%
  names()

# Dataset w/ just top 5 platforms
vg_clean2 <- vg_clean %>%
  filter(Platform %in% table)

# plot
plot_ly(vg_clean2,
        x = ~NA_Sales, color = ~Platform, 
        type = 'histogram',
        histnorm = 'density',
        hoverinfo = 'text',  # hover text
        text = ~paste("Game: ", Name,
                      "<br>Platform: ", Platform,
                      "<br>Year: ", Year,
                      "<br>North America Sales: ", NA_Sales)) %>%
  layout(
    title = "North America Sales by Platform",
    xaxis = list(title = "NA Sales (millions)",
                 range = c(0,1.5)),
    yaxis = list(title = "Count",
                 range = c(0, 9000)),
    showlegend = T
  )

Include at least a 1-paragraph discussion about the graph. Discuss what is being plotted and what information is being displayed in the graph. Discuss any information that the reader may gain from hovering the cursor over graphing elements. Discuss any issues/chalenges you had (if any) while making the plot, and you you dealt with or overcame them.

Discussion:

The graph shows us a layered histogram that shows the distribution of North American video game sales (NA_Sales) across the top 5 platforms with the most entries in the dataset. These platforms are DS, PS2, PS3, Wii, and X360 and they are distinguishable in the graph by their color. This allows for the comparison between platforms. The x-axis displays the NA Sales in millions, while the y-axis represents the count of games within the sales range. The hover text provides more information about each individual game, showing us the game name, platform, release year, and the North American sales for that game. This allows easy access when the reader hovers their cursor. A challenge that came up was ensuring that the graph was easy to read. This caused me to limit the x-axis range from 0 to 1.5 million and the y-axis from 0 to 9,000. This allowed readers to focus on the most relevant part of the data.

Problem 2 [20 points]

Create an animated plotly graph with a data set of your choosing. This can be, but does not have to be a scatter plot. Also, the animation does not have to take place over time. As mentioned in the notes, the frame can be set to a categorical variable. However, the categories the frames cycle through should be organized (if needs be) such that the progression through them shows some pattern.

This graph should include:

Aside from the graphing variable, a separate categorical variable. For example, in our animated scatter plot we color grouped the points by continent.
Appropriate axis labels and a title
Augment the frame label to make it more visible. This can include changing the font size and color to make it stand out more, and/or moving the frame label to a new location in the plotting region. Note, if you do this, make sure it is till clearly visible and does not obstruct the view of your plot.

# read in the dataset
student <- read.csv("student_habits_performance.csv")

# create the plot
student %>%
  plot_ly(
    x = ~study_hours_per_day,
    y = ~exam_score,
    type = 'scatter',      # scatter plot
    mode = "markers",
    color = ~gender,
    frame = ~age,   # animate by age
    hoverinfo = 'text',
    text = ~paste("Student ID: ", student_id,
                  "<br>Study Hours: ", study_hours_per_day,
                  "<br>Exam Score: ", exam_score,
                  "<br>Age: ", age,
                  "<br>Gender: ", gender)
  ) %>%
  layout(
    title = "Study Hours vs. Exam Score",
    xaxis = list(title = "Study Hourse (per day)"),
    yaxis = list(title = "Exam Score")
    
  )

Include at least a 1-paragraph discussion about the plot. Discuss what you are plotting and what trends can be seen throughout the animation. Discuss any issues, if any, you ran into in making the plot and how you overcame them.

Discussion:

The animated plot above shows the relationship between study hours per day and exam scores for students, with the animation going through different ageas. Each point on the scatterplot is colored by their gender, which allows us to look at performances based on gender and age. As the animation goes through the different age groups, we can see a general pattern of students who reported higher study hours tended to achieve the higher test scores. It is not a perfectly linear relationship. This could be due to some of the other variables in the dataset, such as sleep, work, etc. One trend that showed among all the age groups was no one got a 100 score that reported less than 4 hours of study time per day. Another interesting trend came up in the 19 year age group. Every other age group had a somewhat linear trend throughout the whole range of study hours. Meanwhile this group had a slight down trend from 0-2 study hours per day, then a cluster with the normal trend of the others from about 2-5.5 hours, and then after that everyone had a 100 score. I didn’t run into any issues when it came to the frame labeling. This plot’s default maintained a clean visual without any modifications needed. The animation slider ranged from 17-24 years of age.

Sp25Assignment 5 [40 points]

Samantha Thomas

Problem 1 [20 points]

Problem 2 [20 points]