Libraries

library(plotly)
library(dslabs)
library(ggplot2)
library(tidyverse)
library(dplyr)
library(ISLR)
library(tibble)

Problem 1 Data

data("Hitters")
Hitters <- Hitters %>%
  rownames_to_column(var = "Player")
Hitters$Player <- str_remove(Hitters$Player, "^-")

Problem 1

Hitters %>% 
  plot_ly(x = ~Runs, y = ~Salary, color = ~League, colors = c("blue", "red"), 
          hoverinfo = "text",
          text = ~paste("Runs:", Runs, "<br>",
                       "Salary:", Salary, "<br>",
                       "Player:", Player)) %>%
  add_markers() %>%
  layout(title = "Salary (1987) vs Runs (1986) by League", yaxis = list(title = "Salary (in thousands $)"), legend = list(title = list(text = "League")))

Problem 1 Plot Description

This scatter plot is based on data from the ‘Hitters’ data set in R. It contains the variable Runs on the x-axis, which is the number of runs in 1986. Salary is on the y-axis, and it is in thousands of dollars. Salary is also the 1987 salary on opening day. Each point represents a player. A blue point means they are in the American League and a red point means they are in the National League. When each point is hovered over, the exact runs, salary, and name of the player is shown. Knowing the player name can be helpful especially when looking at outliers. If someone has low runs but a high salary, the reader can look into whether the specific player is skilled in other ways, injured, has had a long career, etc.

In terms of relationship, Runs and Salary have a moderately positive correlation. This means players who score more runs in 1986 often have higher salaries in 1987. There is a lot of variance and spread of the points however, so there are other factors that can affect salary. In terms of league, there seems to be a mostly equal spread of American and National players. There are a few more American league players that are higher outliers than National, but nothing that makes a huge difference in the spread. The only issue I encountered in the process of making the plot was that the original Hitters data set had the Player names as row names and they had a minus sign in front of each name. I was able to fix this by using rownames_to_column() from the tibble library to make a Player column. Then I used str_remove() from tidyverse to remove the minus sign.

Problem 2

LifeExpectancy <- read.csv("LifeExpectancy.csv")
LifeExpectancy %>%
  plot_ly(x = ~BMI, y = ~Life_expectancy, 
          hoverinfo = "text",
          text = ~paste("BMI:", BMI, "<br>",
                       "Life Expectancy:", Life_expectancy, "<br>",
                       "Country:", Country), showlegend = FALSE)  %>%
  add_text(x = 26, y = 50, text = ~Year, frame = ~Year,
           textfont = list(size = 100, color = toRGB("grey90"), showlegend = FALSE)) %>%
  add_markers(frame = ~Year, color = ~Region, showlegend = TRUE ) %>%
  layout(title = "Life Expectancy vs BMI by Region from 2000-2015", yaxis = list(title = "Life Expectancy"), legend = list(title = list(text = "Region")))

Problem 2 Plot Description

This animated scatter plot is based on a WHO Life Expectancy data set found on Kaggle. This data set is similar to the gap minder but includes more health related variables. The x-axis contains the body mass index (BMI) and and y-axis contains the Life Expectancy. Each point represents a country, but is colored according to the Region. The frame is the Year, so when the play button is pressed it follows each country’s point from 2000-2015. When a point is hovered over, the exact BMI, Life Expectancy, and Country name can be seen. The frame label that shows the year can also bee seen in large grey text near the center of the plot.

Overall, there seems to be a positive relationship between BMI and Life Expectancy throughout all the years. This is probably because countries with lower BMIs are dealing with higher levels of starvation and poverty. Life Expectancy has increased overall for countries throughout the years. Many African countries were below 50 years in 2000 but all countries were above 50 years in 2015. BMI has increased as well throughout the years with more countries shifting to the right in the animation. In some cases, when BMI reaches a level of unhealthiness, the Life Expectancy decreases. For example, some Oceanic countries have the highest BMIs but Life Expectancy is in the 60-70 year range. The region coloring for the points also shows that African countries tend to have low BMIs and low life expectancies while many Asian countries have low BMIs and higher life expectancies. European countries seem to have a slightly overweight BMI over 25 but still maintain high life expectancy.

The only issue I encountered in creating this plot was the the frame label added “trace 0” to the legend and I had trouble removing it. At first I tried putting showlegend = FALSE into add_text() and showlegend = TRUE into add_markers() so the trace 0 would disappear without removing the region legend, but this did not do anything. I kept these two showlegend() additions and added showlegend = FALSE into plot_ly. This fixed the issue by making the legend not exist initially and then adding the region legend but not the “trace 0” legend for the frame label.

Fa25_Assignment5_Aminah_Harp

Aminah Harp

2025-11-25