Load in Libraries!

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(plotly)
## Warning: package 'plotly' was built under R version 4.0.2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.0.4
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine

Set Working Directory!

getwd()
## [1] "C:/Users/Joeyc/Documents/School/Spring 2021/Project 1"
setwd("/Users/Joeyc/Documents/School/Spring 2021/Project 1")
getwd()
## [1] "C:/Users/Joeyc/Documents/School/Spring 2021/Project 1"

Load in Athletes CSV File

Athletes <- read.csv("Highest Paid Athletes.csv")

Let’s take a look at the first plot we made. I made a scatter plot in ggplot2 combined with a regression line going through. This shows the average the scatterplots were moving per year. Since there was a lot of points on the graph I used ggplotly to create hover over tabs to better look at the data.

p1 <- ggplot(Athletes) +
  aes(x = Year, y= earnings) +
  geom_point(aes(color = Sport)) +
  geom_smooth() +
  xlab("Year") +
  ylab("Earnings in Millions") +
  ggtitle("Top 10 Highest Paid Athletes from 1990 to 2020")
ggplotly(p1)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Next, we have a similar graph as the first one, but the hover over tab has more descriptions of the data, including the name of the athlete. This on top of what we had on the other graph; Earnings in Millions, Country name, and the Sport they play.

p2 <- plot_ly(data = Athletes,
              x=~Year,
              y=~earnings,
              type="scatter",
              mode= "markers",
              color = ~as.factor(Sport),
              hoverinfo= "text",
              text = paste("Name: ", Athletes$Name,
                           "<br>",
                           "Earnings In Millions: ", Athletes$earnings,
                           "<br>",
                           "Country: ", Athletes$Nationality,
                           "<br>",
                           "Sport: ", Athletes$Sport)
              ) 
p2
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

Now let’s take a look at a histogram showing yearly earnings of the top 10 athletes in the world in terms of earnings.

p3 <- Athletes %>%
  ggplot(aes(earnings, fill = Sport)) +
  geom_histogram(binwidth = 10) +
  xlab("Earnings in Millions") +
  ylab("Amount of Athletes") +
  ggtitle("Number of Athletes in Each Bracket of Earnings")
ggplotly(p3)

Again, we have a histogram, but this one is clearer since any player making more than $150 million have been filtered out.

p4 <- Athletes %>%
  filter( earnings < 150 ) %>%
  ggplot(aes(earnings, fill = Sport)) +
  geom_histogram(binwidth = 5) +
  xlab("Earnings in Millions") +
  ylab("Amount of Athletes") +
  ggtitle("Number Athletes in Each Bracket of Earnings without Outliers")
ggplotly(p4)

Now we have a bar plot, the number of times each player was in the top 10 in earnings for the year.

p5 <- Athletes %>%
  ggplot(aes(Name, fill = Sport)) +
  geom_bar() +
  xlab("Athletes") +
  ylab("Amount of Times in the Top 10") +
  ggtitle("Number of Times Each Athletes was in the Top 10 of Earnings") +
  coord_flip()
ggplotly(p5)

Since that was a little crowded, I changed it to number of times each player was in the top 5 of earnings in a year.

p6 <- Athletes %>%
   filter( Current.Rank <= 5 ) %>%
  ggplot(aes(Name, fill = Sport)) +
  geom_bar() +  
  xlab("Athletes") +
  ylab("Amount of Times in the Top 5") +
  ggtitle("Number of Times Each Athletes was in the Top 5 of Earnings") +
  coord_flip()
ggplotly(p6)

Next, I was interested in a couple of things. The number of times each player was the number one earning in a year.

p7 <- Athletes %>%
   filter( Current.Rank <= 1 ) %>%
  ggplot(aes(Name, fill = Sport)) +
  geom_bar() +
  ggtitle("Times Each Athletes was the #1")+
  xlab("Athletes") +
  ylab("Amount of Times they were #1") +
  coord_flip()
ggplotly(p7)

Lastly, I looked at each time each sport was the number one earner and when which country was the number one earner.

p8 <- Athletes %>%
   filter( Current.Rank <= 1 ) %>%
  ggplot(aes(Sport, fill = Name)) +
  geom_bar() +
  ggtitle("Times Each Sport was the #1") +
  xlab("Sport") +
  ylab("Amount of Times it was #1")+
  coord_flip()

p9 <- Athletes %>%
   filter( Current.Rank <= 1 ) %>%
  ggplot(aes(Nationality, fill = Name)) +
  geom_bar() +
  ggtitle("Times Each Country was the #1")+
  xlab("Country") +
  ylab("Amount of Times it was #1") +
  coord_flip()

grid.arrange(p8, p9, nrow=2)

The dataset I am analyzing today was from Kaggle. This dataset showed the top 10 highest earning athletes over the last 30 years. Unfortunately, the dataset was missing 2001 data. The variables included the name of the player, the country they were from, where they ranked as richest athlete for the year, their ranking the previous year, the sport they played that year, the year, and their earnings in millions. This was a mix of continuous and discrete variables. The dataset was clean enough for me to work with in most occasions. One time I had to use the filter function in one of my plots.
In this project I have a couple of visualizations. First, I started with a scatter plot with a regression line on top of the plot. This shows how the top 10 combined salaries of all the athletes are gradually increasing every year. Next, I created two histograms. The first one showed how many times each earnings block was in the top 10 each year. The histogram was very spread out because of a few outliers. Each of the outliers were Boxing so you can tell boxing nets you the most money per year on the higher end. So, then I made the histogram without the few outliers. This helped me understand which sports were disturbed across the histogram. Outside of boxing you could tell soccer had the next richest top end athletes. Finally, the last plot I made was a bar plot. I had to flip them to show the data better; they are horizontal bar plots now. These bar plots broke down how many times each player made it into the top 10. You could clearly see that Michael Jordan and Tiger Woods were in the top 10 the most out of any athlete by about 5 times. Then I dug deeper to see who was on the top 5 of the lists the most. Tiger and Michael still lead the way, showing if they made the top 10, they most likely made the top 5 too. Then lastly, we have a couple different bar plots; one showing which athlete had the most at the #1 spot, then which country, and which sport. Turns out USA, Golf, and Tiger Woods are all first in each category. Tiger Woods carried the way for all 3.
I wish I was able to include more points on the text box in ggplotly, I was only able to change it in the plotly graph, which is why I made both.