Lab 3B

You will want to load ggplot2 and dplyr to manipulate and visualize the data.

library(ggplot2)
library(dplyr)
library(plotly)
library(RColorBrewer)
library(ggpubr)

Requirements:

For the following questions, answer using dplyr and ggplot2 and plotly.
Include both a knitted html output and the Rmarkdown file for your submission to Canvas.

Data

The lab’s questions will include data from the Rugby 2015 World Cup.

rugby_data = read.csv("rubgyworldcup2015.csv", sep = ",", header = TRUE)

Quesiton 1

Remove all NA values in the rugby_data dataset. Create a third categorical variable that categorizes the team country by continent:

Americas: Canada, USA, Argentina, Uruguay
Europe: England, France, Ireland, Italy, Romania, Scotland, Wales
Asia: Georgia, Japan,
Oceania: Australia, Fiji, New Zealand, Samoa, Tonga
Africa: Namibia, South Africa

Use this variable as the x-axis label. Using ggplot2, create faceted bar graphs. On the left bar graph plot the average height per team per continent and on the right bar graph plot the average weight per team per continent. Colorize based on team. Give your graphs titles, label x- and y-axes, and label the bar based on its value. After your graph, include a write-up describing any conclusions based on the visualization.

r1 = na.omit(rugby_data)
americas =  c("Canada","USA","Argentina","Uruguay")
europe = c("England","France","Ireland","Italy","Romania","Scotland", "Wales")
asia = c("Georgia","Japan")
oceania = c("Australia","Fiji","New Zealand","Samoa","Tonga")
africa = c("Namibia", "South Africa")
r1 = r1%>%
  group_by(team)%>%
  mutate(cont = case_when(
    team %in% americas ~ "Americas",
    team %in% europe ~ "Europe",
    team %in% oceania ~ "Oceania",
    team %in% africa ~ "Africa",
    team %in% asia ~ "Asia",
    TRUE ~ "Other"
  ))

r2 = r1%>%
  group_by(team, cont)%>%
  summarise(height_mean = mean(height_cm), weight_mean = mean(weight_kg))

## `summarise()` has grouped output by 'team'. You can override using the `.groups` argument.

g1 = ggplot(data = r2, aes(x = cont, y = height_mean, fill = team))+
  geom_bar(stat = "identity", position = position_dodge(), color = "black")+
  geom_text(aes(label = round(height_mean,digits = 1)), vjust = 2, position = position_dodge(.9), size = 1.2)+
  ggtitle("Average Height by Team") + ylab("Height (cm)") + xlab("Continient")+
  labs(fill = "Team")+theme(legend.key.size = unit(0.4, 'cm'), legend.position = "bottom")

g2 = ggplot(data = r2, aes(x = cont, y = weight_mean, fill = team))+
  geom_bar(stat = "identity", position = position_dodge(), color = "black")+
  geom_text(aes(label = round(weight_mean,digits = 1)), vjust = 2, position = position_dodge(.9), size = 1.2)+
  ggtitle("Average Weight by Team") + ylab("Weight (kg)") + xlab("Continient")+
  labs(fill = "Team")+theme(legend.key.size = unit(0.4, 'cm'), legend.position = "bottom")

ggarrange(g1,g2, common.legend = TRUE, legend = "bottom")

Conclusion: The graphs appear to follow the same shape. They are just scaled differently (weight 0-100kg and height 0-150cm.) Uruguay appears to be the lightest and smallest.

Question 2

Using polt_ly, create side-by-side box plot that shows the distribution of age based on position. Colorize the plot based on the continent. Add title, x- and y-axis labels, and a key. Include a write up afterwards of anything interesting you noticed in the data visualization.

r3 <- r1%>%
  group_by(position, cont)

figAge <- plot_ly(data = r3, x = ~position, y = ~age, color = ~cont, type = "box")%>% 
  layout(boxmode = "group", title = "Age by Position", yaxis = list(title = "Age"), xaxis = list(title = "Position", tickangle = 90))
figAge

## Warning: 'layout' objects don't have these attributes: 'boxmode'
## Valid attributes include:
## 'font', 'title', 'uniformtext', 'autosize', 'width', 'height', 'margin', 'computed', 'paper_bgcolor', 'plot_bgcolor', 'separators', 'hidesources', 'showlegend', 'colorway', 'datarevision', 'uirevision', 'editrevision', 'selectionrevision', 'template', 'modebar', 'newshape', 'activeshape', 'meta', 'transition', '_deprecated', 'clickmode', 'dragmode', 'hovermode', 'hoverdistance', 'spikedistance', 'hoverlabel', 'selectdirection', 'grid', 'calendar', 'xaxis', 'yaxis', 'ternary', 'scene', 'geo', 'mapbox', 'polar', 'radialaxis', 'angularaxis', 'direction', 'orientation', 'editType', 'legend', 'annotations', 'shapes', 'images', 'updatemenus', 'sliders', 'colorscale', 'coloraxis', 'metasrc', 'barmode', 'bargap', 'mapType'

Conclusion: It is interesting to see that only in America is the position Utility Forward and Back. You can also see that some positions appear more normally distributed like Oceanian Back Row and other appear very skewed like Asian Wing

Question 3

Consider the height_cm (x) and weight_kg (y) variables from rugby_df. First, plot x and y using a scatterplot where color is based on continent. The plot should be rendered using plot_ly and should show height_cm, weight_kg, continent, and name of each observation upon hover. The x- and y-axes should be labeled and the plot should have a title. Add the regression equation line for each of the 5 continents.

Hint: To add the regression equation to your plotly plot, add a column to the data that reports the fitted estimate of weight_kg given the regression equation. Then add the lines mapping x to y-hat using add_trace() function.

Then, calculate the least squares regression equations that models the relationship between x and y based on continent (hint: there should be 5 sets of slops and y-intercepts). The analysis should display p-values for each calculated slope in a dataframe that is clearly labeled. After the graph and analysis, include a write-up discussing whether you believe the data can be sufficiently modeled using a linear regression equation. You may include other analysis tools such as the coefficient of determination or the correlation coefficient.

Hint: Use lmList from lme4 package to find the equations simultaneously based on continent.

library(lme4)

## Loading required package: Matrix

r4 = r1%>%
  group_by(cont)

r4$fv <- r4%>% ##Fitted Values
  lm(weight_kg ~ height_cm*cont,.) %>%
  fitted.values()

regLines = lmList(weight_kg ~ height_cm | cont, data = r4)


figReg <- plot_ly(data = r4, x = ~height_cm, y = ~weight_kg, color = ~cont, type = "scatter", text = ~paste("Name: ", name, "\n Continent: ", cont, "\n Weight: ", weight_kg, "kg", "\n Height: ", height_cm, "cm"))%>%
  layout(title = "Height v Weight", yaxis = list(title = "Weight (kg)"), xaxis = list(title = "Height (cm)"))%>%
  add_trace(x = ~height_cm, y = ~fv, mode = "lines")

figReg

## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode

regLines

## Call: lmList(formula = weight_kg ~ height_cm | cont, data = r4) 
## Coefficients:
##          (Intercept) height_cm
## Africa     -66.39762 0.9120978
## Americas   -81.81404 0.9830348
## Asia      -101.47058 1.1029536
## Europe    -107.37021 1.1292029
## Oceania    -71.96029 0.9452436
## 
## Degrees of freedom: 643 total; 633 residual
## Residual standard error: 10.44036

summary(regLines)

## Call:
##   Model: weight_kg ~ height_cm | NULL 
##    Data: r4 
## 
## Coefficients:
##    (Intercept) 
##            Estimate Std. Error   t value     Pr(>|t|)
## Africa    -66.39762   33.73023 -1.968490 4.944765e-02
## Americas  -81.81404   23.88010 -3.426035 6.520971e-04
## Asia     -101.47058   30.13398 -3.367314 8.050369e-04
## Europe   -107.37021   18.63093 -5.763009 1.290987e-08
## Oceania   -71.96029   22.14145 -3.250026 1.215093e-03
##    height_cm 
##           Estimate Std. Error   t value     Pr(>|t|)
## Africa   0.9120978 0.18122286  5.033017 6.300651e-07
## Americas 0.9830348 0.12896340  7.622588 9.128444e-14
## Asia     1.1029536 0.16392424  6.728435 3.843633e-11
## Europe   1.1292029 0.09951299 11.347291 2.706610e-27
## Oceania  0.9452436 0.11853510  7.974378 7.180252e-15
## 
## Residual standard error: 10.44036 on 633 degrees of freedom

Conclusion: I do not believe the data is linearly correlated. The data is very scattered and doesn’t match the plotted linear model. The p-values are all near zero with a high standard error thus again showing the data isn’t linearly related.

Question 4

Create a new variable in rugby_df called n that calculates the count of the number of players per debut year per continent. Using a line plot, visualize this new varaible. Your graph should have 5 lines, colorized by continent, formatted with x- and y-axis labels and a title. Include a discussion on how the new variable was created and anything you observe in the visualization.

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

r5 = r1%>%
  group_by(cont)%>%
  mutate(year = year(dmy(debut)))

r5 = count(r5, year)

figYear <- plot_ly(data = r5, x = ~year, y = ~n, color = ~cont, mode = "lines")%>%
  layout(title = "Player Debut per Year", yaxis = list(title = "Number of Player Debut"), xaxis = list(title = "Year"))


figYear

## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter

Conclusion: There seems to be a rise of player out of Europe while Asia and Africa have significantly less. I would be interested to see the correlations to Rugby’s popularity in each continent and how it related to the number of debuts.

Question 5

Create a final visualization of rugby_df. Requirements:

Graph should include at least 3 variables, one of which is created using dplyr
Graph should either be created using ggplot2 package and then run through ggplotly or should be created using plotly package, so that the end result is interactive.
Graph should be formatted with x- and y-axis labels, a title, and colorization based on one variable

Following your graph, create a write-up on which variables you chose and why, and any conclusions that you have come to based on your analysis.

r6 = r1%>%
  filter(cont == "Americas")

figAmericas = plot_ly(data = r6, x= ~years_since_debut, y = ~caps, symbol = ~position, symbols = c(0,1,2,3,4,5,6,7,8,9,10,11), color = ~team, type = "scatter",
                      text = ~paste("Name: ", name, "\n Team: ", team, "\n Position: ", position))%>%
  layout(title = "Apperences v Years Played", yaxis = list(title = "Apperences"), xaxis = list(title = "Years Since Debut"))

figAmericas

## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode

Conclusion: The graph above looks at the only players in the Americas. I wanted to see how difference positions plays a roll in the number of appearances (caps). There is an obvious positive trend of years since debut and the number of appearances. There doesn’t appear to have one position that gets more appearances or stay playing longer. There is however more Hookers above 10 years than any other position. Argentina also has the most players above 10 years.