/40

+2 completion (turned in on time, turned in Rmd and Html…)

You will want to load ggplot2 and dplyr to manipulate and visualize the data.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(plotly)
## Warning: package 'plotly' was built under R version 4.3.3
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

Requirements:

Data

The lab’s questions will include data from the Rugby 2015 World Cup.

rugby_data = read.csv("rubgyworldcup2015.csv", sep = ",", header = TRUE)

Quesiton 1

Remove all NA values in the rugby_data dataset. Create a third categorical variable that categorizes the team country by continent:

Use this variable as the x-axis label. Using ggplot2, create faceted bar graphs. On the left bar graph plot the average height per team per continent and on the right bar graph plot the average weight per team per continent. Colorize based on team. Give your graphs titles, label x- and y-axes, and label the bar based on its value. After your graph, include a write-up describing any conclusions based on the visualization.

+1 Continent Created +1 Side-by-side graphs showing avg height and avg weight +1 Avg height/weight correct +1 Graph shows bars per team per continent +1 Colorization based on team +2 Formatting includes title, x- and y-labels and labels on bars +1 commentary

library(tidyr)
table(is.na(rugby_data))
## 
## FALSE  TRUE 
##  6454     6
rugby_data2 <- rugby_data %>%
  na.omit()

rugby_data2 <- rugby_data2 %>%
  mutate(continent = case_when(
    team %in% c("Canada","USA","Argentina","Uruguay") ~ "Americas",
    team %in% c("England","France","Ireland","Italy","Romania","Scotland","Wales") ~ "Europe",
    team %in% c("Georgia","Japan") ~ "Asia",
    team %in% c("Australia","Fiji","New Zealand","Samoa","Tonga") ~ "Oceania",
    team %in% c("Namibia","South Africa") ~ "Africa"
  ))

rugby_avg <- rugby_data2 %>%
  group_by(continent, team) %>%
  summarise(
    avg_height = mean(height_cm),
    avg_weight = mean(weight_kg)
  )
## `summarise()` has grouped output by 'continent'. You can override using the
## `.groups` argument.
num_sum <- rugby_avg %>%
  pivot_longer(cols = c(avg_height, avg_weight),
               names_to = "aggregate",
               values_to = "average")
num_sum
a = ggplot(num_sum, aes(x = continent, y = average, fill = team)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = round(average,1)),
            position = position_dodge(width = .9),
            vjust = -0.5,
            size = 2.5,
            check_overlap = TRUE) +
  facet_wrap(~aggregate, scales = "fixed") +
  labs(title = "Average Height and Weight by Team and Continent",
       x = "Continent",
       y = "Average Value") +
  theme_bw()

ggplotly(a) 
library(tidyr)
table(is.na(rugby_data))
## 
## FALSE  TRUE 
##  6454     6
rugby_data2 <- rugby_data %>%
  na.omit()

rugby_data2 <- rugby_data2 %>%
  mutate(continent = case_when(
    team %in% c("Canada","USA","Argentina","Uruguay") ~ "Americas",
    team %in% c("England","France","Ireland","Italy","Romania","Scotland","Wales") ~ "Europe",
    team %in% c("Georgia","Japan") ~ "Asia",
    team %in% c("Australia","Fiji","New Zealand","Samoa","Tonga") ~ "Oceania",
    team %in% c("Namibia","South Africa") ~ "Africa"
  ))

rugby_avg <- rugby_data2 %>%
  group_by(continent, team) %>%
  summarise(
    avg_height = mean(height_cm),
    avg_weight = mean(weight_kg)
  )
## `summarise()` has grouped output by 'continent'. You can override using the
## `.groups` argument.
num_sum <- rugby_avg %>%
  pivot_longer(cols = c(avg_height, avg_weight),
               names_to = "aggregate",
               values_to = "average")
num_sum
b = ggplot(num_sum, aes(x = continent, y = average, fill = team)) +
  geom_bar(stat = "identity", position = position_dodge2(width = 0.9)) +
  geom_text(aes(y = average/2,
                label = round(average,2),
            group = team),
            position = position_dodge(width = 0.9),
            size = 3,
            angle = 90,
            hjust = 0,
            size = 3,
            color = "black") +
  facet_wrap(~aggregate, scales = "fixed") +
  labs(title = "Average Height and Weight by Team and Continent",
       x = "Continent",
       y = "Average Value") +
  theme_bw()
## Warning: Duplicated aesthetics after name standardisation: size
b

Comment: The chart shows average height and average weight of rugby teams categorized by continent and team. The x-axis represents the continents, while the y-axis depicts the average values for the height and weight variables. Additionally, colors represent different teams. Teams from Oceania appear to have the heaviest players on average. American players seem to be the lightest, while Asian players appear to be shorter compared to teams from other continents. Players from the Europe continent seem to be the tallest.

Question 2

Using polt_ly, create side-by-side box plot that shows the distribution of age based on position. Colorize the plot based on the continent. Add title, x- and y-axis labels, and a key. Include a write up afterwards of anything interesting you noticed in the data visualization.

+2 plotly side-by-side boxplot created +1 Position on x-axis age on y-axis +1 Colorized based on continent +2 Formatting includes title, x- and y-labels and key +1 commentary

plot_ly(
  rugby_data2,
  x = ~position,
  y = ~age,
  color = ~continent,
  type = "box"
) %>%
  layout(
    title = "Age Distribution by Position",
    xaxis = list(title = "Position"),
    yaxis = list(title = "Age"),
    boxmode = "group"
  )
plot_ly(
  rugby_data2,
  x = ~position,
  y = ~age,
  color = ~continent,
  type = "box"
) %>%
  layout(
    title = "Age Distribution by Position",
    xaxis = list(title = "Position"),
    yaxis = list(title = "Age")
  )

Comment: The graph demonstrates the distribution of player ages by rugby position based on different continents. The x-axis portrays player positions, whereas the y-axis illustrates the age variable. As shown, players in the prop and lock positions seem to have the highest median ages across all continents. In addition, the overall distribution shows that most positions have a median age between 25 and 30.

Question 3

Consider the height_cm (x) and weight_kg (y) variables from rugby_df. First, plot x and y using a scatterplot where color is based on continent. The plot should be rendered using plot_ly and should show height_cm, weight_kg, continent, and name of each observation upon hover. The x- and y-axes should be labeled and the plot should have a title. Add the regression equation line for each of the 5 continents.

Hint: To add the regression equation to your plotly plot, add a column to the data that reports the fitted estimate of weight_kg given the regression equation. Then add the lines mapping x to y-hat using add_trace() function.

Then, calculate the least squares regression equations that models the relationship between x and y based on continent (hint: there should be 5 sets of slops and y-intercepts). The analysis should display p-values for each calculated slope in a dataframe that is clearly labeled. After the graph and analysis, include a write-up discussing whether you believe the data can be sufficiently modeled using a linear regression equation. You may include other analysis tools such as the coefficient of determination or the correlation coefficient.

Hint: Use lmList from lme4 package to find the equations simultaneously based on continent.

+1 Plotly scatterplot created +1 height_cm and weight_cm on x- and y- axis +1 Colorized based on continent +1 Hover shows height_cm, weight_kg, continent, and name +1 Least squares regression equations calculated +1 Least squares regression equations labeled on graph +1 P-values for least squares regression equations calculated +1 Commentary

# to prevent error because of attaching the name to each point: Error in gsub("\n", br(), a, fixed = TRUE) : input string 1 is invalid in this locale
rugby_data3 <- rugby_data2 %>%
  mutate(across(where(is.character), ~iconv(.x, "", "UTF-8", sub = "")))


p <- plot_ly(
 rugby_data3,
 x = ~height_cm,
 y = ~weight_kg,
 color = ~continent,
 text = ~paste("Name:", name,
               "<br>Height:", height_cm,
               "<br>Weight:", weight_kg,
               "<br>Continent:", continent),
 hoverinfo = "text",
 type = "scatter",
 mode = "markers"
)


rugby_data3 <- rugby_data3 %>%
 group_by(continent) %>%
 mutate(weight_pred = predict(lm(weight_kg ~ height_cm)))


rugby_data3 <- rugby_data3 %>%
  group_by(continent) %>%
  do({
    model <- lm(weight_kg ~ height_cm, data = .)
    data.frame(., weight_pred = predict(model))
  })

p <- p %>%
 add_trace(
   data = rugby_data3,
   x = ~height_cm,
   y = ~weight_pred,
   color = ~continent,
   type = "scatter",
   mode = "lines",
   showlegend = FALSE
 )

p%>%
 layout(
   title = "Height vs Weight by Continent",
   xaxis = list(title = "Height (cm)"),
   yaxis = list(title = "Weight (kg)")
 )
library(lme4)
## Warning: package 'lme4' was built under R version 4.3.3
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Warning in check_dep_version(): ABI version mismatch: 
## lme4 was built with Matrix ABI version 1
## Current Matrix ABI version is 0
## Please re-install lme4 from source or restore original 'Matrix' package
models <- lmList(weight_kg ~ height_cm | continent, data = rugby_data2)


summary(models)
## Call:
##   Model: weight_kg ~ height_cm | NULL 
##    Data: rugby_data2 
## 
## Coefficients:
##    (Intercept) 
##            Estimate Std. Error   t value     Pr(>|t|)
## Africa    -66.39762   33.73023 -1.968490 4.944765e-02
## Americas  -81.81404   23.88010 -3.426035 6.520971e-04
## Asia     -101.47058   30.13398 -3.367314 8.050369e-04
## Europe   -107.37021   18.63093 -5.763009 1.290987e-08
## Oceania   -71.96029   22.14145 -3.250026 1.215093e-03
##    height_cm 
##           Estimate Std. Error   t value     Pr(>|t|)
## Africa   0.9120978 0.18122286  5.033017 6.300651e-07
## Americas 0.9830348 0.12896340  7.622588 9.128444e-14
## Asia     1.1029536 0.16392424  6.728435 3.843633e-11
## Europe   1.1292029 0.09951299 11.347291 2.706610e-27
## Oceania  0.9452436 0.11853510  7.974378 7.180252e-15
## 
## Residual standard error: 10.44036 on 633 degrees of freedom
summary_fit <- summary(models)

p_values <- data.frame(
  continent = names(models),
  slope = coef(models)[,2],
  slope_p_value = summary_fit$coefficients[, "Pr(>|t|)", "height_cm"]
)


p_values
coefs <- coef(models)
results <- data.frame(
  continent = names(models),
  intercept = coefs[,1],
  slope = coefs[,2],
  slope_p_value = p_values
)

eq_labels <- results %>%
  mutate(label = paste0(
    continent, ": y = ",
    round(intercept, 2), " + ",
    round(slope, 2), "x"
  ))

p <- p %>%
  add_annotations(
    text = eq_labels$label,
    x = max(rugby_data3$height_cm),
    y = seq(
      max(rugby_data3$weight_kg),
      max(rugby_data3$weight_kg) - 20,
      length.out = nrow(eq_labels)
    ),
    showarrow = FALSE
  )

p %>%
  layout(
    title = "Height vs Weight by Continent",
    xaxis = list(title = "Height (cm)"),
    yaxis = list(title = "Weight (kg)")
  )

Africa: y = -66.4 + 0.91x Americas: y = -81.81 + 0.98x Asia: y = -101.47 + 1.1x Europe: y = -107.37 + 1.13x Oceania y = -71.96 + 0.95x

Comment: The graph shows the relationship between height and weight of rugby players among different continents. The x-axis represents the height, while the y-axis depicts the weight of players. The data portrays strong positive correlations between height and weight in all continents. The p-values for the slopes are statistically significant since the values are extremely small. This means the height and weight variables are highly correlated to one another, which the taller the player, the heavier they are.

Question 4

Create a new variable in rugby_df called n that calculates the count of the number of players per debut year per continent. Using a line plot, visualize this new varaible. Your graph should have 5 lines, colorized by continent, formatted with x- and y-axis labels and a title. Include a discussion on how the new variable was created and anything you observe in the visualization.

+1 n created +2 Multi Lineplot created +1 Year on x-axis and Count on y-axis +1 Colorized based on continent +1 Formatted with x- and y-axis labels and a title +1 Commentary

rugby_data3 <- rugby_data2 %>%
  mutate(debut_year = as.integer(format(as.Date(debut, format = "%d-%b-%y"), "%Y"))) %>%
  filter(!is.na(debut_year))

debut_counts <- rugby_data3 %>%
  group_by(continent, debut_year) %>%
  summarise(n = n(), .groups = "drop")

ggplot(debut_counts, aes(x = debut_year, y = n, color = continent, group = continent)) +
  geom_line(size = 1.2) +
  geom_point(size = 2) +
  labs(
    title = "Number of Players Debuting per Year by Continent",
    x = "Debut Year",
    y = "Count of Players",
    color = "Continent"
  ) +
  theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The chart demonstrates the number of rugby players that debut each year from distict continents. The x-axis portrays their debut years, while the y-axis depicts the number of players debuting per year. Overall, Europe has the highest counts among all continents. The debut peak for a lot of countries is between 2012-2014

Question 5

Create a final visualization of rugby_df. Requirements:

Following your graph, create a write-up on which variables you chose and why, and any conclusions that you have come to based on your analysis.

+2 3 variables displayed +2 Rendered with ggplot + ggplotly or plotly +1 Formatted with x- and y-axis labels and a title +2 Graph is appropriate for data +1 Commentary

z = ggplot(rugby_data2,
       aes(x = position, y = weight_kg, fill = continent)) +
  geom_boxplot() +
  labs(
    title = "Player Weight Distribution by Position and Continent",
    x = "Position",
    y = "Weight (kg)"
  ) +
  theme_minimal()
ggplotly(z)
rugby_df2 <- rugby_data2 %>%
  mutate(position_group = ifelse(position %in% c("Prop","Hooker","Lock","Back Row"),
                                 "Forwards","Backs"))

p <- ggplot(rugby_df2, aes(x = height_cm,
                          y = weight_kg,
                          color = position_group,
                          size = caps)) +
  geom_point(alpha = 0.7) +
  labs(
    title = "Height vs Weight by Position Group",
    x = "Height (cm)",
    y = "Weight (kg)",
    color = "Position Group",
    size = "Caps"
  ) +
  theme_minimal()

ggplotly(p)

Comment:The boxplot above shows the distribution of rugby players’ weights based on their positions and continents. The x-axis represents player positions, while the y-axis depicts the weights of rugby players in kilograms(kg). As shown above, the prop and lock positions have the highest weights overall(110kg-130kg) since these positions require a lot of strength. In contrast, scrum halves tend to be the lightest among the other positions(70kg-85kg), as the role needs a high level of speed.

Comment: The scatterplot above exhibits the relationship between heights and weights of rugby players. The positions were first grouped into backs and forwards. This new third variable was created to further display the overall shape of the data comparing forward and backward positions. The x-axis depicts the heights of rugby players in cm, while the y-axis demonstrates the weights of rugby players in kg. There is also a legend that portrays the colors that differentiate the forward and backward players, and the size of the points illustrates the number of times a player has played for their national team. There is a positive relationship/association between the variables weight and height in this data, indicating the taller the player, the heavier they are. Forward players are typically heavier and taller than backs because they need a high level of strength. On the other hand, the backs(red) are generally lighter and shorter, as they need a high level of speed.