/40
+2 completion (turned in on time, turned in Rmd and Html…)
You will want to load ggplot2 and dplyr to
manipulate and visualize the data.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(plotly)
## Warning: package 'plotly' was built under R version 4.3.3
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
For the following questions, answer using dplyr and
ggplot2 and plotly.
Include both a knitted html output and the Rmarkdown file for your submission to Canvas.
The lab’s questions will include data from the Rugby 2015 World Cup.
rugby_data = read.csv("rubgyworldcup2015.csv", sep = ",", header = TRUE)
Remove all NA values in the rugby_data dataset. Create a
third categorical variable that categorizes the team country by
continent:
Use this variable as the x-axis label. Using ggplot2,
create faceted bar graphs. On the left bar graph plot the average height
per team per continent and on the right bar graph plot the average
weight per team per continent. Colorize based on team. Give your graphs
titles, label x- and y-axes, and label the bar based on its value. After
your graph, include a write-up describing any conclusions based on the
visualization.
+1 Continent Created +1 Side-by-side graphs showing avg height and avg weight +1 Avg height/weight correct +1 Graph shows bars per team per continent +1 Colorization based on team +2 Formatting includes title, x- and y-labels and labels on bars +1 commentary
library(tidyr)
table(is.na(rugby_data))
##
## FALSE TRUE
## 6454 6
rugby_data2 <- rugby_data %>%
na.omit()
rugby_data2 <- rugby_data2 %>%
mutate(continent = case_when(
team %in% c("Canada","USA","Argentina","Uruguay") ~ "Americas",
team %in% c("England","France","Ireland","Italy","Romania","Scotland","Wales") ~ "Europe",
team %in% c("Georgia","Japan") ~ "Asia",
team %in% c("Australia","Fiji","New Zealand","Samoa","Tonga") ~ "Oceania",
team %in% c("Namibia","South Africa") ~ "Africa"
))
rugby_avg <- rugby_data2 %>%
group_by(continent, team) %>%
summarise(
avg_height = mean(height_cm),
avg_weight = mean(weight_kg)
)
## `summarise()` has grouped output by 'continent'. You can override using the
## `.groups` argument.
num_sum <- rugby_avg %>%
pivot_longer(cols = c(avg_height, avg_weight),
names_to = "aggregate",
values_to = "average")
num_sum
a = ggplot(num_sum, aes(x = continent, y = average, fill = team)) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = round(average,1)),
position = position_dodge(width = .9),
vjust = -0.5,
size = 2.5,
check_overlap = TRUE) +
facet_wrap(~aggregate, scales = "fixed") +
labs(title = "Average Height and Weight by Team and Continent",
x = "Continent",
y = "Average Value") +
theme_bw()
ggplotly(a)
library(tidyr)
table(is.na(rugby_data))
##
## FALSE TRUE
## 6454 6
rugby_data2 <- rugby_data %>%
na.omit()
rugby_data2 <- rugby_data2 %>%
mutate(continent = case_when(
team %in% c("Canada","USA","Argentina","Uruguay") ~ "Americas",
team %in% c("England","France","Ireland","Italy","Romania","Scotland","Wales") ~ "Europe",
team %in% c("Georgia","Japan") ~ "Asia",
team %in% c("Australia","Fiji","New Zealand","Samoa","Tonga") ~ "Oceania",
team %in% c("Namibia","South Africa") ~ "Africa"
))
rugby_avg <- rugby_data2 %>%
group_by(continent, team) %>%
summarise(
avg_height = mean(height_cm),
avg_weight = mean(weight_kg)
)
## `summarise()` has grouped output by 'continent'. You can override using the
## `.groups` argument.
num_sum <- rugby_avg %>%
pivot_longer(cols = c(avg_height, avg_weight),
names_to = "aggregate",
values_to = "average")
num_sum
b = ggplot(num_sum, aes(x = continent, y = average, fill = team)) +
geom_bar(stat = "identity", position = position_dodge2(width = 0.9)) +
geom_text(aes(y = average/2,
label = round(average,2),
group = team),
position = position_dodge(width = 0.9),
size = 3,
angle = 90,
hjust = 0,
size = 3,
color = "black") +
facet_wrap(~aggregate, scales = "fixed") +
labs(title = "Average Height and Weight by Team and Continent",
x = "Continent",
y = "Average Value") +
theme_bw()
## Warning: Duplicated aesthetics after name standardisation: size
b
Comment: The chart shows average height and average weight of rugby
teams categorized by continent and team. The x-axis represents the
continents, while the y-axis depicts the average values for the height
and weight variables. Additionally, colors represent different teams.
Teams from Oceania appear to have the heaviest players on average.
American players seem to be the lightest, while Asian players appear to
be shorter compared to teams from other continents. Players from the
Europe continent seem to be the tallest.
Using polt_ly, create side-by-side box plot that shows
the distribution of age based on position. Colorize the plot based on
the continent. Add title, x- and y-axis labels, and a key. Include a
write up afterwards of anything interesting you noticed in the data
visualization.
+2 plotly side-by-side boxplot created +1 Position on x-axis age on y-axis +1 Colorized based on continent +2 Formatting includes title, x- and y-labels and key +1 commentary
plot_ly(
rugby_data2,
x = ~position,
y = ~age,
color = ~continent,
type = "box"
) %>%
layout(
title = "Age Distribution by Position",
xaxis = list(title = "Position"),
yaxis = list(title = "Age"),
boxmode = "group"
)
plot_ly(
rugby_data2,
x = ~position,
y = ~age,
color = ~continent,
type = "box"
) %>%
layout(
title = "Age Distribution by Position",
xaxis = list(title = "Position"),
yaxis = list(title = "Age")
)
# plot_ly(rugby_data2, x= ~position, y = ~age, color = ~continent, type = "box") %>%
# layout(title = "Distribution of Age Based on Rugby Position")
Comment: The graph demonstrates the distribution of player ages by rugby position based on different continents. The x-axis portrays player positions, whereas the y-axis illustrates the age variable. As shown, players in the prop and lock positions seem to have the highest median ages across all continents. In addition, the overall distribution shows that most positions have a median age between 25 and 30.
Consider the height_cm (x) and weight_kg
(y) variables from rugby_df. First, plot x and y using a
scatterplot where color is based on continent. The plot should be
rendered using plot_ly and should show
height_cm, weight_kg, continent,
and name of each observation upon hover. The x- and y-axes
should be labeled and the plot should have a title. Add the regression
equation line for each of the 5 continents.
Hint: To add the regression equation to your plotly plot, add a
column to the data that reports the fitted estimate of
weight_kg given the regression equation. Then add the lines
mapping x to y-hat using add_trace() function.
Then, calculate the least squares regression equations that models the relationship between x and y based on continent (hint: there should be 5 sets of slops and y-intercepts). The analysis should display p-values for each calculated slope in a dataframe that is clearly labeled. After the graph and analysis, include a write-up discussing whether you believe the data can be sufficiently modeled using a linear regression equation. You may include other analysis tools such as the coefficient of determination or the correlation coefficient.
Hint: Use lmList from lme4 package to find
the equations simultaneously based on continent.
+1 Plotly scatterplot created +1 height_cm and weight_cm on x- and y- axis +1 Colorized based on continent +1 Hover shows height_cm, weight_kg, continent, and name +1 Least squares regression equations calculated +1 Least squares regression equations labeled on graph +1 P-values for least squares regression equations calculated +1 Commentary
# to prevent error because of attaching the name to each point: Error in gsub("\n", br(), a, fixed = TRUE) : input string 1 is invalid in this locale
rugby_data3 <- rugby_data2 %>%
mutate(across(where(is.character), ~iconv(.x, "", "UTF-8", sub = "")))
p <- plot_ly(
rugby_data3,
x = ~height_cm,
y = ~weight_kg,
color = ~continent,
text = ~paste("Name:", name,
"<br>Height:", height_cm,
"<br>Weight:", weight_kg,
"<br>Continent:", continent),
hoverinfo = "text",
type = "scatter",
mode = "markers"
)
rugby_data3 <- rugby_data3 %>%
group_by(continent) %>%
mutate(weight_pred = predict(lm(weight_kg ~ height_cm)))
rugby_data3 <- rugby_data3 %>%
group_by(continent) %>%
do({
model <- lm(weight_kg ~ height_cm, data = .)
data.frame(., weight_pred = predict(model))
})
p <- p %>%
add_trace(
data = rugby_data3,
x = ~height_cm,
y = ~weight_pred,
color = ~continent,
type = "scatter",
mode = "lines",
showlegend = FALSE
)
p%>%
layout(
title = "Height vs Weight by Continent",
xaxis = list(title = "Height (cm)"),
yaxis = list(title = "Weight (kg)")
)
library(lme4)
## Warning: package 'lme4' was built under R version 4.3.3
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Warning in check_dep_version(): ABI version mismatch:
## lme4 was built with Matrix ABI version 1
## Current Matrix ABI version is 0
## Please re-install lme4 from source or restore original 'Matrix' package
models <- lmList(weight_kg ~ height_cm | continent, data = rugby_data2)
summary(models)
## Call:
## Model: weight_kg ~ height_cm | NULL
## Data: rugby_data2
##
## Coefficients:
## (Intercept)
## Estimate Std. Error t value Pr(>|t|)
## Africa -66.39762 33.73023 -1.968490 4.944765e-02
## Americas -81.81404 23.88010 -3.426035 6.520971e-04
## Asia -101.47058 30.13398 -3.367314 8.050369e-04
## Europe -107.37021 18.63093 -5.763009 1.290987e-08
## Oceania -71.96029 22.14145 -3.250026 1.215093e-03
## height_cm
## Estimate Std. Error t value Pr(>|t|)
## Africa 0.9120978 0.18122286 5.033017 6.300651e-07
## Americas 0.9830348 0.12896340 7.622588 9.128444e-14
## Asia 1.1029536 0.16392424 6.728435 3.843633e-11
## Europe 1.1292029 0.09951299 11.347291 2.706610e-27
## Oceania 0.9452436 0.11853510 7.974378 7.180252e-15
##
## Residual standard error: 10.44036 on 633 degrees of freedom
summary_fit <- summary(models)
p_values <- data.frame(
continent = names(models),
slope = coef(models)[,2],
slope_p_value = summary_fit$coefficients[, "Pr(>|t|)", "height_cm"]
)
p_values
coefs <- coef(models)
results <- data.frame(
continent = names(models),
intercept = coefs[,1],
slope = coefs[,2],
slope_p_value = p_values
)
eq_labels <- results %>%
mutate(label = paste0(
continent, ": y = ",
round(intercept, 2), " + ",
round(slope, 2), "x"
))
p <- p %>%
add_annotations(
text = eq_labels$label,
x = max(rugby_data3$height_cm),
y = seq(
max(rugby_data3$weight_kg),
max(rugby_data3$weight_kg) - 20,
length.out = nrow(eq_labels)
),
showarrow = FALSE
)
p %>%
layout(
title = "Height vs Weight by Continent",
xaxis = list(title = "Height (cm)"),
yaxis = list(title = "Weight (kg)")
)
Africa: y = -66.4 + 0.91x Americas: y = -81.81 + 0.98x Asia: y = -101.47 + 1.1x Europe: y = -107.37 + 1.13x Oceania y = -71.96 + 0.95x
Comment: The graph shows the relationship between height and weight of rugby players among different continents. The x-axis represents the height, while the y-axis depicts the weight of players. The data portrays strong positive correlations between height and weight in all continents. The p-values for the slopes are statistically significant since the values are extremely small. This means the height and weight variables are highly correlated to one another, which the taller the player, the heavier they are.
Create a new variable in rugby_df called n
that calculates the count of the number of players per debut year per
continent. Using a line plot, visualize this new varaible. Your graph
should have 5 lines, colorized by continent, formatted with x- and
y-axis labels and a title. Include a discussion on how the new variable
was created and anything you observe in the visualization.
+1 n created +2 Multi Lineplot created +1 Year on x-axis and Count on y-axis +1 Colorized based on continent +1 Formatted with x- and y-axis labels and a title +1 Commentary
rugby_data3 <- rugby_data2 %>%
mutate(debut_year = as.integer(format(as.Date(debut, format = "%d-%b-%y"), "%Y"))) %>%
filter(!is.na(debut_year))
debut_counts <- rugby_data3 %>%
group_by(continent, debut_year) %>%
summarise(n = n(), .groups = "drop")
ggplot(debut_counts, aes(x = debut_year, y = n, color = continent, group = continent)) +
geom_line(size = 1.2) +
geom_point(size = 2) +
labs(
title = "Number of Players Debuting per Year by Continent",
x = "Debut Year",
y = "Count of Players",
color = "Continent"
) +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The chart demonstrates the number of rugby players that debut each year from distict continents. The x-axis portrays their debut years, while the y-axis depicts the number of players debuting per year. Overall, Europe has the highest counts among all continents. The debut peak for a lot of countries is between 2012-2014
Create a final visualization of rugby_df.
Requirements:
dplyrggplot2 package
and then run through ggplotly or should be created using
plotly package, so that the end result is interactive.Following your graph, create a write-up on which variables you chose and why, and any conclusions that you have come to based on your analysis.
+2 3 variables displayed +2 Rendered with ggplot + ggplotly or plotly +1 Formatted with x- and y-axis labels and a title +2 Graph is appropriate for data +1 Commentary
z = ggplot(rugby_data2,
aes(x = position, y = weight_kg, fill = continent)) +
geom_boxplot() +
labs(
title = "Player Weight Distribution by Position and Continent",
x = "Position",
y = "Weight (kg)"
) +
theme_minimal()
ggplotly(z)
rugby_df2 <- rugby_data2 %>%
mutate(position_group = ifelse(position %in% c("Prop","Hooker","Lock","Back Row"),
"Forwards","Backs"))
p <- ggplot(rugby_df2, aes(x = height_cm,
y = weight_kg,
color = position_group,
size = caps)) +
geom_point(alpha = 0.7) +
labs(
title = "Height vs Weight by Position Group",
x = "Height (cm)",
y = "Weight (kg)",
color = "Position Group",
size = "Caps"
) +
theme_minimal()
ggplotly(p)
Comment:The boxplot above shows the distribution of rugby players’ weights based on their positions and continents. The x-axis represents player positions, while the y-axis depicts the weights of rugby players in kilograms(kg). As shown above, the prop and lock positions have the highest weights overall(110kg-130kg) since these positions require a lot of strength. In contrast, scrum halves tend to be the lightest among the other positions(70kg-85kg), as the role needs a high level of speed.
Comment: The scatterplot above exhibits the relationship between heights and weights of rugby players. The positions were first grouped into backs and forwards. This new third variable was created to further display the overall shape of the data comparing forward and backward positions. The x-axis depicts the heights of rugby players in cm, while the y-axis demonstrates the weights of rugby players in kg. There is also a legend that portrays the colors that differentiate the forward and backward players, and the size of the points illustrates the number of times a player has played for their national team. There is a positive relationship/association between the variables weight and height in this data, indicating the taller the player, the heavier they are. Forward players are typically heavier and taller than backs because they need a high level of strength. On the other hand, the backs(red) are generally lighter and shorter, as they need a high level of speed.