@import url('https://fonts.googleapis.com/css2?family=Lato:ital@1&display=swap');
h1, h2, h3, h4 {font-family: 'Lato', sans-serif;
font-weight: bold}
body {font-family: 'Oswald', sans-serif;;
background-color: #eff5f5}
General comments:
All the plots should be labelled appropriately (axes, legends, titles). There will be marks allocated for this.
Please submit both your .Rmd, and the generated output file .html or .pdf on Canvas before the due date/time.
Please make sure that the .Rmd file compiles without any errors. The marker will not spend time fixing the bugs in your code.
Please avoid specifying absolute paths.
Your submission must be original, and if we recognize that you have copied answers from another student in the course, we will deduct your marks.
You must use tidyverse packages to answer the questions in this assignment. Please use dplyr for data wrangling/manipulation, ggplot2 for data visualisation, and lubridate for dates/times. Some parts of Problem 2 will use plots from the fpp3 packages.
IMPORTANT NOTE: There are some questions that are for STATS 786 only. Students taking STATS 326, while you are welcome to attempt these questions, please do not submit answers to them.
Due: Friday 15 March 2023 at 16:00 PM (NZ time)
The lakers data set (in the lubridate package) contains play-by-play statistics of each Los Angeles Lakers basketball game in the 2008-2009 regular season. It contains the following variables:
| Variable | Description |
|---|---|
date |
Date of the game |
opponent |
Name of the opposition team |
game_type |
Home or away game |
time |
Time remaining on the game clock in a given period (counting down from 12 minutes) |
period |
The period of play (most games have four quarters, each 12 minutes in duration, noting that some games go into a 5-minute duration overtime if tied at the end of regular play) |
etype |
The type of play made (e.g., shot, turnover, rebound) |
team |
Name of the NBA team the player who made the play belongs to |
player |
Name of the player that the play was made by |
result |
Whether they won or lost the game |
type |
A more detailed description of the type of play made |
x |
The \(x\)-coordinate on the field of play (in ft) |
y |
The \(y\)-coordinate on the field of play (in ft) |
- 6 Marks
- Read in the
lakersdata set and convert this into atibbleobject.- Keep only the rows relating to Kobe Bryant. Name this object
kobe.- Transform the
datevariable into alubridatedate format (noting that it is currently in integer format).- Shot location is given by
xandy. The center of the hoop is located at the coordinates \((25, 5.25)\). Center thexandyvariables to \((0, 0)\); you will want to overwritexandyin yourkobedata set.
data(lakers)
lakers <- as_tibble(lakers)
kobe <- lakers %>% filter(player == "Kobe Bryant")
kobe$date <- ymd(kobe$date)
kobe$y <- kobe$y - 5.25
kobe$x <- kobe$x - 25
- 6 Marks
- Subset the
kobedata set by only considering plays that are shot attempts (i.e., whereetypeis equal toshot). Name this new data setkobe.shot.- Make a scatter plot of the centered shot location, colouring the points by
result. You should use thegeom_pointlayer.- Set the transparency of the points to
alpha = 0.5.- Use the default colour scheme, but reverse the colour order so that shots made is green(ish) and shots missed is red(ish). Hint: You can use
scale_colour_discretewith an additional argument.
kobe.shot <- kobe
kobe.shot <- subset(kobe.shot, etype == "shot")
ggplot(data = kobe.shot,
mapping = aes(x = x,
y = y,
colour = result ,
size = y)) +
geom_point(alpha = 0.5) +
scale_color_manual(values = c("made" = "dark green", "missed" = "red"))
labs(x = "Centered X", y = "Centered Y", title = "Shot Locations Of Kobe Bryant") +
theme_minimal()
## NULL
- 6 Marks
- Using the
kobe.shotdata set, produce a 2-dimensional density plot (with contours) of Kobe Bryant’s shot locations. You will want to use bothgeom_density_2d_filledandgeom_density_2d. Do not colour byresult.- Remove the legend using
legend.positionargument in thethemelayer.
ggplot(kobe.shot, aes(x = x, y = y)) +
# geom density for the background colour
geom_density2d_filled(fill = "blueviolet") +
geom_density_2d(color = "white") +
# removing the legend
theme(legend.position = "none") +
labs(x = "X Axis", y = "Y Axis", title = "Two Dimentional Density Plot Of Shot Locations Of Kobe Bryant")
- 9 Marks
- Within the
kobe.shotdata set, create a variable calleddistancethat calculates the distance a shot was taken from hoop. You will need to use Pythagoras’ theorem, i.e., \(\text{distance} = \sqrt{x^2 + y^2}\).- Then create another variable within your
kobe.shotdata set calledindicatorthat concatenates the values ofresultwithgame_type. Hint: You can use thepastefunction. You should end up with a variable in your data set that takes on the four values: “made home”, “made away”, “missed home”, “missed away”.- Plot histograms showing the distribution of distance using
geom_histogram. Usefacet_wrapto create seperate panels for all values ofindicator. (You should end up with four panels on the same figure).- Fill the histograms by
indicatorsuch that the interior of the bars are different colours for the four different groups.
- Remove the legend.
# makes distance and claculates distance
kobe.shot$distance <- sqrt(kobe.shot$x^2 + kobe.shot$y^2)
# creates indicator variable that shows if it was home or away
kobe.shot$indicator <- paste(kobe.shot$result, kobe.shot$game_type, sep = " ")
# histograms showing the distribution of distance
ggplot(data = kobe.shot, aes(x = distance, fill = indicator)) +
geom_histogram(binwidth = 5) +
facet_wrap(~ indicator) +
theme(legend.position = "none")
- 11 Marks
- Subset the original
kobedata set (not thekobe.shotsdata set) by considering plays that are only free throws (i.e., whereetypeis equal tofree throw). Call this new data setkobe.free.- Within the
kobe.freedata set, calculate the total number of points from free throws per game as well as the free throw percentage per game. You will want to use thegroup_by,summarise,sum, andnfunctions.- Plot Kobe Bryant’s free throw percentage per game using
geom_segmentto create vertical line segments from 0 to the free throw percentage. Your \(x\)-axis should bedateand your \(y\)-axis should be free throw percentage.- Add transparency proportional to the total number of points per game. (i.e., a larger number of points should have darker line segments).
kobe.free <- kobe %>%
subset(etype == "free throw")
# and free throw percentage per game
kobe.free <- kobe.free %>%
group_by(date) %>%
summarise(total_points = sum(points), missed = sum(result == "missed"), made = sum(result == "made"),
free_throw_percentage = (made/(missed + made))*100)
max_points <- max(kobe.free$total_points)
kobe.free$transparency <- 1 - (kobe.free$total_points / max_points)
ggplot(kobe.free, aes(x = date, y = free_throw_percentage)) +
geom_segment(aes(xend = date, yend = 0, alpha = transparency), size = 1.5) +
labs(x = "Date", y = "Free Throw Percentage") +
geom_point(alpha = 0.5) +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
- 7 Marks
- Using the
kobedata set, find the unique dates that Kobe Bryant played in the 2008-2009 regular season. Hint: You will want to use thegroup_by,summarise, andn_distinctfunctions. You should end up with a data set with 78 rows. Name this data setkobe.week.- Then create a variable that tells you the day of the week the game was played. You will need an appropriate
lubridatefunction.- Plot a bar chart that shows the frequency of games played on each of the seven days of the week.
- Comment on the most common and least common game days.
kobe.week <- kobe %>%
group_by(date) %>%
summarise("2008 / 2009")
# variable that tells you the day of the week the game was played.
kobe.week$day_of_week <- wday(kobe.week$date, label = TRUE)
# ar chart that shows the frequency of games played on each of the seven days of the week
ggplot(data = kobe.week, aes(x = day_of_week)) +
geom_bar(fill = "orange", color = "grey") +
labs(x = "Week day", y = "Number of played days", title = "Number Of Each Weekdays Played By Kobe In 2008 and 2009") +
scale_x_discrete(labels = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat")) +
theme_minimal()
The data set auckland_temps.csv contains the monthly average temperatures in Auckland from July 1994 until January 2024. [Data source: https://cliflo.niwa.co.nz]
- 5 Marks
- Read in the data using
read_csv(don’t useread.csv).
- Convert the
Monthvariable into the correct date format. Hint: You will need to use a function from thetsibblepackage.- Coerce your
tibbleto atsibbleobject withMonthas the index.
auckland_temps <- read.csv("auckland_temps.csv")
auckland_temps <- auckland_temps %>%
mutate(Month = yearmonth(Month)) %>%
as_tsibble(index = Month)
print(auckland_temps, n = 10)
## # A tsibble: 355 x 2 [1M]
## Month Temperature
## <mth> <dbl>
## 1 1994 Jul 10.5
## 2 1994 Aug 11
## 3 1994 Sep 12.2
## 4 1994 Oct 13.6
## 5 1994 Nov 15.7
## 6 1994 Dec 18.7
## 7 1995 Jan 19.7
## 8 1995 Feb 20.5
## 9 1995 Mar 19
## 10 1995 Apr 18.2
## # ℹ 345 more rows
- 7 Marks
- Create a time plot, seasonal plot, and subseries plot of the data.
- Comment on the seasonality in the plots. Which month has the highest average temperatures, and which month has the lowest?
- Comment on whether there is a trend in the data and if so, in what direction.
auckland_temps %>%
autoplot(Temperature) +
labs(title = "Time Plot Of Temperatures And Months")
auckland_temps %>%
gg_season(Temperature, labels = "right") +
labs(title = "Seasonal Plot Of Temperatures And Months")
auckland_temps %>%
gg_subseries(Temperature) +
labs(title = "Subseries Plot Of Temperatures And Months")
The first plot is a time plot that displays the value of the Auckland temperature based on years. The plot shows specific trends, cyclic patterns, and peaks that are specific to each year and month from the 1990s to the 2020s. Looking at the data, we can see some slight changes during the years, but there are no noticeable changes in the pattern or trend of the data, suggesting that over the years there have been some alternations in Auckland temperature but nothing outstanding enough based on the time plot. Therefore, we will need to look at the following plots to get a more defined idea of how the temperature has changed based on the seasonal patterns.
The second plot is a seasonal plot that shows Auckland temperatures based on different months and seasons. There is a general decrease in the temperature from January until June or July, which is the time that winter ends, followed by a gradual increase in the temperature until December of each year. The seasonal plot shows a slight increase in the temperature in recent years, and 2024 February can be marked as one of the highest temperature values compared to the rest of the years, which is a global concern, but to draw conclusions, we need more resources and experts.
The last plot is a subseries plot that compares the Auckland temperature based on different months and decades. This plot, like previous plots, gives us the same pattern of a decrease in the temperature up until July and an increase in the temperature from August to December but enables us to see this pattern more spread out and in more detail, making it easier to compare the variations and differences over the years. The years January, February, and March, in order, are the hottest months, and the months June and July are the months with the lowest recorded temperatures. Interestingly, what is common in all these months is that there is a lot of variation and variability over the years, suggesting the differences and alternations within the years across the decades.
- 8 Marks
- Create a lag plot for the first 12 lags. Use the
pointgeometry and setalpha = 0.5.- Write a sentence or two explaining what autocorrelation is.
- Comment on the patterns you observe in the lag plot, explaining why we see these specific autocorrelation patterns for lags 1, 6, and 12.
auckland_temps %>%
gg_lag(Temperature, lags = 1:12, geom = "point", alpha = 0.5) +
labs()
Autocorrelation is a time series concept used to measure the patterns and the degree of correlation between certain values (primarily seasonality and cyclic patterns) over a specific period of time. In this case, a lag plot has been used as a tool to measure the lagged correlation of Auckland temperature in the 12 months of the year.
The months and other conditions in lags 1, 6, and 12 explain the autocorrelation presented. Lag numbers 12 and 1 both follow a similar positive linear trend, which is due to lag 12 being the same time as lag 1 but 12 months earlier, making the weather and seasonal situations very similar to lag number 1, causing the temperatures to be similar, while lag number 6 follows a negative linear correlation compared to both lag numbers 1 and 12, as there is a 6-month lag between lag numbers 6 and 1, causing the exact opposite correlation, which is probably due to it being a different time of the year where the weather, seasonal, and everything effectuating the temperatures to be different.
- STATS 786 only 10 Marks
- In this question, you will recreate the lag plot from (3).
- Instead of using
gg_lag, you will use functions from thedplyrandggplot2packages to create your own lag plot. You may also need functions fromlubridate,forcats, andtidyr.- Try to get your plot as close to the
gg_lagplot you produced in (3). You will get full marks if your plot is exactly the same as what you get withgg_lag. Marks will be deducted for inconsistencies.
- Note: There are many ways to solve this problem, but here are some things you may want to consider: how to create lagged variables in your data set, how to name them, how to convert your data set into a long format, and how to facet your plot.
Total possible marks for Problem 2: 20 Marks for 326 30 Marks for 786
Total possible marks for Assignment 1: 65 Marks for 326 75 Marks for 786
Comment on the most common and least common game days:
Based on the provided bar chart of Kobe Bryant games in 2008 and 2009, Kobe played the most games on Tuesdays (with over 15 games), followed by Fridays (around 16 or 17 games). This could be due to the timing of NBA tournaments or other various factors. Sunday with 15 games seems to be the third highest played day, and Wednesday and Thursday seem to be the average number of games played by Kobe with around 10 games, while Monday and Saturday are the days with the least amount of games played by Kobe.