@import url('https://fonts.googleapis.com/css2?family=Lato:ital@1&display=swap');

h1, h2, h3, h4 {font-family: 'Lato', sans-serif;
    font-weight: bold}

body {font-family: 'Oswald', sans-serif;;
     background-color: #eff5f5} 

General comments:

  • All the plots should be labelled appropriately (axes, legends, titles). There will be marks allocated for this.

  • Please submit both your .Rmd, and the generated output file .html or .pdf on Canvas before the due date/time.

  • Please make sure that the .Rmd file compiles without any errors. The marker will not spend time fixing the bugs in your code.

  • Please avoid specifying absolute paths.

  • Your submission must be original, and if we recognize that you have copied answers from another student in the course, we will deduct your marks.

  • You must use tidyverse packages to answer the questions in this assignment. Please use dplyr for data wrangling/manipulation, ggplot2 for data visualisation, and lubridate for dates/times. Some parts of Problem 2 will use plots from the fpp3 packages.

  • IMPORTANT NOTE: There are some questions that are for STATS 786 only. Students taking STATS 326, while you are welcome to attempt these questions, please do not submit answers to them.

Due: Friday 15 March 2023 at 16:00 PM (NZ time)

Problem 1: Kobe Bryant

The lakers data set (in the lubridate package) contains play-by-play statistics of each Los Angeles Lakers basketball game in the 2008-2009 regular season. It contains the following variables:

Variable Description
date Date of the game
opponent Name of the opposition team
game_type Home or away game
time Time remaining on the game clock in a given period (counting down from 12 minutes)
period The period of play (most games have four quarters, each 12 minutes in duration, noting that some games go into a 5-minute duration overtime if tied at the end of regular play)
etype The type of play made (e.g., shot, turnover, rebound)
team Name of the NBA team the player who made the play belongs to
player Name of the player that the play was made by
result Whether they won or lost the game
type A more detailed description of the type of play made
x The \(x\)-coordinate on the field of play (in ft)
y The \(y\)-coordinate on the field of play (in ft)
  1. 6 Marks
  • Read in the lakers data set and convert this into a tibble object.
  • Keep only the rows relating to Kobe Bryant. Name this object kobe.
  • Transform the date variable into a lubridate date format (noting that it is currently in integer format).
  • Shot location is given by x and y. The center of the hoop is located at the coordinates \((25, 5.25)\). Center the x and y variables to \((0, 0)\); you will want to overwrite x and y in your kobe data set.
data(lakers)
lakers <- as_tibble(lakers)

kobe <- lakers %>% filter(player == "Kobe Bryant")

kobe$date <- ymd(kobe$date)

kobe$y <- kobe$y - 5.25
kobe$x <- kobe$x - 25
  1. 6 Marks
  • Subset the kobe data set by only considering plays that are shot attempts (i.e., where etype is equal to shot). Name this new data set kobe.shot.
  • Make a scatter plot of the centered shot location, colouring the points by result. You should use the geom_point layer.
  • Set the transparency of the points to alpha = 0.5.
  • Use the default colour scheme, but reverse the colour order so that shots made is green(ish) and shots missed is red(ish). Hint: You can use scale_colour_discrete with an additional argument.
kobe.shot <- kobe


kobe.shot <- subset(kobe.shot, etype == "shot")

ggplot(data = kobe.shot, 
       mapping = aes(x = x,
                     y = y, 
                     colour = result ,
                     size = y)) +

geom_point(alpha = 0.5) +
  scale_color_manual(values = c("made" = "dark green", "missed" = "red")) 

  labs(x = "Centered X", y = "Centered Y", title = "Shot Locations Of Kobe Bryant") +
  theme_minimal()
## NULL
  1. 6 Marks
  • Using the kobe.shot data set, produce a 2-dimensional density plot (with contours) of Kobe Bryant’s shot locations. You will want to use both geom_density_2d_filled and geom_density_2d. Do not colour by result.
  • Remove the legend using legend.position argument in the theme layer.
ggplot(kobe.shot, aes(x = x, y = y)) +
# geom density for the background colour   
  geom_density2d_filled(fill = "blueviolet") +

  geom_density_2d(color = "white") +
# removing the legend 
  theme(legend.position = "none") +
  labs(x = "X Axis", y = "Y Axis", title = "Two Dimentional Density Plot Of Shot Locations Of Kobe Bryant") 

  1. 9 Marks
  • Within the kobe.shot data set, create a variable called distance that calculates the distance a shot was taken from hoop. You will need to use Pythagoras’ theorem, i.e., \(\text{distance} = \sqrt{x^2 + y^2}\).
  • Then create another variable within your kobe.shot data set called indicator that concatenates the values of result with game_type. Hint: You can use the paste function. You should end up with a variable in your data set that takes on the four values: “made home”, “made away”, “missed home”, “missed away”.
  • Plot histograms showing the distribution of distance using geom_histogram. Use facet_wrap to create seperate panels for all values of indicator. (You should end up with four panels on the same figure).
  • Fill the histograms by indicator such that the interior of the bars are different colours for the four different groups.
  • Remove the legend.
# makes distance and claculates distance 
kobe.shot$distance <- sqrt(kobe.shot$x^2 + kobe.shot$y^2)
# creates indicator variable that shows if it was home or away 
kobe.shot$indicator <- paste(kobe.shot$result, kobe.shot$game_type, sep = " ")

# histograms showing the distribution of distance
ggplot(data = kobe.shot, aes(x = distance, fill = indicator)) +
  geom_histogram(binwidth = 5) +
  facet_wrap(~ indicator) +
  theme(legend.position = "none")

  1. 11 Marks
  • Subset the original kobe data set (not the kobe.shots data set) by considering plays that are only free throws (i.e., where etype is equal to free throw). Call this new data set kobe.free.
  • Within the kobe.free data set, calculate the total number of points from free throws per game as well as the free throw percentage per game. You will want to use the group_by, summarise, sum, and n functions.
  • Plot Kobe Bryant’s free throw percentage per game using geom_segment to create vertical line segments from 0 to the free throw percentage. Your \(x\)-axis should be date and your \(y\)-axis should be free throw percentage.
  • Add transparency proportional to the total number of points per game. (i.e., a larger number of points should have darker line segments).
kobe.free <- kobe %>%
  subset(etype == "free throw")

# and free throw percentage per game
kobe.free <- kobe.free %>%
  
  group_by(date) %>%
  summarise(total_points = sum(points), missed = sum(result == "missed"), made = sum(result == "made"), 
            free_throw_percentage = (made/(missed + made))*100)

max_points <- max(kobe.free$total_points)
kobe.free$transparency <- 1 - (kobe.free$total_points / max_points)

ggplot(kobe.free, aes(x = date, y = free_throw_percentage)) +
  geom_segment(aes(xend = date, yend = 0, alpha = transparency), size = 1.5) +
  labs(x = "Date", y = "Free Throw Percentage") +
  geom_point(alpha = 0.5) +
  theme_minimal() 
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

  1. 7 Marks
  • Using the kobe data set, find the unique dates that Kobe Bryant played in the 2008-2009 regular season. Hint: You will want to use the group_by, summarise, and n_distinct functions. You should end up with a data set with 78 rows. Name this data set kobe.week.
  • Then create a variable that tells you the day of the week the game was played. You will need an appropriate lubridate function.
  • Plot a bar chart that shows the frequency of games played on each of the seven days of the week.
  • Comment on the most common and least common game days.
kobe.week <- kobe %>%
group_by(date) %>% 
  summarise("2008 / 2009") 
# variable that tells you the day of the week the game was played.
kobe.week$day_of_week <- wday(kobe.week$date, label = TRUE)
# ar chart that shows the frequency of games played on each of the seven days of the week
ggplot(data = kobe.week, aes(x = day_of_week)) +
  geom_bar(fill = "orange", color = "grey") +
  labs(x = "Week day", y = "Number of played days", title = "Number Of Each Weekdays Played By Kobe In 2008 and 2009") +
  scale_x_discrete(labels = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat")) +
  theme_minimal()

Comment on the most common and least common game days:

Based on the provided bar chart of Kobe Bryant games in 2008 and 2009, Kobe played the most games on Tuesdays (with over 15 games), followed by Fridays (around 16 or 17 games). This could be due to the timing of NBA tournaments or other various factors. Sunday with 15 games seems to be the third highest played day, and Wednesday and Thursday seem to be the average number of games played by Kobe with around 10 games, while Monday and Saturday are the days with the least amount of games played by Kobe. 

Total possible marks for Problem 1: 45 Marks

Problem 2: Monthly average temperatures in Auckland

The data set auckland_temps.csv contains the monthly average temperatures in Auckland from July 1994 until January 2024. [Data source: https://cliflo.niwa.co.nz]

  1. 5 Marks
  • Read in the data using read_csv (don’t use read.csv).
  • Convert the Month variable into the correct date format. Hint: You will need to use a function from the tsibble package.
  • Coerce your tibble to a tsibble object with Month as the index.
auckland_temps <- read.csv("auckland_temps.csv")

auckland_temps <- auckland_temps %>%
  mutate(Month = yearmonth(Month)) %>%
  as_tsibble(index = Month)

print(auckland_temps, n = 10)
## # A tsibble: 355 x 2 [1M]
##       Month Temperature
##       <mth>       <dbl>
##  1 1994 Jul        10.5
##  2 1994 Aug        11  
##  3 1994 Sep        12.2
##  4 1994 Oct        13.6
##  5 1994 Nov        15.7
##  6 1994 Dec        18.7
##  7 1995 Jan        19.7
##  8 1995 Feb        20.5
##  9 1995 Mar        19  
## 10 1995 Apr        18.2
## # ℹ 345 more rows
  1. 7 Marks
  • Create a time plot, seasonal plot, and subseries plot of the data.
  • Comment on the seasonality in the plots. Which month has the highest average temperatures, and which month has the lowest?
  • Comment on whether there is a trend in the data and if so, in what direction.
auckland_temps %>%
  autoplot(Temperature) + 
  labs(title = "Time Plot Of Temperatures And Months")

auckland_temps %>% 
  gg_season(Temperature, labels = "right") +
    labs(title = "Seasonal Plot Of Temperatures And Months")

auckland_temps %>% 
  gg_subseries(Temperature) +
    labs(title = "Subseries Plot Of Temperatures And Months")

The first plot is a time plot that displays the value of the Auckland temperature based on years. The plot shows specific trends, cyclic patterns, and peaks that are specific to each year and month from the 1990s to the 2020s. Looking at the data, we can see some slight changes during the years, but there are no noticeable changes in the pattern or trend of the data, suggesting that over the years there have been some alternations in Auckland temperature but nothing outstanding enough based on the time plot. Therefore,  we will need to look at the following plots to get a more defined idea of how the temperature has changed based on the seasonal patterns. 

The second plot is a seasonal plot that shows Auckland temperatures based on different months and seasons. There is a general decrease in the temperature from January until June or July, which is the time that winter ends, followed by a gradual increase in the temperature until December of each year. The seasonal plot shows a slight increase in the temperature in recent years, and 2024 February can be marked as one of the highest temperature values compared to the rest of the years, which is a global concern, but to draw conclusions, we need more resources and experts. 

The last plot is a subseries plot that compares the Auckland temperature based on different months and decades. This plot, like previous plots, gives us the same pattern of a decrease in the temperature up until July and an increase in the temperature from August to December but enables us to see this pattern more spread out and in more detail, making it easier to compare the variations and differences over the years. The years January, February, and March, in order, are the hottest months, and the months June and July are the months with the lowest recorded temperatures. Interestingly,  what is common in all these months is that there is a lot of variation and variability over the years, suggesting the differences and alternations within the years across the decades. 

  1. 8 Marks
  • Create a lag plot for the first 12 lags. Use the point geometry and set alpha = 0.5.
  • Write a sentence or two explaining what autocorrelation is.
  • Comment on the patterns you observe in the lag plot, explaining why we see these specific autocorrelation patterns for lags 1, 6, and 12.
auckland_temps %>% 
  gg_lag(Temperature, lags = 1:12,  geom = "point", alpha = 0.5) +
  labs()

Explanation And Interpretation Of Autocorelation And Specific Patterns In The Lag Plot Above:

Autocorrelation is a time series concept used to measure the patterns and the degree of correlation between certain values (primarily seasonality and cyclic patterns) over a specific period of time. In this case, a lag plot has been used as a tool to measure the lagged correlation of Auckland temperature in the 12 months of the year. 

The months and other conditions in lags 1, 6, and 12 explain the autocorrelation presented. Lag numbers 12 and 1 both follow a similar positive linear trend, which is due to lag 12 being the same time as lag 1 but 12 months earlier, making the weather and seasonal situations very similar to lag number 1, causing the temperatures to be similar, while lag number 6 follows a negative linear correlation compared to both lag numbers 1 and 12, as there is a 6-month lag between lag numbers 6 and 1, causing the exact opposite correlation, which is probably due to it being a different time of the year where the weather, seasonal, and everything effectuating the temperatures to be different. 

  1. STATS 786 only 10 Marks
  • In this question, you will recreate the lag plot from (3).
  • Instead of using gg_lag, you will use functions from the dplyr and ggplot2 packages to create your own lag plot. You may also need functions from lubridate, forcats, and tidyr.
  • Try to get your plot as close to the gg_lag plot you produced in (3). You will get full marks if your plot is exactly the same as what you get with gg_lag. Marks will be deducted for inconsistencies.
  • Note: There are many ways to solve this problem, but here are some things you may want to consider: how to create lagged variables in your data set, how to name them, how to convert your data set into a long format, and how to facet your plot.

Total possible marks for Problem 2: 20 Marks for 326 30 Marks for 786

Total possible marks for Assignment 1: 65 Marks for 326 75 Marks for 786