Analyzing Soccer Ticket Pricing

Introduction

Analyzing Soccer Ticket Pricing using data based on Seatgeek event tickets. 
The event that it is analyzing is the Nashville SC at intermiami SF game held in Florida. 
The analysis for Deal score and various of other variables have helped us correlate how prices change from game to game

Problem Statement

The management of Nashville SC at Inter Miami CF looks at the ticket pricing variations and strategies to best optimize revenue.

Objectives

The objective is to understand the factors that influence the ticket prices changing from game to game. 
This means using data visualization to find the independent variables such as section and Deal Score(rating of a ticket).

Step 1: Install and Load the R packages

# install.packages("readxl")
library(readxl)#Allows us to import Excel Files

options(scipen = 999)#Avoids using Scientific Notation in favor of decimals

Step 2: Import Excel Data

event_data <- read_excel(file.choose())
event_data

## # A tibble: 1,000 × 27
##    event_url        event_id event_category event_subcategory title datetime_utc
##    <chr>               <dbl> <chr>          <chr>             <chr> <chr>       
##  1 https://seatgee…  5900975 SPORTS         MLS               Nash… 2023/08/30 …
##  2 https://seatgee…  5900975 SPORTS         MLS               Nash… 2023/08/30 …
##  3 https://seatgee…  5900975 SPORTS         MLS               Nash… 2023/08/30 …
##  4 https://seatgee…  5900975 SPORTS         MLS               Nash… 2023/08/30 …
##  5 https://seatgee…  5900975 SPORTS         MLS               Nash… 2023/08/30 …
##  6 https://seatgee…  5900975 SPORTS         MLS               Nash… 2023/08/30 …
##  7 https://seatgee…  5900975 SPORTS         MLS               Nash… 2023/08/30 …
##  8 https://seatgee…  5900975 SPORTS         MLS               Nash… 2023/08/30 …
##  9 https://seatgee…  5900975 SPORTS         MLS               Nash… 2023/08/30 …
## 10 https://seatgee…  5900975 SPORTS         MLS               Nash… 2023/08/30 …
## # ℹ 990 more rows
## # ℹ 21 more variables: datetime_local <chr>, name <chr>, address <chr>,
## #   city <chr>, state <chr>, country <chr>, postal_code <dbl>,
## #   formatted_address <chr>, ticket_id <chr>, ticket_price <dbl>,
## #   total_price <dbl>, fee <dbl>, full_section <chr>, section <chr>, row <chr>,
## #   quantity <dbl>, delivery_method <chr>, marketplace <chr>, deal_label <chr>,
## #   deal_score <chr>, in_hand_date <chr>

Cleaning Data

#Makes a new data frame with only relevant columns
price_data <- subset(event_data, select = c(ticket_price, section, row, quantity, deal_score))
price_data

## # A tibble: 1,000 × 5
##    ticket_price section row   quantity deal_score        
##           <dbl> <chr>   <chr>    <dbl> <chr>             
##  1          369 130     4            2 8.8000000000000007
##  2          210 113     35           2 6.5               
##  3          173 125     6            2 9.3000000000000007
##  4          345 122     1            3 7.2               
##  5          275 128     21           2 3.4               
##  6          215 122     15           4 7.5               
##  7          298 127     4            2 6.4               
##  8          260 128     25           4 3.2               
##  9          335 112     2            2 7.9               
## 10          383 126     1            2 5.7               
## # ℹ 990 more rows

Data Description: A description of some of the features are presented in the table below.

Variable	Definition
Price	Price of ticket
Row	Row of ticket (Lower is closer)
Deal_Score	Seatgeek’s rating of value (Highers is better)
Section	Seating of the ticket inside a row
Quantity	Number of tickets sold

# Extract numeric part from the 'section' column.  Ex: Section 89 -> 89
price_data$section <- as.numeric(gsub("[^0-9]", "", price_data$section))

# Extract numeric part from the 'row' column. Ex: Row 20 -> 20
price_data$row <- as.numeric(gsub("[^0-9]", "", price_data$row))

#Turns the column numeric and turns anything else into N/A
price_data$deal_score <- as.numeric(price_data$deal_score)

## Warning: NAs introduced by coercion

Data Visualization

pairs(price_data)

model <- lm( ticket_price ~ row + deal_score + section + quantity, data = price_data)
summary(model)

## 
## Call:
## lm(formula = ticket_price ~ row + deal_score + section + quantity, 
##     data = price_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -604.1 -178.5  -48.8  105.7 7037.7 
## 
## Coefficients:
##              Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 1336.6408    61.4915  21.737 <0.0000000000000002 ***
## row          -19.3089     1.5175 -12.725 <0.0000000000000002 ***
## deal_score  -100.4576     5.4885 -18.303 <0.0000000000000002 ***
## section        0.4141     0.3972   1.043               0.297    
## quantity      -4.2246     6.3723  -0.663               0.508    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 463.8 on 954 degrees of freedom
##   (41 observations deleted due to missingness)
## Multiple R-squared:  0.3264, Adjusted R-squared:  0.3236 
## F-statistic: 115.6 on 4 and 954 DF,  p-value: < 0.00000000000000022

Interpretation: Only Row and Deal_Score are significant (p_value > 0.05) so must omit the rest
Keeping only significant values make new data frame

Creating dataframe with only significant variables

sig_price_data <- subset(price_data, select = c(ticket_price, row, deal_score))
sig_price_data

## # A tibble: 1,000 × 3
##    ticket_price   row deal_score
##           <dbl> <dbl>      <dbl>
##  1          369     4        8.8
##  2          210    35        6.5
##  3          173     6        9.3
##  4          345     1        7.2
##  5          275    21        3.4
##  6          215    15        7.5
##  7          298     4        6.4
##  8          260    25        3.2
##  9          335     2        7.9
## 10          383     1        5.7
## # ℹ 990 more rows

Data Summary

summary(sig_price_data)

##   ticket_price         row          deal_score   
##  Min.   : 138.0   Min.   : 0.00   Min.   :0.200  
##  1st Qu.: 207.0   1st Qu.:10.00   1st Qu.:3.400  
##  Median : 304.0   Median :18.00   Median :6.350  
##  Mean   : 457.6   Mean   :17.76   Mean   :5.598  
##  3rd Qu.: 503.0   3rd Qu.:26.00   3rd Qu.:7.900  
##  Max.   :8354.0   Max.   :38.00   Max.   :9.700  
##                   NA's   :31      NA's   :8

Interpretation: Average price of a ticket is 457.60 dollars

Scatterplot Matrix & Correlation

pairs(sig_price_data)

cor(sig_price_data, use = "pairwise.complete.obs")

##              ticket_price         row  deal_score
## ticket_price    1.0000000 -0.29780261 -0.46121541
## row            -0.2978026  1.00000000 -0.09824359
## deal_score     -0.4612154 -0.09824359  1.00000000

Price and row have a -0.298 correlation so as price goes up you go down in rows. 
  (Getting closer to field)
Price and deal_score have a -0.461 correlation so as price goes up deal score goes down.
  (Higher deal_score shows better value so cheaper tickets are a better deal)

qqnorm(residuals(model))
qqline(residuals(model))

Assumption of Normality of Residuals:
Residuals should lie along the 45-degree line
This plot shows deviation from the line especially in the upper right tail
non-normality: significant deviation from the line
impacts confidence intervals and hypothesis tests
Extreme outliers suggest that the residuals do not follow a normal distribution

plot(model, which = 1)

Assumptions of Linearity and Homoscedasticity:
Residuals should be randomly scattered around zero
Our plot shows a funnel like shape with residuals increasing in spread as the fitted value increase
Heteroscedasticity: variance of residuals are not constant
extreme outliers indicate a few data points have an unusually high influence on the model

Create a Linear Model

model <- lm( ticket_price ~ row + deal_score, data = sig_price_data)
summary(model)

## 
## Call:
## lm(formula = ticket_price ~ row + deal_score, data = sig_price_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -604.9 -188.0  -46.3  110.0 7053.0 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 1359.398     44.471   30.57 <0.0000000000000002 ***
## row          -19.250      1.479  -13.01 <0.0000000000000002 ***
## deal_score   -99.479      5.415  -18.37 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 462 on 966 degrees of freedom
##   (31 observations deleted due to missingness)
## Multiple R-squared:  0.3247, Adjusted R-squared:  0.3233 
## F-statistic: 232.2 on 2 and 966 DF,  p-value: < 0.00000000000000022

Coefficient Interpretation
The expected price of a ticket if row and deal score are 0 would be 1359.40$.

The expected price would go down $19.25 for every row you move up in. 
  (Sitting farther away from the field makes the seat cheaper)

The expected price goes down $99.48 for every whole deal score increase.
  (Better/Higher deal score = Cheaper tickets)

Adjusted R-squared: about 32.33% of the variance in the dependent variable
This means that there is a weak model fit and there may be important factors missing from the model.
Model is significant because p-value is < 0.05

Conclusions

Conclusion & Recommendation
- We can conclude that it is possible to predict ticket seating based on certain variables such as price
- We observed that two variables, namely row and deal score are statistically significant (p-value < 0.05)

Business Recommendation
- The stadiums can use this to implement future pricing strategies
- The ticket companies can use it for understanding the pricing decisions
- Use it to see how changes in different factors affect ticket pricing