Week 6 Data Dive

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

library(readxl)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

dataset <- read_excel("~/Downloads/UFC_Dataset.xls")
head(dataset)

## # A tibble: 6 × 118
##   RedFighter     BlueFighter RedOdds BlueOdds RedExpectedValue BlueExpectedValue
##   <chr>          <chr>         <dbl>    <dbl>            <dbl>             <dbl>
## 1 Jack Hermanss… Joe Pyfer       205     -250            205                40  
## 2 Dan Ige        Andre Fili     -185      154             54.1             154  
## 3 Robert Bryczek Ihor Potie…    -230      190             43.5             190  
## 4 Brad Tavares   Gregory Ro…     190     -230            190                43.5
## 5 Michael Johns… Darrius Fl…    -155      130             64.5             130  
## 6 Rodolfo Vieira Armen Petr…    -105     -115             95.2              87.0
## # ℹ 112 more variables: Date <dttm>, Location <chr>, Country <chr>,
## #   Winner <chr>, TitleBout <lgl>, WeightClass <chr>, Gender <chr>,
## #   NumberOfRounds <dbl>, BlueCurrentLoseStreak <dbl>,
## #   BlueCurrentWinStreak <dbl>, BlueDraws <dbl>, BlueAvgSigStrLanded <dbl>,
## #   BlueAvgSigStrPct <dbl>, BlueAvgSubAtt <dbl>, BlueAvgTDLanded <dbl>,
## #   BlueAvgTDPct <dbl>, BlueLongestWinStreak <dbl>, BlueLosses <dbl>,
## #   BlueTotalRoundsFought <dbl>, BlueTotalTitleBouts <dbl>, …

#Combining data to get a picture of overall strikes landed
red_data <- dataset |> select(RedAvgSigStrLanded, RedReachCms) |> mutate(fighter_color = "Red")
blue_data <- dataset |> select(BlueAvgSigStrLanded, BlueReachCms) |> mutate(fighter_color = "Blue")

combined_data <- rbind(
  red_data |> rename (AvgsigStrLanded = RedAvgSigStrLanded, ReachCms = RedReachCms),
  blue_data |> rename (AvgsigStrLanded = BlueAvgSigStrLanded, ReachCms = BlueReachCms)
)

#Filtering the data to make the graphs neater 
filtered_combined_data <- combined_data |>
  filter(ReachCms >= 125 & ReachCms <= 225)

filtered_dataset <- dataset |>
  filter(HeightDif >= -50 & HeightDif <= 50)

#Visualization for correlation between reach and average strikes landed. 
#Interpretation: This visualization enforces the statments I make below about a weak linear relatiuonhip between the two variables. The data points are spread out with no real identifiable trend. 
ggplot(filtered_combined_data, aes(x = ReachCms, y = AvgsigStrLanded, color = fighter_color)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(
    title = "Correlation between Average Significant Strikes Landed and Reach",
    x = "Reach (cm)",
    y = "Average Significant Strikes Landed",
    color = "Fighter Color"
  )

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 1385 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 1385 rows containing missing values or values outside the scale range
## (`geom_point()`).

#Visualization for the relationship between height difference and strikes landed by Red fighter
##Interpretation: This visualization enforces the statments I make below about a weak linear relatiuonhip between the two variables. The data points are spread out with no real identifiable trend. 
ggplot(filtered_dataset, aes(x = HeightDif, y= RedAvgSigStrLanded)) +
  geom_jitter(width = 0.1, height = 0.1, alpha = 0.4) +
  geom_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE) +
  labs(
    title = "Correlation between Height Difference and Red Fighter Strikes Landed",
    x = "Height Difference (cm)",
    y = "Average Significant Strikes Landed"
  )

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 455 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Failed to fit group -1.
## Caused by error:
## ! y values must be 0 <= y <= 1

## Warning: Removed 455 rows containing missing values or values outside the scale range
## (`geom_point()`).

filtered_combined_data <- combined_data |>
  filter(ReachCms >= 125 & ReachCms <= 225)

filtered_dataset <- dataset |>
  filter(HeightDif >= -50 & HeightDif <= 50)

#Correlation Coefficient for the relationship between Reach and Strikes landed
#Interpretation: Since the coefficient is very close to zero we can suggest that there is not a strong linear relationship between a reach difference and the total strikes landed through out a fight by a fighter. This is surprising since reach is often viewed as an important factor for strikers and hailed as a massive avantage in all striking combat sports. I need to figure out if I did this incorrectly and need to determine what factors do affect high striking volume. 
correlation_reach_sigstrikes <- cor(filtered_combined_data$ReachCms, filtered_combined_data$AvgsigStrLanded, use = "complete.obs")
print(correlation_reach_sigstrikes)

## [1] -0.05446923

#Correlation Coefficient for the relationship bewteen Height Difference and Reds Strikes Landed
#Interpretation: Since the coefficient is very close to zero we can suggest that there is not a strong linear relationship between a hieght difference and the total strikes landed through out a fight by a fighter.This is surprising since reach is often viewed as an important factor for strikers and hailed as a massive avantage in all striking combat sports. I need to figure out if I did this incorrectly and need to determine what factors do affect high striking volume. 
correlation_height_sigstrikes <- cor(filtered_dataset$HeightDif, filtered_dataset$RedAvgSigStrLanded, use = "complete.obs")
print(correlation_height_sigstrikes)

## [1] 0.02376966

#Confidence Intervals
t_test_reach_strikes <- t.test(filtered_combined_data$AvgsigStrLanded, conf.level = 0.95)
print(t_test_reach_strikes$conf.int)

## [1] 21.66575 22.43829
## attr(,"conf.level")
## [1] 0.95

##We can suggest with 95% certainty that the average strikes landed falls between this interval. This gives us insight into the central tendaency of the variable and what the mean should be. We can use these averages to better understand what a “high” volume strikes truly is.

# Confidence Intervals
t_test_height_strikes <- t.test(filtered_dataset$RedAvgSigStrLanded, conf.level = 0.95)
print(t_test_height_strikes$conf.int)

## [1] 22.14389 23.19773
## attr(,"conf.level")
## [1] 0.95

##We can suggest with 95% certainty that the average strikes landed by the red fighter falls between this interval. This gives us insight into the central tendaency of the variable and what the mean should be.

```

Week 6 Data Dive

2024-10-08

R Markdown