STAT380 Mini-Project 1

Front Matter

#Load Libraries
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library("readxl")

#Read Data
COD_Data <- read_xlsx("/Users/bunch/Library/Mobile Documents/com~apple~CloudDocs/Stat 380/Week 8/CODGames2_mp.xlsx")

Variable Exploration

Variable 1

Variable Name: Eliminations

Variable Type: Quantitative

Variable Values: Can take on any numeric value from 0 to the winning score (typically 100), but in this data set ranges from 2 - 39.

Variable Missing Data: No missing data

sum(is.na(COD_Data$Eliminations))

## [1] 0

Description: The Elimination variable describes the number of kills a player gets. A kill is credited to the person who dealt the last damage to an opposing player before they died.

Summary Statistics:

COD_Data %>%
  summarize(N = n(),
            Minimum = min(Eliminations),
            Maximum = max(Eliminations),
            Mean = mean(Eliminations),
            Median = median(Eliminations),
            Standard_Deviation = sd(Eliminations),
            Q1 = quantile(Eliminations, probs = .25),
            Q3 = quantile(Eliminations, probs = .75)
            )

## # A tibble: 1 × 8
##       N Minimum Maximum  Mean Median Standard_Deviation    Q1    Q3
##   <int>   <dbl>   <dbl> <dbl>  <dbl>              <dbl> <dbl> <dbl>
## 1   211       2      39  15.1     14               6.13    11  18.5

Visualization:

ggplot(data = COD_Data,
       mapping = aes(x = Eliminations)) +
  geom_histogram(fill = "orange", color = "white", binwidth = 1) + 
  labs(title = "COD Eliminations Distribution",
       x = "Eliminations",
       y = "Frequency")

Analysis of Visualization: The distribution is mainly symmetric, with perhaps a very small right skew, as suggested with the mean of 15.09 being greater than the median of 14. A majority of eliminations fall within 10-20 eliminations, but there are a few outliers above 30 eliminations, including the maximum of 39.

Variable 2

Variable Name: PrimaryWeapon

Variable Type: Categorical

Variable Values: Different names of guns a player can use like a “Krig 6” (Assault Rifle) or a “MP5” (Submachine Gun)

Variable Missing Data: No missing data

sum(is.na(COD_Data$PrimaryWeapon))

## [1] 0

Description: The PrimaryWeapon variable states the main gun a player used in the game they played. A player can only get one PrimaryWeapon in their load out when playing a game.

Summary Statistics:

COD_Data %>%
  count(PrimaryWeapon)

## # A tibble: 9 × 2
##   PrimaryWeapon     n
##   <chr>         <int>
## 1 AK-47             4
## 2 FFAR 1            3
## 3 Krig 6           21
## 4 M16              48
## 5 MG 82             2
## 6 MP5              45
## 7 Pelington 703    38
## 8 QBZ-83           37
## 9 Type 63          13

unique(COD_Data$PrimaryWeapon)

## [1] "M16"           "MP5"           "AK-47"         "Krig 6"       
## [5] "QBZ-83"        "Pelington 703" "FFAR 1"        "Type 63"      
## [9] "MG 82"

Visualization:

ggplot(data = COD_Data,
       mapping = aes(x = PrimaryWeapon)) +
  geom_bar(fill = "lightblue", color = "white") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  labs(title = "COD Primary Weapon Distribution",
       x = "Primary Weapon",
       y = "Frequency")

Analysis of Visualization: As supported by the summary statistics, it is clear the 4 most popular PrimaryWeapons used are the M16 (48 times), MP5 (45 times), Pelington 703 (38 times), and QBZ-83 (37 times). The distribution/range of the data is also very wide, what I mean by that is there are 4 weapons with high frequency (>35) but also 3 with very low frequency (<5) and only 2 weapons that fall in between.

Variable 3

Variable Name: Damage

Variable Type: Quantitative

Variable Values: Technically could take on any number 0 or above, but ranges from 56 - 960 in this dataset.

Variable Missing Data: No missing data

sum(is.na(COD_Data$Damage))

## [1] 0

Description: The Damage variable represents the amount of damage a player dealt to an opposing player’s health. For example, if a player shot an opposing player, and the opposing player had their health decrease by 50, 50 would be added to the Damage variable for the player who shot the shot.

Summary Statistics:

COD_Data %>%
  summarize(N = n(),
            Minimum = min(Damage),
            Maximum = max(Damage),
            Mean = mean(Damage),
            Median = median(Damage),
            Standard_Deviation = sd(Damage),
            Q1 = quantile(Damage, probs = .25),
            Q3 = quantile(Damage, probs = .75)
            )

## # A tibble: 1 × 8
##       N Minimum Maximum  Mean Median Standard_Deviation    Q1    Q3
##   <int>   <dbl>   <dbl> <dbl>  <dbl>              <dbl> <dbl> <dbl>
## 1   211      56     960  415.    397               166.  304.   508

Visualization:

ggplot(data = COD_Data,
       mapping = aes(x = Damage)) +
  geom_histogram(fill = "darkgreen", color = "white", binwidth = 25) +  
  labs(title = "COD Damage Distribution",
       x = "Damage",
       y = "Frequency")

Analysis of Visualization: The distribution of the Damage variable is rather symmetric with potentially a slight right skew as supported by the mean of 415.17 being greater than the median of 397. Looking at our Q1 of 303.5 and Q3 of 508, the plot supports this as it does appear a majority of the observations fall within 300 to 500. The maximum of 960 appears to be a major outlier as no other observation comes close to it.

Variable 4

Variable Name: Map1

Variable Type: Categorical

Variable Values: Can take on any of the 26 map names in the game such as “Echelon”, “Nuketown ’84”, or “Miami Strike”.

Variable Missing Data: 43 observations with missing Map1 data.

sum(is.na(COD_Data$Map1))

## [1] 43

Description: Map1 is the first map option for the vote, provided when loading into a game. A map is the setting and environment in which the players will fight each other, each map providing different strategic advantages due to its theme and set up design.

Summary Statistics:

map_table <- COD_Data %>% 
  count(Map1)

print(map_table, n = nrow(map_table))

## # A tibble: 27 × 2
##    Map1                  n
##    <chr>             <int>
##  1 Amerika               2
##  2 Apocalypse           10
##  3 Cartel                7
##  4 Checkmate             6
##  5 Collateral Strike     7
##  6 Crossroads Strike     6
##  7 Deprogram             6
##  8 Diesel                8
##  9 Drive-In              3
## 10 Echelon               9
## 11 Express               4
## 12 Garrison              8
## 13 Hijacked              5
## 14 Jungle                9
## 15 Miami Strike          4
## 16 Moscow               11
## 17 Nuketown '84          6
## 18 Raid                  5
## 19 Ruah                  1
## 20 Rush                  8
## 21 Slums                 5
## 22 Standoff              4
## 23 The Pines            12
## 24 WMD                   7
## 25 Yamantau              5
## 26 Zoo                  10
## 27 <NA>                 43

unique(COD_Data$Map1)

##  [1] "Moscow"            NA                  "Drive-In"         
##  [4] "Collateral Strike" "Crossroads Strike" "The Pines"        
##  [7] "Echelon"           "Nuketown '84"      "Jungle"           
## [10] "Rush"              "WMD"               "Diesel"           
## [13] "Zoo"               "Slums"             "Cartel"           
## [16] "Express"           "Deprogram"         "Apocalypse"       
## [19] "Miami Strike"      "Raid"              "Garrison"         
## [22] "Amerika"           "Standoff"          "Yamantau"         
## [25] "Checkmate"         "Hijacked"          "Ruah"

Visualization:

ggplot(data = COD_Data,
       mapping = aes(x = Map1)) +
  geom_bar(fill = "purple", color = "white") + 
  coord_flip() +
  labs(title = "COD Map 1 Distribution",
       x = "Map 1",
       y = "Frequency")

Analysis of Visualization: There are a lot of NA’s (43) for Map1, I am not experienced in the game but I have a guess as to why. My guess is the game mode had a predetermined map for players to play on and they did not have the option to change it. Besides that, another observation is that all the maps have a pretty even distribution probably due to the fact players want to switch it up and have variety. However, the map with the minimum frequency of 1, Ruah, does stand out.

Variable 5

Variable Name: Deaths

Variable Type: Quantitative

Variable Values: Can technically take on any value 0 and above, but ranges from 4 - 42 in this dataset.

Variable Missing Data: No missing data

sum(is.na(COD_Data$Deaths))

## [1] 0

Description: The Deaths variable is the amount of times a player dies in a game. Contrary to a lot of other quantitative statistics, it is better for this number to be lower, as that means a player died less, which is good.

Summary Statistics:

COD_Data %>%
  summarize(N = n(),
            Minimum = min(Deaths),
            Maximum = max(Deaths),
            Mean = mean(Deaths),
            Median = median(Deaths),
            Standard_Deviation = sd(Deaths),
            Q1 = quantile(Deaths, probs = .25),
            Q3 = quantile(Deaths, probs = .75)
            )

## # A tibble: 1 × 8
##       N Minimum Maximum  Mean Median Standard_Deviation    Q1    Q3
##   <int>   <dbl>   <dbl> <dbl>  <dbl>              <dbl> <dbl> <dbl>
## 1   211       4      42  15.0     15               5.13    12  17.5

Visualization:

ggplot(data = COD_Data,
       mapping = aes(x = Deaths)) +
  geom_histogram(fill = "pink", color = "white", binwidth = 2) + 
  labs(title = "COD Deaths Distribution",
       x = "Deaths",
       y = "Frequency")

Analysis of Visualization: The main chunk of observations look to fall between 10 to 20 deaths which is close to our Q1 and Q3 of 12 and 17.5 respectively. The distribution is very symmetric except for a few outliers with more than 27 deaths. The range of 38 suggests the standard deviation of 5.13 is significantly small, and that is seen in the plot as the data is not very spread out.

3 Questions

Why are there NA’s for “Map1” and “Map2”?
Why are all the values for the “DidPlayerVote” variable “No”?
How is the “Score” variable calculated?

Research Questions

Part 1

Question: Is the player’s performance, as quantified by the amount of experience points gained (TotalXP variable) changing over time?

Data Pre-Processing:

COD_Data_Month <- COD_Data %>% 
  mutate(Month = month(ymd(Date), label = TRUE, abbr = FALSE))

Data Visualization:

ggplot(data = COD_Data_Month,
       mapping = aes(x = Month,
                     y = TotalXP)) +
  geom_boxplot(fill = "yellow", outlier.color = "red") +
  labs(title = "Player Performance Over Time",
       x = "Month",
       y = "Total XP")

Answer: According to the box plots, it appears player performance doesn’t change over time. The median, 1st quartile, and 3rd quartile are all pretty similar going from June to August. It could be argued performance slightly decreases over time as June has the higher median, but it is not by a lot. The no changing could be due to the fact the game being out for awhile at this point, so there are no new game breaking methods for people to try and everyone is experienced.

Part 2

Question: Do players perform better on certain maps?

Set Up: To determine how me measure “performing better”, we will use a common video game statistic, kill to death ratio (KD). KD is computed by dividing eliminations by deaths. It is a good measure of player success, a high number indicates a player gets a lot of kills without dying and a low number shows a player dies a lot and doesn’t get a lot of kills. I want to see if specific maps give a better KD, perhaps certain settings are more protective or more open which affects success. The map played on is stored in the Choice variable.

Data Pre-Processing

COD_Data_P2 <- COD_Data %>%
  mutate(Kill_Death_Ratio = Eliminations / Deaths)

COD_Data_P2 <- COD_Data_P2 %>%
  select(Choice, Kill_Death_Ratio) %>% 
  na.omit()

Data Visualization:

ggplot(data = COD_Data_P2,
       mapping = aes(x = reorder(Choice, Kill_Death_Ratio, FUN = median),
                     y = Kill_Death_Ratio)) +
  geom_boxplot(fill = "lightblue", outlier.color = "lightgreen") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  labs(title = "Kill to Death Ratio by Map",
       x = "Map",
       y = "KD Ratio")

Answer: According to the box plots, it does appear players perform better on certain maps e.g. higher kill to death ratio. For starters, plenty of maps are below a median KD of 1, meaning players die more often than get kills, which is bad. A lot of other maps have the median float around 1, which is average. But certain maps like Apocalypse, Nuketown ’84, and Miami Strike have significantly high KD’s compared to the rest. This could suggest players perform better on these maps and are probably designed a certain way for players to take advantage of it’s physical settings.

STAT380 Mini-Project 1

Trevor Bunch

March 6th, 2025

Front Matter

Variable Exploration

Variable 1

Variable 2

Variable 3

Variable 4

Variable 5

3 Questions

Research Questions

Part 1

Part 2