Project 2

Author

Daniel Ekane

The dataset includes a variety of variables:

Categorical Variables: These include the name of the game (Name), the platform it was released on (Platform), the genre (Genre), the publisher (Publisher), the developer (Developer), and the ESRB rating (Rating). Continuous Variables: These encompass the sales figures in different regions (North America, Europe, Japan, and other regions) as well as globally (NA_Sales, EU_Sales, JP_Sales, Other_Sales, Global_Sales), and the scores given by critics and users (Critic_Score, User_Score).

The data was sourced from a Kaggle dataset. Cleaning this data involved handling missing values, normalizing text fields, and checking for data consistency. For example, missing values in numerical fields like Critic_Score and User_Score were filled with median values, and text fields such as Publisher were standardized to maintain uniformity.

The reason for choosing this dataset is twofold. Firstly, as a video game enthusiast, I find the analysis of game trends, preferences, and the factors that drive a game’s success intriguing. Secondly, from an analytical viewpoint, this dataset offers a rich field for exploring various statistical and data visualization techniques.

setwd("C:/Users/danie/OneDrive/Documents/Data 110 work/Project 2")
video_games <- read.csv("Video_Games.csv")
# Load necessary libraries
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Initial exploration
head(video_games)
                      Name Platform Year_of_Release        Genre Publisher
1               Wii Sports      Wii            2006       Sports  Nintendo
2        Super Mario Bros.      NES            1985     Platform  Nintendo
3           Mario Kart Wii      Wii            2008       Racing  Nintendo
4        Wii Sports Resort      Wii            2009       Sports  Nintendo
5 Pokemon Red/Pokemon Blue       GB            1996 Role-Playing  Nintendo
6                   Tetris       GB            1989       Puzzle  Nintendo
  NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales Critic_Score Critic_Count
1    41.36    28.96     3.77        8.45        82.53           76           51
2    29.08     3.58     6.81        0.77        40.24           NA           NA
3    15.68    12.76     3.79        3.29        35.52           82           73
4    15.61    10.93     3.28        2.95        32.77           80           73
5    11.27     8.89    10.22        1.00        31.37           NA           NA
6    23.20     2.26     4.22        0.58        30.26           NA           NA
  User_Score User_Count Developer Rating
1          8        322  Nintendo      E
2                    NA                 
3        8.3        709  Nintendo      E
4          8        192  Nintendo      E
5                    NA                 
6                    NA                 
summary(video_games)
     Name             Platform         Year_of_Release       Genre          
 Length:16719       Length:16719       Length:16719       Length:16719      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
  Publisher            NA_Sales          EU_Sales         JP_Sales      
 Length:16719       Min.   : 0.0000   Min.   : 0.000   Min.   : 0.0000  
 Class :character   1st Qu.: 0.0000   1st Qu.: 0.000   1st Qu.: 0.0000  
 Mode  :character   Median : 0.0800   Median : 0.020   Median : 0.0000  
                    Mean   : 0.2633   Mean   : 0.145   Mean   : 0.0776  
                    3rd Qu.: 0.2400   3rd Qu.: 0.110   3rd Qu.: 0.0400  
                    Max.   :41.3600   Max.   :28.960   Max.   :10.2200  
                                                                        
  Other_Sales        Global_Sales      Critic_Score    Critic_Count   
 Min.   : 0.00000   Min.   : 0.0100   Min.   :13.00   Min.   :  3.00  
 1st Qu.: 0.00000   1st Qu.: 0.0600   1st Qu.:60.00   1st Qu.: 12.00  
 Median : 0.01000   Median : 0.1700   Median :71.00   Median : 21.00  
 Mean   : 0.04733   Mean   : 0.5335   Mean   :68.97   Mean   : 26.36  
 3rd Qu.: 0.03000   3rd Qu.: 0.4700   3rd Qu.:79.00   3rd Qu.: 36.00  
 Max.   :10.57000   Max.   :82.5300   Max.   :98.00   Max.   :113.00  
                                      NA's   :8582    NA's   :8582    
  User_Score          User_Count       Developer            Rating         
 Length:16719       Min.   :    4.0   Length:16719       Length:16719      
 Class :character   1st Qu.:   10.0   Class :character   Class :character  
 Mode  :character   Median :   24.0   Mode  :character   Mode  :character  
                    Mean   :  162.2                                        
                    3rd Qu.:   81.0                                        
                    Max.   :10665.0                                        
                    NA's   :9129                                           
str(video_games)
'data.frame':   16719 obs. of  16 variables:
 $ Name           : chr  "Wii Sports" "Super Mario Bros." "Mario Kart Wii" "Wii Sports Resort" ...
 $ Platform       : chr  "Wii" "NES" "Wii" "Wii" ...
 $ Year_of_Release: chr  "2006" "1985" "2008" "2009" ...
 $ Genre          : chr  "Sports" "Platform" "Racing" "Sports" ...
 $ Publisher      : chr  "Nintendo" "Nintendo" "Nintendo" "Nintendo" ...
 $ NA_Sales       : num  41.4 29.1 15.7 15.6 11.3 ...
 $ EU_Sales       : num  28.96 3.58 12.76 10.93 8.89 ...
 $ JP_Sales       : num  3.77 6.81 3.79 3.28 10.22 ...
 $ Other_Sales    : num  8.45 0.77 3.29 2.95 1 0.58 2.88 2.84 2.24 0.47 ...
 $ Global_Sales   : num  82.5 40.2 35.5 32.8 31.4 ...
 $ Critic_Score   : int  76 NA 82 80 NA NA 89 58 87 NA ...
 $ Critic_Count   : int  51 NA 73 73 NA NA 65 41 80 NA ...
 $ User_Score     : chr  "8" "" "8.3" "8" ...
 $ User_Count     : int  322 NA 709 192 NA NA 431 129 594 NA ...
 $ Developer      : chr  "Nintendo" "" "Nintendo" "Nintendo" ...
 $ Rating         : chr  "E" "" "E" "E" ...
# Handling missing data
# replacing NA in numerical columns with the median:
video_games$Critic_Score[is.na(video_games$Critic_Score)] <- median(video_games$Critic_Score, na.rm = TRUE)
video_games$User_Score[is.na(video_games$User_Score)] <- median(video_games$User_Score, na.rm = TRUE)

# Converting data types
# Convert Year_of_Release to integer
video_games$Year_of_Release <- as.integer(video_games$Year_of_Release)
Warning: NAs introduced by coercion
video_games$Publisher <- tolower(video_games$Publisher)

# Check for duplicates
video_games <- video_games[!duplicated(video_games), ]

video_games %>% summarise(
  Average_Global_Sales = mean(Global_Sales, na.rm = TRUE),
  Median_Critic_Score = median(Critic_Score, na.rm = TRUE),
  Median_User_Score = median(User_Score, na.rm = TRUE)
)
  Average_Global_Sales Median_Critic_Score Median_User_Score
1            0.5335427                  71               6.2
# Boxplot of Critic Scores by Genre
ggplot(video_games, aes(x = Genre, y = Critic_Score)) +
  geom_boxplot(fill = "lightblue", color = "black") +
  theme_minimal() +
  labs(title = "Critic Scores by Genre",
       x = "Genre",
       y = "Critic Score")

# 1. Histogram for Global Sales (Quantitative Variable)
ggplot(video_games, aes(x = Global_Sales)) +
  geom_histogram(bins = 30, fill = "cornflowerblue", color = "black") +
  labs(title = "Histogram of Global Sales",
       x = "Global Sales (Millions)",
       y = "Frequency") +
  theme_minimal()

# 2. Boxplot for Critic Scores by Genre (Categorical and Quantitative Variables)
ggplot(video_games, aes(x = Genre, y = Critic_Score)) +
  geom_boxplot(aes(fill = Genre)) +
  labs(title = "Boxplot of Critic Scores by Genre",
       x = "Genre",
       y = "Critic Score") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set1")
Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set1 is 9
Returning the palette you asked for with that many colors

# Bar Plot for Count of Games by Platform 
filtered_platforms <- video_games %>%
  filter(Platform %in% c("PS4", "PS3", "Nintendo", "PC", "3DS", "360", "DS", "Wii", "WiiU", "GBA"))

ggplot(filtered_platforms, aes(x = Platform, fill = Platform)) +
  geom_bar() +
  labs(title = "Count of Games by Specific Platforms",
       x = "Platform",
       y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_brewer(palette = "Set2")

# Here, we filter out games with very low sales to focus on more successful titles
filtered_data <- video_games %>%
  filter(Global_Sales > 10)

# Create the visualization
ggplot(filtered_data, aes(x = Critic_Score, y = Global_Sales, color = Genre)) +
  geom_point(alpha = 0.6) +
  facet_wrap(~Platform) +
  labs(title = "Global Sales vs. Critic Score by Genre and Platform",
       subtitle = "For video games with more than 10 million sales",
       x = "Critic Score",
       y = "Global Sales (Millions)",
       caption = "Data source: Video_Games.csv dataset from Kaggle",
       color = "Genre") +
  scale_color_manual(values = rainbow(length(unique(filtered_data$Genre)))) +
  theme_minimal() +
  theme(legend.position = "bottom", 
        plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

top_games <- video_games %>%
  arrange(desc(Global_Sales)) %>%
  head(10)

# Creating the visualization
ggplot(top_games, aes(x = reorder(Name, Global_Sales), y = Global_Sales, fill = Platform)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = paste(Year_of_Release, "-", Genre)), 
            position = position_stack(vjust = 0.5), 
            color = "white", size = 3) +
  labs(title = "Top 10 Most Bought Video Games",
       x = "Video Game",
       y = "Global Sales (Millions)",
       fill = "Platform") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "bottom")

Visualization and Analysis: The visualizations created from this dataset reveal intriguing patterns and insights. One notable visualization is the bar chart showcasing the count of games by specific platforms, such as PS4, PS3, Nintendo, and others. This visualization highlights the dominance of certain platforms in the gaming market and the variation in the number of games available across different platforms.

Another significant visualization is the one showing the top 10 most bought games, along with their year of release, platform, and genre. This graph not only illustrates the most successful games in terms of sales but also provides a snapshot of consumer preferences and market trends over the years. It’s interesting to see how certain genres and platforms have maintained popularity over time, and how some games have achieved remarkable success.

While these visualizations provide valuable insights, there are aspects that I wished could have been included. For instance, a more in-depth analysis of the impact of critic and user scores on sales would be insightful. Additionally, exploring the relationship between the game’s release year and its sales could reveal trends in the gaming industry’s evolution.

```