setwd("C:/Users/danie/OneDrive/Documents/Data 110 work/Project 2")
video_games <- read.csv("Video_Games.csv")Project 2
The dataset includes a variety of variables:
Categorical Variables: These include the name of the game (Name), the platform it was released on (Platform), the genre (Genre), the publisher (Publisher), the developer (Developer), and the ESRB rating (Rating). Continuous Variables: These encompass the sales figures in different regions (North America, Europe, Japan, and other regions) as well as globally (NA_Sales, EU_Sales, JP_Sales, Other_Sales, Global_Sales), and the scores given by critics and users (Critic_Score, User_Score).
The data was sourced from a Kaggle dataset. Cleaning this data involved handling missing values, normalizing text fields, and checking for data consistency. For example, missing values in numerical fields like Critic_Score and User_Score were filled with median values, and text fields such as Publisher were standardized to maintain uniformity.
The reason for choosing this dataset is twofold. Firstly, as a video game enthusiast, I find the analysis of game trends, preferences, and the factors that drive a game’s success intriguing. Secondly, from an analytical viewpoint, this dataset offers a rich field for exploring various statistical and data visualization techniques.
# Load necessary libraries
library(tidyverse)── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.3 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Initial exploration
head(video_games) Name Platform Year_of_Release Genre Publisher
1 Wii Sports Wii 2006 Sports Nintendo
2 Super Mario Bros. NES 1985 Platform Nintendo
3 Mario Kart Wii Wii 2008 Racing Nintendo
4 Wii Sports Resort Wii 2009 Sports Nintendo
5 Pokemon Red/Pokemon Blue GB 1996 Role-Playing Nintendo
6 Tetris GB 1989 Puzzle Nintendo
NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales Critic_Score Critic_Count
1 41.36 28.96 3.77 8.45 82.53 76 51
2 29.08 3.58 6.81 0.77 40.24 NA NA
3 15.68 12.76 3.79 3.29 35.52 82 73
4 15.61 10.93 3.28 2.95 32.77 80 73
5 11.27 8.89 10.22 1.00 31.37 NA NA
6 23.20 2.26 4.22 0.58 30.26 NA NA
User_Score User_Count Developer Rating
1 8 322 Nintendo E
2 NA
3 8.3 709 Nintendo E
4 8 192 Nintendo E
5 NA
6 NA
summary(video_games) Name Platform Year_of_Release Genre
Length:16719 Length:16719 Length:16719 Length:16719
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Publisher NA_Sales EU_Sales JP_Sales
Length:16719 Min. : 0.0000 Min. : 0.000 Min. : 0.0000
Class :character 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 0.0000
Mode :character Median : 0.0800 Median : 0.020 Median : 0.0000
Mean : 0.2633 Mean : 0.145 Mean : 0.0776
3rd Qu.: 0.2400 3rd Qu.: 0.110 3rd Qu.: 0.0400
Max. :41.3600 Max. :28.960 Max. :10.2200
Other_Sales Global_Sales Critic_Score Critic_Count
Min. : 0.00000 Min. : 0.0100 Min. :13.00 Min. : 3.00
1st Qu.: 0.00000 1st Qu.: 0.0600 1st Qu.:60.00 1st Qu.: 12.00
Median : 0.01000 Median : 0.1700 Median :71.00 Median : 21.00
Mean : 0.04733 Mean : 0.5335 Mean :68.97 Mean : 26.36
3rd Qu.: 0.03000 3rd Qu.: 0.4700 3rd Qu.:79.00 3rd Qu.: 36.00
Max. :10.57000 Max. :82.5300 Max. :98.00 Max. :113.00
NA's :8582 NA's :8582
User_Score User_Count Developer Rating
Length:16719 Min. : 4.0 Length:16719 Length:16719
Class :character 1st Qu.: 10.0 Class :character Class :character
Mode :character Median : 24.0 Mode :character Mode :character
Mean : 162.2
3rd Qu.: 81.0
Max. :10665.0
NA's :9129
str(video_games)'data.frame': 16719 obs. of 16 variables:
$ Name : chr "Wii Sports" "Super Mario Bros." "Mario Kart Wii" "Wii Sports Resort" ...
$ Platform : chr "Wii" "NES" "Wii" "Wii" ...
$ Year_of_Release: chr "2006" "1985" "2008" "2009" ...
$ Genre : chr "Sports" "Platform" "Racing" "Sports" ...
$ Publisher : chr "Nintendo" "Nintendo" "Nintendo" "Nintendo" ...
$ NA_Sales : num 41.4 29.1 15.7 15.6 11.3 ...
$ EU_Sales : num 28.96 3.58 12.76 10.93 8.89 ...
$ JP_Sales : num 3.77 6.81 3.79 3.28 10.22 ...
$ Other_Sales : num 8.45 0.77 3.29 2.95 1 0.58 2.88 2.84 2.24 0.47 ...
$ Global_Sales : num 82.5 40.2 35.5 32.8 31.4 ...
$ Critic_Score : int 76 NA 82 80 NA NA 89 58 87 NA ...
$ Critic_Count : int 51 NA 73 73 NA NA 65 41 80 NA ...
$ User_Score : chr "8" "" "8.3" "8" ...
$ User_Count : int 322 NA 709 192 NA NA 431 129 594 NA ...
$ Developer : chr "Nintendo" "" "Nintendo" "Nintendo" ...
$ Rating : chr "E" "" "E" "E" ...
# Handling missing data
# replacing NA in numerical columns with the median:
video_games$Critic_Score[is.na(video_games$Critic_Score)] <- median(video_games$Critic_Score, na.rm = TRUE)
video_games$User_Score[is.na(video_games$User_Score)] <- median(video_games$User_Score, na.rm = TRUE)
# Converting data types
# Convert Year_of_Release to integer
video_games$Year_of_Release <- as.integer(video_games$Year_of_Release)Warning: NAs introduced by coercion
video_games$Publisher <- tolower(video_games$Publisher)
# Check for duplicates
video_games <- video_games[!duplicated(video_games), ]
video_games %>% summarise(
Average_Global_Sales = mean(Global_Sales, na.rm = TRUE),
Median_Critic_Score = median(Critic_Score, na.rm = TRUE),
Median_User_Score = median(User_Score, na.rm = TRUE)
) Average_Global_Sales Median_Critic_Score Median_User_Score
1 0.5335427 71 6.2
# Boxplot of Critic Scores by Genre
ggplot(video_games, aes(x = Genre, y = Critic_Score)) +
geom_boxplot(fill = "lightblue", color = "black") +
theme_minimal() +
labs(title = "Critic Scores by Genre",
x = "Genre",
y = "Critic Score")# 1. Histogram for Global Sales (Quantitative Variable)
ggplot(video_games, aes(x = Global_Sales)) +
geom_histogram(bins = 30, fill = "cornflowerblue", color = "black") +
labs(title = "Histogram of Global Sales",
x = "Global Sales (Millions)",
y = "Frequency") +
theme_minimal()# 2. Boxplot for Critic Scores by Genre (Categorical and Quantitative Variables)
ggplot(video_games, aes(x = Genre, y = Critic_Score)) +
geom_boxplot(aes(fill = Genre)) +
labs(title = "Boxplot of Critic Scores by Genre",
x = "Genre",
y = "Critic Score") +
theme_minimal() +
scale_fill_brewer(palette = "Set1")Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set1 is 9
Returning the palette you asked for with that many colors
# Bar Plot for Count of Games by Platform
filtered_platforms <- video_games %>%
filter(Platform %in% c("PS4", "PS3", "Nintendo", "PC", "3DS", "360", "DS", "Wii", "WiiU", "GBA"))
ggplot(filtered_platforms, aes(x = Platform, fill = Platform)) +
geom_bar() +
labs(title = "Count of Games by Specific Platforms",
x = "Platform",
y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_brewer(palette = "Set2")# Here, we filter out games with very low sales to focus on more successful titles
filtered_data <- video_games %>%
filter(Global_Sales > 10)
# Create the visualization
ggplot(filtered_data, aes(x = Critic_Score, y = Global_Sales, color = Genre)) +
geom_point(alpha = 0.6) +
facet_wrap(~Platform) +
labs(title = "Global Sales vs. Critic Score by Genre and Platform",
subtitle = "For video games with more than 10 million sales",
x = "Critic Score",
y = "Global Sales (Millions)",
caption = "Data source: Video_Games.csv dataset from Kaggle",
color = "Genre") +
scale_color_manual(values = rainbow(length(unique(filtered_data$Genre)))) +
theme_minimal() +
theme(legend.position = "bottom",
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))top_games <- video_games %>%
arrange(desc(Global_Sales)) %>%
head(10)
# Creating the visualization
ggplot(top_games, aes(x = reorder(Name, Global_Sales), y = Global_Sales, fill = Platform)) +
geom_bar(stat = "identity") +
geom_text(aes(label = paste(Year_of_Release, "-", Genre)),
position = position_stack(vjust = 0.5),
color = "white", size = 3) +
labs(title = "Top 10 Most Bought Video Games",
x = "Video Game",
y = "Global Sales (Millions)",
fill = "Platform") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "bottom")Visualization and Analysis: The visualizations created from this dataset reveal intriguing patterns and insights. One notable visualization is the bar chart showcasing the count of games by specific platforms, such as PS4, PS3, Nintendo, and others. This visualization highlights the dominance of certain platforms in the gaming market and the variation in the number of games available across different platforms.
Another significant visualization is the one showing the top 10 most bought games, along with their year of release, platform, and genre. This graph not only illustrates the most successful games in terms of sales but also provides a snapshot of consumer preferences and market trends over the years. It’s interesting to see how certain genres and platforms have maintained popularity over time, and how some games have achieved remarkable success.
While these visualizations provide valuable insights, there are aspects that I wished could have been included. For instance, a more in-depth analysis of the impact of critic and user scores on sales would be insightful. Additionally, exploring the relationship between the game’s release year and its sales could reveal trends in the gaming industry’s evolution.
```