Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
Rows: 16598 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Name, Platform, Year, Genre, Publisher
dbl (6): Rank, NA_Sales, EU_Sales, JP_Sales, Other_Sales, Global_Sales
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Introduction
The data set that I chose to use for my project is from Video Games Sales which was collected from vgchartz.com which is a public forum for video game enthusiast. This data set has a toal of 11 variables and over 17,000 entries. Some of the more notable variables that I will be using for analysis are sales per year (in the US, Europe, Japan, and the rest of the world), platform (the console n which the game is running on), year of release, and genre. Although more than half of the data is categorical, I feel as though some of the variables such as platform and genre have few categories which will allow them to fit in the charts nicely. I also noticed that this data favors console platforms since most big PC games are free to play nowadays. I decided to chose this data set because of the influence video games has had on my life and how much time I’ve spent in front of the TV when I was younger.
As shown above, N/A’s are common in the Year category which is the year that the game was published so the easiest way to go about this would be to remove them along with the ones in the Publisher category.
I have also found an outlier in the data set which is found in the year category. It is the only game in the year 2020. Removing this single outlier would significantly affect the data because it amounts to 5 rows of the data set.
# Convert year to datevgsales1$Year <-as.Date(as.character(vgsales1$Year), format="%Y")vgsales1$Year <-year(vgsales1$Year)
# Convert name to charactervgsales1$Name <-as.character(vgsales1$Name)
Linear Regression Analysis
# visual representation of the North American sales for each yearggplot(vgsales1, aes(NA_Sales, Year)) +geom_point() +geom_smooth(method ="lm")
`geom_smooth()` using formula = 'y ~ x'
This model shows the North American sales in each year and there doesn’t seem to be a correlation of sales other than the single outlier to the right which is Wii Sports, the game that release along with the Wii console which makes sense because everyone that wanted a Wii console got Wii Sports with it.
Exploring variables and categories
## Filtering by Top 100 video games by platformTop_100 <- vgsales1 %>%filter(Global_Sales >7.38)ggplot(Top_100, aes(x=Platform, y=Global_Sales, fill=Platform))+geom_boxplot(alpha=0.6)+labs(x="Platform", y="Total Amount of Copies Sold",title="The Top 100 Best-Selling Video Games by Platform")+theme(legend.position ="none")+theme(panel.background =element_blank(),panel.grid.major =element_blank(),panel.grid.minor =element_blank())+theme(plot.title =element_text(hjust =0.5, face="bold"))
This first visualization was my attempt at a jitter plot and i was unable to make it interactive so i just added the captions to the higher ranked games.
ggplot(vgsales1,aes(x=Genre,y=Global_Sales,)) +geom_jitter(alpha=0.1,color="red") +geom_label_repel(data=head(vgsales1, 10),aes(label=Name)) +labs(y="Global Sales",x="Genre", title ="Global Sales by Genre") +theme(plot.title =element_text(hjust =0.5, face="bold"))
This second visualization is the final one and although I was still unable to make it interactive, I chose to reduce the Publishing names to just 5 so my plot is actually viewable.
# choose only the top 5 Publishing names to make the plot easier to viewtoppub <- vgsales1 %>%filter(Publisher %in%c("Electronic Arts", "Activision", "Ubisoft", "Namco Bandai Games", "Nintendo")) %>%group_by(Publisher, Year) %>%summarize(total =sum(Global_Sales))
`summarise()` has grouped output by 'Publisher'. You can override using the
`.groups` argument.
# create line plotvgsales1 %>%group_by(Publisher, Year) %>%summarize(total =sum(Global_Sales)) %>%ggplot(aes(x = Year, y = total)) +geom_point(alpha =0.5, pch =21) +geom_point(data = toppub, aes(col = Publisher), size =1.5) +geom_line(data = toppub, aes(col = Publisher), size =1.3) +scale_color_viridis(discrete =TRUE) +theme(legend.position =c(0.2, 0.7)) +labs(title ="Time series of Global Sales by Publishers", caption ='vgchartz.com forum' , y ="Sales (in million units)")
`summarise()` has grouped output by 'Publisher'. You can override using the
`.groups` argument.
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
3.5.0.
ℹ Please use the `legend.position.inside` argument of `theme()` instead.
Conclusion
The data set that I chose to use for my project is from Video Games Sales which was collected from vgchartz.com which is a public forum for video game enthusiast. This data set has a toal of 11 variables and over 17,000 entries. Some of the more notable variables that I will be using for analysis are sales per year (in the US, Europe, Japan, and the rest of the world), platform (the console n which the game is running on), year of release, and genre. Although more than half of the data is categorical, I feel as though some of the variables such as platform and genre have few categories which will allow them to fit in the charts nicely. I also noticed that this data favors console platforms since most big PC games are free to play nowadays. I decided to chose this data set because of the influence video games has had on my life and how much time I’ve spent in front of the TV when I was younger. The analysis from tis data set reveals sevral key findings about video games and the video game industry. The first big finding is that Japan’s video game market shows distinct preferences in genres, publishers and platforms compared to other regions like Europe and North America while Europe and North America tend to have similar preferences and other regions around the world also follow these preferences. The video game industry is dominated by larger companies such as Nintendo, Sony, and Microsoft which is why most top selling games come from these companies. The data set also shows that the video game industry isn’t showing much growth as it did in the early 2000’s and this may be because of advancements in technology such as mobile gaming and its convenience. One thing that I wish I was able to figure out was adding interactivity to both visualizations since I kept getting error messages when using the ggplotly method.