Upload Dataset and Library
I divide this App Game Analysis into two parts: the Data Visualization for basic Statistical Summary, and some simple modeling to find what is the Best Education Games. In addition, we will explore the relationship between Developer VS User.Ratings.
This project is inspired by a disconnected-friend who is really good at Player Unknown’s Battlegrounds, and Honor of Kings. He was so good that he has to wait a long time for an opponent.
We start by doing an overview statistical summary for the important variables.
From the bar chart of user rating counts, we can see the majority of the game is between 4 to 4.5, almost like a negative skewed normal distribution.
game2<- game1[!is.na(game1[,6]),]
game3<- group_by(game2,Average.User.Rating)
game4<- dplyr::summarise(game3,count=n())
b<- ggplot(game4, aes(x=Average.User.Rating,y=count,fill=Average.User.Rating))+
geom_bar(stat="identity")+
geom_text(aes(label=count),size=3,color="coral",vjust=1.6)+
ggtitle("Overall User Rating Count Bar Plot")
bWe will run the summary function to have an overall idea of the Rating User Count, we can see the App game reviews count has a high volatility, the review counts vary from 5 to 3million.
In order to avoid the Reviewer’s bias, we would choose the reviewer Count greater than 3306 (mean value) for future analysis.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5 12 46 3306 309 3032734
Now, we take a look at the Price Range distributions, we will use the original data, which means we will include the games has no reviews as well. Due to the excessively large amount of Free games, we categorize the games into 2 groups: Free and $0.99 to $179.99. The below donut graph presented the majority of the games are free, and the most expensive games are 179.99, which all associated with \(3D\) or \(simulation\).
price <- game1[!is.na(game1[,8]),]
price1<- group_by(price,Price)
price2<- dplyr::summarise(price1,count=n())
price2$lc <- log(1+price2$count)
price2$price_range <- ifelse(price2$Price==0.00,"Free", "0.99-179.99")
#compute percentage
price2$frac=price2$count/sum(price2$count)
price2$ymax=cumsum(price2$frac)
price2$ymin=c(0,head(price2$ymax,n=-1))
price2$labelPosition <- (price2$ymax+price2$ymin)/2
price2$label <-paste0(price2$price_range)
#donut plot
ggplot(price2,aes(ymax=ymax,ymin=ymin,xmax=4,xmin=3,fill=price_range))+geom_rect()+coord_polar(theta="y")+xlim(c(2,4))+theme_void()+
scale_fill_brewer(palette=16)+ggtitle("App Game Price Tag")We want to see the distribution of the design of age. As the graph showed the majority of the game is available for age 4+.
age <- game1[!is.na(game1[,12]),]
age1<- group_by(age,Age.Rating)
age2<- dplyr::summarise(age1,count=n())
#bar chart
ggplot(age2,aes(fill=Age.Rating, y=count, x=Age.Rating))+
geom_bar(position ="dodge", stat="identity" )+ggtitle("Age Distrubtion in Games")As nowadays brand affect, we’d like to see if there’s an association between the top developers and their ratings.
From the games with non-empty User.Review, “Tapps Tecnologia da Informa73o Ltda.” developed the most games.
dev <- game2[!is.na(game2[,11]),]
dev1<- group_by(dev,Developer)
dev2<- dplyr::summarise(dev1,count=n())
dev2<- arrange(dev2,desc(count))
#treemap
dev2<-dev2[1:20,]
treemap(dev2,index="Developer", vSize="count", type="index",
fontsize.labels =8 )We select the Developer with 30 or more games to generate a heatmap. We calculate the mean score as the average score of their game products, scount as the summation of all their game reviews. From the Heatmap, it is clear that “Tapps Tecnologia” is the winner of the developer! This company has an average score > 4 with more or less 300k users.
ht <- game2[ which(game2$Developer %like% "Tapps Tecnologia" | game2$Developer %in% c("Detention Apps","EASY Inc.","Qumaron", " HexWar Games Ltd","8Floor","HexWar Games Ltd")),]
ht2<- ht %>% group_by(Developer,add=TRUE) %>% summarise(mscore=mean(Average.User.Rating),scount=sum(User.Rating.Count))
#Categorize Rating
ht2$mean.score[ht2$mscore>=3 & ht2$mscore<4] <- "Rating 3~4"
ht2$mean.score[ht2$mscore>=4 ] <- "Rating >= 4"
#heatmap
p<-ggplot(ht2, aes(x=mean.score, y=Developer, fill=scount))+
scale_fill_gradient(low="skyblue",high="pink")+
geom_tile(colour="white")+
labs(y="Developer", x="Average Rating", title="Heatmap of Developer",fill="User Count")
pWe want to see how’s the games are surviving ? The density plot showed it’s almost like a Chi-Square Graph of degree=1.
s<- game2[game2$Original.Release.Date!="NA" & game2$Current.Version.Release.Date!="NA",]
#convert to standard date
s$now=as.Date(s$Current.Version.Release.Date,format="%d/%m/%Y")
s$start=as.Date(s$Original.Release.Date,format="%d/%m/%Y")
#Calculate the differences
s$diff=difftime(s$now,s$start, unit=c("days"))
#density plot
s$syear<-as.numeric(format(s$start,"%Y"))
s$cyear<- as.numeric(format(s$now,"%Y"))
s$diff1 <-as.numeric(s$diff)
ggplot(s,aes(x=diff1,color=s$Average.User.Rating))+
geom_density(fill="lightgreen",alpha=0.3)+
ggtitle("Density Plot of Duration of Game Survival time")+
xlab("Duration of Game Version Days")We use Gantt graph to show the survival time of 10 random Games by the “sample” function, if the line is empty then it means it the the game only have one version.
We will explore the Game Genres
Can we guess what Genre it is in by looking the Game Names? we see a lot of Game actually use the Game to catch people’s attention.
set.seed(123)
wordcloud(words=game1$Name, max.words=123,random.order=FALSE,
rot.per=0.35,colors=brewer.pal(8, "Dark2"),
main="Wordcloud of Game Names")The lollipop Graph demonstrated the most popular genres are strategy, Puzzle. It seems like most people like challenges.
set.seed(123)
wordcloud(words=game1$Genres, max.words=123,random.order=FALSE,
rot.per=0.35,colors=brewer.pal(8, "Dark2"),
main="Wordcloud of Game Genres")cat<- game2[ which(game2$User.Rating.Count>=3306 ),]
cat<- group_by(game1,Genres)
cat1<- dplyr::summarise(cat,count=n())
cat1<- arrange(cat1,desc(count))
cat15<- cat1[1:10,]
ggplot(cat15,aes(y=count,x=Genres))+
geom_segment(aes(y=0,yend=count,x=Genres,xend=Genres),color="orange")+
geom_point( color="blue", size=4, alpha=0.6)+coord_flip()+
ggtitle("The Most Popular Game Genres")+
theme_light()+
theme(
panel.grid.major.y = element_blank(),
panel.border = element_blank(),
axis.ticks.y = element_blank()
)+ggtitle("Genres vs Counts")#higest rating
h1<- arrange(game1,desc(Average.User.Rating))
h1<- h1[(h1[,6]>4.5),]
ed2<-arrange(h1,desc(User.Rating.Count))
ed2<-ed2[1:15,]
treemap(ed2,index="Name", vSize="User.Rating.Count", type="index",
fontsize.labels =6 )Now let’s take a look at the educational games.
With the treemap, we observed within the 5 star rating game,Arizona Rose has the most downloads. I googled it, it is a game to decipher the clever codes of Blackbeard’s treasure maps in Arizona Rose and the Pirates’ Riddles. While shopping for exotic antiques, Arizona stumbles across the lost maps of an infamous pirate’s hidden treasure. Join Arizona on her epic treasure-hunting quest with 200 levels and fortunes waiting to be discovered. There’s always one more puzzle to solve, one more cave or shipwreck to explore and one more treasure to take home at the end!
#Select the games with Educational puporse
ed<- game1[game1$Genres %like% "Education", ]
#Exclude the
ed<-ed[!is.na(ed[,6]),]
ed1<- arrange(ed,desc(Average.User.Rating))
ed1<- ed1[(ed1[,6]>4.5),]
ed2<-arrange(ed1,desc(User.Rating.Count))
ed2<-ed2[1:15,]
treemap(ed2,index="Name", vSize="User.Rating.Count", type="index",
fontsize.labels =6 )Arizona Rose
Now we look at my personal favorite game: Five in Line, Surprisingly, there are not too many five in a Line game, the best ones are Five in a Row Pro, Five Field Kono.
five <- game2[game2$Name %like% "Five", ]
five1 <-five[c(1,8),c(3,6,7,9,11,12,17,18)]
five2<- five[c(1,8),]
five2 %>%
mutate(image = paste0('<img width="60%" height="15%" src="', Icon.URL , '"></img>')) %>%
select(image, Name,Average.User.Rating, User.Rating.Count, Genres) %>%
datatable(class = "nowrap hover row-border", escape = FALSE,
options = list(dom = 't',scrollX = TRUE))In july 2019, i met a guy online, he is super nice, with extremely fancy handwritings. We started to be friends and played Five in a Line, i lost all the games, even with his “blind” moves.
Out of a sudden, I lost connection with him, then i realized i don’t even know his real name except he is superb at Five in Line, listens to Draughty in the subway, enjoys drinking mojito sometimes, and rises his eyebrow when others trying to see his exam answers.
My favorite quote from him is : One 4-connected is not a win, only two sided 3-connected counts winning! I don’t rely on my winning on others’ mistakes.
I never got a chance to say goodbye to him properly, so I decide to make this Game Analysis in memory of the time with him.
Silver Lining, i guess :)
Happy 2020 !