potter_data <- read.csv("Downloads/harry_potter_reviews.csv")
hist(potter_data$user_age, xlab = "Age", ylab="Number of users", col="#C6DBDA", main='')
Age of the sample
brplt <- as.data.frame(table(potter_data$favourite_character))
ggplot(brplt, aes(x=Var1, y=Freq, fill=Var1)) +
geom_bar(stat = "identity") +
scale_fill_hue(c = 30) +
theme(legend.position="none") +
coord_flip() +
xlab("Favourite character") +
ylab("Frequency")
Favourite characters’ frequency
ggplot(potter_data, aes(x=user_sex, y=rating)) +
geom_boxplot(varwidth = TRUE, fill="slateblue", alpha=0.2) +
xlab("Sex") +
ylab("Rating")
Distribution of rating among gender groups
potter_corpus <- corpus(potter_data, text_field = "comment", docid_field = "user_id")
potter_tokens <- tokens(potter_corpus, remove_numbers = TRUE, remove_punct = TRUE, remove_url = TRUE)
potter_tokens_2 <- tokens_replace(potter_tokens, pattern = hash_lemmas$token, replacement = hash_lemmas$lemma)
potter_tokens_clean <- tokens_remove(potter_tokens_2,
pattern = c(stopwords("en"), "Draco", "Draco's", "Malfoy", "Malfoy's", "Hermione", "Hermione's", "Granger", "Granger's", "Ron","Rons" , "Weasley", "Severus", "Snape", "Harry", "Potter", "Albus", "Dumbledore", "Hagrid", "Rubeus", "Neville", "Longbottom", "Weasley's", "Severus's", "Snape's", "Harry's", "Potter's", "Albus's", "Dumbledore's", "Hagrid's", "Rubeus's", "Neville's", "Longbottom's"))
potter_dfm <- dfm(potter_tokens_clean)
potter_dfm_trim <- dfm_trim(potter_dfm, min_docfreq = 0.0005, docfreq_type = "prop")
textplot_wordcloud(potter_dfm_trim, color = "#015781", min_size = 1, max_size = 5, max_words = 25)
Most frequent words in reviews
potter_table <- potter_data%>%
group_by(user_country, user_sex) %>%
summarize(rating_mean = mean(rating, na.rm = TRUE))
kable(potter_table, col.names = c('Country', 'Sex', "Mean rating"), align = "ccc", caption = "Mean rating values in different countries and genders") %>%
kable_paper()
Country | Sex | Mean rating |
---|---|---|
Austria | female | 4.277778 |
Austria | male | 4.444444 |
Austria | other | 4.500000 |
Belgium | female | 3.850000 |
Belgium | male | 4.000000 |
Belgium | other | 4.100000 |
Croatia | female | 4.175000 |
Croatia | male | 4.062500 |
Croatia | other | 3.500000 |
Finland | female | 4.140625 |
Finland | male | 3.900000 |
Finland | other | 3.857143 |
France | female | 4.000000 |
France | male | 4.214286 |
France | other | 4.333333 |
Germany | female | 4.020000 |
Germany | male | 3.868421 |
Germany | other | 3.954546 |
Greece | female | 4.184210 |
Greece | male | 3.214286 |
Greece | other | 4.500000 |
Netherlands | female | 4.333333 |
Netherlands | male | 5.000000 |
Netherlands | other | 4.500000 |
Spain | female | 3.909091 |
Spain | male | 3.903226 |
Spain | other | 4.090909 |
Sweden | female | 4.083333 |
Sweden | male | 3.727273 |
Sweden | other | 3.000000 |
Switzerland | female | 3.969697 |
Switzerland | male | 4.062500 |
Switzerland | other | 4.214286 |
Turkey | female | 3.710526 |
Turkey | male | 3.900000 |
Turkey | other | 4.000000 |
United Kingdom | female | 4.250000 |
United Kingdom | male | 4.000000 |
library(beeswarm)
beeswarm(potter_data$user_age, horizontal=TRUE, pch=16, col="red", xlab = 'Age')
Distribution of age among Harry Potter reviewers
Small story: I was doubting that histogram is enough to be a plot and was trying to search for some interesting alternative, beeswarm sounded funny and interesting.
library(vcd)
?mosaicplot
mosaicplot(~ favourite_character + rating, data = potter_data,
main = "",
xlab = "Favourite character",
ylab = "Rating",
color = TRUE)
Mosaic plot of Harry Potter rating by favourite character
Story: Firstly, I understand that the graph may be a little too complicated at first, but it’s cool!!! So the story goes like this: I really want many points for this h/w, so I tried really hard to find some visualizations, but I faced a big problem, because the data set that I use have only 1 continuous variable, so all cool 3d graphs were not an option. And so I googled vizualisation for categorical variables and found this treasure, And, of course, I really wanted to understand if there is a possible relationship between favourite character a person chose and the rating, and certainly we can see some interesting things on this.
P.S. if you didn’t like my new technique for visualization, and you may find wordcloud (3rd plot) as a more suitable one, I can tell a story of how I found out about it as well. So I for some unknown reason decided to enroll into computer methods for text mining (or smth like that), That was really hard to understand anything on that course, but I managed to learn how to create corpus, tokens and even delete stopwords (unfortunately manually), and also wordcloud was the best thing I learned.