potter_data <- read.csv("Downloads/harry_potter_reviews.csv")

1st plot

hist(potter_data$user_age, xlab = "Age", ylab="Number of users", col="#C6DBDA", main='')

Age of the sample

Age of the sample

2nd plot

brplt <- as.data.frame(table(potter_data$favourite_character))
ggplot(brplt, aes(x=Var1, y=Freq, fill=Var1)) + 
  geom_bar(stat = "identity") + 
  scale_fill_hue(c = 30) +
   theme(legend.position="none") +
  coord_flip() + 
  xlab("Favourite character") +
  ylab("Frequency")

Favourite characters’ frequency

Favourite characters' frequency

3rd plot

ggplot(potter_data, aes(x=user_sex, y=rating)) + 
    geom_boxplot(varwidth = TRUE, fill="slateblue", alpha=0.2) + 
    xlab("Sex") +
    ylab("Rating")

Distribution of rating among gender groups

Distribution of rating among gender groups

4th plot

potter_corpus <- corpus(potter_data, text_field = "comment", docid_field = "user_id")

potter_tokens <- tokens(potter_corpus, remove_numbers = TRUE, remove_punct = TRUE, remove_url = TRUE)
potter_tokens_2 <- tokens_replace(potter_tokens, pattern = hash_lemmas$token, replacement = hash_lemmas$lemma)
potter_tokens_clean <- tokens_remove(potter_tokens_2, 
                                 pattern = c(stopwords("en"), "Draco", "Draco's", "Malfoy", "Malfoy's", "Hermione", "Hermione's", "Granger", "Granger's", "Ron","Rons" , "Weasley", "Severus", "Snape", "Harry", "Potter", "Albus", "Dumbledore", "Hagrid", "Rubeus", "Neville", "Longbottom", "Weasley's", "Severus's", "Snape's", "Harry's", "Potter's", "Albus's", "Dumbledore's", "Hagrid's", "Rubeus's", "Neville's", "Longbottom's"))

potter_dfm <- dfm(potter_tokens_clean)
potter_dfm_trim <- dfm_trim(potter_dfm, min_docfreq = 0.0005, docfreq_type = "prop") 

textplot_wordcloud(potter_dfm_trim, color = "#015781", min_size = 1, max_size = 5, max_words = 25)

Most frequent words in reviews

Most frequent words in reviews

Table with data

potter_table <- potter_data%>%
  group_by(user_country, user_sex) %>%
  summarize(rating_mean = mean(rating, na.rm = TRUE))
kable(potter_table, col.names = c('Country', 'Sex', "Mean rating"),  align = "ccc", caption = "Mean rating values in different countries and genders") %>%
  kable_paper()
Mean rating values in different countries and genders
Country Sex Mean rating
Austria female 4.277778
Austria male 4.444444
Austria other 4.500000
Belgium female 3.850000
Belgium male 4.000000
Belgium other 4.100000
Croatia female 4.175000
Croatia male 4.062500
Croatia other 3.500000
Finland female 4.140625
Finland male 3.900000
Finland other 3.857143
France female 4.000000
France male 4.214286
France other 4.333333
Germany female 4.020000
Germany male 3.868421
Germany other 3.954546
Greece female 4.184210
Greece male 3.214286
Greece other 4.500000
Netherlands female 4.333333
Netherlands male 5.000000
Netherlands other 4.500000
Spain female 3.909091
Spain male 3.903226
Spain other 4.090909
Sweden female 4.083333
Sweden male 3.727273
Sweden other 3.000000
Switzerland female 3.969697
Switzerland male 4.062500
Switzerland other 4.214286
Turkey female 3.710526
Turkey male 3.900000
Turkey other 4.000000
United Kingdom female 4.250000
United Kingdom male 4.000000

New visualization techniques

very simple, but can be useful

library(beeswarm)
beeswarm(potter_data$user_age, horizontal=TRUE, pch=16, col="red", xlab = 'Age')

Distribution of age among Harry Potter reviewers

Distribution of age among Harry Potter reviewers

Small story: I was doubting that histogram is enough to be a plot and was trying to search for some interesting alternative, beeswarm sounded funny and interesting.

more interesting one

library(vcd)
?mosaicplot
mosaicplot(~ favourite_character + rating, data = potter_data, 
           main = "", 
           xlab = "Favourite character", 
           ylab = "Rating", 
           color = TRUE)

Mosaic plot of Harry Potter rating by favourite character

Mosaic plot of Harry Potter rating by favourite character

Story: Firstly, I understand that the graph may be a little too complicated at first, but it’s cool!!! So the story goes like this: I really want many points for this h/w, so I tried really hard to find some visualizations, but I faced a big problem, because the data set that I use have only 1 continuous variable, so all cool 3d graphs were not an option. And so I googled vizualisation for categorical variables and found this treasure, And, of course, I really wanted to understand if there is a possible relationship between favourite character a person chose and the rating, and certainly we can see some interesting things on this.

P.S. if you didn’t like my new technique for visualization, and you may find wordcloud (3rd plot) as a more suitable one, I can tell a story of how I found out about it as well. So I for some unknown reason decided to enroll into computer methods for text mining (or smth like that), That was really hard to understand anything on that course, but I managed to learn how to create corpus, tokens and even delete stopwords (unfortunately manually), and also wordcloud was the best thing I learned.