h/w1 ADA

1st plot

hist(potter_data$user_age, xlab = "Age", ylab="Number of users", col="#C6DBDA", main='')

Age of the sample

2nd plot

brplt <- as.data.frame(table(potter_data$favourite_character))
ggplot(brplt, aes(x=Var1, y=Freq, fill=Var1)) + 
  geom_bar(stat = "identity") + 
  scale_fill_hue(c = 30) +
   theme(legend.position="none") +
  coord_flip() + 
  xlab("Favourite character") +
  ylab("Frequency")

Favourite characters’ frequency

3rd plot

ggplot(potter_data, aes(x=user_sex, y=rating)) + 
    geom_boxplot(varwidth = TRUE, fill="slateblue", alpha=0.2) + 
    xlab("Sex") +
    ylab("Rating")

Distribution of rating among gender groups

4th plot

potter_corpus <- corpus(potter_data, text_field = "comment", docid_field = "user_id")

potter_tokens <- tokens(potter_corpus, remove_numbers = TRUE, remove_punct = TRUE, remove_url = TRUE)
potter_tokens_2 <- tokens_replace(potter_tokens, pattern = hash_lemmas$token, replacement = hash_lemmas$lemma)
potter_tokens_clean <- tokens_remove(potter_tokens_2, 
                                 pattern = c(stopwords("en"), "Draco", "Draco's", "Malfoy", "Malfoy's", "Hermione", "Hermione's", "Granger", "Granger's", "Ron","Rons" , "Weasley", "Severus", "Snape", "Harry", "Potter", "Albus", "Dumbledore", "Hagrid", "Rubeus", "Neville", "Longbottom", "Weasley's", "Severus's", "Snape's", "Harry's", "Potter's", "Albus's", "Dumbledore's", "Hagrid's", "Rubeus's", "Neville's", "Longbottom's"))

potter_dfm <- dfm(potter_tokens_clean)
potter_dfm_trim <- dfm_trim(potter_dfm, min_docfreq = 0.0005, docfreq_type = "prop") 

textplot_wordcloud(potter_dfm_trim, color = "#015781", min_size = 1, max_size = 5, max_words = 25)

Most frequent words in reviews

Table with data

potter_table <- potter_data%>%
  group_by(user_country, user_sex) %>%
  summarize(rating_mean = mean(rating, na.rm = TRUE))
kable(potter_table, col.names = c('Country', 'Sex', "Mean rating"),  align = "ccc", caption = "Mean rating values in different countries and genders") %>%
  kable_paper()

Mean rating values in different countries and genders
Country	Sex	Mean rating
Austria	female	4.277778
Austria	male	4.444444
Austria	other	4.500000
Belgium	female	3.850000
Belgium	male	4.000000
Belgium	other	4.100000
Croatia	female	4.175000
Croatia	male	4.062500
Croatia	other	3.500000
Finland	female	4.140625
Finland	male	3.900000
Finland	other	3.857143
France	female	4.000000
France	male	4.214286
France	other	4.333333
Germany	female	4.020000
Germany	male	3.868421
Germany	other	3.954546
Greece	female	4.184210
Greece	male	3.214286
Greece	other	4.500000
Netherlands	female	4.333333
Netherlands	male	5.000000
Netherlands	other	4.500000
Spain	female	3.909091
Spain	male	3.903226
Spain	other	4.090909
Sweden	female	4.083333
Sweden	male	3.727273
Sweden	other	3.000000
Switzerland	female	3.969697
Switzerland	male	4.062500
Switzerland	other	4.214286
Turkey	female	3.710526
Turkey	male	3.900000
Turkey	other	4.000000
United Kingdom	female	4.250000
United Kingdom	male	4.000000

New visualization techniques

very simple, but can be useful

library(beeswarm)
beeswarm(potter_data$user_age, horizontal=TRUE, pch=16, col="red", xlab = 'Age')

Distribution of age among Harry Potter reviewers

Small story: I was doubting that histogram is enough to be a plot and was trying to search for some interesting alternative, beeswarm sounded funny and interesting.

more interesting one

library(vcd)
?mosaicplot
mosaicplot(~ favourite_character + rating, data = potter_data, 
           main = "", 
           xlab = "Favourite character", 
           ylab = "Rating", 
           color = TRUE)

Mosaic plot of Harry Potter rating by favourite character

Story: Firstly, I understand that the graph may be a little too complicated at first, but it’s cool!!! So the story goes like this: I really want many points for this h/w, so I tried really hard to find some visualizations, but I faced a big problem, because the data set that I use have only 1 continuous variable, so all cool 3d graphs were not an option. And so I googled vizualisation for categorical variables and found this treasure, And, of course, I really wanted to understand if there is a possible relationship between favourite character a person chose and the rating, and certainly we can see some interesting things on this.

P.S. if you didn’t like my new technique for visualization, and you may find wordcloud (3rd plot) as a more suitable one, I can tell a story of how I found out about it as well. So I for some unknown reason decided to enroll into computer methods for text mining (or smth like that), That was really hard to understand anything on that course, but I managed to learn how to create corpus, tokens and even delete stopwords (unfortunately manually), and also wordcloud was the best thing I learned.