CMTA Project 1

Intro

In this project I am going to analyze the difference in speech by gender based on the data about book reviews written by both males and females. Step by step, I will come up with different measures and comment on the result`s obtained just after each metric is calculated. In the end, I will briefly summarize my findings.

text <- read_delim("~/review.csv", ";", escape_double = FALSE, trim_ws = TRUE)

Preparation

First, some preparation needs to be done: let`s eliminate all the punctuation and replace upper case to lower case. After that we do lemmatization, delete russian stopwords and the most frequemtly used words as well, as they don’t communicate us anything meaningful.

text$Review = str_to_lower(text$Review) 
text$Review = gsub("[[:punct:]]", "",text$Review)


text$lemma <-system2("mystem", c("-c","-l","-d"), input = text$Review,stdout = T)
text$lemma <- str_replace_all(text$lemma, "\\{([^}]+?)([?]+)?\\}", "\\1")

rustopwords <- data.frame(words= c(stopwords("ru"),"книга","это","который","весь","свой","мочь","читать","герой","автор","история","просто","роман","очень","самый"), stringsAsFactors=FALSE)

text.tok = text %>% unnest_tokens(words, lemma) %>% na.omit() %>% anti_join(rustopwords, by = "words")

text.tok$Sex = as.character(text.tok$Sex)
text.tok$Sex = ifelse(str_detect(text.tok$Sex, "N/A"), NA, text.tok$Sex)
text.tok = text.tok %>% na.omit()
text.tok$Sex = as.factor(text.tok$Sex)

Frequency lists

To start with, let`s look at the simple frequency lists by gender:

freq_f = text.tok %>% filter(Sex == "Ж") %>% group_by(words) %>%  summarise(count = n()) %>% arrange(-count) %>% top_n(15, count)

ggplot(freq_f, aes(x = reorder(words,count), y = count))+
  geom_bar(stat = "identity", alpha = 0.7, color = "black", fill = "lightblue")+
  coord_flip()+
  theme_minimal()+
  labs(title = "Top 15 words used by Females
       ", x = "Word frequency", y = "Word")+
  theme(text = element_text(size = 15), plot.title = element_text(hjust = 0.5), title = element_text(size = 15))

freq_m = text.tok %>% filter(Sex == "М") %>% group_by(words) %>%  summarise(count = n()) %>% arrange(-count) %>% top_n(15, count)

ggplot(freq_m, aes(x = reorder(words,count), y = count))+
  geom_bar(stat = "identity", alpha = 0.7, color = "black", fill = "lightblue")+
  coord_flip()+
  theme_minimal()+
  labs(title = "Top 15 words used by Males
       ", x = "Word frequency", y = "Word")+
  theme(text = element_text(size = 15), plot.title = element_text(hjust = 0.5), title = element_text(size = 15))

From the bar charts we can track the most popular words used by both men and women. As it can be seen, “время” consistently occupies top 1 positions in both cases.

However, it can be noticed that “любовь” is in top 10 most frequently used words in females`s subgroup, whereas there is no such a word even in top 15 (actually 16) in the men’s subgroup!

Women are also prone to use such words as “друг”, “ребенок”(especially) which probably tells us that females are more likely to concentrate their attention on relationships, family and motherhood, whereas males operate with much more vague words.

My personal guess would be that females are usually focused on the book`s content (i.g. plot) while writing a review, whereas males are more likely to rate the style of writing, professionalism of the author, etc.

Wordclouds

For Male:

wordcloud_m = text.tok %>% filter(Sex == "М") %>% 
    count(words, sort=TRUE) %>% 
    top_n(80, n)

wordcloud2(data = wordcloud_m, rotateRatio = 0)

For Female:

wordcloud_f = text.tok %>% filter(Sex == "Ж") %>% 
    count(words, sort=TRUE) %>% 
    top_n(80, n)

wordcloud2(data = wordcloud_f, rotateRatio = 0)

Some statistics

words_said = text.tok %>% group_by(Sex) %>% summarize(words_said = n()) %>% mutate(pecentage = round((words_said/sum(words_said)*100),1)) 

mean_words = text.tok %>% group_by(Nickname, Sex) %>% summarise(count = n()) %>% group_by(Sex) %>% summarise(mean = round(mean(count),2), median = median(count), sd = round(sd(count),2), min = min(count), max = max(count)) %>% select(-Sex)

stats = cbind(words_said, mean_words)

formattable(stats, 
            align =c("c","c","c"), 
            list(`Indicator Name` = formatter(
              "span", style = ~ style(color = "grey",font.weight = "bold"))))

Sex	words_said	pecentage	mean	median	sd	min	max
Ж	217554	84.5	212.25	134	246.39	3	2446
М	39790	15.5	219.83	154	278.45	2	2423

We can see that the overwhelming majority of the words (~ 85%) are said by women in our sample data. However, it is interesting to notice that on average men write 7 words longer reviews (mean ~ 222 words per review) then women (mean ~ 215 words per review).

Nevertheless, standard deviation for the number of words per review is higher for males then for females, which tells us about a higher diversity of review length written by males.

In simple words, in our sample males on average write a little more then females but the length of a randomely chosen review is a little bit less predictable then for females.

IPM and G2

males <- text.tok %>% filter(Sex == "М")
females <- text.tok %>% filter(Sex == "Ж")

different_table <- bind_rows(males, females) %>%
    dplyr::count(words, Sex) %>% 
    spread(Sex, n, fill = 0) %>% 
  filter(М>5 | Ж>5)


g2 = function(a, b) {
  c = sum(a)
  d = sum(b)
  E1 = c * ((a + b) / (c + d))
  E2 = d * ((a + b) / (c + d))
  return(2*((a*log(a/E1+1e-7)) + (b*log(b/E2+1e-7))))
}

logratio <- function(a, b) {
    return(log2((a/sum(a)/(b/sum(b)))))
}

Top 10 words by g2 metric:

g2_table <- different_table %>%
    mutate(g2=g2(Ж,М)) %>%
    arrange(desc(g2)) %>%
    mutate(g2 = round(g2, 2)) %>%
    group_by(words) %>% mutate(male_ipm = (М * 10e+6)/nrow(males), female_ipm = (Ж * 10e+6)/nrow(females))

top_g2 = g2_table %>% head(10)

formattable(top_g2, 
            align =c("l","c","c","c","c","c"), 
            list(`Indicator Name` = formatter(
              "span", style = ~ style(color = "grey",font.weight = "bold"))))

words	Ж	М	g2	male_ipm	female_ipm
пинчон	0	18	69.26	4523.750	0.0000
аверченко	0	14	53.87	3518.472	0.0000
шалалалулааа	0	13	50.02	3267.153	0.0000
ходасевич	0	12	46.17	3015.833	0.0000
любовь	730	54	45.15	13571.249	33554.8875
лу	4	14	36.06	3518.472	183.8624
бизнес	7	16	35.50	4021.111	321.7592
футбол	2	12	35.32	3015.833	91.9312
крусанов	0	9	34.63	2261.875	0.0000
моралите	0	9	34.63	2261.875	0.0000

From the table we can see that in our sample only men write (and therefore read?) Thomas Pinchon, Arkadiy Averchenko and Vladislav Hodasevich. It is also can be noticed that men are more likely to read Franz Kafka and Isaac Asimov, mentioning them more frequently then women.

Furthermore, as it might be intuitively clear, men are much more likely to talk about football and business. It is astonishing how straightforward the results are! Honestly speaking, I would not expect to find such a vivid examples of gender differentiation.

Still, the most outstanding gender based difference can be seen in the “любовь” word usage. Women are extremely outperform men in elaborating on topics concerning love and, probably, romantic relationships. Well, it has just confirmed one of the most popular stereotype about women.

g2_table <- different_table %>%
  mutate(logratio = logratio(Ж,М)) %>% 
  mutate(logratio = round(logratio,2)) %>% 
    mutate(g2=g2(Ж,М)) %>%
    arrange(desc(g2)) %>%
    mutate(g2 = round(g2, 2)) %>%
    group_by(words) %>% mutate(male_ipm = (М * 10e+6)/nrow(males), female_ipm = (Ж * 10e+6)/nrow(females))

g2_table_1 = g2_table %>%
    filter(М > 0 & Ж > 0) %>%
    group_by(logratio < 0) %>%
    top_n(10, abs(logratio)) %>%
    ungroup()

male_words = g2_table_1 %>% filter(`logratio < 0` == TRUE) %>% select(-`logratio < 0`) %>% arrange(logratio)
female_words = g2_table_1 %>% filter(`logratio < 0` == FALSE) %>%  select(-`logratio < 0`) %>% arrange(-logratio)

Top “Male” words by logratio:

formattable(male_words, 
            align =c("l","c","c","c","c","c","c"), 
            list(`Indicator Name` = formatter(
              "span", style = ~ style(color = "grey",font.weight = "bold"))))

words	Ж	М	logratio	g2	male_ipm	female_ipm
ницше	1	7	-5.35	21.22	1759.236	45.9656
футбол	2	12	-5.13	35.32	3015.833	91.9312
коэльо	1	6	-5.13	17.66	1507.917	45.9656
формально	1	6	-5.13	17.66	1507.917	45.9656
методика	2	11	-5.01	31.79	2764.514	91.9312
биографический	2	10	-4.87	28.29	2513.194	91.9312
сорокин	2	10	-4.87	28.29	2513.194	91.9312
v	2	9	-4.72	24.83	2261.875	91.9312
владимир	2	9	-4.72	24.83	2261.875	91.9312
спецслужба	2	9	-4.72	24.83	2261.875	91.9312
субъект	2	9	-4.72	24.83	2261.875	91.9312

Now we have a little bit different results concerning gender based word usage taking into consideration logratio. I have alredy talked about football, business, Kafka (“Кафка”, “Тошнота”) and Azimov.

Still, it is interesting to note overwhelming “методика” word usage in male subgroup. This somewhat supports my previously stated htpothesis about men being more pragmatically-tuned and talking about the author`s style of writing, etc.

Finally, Lev Tolstoy is a little bit more popular among males then females.

Top “Female” words by logratio:

formattable(female_words, 
            align =c("l","c","c","c","c","c","c"), 
            list(`Indicator Name` = formatter(
              "span", style = ~ style(color = "grey",font.weight = "bold"))))

words	Ж	М	logratio	g2	male_ipm	female_ipm
кот	58	1	3.31	12.02	251.3194	2666.005
зацеплять	53	1	3.18	10.62	251.3194	2436.177
захотеться	51	1	3.12	10.07	251.3194	2344.246
горе	48	1	3.04	9.24	251.3194	2206.349
сквозь	48	1	3.04	9.24	251.3194	2206.349
жалко	46	1	2.98	8.69	251.3194	2114.418
умение	45	1	2.94	8.42	251.3194	2068.452
идеальный	87	2	2.90	16.03	502.6389	3999.007
смотря	42	1	2.84	7.61	251.3194	1930.555
поворот	82	2	2.81	14.69	502.6389	3769.179
задумка	41	1	2.81	7.34	251.3194	1884.590
уезжать	41	1	2.81	7.34	251.3194	1884.590

As for females, it is so funny to find out that women are more likely to enjoy writing about cats (stories which mentioned cats). By intuitive analysis, the sentiment for these femalse words are a little bit more negative then for the same male`s list.

Women tend to write about “горе”, “жалко”, “уезжать”. It can be infered that women are on average more likely to read more emotionally intense books, and, therefore, leave more emotionally intense reviews. Or women are just more emotional in general but identifying this is beyond the scope of text analysis’ capacities :)

Nice plot for logratio visual estimation

g2_table_1 %>%
    filter(М > 0 & Ж > 0) %>%
    group_by(logratio < 0) %>%
    top_n(10, abs(logratio)) %>%
    ungroup() %>%
    mutate(words = reorder(words, logratio)) %>%
    ggplot(aes(words, logratio, fill = logratio > 0)) +
    geom_col(show.legend = F, alpha = 0.5, color = "black") +
    coord_flip() +
  theme_minimal()+
    scale_fill_discrete(name = "", labels = c("Males", "Females"))+
  labs(title = "Distribution of words by gender
       ", x = "Words", y = "log odds ratio (Males/Females)")+
  theme(text = element_text(size = 13), plot.title = element_text(hjust = 0.5), title = element_text(size = 13))

Visually, it an be noticed that male`s words have a greater values of logration, which means that men are more likely to use words that no woman does. In simple words, men use more unique words or mention more unique topics/authors in thier reviews based on our sample.

Conclusion

To sum up, I have uncovered some gender-based differences in the speech of males and females based on their book reviews.

First, males seem to be more methodologically oriented, whereas women are obserbed in book`s content completely. So to say, men are more likely to write about the author’s style, writing rechniques and so on, while women are generally concentrated on the plot of the book.

Second, female users are more emotionally engaged ones. In their reviews they often elaborate on love, romantic and family/friends relationships and parenthood. As for specific males` topics, they are much more likely to write about football and business.

Third, counterintuitively, women on average tend to write shorter reviews then males.

Fourth, males mention more unique words and authors then females. Some authors are read exclusively by males in our sample, however there is no author read exlusively by females.

Fifth, females are drama queens and like writing about cats! :)