1 Input:

1.1 Prepare data:

Firstly, we will prepare a lot packages needed in this chapter. If you don’t have, just simply install them by function install_package(“name_package”).

You can get this dataset from Kaggle in this link: top companies rating and reviews

The description of dataset contain multiple cols:

company: name of the company.
type: the company’s industry sector.
rating: number of stars rate on a 5-star scale.
reviewers: the number of votes.
age: seniority in the profession.
highly_ratedFOR: highly rating factors
critically_ratedFOR: critically rating areas

1.2 Fix data:

Remember always checking the missing value in the dataset before analyst. I suggest a lot of functions in R can highly solve this problem.

library(naniar)
gg_miss_var(df,show_pct = TRUE)

This package {naniar} create beautiful plot and easily using by one line of code.

In R, there are some specialized packages for handling missing data. You can consult this problem in the Flexible Imputation of Missing Data by Stef van Buuren. But in this chapter, I just simply don’t use col critically_ratedFOR and remove null value from extant cols.

2 Process:

2.1 Preliminary analysis:

In CNN model, missing value is not allowed so I will remove the missing value and remove ID and critically_ratedFOR.

Also, I will extract the number from strings in age and reviewers cols, remove duplicated rows and fix error.

df %>% select(-c(names(df[1]),
                       critically_ratedFOR))->df ##Remove the first col because it's ID col, R automatically  create ID for dataset.

df %>% drop_na()->df ##Remove missing value.

library(readr)
df %>% 
  mutate(reviewers = parse_number(reviewers),
         age = parse_number(age)) -> df  ##Extract the number from string.

df %>% distinct()->df ##Remove duplicated rows

Because in col age of data, there are a lot of 2024 values so I think it’s error from data input. I have researched from Google and replace value by 2024 minus the founded year.

##Replace the age of company based on real founded year in Google.
df$age[which(df$age == 2024)]<- c(2024 - c(1935,1882,1953,2015,1963,2011,1987))

Next, we should summarise the information of dataset to have a comprehensive perspective.

library(dplyr)
glimpse(df)

## Rows: 8,196
## Columns: 6
## $ company         <chr> "TCS", "Accenture", "Cognizant", "Wipro", "Capgemini",…
## $ type            <chr> "IT Services & Consulting", "IT Services & Consulting"…
## $ rating          <dbl> 3.8, 4.0, 3.9, 3.8, 3.9, 4.0, 4.0, 3.8, 3.7, 3.7, 3.9,…
## $ reviewers       <dbl> 73.8, 46.8, 42.2, 39.6, 34.3, 34.3, 33.8, 32.1, 28.9, …
## $ age             <dbl> 56, 35, 30, 79, 57, 30, 30, 43, 33, 38, 27, 31, 46, 41…
## $ highly_ratedFOR <chr> "Job Security, Work Life Balance", "Company Culture, S…

2.2 Correlation analysis:

Firstly, we should find out some information.

#How much company?
nrow(df)

## [1] 8196

#How many types of company?
length(unique(df$type))

## [1] 84

#How many types of businesses achieve 5 stars?
table(df %>% 
        group_by(type) %>% 
        summarise(count = sum(rating == 5)) %>% 
        filter(count > 0))

##                                 count
## type                             1 2
##   Architecture & Interior Design 0 1
##   Chemicals                      1 0
##   Emerging Technologies          1 0
##   Internet                       1 0
##   Software Product               1 0

In simple thinking, the older company will have higher rating.

library(ggplot2)
ggplot(data = df,
       mapping = aes(x = age,
                     y = rating))+
  geom_bin2d() +
  scale_fill_continuous(type = "viridis")+
  geom_smooth(method = "lm",col = "#0099f9", size = 1.5)+
  theme_bw()+
  labs(y = "The rating scale",
       x = "The age of company",
       title = "The correlation of age of company and rating scale")

## `geom_smooth()` using formula = 'y ~ x'

In this graph, we can see that the higher rate concentrate on the company with the number of operation years between 0 to 60-70 years. The older company from 200 - 400 only struggling from 3 to 4.5 star

Base on this insight, I want to classify these company into 4 class:

Young if < 20 operation year.
Middle if < 50 operation year.
Old if < 100 operation year.
Too old if > 100 operation year.

df$group<-ifelse(df$age < 20,"Young",ifelse(df$age < 50,"Middle",ifelse(df$age < 100,"Old","Too old")))

Next I will use the group to show the density of rating between group.

According to the graph below, we have some insight:

The rating higher 4.5 star and 2 to 3 star belongs to majority of young age company. So we can think that young company is just contained 2 types: bad or good.
The middle age company seem to be equal rating in all star.
The old and too old age company has better performance because the rating fluctuate between 3 to 4.7 but cannot close perfectly to 5 star.

library(hrbrthemes)

## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.

##       Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and

##       if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow

library(viridis)

## Loading required package: viridisLite

ggplot(data=df, 
       aes(x = rating,
           group = group,
           fill= group)) +
    geom_density(adjust=1.5, position="fill") +
    scale_fill_viridis(discrete = TRUE, alpha = 0.8) +
    theme_ipsum()+
    labs(x = "Rating scale",
         y = "")

The rating perhaps is affected by the number of reviewers. I will use jitter plot to illustrate the distribution of number of reviewers.

ggplot(data = df,
       aes(x = fct_relevel(group,c("Young","Middle","Old","Too old")), 
           y = reviewers, 
           fill = group)) +
    geom_violin() +
    scale_fill_viridis(discrete = TRUE, alpha=0.8) +
    theme_light() +
    theme(
      legend.position="none",
      plot.title = element_text(size=14,
                                face="bold"),
      plot.subtitle = element_text(size = 11,
                                   face = "italic")
    ) +
    labs(title = "A boxplot of age company",
         x = "",
         y = "The number of reviewers",
         subtitle = "(Unit = thousand)")

We also test the significant difference between group by anova method.

The result show the diff in the number of reviewers based on the group age of companies. The p-value < 0.05 meaning the difference is significant. Beside, the argument conf.level = .95 extract the additional 95% confidence interval of the difference mean value.

TukeyHSD(aov(reviewers~group,data = df),conf.level=.95)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = reviewers ~ group, data = df)
## 
## $group
##                      diff       lwr         upr     p adj
## Old-Middle       4.932991 -10.35534  20.2213190 0.8406748
## Too old-Middle  40.015276  16.75019  63.2803611 0.0000592
## Young-Middle   -10.368203 -22.43697   1.7005672 0.1212285
## Too old-Old     35.082284   9.44467  60.7198990 0.0024837
## Young-Old      -15.301194 -31.47767   0.8752794 0.0715152
## Young-Too old  -50.383479 -74.24159 -26.5253664 0.0000004

2.3 Modeling:

2.3.1 Fixing data:

Move to the modeling part, we need to convert the data to suitable form to analysis.

Faced to text class, we have to get the insight from the answer of interviewer. Insight can be identified by your inference or usually from the keywords you take in the sentence or paraphrase.

In R having some packages deal with these problems such as: {tidytext}, {textrecipes},… But in this chapter, I will use package {tidytext} due to simple operation in coding.

library(tidytext)

## Warning: package 'tidytext' was built under R version 4.2.3

library(janeaustenr)

## Warning: package 'janeaustenr' was built under R version 4.2.3

#We should convert to new object because if there're errors happen, you can code again with older object: 
##unnest_tokens will create new col contain words from selected cols. (In this example, highly_ratedFOR is old col, text is new col)

df1<-df %>%
  unnest_tokens(text,
                highly_ratedFOR,
                token = "regex",
                pattern = ", ")

Because the dataset have taken keyword previously so we don’t have to fix. If you deal with raw sentences or paraphrase, you should remove stop word - are words that are not useful for an analysis such as “the”, “of”, “to”, and so forth in English.

In R, we have a dataset contain all stop word in English. You can import by code line below: {data(stop_words)}

2.3.2 Description analysis:

The barchart show the total number of critical factor in the study.

Based on these critical factor, we will buid the simple regression model with dependent variable is rating and dependent variable is text.

library(ggplot2)

df1 %>%
  count(text, sort = TRUE) %>%
  mutate(text = reorder(text, n)) %>%
  ggplot(aes(n, text)) +
  geom_col() +
  labs(y = NULL)

I use package jtools because beautiful plot and can be adjusted easily. Since jtools seems to use ggplot2 so you can just change what like you would if you were using ggplot2.

The result show that employee have higher attention in factor related to impacting intangible factors that have uplifting meanings about personal spirituality and surrounding life than tangible factors likes: salary, promotion, work satisfaction.

library(jtools)
reg<-lm(rating ~ as.factor(text),data = df1)

effect_plot(reg, 
            pred = text,
            interval = TRUE,
            cat.geom = "point",
            colors = "lightblue",
            data = df1)+
  labs(y = "Rating scale",
       x = "")+
  coord_flip()

2.3.3 Building model:

I want to build the deep learning model to classify the critical factor affect the rating operation.

First step, I will create training and validation dataset

set.seed(831)
library(caret)
samp <- createDataPartition(df1$rating, p = .70, list = FALSE)
train = df1[samp, ] 
test = df1[-samp, ]

Next, I need to prepare data suitable for analysis due to keras package’s requirement input in R.

library(keras)
num_words = 500
max_length = 20

text_vectorization <- layer_text_vectorization(
  max_tokens = num_words, 
  output_sequence_length = max_length, 
)

##apply
text_vectorization %>% 
  adapt(df1$text)

##Convert due to dummy variable:
y_train<-train$rating - 1
y_train <- to_categorical(y = y_train, num_classes = 5)
y_test<-test$rating - 1
y_test <- to_categorical(y = y_test, num_classes = 5)

library(keras)
embedding_size = 32

input <- layer_input(shape = c(1), dtype = "string")

output <- input %>% 
  text_vectorization() %>%
  layer_embedding(input_dim = num_words + 1, 
                  output_dim = embedding_size) %>%
  bidirectional(layer_lstm(units = embedding_size,
             dropout = 0.2, 
             recurrent_dropout = 0.2)) %>%
  layer_dense(units = embedding_size, 
              activation = "relu") %>%
  layer_dense(units = 5, 
              activation = "softmax")

rnn_model <- keras_model(input, output)

rnn_model %>% compile(
  optimizer = 'adam',
  loss = 'categorical_crossentropy',
  metrics = list('accuracy')
)

2.4 Final results:

2.4.1 Evaluating the model:

Comparing the training and validating data, we notice that no big difference in result happen from epoch 2 go on. So the next runtime, we can decrease the epoch to 5 except 10.

history <- rnn_model %>% fit(
  train$text,
  y_train,
  epochs = 10,
  batch_size = 32,
  validation_split = 0.1,
  verbose=2
)

## Epoch 1/10
## 416/416 - 16s - loss: 0.7374 - accuracy: 0.6312 - val_loss: 0.7511 - val_accuracy: 0.6192 - 16s/epoch - 39ms/step
## Epoch 2/10
## 416/416 - 10s - loss: 0.6717 - accuracy: 0.6658 - val_loss: 0.7473 - val_accuracy: 0.6192 - 10s/epoch - 24ms/step
## Epoch 3/10
## 416/416 - 10s - loss: 0.6701 - accuracy: 0.6637 - val_loss: 0.7567 - val_accuracy: 0.6192 - 10s/epoch - 24ms/step
## Epoch 4/10
## 416/416 - 11s - loss: 0.6704 - accuracy: 0.6646 - val_loss: 0.7526 - val_accuracy: 0.6308 - 11s/epoch - 26ms/step
## Epoch 5/10
## 416/416 - 11s - loss: 0.6695 - accuracy: 0.6665 - val_loss: 0.7497 - val_accuracy: 0.6192 - 11s/epoch - 25ms/step
## Epoch 6/10
## 416/416 - 11s - loss: 0.6679 - accuracy: 0.6700 - val_loss: 0.7830 - val_accuracy: 0.6308 - 11s/epoch - 25ms/step
## Epoch 7/10
## 416/416 - 11s - loss: 0.6698 - accuracy: 0.6646 - val_loss: 0.7625 - val_accuracy: 0.6308 - 11s/epoch - 25ms/step
## Epoch 8/10
## 416/416 - 11s - loss: 0.6698 - accuracy: 0.6614 - val_loss: 0.7478 - val_accuracy: 0.6192 - 11s/epoch - 25ms/step
## Epoch 9/10
## 416/416 - 11s - loss: 0.6688 - accuracy: 0.6624 - val_loss: 0.7524 - val_accuracy: 0.6247 - 11s/epoch - 25ms/step
## Epoch 10/10
## 416/416 - 11s - loss: 0.6683 - accuracy: 0.6694 - val_loss: 0.7595 - val_accuracy: 0.6308 - 11s/epoch - 25ms/step

plot(history)

2.4.2 Predicting by using the model:

Finally is predicting and checking the performance by these code.

## Predict value base on the value from validation data:
result<-predict(rnn_model,test$text)

## 198/198 - 1s - 1s/epoch - 7ms/step

head(result)

##              [,1]        [,2]      [,3]      [,4]         [,5]
## [1,] 4.308903e-06 0.006127234 0.2623272 0.7312081 0.0003330695
## [2,] 1.228825e-05 0.014507041 0.4321555 0.5527015 0.0006237051
## [3,] 1.228825e-05 0.014507041 0.4321555 0.5527015 0.0006237051
## [4,] 1.460161e-06 0.002477859 0.2706771 0.7266413 0.0002023376
## [5,] 1.228825e-05 0.014507041 0.4321555 0.5527015 0.0006237051
## [6,] 1.460161e-06 0.002477859 0.2706771 0.7266413 0.0002023376

2.5 References:

I hope you my presentation helping you in your practice. And also, thank you to all author who wrote the documents below.

Supervised Machine Learning for Text Analysis in R by EMIL HVITFELDT AND JULIA SILGE.
Deep Learning with Text Data by Chelsey Hill.

Keras_model

Lộc

2024-04-16