Introuction

LeetCode questions seem to be an unavoidable topic during the tech interview process and is among one of the most recognizable coding website for practicing questions covered with data structure and algorithms.

LeetCode is by far the most effective tool that applicants used to improve at interview-style algorithms problems. The goal of this project is not to illustrate on how to solve some specific problems related to hash tables, dynamic programming, Depth-First Search or greedy solutions, but to help whoever are interested in to know where to start and some characteristics of leetCode questions as a whole.

Research questions

The questions are identified as below so far with this dataset:

-How is the difficulty of the questions correlated with the total submitted attempts?

-How is the acceptance rate which we have added as a new column correlate to the difficulty of the questions?

-Are all hard questions more difficult than all of the medium and easy questions?

-How is the video/written solution correlate to the premium/paid subscription service?

-Are more of the medium questions locked behind the premium service rather than the easy ones?

Data:

Read in/clean the dataset leetcode.csv which will be used in the analysis

Import readr and assign csv dataset to a variable

  library(readr)
  library(tidyverse)
  library(tidyr) 
  library(rstudioapi)
  library(ggplot2)
  library(dplyr)
  library(plotrix)
  library(reshape2)

# Getting the path of your current open file automatically
leetcode <- read_csv("leetcode.csv")
# Creating a dataframe that will be useful in the future analysis
newData <- data.frame(leetcode)
view(leetcode)
# Preview and get the column names of the dataset
head(leetcode)

# A tibble: 6 × 8
  question_id video title      link  `total Accepted` `total Submitt…`
        <dbl> <lgl> <chr>      <chr>            <dbl>            <dbl>
1        1959 NA    Minimum P… http…              262              434
2        1949 NA    Implement… http…              691             1126
3        1947 NA    Number of… http…             1976             7740
4        1946 NA    Minimum A… http…             5411            13595
5        1945 NA    Finding t… http…             5890             7460
6        1944 NA    Truncate … http…             7056             8979
# … with 2 more variables: difficulty <dbl>, isPaid <lgl>

colnames(leetcode)

[1] "question_id"     "video"           "title"          
[4] "link"            "total Accepted"  "total Submitted"
[7] "difficulty"      "isPaid"

Identify the variables in the dataset and the dataset

As you may see with colnames(), we have listed all variables

Varible type classification:

String/char type : title, link, trueDifficult

title definition: the leetcode question description titles on the official website

link definition: offcial link to the specific question

trueDifficulty definition: here as you may know, we tranform 1 to easy difficulty, 2 to medium difficulty, 3 to hard difficulty. Usuallly, higher the difficulty, the more likely the user might spend time

Numeric type : question_id, total Accepted, total Submitted, difficulty, total Accepted(updated)

question_id definition: the number which corresponds to each question total Accepted definition: number of submitted solutions which can compile and run well

total Submitted definition: number of solutions/coding attempt submitted to the website, could be wrong

difficulty definition: in general, this represents the learning curve and how challenging a typical question could be

rate definition: we use the division between totalAccepted andd totalSubmitted to get the percentage which the submitted solution have been approved by the server

Logical type : video, isPaid

video definition: TRUE means the solution has video format, FALSE means written, NA means none

isPaid definition: TRUE means the question requires premium membership, FALSE means not requiring

#Use filter() to list all the questions without solutions
solvedQuestions <- filter(leetcode, is.na(`video`)) 
head(solvedQuestions)

# A tibble: 6 × 8
  question_id video title      link  `total Accepted` `total Submitt…`
        <dbl> <lgl> <chr>      <chr>            <dbl>            <dbl>
1        1959 NA    Minimum P… http…              262              434
2        1949 NA    Implement… http…              691             1126
3        1947 NA    Number of… http…             1976             7740
4        1946 NA    Minimum A… http…             5411            13595
5        1945 NA    Finding t… http…             5890             7460
6        1944 NA    Truncate … http…             7056             8979
# … with 2 more variables: difficulty <dbl>, isPaid <lgl>

Data Wrangling and cleaning

Preview the whole dataset before any operations; We now find the dimensions of leetcode.csv (which we have imported as a data frame) by using thr R function dim(.) and then create a table containing the count of the different data types we have in this data frame, by using the R function nested within table(.).

Then we continue exploring the names of the columns by using pre built-in R function glimpse(.) which behaves as an equivalent version of print: the funtion list all the columns and rows in a precise and clean approach and possible for users to judge and search for some specific elements.

Checking missing values

We used is.na<- and colSums(.) to check missing values column by column. Function is.na<- may provide a safer way to set missingness. It behaves differently for factors and is generic. It will check whether value passed in it is NULL or not and it is inside the nested sum(.) function. The colSums(.) will give us the NULL value count for each column in leetcode.

dim(leetcode)

[1] 1650    8

head(leetcode, n = 1650)

# A tibble: 1,650 × 8
   question_id video title     link  `total Accepted` `total Submitt…`
         <dbl> <lgl> <chr>     <chr>            <dbl>            <dbl>
 1        1959 NA    Minimum … http…              262              434
 2        1949 NA    Implemen… http…              691             1126
 3        1947 NA    Number o… http…             1976             7740
 4        1946 NA    Minimum … http…             5411            13595
 5        1945 NA    Finding … http…             5890             7460
 6        1944 NA    Truncate… http…             7056             8979
 7        1943 NA    Count Pa… http…              584              850
 8        1937 NA    Maximize… http…              571              819
 9        1936 NA    Maximize… http…             2958            10959
10        1935 NA    Minimum … http…             7134            10125
# … with 1,640 more rows, and 2 more variables: difficulty <dbl>,
#   isPaid <lgl>

glimpse(leetcode)

Rows: 1,650
Columns: 8
$ question_id       <dbl> 1959, 1949, 1947, 1946, 1945, 1944, 1943, …
$ video             <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ title             <chr> "Minimum Path Cost in a Hidden Grid", "Imp…
$ link              <chr> "https://leetcode.com/problems/minimum-pat…
$ `total Accepted`  <dbl> 262, 691, 1976, 5411, 5890, 7056, 584, 571…
$ `total Submitted` <dbl> 434, 1126, 7740, 13595, 7460, 8979, 850, 8…
$ difficulty        <dbl> 2, 2, 3, 2, 2, 1, 2, 3, 3, 2, 2, 1, 2, 2, …
$ isPaid            <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TR…

# Check NA availability if any
colSums(is.na(leetcode))

    question_id           video           title            link 
              0             701               0               0 
 total Accepted total Submitted      difficulty          isPaid 
              0               0               0               0

In leetcode.csv found that the data set has 1650 rows and 8 columns, after we clean and mutate dataset, we have a few more characteristics. Then we view all the rows and start mutating the missing values in the rows.

Replace the NA value in videos as “No solution” since FALSE and TRUE have been written down, rename TRUE as “Video” and FALSE as “Written”. It aimed at improving the readability of leetcode so that a native reader could understand the elemeents better. In addition, we replace the video missing values to “No solution” by using R function replace_na(.) to accomplish which takes dataset and the column as input arguments.

We can observe that, there is lots of missing data in this column needing to be thoroughly cleaned before the actual analysis. We should perform analysis without ignoring those missing values.

Usually the data cleaning process include: Renaming columns with more meaningful signature; Replacing or getting rid of missing values by mutating the dataset.

#replace_na(leetcode, list(video = "No solution"))
# Calling str() function to confirm
str(leetcode)

spec_tbl_df [1,650 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ question_id    : num [1:1650] 1959 1949 1947 1946 1945 ...
 $ video          : logi [1:1650] NA NA NA NA NA NA ...
 $ title          : chr [1:1650] "Minimum Path Cost in a Hidden Grid" "Implement Trie II (Prefix Tree)" "Number of Different Subsequences GCDs" "Minimum Absolute Sum Difference" ...
 $ link           : chr [1:1650] "https://leetcode.com/problems/minimum-path-cost-in-a-hidden-grid" "https://leetcode.com/problems/implement-trie-ii-prefix-tree" "https://leetcode.com/problems/number-of-different-subsequences-gcds" "https://leetcode.com/problems/minimum-absolute-sum-difference" ...
 $ total Accepted : num [1:1650] 262 691 1976 5411 5890 ...
 $ total Submitted: num [1:1650] 434 1126 7740 13595 7460 ...
 $ difficulty     : num [1:1650] 2 2 3 2 2 1 2 3 3 2 ...
 $ isPaid         : logi [1:1650] TRUE TRUE FALSE FALSE FALSE FALSE ...
 - attr(*, "spec")=
  .. cols(
  ..   question_id = col_double(),
  ..   video = col_logical(),
  ..   title = col_character(),
  ..   link = col_character(),
  ..   `total Accepted` = col_double(),
  ..   `total Submitted` = col_double(),
  ..   difficulty = col_double(),
  ..   isPaid = col_logical()
  .. )
 - attr(*, "problems")=<externalptr>

leetcode <- leetcode %>%
  mutate(solution = case_when(
    video == TRUE ~ "Video",
    video == FALSE ~ "Written",
    video == NA ~ "No solution",))

By now, leetcode has a modified column that contains zero number of missing values and the cleaning process is done.

# Define the column rate which could be calculated by division between totalAccepted and totalSubmitted
acceptedRate <- select(leetcode, `total Accepted`)
totalSubmitted <- select(leetcode, `total Submitted`)
# Add new column named acceptedRate to leetcode dataset
leetcode <- mutate(leetcode, rate = (`total Accepted` / `total Submitted`))
#rate <- transform(rate, ar = accepetedRate / totalSubmitted)
# Preview the rate to check whether it is right
head(leetcode, n = 1650)

# A tibble: 1,650 × 10
   question_id video title     link  `total Accepted` `total Submitt…`
         <dbl> <lgl> <chr>     <chr>            <dbl>            <dbl>
 1        1959 NA    Minimum … http…              262              434
 2        1949 NA    Implemen… http…              691             1126
 3        1947 NA    Number o… http…             1976             7740
 4        1946 NA    Minimum … http…             5411            13595
 5        1945 NA    Finding … http…             5890             7460
 6        1944 NA    Truncate… http…             7056             8979
 7        1943 NA    Count Pa… http…              584              850
 8        1937 NA    Maximize… http…              571              819
 9        1936 NA    Maximize… http…             2958            10959
10        1935 NA    Minimum … http…             7134            10125
# … with 1,640 more rows, and 4 more variables: difficulty <dbl>,
#   isPaid <lgl>, solution <chr>, rate <dbl>

By now the data set contains at least 11 columns with both categorical values and double type values.

Compute descriptive statistics

For each of variable including mean, median, and standard deviation for numerical variables, and frequencies for categorical variables, we should gain some understanding of the data statistics including the relationships between a variety of variables.

Group by the dataset by the number level of difficulty and calculate the average for submitted count for each difficulty; Group according to difficulty and performing aggregate functions.

The below computing process include the maximum and minimum of the leetcode dataset.

# Compute the median for the accepted solutions and submitted solutions
summarize(leetcode, min.totalAccepted = min(`total Accepted`, na.rm = TRUE))

# A tibble: 1 × 1
  min.totalAccepted
              <dbl>
1               262

summarize(leetcode, max.totalAccepted = max(`total Accepted`, na.rm = TRUE))

# A tibble: 1 × 1
  max.totalAccepted
              <dbl>
1           4058693

The below computing process include the median andd inter quartile range of the leetcode dataset.

# Compute the median for the accepted solutions and submitted solutions
summarize(leetcode, median.totalAccepted = median(`total Accepted`, na.rm = TRUE))

# A tibble: 1 × 1
  median.totalAccepted
                 <dbl>
1               43560.

summarize(leetcode, IQR.totalAccepted = IQR(`total Accepted`, na.rm = TRUE))

# A tibble: 1 × 1
  IQR.totalAccepted
              <dbl>
1            112721

The below computing process include the mean and sd, inter quartile range of the leetcode dataset. mean.totalAcceptedRate, is the average rate calculated by the division by the mean of the average submitted attempts and the average accepted attempts.

# Compute the mean for the accepted solutions and submitted solutions
summarize(leetcode, mean.totalAccepted = mean(`total Accepted`, na.rm = TRUE))

# A tibble: 1 × 1
  mean.totalAccepted
               <dbl>
1            128905.

summarize(leetcode, mean.totalSubmitted = mean(`total Submitted`, na.rm = TRUE))

# A tibble: 1 × 1
  mean.totalSubmitted
                <dbl>
1             280704.

# Compute the average for the difficulty, 1.978 which means a bit below medium
summarize(leetcode, mean.difficulty = mean(difficulty, na.rm = TRUE))

# A tibble: 1 × 1
  mean.difficulty
            <dbl>
1            1.98

# Compute the sd for the accepted solutions
summarize(leetcode, sd.totalAccepted = sd(`total Accepted`, na.rm = TRUE))

# A tibble: 1 × 1
  sd.totalAccepted
             <dbl>
1          233157.

leetcode %>%
  group_by(difficulty) %>%
  summarize(mean.totoalSubmittedDifficult=mean(`total Submitted`), mean.totalAcceptedDifficult=mean(`total Accepted`), mean.totalAcceptedRate=(`mean.totalAcceptedDifficult`/`mean.totoalSubmittedDifficult`))

# A tibble: 3 × 4
  difficulty mean.totoalSubmittedDi… mean.totalAccep… mean.totalAccep…
       <dbl>                   <dbl>            <dbl>            <dbl>
1          1                 418797.          213609.            0.510
2          2                 269992.          119730.            0.443
3          3                 155624.           58597.            0.377

From above, we can conclude that hard questions have less average ratio when compared to easy and medium. Ratio decreases as difficulty of questions increases in general. In addition, we could conclude that harder the difficulty is, fewer accepted attempts and lower accepted ratio will be.

Compute freq table with leetcode$difficulty

This shows the average Submission for different difficulties.

frequencyDifficulty <- leetcode %>%
 select(difficulty) %>%
 table()
frequencyDifficulty

difficulty
  1   2   3 
397 891 362

frequencyDifficulty %>%
 prop.table()

difficulty
        1         2         3 
0.2406061 0.5400000 0.2193939

Visualization

By now, we could start visualization using ggplot2 with both univariate and bivariate and explain these graphs with variables, questions and conclusion. We will identify limitations of the visualization if any.

Visualize the bar graph showing the distribution of all questions accepted rate as x-axis and the density as y-axis

Visualization #1 leetcode rate distribution and density

ggplot(leetcode, aes(rate)) + 
  ggtitle("All leetcode question rate distribution") + 
  geom_histogram(aes(y = ..density..), alpha = 0.5) +
    geom_vline(aes(xintercept=mean(rate)), color="black", linetype="dotdash", size=1) +
  geom_vline(aes(xintercept=median(rate)), color="green", linetype="dashed", size=1) + 
  geom_vline(aes(xintercept=quantile(rate, 0.25)), color="orange", linetype="solid", size=1) +
  geom_vline(aes(xintercept=quantile(rate, 0.75)), color="yellow", linetype="longdash", size=1)

  geom_density(alpha = 0.01, fill="red")

geom_density: na.rm = FALSE, orientation = NA, outline.type = upper
stat_density: na.rm = FALSE, orientation = NA
position_identity

The plot gave us information on the overall distribution and we could say that most of the question have fair difficulty and the overall rate of acceptance satisfy a normal distribution graph but it does not illustrate that all hard difficulty questions will be harder than all medium ones. We can see that mean, median, 25 quartile and 75 quartile are all labeled on the graph to reflect the overall status.

Hence, we need another approach by first descending to find the questions with highest accepted rate.

Visualization #2: Question count/frequency for each difficulty

# Visualize the bar graph showing the question difficulty as x-axis and the count for each question as y-axis
ggplot(leetcode, aes(`difficulty`)) + geom_bar(color="grey", fill="black") +
labs(title = "diffuculty frequency") +
  scale_x_discrete(labels=c("Easy", "Medium", "Hard"))

theme_bw()

List of 93
 $ line                      :List of 6
  ..$ colour       : chr "black"
  ..$ size         : num 0.5
  ..$ linetype     : num 1
  ..$ lineend      : chr "butt"
  ..$ arrow        : logi FALSE
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_line" "element"
 $ rect                      :List of 5
  ..$ fill         : chr "white"
  ..$ colour       : chr "black"
  ..$ size         : num 0.5
  ..$ linetype     : num 1
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_rect" "element"
 $ text                      :List of 11
  ..$ family       : chr ""
  ..$ face         : chr "plain"
  ..$ colour       : chr "black"
  ..$ size         : num 11
  ..$ hjust        : num 0.5
  ..$ vjust        : num 0.5
  ..$ angle        : num 0
  ..$ lineheight   : num 0.9
  ..$ margin       : 'margin' num [1:4] 0points 0points 0points 0points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : logi FALSE
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ title                     : NULL
 $ aspect.ratio              : NULL
 $ axis.title                : NULL
 $ axis.title.x              :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : NULL
  ..$ vjust        : num 1
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 2.75points 0points 0points 0points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ axis.title.x.top          :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : NULL
  ..$ vjust        : num 0
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 0points 0points 2.75points 0points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ axis.title.x.bottom       : NULL
 $ axis.title.y              :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : NULL
  ..$ vjust        : num 1
  ..$ angle        : num 90
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 0points 2.75points 0points 0points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ axis.title.y.left         : NULL
 $ axis.title.y.right        :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : NULL
  ..$ vjust        : num 0
  ..$ angle        : num -90
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 0points 0points 0points 2.75points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ axis.text                 :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : chr "grey30"
  ..$ size         : 'rel' num 0.8
  ..$ hjust        : NULL
  ..$ vjust        : NULL
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : NULL
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ axis.text.x               :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : NULL
  ..$ vjust        : num 1
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 2.2points 0points 0points 0points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ axis.text.x.top           :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : NULL
  ..$ vjust        : num 0
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 0points 0points 2.2points 0points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ axis.text.x.bottom        : NULL
 $ axis.text.y               :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : num 1
  ..$ vjust        : NULL
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 0points 2.2points 0points 0points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ axis.text.y.left          : NULL
 $ axis.text.y.right         :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : num 0
  ..$ vjust        : NULL
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 0points 0points 0points 2.2points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ axis.ticks                :List of 6
  ..$ colour       : chr "grey20"
  ..$ size         : NULL
  ..$ linetype     : NULL
  ..$ lineend      : NULL
  ..$ arrow        : logi FALSE
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_line" "element"
 $ axis.ticks.x              : NULL
 $ axis.ticks.x.top          : NULL
 $ axis.ticks.x.bottom       : NULL
 $ axis.ticks.y              : NULL
 $ axis.ticks.y.left         : NULL
 $ axis.ticks.y.right        : NULL
 $ axis.ticks.length         : 'simpleUnit' num 2.75points
  ..- attr(*, "unit")= int 8
 $ axis.ticks.length.x       : NULL
 $ axis.ticks.length.x.top   : NULL
 $ axis.ticks.length.x.bottom: NULL
 $ axis.ticks.length.y       : NULL
 $ axis.ticks.length.y.left  : NULL
 $ axis.ticks.length.y.right : NULL
 $ axis.line                 : list()
  ..- attr(*, "class")= chr [1:2] "element_blank" "element"
 $ axis.line.x               : NULL
 $ axis.line.x.top           : NULL
 $ axis.line.x.bottom        : NULL
 $ axis.line.y               : NULL
 $ axis.line.y.left          : NULL
 $ axis.line.y.right         : NULL
 $ legend.background         :List of 5
  ..$ fill         : NULL
  ..$ colour       : logi NA
  ..$ size         : NULL
  ..$ linetype     : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_rect" "element"
 $ legend.margin             : 'margin' num [1:4] 5.5points 5.5points 5.5points 5.5points
  ..- attr(*, "unit")= int 8
 $ legend.spacing            : 'simpleUnit' num 11points
  ..- attr(*, "unit")= int 8
 $ legend.spacing.x          : NULL
 $ legend.spacing.y          : NULL
 $ legend.key                :List of 5
  ..$ fill         : chr "white"
  ..$ colour       : logi NA
  ..$ size         : NULL
  ..$ linetype     : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_rect" "element"
 $ legend.key.size           : 'simpleUnit' num 1.2lines
  ..- attr(*, "unit")= int 3
 $ legend.key.height         : NULL
 $ legend.key.width          : NULL
 $ legend.text               :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : 'rel' num 0.8
  ..$ hjust        : NULL
  ..$ vjust        : NULL
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : NULL
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ legend.text.align         : NULL
 $ legend.title              :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : num 0
  ..$ vjust        : NULL
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : NULL
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ legend.title.align        : NULL
 $ legend.position           : chr "right"
 $ legend.direction          : NULL
 $ legend.justification      : chr "center"
 $ legend.box                : NULL
 $ legend.box.just           : NULL
 $ legend.box.margin         : 'margin' num [1:4] 0cm 0cm 0cm 0cm
  ..- attr(*, "unit")= int 1
 $ legend.box.background     : list()
  ..- attr(*, "class")= chr [1:2] "element_blank" "element"
 $ legend.box.spacing        : 'simpleUnit' num 11points
  ..- attr(*, "unit")= int 8
 $ panel.background          :List of 5
  ..$ fill         : chr "white"
  ..$ colour       : logi NA
  ..$ size         : NULL
  ..$ linetype     : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_rect" "element"
 $ panel.border              :List of 5
  ..$ fill         : logi NA
  ..$ colour       : chr "grey20"
  ..$ size         : NULL
  ..$ linetype     : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_rect" "element"
 $ panel.spacing             : 'simpleUnit' num 5.5points
  ..- attr(*, "unit")= int 8
 $ panel.spacing.x           : NULL
 $ panel.spacing.y           : NULL
 $ panel.grid                :List of 6
  ..$ colour       : chr "grey92"
  ..$ size         : NULL
  ..$ linetype     : NULL
  ..$ lineend      : NULL
  ..$ arrow        : logi FALSE
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_line" "element"
 $ panel.grid.major          : NULL
 $ panel.grid.minor          :List of 6
  ..$ colour       : NULL
  ..$ size         : 'rel' num 0.5
  ..$ linetype     : NULL
  ..$ lineend      : NULL
  ..$ arrow        : logi FALSE
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_line" "element"
 $ panel.grid.major.x        : NULL
 $ panel.grid.major.y        : NULL
 $ panel.grid.minor.x        : NULL
 $ panel.grid.minor.y        : NULL
 $ panel.ontop               : logi FALSE
 $ plot.background           :List of 5
  ..$ fill         : NULL
  ..$ colour       : chr "white"
  ..$ size         : NULL
  ..$ linetype     : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_rect" "element"
 $ plot.title                :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : 'rel' num 1.2
  ..$ hjust        : num 0
  ..$ vjust        : num 1
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 0points 0points 5.5points 0points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ plot.title.position       : chr "panel"
 $ plot.subtitle             :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : num 0
  ..$ vjust        : num 1
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 0points 0points 5.5points 0points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ plot.caption              :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : 'rel' num 0.8
  ..$ hjust        : num 1
  ..$ vjust        : num 1
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 5.5points 0points 0points 0points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ plot.caption.position     : chr "panel"
 $ plot.tag                  :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : 'rel' num 1.2
  ..$ hjust        : num 0.5
  ..$ vjust        : num 0.5
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : NULL
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ plot.tag.position         : chr "topleft"
 $ plot.margin               : 'margin' num [1:4] 5.5points 5.5points 5.5points 5.5points
  ..- attr(*, "unit")= int 8
 $ strip.background          :List of 5
  ..$ fill         : chr "grey85"
  ..$ colour       : chr "grey20"
  ..$ size         : NULL
  ..$ linetype     : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_rect" "element"
 $ strip.background.x        : NULL
 $ strip.background.y        : NULL
 $ strip.placement           : chr "inside"
 $ strip.text                :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : chr "grey10"
  ..$ size         : 'rel' num 0.8
  ..$ hjust        : NULL
  ..$ vjust        : NULL
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : 'margin' num [1:4] 4.4points 4.4points 4.4points 4.4points
  .. ..- attr(*, "unit")= int 8
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ strip.text.x              : NULL
 $ strip.text.y              :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : NULL
  ..$ vjust        : NULL
  ..$ angle        : num -90
  ..$ lineheight   : NULL
  ..$ margin       : NULL
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ strip.switch.pad.grid     : 'simpleUnit' num 2.75points
  ..- attr(*, "unit")= int 8
 $ strip.switch.pad.wrap     : 'simpleUnit' num 2.75points
  ..- attr(*, "unit")= int 8
 $ strip.text.y.left         :List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : NULL
  ..$ vjust        : NULL
  ..$ angle        : num 90
  ..$ lineheight   : NULL
  ..$ margin       : NULL
  ..$ debug        : NULL
  ..$ inherit.blank: logi TRUE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 - attr(*, "class")= chr [1:2] "theme" "gg"
 - attr(*, "complete")= logi TRUE
 - attr(*, "validate")= logi TRUE

The graph displays the question difficulty frequency on the entire leetcode dataset. Visualize chart graph showing the total submitted attempts over the 3 levels of difficulty, the graph has variables including frequency for each difficulty level; There are about 55% medium questions, 12% easy questions, 33% hard questions.

Visualization #3: Number of leetcode premium questions

Visualize the bar graph showing the isPaid as x-axis and the count for each question as y-axis

ggplot(leetcode, aes(`isPaid`)) + geom_bar(color="grey", fill="grey") +
labs(title = "Premium sub and question counts") +
theme_minimal()

The graph displays the number of premium questions and non-premium ones.

Visualization #4: Combine the two bar charts

# make the plot
ggplot(leetcode, aes(x = `difficulty`, y = `total Submitted` + `total Accepted`, fill = isPaid)) +
  geom_bar( 
           stat="identity", position = "dodge", width = 0.7) +
  scale_fill_manual("isPaid\n", values = c("red","blue"), 
                    labels = c(" unpaid", " paid")) +
  labs(x="\ndifficulty",y="totalAttempts\n", title ="Total Attempts for each difficulty group related to premium service ") +              
  theme_bw(base_size = 14)

The graph displays under each difficulty catogery, how many questions are classified as premium and how many questions are classified as non-premium ones. We found that as the difficulty level increases from 1 to 3, there will be a higher portion of questions that are included in the premium service. From above, it is hard for us to determine whether the relationship will be linear or not, even though the trend can be applied to the situation when we compare easy to medium, medium to hard.

Among medium questions, there is more percent of submitted and accepted premium questions when compared to easy and hard. There are less percent of easy premium questions.

Visualization #5: Difficulty vs the average submission count sd

Visualize the bar graph showing the question difficulty as x-axis and the average submission count as y-axis, add standard error bars to visualize the uncertainty around your estimate

leetcode %>%
  group_by(difficulty) %>%
  summarize(mean.totoalSubmittedDifficult=mean(`total Submitted`), mean.totalAcceptedDifficult=mean(`total Accepted`), mean.totalAcceptedRate=(`mean.totalAcceptedDifficult`/`mean.totoalSubmittedDifficult`), sd.totalSubmitted=sd(`total Submitted`))

# A tibble: 3 × 5
  difficulty mean.totoalSubmittedDi… mean.totalAccep… mean.totalAccep…
       <dbl>                   <dbl>            <dbl>            <dbl>
1          1                 418797.          213609.            0.510
2          2                 269992.          119730.            0.443
3          3                 155624.           58597.            0.377
# … with 1 more variable: sd.totalSubmitted <dbl>

ggplot(leetcode, aes(x = `difficulty`, y = `total Submitted`)) +
    geom_bar(stat="identity", fill="blue", position=position_dodge()) +
    geom_errorbar( aes(ymin=`total Submitted`-533771, ymax=`total Submitted`+533771), colour="blue", position=position_dodge(.9)) +
  labs(title = "Difficulty vs the average submission count sd")

The graph has variables including question difficulty and the average for total submitted. In addition, we added a standard error bar to the bar chart which used the variable we got previous as the input std. The generic relationship between question difficulty and total submitted attempts are fairly linear for each difficulty level applies.

Visualization #6: Difficulty versus TotalSubmittedAttempts boxplot

Replace the numeral value in difficulty, recode leetcode to make a boxplot and the result should be three separate box plots for different values of difficulty and paid subscription.

Visualize the boxplot graph showing the relationship between question difficulty and the count for total submitted each question as y-axis.

leetcode<-mutate(leetcode, trueDifficulty = recode(difficulty, `1` = "easy", `2` = "medium", `3` = "difficult"))

ggplot(leetcode, aes(x = `trueDifficulty`, y = `total Submitted` ,fill = isPaid)) + 
  geom_boxplot() +
   labs(title = "Difficultity vs.TotalSubittedAttempts ", y = "total submmitted attempts", x = "difficulty level") +
   facet_wrap(~ isPaid)

The graph has variables including question difficulty and the count for total submitted. We found that for non-premium users who do not have access to paid questions, there is a clear correlation with difficulty level and total submitted attempts but not that obvious in the case of premium users. As in the premiums one, the highest outliers belong to the medium difficulty instead of easy.

Visualization #7: Difficulty versus TotalSubmittedAttempts volin

Instead of a boxplot, we make a violin graph this time and the result should be three separate volin plots different values of difficulty and solution.

Visualize the volin graph showing the relationship between question difficulty and the count for total submitted each question as y-axis grouped by solution.

ggplot(leetcode, aes(x = `trueDifficulty`, y = `total Submitted`, fill = solution)) + 
   geom_violin() +
   labs(title = "Difficultity vs.TotalSubittedAttempts ", y = "total submmitted attempts", x = "difficulty level")

  facet_wrap(~ solution)

<ggproto object: Class FacetWrap, Facet, gg>
    compute_layout: function
    draw_back: function
    draw_front: function
    draw_labels: function
    draw_panels: function
    finish_data: function
    init_scales: function
    map_data: function
    params: list
    setup_data: function
    setup_params: function
    shrink: TRUE
    train_scales: function
    vars: function
    super:  <ggproto object: Class FacetWrap, Facet, gg>

The graph has variables including question difficulty and the count for total submitted. We found it intriguing that for all difficulties, if the question has no solution at all, then the submitted attempts will be undoubtedly the lowest. In reality, submitted attempts do not require the code to be complied which means someone can submit a wrong code and still his or her could be considerd as submitted ones. Here, we could say that many users will not even try the problems if the questions has no solution at all.

Visualization #8: Question_id vs total submmitted attempted colored by accepted rate

Visualize the graph showing the question id as x-axis and the total submission count as y-axis correlated to the acceptance rate

ggplot(leetcode, aes(x = question_id , y = `total Submitted` , color = rate)) +
        geom_point(alpha = 0.5 , size = 1.5) + 
        geom_smooth(se = F , color = 'darkred' , method = 'loess') +
        xlim(c(0 , 1650)) +
       labs(title = 'Question_id vs total submmitted attempted colored by accepted') +
        facet_wrap(~ difficulty)

The graph display the trend between three variables question_id, total submitted and acceptance rate in the three difficulty levels. It can be concluded that for each question, the number of total submitted is not only affected by the difficulty level but also question_id as well. As for questions after number 500, the amount of efforts and recognition for the users have dramatically decreased. In other words, most people will only focus on the first 500 ones.

Visualization #9: Rank the questions in another order and list the top 30

Therefore we know could rank the questions in another order and list the top 30 most difficult questions.

#leetcode %>%
 #arrange(leetcode, desc(`rate`)) %>%
 #filter(rate > 0.89) %>%
 #head(rateDescending, n = 1650)
  #ggplot(aes(x = question_id, y = rate, fill=difficulty)) +
  #geom_bar(stat='identity', colour="black") +
  #geom_text()

The graph displays the top questions with descending order in accepted rate so that we could have a clear vision on what question_id and difficulty for the hardest questions.

From above there are 5 Hard questions easy to solve when compared to other hard ones according to acceptance rate in the list of the easiest questions. These questions include: 1022, 771, 865, 1503 and 1061

From above there are some questions with easy or medium difficult level but difficult to solve include: 1646, 1644, 1648, 1634, 1635, 1637, 1638, 1639, 1640, 1642, 1643, 1625, 1626, 1627, 1631, 1632, 1633 and so on.

We can conclude that not all hard questions are guaranteed to be more challenging than all easy or medium difficult level.

On the contrary, there are many problem with medium level but are difficult to solve as they have less acceptance/submissions ratio. (In general a question is considered as very difficult when the ratio is below 0.3 = 30%)

Visualization #10: Difficulty vs total submmitted attempted classified by solution

Visualize the graph showing the question id as x-axis and the total submission count as y-axis correlated to the solution

ggplot(leetcode, aes(x = difficulty , y = `total Submitted`, fill = solution)) + 
        geom_boxplot() + 
        labs(title = 'Difficulty vs total submmitted attempted classified by solution')+
  facet_wrap(~ solution)

The graph has variables including question difficulty and the count for total submitted. The relationship is observed that for all difficulty levels, video solution helps the most, there is a big difference between the submitted attempts when a question has a video solution or just with a written one.

Visualization #11: Total submitted vs trueDifficulty

Visualize the graph showing the density between total submitted attempts as x-axis and the trueDifficulty

ggplot(leetcode,aes(`total Submitted`, fill=trueDifficulty))+
  geom_density(alpha=0.1) + 
  facet_grid(trueDifficulty ~ .)

The graph display the density depending the actual difficulty of the questions. As the attempts appeared to be more, it is more likely it would be an easy or medium question. Some of the easy questions have the most attempts as new users will not tend to solve hard questions at all.

Visualization #12: Accepted rate vs Total submitted for different difficulty

Visualize the graph showing the Total submitted as x-axis and the rate as y-axis

ggplot(leetcode, aes(x = `total Submitted`, y = rate, color = factor(trueDifficulty))) +
        geom_point(alpha = 0.2 , size = 1.9) + 
        geom_smooth(se = F,color = 'black') +
        xlim(c(0 , 3000000)) + 
        labs(title = "Accepted rate vs Total submitted for different difficulty") +
        facet_grid(~trueDifficulty)

The relationship is observed that for all difficulty, when the total submitted increase at the beginning, there will be a significant decline in accepted rate. As go on, both easy and hard questions will decline in a constant rate but for medium ones, the more attempts will guarantee the lower acceptance rate. My thought is that many medium questions are challenging but popular at the same time.

Reflection:

Overall, this project as well as the assignments has been wonderful due to the well-thought and organized structure of the course. Even though technically it is just a 5-week-long project but I do consider the pacing is balanced and Professor himself is full of passion and practical which in a way motivated me since initially I was struggling with picking up the right data set and thought coding in R was too challenging for me. R is widely used in multiple real-world industries and a large portion of the data set that I found required certain terminology and professional understanding including fin tech, economics,, biology, social science and so on. What’s more, a set of reasonably meaningful research questions are not easy to be defined since sometimes I might think of questions which could be hard to answer but not worth the effort in reality. On the other time, research questions could have meaning but seem too superficial or straightforward to answer which does not require the following graphing visualization tools to assist with the process.

I wish I would have known that R programming could be really useful and I should get prepared to some reading materials so that I would have a better understanding of the syntax and grammar. Another thing is I would pick a more complex data set so that I could push myself more since solving coding problems and encountering obstacles will improve one’s skill faster. I was initially searching for some tech or data science related data set and some of them discuss elaborately on reinforcement learning, supervised learning, linear regression, natural language processing and other research-based projects. Then I realized that this sort of data sets may seem too overwhelming at that time and I gave up easily on them.

Sometimes I was too cautious to use some untaught methods or packages so that during the entire project, I stick with majority of the course materials and knowledge. For instance, I did not anticipate other more complicated graphing tools besides histogram, chart, boxplot, violin and density. When I look back, I could try packages like plotly which could also offer aesthetic graphical plots to its users. I may have utilized some models including Regression, Tree models and ANOVA to merge and present certain data more efficiently.

Another challenge was the lack of statistical background and interpreted language programming experience to get to know more about R since it can run without compiler.

For a naive reader, he or she should know something about coding website and their business model, for instance, how leetcode Premium works. Make sure he or she understands the elements in the dataset spreadsheet.

Based on these provided information, I would love to keep working on the project to develop an algorithm and train the model so that people could utilize it to predict the difficulty for each existing question based on the data library provided so far including total Admitted submission, total Submitted ones, accept rate and some other variables which could be added to the data set like question classifications(array, string, dynamic programming, hash table and so on).

The ultimate goal is to provide a list of questions which are suitable for each difficulty which extracts the outliers including some hard medium questions or some easy hard questions. There are already some pre-built python project using Catboost(an open source gradient boosting library), TensorFlow(an end-to-end open source platform for machine learning), Keras(a deep learning API) and Tabular data preprocessing feature. Currently, I am studying the potential output shape and what kind of parameters are needed/optional during the process of the model training, what is the training speed and accuracy. Some more practical questions would be, where to store the training data and how to consistently improve the algorithm.

Conclusion

In general, I am capable of attaining the answers to majority of the questions with some complex graphs are missing since I mainly utilized a few Univariate basics and should focus more on Bivariate basics more. I added standard error bars to visualize the uncertainty but it is not correct.

In summary, we could draw a few statements and answer the questions in the following:

“How is the difficulty of the questions correlated with the total submitted attempts?” we could conclude that harder the difficulty is, fewer accepted attempts and lower accepted ratio will be. It can be concluded that for each question, the number of total submitted is not only affected by the difficulty level but also question_id as well.

“How is the acceptance rate which we have added as a new column correlate to the difficulty of the questions?” We can conclude that hard questions have less average ratio when compared to easy and medium. Ratio decreases as difficulty of questions increases in general.

“Are all hard questions more difficult than all of the medium and easy questions?” We can conclude that not all hard questions are guaranteed to be more challenging than all easy or medium difficult level. On the contrary, there are many problem with medium level but are difficult to solve as they have less acceptance/submissions ratio.

“How is the video/written solution correlate to the premium/paid subscription service?” # We can conclude that there are only 3.6 percent of questions with video solutions. We can see that more percent of premium questions have video solutions when compared to non premium questions (10% vs 2%) The relationship is observed that for all difficulty levels, video solution helps the most, there is a big difference between the submitted attempts when a question has a video solution or just with a written one.

“Are more of the medium questions locked behind the premium service rather than the easy ones?” Among medium questions, there is more percent of submitted and accepted premium questions when compared to easy and hard. There are less percent of easy premium questions. We found that as the difficulty level increases from 1 to 3, there will be a higher portion of questions that are included in the premium service. As go on, both easy and hard questions will decline in a constant rate but for medium ones, the more attempts will guarantee the lower acceptance rate.

There are a number of questions I initially wish to answer with my dataset but can’t. These questions are including:

1.In general, do questions with higher difficulty level usually have longer question titls/descriptions?

Bibliography

1.R for Data Science. Book. https://r4ds.had.co.nz/index.html 2.Programming Language R 3.Azeem, M. A. (2021, April 5). Leetcode. Kaggle. Retrieved July 4, 2022, from https://www.kaggle.com/datasets/muhammedabdulazeem/leetcode/code 4.Lonnieqin. (2021, December 24). Leetcode analysis. Kaggle. Retrieved July 4, 2022, from https://www.kaggle.com/code/lonnieqin/leetcode-analysis 5.Upparipavan. (2022, March 6). Eda on leetcode questions. Kaggle. Retrieved July 4, 2022, from https://www.kaggle.com/code/upparipavan/eda-on-leetcode-questions

FinalProjectHuidiDing