An exploratin of leetcode questions and other determining factors.
knitr::opts_chunk$set(echo = TRUE)
Leetcode questions seem to be an unavoidable topic during the tech interview process and is among one of the most recognizable coding website for practicing questions covered with data structure and algorithms.
#Research questions
The questions are identified as below so far with this dataset:
-How is the difficulty of the questions correlated with the total submitted attempts?
-How is the acceptance rate which we have added as a new column correlate to the difficulty of the questions?
-Are all hard questions more difficult than all of the medium and easy questions?
-How is the video/written solution correlate to the premium/paid subscrition service?
-Are more of the medium questions locked bebhind the premium service rather than the easy ones?
# Import readr and assign csv dataset to a variable
library(readr)
library(tidyverse)
library(tidyr)
library(rstudioapi)
library(ggplot2)
library(dplyr)
library(plotrix)
library(reshape2)
# Getting the path of your current open file automatically
leetcode <- read_csv("leetcode.csv")
#newData <- data.frame(leetcode)
view(leetcode)
# Preview the whole dataset before any operations
dim(leetcode)
[1] 1650 8
head(leetcode, n = 1650)
# A tibble: 1,650 × 8
question_id video title link `total Accepted` `total Submitt…`
<dbl> <lgl> <chr> <chr> <dbl> <dbl>
1 1959 NA Minimum … http… 262 434
2 1949 NA Implemen… http… 691 1126
3 1947 NA Number o… http… 1976 7740
4 1946 NA Minimum … http… 5411 13595
5 1945 NA Finding … http… 5890 7460
6 1944 NA Truncate… http… 7056 8979
7 1943 NA Count Pa… http… 584 850
8 1937 NA Maximize… http… 571 819
9 1936 NA Maximize… http… 2958 10959
10 1935 NA Minimum … http… 7134 10125
# … with 1,640 more rows, and 2 more variables: difficulty <dbl>,
# isPaid <lgl>
question_id video title link
0 701 0 0
total Accepted total Submitted difficulty isPaid
0 0 0 0
# Replace the NA value in videos as "No solution" since FALSE and TRUE have been written down, rename TRUE as "Video" and FALSE as "Written". We can observe that, there is lots of missing data in this column. Let's perform analysis by ignoring those missing values
#replace_na(leetcode, list(video = "No solution"))
# Calling str() function to confirm
str(leetcode)
spec_tbl_df [1,650 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ question_id : num [1:1650] 1959 1949 1947 1946 1945 ...
$ video : logi [1:1650] NA NA NA NA NA NA ...
$ title : chr [1:1650] "Minimum Path Cost in a Hidden Grid" "Implement Trie II (Prefix Tree)" "Number of Different Subsequences GCDs" "Minimum Absolute Sum Difference" ...
$ link : chr [1:1650] "https://leetcode.com/problems/minimum-path-cost-in-a-hidden-grid" "https://leetcode.com/problems/implement-trie-ii-prefix-tree" "https://leetcode.com/problems/number-of-different-subsequences-gcds" "https://leetcode.com/problems/minimum-absolute-sum-difference" ...
$ total Accepted : num [1:1650] 262 691 1976 5411 5890 ...
$ total Submitted: num [1:1650] 434 1126 7740 13595 7460 ...
$ difficulty : num [1:1650] 2 2 3 2 2 1 2 3 3 2 ...
$ isPaid : logi [1:1650] TRUE TRUE FALSE FALSE FALSE FALSE ...
- attr(*, "spec")=
.. cols(
.. question_id = col_double(),
.. video = col_logical(),
.. title = col_character(),
.. link = col_character(),
.. `total Accepted` = col_double(),
.. `total Submitted` = col_double(),
.. difficulty = col_double(),
.. isPaid = col_logical()
.. )
- attr(*, "problems")=<externalptr>
leetcode <- leetcode %>%
mutate(solution = case_when(
video == TRUE ~ "Video",
video == FALSE ~ "Written",
video == NA ~ "No solution",))
# Define the column rate which could be calculated by division between totalAccepted and totalSubmitted
acceptedRate <- select(leetcode, `total Accepted`)
totalSubmitted <- select(leetcode, `total Submitted`)
# Add new column named acceptedRate to leetcode dataset
leetcode <- mutate(leetcode, rate = (`total Accepted` / `total Submitted`))
#rate <- transform(rate, ar = accepetedRate / totalSubmitted)
# Preview the rate to check whether it is right
head(leetcode, n = 1650)
# A tibble: 1,650 × 10
question_id video title link `total Accepted` `total Submitt…`
<dbl> <lgl> <chr> <chr> <dbl> <dbl>
1 1959 NA Minimum … http… 262 434
2 1949 NA Implemen… http… 691 1126
3 1947 NA Number o… http… 1976 7740
4 1946 NA Minimum … http… 5411 13595
5 1945 NA Finding … http… 5890 7460
6 1944 NA Truncate… http… 7056 8979
7 1943 NA Count Pa… http… 584 850
8 1937 NA Maximize… http… 571 819
9 1936 NA Maximize… http… 2958 10959
10 1935 NA Minimum … http… 7134 10125
# … with 1,640 more rows, and 4 more variables: difficulty <dbl>,
# isPaid <lgl>, solution <chr>, rate <dbl>
# Compute the median for the accepted solutions and submitted solutions
summarize(leetcode, mean.totalAccepted = mean(`total Accepted`, na.rm = TRUE))
# A tibble: 1 × 1
mean.totalAccepted
<dbl>
1 128905.
# A tibble: 1 × 1
mean.totalSubmitted
<dbl>
1 280704.
# Compute the average for the difficulty, 1.978 which means a bit below medium
summarize(leetcode, mean.difficulty = mean(difficulty, na.rm = TRUE))
# A tibble: 1 × 1
mean.difficulty
<dbl>
1 1.98
# Compute the sd for the accepted solutions
summarize(leetcode, sd.totalAccepted = sd(`total Accepted`, na.rm = TRUE))
# A tibble: 1 × 1
sd.totalAccepted
<dbl>
1 233157.
# Group by the dataset by the number level of difficulty and calculate the average for submitted count for each difficulty;Group according to difficulty and performing aggregate functions
leetcode %>%
group_by(difficulty) %>%
summarize(mean.totoalSubmittedDifficult=mean(`total Submitted`), mean.totalAcceptedDifficult=mean(`total Accepted`), mean.totalAcceptedRate=(`mean.totalAcceptedDifficult`/`mean.totoalSubmittedDifficult`))
# A tibble: 3 × 4
difficulty mean.totoalSubmittedDi… mean.totalAccep… mean.totalAccep…
<dbl> <dbl> <dbl> <dbl>
1 1 418797. 213609. 0.510
2 2 269992. 119730. 0.443
3 3 155624. 58597. 0.377
# From above, we can conclude that hard questions have less average ratio when compared to easy and medium. Ratio decreases as difficulty of questions increases in general. In addition, we could conclude that harder the difficulty is, fewer accepted attempts and lower accepted ratio will be.
# Compute freq table with leetcode$difficulty, this shows the average Submission for different difficulties
frequencyDifficulty <- leetcode %>%
select(difficulty) %>%
table()
frequencyDifficulty
difficulty
1 2 3
397 891 362
frequencyDifficulty %>%
prop.table()
difficulty
1 2 3
0.2406061 0.5400000 0.2193939
# Visualize the pie chart graph showing the total submitted attempts over the 3 levels of difficulty, the graph has variables including frequency for each difficulty level; There are about 55% medium questions, 12% easy questions, 33% hard questions.
pie(frequencyDifficulty,labels = frequencyDifficulty, main = "Total count for each difficulty")
# Descend to find the questions with highest accepted rate
rateDescending <- arrange(leetcode, desc(`rate`))
head(rateDescending, n = 1650)
# A tibble: 1,650 × 10
question_id video title link `total Accepted` `total Submitt…`
<dbl> <lgl> <chr> <chr> <dbl> <dbl>
1 1404 NA Print Im… http… 23790 25230
2 1713 FALSE Dot Prod… http… 35694 39106
3 1089 NA Remove V… http… 64124 70819
4 1603 FALSE Running … http… 287068 323746
5 1205 NA Defangin… http… 297302 336414
6 1791 NA Richest … http… 92205 104538
7 1528 NA Kids Wit… http… 197377 223831
8 1580 NA Shuffle … http… 186267 211271
9 1572 NA Subrecta… http… 39364 44719
10 1635 NA Number o… http… 156334 178288
# … with 1,640 more rows, and 4 more variables: difficulty <dbl>,
# isPaid <lgl>, solution <chr>, rate <dbl>
# Hence we know could rank the questions in another order and list the top 30 most difficult questions
# From above there are 5 Hard questions easy to solve when compared to other hard ones according to acceptance rate in the list of the easiest questions. These questions include: 1022, 771, 865, 1503 and 1061
# From above there are some questions with easy or medium difficult level but difficult to solve include: 1646, 1644, 1648, 1634, 1635, 1637, 1638, 1639, 1640, 1642, 1643, 1625, 1626, 1627, 1631, 1632, 1633 and so on.
# We can conclude that not all hard questions are guaranteed to be more challenging than all easy or medium difficult level.
# On the contrary, there are many problem with medium level but are difficult to solve as they have less acceptance/submissions ratio. (In general a question is considered as very difficult when the ratio is below 0.3 = 30%)
#Visualization #1:
# Visualize the bar graph showing the question difficulty as x-axis and the count for each question as y-axis
ggplot(leetcode, aes(`difficulty`)) + geom_bar(color="grey", fill="black") +
labs(title = "diffuculty frequency") +
scale_x_discrete(labels=c("Easy", "Medium", "Hard"))
theme_bw()
List of 93
$ line :List of 6
..$ colour : chr "black"
..$ size : num 0.5
..$ linetype : num 1
..$ lineend : chr "butt"
..$ arrow : logi FALSE
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_line" "element"
$ rect :List of 5
..$ fill : chr "white"
..$ colour : chr "black"
..$ size : num 0.5
..$ linetype : num 1
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_rect" "element"
$ text :List of 11
..$ family : chr ""
..$ face : chr "plain"
..$ colour : chr "black"
..$ size : num 11
..$ hjust : num 0.5
..$ vjust : num 0.5
..$ angle : num 0
..$ lineheight : num 0.9
..$ margin : 'margin' num [1:4] 0points 0points 0points 0points
.. ..- attr(*, "unit")= int 8
..$ debug : logi FALSE
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ title : NULL
$ aspect.ratio : NULL
$ axis.title : NULL
$ axis.title.x :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : NULL
..$ vjust : num 1
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 2.75points 0points 0points 0points
.. ..- attr(*, "unit")= int 8
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.title.x.top :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : NULL
..$ vjust : num 0
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 0points 0points 2.75points 0points
.. ..- attr(*, "unit")= int 8
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.title.x.bottom : NULL
$ axis.title.y :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : NULL
..$ vjust : num 1
..$ angle : num 90
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 0points 2.75points 0points 0points
.. ..- attr(*, "unit")= int 8
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.title.y.left : NULL
$ axis.title.y.right :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : NULL
..$ vjust : num 0
..$ angle : num -90
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 0points 0points 0points 2.75points
.. ..- attr(*, "unit")= int 8
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.text :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : chr "grey30"
..$ size : 'rel' num 0.8
..$ hjust : NULL
..$ vjust : NULL
..$ angle : NULL
..$ lineheight : NULL
..$ margin : NULL
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.text.x :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : NULL
..$ vjust : num 1
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 2.2points 0points 0points 0points
.. ..- attr(*, "unit")= int 8
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.text.x.top :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : NULL
..$ vjust : num 0
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 0points 0points 2.2points 0points
.. ..- attr(*, "unit")= int 8
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.text.x.bottom : NULL
$ axis.text.y :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : num 1
..$ vjust : NULL
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 0points 2.2points 0points 0points
.. ..- attr(*, "unit")= int 8
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.text.y.left : NULL
$ axis.text.y.right :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : num 0
..$ vjust : NULL
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 0points 0points 0points 2.2points
.. ..- attr(*, "unit")= int 8
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.ticks :List of 6
..$ colour : chr "grey20"
..$ size : NULL
..$ linetype : NULL
..$ lineend : NULL
..$ arrow : logi FALSE
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_line" "element"
$ axis.ticks.x : NULL
$ axis.ticks.x.top : NULL
$ axis.ticks.x.bottom : NULL
$ axis.ticks.y : NULL
$ axis.ticks.y.left : NULL
$ axis.ticks.y.right : NULL
$ axis.ticks.length : 'simpleUnit' num 2.75points
..- attr(*, "unit")= int 8
$ axis.ticks.length.x : NULL
$ axis.ticks.length.x.top : NULL
$ axis.ticks.length.x.bottom: NULL
$ axis.ticks.length.y : NULL
$ axis.ticks.length.y.left : NULL
$ axis.ticks.length.y.right : NULL
$ axis.line : list()
..- attr(*, "class")= chr [1:2] "element_blank" "element"
$ axis.line.x : NULL
$ axis.line.x.top : NULL
$ axis.line.x.bottom : NULL
$ axis.line.y : NULL
$ axis.line.y.left : NULL
$ axis.line.y.right : NULL
$ legend.background :List of 5
..$ fill : NULL
..$ colour : logi NA
..$ size : NULL
..$ linetype : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_rect" "element"
$ legend.margin : 'margin' num [1:4] 5.5points 5.5points 5.5points 5.5points
..- attr(*, "unit")= int 8
$ legend.spacing : 'simpleUnit' num 11points
..- attr(*, "unit")= int 8
$ legend.spacing.x : NULL
$ legend.spacing.y : NULL
$ legend.key :List of 5
..$ fill : chr "white"
..$ colour : logi NA
..$ size : NULL
..$ linetype : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_rect" "element"
$ legend.key.size : 'simpleUnit' num 1.2lines
..- attr(*, "unit")= int 3
$ legend.key.height : NULL
$ legend.key.width : NULL
$ legend.text :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : 'rel' num 0.8
..$ hjust : NULL
..$ vjust : NULL
..$ angle : NULL
..$ lineheight : NULL
..$ margin : NULL
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ legend.text.align : NULL
$ legend.title :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : num 0
..$ vjust : NULL
..$ angle : NULL
..$ lineheight : NULL
..$ margin : NULL
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ legend.title.align : NULL
$ legend.position : chr "right"
$ legend.direction : NULL
$ legend.justification : chr "center"
$ legend.box : NULL
$ legend.box.just : NULL
$ legend.box.margin : 'margin' num [1:4] 0cm 0cm 0cm 0cm
..- attr(*, "unit")= int 1
$ legend.box.background : list()
..- attr(*, "class")= chr [1:2] "element_blank" "element"
$ legend.box.spacing : 'simpleUnit' num 11points
..- attr(*, "unit")= int 8
$ panel.background :List of 5
..$ fill : chr "white"
..$ colour : logi NA
..$ size : NULL
..$ linetype : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_rect" "element"
$ panel.border :List of 5
..$ fill : logi NA
..$ colour : chr "grey20"
..$ size : NULL
..$ linetype : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_rect" "element"
$ panel.spacing : 'simpleUnit' num 5.5points
..- attr(*, "unit")= int 8
$ panel.spacing.x : NULL
$ panel.spacing.y : NULL
$ panel.grid :List of 6
..$ colour : chr "grey92"
..$ size : NULL
..$ linetype : NULL
..$ lineend : NULL
..$ arrow : logi FALSE
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_line" "element"
$ panel.grid.major : NULL
$ panel.grid.minor :List of 6
..$ colour : NULL
..$ size : 'rel' num 0.5
..$ linetype : NULL
..$ lineend : NULL
..$ arrow : logi FALSE
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_line" "element"
$ panel.grid.major.x : NULL
$ panel.grid.major.y : NULL
$ panel.grid.minor.x : NULL
$ panel.grid.minor.y : NULL
$ panel.ontop : logi FALSE
$ plot.background :List of 5
..$ fill : NULL
..$ colour : chr "white"
..$ size : NULL
..$ linetype : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_rect" "element"
$ plot.title :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : 'rel' num 1.2
..$ hjust : num 0
..$ vjust : num 1
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 0points 0points 5.5points 0points
.. ..- attr(*, "unit")= int 8
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ plot.title.position : chr "panel"
$ plot.subtitle :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : num 0
..$ vjust : num 1
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 0points 0points 5.5points 0points
.. ..- attr(*, "unit")= int 8
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ plot.caption :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : 'rel' num 0.8
..$ hjust : num 1
..$ vjust : num 1
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 5.5points 0points 0points 0points
.. ..- attr(*, "unit")= int 8
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ plot.caption.position : chr "panel"
$ plot.tag :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : 'rel' num 1.2
..$ hjust : num 0.5
..$ vjust : num 0.5
..$ angle : NULL
..$ lineheight : NULL
..$ margin : NULL
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ plot.tag.position : chr "topleft"
$ plot.margin : 'margin' num [1:4] 5.5points 5.5points 5.5points 5.5points
..- attr(*, "unit")= int 8
$ strip.background :List of 5
..$ fill : chr "grey85"
..$ colour : chr "grey20"
..$ size : NULL
..$ linetype : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_rect" "element"
$ strip.background.x : NULL
$ strip.background.y : NULL
$ strip.placement : chr "inside"
$ strip.text :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : chr "grey10"
..$ size : 'rel' num 0.8
..$ hjust : NULL
..$ vjust : NULL
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 4.4points 4.4points 4.4points 4.4points
.. ..- attr(*, "unit")= int 8
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ strip.text.x : NULL
$ strip.text.y :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : NULL
..$ vjust : NULL
..$ angle : num -90
..$ lineheight : NULL
..$ margin : NULL
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ strip.switch.pad.grid : 'simpleUnit' num 2.75points
..- attr(*, "unit")= int 8
$ strip.switch.pad.wrap : 'simpleUnit' num 2.75points
..- attr(*, "unit")= int 8
$ strip.text.y.left :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : NULL
..$ vjust : NULL
..$ angle : num 90
..$ lineheight : NULL
..$ margin : NULL
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
- attr(*, "class")= chr [1:2] "theme" "gg"
- attr(*, "complete")= logi TRUE
- attr(*, "validate")= logi TRUE
#Visualization #2:
# Visualize the bar graph showing the isPaid as x-axis and the count for each question as y-axis
ggplot(leetcode, aes(`isPaid`)) + geom_bar(color="grey", fill="black") +
labs(title = "Premium sub and question counts") +
theme_minimal()
# Visualization #3: Combine the two bar charts
# make the plot
ggplot(leetcode, aes(x = `difficulty`, y = `total Submitted` + `total Accepted`, fill = isPaid)) +
geom_bar(
stat="identity", position = "dodge", width = 0.7) +
scale_fill_manual("isPaid\n", values = c("red","blue"),
labels = c(" unpaid", " paid")) +
labs(x="\ndifficulty",y="totalAttempts\n", title ="Total Attempts for each difficulty group related to premium service ") +
theme_bw(base_size = 14)
# Among medium questions, there is more percent of submitted and accepted premium questions when compared to easy and hard. There are less percent of easy premium questions.
# Visualization #4:
# Visualize the bar graph showing the question difficulty as x-axis and the average submission count as y-axis, add standard error bars to visualize the uncertainty around your estimate
leetcode %>%
group_by(difficulty) %>%
summarize(mean.totoalSubmittedDifficult=mean(`total Submitted`), mean.totalAcceptedDifficult=mean(`total Accepted`), mean.totalAcceptedRate=(`mean.totalAcceptedDifficult`/`mean.totoalSubmittedDifficult`), sd.totalSubmitted=sd(`total Submitted`))
# A tibble: 3 × 5
difficulty mean.totoalSubmittedDi… mean.totalAccep… mean.totalAccep…
<dbl> <dbl> <dbl> <dbl>
1 1 418797. 213609. 0.510
2 2 269992. 119730. 0.443
3 3 155624. 58597. 0.377
# … with 1 more variable: sd.totalSubmitted <dbl>
ggplot(leetcode, aes(x = `difficulty`, y = `total Submitted`)) +
geom_bar(stat="identity", fill="grey", position=position_dodge()) +
geom_errorbar( aes(ymin=`total Submitted`-533771, ymax=`total Submitted`+533771), colour="orange", position=position_dodge(.9))
# Visualization #5:
# Replace the numeral value in difficulty, recode leetcode to make a boxplot
#The result should be three separate box plots for different values of year_postgrad
leetcode<-mutate(leetcode, trueDifficulty = recode(difficulty, `1` = "easy", `2` = "medium", `3` = "difficult"))
# Visualize the boxplot graph showing the relationship between question difficulty and the count for total submitted each question as y-axis
ggplot(leetcode, aes(`trueDifficulty`, `total Submitted`)) +
geom_boxplot() +
labs(title = "Difficultity vs.TotalSubittedAttempts ", y = "difficulty level", x = "total submmitted attempts")
# The graph has variables including question difficulty and the count for total submitted;
# Visualization #6:
# Visualize the histogram graph showing the question id as x-axis and the total submission count as y-axis
ggplot(leetcode, aes(question_id, fill = isPaid)) +
geom_histogram(binwidth = 50) +
labs(title = "question number in all difficulty levels") +
theme_bw() +
facet_wrap(vars(difficulty), scales = "free")
# Preview and get the column names of the dataset
head(leetcode)
# A tibble: 6 × 11
question_id video title link `total Accepted` `total Submitt…`
<dbl> <lgl> <chr> <chr> <dbl> <dbl>
1 1959 NA Minimum P… http… 262 434
2 1949 NA Implement… http… 691 1126
3 1947 NA Number of… http… 1976 7740
4 1946 NA Minimum A… http… 5411 13595
5 1945 NA Finding t… http… 5890 7460
6 1944 NA Truncate … http… 7056 8979
# … with 5 more variables: difficulty <dbl>, isPaid <lgl>,
# solution <chr>, rate <dbl>, trueDifficulty <chr>
colnames(leetcode)
[1] "question_id" "video" "title"
[4] "link" "total Accepted" "total Submitted"
[7] "difficulty" "isPaid" "solution"
[10] "rate" "trueDifficulty"
As you may see with colnames(), we have list all variables
Varible type classification:
String/char type
title definition: the leetcode question description titles on the official website
link definition: offcial link to the specific question
trueDifficulty definition: here as you may know, we tranform 1 to easy difficulty, 2 to medium difficulty, 3 to hard difficulty. Usuallly, higher the difficulty, the more likely the user might spend time
Numeric type
question_id definition: the number which corresponds to each question total Accepted definition: number of submitted solutions which can compile and run well
total Submitted definition: number of solutions/coding attempt submitted to the website, could be wrong
difficulty definition: in general, this represents the learning curve and how challenging a typical question could be
rate definition: we use the division between totalAccepted andd totalSubmitted to get the percentage which the submitted solution have been approved by the server
Logical type
video definition: TRUE means the solution has video format, FALSE means written, NA means none
isPaid definition: TRUE means the question requires premium membership, FALSE means not requiring
#Use filter() to list all the questions without solutions
solvedQuestions <- filter(leetcode, is.na(`video`))
head(solvedQuestions)
# A tibble: 6 × 11
question_id video title link `total Accepted` `total Submitt…`
<dbl> <lgl> <chr> <chr> <dbl> <dbl>
1 1959 NA Minimum P… http… 262 434
2 1949 NA Implement… http… 691 1126
3 1947 NA Number of… http… 1976 7740
4 1946 NA Minimum A… http… 5411 13595
5 1945 NA Finding t… http… 5890 7460
6 1944 NA Truncate … http… 7056 8979
# … with 5 more variables: difficulty <dbl>, isPaid <lgl>,
# solution <chr>, rate <dbl>, trueDifficulty <chr>
– What is missing (if anything) in your analysis process so far?
Some complex graphs are missing since I mainly uterlized a few Univariate basics and should focus more on Bivariate basics more. I added standard error bars to visualize the uncertainty but it is not correct.
– What conclusions can you make about your research questions at this point?
In summary, we could answer the questions in the following:
“How is the difficulty of the questions correlated with the total submitted attempts?” we could conclude that harder the difficulty is, fewer accepted attempts and lower accepted ratio will be.
“How is the acceptance rate which we have added as a new column correlate to the difficulty of the questions?” We can conclude that hard questions have less average ratio when compared to easy and medium. Ratio decreases as difficulty of questions increases in general.
“Are all hard questions more difficult than all of the medium and easy questions?” We can conclude that not all hard questions are guaranteed to be more challenging than all easy or medium difficult level. On the contrary, there are many problem with medium level but are difficult to solve as they have less acceptance/submissions ratio.
“How is the video/written solution correlate to the premium/paid subscrition service?” # We can conclude that there are only 3.6 percent of questions with video solutions. We can see that more percent of premium questions have video solutions when compared to non premium questions (10% vs 2%)
“Are more of the medium questions locked bebhind the premium service rather than the easy ones?” Among medium questions, there is more percent of submitted and accepted premium questions when compared to easy and hard. There are less percent of easy premium questions.
– What do you think a naive reader would need to fully understand your graphs?
From my perspective, a naive reader needs to know something about coding website and their business model, for instance, how leetcode Premium works. Make sure he or she understands the elements in the dataset spreadsheet. Some coding background would be ideal since for an attempt to be accepted, the code needs to be complied and pass the additional test cases.
– Is there anything you want to answer with your dataset, but can’t?
There are a number of questions I initially wish to answer including:
1.In general, do questions with higher difficulty level usually have longer question titls/descriptions?
2.Based on these provided information, is there a model or algorithm that we could uterlize to predict the difficulty for each question? If so, how are we going to train the model?
3.How can I provide a list of questions which suitable for each difficulty which extracts the outliers including some hard medium questions or some easy hard questions?