The objective of this analysis is to understand 1) The types of questions / concepts where learners make the most mistakes and 2) Identify specific modules and sessions where learner performance is strong, medium and weak.
## checkbox numeric radio
## 604 32 4695
First, let’s compare the number of questions asked in each module and course.
unique_questions = data[which(!duplicated(data$question_id)), ]
unique_questions %>% group_by(module_name) %>% summarise(n())
## # A tibble: 42 x 2
## module_name n()
## <fctr> <int>
## 1 A/B Testing 4
## 2 Acquisition Analytics 15
## 3 Advanced Regression 28
## 4 Association Rule Mining 49
## 5 Big Data Analytics using Apache Spark 6
## 6 Big Data processing with Hadoop 20
## 7 Business and Data Understanding 36
## 8 Data Ingestion and Processing 18
## 9 Data Preparation 35
## 10 Data Visualization - How to Make Data Presentable! 51
## # ... with 32 more rows
There are a total of 1358 questions (excluding the ‘Welcome’ module).
unique_questions %>% group_by(course_number) %>% summarise(n())
## # A tibble: 7 x 2
## course_number n()
## <int> <int>
## 1 0 62
## 2 1 253
## 3 2 241
## 4 3 269
## 5 4 289
## 6 5 52
## 7 6 192
The number of questions asked in Big Data course were significantly lower than the other courses.
Let’s create separate dataframes for each course.
# aggregating the entire data on average number of attempts and the fraction of correct students, by each question
questions_summary = data %>% group_by(question_id) %>%
summarise(count = sum(user_cnt),
correct = sum(user_cnt*is_correct),
fraction = round(correct/count, 2))
course_names = data[, c(5, 6, 12, 16, 17, 18)]
all_courses_summary = merge(questions_summary, course_names, by="question_id", all.x = F)
all_courses = all_courses_summary[which(!duplicated(all_courses_summary$question_id)), ]
Let’s look at some basic statistics:
# avg of correct fraction
mean(all_courses$fraction)
## [1] 0.7561929
# comparing avg correct across courses
all_courses %>% group_by(course_number) %>% summarise(avg = round(mean(fraction), 2), num_questions = n())
## # A tibble: 7 x 3
## course_number avg num_questions
## <int> <dbl> <int>
## 1 0 0.55 62
## 2 1 0.73 253
## 3 2 0.83 241
## 4 3 0.80 269
## 5 4 0.74 289
## 6 5 0.83 52
## 7 6 0.70 192
#How does the avergae compare across graded and non-graded questions?
all_courses %>% group_by(graded) %>% summarise(avg = round(mean(fraction), 2), num_questions = n())
## # A tibble: 2 x 3
## graded avg num_questions
## <lgl> <dbl> <int>
## 1 FALSE 0.72 799
## 2 TRUE 0.81 559
On an average, 75.6% learners get a question correct (in either 1 or 2 attempts). The average varies significantly across courses, with 54% in course 0 (placement content) and 84% in course 3, PA-I.
Also, 81% graded questions are correctly attempted compared to 72% non-graded.
Let’s drill down into individual questions in each course.
# creating dfs forindividual courses
course_1 = filter(all_courses, course_number == 1)
course_2 = filter(all_courses, course_number == 2)
course_3 = filter(all_courses, course_number == 3)
course_4 = filter(all_courses, course_number == 4)
course_5 = filter(all_courses, course_number == 5)
course_6 = filter(all_courses, course_number == 6)
course_0 = filter(all_courses, course_number == 0)
# Course 0 analysis
course_0 %>% group_by(module_name) %>%
summarise(fraction_correct = round(mean(fraction), 2), count = n())
## # A tibble: 1 x 3
## module_name fraction_correct count
## <fctr> <dbl> <int>
## 1 Placement Support 0.55 62
write.csv(course_0, "course_0.csv")
In placement questions, the correct answer rate is only 55%. Drilling down into the questions. The questions having lower than 50% correct rates are: 1. Fundamental concepts, such as cross-validation, overfitting, application of VIF etc. 2. Those requiring a decent amount of thinking and effort, e.g. calculation of confidence interval using z-scores, calculation of log odds etc.
Based on these observations and the interview feedbacks, the recommendations for placement content are
# Course 1 analysis
mean(course_1$fraction)
## [1] 0.7304743
course_1 %>% group_by(module_name, graded) %>%
summarise(fraction_correct = round(mean(fraction), 2), count = n())
## Source: local data frame [9 x 4]
## Groups: module_name [?]
##
## module_name graded fraction_correct count
## <fctr> <lgl> <dbl> <int>
## 1 Business and Data Understanding FALSE 0.85 22
## 2 Business and Data Understanding TRUE 0.89 14
## 3 How Data Exists Within Enterprises FALSE 0.71 27
## 4 How Data Exists Within Enterprises TRUE 0.83 38
## 5 Introduction to Data Analytics FALSE 0.74 16
## 6 Introduction to Data Analytics TRUE 0.76 10
## 7 Introduction to Python FALSE 0.56 83
## 8 Language of Data Analysts - Introduction FALSE 0.81 17
## 9 Language of Data Analysts - Introduction TRUE 0.89 26
write.csv(course_1, "course_1.csv")
In almost all modules, performance in graded questions is significantly better than in non-graded ones.
Introduction to Python has only 56% correct rate: This may also be because the module being optional, people don’t take the questions seriously and just browse through the videos casually. All these questions were non-graded.
In the other modules, most low correct-rate questions have no clear pattern. They are sometimes vague and do non test any serious learning objective (marked in the Excel sheet).
Some specific low correct rate (graded) questions are: 1. What is VLOOKUP used for? 2. Which types of variables are generally put in columns of a pivot table?
Recommendations:
# Course 2 analysis
course_2 %>% group_by(module_name, graded) %>%
summarise(fraction_correct = round(mean(fraction), 2), count = n())
## Source: local data frame [12 x 4]
## Groups: module_name [?]
##
## module_name graded
## <fctr> <lgl>
## 1 Data Visualization - How to Make Data Presentable! FALSE
## 2 Data Visualization - How to Make Data Presentable! TRUE
## 3 Descriptive Statistics FALSE
## 4 Descriptive Statistics TRUE
## 5 Design of Experiments FALSE
## 6 Design of Experiments TRUE
## 7 Exploratory Data Analysis FALSE
## 8 Exploratory Data Analysis TRUE
## 9 Hypothesis Testing FALSE
## 10 Hypothesis Testing TRUE
## 11 Inferential Statistics FALSE
## 12 Inferential Statistics TRUE
## # ... with 2 more variables: fraction_correct <dbl>, count <int>
write.csv(course_2, "course_2.csv")
In almost all modules, performance in graded questions is significantly better than in non-graded ones.
Performance in course 2 is fairly good, apart from some conceptual questions (marked in the Excel) most people have gotten wrong. Some examples of such questions: - Application of joint probability (conceptual) - Application of binomial distribution (conceptual) - Terminology: sample and population standard deviation - Recalling concepts from the previous course (questions on R) - Questions on skewness
It should be noted that there are quite a few moderately difficult questions from topics such as HT, probability, descriptive statistics etc. which more than 80% learners have correctly attempted.
Many of these questions require some computation using CSV files etc., indicating that people attempt these questions
Questions on hypothesis testing, percentiles, two-way tables have high correct rates
Recommendations:
# Course 3 analysis
course_3 %>% group_by(module_name, graded) %>%
summarise(fraction_correct = round(mean(fraction), 2), count = n())
## Source: local data frame [14 x 4]
## Groups: module_name [?]
##
## module_name graded
## <fctr> <lgl>
## 1 Data Preparation FALSE
## 2 Data Preparation TRUE
## 3 Introduction to Models FALSE
## 4 Introduction to Models TRUE
## 5 Linear Regression FALSE
## 6 Linear Regression TRUE
## 7 Supervised Classification - I FALSE
## 8 Supervised Classification - I TRUE
## 9 Supervised Classification II- Logistic Regression FALSE
## 10 Supervised Classification II- Logistic Regression TRUE
## 11 Supervised Classification III: Support Vector Machines FALSE
## 12 Supervised Classification III: Support Vector Machines TRUE
## 13 Unsupervised Learning: Clustering FALSE
## 14 Unsupervised Learning: Clustering TRUE
## # ... with 2 more variables: fraction_correct <dbl>, count <int>
write.csv(course_3, "course_3.csv")
course_3 %>% group_by(module_name, graded) %>%
summarise(fraction_correct = round(mean(fraction), 2), count = n())
## Source: local data frame [14 x 4]
## Groups: module_name [?]
##
## module_name graded
## <fctr> <lgl>
## 1 Data Preparation FALSE
## 2 Data Preparation TRUE
## 3 Introduction to Models FALSE
## 4 Introduction to Models TRUE
## 5 Linear Regression FALSE
## 6 Linear Regression TRUE
## 7 Supervised Classification - I FALSE
## 8 Supervised Classification - I TRUE
## 9 Supervised Classification II- Logistic Regression FALSE
## 10 Supervised Classification II- Logistic Regression TRUE
## 11 Supervised Classification III: Support Vector Machines FALSE
## 12 Supervised Classification III: Support Vector Machines TRUE
## 13 Unsupervised Learning: Clustering FALSE
## 14 Unsupervised Learning: Clustering TRUE
## # ... with 2 more variables: fraction_correct <dbl>, count <int>
As seen earlier, performance in graded questions is significantly better than in non-graded ones.
Some concepts where performance in graded questions is worse than average are: - Multicollinearity - Step-wise variable selection - Identifying a given task as that of clustering / supervised learning - Learning algorithm of k-means and hierarchical clustering - Decile analysis in logistic regression
Recommendations:
# Course 4 analysis
course_4 %>% group_by(module_name, graded) %>%
summarise(fraction_correct = round(mean(fraction), 2), count = n())
## Source: local data frame [14 x 4]
## Groups: module_name [?]
##
## module_name graded fraction_correct count
## <fctr> <lgl> <dbl> <int>
## 1 Advanced Regression FALSE 0.80 17
## 2 Advanced Regression TRUE 0.71 11
## 3 Association Rule Mining FALSE 0.69 26
## 4 Association Rule Mining TRUE 0.81 23
## 5 Decision Trees FALSE 0.77 31
## 6 Decision Trees TRUE 0.83 5
## 7 Ensembles FALSE 0.68 19
## 8 Ensembles TRUE 0.69 10
## 9 Model Selection FALSE 0.81 18
## 10 Model Selection TRUE 0.66 10
## 11 Neural Networks FALSE 0.72 56
## 12 Neural Networks TRUE 0.79 15
## 13 Time Series Analysis FALSE 0.73 38
## 14 Time Series Analysis TRUE 0.79 10
write.csv(course_4, "course_4.csv")
In PA-2, the hypothesis that people score higher in graded questions is true only for some modules, and false for Model selection, Ensembles, Advanced regression.
A module-wise summary of weak areas:
Recommendations:
# Course 5 analysis
course_5 %>% group_by(module_name, graded) %>%
summarise(fraction_correct = round(mean(fraction), 2), count = n())
## Source: local data frame [8 x 4]
## Groups: module_name [?]
##
## module_name graded fraction_correct count
## <fctr> <lgl> <dbl> <int>
## 1 Big Data Analytics using Apache Spark FALSE 0.84 2
## 2 Big Data Analytics using Apache Spark TRUE 0.94 4
## 3 Big Data processing with Hadoop FALSE 0.80 9
## 4 Big Data processing with Hadoop TRUE 0.79 11
## 5 Data Ingestion and Processing FALSE 0.87 8
## 6 Data Ingestion and Processing TRUE 0.84 10
## 7 Introduction to Big Data FALSE 0.68 4
## 8 Introduction to Big Data TRUE 0.88 4
write.csv(course_5, "course_5.csv")
A few areas where most people have gotten graded questions incorrect are: - HDFS: Name nodes, application masters etc.
Recommendations:
# Course 6 analysis
course_6 %>% group_by(module_name, graded) %>%
summarise(fraction_correct = round(mean(fraction), 2), count = n())
## Source: local data frame [19 x 4]
## Groups: module_name [?]
##
## module_name graded fraction_correct
## <fctr> <lgl> <dbl>
## 1 A/B Testing FALSE 0.70
## 2 Acquisition Analytics FALSE 0.49
## 3 Acquisition Analytics TRUE 0.80
## 4 Drug Lifecycle FALSE 0.53
## 5 Drug Lifecycle TRUE 0.50
## 6 Engagement Analytics TRUE 0.77
## 7 Healthcare Data Understanding and Analysis FALSE 0.61
## 8 Healthcare Data Understanding and Analysis TRUE 0.54
## 9 Introduction to Banking and Financial Services FALSE 0.71
## 10 Introduction to e-commerce FALSE 0.90
## 11 Market Mix Modelling FALSE 0.87
## 12 Market Mix Modelling TRUE 0.90
## 13 Price Optimisation FALSE 0.70
## 14 Price Optimisation TRUE 0.82
## 15 Recommendation Systems FALSE 0.84
## 16 Recommendation Systems TRUE 0.72
## 17 Risk Analytics FALSE 0.68
## 18 Risk Analytics TRUE 0.78
## 19 Understanding the Healthcare Domain FALSE 0.68
## # ... with 1 more variables: count <int>
write.csv(course_6, "course_6.csv")
Some observations: - Healthcare has extremely low correct rates, both in graded and non-graded questions - Ecommerce: Has decent correct rates, average 81% correct rate, 82% in graded questions - BFS: Most graded questions are correct (78.5% correct rate), the areas of mistakes are mainly conceptual, domain-related questions in risk analytics.
Based on this analysis and the interview feedback calls, the following observations are clear:
Thus, the recommendations at the program level are as follows:
Non-graded questions should be repeated/reframed and included in the graded questions, thereby increasing the number of graded questions and testing a wider set of concepts
If possible, coding console should be included heavily to help learners practice R/SQL and other hands-on exercises throughout the program (rather than in 1-2 modules)
Placement content can be potentially made graded, and should include R, SQL, basic concepts of statistics, model building, model selection and evaluation etc.