DA Learning Analysis

The objective of this analysis is to understand 1) The types of questions / concepts where learners make the most mistakes and 2) Identify specific modules and sessions where learner performance is strong, medium and weak.

## checkbox  numeric    radio 
##      604       32     4695

First, let’s compare the number of questions asked in each module and course.

unique_questions = data[which(!duplicated(data$question_id)), ]
unique_questions %>% group_by(module_name) %>% summarise(n())

## # A tibble: 42 x 2
##                                            module_name   n()
##                                                 <fctr> <int>
## 1                                          A/B Testing     4
## 2                                Acquisition Analytics    15
## 3                                  Advanced Regression    28
## 4                              Association Rule Mining    49
## 5                Big Data Analytics using Apache Spark     6
## 6                      Big Data processing with Hadoop    20
## 7                      Business and Data Understanding    36
## 8                        Data Ingestion and Processing    18
## 9                                     Data Preparation    35
## 10 Data Visualization - How to  Make Data Presentable!    51
## # ... with 32 more rows

There are a total of 1358 questions (excluding the ‘Welcome’ module).

Comparing the number of questions in each course and module.

unique_questions %>% group_by(course_number) %>% summarise(n())

## # A tibble: 7 x 2
##   course_number   n()
##           <int> <int>
## 1             0    62
## 2             1   253
## 3             2   241
## 4             3   269
## 5             4   289
## 6             5    52
## 7             6   192

The number of questions asked in Big Data course were significantly lower than the other courses.

Let’s create separate dataframes for each course.

Questions Correctness Analysis

# aggregating the entire data on average number of attempts and the fraction of correct students, by each question
questions_summary = data %>% group_by(question_id) %>% 
  summarise(count = sum(user_cnt), 
            correct = sum(user_cnt*is_correct), 
            fraction = round(correct/count, 2))
course_names = data[, c(5, 6, 12, 16, 17, 18)]
all_courses_summary = merge(questions_summary, course_names, by="question_id", all.x = F)
all_courses = all_courses_summary[which(!duplicated(all_courses_summary$question_id)), ]

Let’s look at some basic statistics:

What is the average of the fraction of students who have got correct answers?
How does the average_correct compare across courses?
How does the avergae compare across graded and non-graded questions?

# avg of correct fraction
mean(all_courses$fraction)

## [1] 0.7561929

# comparing avg correct across courses
all_courses %>% group_by(course_number) %>% summarise(avg = round(mean(fraction), 2), num_questions = n())

## # A tibble: 7 x 3
##   course_number   avg num_questions
##           <int> <dbl>         <int>
## 1             0  0.55            62
## 2             1  0.73           253
## 3             2  0.83           241
## 4             3  0.80           269
## 5             4  0.74           289
## 6             5  0.83            52
## 7             6  0.70           192

#How does the avergae compare across graded and non-graded questions?
all_courses %>% group_by(graded) %>% summarise(avg = round(mean(fraction), 2), num_questions = n())

## # A tibble: 2 x 3
##   graded   avg num_questions
##    <lgl> <dbl>         <int>
## 1  FALSE  0.72           799
## 2   TRUE  0.81           559

On an average, 75.6% learners get a question correct (in either 1 or 2 attempts). The average varies significantly across courses, with 54% in course 0 (placement content) and 84% in course 3, PA-I.

Also, 81% graded questions are correctly attempted compared to 72% non-graded.

Let’s drill down into individual questions in each course.

# creating dfs forindividual courses
course_1 = filter(all_courses, course_number == 1)
course_2 = filter(all_courses, course_number == 2)
course_3 = filter(all_courses, course_number == 3)
course_4 = filter(all_courses, course_number == 4)
course_5 = filter(all_courses, course_number == 5)
course_6 = filter(all_courses, course_number == 6)
course_0 = filter(all_courses, course_number == 0)

Course 0 Analysis

# Course 0 analysis
course_0 %>% group_by(module_name) %>% 
  summarise(fraction_correct = round(mean(fraction), 2), count = n())

## # A tibble: 1 x 3
##         module_name fraction_correct count
##              <fctr>            <dbl> <int>
## 1 Placement Support             0.55    62

write.csv(course_0, "course_0.csv")

In placement questions, the correct answer rate is only 55%. Drilling down into the questions. The questions having lower than 50% correct rates are: 1. Fundamental concepts, such as cross-validation, overfitting, application of VIF etc. 2. Those requiring a decent amount of thinking and effort, e.g. calculation of confidence interval using z-scores, calculation of log odds etc.

Based on these observations and the interview feedbacks, the recommendations for placement content are

It could be potentially made graded and compulsory
SQL and R should be heavily emphasised as part of placement practice
Basic concepts such as cross-validation, overfitting, model building and evaluation should be focused upon (especially three topics - linear regression, logistic regression and k-means clustering)

Course 1 Analysis

# Course 1 analysis
mean(course_1$fraction)

## [1] 0.7304743

course_1 %>% group_by(module_name, graded) %>% 
  summarise(fraction_correct = round(mean(fraction), 2), count = n())

## Source: local data frame [9 x 4]
## Groups: module_name [?]
## 
##                                module_name graded fraction_correct count
##                                     <fctr>  <lgl>            <dbl> <int>
## 1          Business and Data Understanding  FALSE             0.85    22
## 2          Business and Data Understanding   TRUE             0.89    14
## 3       How Data Exists Within Enterprises  FALSE             0.71    27
## 4       How Data Exists Within Enterprises   TRUE             0.83    38
## 5           Introduction to Data Analytics  FALSE             0.74    16
## 6           Introduction to Data Analytics   TRUE             0.76    10
## 7                   Introduction to Python  FALSE             0.56    83
## 8 Language of Data Analysts - Introduction  FALSE             0.81    17
## 9 Language of Data Analysts - Introduction   TRUE             0.89    26

write.csv(course_1, "course_1.csv")

In almost all modules, performance in graded questions is significantly better than in non-graded ones.

Introduction to Python has only 56% correct rate: This may also be because the module being optional, people don’t take the questions seriously and just browse through the videos casually. All these questions were non-graded.

In the other modules, most low correct-rate questions have no clear pattern. They are sometimes vague and do non test any serious learning objective (marked in the Excel sheet).

Some specific low correct rate (graded) questions are: 1. What is VLOOKUP used for? 2. Which types of variables are generally put in columns of a pivot table?

Recommendations:

At least some questions on Excel should be made graded.
Python needs to be redeveloped, and questions need improvement - can replace existing content with that of the ML/AI program

Course 2 Analysis

# Course 2 analysis
course_2 %>% group_by(module_name, graded) %>% 
  summarise(fraction_correct = round(mean(fraction), 2), count = n())

## Source: local data frame [12 x 4]
## Groups: module_name [?]
## 
##                                            module_name graded
##                                                 <fctr>  <lgl>
## 1  Data Visualization - How to  Make Data Presentable!  FALSE
## 2  Data Visualization - How to  Make Data Presentable!   TRUE
## 3                               Descriptive Statistics  FALSE
## 4                               Descriptive Statistics   TRUE
## 5                                Design of Experiments  FALSE
## 6                                Design of Experiments   TRUE
## 7                            Exploratory Data Analysis  FALSE
## 8                            Exploratory Data Analysis   TRUE
## 9                                   Hypothesis Testing  FALSE
## 10                                  Hypothesis Testing   TRUE
## 11                              Inferential Statistics  FALSE
## 12                              Inferential Statistics   TRUE
## # ... with 2 more variables: fraction_correct <dbl>, count <int>

write.csv(course_2, "course_2.csv")

In almost all modules, performance in graded questions is significantly better than in non-graded ones.

Performance in course 2 is fairly good, apart from some conceptual questions (marked in the Excel) most people have gotten wrong. Some examples of such questions: - Application of joint probability (conceptual) - Application of binomial distribution (conceptual) - Terminology: sample and population standard deviation - Recalling concepts from the previous course (questions on R) - Questions on skewness

It should be noted that there are quite a few moderately difficult questions from topics such as HT, probability, descriptive statistics etc. which more than 80% learners have correctly attempted.

Many of these questions require some computation using CSV files etc., indicating that people attempt these questions
Questions on hypothesis testing, percentiles, two-way tables have high correct rates

Recommendations:

To improve the ability to recall concepts, coding console exercises should be increased

Course 3 Analysis

# Course 3 analysis
course_3 %>% group_by(module_name, graded) %>% 
  summarise(fraction_correct = round(mean(fraction), 2), count = n())

## Source: local data frame [14 x 4]
## Groups: module_name [?]
## 
##                                               module_name graded
##                                                    <fctr>  <lgl>
## 1                                        Data Preparation  FALSE
## 2                                        Data Preparation   TRUE
## 3                                  Introduction to Models  FALSE
## 4                                  Introduction to Models   TRUE
## 5                                      Linear Regression   FALSE
## 6                                      Linear Regression    TRUE
## 7                           Supervised Classification - I  FALSE
## 8                           Supervised Classification - I   TRUE
## 9       Supervised Classification II- Logistic Regression  FALSE
## 10      Supervised Classification II- Logistic Regression   TRUE
## 11 Supervised Classification III: Support Vector Machines  FALSE
## 12 Supervised Classification III: Support Vector Machines   TRUE
## 13                      Unsupervised Learning: Clustering  FALSE
## 14                      Unsupervised Learning: Clustering   TRUE
## # ... with 2 more variables: fraction_correct <dbl>, count <int>

write.csv(course_3, "course_3.csv")

course_3 %>% group_by(module_name, graded) %>% 
  summarise(fraction_correct = round(mean(fraction), 2), count = n())

## Source: local data frame [14 x 4]
## Groups: module_name [?]
## 
##                                               module_name graded
##                                                    <fctr>  <lgl>
## 1                                        Data Preparation  FALSE
## 2                                        Data Preparation   TRUE
## 3                                  Introduction to Models  FALSE
## 4                                  Introduction to Models   TRUE
## 5                                      Linear Regression   FALSE
## 6                                      Linear Regression    TRUE
## 7                           Supervised Classification - I  FALSE
## 8                           Supervised Classification - I   TRUE
## 9       Supervised Classification II- Logistic Regression  FALSE
## 10      Supervised Classification II- Logistic Regression   TRUE
## 11 Supervised Classification III: Support Vector Machines  FALSE
## 12 Supervised Classification III: Support Vector Machines   TRUE
## 13                      Unsupervised Learning: Clustering  FALSE
## 14                      Unsupervised Learning: Clustering   TRUE
## # ... with 2 more variables: fraction_correct <dbl>, count <int>

As seen earlier, performance in graded questions is significantly better than in non-graded ones.

Some concepts where performance in graded questions is worse than average are: - Multicollinearity - Step-wise variable selection - Identifying a given task as that of clustering / supervised learning - Learning algorithm of k-means and hierarchical clustering - Decile analysis in logistic regression

Recommendations:

Should increase the number of graded questions on basic, important concepts listed above
Three modules are crucial from an interview POV: Linear regression, Logistic regression and k-means clustering

Course 4 Analysis

# Course 4 analysis
course_4 %>% group_by(module_name, graded) %>% 
  summarise(fraction_correct = round(mean(fraction), 2), count = n())

## Source: local data frame [14 x 4]
## Groups: module_name [?]
## 
##                module_name graded fraction_correct count
##                     <fctr>  <lgl>            <dbl> <int>
## 1      Advanced Regression  FALSE             0.80    17
## 2      Advanced Regression   TRUE             0.71    11
## 3  Association Rule Mining  FALSE             0.69    26
## 4  Association Rule Mining   TRUE             0.81    23
## 5           Decision Trees  FALSE             0.77    31
## 6           Decision Trees   TRUE             0.83     5
## 7                Ensembles  FALSE             0.68    19
## 8                Ensembles   TRUE             0.69    10
## 9         Model Selection   FALSE             0.81    18
## 10        Model Selection    TRUE             0.66    10
## 11         Neural Networks  FALSE             0.72    56
## 12         Neural Networks   TRUE             0.79    15
## 13    Time Series Analysis  FALSE             0.73    38
## 14    Time Series Analysis   TRUE             0.79    10

write.csv(course_4, "course_4.csv")

In PA-2, the hypothesis that people score higher in graded questions is true only for some modules, and false for Model selection, Ensembles, Advanced regression.

A module-wise summary of weak areas:

Model Selection:

Conceptual questions are mostly incorrect, though they were difficult
Identifying linear and non-linear models
The concept of model complexity

Advanced regression:

Matrices and least squares minimisation

Ensembles

Boosting (all basic concepts)
Regularisation concepts

Recommendations:

Course 5 Analysis

# Course 5 analysis
course_5 %>% group_by(module_name, graded) %>% 
  summarise(fraction_correct = round(mean(fraction), 2), count = n())

## Source: local data frame [8 x 4]
## Groups: module_name [?]
## 
##                             module_name graded fraction_correct count
##                                  <fctr>  <lgl>            <dbl> <int>
## 1 Big Data Analytics using Apache Spark  FALSE             0.84     2
## 2 Big Data Analytics using Apache Spark   TRUE             0.94     4
## 3       Big Data processing with Hadoop  FALSE             0.80     9
## 4       Big Data processing with Hadoop   TRUE             0.79    11
## 5         Data Ingestion and Processing  FALSE             0.87     8
## 6         Data Ingestion and Processing   TRUE             0.84    10
## 7              Introduction to Big Data  FALSE             0.68     4
## 8              Introduction to Big Data   TRUE             0.88     4

write.csv(course_5, "course_5.csv")

A few areas where most people have gotten graded questions incorrect are: - HDFS: Name nodes, application masters etc.

Recommendations:

Need to increase the quality and quantity of questions significantly
From a placement POV: Should make it clear to learners to explicitly mention that they are Big Data analysts (entry-level), so the interviewer does not expect much BDE knowledge

Course 6 Analysis

# Course 6 analysis
course_6 %>% group_by(module_name, graded) %>% 
  summarise(fraction_correct = round(mean(fraction), 2), count = n())

## Source: local data frame [19 x 4]
## Groups: module_name [?]
## 
##                                        module_name graded fraction_correct
##                                             <fctr>  <lgl>            <dbl>
## 1                                      A/B Testing  FALSE             0.70
## 2                            Acquisition Analytics  FALSE             0.49
## 3                            Acquisition Analytics   TRUE             0.80
## 4                                   Drug Lifecycle  FALSE             0.53
## 5                                   Drug Lifecycle   TRUE             0.50
## 6                             Engagement Analytics   TRUE             0.77
## 7       Healthcare Data Understanding and Analysis  FALSE             0.61
## 8       Healthcare Data Understanding and Analysis   TRUE             0.54
## 9  Introduction to Banking and Financial Services   FALSE             0.71
## 10                      Introduction to e-commerce  FALSE             0.90
## 11                            Market Mix Modelling  FALSE             0.87
## 12                            Market Mix Modelling   TRUE             0.90
## 13                              Price Optimisation  FALSE             0.70
## 14                              Price Optimisation   TRUE             0.82
## 15                          Recommendation Systems  FALSE             0.84
## 16                          Recommendation Systems   TRUE             0.72
## 17                                 Risk Analytics   FALSE             0.68
## 18                                 Risk Analytics    TRUE             0.78
## 19             Understanding the Healthcare Domain  FALSE             0.68
## # ... with 1 more variables: count <int>

write.csv(course_6, "course_6.csv")

Some observations: - Healthcare has extremely low correct rates, both in graded and non-graded questions - Ecommerce: Has decent correct rates, average 81% correct rate, 82% in graded questions - BFS: Most graded questions are correct (78.5% correct rate), the areas of mistakes are mainly conceptual, domain-related questions in risk analytics.

Recommendations at the Program Level

Based on this analysis and the interview feedback calls, the following observations are clear:

Learners attempt graded questions more sincerely than non-graded ones
In most typical interviews, coding questions on SQL and R are heavily asked (especially in the first few rounds)
Questions on basic concepts, such as cross-validation, linear and logistic regression, model evaluation etc. are heavily asked and learners often get them wrong both in interviews and in graded questions

Thus, the recommendations at the program level are as follows:

Non-graded questions should be repeated/reframed and included in the graded questions, thereby increasing the number of graded questions and testing a wider set of concepts
If possible, coding console should be included heavily to help learners practice R/SQL and other hands-on exercises throughout the program (rather than in 1-2 modules)
Placement content can be potentially made graded, and should include R, SQL, basic concepts of statistics, model building, model selection and evaluation etc.