Research Objective

Following our two previous reports, initial report and sprint 2 report, we sought to further analyse the dataset provided to us by RMIT. (That can be found here.)

Within this report, we set out to the understand the effects of a student’s achievements in correlation to the length of a course. That is, how does the course length affect the student’s achievements in the course? The datasets used to analyse this question were courses which shows the courses offered by the university, assessments which showed each assessment the course had and finally studentAssessment which showed the score each student received for the assessments they submitted. We separated the major task into smaller tasks in hope of finding a rule that made it possible to predict final_result of the students in correlation to the course length.

Methodology

For this research question, we split the tasks into three dfifferent tasks. This was so that we can find out if there are different formulas that allow the prediction of a student’s result.

The process for reading and merging the files was similar across all three tasks. The packages used for these tasks were knitr, dplyr, arules, arulesviz. At first, we looked through the tables that were needed for this question to find the columns needed. We found that the first two datasets assessments and courses had the columns code_module and code_presentation matching. So, we merged the two tables together by the common columns using the merge function. This created a new table that had the column module_presentation_length added to the corresponding course.

kable(tail(merge1)) %>% 
  kable_styling(fixed_thead = T, bootstrap_options = c("striped", "hover", "condensed", full_width = F ))
code_module code_presentation id_assessment assessment_type date weight module_presentation_length
201 GGG 2014J 37442 CMA 229 0 269
202 GGG 2014J 37443 CMA 229 0 269
203 GGG 2014J 37435 TMA 61 0 269
204 GGG 2014J 37436 TMA 124 0 269
205 GGG 2014J 37437 TMA 173 0 269
206 GGG 2014J 37444 Exam 229 100 269

Following that, we needed to merge the new table with the table studentAssessment by the common column they hadid_assessment, using the merge function. The second merged table added the id_student, date_submitted, is_bank and score columns for each assessment the student submitted.

kable(tail(merge2)) %>% 
  kable_styling(fixed_thead = T, bootstrap_options = c("striped", "hover", "condensed", full_width = F)) %>%
  scroll_box(width = "100%")
id_assessment code_module code_presentation assessment_type date weight module_presentation_length id_student date_submitted is_banked score
173925 37443 GGG 2014J CMA 229 0 269 527538 227 0 60
173926 37443 GGG 2014J CMA 229 0 269 534672 229 0 100
173927 37443 GGG 2014J CMA 229 0 269 546286 215 0 80
173928 37444 GGG 2014J Exam 229 100 269 NA NA NA NA
173929 40087 CCC 2014B Exam NA 100 241 NA NA NA NA
173930 40088 CCC 2014J Exam NA 100 269 NA NA NA NA

Task 1:

In this task, we used a formula to find associations between the columns. Formula is: code_module, code_presentation, module_presentation_length => final_results.

After finishing the merging of the needed tables, we needed to create a new table that would show only the relevant information combined with final_result of each student in the courses they took. First, we used the function group_by to group the table by the columns code_module, code_presentation, module_presentation_length and id_student. Then using the summarise function we added the column final_result which showed the calculated sum of the students’ score using the following algorithm final_result = (sum(final_score = ((weight / 100) * score))) . This calculates the score for each assessment with the weight of the assessment. By using this method, the columns that we didn’t specify to group by were removed from the table. We removed the column id_student so that it’ll be easier to find any association rules in the data using newtable2$id_student <- NULL. Following that, we cleared any NULL values from the table to make sure that the code for the association rules will not use any NULL values. This was done by using newtable2 <- newtable2[complete.cases(newtable2), ] then changing all the columns in the table to factors, using the function as.factor. that way it will be usable by the association rules code.

Resulting table:

code_module code_presentation module_presentation_length final_result
FFF 2014J 269 73.25
FFF 2014J 269 8.5
FFF 2014J 269 41.5
FFF 2014J 269 67
FFF 2014J 269 24
FFF 2014J 269 84.25

Task 2:

In this task we used a formula to find associations between the columns. Formula used code_module, code_presentation, weight => score.

In this section, we also needed to create a table with the relevant information. First, we grouped the table using the group_by function. The code_module, code_presentation, weight and score were all used in the grouping. Furthermore, we needed to remove any irrelevant columns in the table that were not needed. The removed columns include date_submitted, is_banked, module_presentation_length and date. We set these columns to NULL once again using the function t1$date_submitted <- NULL. As with the last task, we needed to clear any NULL values from the table, making sure the code for the association rules will not use any NULL values. This was done using t1 <- t1[complete.cases(t1), ] then changed all the columns in the table to factors, using the function as.factor That way it will be usable by the association rules code.

Resulting table:

id_assessment code_module code_presentation assessment_type weight id_student score
37443 GGG 2014J CMA 0 443765 60
37443 GGG 2014J CMA 0 465952 80
37443 GGG 2014J CMA 0 546724 100
37443 GGG 2014J CMA 0 527538 60
37443 GGG 2014J CMA 0 534672 100
37443 GGG 2014J CMA 0 546286 80

Task 3:

In this task we used a formula to find associations between the columns. Formula used assessment_type, module_presentation_length, code_module => score. This task was very similar to task 2 but in this task, we grouped it by assessment_type, module_presentation_length, code_module and score. Again, removing any irrelevant information. So, we removed the columns date_submitted, is_banked, date, id_student and weight. We set those columns to NULL and removed the `NULL values using m2 <- m2[complete.cases(m2), ] , then changed all the columns in the table to factors, using the function as.factor, That way it will be usable by the association rules code.

Resulting table:

id_assessment code_module code_presentation module_presentation_length assessment_type score
37443 GGG 2014J 269 CMA 60
37443 GGG 2014J 269 CMA 80
37443 GGG 2014J 269 CMA 100
37443 GGG 2014J 269 CMA 60
37443 GGG 2014J 269 CMA 100
37443 GGG 2014J 269 CMA 80

After finishing manipulating the data, we needed to produce association rules to find a prediction for each of the formulas. To produce the rules the packages arules and arulesviz were used with the function apriori . this was to determine the set of rules from the new table created, using the parameters support and confidence. We then used the function sort to sort out the produced rules. For each part, we used different sorting parameters, so the rules could be sorted by Lift, support or confidence.

Support: The support represents the frequency of an itemset in the data.

Confidence: The confidence shows how often a rule is found to be true. E.g. if x is bought, how often is y bought. In this context, rather than x and y, the terms Left-Hand-Side (LHS) and Right-Hand-Side (RHS) are used.

Lift: Lift provides the information if a rule LHS -> RHS is random (LHS and RHS are independent) or not. If Lift > 1, both occurrences are dependent. Only for Lift greater than 1 a potential useful rule can be found.

For task 1 we chose to sort them by lift in a decreasing order, because it showed the most values, we can use on the LHS. If support is used, the LHS doesn’t show any values in it, if ‘confidence’ is used the LHS is module_presentation_length, but when lift is used the LHS values are code_module and code_presentation mainly. After sorting we printed out the top 5 rules using the inspect function.

##     lhs                          rhs                                 support confidence      lift count
## [1] {code_module=BBB,                                                                                  
##      code_presentation=2014B} => {module_presentation_length=234} 0.05106858          1 19.581513  1190
## [2] {code_module=DDD,                                                                                  
##      code_presentation=2013J} => {module_presentation_length=261} 0.06432924          1 15.545030  1499
## [3] {code_module=DDD,                                                                                  
##      code_presentation=2014J} => {module_presentation_length=262} 0.06076732          1  7.762159  1416
## [4] {code_module=BBB,                                                                                  
##      code_presentation=2014J} => {module_presentation_length=262} 0.06806283          1  7.762159  1586
## [5] {code_presentation=2013B} => {module_presentation_length=240} 0.16127371          1  6.200639  3758
##     lhs    rhs                                 support   confidence lift
## [1] {}  => {assessment_type=TMA}               0.5655207 0.5655207  1   
## [2] {}  => {assessment_type=CMA}               0.4059365 0.4059365  1   
## [3] {}  => {id_assessment=[3.49e+04,3.74e+04]} 0.3378171 0.3378171  1   
## [4] {}  => {id_student=[6.2e+05,2.7e+06]}      0.3333966 0.3333966  1   
## [5] {}  => {id_assessment=[2.43e+04,3.49e+04)} 0.3333909 0.3333909  1   
##     count
## [1] 98253
## [2] 70527
## [3] 58692
## [4] 57924
## [5] 57923

For task 1, we decided to remove any redundant rules by using using the formula subRules <- which(colSums(is.subset(rules2, rules2)) > 1) to create a subset of rules then using the function rules2 <- rules2[-subRules] to remove the redundant rules. For task 2, we sorted the rules by using the support parameters and for task 3, we sorted the rules by using the confidence parameters, to show the different results we received using the different parameters.

For task 3 we did the same as task 1 except we sorted it by confidence to show the different results.

# print top 5 rules
inspect(rules3[1:5]) 
##     lhs                                    rhs                                   support confidence     lift count
## [1] {module_presentation_length=268}    => {code_presentation=2013J}           0.2020560  1.0000000 3.546997 35105
## [2] {module_presentation_length=269}    => {code_presentation=2014J}           0.2114263  1.0000000 3.332675 36733
## [3] {code_module=BBB}                   => {id_assessment=[1.75e+03,2.43e+04)} 0.2473768  1.0000000 3.041436 42979
## [4] {code_module=FFF}                   => {id_assessment=[3.49e+04,3.74e+04]} 0.2502432  0.7938250 2.349866 43477
## [5] {id_assessment=[1.75e+03,2.43e+04)} => {code_module=BBB}                   0.2473768  0.7523808 3.041436 42979

Findings

As mentioned above, we decided to divide this question into three parts and for each task we had different formulas to try and answer the research question. The formulas we used were: Part 1: code_module, code_presentation, module_presentation_length => final_results Part 2: code_module, code_presentation, weight => score Part 3: assessment_type, module_presentation_length, code_module => score

In all the parts that we worked on for the report, we were looking to find if it is possible to predict final_result and the score of the students, using the formulas mentioned above.

We sorted the rules by different parameters, such as support, confidence and lift. After sorting the rules, we could not find rules that had a high support or high confidence that could help use predict the results using the formulas. In task 2 the LHS value was empty, meaning no rules were produced.

We found out that the rules with high lift value produced rules with more values in the LHS but most of them followed different formulas than what we were using. For task 1, it followed the formulas code_module, code_presentation => module_presentation_length and code_presentation => module_presentation_length. For task 2 it followed the formulas weight => code_module and assessment_type => code_module and for task 3, it followed the formulas module_presentation_length => code_presentation, code_presentation => module_presentation_length and code_module => id_assessment.

We couldn’t use any of the rules that were produced because none of the rules could help us in predicting the final_result or the score of the students according to the formulas we followed. Meaning that that there was no relationship between the tested columns and the final_result/score of the students.

In our research question we tried to find if the module_presentation_length affected the students’ final_result, but we were disappointed to find that there were no rules that show a relationship between them.

Conclusion

In our quest for analysing RMIT’S provided data, we set out to further understand the datasets. We hoped to find a rule that made it possible to predict a student’s final_result in correlation to the length of the course. his was to try answering the question, how does the course length affect the student’s achievements in the course?

In our research, we tried different methods, three methods to be exact, to try and answer the research question. We were disappointed to find that there were no rules that show a relationship between them, with CSV files and columns we used. Therefore, we arnt able to have an evidential conclusion , as we haven’t ventured to other methods yet. We hope to venture deeper into the datasets and find interesting relations and rules. As for now, with the limited data we’ve tested, there is no correlation between the students results and course length.