Following our two previous reports, initial report and sprint 2 report, we sought to further analyse the dataset provided to us by RMIT. (That can be found here.)
Within this report, we set out to the understand the effects of a student’s achievements in correlation to the length of a course. That is, how does the course length affect the student’s achievements in the course? The datasets used to analyse this question were courses which shows the courses offered by the university, assessments which showed each assessment the course had and finally studentAssessment which showed the score each student received for the assessments they submitted. We separated the major task into smaller tasks in hope of finding a rule that made it possible to predict final_result of the students in correlation to the course length.
For this research question, we split the tasks into three dfifferent tasks. This was so that we can find out if there are different formulas that allow the prediction of a student’s result.
The process for reading and merging the files was similar across all three tasks. The packages used for these tasks were knitr, dplyr, arules, arulesviz. At first, we looked through the tables that were needed for this question to find the columns needed. We found that the first two datasets assessments and courses had the columns code_module and code_presentation matching. So, we merged the two tables together by the common columns using the merge function. This created a new table that had the column module_presentation_length added to the corresponding course.
kable(tail(merge1)) %>%
kable_styling(fixed_thead = T, bootstrap_options = c("striped", "hover", "condensed", full_width = F ))
| code_module | code_presentation | id_assessment | assessment_type | date | weight | module_presentation_length | |
|---|---|---|---|---|---|---|---|
| 201 | GGG | 2014J | 37442 | CMA | 229 | 0 | 269 |
| 202 | GGG | 2014J | 37443 | CMA | 229 | 0 | 269 |
| 203 | GGG | 2014J | 37435 | TMA | 61 | 0 | 269 |
| 204 | GGG | 2014J | 37436 | TMA | 124 | 0 | 269 |
| 205 | GGG | 2014J | 37437 | TMA | 173 | 0 | 269 |
| 206 | GGG | 2014J | 37444 | Exam | 229 | 100 | 269 |
Following that, we needed to merge the new table with the table studentAssessment by the common column they hadid_assessment, using the merge function. The second merged table added the id_student, date_submitted, is_bank and score columns for each assessment the student submitted.
kable(tail(merge2)) %>%
kable_styling(fixed_thead = T, bootstrap_options = c("striped", "hover", "condensed", full_width = F)) %>%
scroll_box(width = "100%")
| id_assessment | code_module | code_presentation | assessment_type | date | weight | module_presentation_length | id_student | date_submitted | is_banked | score | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 173925 | 37443 | GGG | 2014J | CMA | 229 | 0 | 269 | 527538 | 227 | 0 | 60 |
| 173926 | 37443 | GGG | 2014J | CMA | 229 | 0 | 269 | 534672 | 229 | 0 | 100 |
| 173927 | 37443 | GGG | 2014J | CMA | 229 | 0 | 269 | 546286 | 215 | 0 | 80 |
| 173928 | 37444 | GGG | 2014J | Exam | 229 | 100 | 269 | NA | NA | NA | NA |
| 173929 | 40087 | CCC | 2014B | Exam | NA | 100 | 241 | NA | NA | NA | NA |
| 173930 | 40088 | CCC | 2014J | Exam | NA | 100 | 269 | NA | NA | NA | NA |
In this task, we used a formula to find associations between the columns. Formula is: code_module, code_presentation, module_presentation_length => final_results.
After finishing the merging of the needed tables, we needed to create a new table that would show only the relevant information combined with final_result of each student in the courses they took. First, we used the function group_by to group the table by the columns code_module, code_presentation, module_presentation_length and id_student. Then using the summarise function we added the column final_result which showed the calculated sum of the students’ score using the following algorithm final_result = (sum(final_score = ((weight / 100) * score))) . This calculates the score for each assessment with the weight of the assessment. By using this method, the columns that we didn’t specify to group by were removed from the table. We removed the column id_student so that it’ll be easier to find any association rules in the data using newtable2$id_student <- NULL. Following that, we cleared any NULL values from the table to make sure that the code for the association rules will not use any NULL values. This was done by using newtable2 <- newtable2[complete.cases(newtable2), ] then changing all the columns in the table to factors, using the function as.factor. that way it will be usable by the association rules code.
Resulting table:
| code_module | code_presentation | module_presentation_length | final_result |
|---|---|---|---|
| FFF | 2014J | 269 | 73.25 |
| FFF | 2014J | 269 | 8.5 |
| FFF | 2014J | 269 | 41.5 |
| FFF | 2014J | 269 | 67 |
| FFF | 2014J | 269 | 24 |
| FFF | 2014J | 269 | 84.25 |
In this task we used a formula to find associations between the columns. Formula used code_module, code_presentation, weight => score.
In this section, we also needed to create a table with the relevant information. First, we grouped the table using the group_by function. The code_module, code_presentation, weight and score were all used in the grouping. Furthermore, we needed to remove any irrelevant columns in the table that were not needed. The removed columns include date_submitted, is_banked, module_presentation_length and date. We set these columns to NULL once again using the function t1$date_submitted <- NULL. As with the last task, we needed to clear any NULL values from the table, making sure the code for the association rules will not use any NULL values. This was done using t1 <- t1[complete.cases(t1), ] then changed all the columns in the table to factors, using the function as.factor That way it will be usable by the association rules code.
Resulting table:
| id_assessment | code_module | code_presentation | assessment_type | weight | id_student | score |
|---|---|---|---|---|---|---|
| 37443 | GGG | 2014J | CMA | 0 | 443765 | 60 |
| 37443 | GGG | 2014J | CMA | 0 | 465952 | 80 |
| 37443 | GGG | 2014J | CMA | 0 | 546724 | 100 |
| 37443 | GGG | 2014J | CMA | 0 | 527538 | 60 |
| 37443 | GGG | 2014J | CMA | 0 | 534672 | 100 |
| 37443 | GGG | 2014J | CMA | 0 | 546286 | 80 |
In this task we used a formula to find associations between the columns. Formula used assessment_type, module_presentation_length, code_module => score. This task was very similar to task 2 but in this task, we grouped it by assessment_type, module_presentation_length, code_module and score. Again, removing any irrelevant information. So, we removed the columns date_submitted, is_banked, date, id_student and weight. We set those columns to NULL and removed the `NULL values using m2 <- m2[complete.cases(m2), ] , then changed all the columns in the table to factors, using the function as.factor, That way it will be usable by the association rules code.
Resulting table:
| id_assessment | code_module | code_presentation | module_presentation_length | assessment_type | score |
|---|---|---|---|---|---|
| 37443 | GGG | 2014J | 269 | CMA | 60 |
| 37443 | GGG | 2014J | 269 | CMA | 80 |
| 37443 | GGG | 2014J | 269 | CMA | 100 |
| 37443 | GGG | 2014J | 269 | CMA | 60 |
| 37443 | GGG | 2014J | 269 | CMA | 100 |
| 37443 | GGG | 2014J | 269 | CMA | 80 |
After finishing manipulating the data, we needed to produce association rules to find a prediction for each of the formulas. To produce the rules the packages arules and arulesviz were used with the function apriori . this was to determine the set of rules from the new table created, using the parameters support and confidence. We then used the function sort to sort out the produced rules. For each part, we used different sorting parameters, so the rules could be sorted by Lift, support or confidence.
Support: The support represents the frequency of an itemset in the data.
Confidence: The confidence shows how often a rule is found to be true. E.g. if x is bought, how often is y bought. In this context, rather than x and y, the terms Left-Hand-Side (LHS) and Right-Hand-Side (RHS) are used.
Lift: Lift provides the information if a rule LHS -> RHS is random (LHS and RHS are independent) or not. If Lift > 1, both occurrences are dependent. Only for Lift greater than 1 a potential useful rule can be found.
For task 1 we chose to sort them by lift in a decreasing order, because it showed the most values, we can use on the LHS. If support is used, the LHS doesn’t show any values in it, if ‘confidence’ is used the LHS is module_presentation_length, but when lift is used the LHS values are code_module and code_presentation mainly. After sorting we printed out the top 5 rules using the inspect function.
## lhs rhs support confidence lift count
## [1] {code_module=BBB,
## code_presentation=2014B} => {module_presentation_length=234} 0.05106858 1 19.581513 1190
## [2] {code_module=DDD,
## code_presentation=2013J} => {module_presentation_length=261} 0.06432924 1 15.545030 1499
## [3] {code_module=DDD,
## code_presentation=2014J} => {module_presentation_length=262} 0.06076732 1 7.762159 1416
## [4] {code_module=BBB,
## code_presentation=2014J} => {module_presentation_length=262} 0.06806283 1 7.762159 1586
## [5] {code_presentation=2013B} => {module_presentation_length=240} 0.16127371 1 6.200639 3758
## lhs rhs support confidence lift
## [1] {} => {assessment_type=TMA} 0.5655207 0.5655207 1
## [2] {} => {assessment_type=CMA} 0.4059365 0.4059365 1
## [3] {} => {id_assessment=[3.49e+04,3.74e+04]} 0.3378171 0.3378171 1
## [4] {} => {id_student=[6.2e+05,2.7e+06]} 0.3333966 0.3333966 1
## [5] {} => {id_assessment=[2.43e+04,3.49e+04)} 0.3333909 0.3333909 1
## count
## [1] 98253
## [2] 70527
## [3] 58692
## [4] 57924
## [5] 57923
For task 1, we decided to remove any redundant rules by using using the formula subRules <- which(colSums(is.subset(rules2, rules2)) > 1) to create a subset of rules then using the function rules2 <- rules2[-subRules] to remove the redundant rules. For task 2, we sorted the rules by using the support parameters and for task 3, we sorted the rules by using the confidence parameters, to show the different results we received using the different parameters.
For task 3 we did the same as task 1 except we sorted it by confidence to show the different results.
# print top 5 rules
inspect(rules3[1:5])
## lhs rhs support confidence lift count
## [1] {module_presentation_length=268} => {code_presentation=2013J} 0.2020560 1.0000000 3.546997 35105
## [2] {module_presentation_length=269} => {code_presentation=2014J} 0.2114263 1.0000000 3.332675 36733
## [3] {code_module=BBB} => {id_assessment=[1.75e+03,2.43e+04)} 0.2473768 1.0000000 3.041436 42979
## [4] {code_module=FFF} => {id_assessment=[3.49e+04,3.74e+04]} 0.2502432 0.7938250 2.349866 43477
## [5] {id_assessment=[1.75e+03,2.43e+04)} => {code_module=BBB} 0.2473768 0.7523808 3.041436 42979
As mentioned above, we decided to divide this question into three parts and for each task we had different formulas to try and answer the research question. The formulas we used were: Part 1: code_module, code_presentation, module_presentation_length => final_results Part 2: code_module, code_presentation, weight => score Part 3: assessment_type, module_presentation_length, code_module => score
In all the parts that we worked on for the report, we were looking to find if it is possible to predict final_result and the score of the students, using the formulas mentioned above.
We sorted the rules by different parameters, such as support, confidence and lift. After sorting the rules, we could not find rules that had a high support or high confidence that could help use predict the results using the formulas. In task 2 the LHS value was empty, meaning no rules were produced.
We found out that the rules with high lift value produced rules with more values in the LHS but most of them followed different formulas than what we were using. For task 1, it followed the formulas code_module, code_presentation => module_presentation_length and code_presentation => module_presentation_length. For task 2 it followed the formulas weight => code_module and assessment_type => code_module and for task 3, it followed the formulas module_presentation_length => code_presentation, code_presentation => module_presentation_length and code_module => id_assessment.
We couldn’t use any of the rules that were produced because none of the rules could help us in predicting the final_result or the score of the students according to the formulas we followed. Meaning that that there was no relationship between the tested columns and the final_result/score of the students.
In our research question we tried to find if the module_presentation_length affected the students’ final_result, but we were disappointed to find that there were no rules that show a relationship between them.
In our quest for analysing RMIT’S provided data, we set out to further understand the datasets. We hoped to find a rule that made it possible to predict a student’s final_result in correlation to the length of the course. his was to try answering the question, how does the course length affect the student’s achievements in the course?
In our research, we tried different methods, three methods to be exact, to try and answer the research question. We were disappointed to find that there were no rules that show a relationship between them, with CSV files and columns we used. Therefore, we arnt able to have an evidential conclusion , as we haven’t ventured to other methods yet. We hope to venture deeper into the datasets and find interesting relations and rules. As for now, with the limited data we’ve tested, there is no correlation between the students results and course length.