Research Objective

Following our two previous reports, initial report and sprint 2 report, we sought to further analyse the dataset provided to us by RMIT. (That can be found here.)

Within this report, we set out to the understand the effects of a student’s achievements in correlation to the length of a course. That is, how does the course length affect the student’s achievements in the course? The datasets used to analyse this question were courses which shows the courses offered by the university, assessments which showed each assessment the course had and finally studentAssessment which showed the score each student received for the assessments they submitted. We separated the major task into smaller tasks in hope of finding a rule that made it possible to predict final_result of the students in correlation to the course length.

Methodology

For this research question, we split the tasks into three dfifferent tasks. This was so that we can find out if there are different formulas that allow the prediction of a student’s result.

The process for reading and merging the files was similar across all three tasks. The packages used for these tasks were knitr, dplyr, arules, arulesviz. At first, we looked through the tables that were needed for this question to find the columns needed. We found that the first two datasets assessments and courses had the columns code_module and code_presentation matching. So, we merged the two tables together by the common columns using the merge function. This created a new table that had the column module_presentation_length added to the corresponding course.

kable(tail(merge1)) %>% 
  kable_styling(fixed_thead = T, bootstrap_options = c("striped", "hover", "condensed", full_width = F ))

	code_module	code_presentation	id_assessment	assessment_type	date	weight	module_presentation_length
201	GGG	2014J	37442	CMA	229	0	269
202	GGG	2014J	37443	CMA	229	0	269
203	GGG	2014J	37435	TMA	61	0	269
204	GGG	2014J	37436	TMA	124	0	269
205	GGG	2014J	37437	TMA	173	0	269
206	GGG	2014J	37444	Exam	229	100	269

Following that, we needed to merge the new table with the table studentAssessment by the common column they hadid_assessment, using the merge function. The second merged table added the id_student, date_submitted, is_bank and score columns for each assessment the student submitted.

kable(tail(merge2)) %>% 
  kable_styling(fixed_thead = T, bootstrap_options = c("striped", "hover", "condensed", full_width = F)) %>%
  scroll_box(width = "100%")

	id_assessment	code_module	code_presentation	assessment_type	date	weight	module_presentation_length	id_student	date_submitted	is_banked	score
173925	37443	GGG	2014J	CMA	229	0	269	527538	227	0	60
173926	37443	GGG	2014J	CMA	229	0	269	534672	229	0	100
173927	37443	GGG	2014J	CMA	229	0	269	546286	215	0	80
173928	37444	GGG	2014J	Exam	229	100	269	NA	NA	NA	NA
173929	40087	CCC	2014B	Exam	NA	100	241	NA	NA	NA	NA
173930	40088	CCC	2014J	Exam	NA	100	269	NA	NA	NA	NA

Task 1:

In this task, we used a formula to find associations between the columns. Formula is: code_module, code_presentation, module_presentation_length => final_results.

After finishing the merging of the needed tables, we needed to create a new table that would show only the relevant information combined with final_result of each student in the courses they took. First, we used the function group_by to group the table by the columns code_module, code_presentation, module_presentation_length and id_student. Then using the summarise function we added the column final_result which showed the calculated sum of the students’ score using the following algorithm final_result = (sum(final_score = ((weight / 100) * score))) . This calculates the score for each assessment with the weight of the assessment. By using this method, the columns that we didn’t specify to group by were removed from the table. We removed the column id_student so that it’ll be easier to find any association rules in the data using newtable2$id_student <- NULL. Following that, we cleared any NULL values from the table to make sure that the code for the association rules will not use any NULL values. This was done by using newtable2 <- newtable2[complete.cases(newtable2), ] then changing all the columns in the table to factors, using the function as.factor. that way it will be usable by the association rules code.

Resulting table:

code_module	code_presentation	module_presentation_length	final_result
FFF	2014J	269	73.25
FFF	2014J	269	8.5
FFF	2014J	269	41.5
FFF	2014J	269	67
FFF	2014J	269	24
FFF	2014J	269	84.25

Task 2:

In this task we used a formula to find associations between the columns. Formula used code_module, code_presentation, weight => score.

In this section, we also needed to create a table with the relevant information. First, we grouped the table using the group_by function. The code_module, code_presentation, weight and score were all used in the grouping. Furthermore, we needed to remove any irrelevant columns in the table that were not needed. The removed columns include date_submitted, is_banked, module_presentation_length and date. We set these columns to NULL once again using the function t1$date_submitted <- NULL. As with the last task, we needed to clear any NULL values from the table, making sure the code for the association rules will not use any NULL values. This was done using t1 <- t1[complete.cases(t1), ] then changed all the columns in the table to factors, using the function as.factor That way it will be usable by the association rules code.

Resulting table:

id_assessment	code_module	code_presentation	assessment_type	id_student	score
37443	GGG	2014J	CMA	443765	60
37443	GGG	2014J	CMA	465952	80
37443	GGG	2014J	CMA	546724	100
37443	GGG	2014J	CMA	527538	60
37443	GGG	2014J	CMA	534672	100
37443	GGG	2014J	CMA	546286	80

Task 3:

In this task we used a formula to find associations between the columns. Formula used assessment_type, module_presentation_length, code_module => score. This task was very similar to task 2 but in this task, we grouped it by assessment_type, module_presentation_length, code_module and score. Again, removing any irrelevant information. So, we removed the columns date_submitted, is_banked, date, id_student and weight. We set those columns to NULL and removed the `NULL values using m2 <- m2[complete.cases(m2), ] , then changed all the columns in the table to factors, using the function as.factor, That way it will be usable by the association rules code.

Resulting table:

id_assessment	code_module	code_presentation	module_presentation_length	assessment_type	score
37443	GGG	2014J	269	CMA	60
37443	GGG	2014J	269	CMA	80
37443	GGG	2014J	269	CMA	100
37443	GGG	2014J	269	CMA	60
37443	GGG	2014J	269	CMA	100
37443	GGG	2014J	269	CMA	80

After finishing manipulating the data, we needed to produce association rules to find a prediction for each of the formulas. To produce the rules the packages arules and arulesviz were used with the function apriori . this was to determine the set of rules from the new table created, using the parameters support and confidence. We then used the function sort to sort out the produced rules. For each part, we used different sorting parameters, so the rules could be sorted by Lift, support or confidence.

Support: The support represents the frequency of an itemset in the data.

Confidence: The confidence shows how often a rule is found to be true. E.g. if x is bought, how often is y bought. In this context, rather than x and y, the terms Left-Hand-Side (LHS) and Right-Hand-Side (RHS) are used.

Lift: Lift provides the information if a rule LHS -> RHS is random (LHS and RHS are independent) or not. If Lift > 1, both occurrences are dependent. Only for Lift greater than 1 a potential useful rule can be found.

For task 1 we chose to sort them by lift in a decreasing order, because it showed the most values, we can use on the LHS. If support is used, the LHS doesn’t show any values in it, if ‘confidence’ is used the LHS is module_presentation_length, but when lift is used the LHS values are code_module and code_presentation mainly. After sorting we printed out the top 5 rules using the inspect function.

##     lhs                          rhs                                 support confidence      lift count
## [1] {code_module=BBB,                                                                                  
##      code_presentation=2014B} => {module_presentation_length=234} 0.05106858          1 19.581513  1190
## [2] {code_module=DDD,                                                                                  
##      code_presentation=2013J} => {module_presentation_length=261} 0.06432924          1 15.545030  1499
## [3] {code_module=DDD,                                                                                  
##      code_presentation=2014J} => {module_presentation_length=262} 0.06076732          1  7.762159  1416
## [4] {code_module=BBB,                                                                                  
##      code_presentation=2014J} => {module_presentation_length=262} 0.06806283          1  7.762159  1586
## [5] {code_presentation=2013B} => {module_presentation_length=240} 0.16127371          1  6.200639  3758

##     lhs    rhs                                 support   confidence lift
## [1] {}  => {assessment_type=TMA}               0.5655207 0.5655207  1   
## [2] {}  => {assessment_type=CMA}               0.4059365 0.4059365  1   
## [3] {}  => {id_assessment=[3.49e+04,3.74e+04]} 0.3378171 0.3378171  1   
## [4] {}  => {id_student=[6.2e+05,2.7e+06]}      0.3333966 0.3333966  1   
## [5] {}  => {id_assessment=[2.43e+04,3.49e+04)} 0.3333909 0.3333909  1   
##     count
## [1] 98253
## [2] 70527
## [3] 58692
## [4] 57924
## [5] 57923

For task 1, we decided to remove any redundant rules by using using the formula subRules <- which(colSums(is.subset(rules2, rules2)) > 1) to create a subset of rules then using the function rules2 <- rules2[-subRules] to remove the redundant rules. For task 2, we sorted the rules by using the support parameters and for task 3, we sorted the rules by using the confidence parameters, to show the different results we received using the different parameters.

For task 3 we did the same as task 1 except we sorted it by confidence to show the different results.

# print top 5 rules
inspect(rules3[1:5])

##     lhs                                    rhs                                   support confidence     lift count
## [1] {module_presentation_length=268}    => {code_presentation=2013J}           0.2020560  1.0000000 3.546997 35105
## [2] {module_presentation_length=269}    => {code_presentation=2014J}           0.2114263  1.0000000 3.332675 36733
## [3] {code_module=BBB}                   => {id_assessment=[1.75e+03,2.43e+04)} 0.2473768  1.0000000 3.041436 42979
## [4] {code_module=FFF}                   => {id_assessment=[3.49e+04,3.74e+04]} 0.2502432  0.7938250 2.349866 43477
## [5] {id_assessment=[1.75e+03,2.43e+04)} => {code_module=BBB}                   0.2473768  0.7523808 3.041436 42979

Findings

As mentioned above, we decided to divide this question into three parts and for each task we had different formulas to try and answer the research question. The formulas we used were: Part 1: code_module, code_presentation, module_presentation_length => final_results Part 2: code_module, code_presentation, weight => score Part 3: assessment_type, module_presentation_length, code_module => score

In all the parts that we worked on for the report, we were looking to find if it is possible to predict final_result and the score of the students, using the formulas mentioned above.

We sorted the rules by different parameters, such as support, confidence and lift. After sorting the rules, we could not find rules that had a high support or high confidence that could help use predict the results using the formulas. In task 2 the LHS value was empty, meaning no rules were produced.

We found out that the rules with high lift value produced rules with more values in the LHS but most of them followed different formulas than what we were using. For task 1, it followed the formulas code_module, code_presentation => module_presentation_length and code_presentation => module_presentation_length. For task 2 it followed the formulas weight => code_module and assessment_type => code_module and for task 3, it followed the formulas module_presentation_length => code_presentation, code_presentation => module_presentation_length and code_module => id_assessment.

We couldn’t use any of the rules that were produced because none of the rules could help us in predicting the final_result or the score of the students according to the formulas we followed. Meaning that that there was no relationship between the tested columns and the final_result/score of the students.

In our research question we tried to find if the module_presentation_length affected the students’ final_result, but we were disappointed to find that there were no rules that show a relationship between them.

Conclusion

In our quest for analysing RMIT’S provided data, we set out to further understand the datasets. We hoped to find a rule that made it possible to predict a student’s final_result in correlation to the length of the course. his was to try answering the question, how does the course length affect the student’s achievements in the course?

In our research, we tried different methods, three methods to be exact, to try and answer the research question. We were disappointed to find that there were no rules that show a relationship between them, with CSV files and columns we used. Therefore, we arnt able to have an evidential conclusion , as we haven’t ventured to other methods yet. We hope to venture deeper into the datasets and find interesting relations and rules. As for now, with the limited data we’ve tested, there is no correlation between the students results and course length.

Research question 3

24/09/2019

Research Objective

Methodology

Task 1:

Task 2:

Task 3:

Findings

Conclusion