Basic Example

Let’s take the example of a teacher who wants to compare the marks and grades of students across two years, 2010 and 2011. The data is stored in tabular format.

load("../../compareDF/data/results_2010.rda")
load("../../compareDF/data/results_2011.rda")
print(results_2010)

##    Division Student Maths Physics Chem Discipline PE Art
## 1         A   Isaac    90      84   91          B  B  34
## 2         A  Akshay    85      92   91          B  B  36
## 3         A Vishwas    93      93   92          A  B  21
## 4         A   Rohit    95      92   71          C  B  37
## 5         A    Venu    99      92   82          A  E  78
## 6         A  Ananth    99      81   91          B  A  24
## 7         B    Jojy    67      92   81          B  A  27
## 8         B   Bulla    84      73   81          C  A  68
## 9         B   Katti    90      95   99          C  B  49
## 10        B Dhakkan    78      96   71          C  C  39
## 11        B   Macho    90      82   81         A+  D  30
## 12        B  Mugger    95      71   94          A  C  26

print(results_2011)

##    Division Student Maths Physics Chem Discipline PE Art
## 1         A   Isaac    90      84   91          A  B  34
## 2         A  Akshay    85      92   91          A  B  36
## 3         A Vishwas    82      93   92          B  B  21
## 4         A   Rohit    94      92   71          D  B  37
## 5         A    Venu   100      92   82          A  E  78
## 6         A  Ananth    78      81   91          B  A  24
## 7         B    Jojy    99      92   81          B  A  27
## 8         B   Bulla    97      73   81          C  A  68
## 9         B   Katti    78      95   99          C  B  49
## 10        B   Rohit    79      96   71          C  C  39
## 11        B   Macho    90      82   81         A+  D  30
## 12        B  Vikram    99      79   98          A  B  99
## 13        B DIkChik    91      71   84          E  C  99

The data shows the performance of students in two divisions, A and B for two years. Some subjects like Maths, Physics, Chemistry and Art are given scores while others like Discipline and PE are given grades.

It is possible that there are students of the same name in two divisions, for example, there is a Rohit in both the divisions in 2011.

It is also possible that some students have dropped out, or added new across the two years. Eg: - Mugger and Dhakkan dropped out while Vikram and Dikchik where added in the Division B

The package allows a user to quickly identify these changes.

Basic Comparison

Now let’s compare the performance of the students across the years. The grouping variables is the Student columns. We will ignore the Division columns and assume that the student names are unique across divisions. In this sub-example, if a student appears in two division, he has studied in both of them.

library(compareDF)
ctable_student = compare_df(results_2011, results_2010, c("Student"))

## Creating comparison table...

## Loading required namespace: htmlTable

## Creating HTML table for first 100 rows

ctable_student$comparison_df

##    Student chng_type Division Maths Physics Chem Discipline PE Art
## 1   Akshay         +        A    85      92   91          A  B  36
## 2   Akshay         -        A    85      92   91          B  B  36
## 3   Ananth         +        A    78      81   91          B  A  24
## 4   Ananth         -        A    99      81   91          B  A  24
## 5    Bulla         +        B    97      73   81          C  A  68
## 6    Bulla         -        B    84      73   81          C  A  68
## 7  Dhakkan         -        B    78      96   71          C  C  39
## 8    Isaac         +        A    90      84   91          A  B  34
## 9    Isaac         -        A    90      84   91          B  B  34
## 10    Jojy         +        B    99      92   81          B  A  27
## 11    Jojy         -        B    67      92   81          B  A  27
## 12   Katti         +        B    78      95   99          C  B  49
## 13   Katti         -        B    90      95   99          C  B  49
## 14  Mugger         -        B    95      71   94          A  C  26
## 15   Rohit         +        A    94      92   71          D  B  37
## 16   Rohit         +        B    79      96   71          C  C  39
## 17   Rohit         -        A    95      92   71          C  B  37
## 18    Venu         +        A   100      92   82          A  E  78
## 19    Venu         -        A    99      92   82          A  E  78
## 20 Vishwas         +        A    82      93   92          B  B  21
## 21 Vishwas         -        A    93      93   92          A  B  21
## 22 DIkChik         +        B    91      71   84          E  C  99
## 23  Vikram         +        B    99      79   98          A  B  99

By default, no columns are excluded from the comparison, so any of the tuple of grouping variables which are different across the two data frames are shown in the comparison table. The comparison_df table shows all the rows for which at least one record has changed. If nothing has changed across the two tables, the rows are not displayed. If a new record has been introduced or a record has been removed, those are displayed as well.

For example, Akshay, Division A has the exact same scores but has two different grades for Discipline across the two years so that row is included.

However, Macho, Division B has had the exact same scores in both the years, so his data is not shown in the comparison table.

HTML Output

While the comparison, table can be quickly summarized in various forms for futher analysis, it is very difficult to process visually. The html_output provides a way to represent this is a way that is easier for the numan eye to read. NOTE: You need to install the htmlTable package for the HTML comparison to work. For the purpose of the readme I am attaching the html as a png because github markdown doesn’t retain styles.

print(ctable_student$html_output)

Student	chng_type	Division	Maths	Physics	Chem	Discipline	PE	Art
Akshay	+	A	85	92	91	A	B	36
Akshay	-	A	85	92	91	B	B	36
Ananth	+	A	78	81	91	B	A	24
Ananth	-	A	99	81	91	B	A	24
Bulla	+	B	97	73	81	C	A	68
Bulla	-	B	84	73	81	C	A	68
Dhakkan	-	B	78	96	71	C	C	39
Isaac	+	A	90	84	91	A	B	34
Isaac	-	A	90	84	91	B	B	34
Jojy	+	B	99	92	81	B	A	27
Jojy	-	B	67	92	81	B	A	27
Katti	+	B	78	95	99	C	B	49
Katti	-	B	90	95	99	C	B	49
Mugger	-	B	95	71	94	A	C	26
Rohit	+	A	94	92	71	D	B	37
Rohit	+	B	79	96	71	C	C	39
Rohit	-	A	95	92	71	C	B	37
Venu	+	A	100	92	82	A	E	78
Venu	-	A	99	92	82	A	E	78
Vishwas	+	A	82	93	92	B	B	21
Vishwas	-	A	93	93	92	A	B	21
DIkChik	+	B	91	71	84	E	C	99
Vikram	+	B	99	79	98	A	B	99

Now it is very easy to see recognize what has changed. A single cell is colored if it has changed across the two datasets. The value of the cell in the older dataset is colored red and the value of the cell in the newer dataset is colored green. Cells that haven’t changed across the two datasets are colored grey.

If a new row was introduced, the Row group names (and all the other columns for that row as well ) are colored in Green. Similarly, a row group name (and the other columns in that row) are colored red if a row was removed.

For Example, Akshay, Ananth and Bulla has had changes in scored, which are in Discipline, Maths, and Maths respectively. Dhakkan and Mugger have dropped out of the dataset from 2010 and the full rows are shown in red, which DikChik and Vikram have joined new in the data set and the full rows are in green.

The same data is represented in tabular form (for further analysis, if necessary) in the comparison_table_diff object

ctable_student$comparison_table_diff

##    Student chng_type Division Maths Physics Chem Discipline PE Art
## 1        .         +        .     .       .    .          +  .   .
## 2        .         -        .     .       .    .          -  .   .
## 3        .         +        .     +       .    .          .  .   .
## 4        .         -        .     -       .    .          .  .   .
## 5        .         +        .     +       .    .          .  .   .
## 6        .         -        .     -       .    .          .  .   .
## 7        -         -        -     -       -    -          -  -   -
## 8        .         +        .     .       .    .          +  .   .
## 9        .         -        .     .       .    .          -  .   .
## 10       .         +        .     +       .    .          .  .   .
## 11       .         -        .     -       .    .          .  .   .
## 12       .         +        .     +       .    .          .  .   .
## 13       .         -        .     -       .    .          .  .   .
## 14       -         -        -     -       -    -          -  -   -
## 15       .         +        +     +       +    .          +  +   +
## 16       .         +        +     +       +    .          +  +   +
## 17       .         -        -     -       -    .          -  -   -
## 18       .         +        .     +       .    .          .  .   .
## 19       .         -        .     -       .    .          .  .   .
## 20       .         +        .     +       .    .          +  .   .
## 21       .         -        .     -       .    .          -  .   .
## 22       +         +        +     +       +    +          +  +   +
## 23       +         +        +     +       +    +          +  +   +

Change Count and Summary

You can get an details of what has changed for each group using the change_count object in the output. A summary of the same is provided in the change_summary object.

ctable_student$change_count

## Source: local data frame [13 x 4]
## 
##    Student changes additions removals
##     (fctr)   (dbl)     (dbl)    (dbl)
## 1   Akshay       1         0        0
## 2   Ananth       1         0        0
## 3    Bulla       1         0        0
## 4  Dhakkan       0         1        0
## 5    Isaac       1         0        0
## 6     Jojy       1         0        0
## 7    Katti       1         0        0
## 8   Mugger       0         1        0
## 9    Rohit       1         0        1
## 10    Venu       1         0        0
## 11 Vishwas       1         0        0
## 12 DIkChik       0         0        1
## 13  Vikram       0         0        1

ctable_student$change_summary

##   old_obs   new_obs   changes additions  removals 
##        12        13         9         2         3

Grouping Multiple Columns

We can also group_multiple columns into the grouping variable

ctable_student_div = compare_df(results_2011, results_2010, c("Division", "Student"))

## Grouping grouping columns

## Creating comparison table...

## Creating HTML table for first 100 rows

ctable_student_div$comparison_df

grp chng_type Division Student Maths Physics Chem Discipline PE Art 1 1 + A Akshay 85 92 91 A B 36 2 1 - A Akshay 85 92 91 B B 36 3 2 + A Ananth 78 81 91 B A 24 4 2 - A Ananth 99 81 91 B A 24 5 3 + A Isaac 90 84 91 A B 34 6 3 - A Isaac 90 84 91 B B 34 7 4 + A Rohit 94 92 71 D B 37 8 4 - A Rohit 95 92 71 C B 37 9 5 + A Venu 100 92 82 A E 78 10 5 - A Venu 99 92 82 A E 78 11 6 + A Vishwas 82 93 92 B B 21 12 6 - A Vishwas 93 93 92 A B 21 13 7 + B Bulla 97 73 81 C A 68 14 7 - B Bulla 84 73 81 C A 68 15 8 + B DIkChik 91 71 84 E C 99 16 9 + B Jojy 99 92 81 B A 27 17 9 - B Jojy 67 92 81 B A 27 18 10 + B Katti 78 95 99 C B 49 19 10 - B Katti 90 95 99 C B 49 20 12 + B Rohit 79 96 71 C C 39 21 13 + B Vikram 99 79 98 A B 99 22 14 - B Dhakkan 78 96 71 C C 39 23 15 - B Mugger 95 71 94 A C 26

Now Rohits in each individual division are considered as belonging to separate groups and compared accordingly. All the other summaries also change appropriately.

Excluding certain Columns

You can ignore certain columns using the exclude parameter. The fields that have to be excluded can be given as a character vector. (This is a convenience function to deal with the case where some columns are not included)

Limiting HTML size

For dataframes which have a large amount of differences in them, generating HTML might take a long time and crash your system. So the maximum diff size for the HTML (and for the HTML visualization only) is capped at 100 by default. If you want to see more difference, you can change the limit_html parameter appropriately.

Tolerance

It is possible that you’d like numbers very close to each other to be ignored. For example, if the marks of a student vary by less that 5% across the years, it could be due to random error and not any real performance indictaor. In such a case, you would want to give a tolerance parameter. For this case, giving a tolerance of 0.05 would ignore all the changes that are less than 5% apart from the lower value.

ctable_student_div = compare_df(results_2011, results_2010, c("Division", "Student"), tolerance = 0.05)

## Grouping grouping columns

## Creating comparison table...

## Creating HTML table for first 100 rows

ctable_student_div$html_output

grp	chng_type	Division	Student	Maths	Physics	Chem	Discipline	PE	Art
1	+	A	Akshay	85	92	91	A	B	36
1	-	A	Akshay	85	92	91	B	B	36
2	+	A	Ananth	78	81	91	B	A	24
2	-	A	Ananth	99	81	91	B	A	24
3	+	A	Isaac	90	84	91	A	B	34
3	-	A	Isaac	90	84	91	B	B	34
4	+	A	Rohit	94	92	71	D	B	37
4	-	A	Rohit	95	92	71	C	B	37
6	+	A	Vishwas	82	93	92	B	B	21
6	-	A	Vishwas	93	93	92	A	B	21
7	+	B	Bulla	97	73	81	C	A	68
7	-	B	Bulla	84	73	81	C	A	68
8	+	B	DIkChik	91	71	84	E	C	99
9	+	B	Jojy	99	92	81	B	A	27
9	-	B	Jojy	67	92	81	B	A	27
10	+	B	Katti	78	95	99	C	B	49
10	-	B	Katti	90	95	99	C	B	49
12	+	B	Rohit	79	96	71	C	C	39
13	+	B	Vikram	99	79	98	A	B	99
14	-	B	Dhakkan	78	96	71	C	C	39
15	-	B	Mugger	95	71	94	A	C	26

Venu from division A who had a score change from 100 to 99 is no longer present in the diff calculation or in the output

Naturally, tolerance has no meaning for non-numeric values.

Acknowledgements

I’d like to thank System Insights Inc. for all the things that I have learned while working there which I have used one way or the other in this package.

A version of this blog has been published as the README for the package.

Introduction

Usage