DETERMINE SUBJECT OF DECKS BY SIMILAR WORDS

Later notebooks will benefit from using the subject of a deck instead of the deck name or folder heirarchy. At least for big collections. Several things may indicate subject similarity; when the decks were added, when they were reviewed, where in the folder of deck heirarchy did user put each deck and what rare words are in the cards of the deck. The first two are not actualy very useful. Text mining is complicated and user should have an easy way to give input so user’s deck hierarchy. This takes precedence over word coincidence.

The output of this notebook can vary depending on small changes to U’s collection. This is most drastic when the final anchor decks merge.

Users not applying this notebook to their own data may be interested in the plots and results inside section ‘Studies and visualizations of results’.

I imagine many people organize their decks by what they intend to do with the deck rather than by subject matter. Specifically they could have in their collection lowest level deck of ‘trashed’, ‘highest importance’, ‘learning’ and ‘maintaining’ and plenty of single subject folders. App splits all folders found to contain too much disagreement (multi subject folders), then the decks these split folders held are added to low level single subject folders. LL SS folders with too much in common are combined with other LL SS folders into subjects. During process original folder hierarchy is unaltered and kept separately.

Running subject detection algorithm on own data will still help User understand their collection and habits even if results of the subject detection algorithm can change a great deal with small changes to input. Users applying app to their own data should start checking that the output is reasonable at ‘## Decks rearranged for Subject’ .

Notebook was built around big collection. To help the algorithm, User could add prospective things you might be interested in but have no plans for. Algorithms require that at least half all lowest level decks are single subject. There must be at least one ll ss ‘anchor’ deck for each subject. All multisubject lowest level decks must be at most 50% single subject by number of decks with cards.

Running this notebook will make deck_similar_word_network folder contains a html file which shows all decks and strong enough edges (coincidence of words between decks stronger than elbow) at once and is helpful for seeing clusters.

Prep

Load relevant files

#infoRDS("rev.RDS")as
#crd <- readRDS(file =paste0(getwd(),"/crd.RDS")) #because for some reason running readRDS when knitting outputs broken stuff
#rev <- readRDS(file =paste0(getwd(),"/rev.RDS"))
load('Step_1_after.RData')
check_all_decks_in_dtm <- length(unique(crd$dek.nam))
if(!all(as.numeric(rev$cid) %in% as.numeric(crd$cid))){
  (print(getwd()))
  print(summary(crd))
  print(summary(rev))
} 
#print("go")

Approximate subject by assuming lowest level deck is the subject then putting any set less than 100 cards into ‘other’. Will be superseded by ‘Subject’. Number of cards per deck lowest deck. warning to self is this is broken then heirarchy not in crd yet

#this is done to CRD in 'finding single subject folders' 
if(T){ #makes agglo variable
hier.low.lev <- rev %>% group_by(dek.heir.lev.1) %>%
  summarise(count.reviews=n()) %>% ungroup() %>% arrange(desc(count.reviews))
print(hier.low.lev)

to.other <- hier.low.lev$dek.heir.lev.1[(hier.low.lev$count.reviews<100)]


crd$dek.agglo <- crd$dek.heir.lev.1
crd$dek.agglo[crd$dek.agglo %in% to.other] <- "Other"
}
## # A tibble: 13 × 2
##    dek.heir.lev.1                                    count.reviews
##    <chr>                                                     <int>
##  1 AAA easy spanish all                                     109379
##  2 AAA regular                                                7924
##  3 AAlearning                                                 2408
##  4 AAtechComplexBad                                           1927
##  5 Aspanish 2                                                  311
##  6 Alogic reasoning epistemology explain                       154
##  7 ACwassupposedtostudy                                        135
##  8 AStatsBasic                                                  62
##  9 ProgramingAdvancedWide                                       33
## 10 ARlang                                                       31
## 11 Health informatics tech sci sys                               6
## 12 MathAdvanced                                                  4
## 13 information retrieval SEO search collab directory             1
if(F){ #broke stats by each deck level deck
  for(i in (max_lv):1)  
    rev <- rev[order(rev[,paste0("dek.heir.lev.",i)]),]
rev %>% group_by(dek.nam) %>%
  select("dek.nam","cards.in.dek",
         "revs.in.dek" ) %>%
  summarise(last(dek.nam),
            last(cards.in.dek),
         last(revs.in.dek)) %>% ungroup()
  
  rev %>% group_by(dek.heir.lev.2) %>%
  summarise(n()) %>% ungroup()
}

First use of similar subject algorithm

Each deck’s cards text converted into single text vector / document.

tcp <- crd %>% 
  group_by(dek.nam) %>%
  summarise(
    deck.long = str_replace_all(first(dek.nam)," -- "," "),
     concated = paste(card.txt,
                      sep = " ", collapse = " "),
    target = paste(simplest.name,deck.long, sep = " ", collapse = " "),
    deck.lvl.1 =  str_split(first(dek.nam),fixed(" -- "))[[1]][1]
    ) %>%
  mutate(num_chars=nchar(concated))
#tcp$deck.lvl.1

#   dim(crd)
# dim(tcp)


#str(tcp)
#tcp$concated[which(tcp$num_chars<20000 & tcp$num_chars>10000)[1]]

tcp$concated <- str_remove_all(tcp$concated, fixed("Chapter %"))
tcp$concated <- str_remove_all(tcp$concated, fixed("chapter %"))
tcp$concated <- str_remove_all(tcp$concated, fixed("chapter"))
tcp$concated <- str_remove_all(tcp$concated, fixed("Chapter"))
tcp$concated <- str_remove_all(tcp$concated, fixed("ETH"))
tcp$concated <- str_remove_all(tcp$concated, fixed("Introduction"))
tcp$concated <- str_remove_all(tcp$concated, fixed("introduction"))
tcp$concated <- str_remove_all(tcp$concated, fixed("learning"))
tcp$concated <- str_remove_all(tcp$concated, fixed("Learning"))


number.clusters <- round(dim(tcp)[1]/10)
number.clusters <- min(number.clusters,10)
# limiting to 10 clusters because too many clusters are not useful for things like graphs
 

#crd$dek.nam[sample(dim(crd)[1],10 )]
#tcp$deck.lvl.1[sample(dim(tcp)[1],10)]

Document Term Matrix. Words as columns. Documents as rows. Cell is count of one in the other.

{

dtm <- CreateDtm(doc_vec = tcp$concated, 
                 # character vector of documents
                 doc_names = tcp$dek.nam, 
                 # document names, optional
                 ngram_window = c(1, 2), 
                 # minimum and maximum n-gram length
                 stopword_vec = c(stopwords::stopwords("en"), 
                                  # stopwords from tm
                          stopwords::stopwords(source = "smart")), 
                 # this is the default value
                 lower = TRUE, # lowercase - this is the default value
                 remove_punctuation = TRUE, 
                 # punctuation - this is the default
                 remove_numbers = TRUE, 
                 # numbers - this is the default
                 verbose = T, # Turn off status bar for this demo
                 cpus = 2) 
# by default, this will be the max number of cpus 
#dim(dtm)
# colnames(dtm)[1:30]
# rownames(dtm)[1:30]
# remove any tokens that were in 3 or fewer documents
dtm <- dtm[ , colSums(dtm > 0) > 1 ]
# dtm <- dtm[ , colSums(dtm > 0) <=
#                max(15,dim(tcp)[1]/(number.clusters)) ]
 #most number of decks with very very similar information
 
print(paste("number of decks",dim(dtm)[1]))
if(!(dim(dtm)[1]==check_all_decks_in_dtm)) stop("deck missed in dtm")
decks_without_words <- rownames(dtm)[(rowSums(dtm > 0))==0]
if(length(decks_without_words)>0){
  print("below decks that were cut")
  print(decks_without_words)
}

#

dtm[dtm>1] <- 1
#print(glim<-dtm[10:11,10:11])
}
## [1] "number of decks 393"

Toplists of term frequency.

#?CreateDtm
tf_mat <- TermDocFreq(dtm = dtm)
tf_mat$doc_freq <- NULL
#str(tf_mat) 

# look at the most frequent tokens, then least frequent small tokens
head(tf_mat[ order(tf_mat$term_freq, decreasing = TRUE) , ], 20)
## # A tibble: 20 × 3
##    term        term_freq   idf
##    <chr>           <dbl> <dbl>
##  1 time              227 0.549
##  2 information       215 0.603
##  3 make              212 0.617
##  4 set               201 0.671
##  5 number            196 0.696
##  6 data              190 0.727
##  7 process           173 0.821
##  8 type              173 0.821
##  9 based             163 0.880
## 10 order             160 0.899
## 11 specific          159 0.905
## 12 work              156 0.924
## 13 means             154 0.937
## 14 people            154 0.937
## 15 test              154 0.937
## 16 part              153 0.943
## 17 term              153 0.943
## 18 form              153 0.943
## 19 long              152 0.950
## 20 change            152 0.950
tf_smlrams <- tf_mat[ !(stringr::str_detect(tf_mat$term, "_")) , ]
tf_smlrams <- tf_smlrams[tf_smlrams$term_freq>2,]
head(tf_smlrams[ order(tf_smlrams$term_freq, decreasing = FALSE) , ], 10)
## # A tibble: 10 × 3
##    term          term_freq   idf
##    <chr>             <dbl> <dbl>
##  1 abbreviation          3  4.88
##  2 abstracting           3  4.88
##  3 abuses                3  4.88
##  4 accidental            3  4.88
##  5 accomplishing         3  4.88
##  6 acknowledges          3  4.88
##  7 adept                 3  4.88
##  8 adjusts               3  4.88
##  9 administers           3  4.88
## 10 advantageous          3  4.88
# look at the most frequent bi-grams, and least frequent too
tf_bigrams <- tf_mat[ stringr::str_detect(tf_mat$term, "_") , ]
head(tf_bigrams[ order(tf_bigrams$term_freq, decreasing = TRUE) , ], 20)
## # A tibble: 20 × 3
##    term                term_freq   idf
##    <chr>                   <dbl> <dbl>
##  1 long_term                  73  1.68
##  2 true_false                 53  2.00
##  3 short_term                 48  2.10
##  4 term_memory                48  2.10
##  5 data_set                   38  2.34
##  6 dependent_variable         36  2.39
##  7 data_data                  35  2.42
##  8 working_memory             33  2.48
##  9 decision_making            33  2.48
## 10 real_world                 30  2.57
## 11 image_image                29  2.61
## 12 solve_problem              28  2.64
## 13 linear_regression          27  2.68
## 14 data_frame                 27  2.68
## 15 total_number               25  2.75
## 16 period_time                25  2.75
## 17 function_function          24  2.80
## 18 data_type                  24  2.80
## 19 information_systems        23  2.84
## 20 memory_memory              22  2.88
head(tf_bigrams[ order(tf_bigrams$term_freq, decreasing = FALSE) , ], 10)
## # A tibble: 10 × 3
##    term               term_freq   idf
##    <chr>                  <dbl> <dbl>
##  1 aa_true                    2  5.28
##  2 ab_left                    2  5.28
##  3 abierto_sound              2  5.28
##  4 abilities_brain            2  5.28
##  5 ability_act                2  5.28
##  6 ability_assess             2  5.28
##  7 ability_break              2  5.28
##  8 ability_collect            2  5.28
##  9 ability_computer           2  5.28
## 10 ability_creatively         2  5.28
if(F){
for(i in 30:40){
  print(colnames(dtm[ , colSums(dtm > 0) == i])[1:50])
  print(paste(i,"-----------------------------------"))
} 
}
#plot( sort(tf_mat$term_freq ))
saveRDS(tf_mat ,"wordlist.RDS")

Remove ngrams that occur in at least half of all lowest hierarchy folders. This might cause problems for small collections. CHANGEME maybe

deklvl <- mlr::createDummyFeatures(tcp$deck.lvl.1)
deklvl <- as.matrix(deklvl)

folder.by.word <- t(deklvl) %*% dtm

words.remove <- colnames(folder.by.word[ , colSums(folder.by.word > 0) > dim(folder.by.word)[1]/2 ])

dtm <- dtm[,!(colnames(dtm) %in% words.remove)]
#str(deklvl)
#dim(dtm)
#print('though removal is of few words they likely strongly affect the outcome')

Haberman formula takes in two dichotomous variables, calculates their coincidence, and outputs a similarity distance (or how unlikely that there is no relation between them) in standard deviations. Haberman formula is applied to word incidence to detect the similarity in subject matter between two decks. If bug CHANGEME.

{
require(netCoin)

dtm[dtm>1]<-1
C <- coin(t(dtm), minimum = 1) # [sample.test,] coincidence matrix
N <- asNodes(C) # node data frame
N$deklv1 <- str_split(N$name," -- ",simplify = T)[,1]
E <- edgeList(data=C,min = -1000,level=-1,criteria = "Haberman") # edge data frame
# Net <- netCoin(N,E,dir="decks")
# summary(E)
E <- E[order(-E$Haberman),]
hold_for_conv_div_clusters <- E
# E
# C
# str(dtm)
# str(C)
too_small <- unique(crd$dek.nam)[which(!(unique(crd$dek.nam) %in% E$Target | unique(crd$dek.nam) %in% E$Source))]
if(length(too_small)>0){
  print("Following decks were not added to the netCoin network meaning either deck is too small or too esoteric CHANGME")
print(too_small)
}
} 
## [1] "Following decks were not added to the netCoin network meaning either deck is too small or too esoteric CHANGME"
## [1] "MathAdvanced -- Advanced Statistics -- Short Summary of Course Requirements"

Detect ‘elbow’. Hopefully indicated different generative process separating those coincidences that were seem high but were created by randomness from those that mean something more.

elbowr <- function(y=sort(curve),x=1:length(y) ) {
  # hints from
# 'https://stackoverflow.com/questions/2018178/finding-the-best-trade-off-point-on-a-curve/'
# Find point furthest from straight line connecting extremes. This is not the same as highest derivatives except in a well behaved curve. I use it to quickly approximate change in generative process of distribution. which may be a big mistake
  # assume extremes are at end and begining of data so be sure to sort
  lg <- length(x)
  stopifnot( exprs = {
  x[1]==min(x) | x[1]==max(x)
  y[1]==min(y) | y[1]==max(y)
  x[lg]==max(x) | x[lg]==min(x)
  y[lg]==max(y) | y[lg]==min(y)
  })
  
  # x and y into one df
  df <- data.frame(x=x,y=y)
  #plot(df)
  # get line NOT REALLY LINEAR MODEL JUST ENDPOINTS
  intercept <- min(y)
  slope <- (max(y)-min(y))/lg
  # get distance by height not really euclidean 
  df$dist_y <- abs(slope * df$x
                  - df$y + intercept)
  
   out <- which.max(df$dist_y)
  
  return(df[out,])
}

thelbow <- elbowr(sort(E$Haberman))$y

print(paste("elbow / cuttoff:",round(thelbow,digits = 3)))
## [1] "elbow / cuttoff: 11.99"
plot(x=sort(E$Haberman),y=1:length(E$Haberman))
abline(v=thelbow)

# length(unique(E$Target))
# length(unique(E$Source))
# print("the following may be duplicates")
# head(E)
E$dist <- ifelse(E$Haberman >= thelbow*.95,.0001,
          ifelse(E$Haberman > 2,1,10) )

Function that makes clusters from similarity network.

require(reshape2, quietly = T)

#View(E)

edge_dist_clustered <- function(Edges,hclust.method="average",cut.at.height=.01,
                         use.elbow.for.height=F,verby=F,vverby=F){
  
  #clustering, hierarchical then cut at "height" or "elbow". outputs the clusters. Could be told to print diagnostics andgoodness of fit statistics
  #Edges=E[in.deck,]
  stopifnot(
     "verby is not atomic" = is.atomic(verby),
     "vverby is not atomic" = is.atomic(vverby),
     "use.elbow.for.height is not atomic" = is.atomic(use.elbow.for.height),
     "cut.at.height is not atomic" = is.atomic(cut.at.height),
     "hclust.method is not atomic" = is.atomic(hclust.method),
     "verby is not logical" = is.logical(verby),
     "vverby is not logical" = is.logical(vverby),
     "use.elbow.for.height is not logical" = is.logical(use.elbow.for.height),
     "cut.at.height is not numeric" = is.numeric(cut.at.height),
     "hclust.method is not character" = is.character(hclust.method),
     "Edges not found" = !is.null(Edges),
     "Edges is not class df" = class(Edges) == "data.frame",
     "names of Edges columns incorrect" = c("Target","Source","dist") %in%
       names(Edges),
     "Edges has no rows" = dim(Edges)[1]>1
 )
  
  rord <- Edges[c("Target","Source","dist")]
  names(rord) <- c("Source","Target","dist")
  Edges <- rbind(rord,Edges[c("Source","Target","dist")])
  
  prevf <- reshape2::dcast(Edges,value.var='dist',
                 Target ~ Source, drop=F ,
                 fill   = NA)
  
  #?rbind()
  waffles=unique(c(Edges$Target, Edges$Source))
  if(dim(prevf)[1] != length(waffles) | dim(prevf)[2] <= length(waffles)   ){
    warning("not all connections between nodes covered in edges")
  }

  rownames(prevf) <- prevf$Target
  prevf$Target = NULL
  prevf[is.na(prevf)] <- max(prevf) *2
  
  dista <- as.dist(as.matrix(prevf))
  hcs <- hclust(dista,hclust.method)
  
  if(use.elbow.for.height){
    cut.at.height <- elbowr(y=sort(hcs$height,decreasing = T))$y * 1.1
    if(verby) plot(sort(hcs$height,decreasing = T))
    print(paste("elbow selected",cut.at.height))
    }
  
  cut.singles <- sum(hcs$height > cut.at.height) +1#count of remaining clusters
  clustering.singles <- cutree(hcs, max(1,cut.singles))
  dek_cluster_second_output <<- data.frame(cluster.number=clustering.singles,
                                           subject.dek=names(clustering.singles))
  tcs <- table(clustering.singles)
  elms.per.clust <- sum(tcs)/cut.singles #average objects per cluster
  #print(paste("number of expected clusters", cut.singles))
  if(cut.singles==1) return(tcs)
  
  
  if(verby){
    if(vverby){
      for (i in unique(clustering.singles)) {
      print(names(clustering.singles)[clustering.singles==i])
      print(i)
      print("-------------------------------------------")
      }
    }
    # print(paste("RMSE from even clusters",
    #             rmse = sqrt(sum((tcs-elms.per.clust)^2)/cut.singles)))
    # print(paste("RMSE from single cluster",
    #             rmseo = sqrt((max(tcs)-sum(tcs))^2 + sum(tcs[(max(tcs)!=tcs)]^2))
    #             /cut.singles))
    # print(paste("percent in decimal; categorical entropy variance agreement",
    #           cat.var = (sum(tcs^2)/cut.singles) / (elms.per.clust)^2))
    print(paste("percent in decimal; objects outside biggest class",
                p.out = (sum(tcs)-max(tcs))/sum(tcs)))
  }
  
  return(tcs)
}  

Find clusters of decks that have very very similar subjects, or at least vocabularies. Could be important interesting for understanding collection and detecting decks that may be duplicates.

E$dist <- ifelse(E$Haberman >= thelbow*1,1000-E$Haberman,
          ifelse(E$Haberman > 2,10000,100000) )
sink <- edge_dist_clustered(Edges=E,hclust.method="complete",
                            verby=F,vverby=F,cut.at.height=1001)
#'complete' will still cluster the same decks together if any other decks are removed but may splinter if decks are added. So this changes based on new decks added. But closest coincidence decks will still be clustered together first and least subject to splintering. 
dp <- crd %>% 
group_by(dek.nam) %>%
  summarize(n=n()) 

dek_subject_clus <- merge(dek_cluster_second_output,dp,by.y = 'dek.nam', by.x = 'subject.dek')

mergagain <- dek_subject_clus %>%
  arrange(-n)  %>% 
  group_by(cluster.number) %>% 
  mutate(subcluster=first(subject.dek)) %>%
  ungroup() %>%
  select(subject.dek,subcluster)

metest <- merge(crd, mergagain, by.x = "dek.nam", by.y = "subject.dek",all.x = T)

crd <- metest
metest <- NULL

crd %>%
  group_by(subcluster) %>% 
  mutate(deks.in.sub=n_distinct(dek.nam)) %>% 
  ungroup() %>%
  group_by(dek.nam) %>%
  summarize(deks.in.sub=first(deks.in.sub),subcluster=first(subcluster)) %>%
  arrange(-deks.in.sub,subcluster) %>%
  select(deks.in.sub,dek.nam,subcluster)
## # A tibble: 393 × 3
##    deks.in.sub dek.nam                                                subcluster
##          <int> <chr>                                                  <chr>     
##  1          15 ACwassupposedtostudy -- HIT Chapter 3 Health Informat… Health in…
##  2          15 Health informatics tech sci sys -- Ch01 Introduction … Health in…
##  3          15 Health informatics tech sci sys -- Chapter 1 Electron… Health in…
##  4          15 Health informatics tech sci sys -- Clinical Informati… Health in…
##  5          15 Health informatics tech sci sys -- Clinical Informati… Health in…
##  6          15 Health informatics tech sci sys -- Clinical decision … Health in…
##  7          15 Health informatics tech sci sys -- Comp 9 - Unit 7 - … Health in…
##  8          15 Health informatics tech sci sys -- HIM - Data and INF… Health in…
##  9          15 Health informatics tech sci sys -- HIM 503 Coordinati… Health in…
## 10          15 Health informatics tech sci sys -- HIT Pr. Mgmt. :Cha… Health in…
## # ℹ 383 more rows

Decks rearranged for Subject. User with data should start checking output here

Words are not the same thing as ideas especially when two very similar ideas do not share many terms. User may want a subject to be less or more broad. Any clustering algorithm will find different clusters depending on decks missing from collection. Therefore subjects must be defined by the user as level one folders with appropriate decks inside. To pick anchors look at deck_similar_word_network.

Folders which are lowest level but not actualy single subject will be separated into constituents. CHANGEME Modify your own collection to change outcome. Within each LL SS folder, clusters of decks repeatedly join together so long as at least one deck of the first cluster has similar vocabulary to a deck of the second cluster. If in final state the biggest cluster does not contain 60% or greater of all decks with cards then the folder is broken apart. Then folder is deleted and each next level constituent folder or deck become their own ll folder.

dek_name_to_heir <- function(var.to.split){
split_str <- str_split(var.to.split,stringr::fixed(" -- "))
pre_padded_df <- sapply(split_str, simplify="matrix",function(x){
  for(i in 1:(20-length(x))) x<-c(x,"")
return(x)
}) 
dek_heir_padded <- as.data.frame(t(pre_padded_df))
levels_not_empty <- (apply(dek_heir_padded, 2, 
                           function(x) length(unique(x)))!=1)
dek_heir_padded <- dek_heir_padded[,levels_not_empty]
max_lv <<- dim(dek_heir_padded)[2]
for(i in (max_lv):1) names(dek_heir_padded)[i] <- paste0("dek.heir.lev.",i)
return(dek_heir_padded)
}

if(!("dek.heir.lev.1.source" %in% names(E))){
source.unfurl <- dek_name_to_heir(E$Source)
target.unfurl <- dek_name_to_heir(E$Target)
names(source.unfurl) <- paste(names(source.unfurl),".source",sep="")
names(target.unfurl) <- paste(names(target.unfurl),".target",sep="")
E <- cbind(E,cbind(source.unfurl,target.unfurl))
#View(E)  
}
E$folder.one.cluster.s <- E$dek.heir.lev.1.source
E$folder.one.cluster.t <- E$dek.heir.lev.1.target
crd$subject.category <- crd$dek.heir.lev.1
crd$subject.folder.level <- 1

E$dist <- ifelse(E$Haberman >= thelbow*.95,.0001,
          ifelse(E$Haberman > 2,1,10) )

for(folder.lv in 1:5){ #folder.lv<-1
folders.highest <- unique(c(E$folder.one.cluster.s,E$folder.one.cluster.t))
folder.split <- c()

for(i in folders.highest){ #i<-folders.highest[11]
  in.deck <- E$folder.one.cluster.s %in% i & E$folder.one.cluster.t %in% i
  cards.in.folder <- crd[names(crd)==paste0("dek.heir.lev.",folder.lv)] == i
  count.of.sub.decks <- length(unique(crd$dek.nam[cards.in.folder]))
  u <- NA 
   if(count.of.sub.decks>2){
    u <- edge_dist_clustered(Edges=E[in.deck,],"single",verby=F,vverby=F)
    #print(u)
    #print(i)
    if( (count.of.sub.decks-max(u))/count.of.sub.decks >.40) {
      folder.split <- c(folder.split,i)
    }
   }
  
}
print(paste("folders to split ", folder.lv))
print(folder.split)

change.crd <- crd$subject.category %in% folder.split
change.source <- E$folder.one.cluster.s %in% folder.split
change.target <- E$folder.one.cluster.t %in% folder.split

crd$subject.folder.level[change.crd] <- folder.lv+1
new.crd <- crd[change.crd,names(crd)==paste0("dek.heir.lev.",folder.lv+1)]
new.so <- E[change.source,names(E)==paste0("dek.heir.lev.",folder.lv+1,".source")]
new.tr <- E[change.target,names(E)==paste0("dek.heir.lev.",folder.lv+1,".target")]

crd$subject.category[change.crd] <- new.crd
E$folder.one.cluster.s[change.source] <- new.so
E$folder.one.cluster.t[change.target] <- new.tr
}
## [1] "folders to split  1"
## [1] "basic prog cpp software architecture"
## [2] "ACwassupposedtostudy"                
## [3] "AAA regular"                         
## [1] "folders to split  2"
## [1] "AAAll the small"
## [1] "folders to split  3"
## NULL
## [1] "folders to split  4"
## NULL
## [1] "folders to split  5"
## NULL
#unique(crd$subject.category)

Newly detached decks attached to remaining ll ss anchor folders. New anchor folder for each deck is chosen by median of detached deck’s vocab similarity to each deck in anchor folder. When applying this to my collection, most decks end up where they are supposed to be, but some get attached to seemingly unrelated anchors. CHANGEME Again, your own collection.

deks.reattach <- unique(crd$subject.category[crd$subject.folder.level > 1])
deks.main <- unique(crd$subject.category[crd$subject.folder.level == 1])
contain.mains <- E$folder.one.cluster.s %in% deks.main | E$folder.one.cluster.t %in% deks.main
E$main.dek <- ""
E$main.dek[contain.mains] <- E$folder.one.cluster.s[contain.mains]
E$main.dek[E$folder.one.cluster.t %in% deks.main] <- E$folder.one.cluster.t[E$folder.one.cluster.t %in% deks.main]

for(i in deks.reattach){ # i <- deks.reattach[1]
  relevant <- E$folder.one.cluster.s %in% i | E$folder.one.cluster.t %in% i
 best.att  <- E %>% 
   filter(relevant & contain.mains) %>%
   group_by(main.dek) %>% summarise(
     median = median(Haberman)
   )
new.name <- best.att$main.dek[which.max(best.att$median )]
# print(max(best.att$median ))
# print(new.name)
# print(i)
crd$subject.category[crd$subject.category==i] <- new.name
E$folder.one.cluster.s[E$folder.one.cluster.s==i] <- new.name
E$folder.one.cluster.t[E$folder.one.cluster.t==i] <- new.name
print(paste(i," to ",new.name))
}
## [1] "AA mine onn reasoning2  to  Alogic reasoning epistemology explain"
## [1] "AA selfhelp to S?  to  Psychology"
## [1] "AAown on spaced repetition  to  AAlearning"
## [1] "AAquicktech  to  ARlang"
## [1] "7 Habits!  to  Psychology"
## [1] "Brain Rules by John Medina &amp; Other Facts  to  Psychology"
## [1] "CDFR 4390- 7 Habits  to  Psychology"
## [1] "Drive by Pink  to  AAlearning"
## [1] "HIT Chapter 3 Health Information Functions, Purpose, and Users  to  Health informatics tech sci sys"
## [1] "Learning How To Learn - Coursera  to  AAlearning"
## [1] "Make It Stick  to  AAlearning"
## [1] "Meta analysis  to  AB stats mle and censorship"
## [1] "Seven Habits  to  Psychology"
## [1] "Seven Habits of a Highly Effective Teen Study Guide  to  Psychology"
## [1] "The New Science of Learning: Chapter 1 Flash Cards  to  Psychology"
## [1] "Ultralearning_badish  to  AAlearning"
## [1] "your best brain  to  Psychology"
## [1] "ZData Visualization BOOK  to  AB stats mle and censorship"
## [1] "Chapter 1 (intro to systems analysis &amp; design)  to  Health informatics tech sci sys"
## [1] "Chapter 4: Programming Languages  to  Health informatics tech sci sys"
## [1] "Chapter 4: Requirements Engineering (R)  to  AAtechComplexBad"
## [1] "Chapter 6  to  ARlang"
## [1] "Chapter 6 - Functions  to  ARlang"
## [1] "Chapter 7 (intro to systems analysis &amp; design)  to  AAtechComplexBad"
## [1] "Chapter Six  to  ARlang"
## [1] "Computer Science - 15. Validation &amp; Verification  to  AAtechComplexBad"
## [1] "CS2002: L2 Software Engineering Basics CH1  to  Health informatics tech sci sys"
## [1] "CSC 200 T3-C7(Software Engineering)  to  AAtechComplexBad"
## [1] "Essentials of Software Engineering Ch. 1-8  to  Health informatics tech sci sys"
## [1] "git 3  to  ProgramingAdvancedWide"
## [1] "Programming Chapter 6 Functions  to  ARlang"
## [1] "RegExX  to  AAtechComplexBad"
## [1] "Software Development  to  AAtechComplexBad"
## [1] "Software Engineering Chapters 3 &amp; 4  to  AAtechComplexBad"
#unique(crd$subject.category)

Many preceding steps repeated in one block. Documents of dtm here are subjects not decks.

{
make_new_dtm_faster <- crd %>%
  group_by(dek.nam) %>%
  summarise(
    subject.category=first(subject.category)
  )
  
if(!(dim(dtm)[1] == dim(make_new_dtm_faster)[1])) stop("dtm deck names do not line up in alphabetical order like groupings")
#str(dtm)
if(!all(dtm@Dimnames[[1]] == make_new_dtm_faster$dek.nam)) stop("dtm deck names do not line up in alphabetical order like groupings")
#str(tcp)#tcp$deck.lvl.1 for 
#unique(make_new_dtm_faster$subject.category)
deklvl <- mlr::createDummyFeatures(make_new_dtm_faster$subject.category)
deklvl <- as.matrix(deklvl)

dtm2 <- t(deklvl) %*% dtm



# remove any tokens that are in too few
dtm2 <- dtm2[ , colSums(dtm2 > 0) > 1 ]
#again removing words repeated too often though it probably will not matter here
dtm2 <- dtm2[,!(colnames(dtm2) %in% words.remove)]
 
#dim(dtm2)
#colnames(dtm2)[1:30]
#rownames(dtm2)[1:30] 


dtm2[dtm2 > 1] <- 1
C <- coin(t(dtm2), minimum = 1) # [sample.test,] coincidence matrix
N <- asNodes(C) # node data frame
N$deklv1 <- str_split(N$name," -- ",simplify = T)[,1]
E <- edgeList(C,min = -1000,level=-1,criteria = "Haberman") # edge data frame
# Net <- netCoin(N,E,dir="decks")
# summary(E)
E <- E[order(-E$Haberman),]

# hist(E$Haberman)

thelbow <- elbowr(sort(E$Haberman))$y
print(paste('elbow cuttoff between subjects ',thelbow))
#summary(sort(tcp$num_chars))
tcp<-tcp[order(tcp$num_chars),]
#tcp$subject

#Net <- netCoin(N, E[E$Haberman>thelbow*1,], dir="aftercut_first_step_aglo")

E$dist <- ifelse(E$Haberman>=thelbow*.95,.0001,
          ifelse(E$Haberman>5,1,10))
}
## [1] "elbow cuttoff between subjects  33.6657116691321"
#old way is to remake dtm from scratch and time consuming kept here just in case also proof of nearly identical outcome

if(F){
tcp <- crd %>% 
  group_by(subject.category) %>%
  summarise(
     concated = paste(card.txt, sep = " ", collapse = " "),
     deck.long = str_replace_all(first(subject.category)," -- "," "),
    target = paste(simplest.name,deck.long, sep = " ", collapse = " ")#,
    #deck.lvl.1 =  str_split(first(dek.nam),fixed("--"))[[1]][1]
    ) %>%
  mutate(num_chars=nchar(concated))
#tcp$deck.lvl.1
dim(crd)
dim(tcp)
#str(tcp)
#tcp$concated[which(tcp$num_chars<20000 & tcp$num_chars>10000)[1]]

tcp$concated <- str_remove_all(tcp$concated, fixed("Chapter %"))
tcp$concated <- str_remove_all(tcp$concated, fixed("chapter %"))
tcp$concated <- str_remove_all(tcp$concated, fixed("chapter"))
tcp$concated <- str_remove_all(tcp$concated, fixed("Chapter"))
tcp$concated <- str_remove_all(tcp$concated, fixed("ETH"))
tcp$concated <- str_remove_all(tcp$concated, fixed("Introduction"))
tcp$concated <- str_remove_all(tcp$concated, fixed("introduction"))
tcp$concated <- str_remove_all(tcp$concated, fixed("learning"))
tcp$concated <- str_remove_all(tcp$concated, fixed("Learning"))

crd$subject.category[sample(dim(crd)[1],10 )]


require(textmineR)
dtm2 <- CreateDtm(doc_vec = tcp$concated, 
                 # character vector of documents
                 doc_names = tcp$subject.category, 
                 # document names, optional
                 ngram_window = c(1, 2), 
                 # minimum and maximum n-gram length
                 stopword_vec = c(stopwords::stopwords("en"), 
                                  # stopwords from tm
                          stopwords::stopwords(source = "smart")), 
                 # this is the default value
                 lower = TRUE, # lowercase - this is the default value
                 remove_punctuation = TRUE, 
                 # punctuation - this is the default
                 remove_numbers = TRUE, 
                 # numbers - this is the default
                 verbose = T, # Turn off status bar for this demo
                 cpus = 2) 
# by default, this will be the max number of cpus 


dtm2[dtm2>1]<-1
dtm2s[dtm2s>1]<-1
dtm2s <- dtm2s[ , colSums(dtm2s > 0) > 1 ]

str(dtm2)
str(dtm2s)
str(dtm)


for(i in dtm2f@Dimnames[[1]]){
  splat <- which(dtm2@Dimnames[[1]]==i)
  flat <- which(dtm2s@Dimnames[[1]]==i)
  
  cbs<-dtm2[splat,]
  cbs<-data.frame(cbs=cbs,namr=names(cbs))
  
  cbf<-dtm2f[flat,]
  cbf<-data.frame(cbf=cbf,namr=names(cbf))
  
  heyo <- merge(cbs,cbf,all = T)
  print(summary(heyo))
  print(heyo[heyo$cbs!=heyo$cbf,])
}
}

Group Anchor Folders into Subjects

Merge some of the ll anchor decks because they have same subject. CHANGEME hclust.method can be changed to “complete”, “average” instead of single. See ?hclust. TODO I think there are more appropriate methods but later. Maybe algorithm overenthusiastic.

#str(crd)
#E$dist <- (-E$Haberman)
edge_dist_clustered(E,verby = T,vverby = T,use.elbow.for.height = F,
                    cut.at.height = .1, hclust.method = "single" )
## [1] "AAA easy spanish all" "Aspanish 2"          
## [1] 1
## [1] "-------------------------------------------"
## [1] "AAlearning"                           
## [2] "Alogic reasoning epistemology explain"
## [3] "Psychology"                           
## [1] 2
## [1] "-------------------------------------------"
## [1] "AAtechComplexBad"       "ARlang"                 "ProgramingAdvancedWide"
## [1] 3
## [1] "-------------------------------------------"
## [1] "AB stats mle and censorship" "AStatsBasic"                
## [3] "MathAdvanced"               
## [1] 4
## [1] "-------------------------------------------"
## [1] "econ market financial literacy"
## [1] 5
## [1] "-------------------------------------------"
## [1] "Health informatics tech sci sys"                  
## [2] "information retrieval SEO search collab directory"
## [1] 6
## [1] "-------------------------------------------"
## [1] "percent in decimal; objects outside biggest class 0.785714285714286"
## clustering.singles
## 1 2 3 4 5 6 
## 2 3 3 3 1 2
dp <- crd %>% 
group_by(subject.category) %>%
  summarize(n_char=sum(nchar(card.txt))) 

dek_subject_clus <- merge(dek_cluster_second_output,dp,by.y = 'subject.category', by.x = 'subject.dek')

mergagain <- dek_subject_clus %>%
  arrange(-n_char) %>% 
  group_by(cluster.number) %>% 
  mutate(subject=first(subject.dek)) %>%
  ungroup() %>%
  select(subject.dek,subject)

metest <- merge(crd, mergagain, by.x = "subject.category", by.y = "subject.dek",all.x = T)


if( !all(sort(metest$subject.category) == sort(crd$subject.category))) stop("weird")

metest$subject.guessed<-metest$subject.category
metest$subject.category<-NULL
#subject.guessed is the anchor before merging anchors . Probably should be removed
crd <- metest
metest <- NULL

Cards per subject

crd %>% group_by(subject) %>%
summarize(d=n_distinct(cid))
## # A tibble: 6 × 2
##   subject                                   d
##   <chr>                                 <int>
## 1 Alogic reasoning epistemology explain  9038
## 2 Aspanish 2                            31070
## 3 Health informatics tech sci sys        3770
## 4 MathAdvanced                           3754
## 5 ProgramingAdvancedWide                 8993
## 6 econ market financial literacy          303
# check that every crd got a subject
err <- crd %>% group_by(dek.nam) %>%
summarize(d=n_distinct(subject)) %>% 
  filter(d>1)
if(dim(err)[1] > 1) print(err) 

Studies and visualizations of results

Words most common to each subject but rarely in other subjects. Understand the subject and maybe make new vocabulary cards.

make_new_dtm_faster <- crd %>%
  group_by(dek.nam) %>%
  summarise(
    subject=first(subject)
  )
  
if(!(dim(dtm)[1] == dim(make_new_dtm_faster)[1])) stop("dtm deck names do not line up in alphabetical order like groupings")
#str(dtm)
if(!all(dtm@Dimnames[[1]] == make_new_dtm_faster$dek.nam)) stop("dtm deck names do not line up in alphabetical order like groupings")
#str(tcp)#tcp$deck.lvl.1 for 
#unique(make_new_dtm_faster$subject)
deklvl <- mlr::createDummyFeatures(make_new_dtm_faster$subject)
deklvl <- as.matrix(deklvl)

dtm3 <- t(deklvl) %*% dtm
#dim(dtm3) subjects few are rows

dtm3<-dtm3[,!(colnames(dtm3) %in% words.remove)]
#?sweep() #remove function from every element
props <- sweep(dtm3,MARGIN=2,STATS=colSums(dtm3),FUN='/')
#prop is percentage of the terms occurences happened in this subject
nups <- props * props * props * props * dtm3
#here the ngrams that were not in on single suubject get their counts reduce drmatically, at least by multiplied by .125 /2 
#so claim that 

#  dtm3[1:5,1:30]
# print('space')
#  props[1:5,1:30]
# print('space')
# nups[1:5,1:30]
# summary(dtm3@x)
# summary(props@x)
# summary(nups@x)

vocabulary.new <- data.frame(subjects=rownames(dtm3))
for(i in 1:dim(nups)[1]){
  print(suj.nam <- rownames(dtm3)[i])
  hex <- nups[i,]
  hex <- hex[hex > 1.99]
  bow <- list(names(hex[order(-hex)]))
  vocabulary.new$vocab[i]<-bow
  print( bow[[1]][1:20])
  crd$subject.bag.of.words[crd$subject==suj.nam] <- bow[[1]][1:20]
}
## [1] "Alogic reasoning epistemology explain"
##  [1] "term_memory"    "memories"       "cues"           "sensory"       
##  [5] "consolidation"  "cortex"         "retention"      "recalling"     
##  [9] "hippocampus"    "forgetting"     "emotions"       "massed"        
## [13] "mastery"        "working_memory" "remembering"    "short_term"    
## [17] "auditory"       "interference"   "stimuli"        "traces"        
## [1] "Aspanish 2"
##  [1] "escribir" "poder"    "creer"    "tener"    "estar"    "decir"   
##  [7] "ser"      "hacer"    "como"     "llamar"   "pagar"    "juego"   
## [13] "habla"    "tomar"    "comenzar" "llevar"   "leer"     "querer"  
## [19] "pensar"   "hace"    
## [1] "econ market financial literacy"
##  [1] "price_quantity"       "buyers"               "monopoly"            
##  [4] "supplied_quantity"    "quantity_supplied"    "quantity_demanded"   
##  [7] "demanded_quantity"    "diseconomies"         "diseconomies_scale"  
## [10] "equals_quantity"      "good_falls"           "group_buyers"        
## [13] "law_demand"           "market_group"         "positive_externality"
## [16] "price_ceiling"        "price_floor"          "shortage_situation"  
## [19] "supplied_equals"      "surplus_situation"   
## [1] "Health informatics tech sci sys"
##  [1] "healthcare"             "decision_support"       "health_information"    
##  [4] "information_systems"    "clinical_decision"      "electronic_health"     
##  [7] "ehr"                    "health_care"            "icd"                   
## [10] "information_technology" "information_system"     "clinical_data"         
## [13] "health_record"          "interoperability"       "medical_record"        
## [16] "clinical_information"   "clinicians"             "ehrs"                  
## [19] "providers"              "organizations"         
## [1] "MathAdvanced"
##  [1] "residuals"           "interaction_effect"  "manova"             
##  [4] "ancova"              "spss"                "anova"              
##  [7] "unexplained"         "µ"                   "explained_variance" 
## [10] "multilevel"          "covariate"           "variable_image"     
## [13] "multilevel_analysis" "error_term"          "geq"                
## [16] "linear_combination"  "intercept"           "linear_regression"  
## [19] "covariance_matrix"   "reject_null"        
## [1] "ProgramingAdvancedWide"
##  [1] "data_frame"       "rename"           "arrays"           "data_frames"     
##  [5] "data_table"       "col"              "foo"              "function_return" 
##  [9] "function_returns" "sep"              "dplyr"            "str"             
## [13] "txt"              "num"              "columns_data"     "column_names"    
## [17] "nrow"             "ls"               "strings"          "character_vector"
#notice subject.bag.of.words ends up with a single one of the 20 words per line but all of them in the end

saveRDS(vocabulary.new ,"new_vocabulary.RDS")
#crd$subject.bow[sample(1:100000,10)]

Elbow between decks again. NetCoin network saved to directory. Open index.html to see it.

E <- hold_for_conv_div_clusters
#hist(E$Haberman)
thelbow <- elbowr(sort(E$Haberman))$y
print(paste("elbow is approximately different generative process so decides cuttoff:",round(thelbow,digits = 3)))
## [1] "elbow is approximately different generative process so decides cuttoff: 11.99"
mor_join <- crd %>% 
  group_by(dek.nam) %>%
  summarise(subject=first(subject))

try({
  N <- merge(N,mor_join,all.x=T,by.y="dek.nam",by.x="name")
  Net <- netCoin(N, E[E$Haberman > thelbow*.95,],color='subject', 
                 dir = "deck_similar_word_network")#, community='lo'
})

# length(unique(E$Target))
# length(unique(E$Source))

Quantile quantile plot of similarity between decks of same subjects and similarity between decks of different subjects. Quantile-Quantile plot has nearly constant slope with internal (decks in same subject) coincidence reaching much higher than external. Internal coincidence is not significantly negative either. So I guess this is decent convergent vs divergent validation of the new clusters?

{
E <- merge(E,mor_join,all.x=T,by.y="dek.nam",by.x="Source")
E <- merge(E,mor_join,all.x=T,by.y="dek.nam",by.x="Target")
names(E)[4] <- "subject.Source"
internal <- E$subject.Source==E$subject.y

int <- quantile(E$Haberman[internal],seq(.01,.99,.001))
ext <- quantile(E$Haberman[!internal],seq(.01,.99,.001))
print("because extremes cut, actual highest external and internal Haberman ")
print(sort(E$Haberman[!internal],decreasing =T)[1:5])
print(sort(E$Haberman[internal],decreasing =T)[1:5])
tp <- ggplot(data=NULL,aes(ext,int)) + geom_point() + geom_smooth(method = "loess")
tp
print(tp)
#ggplot(data=NULL,aes(ext,int)) + geom_point() + xlim(0,max(ext)) +  ylim(0,max(int)) + geom_smooth(method = "glm")
#ggplot(data=E,aes(Haberman,internal)) +geom_boxplot(coef=3)
if(F){
for(i in unique(E$subject.Source)){#i<-'AAlearning'
  internE <- (E$subject.Source==i & E$subject.y==i)
  print(i)
  print(sort(E$Haberman[internE],decreasing =T)[1:5])
  print(sort(E$Haberman[internE],decreasing =F)[1:5])
  plot(sort(E$Haberman[internE]))
}
}
}
## [1] "because extremes cut, actual highest external and internal Haberman "
## [1] 52.83571 43.12399 41.24478 41.03815 36.91748
## [1] 243.4071 243.4071 233.7979 209.7255 181.1292

Decks with most extreme relations to subject cluster. For each subject make three tables.

  1. Central. 4 decks with most connection to the rest in the subject implying either the culmination or the foundation of the subject. Ex: Sentences of a foreign language as a culmination of decks of vocab and conjugation. Deck of the overview of a subject with more technical or in depth decks.

  2. Bridging. 2 decks that are most similar to other subjects but still may belong to this one. Not terribly useful.

  3. Missplaced. All decks that coincide more with other subjects than this one. CHANGEME Move these to where they belong.

subject_net_double <- 1
subject_net_double <- data.frame(
  d1=c(E$Source,E$Target),
  d2=c(E$Target,E$Source),
  h=c(E$Haberman,E$Haberman),
  sub1=c(E$subject.Source,E$subject.y),
  sub2=c(E$subject.y,E$subject.Source))
for(i in unique(E$subject.Source)){ #i<-'AAlearning'#i<-'ProgramingAdvancedWide'
  g=(subject_net_double$sub1==i)
  rm(subject_n)
  subject_n <- as.data.frame(subject_net_double[g,])
  
  print(paste(i, " ------------------------"))
  
  deck_subject_summary <- subject_n %>% 
    group_by(d1) %>% 
    group_by(sub2,.add = TRUE) %>%
    summarize(
     above_elbow = sum(h > thelbow *.95)/length(h) ,
     third_haberman = quantile(h,1/3),
     two_third_haberman = quantile(h,2/3)
    ) 
  
  culmination_foundation <- deck_subject_summary %>% 
    filter(sub2==i) %>%
    arrange(desc(above_elbow)) %>% 
    ungroup() %>% slice_head(n = 2)
  culmination_foundation2 <- deck_subject_summary %>% 
    filter(sub2==i) %>%
    arrange(desc(third_haberman)) %>% 
    ungroup() %>% slice_head(n = 2)
  
  print(unique(rbind(culmination_foundation[1:2,c(-2)],culmination_foundation2[1:2,c(-2)]))[,c("d1")])


  outgoingness <- deck_subject_summary %>% 
    filter(sub2!=i) %>%
    arrange(desc(two_third_haberman)) %>% 
    ungroup() %>% slice_head(n = 2)
  
  print(outgoingness[,c("d1","sub2")])
  
  
  indkey <- deck_subject_summary$sub2[deck_subject_summary$d1==deck_subject_summary$d1[1]] 
  to_name <- function(x) {
    indkey[x]
  }
  ind <- which(indkey == i)
  
  missdeks  <- deck_subject_summary %>%
    summarise(across(c(above_elbow, third_haberman, two_third_haberman), which.max)) %>%
     rowwise() %>%
     mutate(misplaced = sum(c(above_elbow, third_haberman, two_third_haberman) != ind)) %>% 
    ungroup() %>% filter(misplaced>2) %>% 
     mutate(above_elbow=to_name(above_elbow),
            third_haberman=to_name(third_haberman),
            two_third_haberman=to_name(two_third_haberman)) %>%
    group_by(d1) %>%
    mutate( most_common_subject = paste(unique(
      c(above_elbow,third_haberman,two_third_haberman)),collapse = "; ")) %>%
    ungroup() %>% relocate(d1)  %>% select(d1,most_common_subject)
 
  print(missdeks)
        
  }
## [1] "Aspanish 2  ------------------------"
## # A tibble: 3 × 1
##   d1                                                                           
##   <chr>                                                                        
## 1 Aspanish 2 -- Spanish Sentences and Words 1001 (TTS accurate)                
## 2 Aspanish 2 -- Spanish vocab random & with sentences                          
## 3 AAA easy spanish all -- AB Spanish Sentences and Verbs Websters New World 575
## # A tibble: 2 × 2
##   d1                                              sub2                          
##   <chr>                                           <chr>                         
## 1 Aspanish 2 -- Anki 14000 Spanish - English v2.6 Alogic reasoning epistemology…
## 2 Aspanish 2 -- Anki 14000 Spanish - English v2.6 econ market financial literacy
## # A tibble: 0 × 2
## # ℹ 2 variables: d1 <chr>, most_common_subject <chr>
## [1] "Alogic reasoning epistemology explain  ------------------------"
## # A tibble: 4 × 1
##   d1                                                                            
##   <chr>                                                                         
## 1 Psychology -- memory final bootkat                                            
## 2 AAlearning -- AAAAlearning book make it stick -- Make It Stick - Peter Brown …
## 3 Psychology -- MPTL                                                            
## 4 ACwassupposedtostudy -- Learning How To Learn - Coursera                      
## # A tibble: 2 × 2
##   d1                                          sub2                           
##   <chr>                                       <chr>                          
## 1 Psychology -- MPTL                          Health informatics tech sci sys
## 2 Psychology -- Psychology - research methods MathAdvanced                   
## # A tibble: 1 × 2
##   d1                                          most_common_subject
##   <chr>                                       <chr>              
## 1 Psychology -- Psychology - research methods MathAdvanced       
## [1] "ProgramingAdvancedWide  ------------------------"
## # A tibble: 3 × 1
##   d1                                         
##   <chr>                                      
## 1 ARlang -- AmoreR -- R Programming Study Set
## 2 ARlang -- AmoreR -- Advanced R             
## 3 ProgramingAdvancedWide -- Google Interview 
## # A tibble: 2 × 2
##   d1                                                           sub2             
##   <chr>                                                        <chr>            
## 1 AAtechComplexBad -- AZBasicStats                             MathAdvanced     
## 2 ProgramingAdvancedWide -- ITE302cEthicsInformationTechnology Health informati…
## # A tibble: 16 × 2
##    d1                                                        most_common_subject
##    <chr>                                                     <chr>              
##  1 AAA regular -- AAAll the small -- AAquicktech -- BasicMa… MathAdvanced       
##  2 AAtechComplexBad -- AAAinterpretable predictive modeling… MathAdvanced; Heal…
##  3 AAtechComplexBad -- AAAinterpretable predictive modeling… MathAdvanced       
##  4 AAtechComplexBad -- AZBasicStats                          MathAdvanced       
##  5 AAtechComplexBad -- Expert Systems midterm                Alogic reasoning e…
##  6 AAtechComplexBad -- Tech Terms                            Health informatics…
##  7 ProgramingAdvancedWide -- ETH What is that -- ETH Comput… MathAdvanced       
##  8 ProgramingAdvancedWide -- ETH What is that -- ETH Deep L… MathAdvanced       
##  9 ProgramingAdvancedWide -- ETH What is that -- ETH Machin… MathAdvanced       
## 10 ProgramingAdvancedWide -- ETH What is that -- ETH Multim… MathAdvanced       
## 11 ProgramingAdvancedWide -- ETH What is that -- ETH System… econ market financ…
## 12 ProgramingAdvancedWide -- ITE302cEthicsInformationTechno… Health informatics…
## 13 basic prog cpp software architecture -- CSC 200 T3-C7(So… Health informatics…
## 14 basic prog cpp software architecture -- Chapter 7 (intro… Health informatics…
## 15 basic prog cpp software architecture -- Software Develop… Health informatics…
## 16 basic prog cpp software architecture -- Software Enginee… Health informatics…
## [1] "MathAdvanced  ------------------------"
## # A tibble: 4 × 1
##   d1                                                                            
##   <chr>                                                                         
## 1 MathAdvanced -- Advanced Statistics -- Summary -- Combined -- AN(C)OVA & T-te…
## 2 MathAdvanced -- Advanced Statistics -- Lectures -- Combined -- Lecture 3a MAN…
## 3 MathAdvanced -- Math Very Advanced -- 12th Grade -- Probability, Precalc/Stat 
## 4 MathAdvanced -- An Introduction to Statistical Learning                       
## # A tibble: 2 × 2
##   d1                                                                      sub2  
##   <chr>                                                                   <chr> 
## 1 MathAdvanced -- The Language and Tools of Financial Analysis (Coursera) econ …
## 2 MathAdvanced -- Math Very Advanced -- Math -- Probability Theory        Progr…
## # A tibble: 4 × 2
##   d1                                                         most_common_subject
##   <chr>                                                      <chr>              
## 1 MathAdvanced -- Data Science Interviews -- soft-skills     Alogic reasoning e…
## 2 MathAdvanced -- GRAPHS AND NETWORKS                        ProgramingAdvanced…
## 3 MathAdvanced -- ML Jessi                                   ProgramingAdvanced…
## 4 MathAdvanced -- The Language and Tools of Financial Analy… econ market financ…
## [1] "Health informatics tech sci sys  ------------------------"
## # A tibble: 3 × 1
##   d1                                                                            
##   <chr>                                                                         
## 1 Health informatics tech sci sys -- HIM - Data and INFORMATION Standards       
## 2 Health informatics tech sci sys -- Clinical Informatics Board Review          
## 3 information retrieval SEO search collab directory -- info architecture -- Les…
## # A tibble: 2 × 2
##   d1                                                                       sub2 
##   <chr>                                                                    <chr>
## 1 information retrieval SEO search collab directory -- ETH Information Re… Math…
## 2 basic prog cpp software architecture -- Essentials of Software Engineer… Prog…
## # A tibble: 2 × 2
##   d1                                                         most_common_subject
##   <chr>                                                      <chr>              
## 1 basic prog cpp software architecture -- Chapter 4: Progra… MathAdvanced; Prog…
## 2 information retrieval SEO search collab directory -- ETH … MathAdvanced       
## [1] "econ market financial literacy  ------------------------"
## # A tibble: 3 × 1
##   d1                                                                            
##   <chr>                                                                         
## 1 econ market financial literacy -- AP Microeconomics Terms                     
## 2 econ market financial literacy -- AP Microeconomics Terms -- AP Microeconomic…
## 3 econ market financial literacy -- AP Microeconomics Terms -- AP Microeconomic…
## # A tibble: 2 × 2
##   d1                                                                sub2        
##   <chr>                                                             <chr>       
## 1 econ market financial literacy -- Thomas Sowell's Basic Economics Aspanish 2  
## 2 econ market financial literacy -- Thomas Sowell's Basic Economics Alogic reas…
## # A tibble: 0 × 2
## # ℹ 2 variables: d1 <chr>, most_common_subject <chr>

Treemaps by subject. Area is quantity. sqrt_seen_once (area) is the square root of number of cards ever seen in subject.

#require(treemapify)
require(treemap)
require(grDevices)
 
opp <- crd %>%  
  group_by(subject) %>% 
  summarise(
            Cards_in_deck = n(), 
            did = first(did),
            seen.once = sum(as.numeric(cid) %in% as.numeric(rev$cid))
  ) %>%
  mutate(
    sqrt_cards_in_deck = sqrt(Cards_in_deck),
    sqrt_seen_once = sqrt(seen.once),
    percent_reviewed = (seen.once / Cards_in_deck),
    Sqrt.cards.in.deck = sqrt(Cards_in_deck)
  )

#dim(opp)
#summary(opp)
try({
  treemap(opp, index="subject", vSize="Cards_in_deck", vColor="percent_reviewed", type="manual", palette="RdYlBu")
})

  if(F){ try({
treemap(opp, index="subject", vSize="sqrt_cards_in_deck", vColor="percent_reviewed", type="manual", palette="RdYlBu")
  })}
   try({
 # itreemap(opp)
treemap(opp, index="subject", vSize="sqrt_seen_once", vColor="percent_reviewed", type="manual", palette="RdYlBu")
})

opp <- crd %>%  
  group_by(simplest.name) %>% 
  summarise(
    subject = first(subject),
    Cards_in_deck = n(), 
    did = first(did),
    seen.once = sum((cid) %in% rev$cid)
  ) %>%
  mutate(
    percent_reviewed = (seen.once / Cards_in_deck),
    Sqrt.cards.in.deck = sqrt(Cards_in_deck),
    sqrt.cards.seen.once = sqrt(seen.once)
  )

#dim(opp)
#summary(opp)
try({
treemap(opp, index=c("subject","simplest.name"), vSize="Cards_in_deck", vColor="percent_reviewed", type="manual", palette="RdYlBu")
})

Histogram and density ridges of when each subject’s cards added to collection

require(ggridges)
require(ggplot2)

milli_to_date <- function(mili) as.Date((mili)/(24*60*60*1000), origin = "1970-01-01")

ggplot(crd, aes(x = milli_to_date(did), y=subject,fill = subject, group=subject)) +
  geom_density_ridges(stat = "binline", bins = 60, scale = 1.2) +
  scale_y_discrete(expand = c(0, 0)) +
  scale_x_continuous(expand = c(0, 0)) +
  coord_cartesian(clip = "off") +
  theme_ridges(font_size = 10)+
    theme( axis.text.x = element_text(angle=30) ,
           legend.position  ='none')+
  scale_x_date(date_breaks="6 months")+
  theme(axis.title.y = element_blank(),axis.title.x = element_blank())

#  ggplot(crd,aes(x=milli_to_date(did),y=subject,fill = after_stat(x) )) + geom_density_ridges_gradient(scale = 1.2,rel_min_height = 0.01,gradient_lwd = 1.) +
#   theme( axis.text.x = element_text(angle=30) )+#after_stat(density)
#   scale_x_date(date_breaks="6 months")+
#   scale_fill_gradientn(colors=rainbow(5))+
#   xlab("Date when subject added") + ylab("Subject")

# ggplot(crd, aes(x = milli_to_date(did), y = subject, fill = stat(x))) +
#   geom_density_ridges_gradient(scale = 3, rel_min_height = 0.01) +
#   theme(axis.title.y = element_blank()) 

# ggplot(crd, aes(x = milli_to_date(did), y = subject, group = subject)) +
#   geom_density_ridges2(
#     stat = "binline",
#     aes(fill = subject),
#     binwidth = 1,
#     scale = 0.95
#   ) +
#   scale_x_continuous( expand = c(0, 0)) +
#   scale_y_discrete(expand = c(0, 0)) +
#   scale_fill_cyclical(values = c("#0000B0", "#7070D0")) +
#   #guides(y = "none") +
#   coord_cartesian(clip = "off") +
#   theme_ridges(grid = FALSE)+
#   scale_x_date(date_breaks="6 months")+ 
#   theme( axis.text.x = element_text(angle=30) )+
# xlab("") + ylab("") 
new_order <- crd %>%
  group_by(subject) %>%
  summarize(use=var(hist(did,bin=45,plot=F)$counts)) %>%
  arrange(-use)

crd %>%
  #filter(!duplicated(nid)) %>%
ggplot(aes(x=milli_to_date(did),fill=factor(subject,levels = new_order$subject)))+
  geom_histogram(bins = 45)+
  xlab("Date") +
  scale_fill_brewer(palette = "Set1")+
  #ggtitle( "histogram of additions by subject")+
  scale_x_date(date_breaks="6 months")+ 
  theme( axis.text.x = element_text(angle=30) )+
  labs(colour = " ")+
  theme(axis.title.y = element_blank(),
        axis.title.x = element_blank(),
        legend.title = element_blank())

Cumulative notes in collection by subject.

crd <- crd[order(crd$did),]

crd %>%
  filter(!duplicated(nid)) %>%
  group_by(subject) %>%
  mutate(for.cumcount=1,
    cumulative.notes.in.coll = cumsum(for.cumcount)) %>%
  ggplot(aes(x=milli_to_date(did),y=cumulative.notes.in.coll,color=subject)) + geom_point() +
  scale_x_date(date_breaks="6 months")+ 
  theme( axis.text.x = element_text(angle=30) )+
  theme(axis.title.x = element_blank())

Cumulative notes to be SEEN in collection by subject.

crd <- crd[order(crd$did),]

crd %>%
  filter(!duplicated(nid)) %>%
  filter((nid) %in% rev$nid) %>%
  group_by(subject) %>%
  mutate(for.cumcount=1,
    cumulative.notes.to.be.seen.in.coll = cumsum(for.cumcount)) %>%
  ggplot(aes(x=milli_to_date(did),y=cumulative.notes.to.be.seen.in.coll,color=subject)) + geom_point() +
  scale_x_date(date_breaks="6 months")+ 
  theme( axis.text.x = element_text(angle=30) )+
  theme(axis.title.x = element_blank())

Add subject to rev data table

fr.join <- crd[,c("cid","subject")]
fr.join$cid <- as.numeric(fr.join$cid)
stopifnot(all(rev$cid %in% fr.join$cid))
rev <- merge(rev,fr.join,by="cid",all.x = T,all.y = F)

Histogram and density ridges of when subject’s cards were reviewed.

ggplot(rev,aes(x=milli_to_date(id),y=subject,fill = stat(x))) +
geom_density_ridges_gradient(scale = 1, rel_min_height = 0.0001, gradient_lwd = 1.,panel_scaling=T) +
  coord_cartesian(clip = "off") +
  theme_ridges(font_size = 10)+
  theme( axis.text.x = element_text(angle=30) )+
  scale_x_date(date_breaks="6 months")+
  scale_fill_gradientn(colors=rainbow(8))+
  xlab("Date when reviewed") + ylab("Subject")+
  theme(axis.title.y = element_blank(),axis.title.x = element_blank(),legend.position = 'none')

new_order <- rev %>%
  group_by(subject) %>%
  summarize(use=var(hist(id,bin=45,plot=F)$counts)) %>%
  arrange(-use)

ggplot(rev,aes(x=milli_to_date(id),fill=factor(subject,levels = new_order$subject)))+
  geom_histogram(bins = 45)+
  xlab("Date") +
  scale_fill_brewer(palette = "Set1")+
  #ggtitle( "Histogram of reviews colored by subject")+
  scale_x_date(date_breaks="6 months")+ 
  theme(axis.text.x = element_text(angle=30))+
  labs(colour = " ")+
  theme(axis.title.y = element_blank(),
        axis.title.x = element_blank(),
        legend.title = element_blank())

Cumulative new notes reviewed, by subject.

rev <- rev[order(rev$id),]

rev %>%
  filter(!duplicated(nid)) %>%
  group_by(subject) %>%
  mutate(for.cumcount=1,
    cumulative.new.notes.reviewed = cumsum(for.cumcount)) %>%
  ggplot(aes(x=milli_to_date(id),y=cumulative.new.notes.reviewed,color=subject)) + geom_point() +
  scale_x_date(date_breaks="6 months")+ 
  theme( axis.text.x = element_text(angle=30) )+
  theme(axis.title.x = element_blank())

Single subject per plot. Histogram of reviews colored by new or already seen cards. Three curves made of points of different colors describing cumulative sum of cards, added, added and later will be seen, and currently seen. The TRUE is reviews of cards already reviewed and FALSE is reviews of new notes. A two sided note will make 2 cards like hand -> mano and mano -> hand. Anki has somewhat similar plot.

rev <- rev[order(rev$id),]
crd <- crd[order(crd$did),]
for(i in unique(crd$subject)){
  crs <- crd[crd$subject==i,]
  res <- rev[rev$subject==i,]
  if(dim(res)[1]<100) next()
  
  nts_fg <- crs$did[!duplicated(crs$nid)]
nts_gd <- (crs$did[!duplicated(crs$nid) & (as.numeric(crs$nid) %in% res$nid)])

p <- ggplot()+
  geom_histogram(aes(x=milli_to_date(res$id),color=!res$first.rev.of.a.note),binwidth = 60)+
  geom_point(aes(x=milli_to_date(nts_fg),cumsum(rep(1,times=length(nts_fg))),
                 color="cum notes in coll"))+
  geom_point(aes(x=milli_to_date(nts_gd),cumsum(rep(1,times=length(nts_gd))),
                 color="c n c, to be seen"))+
  geom_point(aes(x=milli_to_date(res$id[res$first.rev.of.a.note]),
                 cumsum(rep(1,times=length(res$id[res$first.rev.of.a.note]))),
                 color="cum notes seen"))+
  xlab("Date") + ylab("Histogram is of reviews of old and new cards")+ 
  ggtitle( i)+
  scale_x_date(date_breaks="6 months")+ 
  theme( axis.text.x = element_text(angle=30) )+
  labs(colour = " ")+
  theme(axis.title.y = element_blank(),axis.title.x = element_blank())

print(p)

p <- ggplot()+
  geom_histogram(aes(x=milli_to_date(res$id),color=!res$first.rev.of.a.note),binwidth = 60)+
  geom_point(aes(x=milli_to_date(nts_gd),cumsum(rep(1,times=length(nts_gd))),
                 color="c n c, to be seen"))+
  geom_point(aes(x=milli_to_date(res$id[res$first.rev.of.a.note]),
                 cumsum(rep(1,times=length(res$id[res$first.rev.of.a.note]))),
                 color="cum notes seen"))+
  xlab("Date") + ylab("Histogram is of reviews of old and new cards")+ 
  ggtitle( i)+
  scale_x_date(date_breaks="6 months")+ 
  theme( axis.text.x = element_text(angle=30) )+
  labs(colour = " ")+
  theme(axis.title.y = element_blank(),axis.title.x = element_blank())
print(p)
}

end

Check that duplicate columns in crd and rev have not diverged.

{
cols <- names(crd)[names(crd) %in% names(rev)]
#print(cols)
tocompare <- rev[!duplicated(rev$cid),cols]
tocompare <- tocompare[order(tocompare$cid),]
crd_compare <- crd[order(crd$cid),cols]
crd_compare <- crd_compare[as.numeric(crd_compare$cid) %in% tocompare$cid,]


if(nchar(all_equal(crd_compare,tocompare))>10){
  require(arsenal, quietly = T)
  require(diffdf, quietly = T)
  print("redundancies between rev and crd are not aligned this may be a big bug")
  dim(tocompare)
dim(crd_compare)
  #(all_equal(crd_compare,tocompare)[1:20])

print(diffdf(crd_compare,tocompare))

comparedf(crd_compare,tocompare)
}
}

Remove most objects from memory and save the rest into files.

saveRDS(words.remove ,"words_removed.RDS")
saveRDS(rev,"rev2.RDS")
saveRDS(crd,"crd2.RDS")


if(F){
#rm(cards)
#rm(note)
#rm(crd)
#rm(dek)

rm(list=unlist(setdiff(ls(),c("rev","crd","tf_mat","vocabulary.new"))))
}

gc()
##            used  (Mb) gc trigger (Mb)  max used (Mb)
## Ncells  3240683 173.1    5916167  316   5916167  316
## Vcells 27788062 212.1  103929280  793 162388325 1239
save.image(file = 'Step_2_after.RData')