Assignment 4

DUE DATE March 11, 2016, 11:59pm PST

NOTE I expect you to be working on your assignment and assignment 4 is an easy one. The entire solution for is provided in the solution folder as a markdown, but you are encouraged to think through how you would solve it yourself as an intellectual exercise. We will discuss parts of the solution in a class lecture.

Submission Instructions: If you want to try on your own, feel free to peek at the solution if you get stuck. In the worst case, you can even run the solution (markdown) completely and submit and you will get all credit.

For this example, upload both the markdown (ass4.Rmd) and the html output (ass4.html) in your private directory on stat290.stanford.edu.

Parallel R Experiments

We briefly discussed talked about the glmnet package in class. See the paper Regularization Paths for Generalized Linear Models via Coordinate Descent by Friedman, Hastie, Tibshirani, in the Journal of Statistical Software, Vol 33, issue 1, where the Internet Ad document classification problem with mostly binary features is discussed.

Install the standard glmnet package from CRAN. The code you want to run is provided in the solution folder along with the data. There is obviously no one correct answer and you may or may not see differences in the times because we are only doing a small experiment. Note also that we’re doing 10-fold validation with a small number of workers.

1a. Test runs

Experiment with 2, 3, 4 workers and two types of workers (using parallel and snow) on your own machine. The associated functions are runParallel and runSnow.

## SNOW Run
library(glmnet)
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-2
runSnow <- function(n, seed = 12345) {
    catn <- function(...) cat(..., "\n")
    internetAd <- readRDS("internetAd.RDS")

    catn("Percent non-zero per glmnet paper",
    (sum(internetAd$x > 0) + sum(internetAd$y > 0)) /
    (prod(dim(internetAd$x)) + length(internetAd$y)))

    stopifnot(require(doSNOW))
    cl <- makeCluster(n)
    registerDoSNOW(cl)
    set.seed(seed)
    time <- system.time(cv <- cv.glmnet(internetAd$x, internetAd$y,
                                        family = "binomial", type.measure = "class",
                                        parallel = TRUE))
    stopCluster(cl)
    list(time = time, cv = cv)
}
resultsSnow <- lapply(2:5, runSnow)
## Percent non-zero per glmnet paper 0.01166918
## Loading required package: doSNOW
## Loading required package: iterators
## Loading required package: snow
## Percent non-zero per glmnet paper 0.01166918 
## Percent non-zero per glmnet paper 0.01166918 
## Percent non-zero per glmnet paper 0.01166918

A plot. Don’t be surprised if you see no performance gains. The overhead is significant.

library(ggplot2)
d <- data.frame(nWorkers = 2:5, t(sapply(resultsSnow, function(x) x$time)))
qplot(x = nWorkers, y = user.self, geom="line", data=d)

Check results.

lapply(resultsSnow, function(x) x$cv$lambda.min)
## [[1]]
## [1] 0.001919066
## 
## [[2]]
## [1] 0.001919066
## 
## [[3]]
## [1] 0.001919066
## 
## [[4]]
## [1] 0.001919066
### Parallel Run
runParallel <- function(n, seed = 12345) {
    catn <- function(...) cat(..., "\n")
    internetAd <- readRDS("internetAd.RDS")
    catn("Percent non-zero per glmnet paper",
    (sum(internetAd$x > 0) + sum(internetAd$y > 0)) /
    (prod(dim(internetAd$x)) + length(internetAd$y)))

    stopifnot(require(doParallel))
    registerDoParallel(n)
    set.seed(seed)
    time <- system.time(cv <- cv.glmnet(internetAd$x, internetAd$y,
                                        family = "binomial", type.measure = "class",
                                        parallel = TRUE))
    list(time = time, cv = cv)
}

resultsParallel <- lapply(2:5, runParallel)
## Percent non-zero per glmnet paper 0.01166918
## Loading required package: doParallel
## Loading required package: parallel
## 
## Attaching package: 'parallel'
## The following objects are masked from 'package:snow':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, clusterSplit, makeCluster,
##     parApply, parCapply, parLapply, parRapply, parSapply,
##     splitIndices, stopCluster
## Percent non-zero per glmnet paper 0.01166918 
## Percent non-zero per glmnet paper 0.01166918 
## Percent non-zero per glmnet paper 0.01166918

A plot.

d <- data.frame(nWorkers = 2:5, t(sapply(resultsParallel, function(x) x$time)))
qplot(x = nWorkers, y = user.self, geom="line", data=d)

Results.

lapply(resultsParallel, function(x) x$cv$lambda.min)
## [[1]]
## [1] 0.001919066
## 
## [[2]]
## [1] 0.001919066
## 
## [[3]]
## [1] 0.001919066
## 
## [[4]]
## [1] 0.001919066

1b. Cluster utilization.

For $n=3$ workers, produce a worker utilization plot forrunSnowusingsnow.time. Replacesystem.timein the function withsnow.timeand justplot` the resulting object. Rename the function if that helps.

Just for clarity, let’s rename the function to snowPlot after we make the changes.

snowPlot <- function(n, seed = 12345) {
    catn <- function(...) cat(..., "\n")
    internetAd <- readRDS("internetAd.RDS")

    catn("Percent non-zero per glmnet paper",
    (sum(internetAd$x > 0) + sum(internetAd$y > 0)) /
    (prod(dim(internetAd$x)) + length(internetAd$y)))

    stopifnot(require(doSNOW))
    cl <- makeCluster(n)
    registerDoSNOW(cl)
    set.seed(seed)
    plot <- snow.time(cv <- cv.glmnet(internetAd$x, internetAd$y,
                                      family = "binomial", type.measure = "class",
                                      parallel = TRUE))
    stopCluster(cl)
    list(plot = plot, cv = cv)
}
snowResults <- snowPlot(n = 3)
## Percent non-zero per glmnet paper 0.01166918

Utilization plot for snow.

plot(snowResults$plot)

Exploring Social Network Data

A popular problem in data mining courses is to implement a MapReduce program in Hadoop that implements a simple People You Might Know social network friendship recommendation algorithm.

The key idea is that if two people have a lot of mutual friends, then the system should recommend that they connect with each other. Data is provided in the file soc-LiveJournal1Adj.txt. It contains the adjacency list and has multiple lines in the following format:

User<TAB>Friends

User is a unique integer ID corresponding to a unique user and Friends is a comma separated list of unique IDs corresponding to the friends of the user with the unique id User. The friendships are mutual (i.e., edges are undirected): if \(A\) is a friend of \(B\) then \(B\) is also a friend of \(A\).

This dataset is too large for you to work on your laptop. So for this exercise, use the simple network of friends below to try out your strategies.

For example, friends 1 and 2 have no other friends in common. Friends 1 and 4 also have friends 5 and 7 in common. User 3 has 4 friends in common with 4, 2 with 1, 2 with 6 and 1 with 7, but is not friends with them.

2a. Mutual friends

For each friend pair \((U,V)\), produce a list of friends they have in common.

## Friend pair
## naras@stat 2011/09/22
##
## The data
##
records <- list(c(1, 2, 4, 5, 7),
                c(2, 1, 3, 9, 10),
                c(3, 2, 5, 8, 9, 10),
                c(4, 1, 5, 7, 8, 9, 10),
                c(5, 1, 3, 4),
                c(6, 7, 8, 9),
                c(7, 1, 4, 6, 8),
                c(8, 3, 4, 6, 7),
                c(9, 2, 3, 4, 6, 10),
                c(10, 2, 3, 4, 9))

This exercise requires the use of hash tables for keys and values, so we use the hash library on CRAN. Install it if you haven’t already done so.

Our solution strategy is the following:

  1. Map the data to 1a set of key-value pairs, where the keys are friend pairs and the values are their sets of friends.

  2. Reduce the values for each key by computing the intersection of the sets of friends.

So we start by generating, for each user \(u\) and for each friend \(v\) of \(u\), a set of key-value pairs. The key is the friend pair \((u, v)\) and the value is the list of friends of \(u\). Thus, for the first record above, we will get key-value pairs as shown below. Note how the key always has the userids in sorted order.

Key Value
"1,2" list(c(2, 4, 5, 7))
"1,4" list(c(2, 4, 5, 7))
"1,7" list(c(2, 4, 5, 7))

After the first friend in the second record is processed for example, the key "1,2" shows up again and so the value above changes to:

Key Value
"1,2" list(c(2, 4, 5, 7), c(1, 3, 9, 10))
"1,4" list(c(2, 4, 5, 7))
"1,7" list(c(2, 4, 5, 7))

etc. and so on. As you can see we the values keep accumulating the sets of friends, which when intersected will give the desired result.

So here is our mapper function.

library(hash)
## hash-2.2.6 provided by Decision Patterns
##
## For each id, for each friend, emit as key the string
## "id,friend_id" and value: the original list of friends of id
## So for example, for the first record above, we get:
##
keyMap <- hash()
mapper <- function(x) {
    id <- x[1] ## first element of the record
    rest <- x[-1] ## the rest of the record
    lapply(rest,
           function(z) {
               key <- do.call(paste, as.list(c(sort(c(id, z)), sep = ",")))
               value <- keyMap[[key]]
               ## if key is not yet in keyMap, value will be NULL, so put it in
               if (is.null(value)) { ## not yet in hash table, so put it in
                   keyMap[[key]] <- list(rest)
               } else { ## there's a value already in keyMap, so append
                   keyMap[[key]] <- c(value, list(rest))
               }
           })
    TRUE
}

All we need to do now is apply it to our records.

lapply(records, mapper)
## [[1]]
## [1] TRUE
## 
## [[2]]
## [1] TRUE
## 
## [[3]]
## [1] TRUE
## 
## [[4]]
## [1] TRUE
## 
## [[5]]
## [1] TRUE
## 
## [[6]]
## [1] TRUE
## 
## [[7]]
## [1] TRUE
## 
## [[8]]
## [1] TRUE
## 
## [[9]]
## [1] TRUE
## 
## [[10]]
## [1] TRUE

You can print keyMap to see what we got.

keyMap
## <hash> containing 21 key-value pair(s).
##   1,2 : 2, 4, 5, 7 1, 3, 9, 10
##   1,4 : 2, 4, 5, 7 1, 5, 7, 8, 9, 10
##   1,5 : 2, 4, 5, 7 1, 3, 4
##   1,7 : 2, 4, 5, 7 1, 4, 6, 8
##   2,10 : 1, 3, 9, 10 2, 3, 4, 9
##   2,3 : 1, 3, 9, 10 2, 5, 8, 9, 10
##   2,9 : 1, 3, 9, 10 2, 3, 4, 6, 10
##   3,10 : 2, 5, 8, 9, 10 2, 3, 4, 9
##   3,5 : 2, 5, 8, 9, 10 1, 3, 4
##   3,8 : 2, 5, 8, 9, 10 3, 4, 6, 7
##   3,9 : 2, 5, 8, 9, 10 2, 3, 4, 6, 10
##   4,10 : 1, 5, 7, 8, 9, 10 2, 3, 4, 9
##   4,5 : 1, 5, 7, 8, 9, 10 1, 3, 4
##   4,7 : 1, 5, 7, 8, 9, 10 1, 4, 6, 8
##   4,8 : 1, 5, 7, 8, 9, 10 3, 4, 6, 7
##   4,9 : 1, 5, 7, 8, 9, 10 2, 3, 4, 6, 10
##   6,7 : 7, 8, 9 1, 4, 6, 8
##   6,8 : 7, 8, 9 3, 4, 6, 7
##   6,9 : 7, 8, 9 2, 3, 4, 6, 10
##   7,8 : 1, 4, 6, 8 3, 4, 6, 7
##   9,10 : 2, 3, 4, 6, 10 2, 3, 4, 9

Let’s dig deeper since the printed value of keyMap does not convey the structure.

keyMap[["1,2"]]
## [[1]]
## [1] 2 4 5 7
## 
## [[2]]
## [1]  1  3  9 10
keyMap[["4,10"]]
## [[1]]
## [1]  1  5  7  8  9 10
## 
## [[2]]
## [1] 2 3 4 9

All we need to do now is reduce the values in the key value pairs using intersection.

## Grouping by keys done, so now reduce
for (key in keys(keyMap)) {
    values <- keyMap[[key]]
    keyMap[[key]] <- Reduce(f = intersect, x = values)
}

At this point the hash table will contain the common friends for the friend pairs.

for (key in keys(keyMap)) {
    cat(key, "\t", keyMap[[key]], "\n")
}
## 1,2    
## 1,4   5 7 
## 1,5   4 
## 1,7   4 
## 2,10      3 9 
## 2,3   9 10 
## 2,9   3 10 
## 3,10      2 9 
## 3,5    
## 3,8    
## 3,9   2 10 
## 4,10      9 
## 4,5   1 
## 4,7   1 8 
## 4,8   7 
## 4,9   10 
## 6,7   8 
## 6,8   7 
## 6,9    
## 7,8   4 6 
## 9,10      2 3 4

2b. Friend Recommendations

For each user \(U\), recommend users who are not already friends with \(U\), but have the most number of mutual friends in common with \(U\). So the first recommendation should be the user with most number of common friends and the last with the least number of common friends. That is why the recommendation for user 3 above would be 4, 1 or 6 in any order, and finally 7.

Here is a solution that just distinguishes friends of lengths 1 and 2 and generates suitable key value pairs. The former are, of course, the direct friends and the latter, the mutual friends.

There is a menial task of adding a key-value pair to a table that we can relegate to a function. The logic is simply this: if the value isn’t there for a key, add the key-value pair, else, if it is already there, append the value to what was already there for the key. (We could have also used this function for part 1!)

keyMap <- hash()
pushKeyValue <- function(key, value) {
    key <- as.character(key)
    existingValue <- keyMap[[key]]
    ## if key is not yet in keyMap, existingValue will be NULL, so put it in
    if (is.null(existingValue)) { ## not yet in hash table, so put it in
        keyMap[[key]] <- list(value)
    } else { ## there's an existing value already in keyMap, so append
        keyMap[[key]] <- c(existingValue, list(value))
    }
    TRUE
}

mapper <- function(x) {
    user <- x[1]
    directFriends <- x[-1]
    lapply(directFriends, function(z) {
        pushKeyValue(key = user,
                     value = list(user = z, dist = 1))
    })
    if (length(x) > 2) { ## more than 1 friend
        ## Compute all possible combinations of user's friends
        tuples <- combn(directFriends, 2)
        apply(tuples, 2, function(z)
            pushKeyValue(key = z[1],
                         value = list(user = z[2], dist = 2)))
        apply(tuples, 2, function(z)
            pushKeyValue(key = z[2],
                         value = list(user = z[1], dist = 2)))
    }
    TRUE
}

lapply(records, mapper)
## [[1]]
## [1] TRUE
## 
## [[2]]
## [1] TRUE
## 
## [[3]]
## [1] TRUE
## 
## [[4]]
## [1] TRUE
## 
## [[5]]
## [1] TRUE
## 
## [[6]]
## [1] TRUE
## 
## [[7]]
## [1] TRUE
## 
## [[8]]
## [1] TRUE
## 
## [[9]]
## [1] TRUE
## 
## [[10]]
## [1] TRUE
str(keyMap[["1"]])
## List of 17
##  $ :List of 2
##   ..$ user: num 2
##   ..$ dist: num 1
##  $ :List of 2
##   ..$ user: num 4
##   ..$ dist: num 1
##  $ :List of 2
##   ..$ user: num 5
##   ..$ dist: num 1
##  $ :List of 2
##   ..$ user: num 7
##   ..$ dist: num 1
##  $ :List of 2
##   ..$ user: num 3
##   ..$ dist: num 2
##  $ :List of 2
##   ..$ user: num 9
##   ..$ dist: num 2
##  $ :List of 2
##   ..$ user: num 10
##   ..$ dist: num 2
##  $ :List of 2
##   ..$ user: num 5
##   ..$ dist: num 2
##  $ :List of 2
##   ..$ user: num 7
##   ..$ dist: num 2
##  $ :List of 2
##   ..$ user: num 8
##   ..$ dist: num 2
##  $ :List of 2
##   ..$ user: num 9
##   ..$ dist: num 2
##  $ :List of 2
##   ..$ user: num 10
##   ..$ dist: num 2
##  $ :List of 2
##   ..$ user: num 3
##   ..$ dist: num 2
##  $ :List of 2
##   ..$ user: num 4
##   ..$ dist: num 2
##  $ :List of 2
##   ..$ user: num 4
##   ..$ dist: num 2
##  $ :List of 2
##   ..$ user: num 6
##   ..$ dist: num 2
##  $ :List of 2
##   ..$ user: num 8
##   ..$ dist: num 2
str(keyMap[["10"]])
## List of 20
##  $ :List of 2
##   ..$ user: num 1
##   ..$ dist: num 2
##  $ :List of 2
##   ..$ user: num 3
##   ..$ dist: num 2
##  $ :List of 2
##   ..$ user: num 9
##   ..$ dist: num 2
##  $ :List of 2
##   ..$ user: num 2
##   ..$ dist: num 2
##  $ :List of 2
##   ..$ user: num 5
##   ..$ dist: num 2
##  $ :List of 2
##   ..$ user: num 8
##   ..$ dist: num 2
##  $ :List of 2
##   ..$ user: num 9
##   ..$ dist: num 2
##  $ :List of 2
##   ..$ user: num 1
##   ..$ dist: num 2
##  $ :List of 2
##   ..$ user: num 5
##   ..$ dist: num 2
##  $ :List of 2
##   ..$ user: num 7
##   ..$ dist: num 2
##  $ :List of 2
##   ..$ user: num 8
##   ..$ dist: num 2
##  $ :List of 2
##   ..$ user: num 9
##   ..$ dist: num 2
##  $ :List of 2
##   ..$ user: num 2
##   ..$ dist: num 2
##  $ :List of 2
##   ..$ user: num 3
##   ..$ dist: num 2
##  $ :List of 2
##   ..$ user: num 4
##   ..$ dist: num 2
##  $ :List of 2
##   ..$ user: num 6
##   ..$ dist: num 2
##  $ :List of 2
##   ..$ user: num 2
##   ..$ dist: num 1
##  $ :List of 2
##   ..$ user: num 3
##   ..$ dist: num 1
##  $ :List of 2
##   ..$ user: num 4
##   ..$ dist: num 1
##  $ :List of 2
##   ..$ user: num 9
##   ..$ dist: num 1
## Grouping by keys done, so now reduce

for (key in keys(keyMap)) {
    userMap <- hash()
    values <- keyMap[[key]]
    for (v in values) {
        user <- as.character(v[["user"]]) ## make key as char for hashing
        dist <- v[["dist"]]
        if (dist == 1) { ## direct friends, we ignore
            userMap[[user]] <- NA
        } else if (dist == 2) { ## mutual friends
            currentValue <- userMap[[user]]
            if (!is.null(currentValue)) {
                if (!is.na(currentValue)) { ## bump the count
                    userMap[[user]] <- currentValue + 1
                }
            } else {  ## count 1
                userMap[[user]] <- 1
            }
        }
    }
    ## Drop direct friends
    for (user in keys(userMap)) {
        if (is.na(userMap[[user]])) {
            del(user, userMap)
        }
    }
    list <- as.list(userMap)
    order <- order(as.integer(list), decreasing=TRUE)
    list <- list[order]
    storage.mode(list) <- "integer"
    keyMap[[key]] <- list
}

#sink("recommend-friends.txt")
cat("Recommendation, top row is user, bottom row frequency")
## Recommendation, top row is user, bottom row frequency
for (key in keys(keyMap)) {
    cat("Recommendation for User", key)
    print(keyMap[[key]])
}
## Recommendation for User 110  3  8  9  6 
##  2  2  2  2  1 
## Recommendation for User 101 5 8 6 7 
## 2 2 2 1 1 
## Recommendation for User 24 5 6 7 8 
## 3 2 1 1 1 
## Recommendation for User 34 1 6 7 
## 4 2 2 1 
## Recommendation for User 43 2 6 
## 4 3 3 
## Recommendation for User 510  2  7  8  9 
##  2  2  2  2  2 
## Recommendation for User 6 4  3 10  1  2 
##  3  2  1  1  1 
## Recommendation for User 7 5  9 10  2  3 
##  2  2  1  1  1 
## Recommendation for User 8 9  1 10  5  2 
##  3  2  2  2  1 
## Recommendation for User 98 1 5 7 
## 3 2 2 2
#sink()

Note on problem 2

You can take the code for to problem 2b and also run it the actual data file soc-LiveJournal1Adj.txt provided to generate all the friend recommendations. Takes about an hour on my laptop.

The only change is to generate the records from the data file, which you can do as follows.

lines <- readLines("soc-LiveJournal1Adj.txt")

## Records is a list of vector ids, first being the subject id, the rest being friend ids.
records <- lapply(lines, function(x) as.integer(strsplit(x, split="[\t,]")[[1]]))

Session Info

sessionInfo()
## R version 3.2.3 (2015-12-10)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] hash_2.2.6        doParallel_1.0.10 ggplot2_2.0.0     doSNOW_1.0.14    
## [5] snow_0.4-1        iterators_1.0.8   glmnet_2.0-2      foreach_1.4.3    
## [9] Matrix_1.2-3     
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.2      knitr_1.12.3     magrittr_1.5     munsell_0.4.2   
##  [5] colorspace_1.2-6 lattice_0.20-33  stringr_1.0.0    plyr_1.8.3      
##  [9] tools_3.2.3      grid_3.2.3       gtable_0.1.2     htmltools_0.3   
## [13] digest_0.6.9     formatR_1.2.1    codetools_0.2-14 evaluate_0.8    
## [17] rmarkdown_0.9.2  labeling_0.3     stringi_1.0-1    compiler_3.2.3  
## [21] scales_0.3.0