Questions and Answers

What is a bag-of-word representation model? What is a Boolean retrieval model? What is a vector space model?

A bag-of-words representation model is simply when you represent a document by the count of the different words which are contained in the document. For instances the document “Mary is quicker than John is” would simply be represented by {Mary:1,is:1,quicker:1,John:1} and “John is quicker than Mary” would also be represented by the same thing.

In a boolean retrieval model we documents are retrieved via a set of boolean operators such as AND, OR, and NOT. In this type of model we are simply looking for the existence of a given word or term in a document, so you can do something like documents which contain “Mary AND John”.

In vector space model documents are represented via a vector of term frequencies. Any given element in the vector is simply a count of the number of times the term appears in the document. This allows us to simply compare vectors in order to decide similarity of documents.

What is TF-IDF? What is the advantage of using TF-IDF for information retrieval?

TF-IDF is given by the equation \(tf.idf_{t,d} = (1+\log_{10}tf_{t,d})\times \log_{10}(N/df_t)\). It is simply the log frequency multiplied with the log of the inverse document frequency. By using this number it allows us to weight the number of times a term appears but not significantly to the point that if a term appears 10 times the document is 10 times as important. It also allows us to more heavily weight rarity of a term so that it will be more important according to how often the term is found within the collection of documents.

What is a PageRank score? How does Pagerank score assess the importance of a webpage?

A PageRank score is simply a score of how important a page is. The key feature of PageRank is the pages importance is determining by the importance of the pages which point to it. But it only gets a proportion of the importance of the pages that point to it, so for instances if Page A points to many pages including Page B then Page B will only get a small fraction of Page A’s importance. It is often described as being the probability that a “random surfer” will visit the page if it is randomly surfing the web selecting links.

Please explain the disadvantage of using TF-IDF and Pagerank, alone, for information retrieval. Please propose one solution to combine TF-IDF and Pagerank to retrieve information online.

Well using TF-IDF alone only measures where the page has the words we are looking for. The page itself may still be useless since you can simply use techniques like Meta-tag stuffing to increase your TF-IDF score. Now PageRank gives us a good measure of a page’s importance in the overall web but does nothing to tell us the relevance of the page to what we are searching for. One could combine them by first using TF-IDF to find relevant pages and them sorting them by PageRank to filter out useless or spam type pages

What is inverted index? When using inverted index for retrieval, what is the difference between document-at-a-time vs. query-at-a-time models.

An inverted index is simply an index where the query terms themselves are the keys to the index and they point to the documents. By using this type of index you can directly access documents which contain certain terms versus having to go look through all of the documents one by one for the terms. The query-at-a-time model uses this inverted index in order to find similar documents. It goes through the search terms accumulating the documents which contain those terms (usually in reverse order of frequency). The document-at-a-time model does the opposite and instead searches through each document one by one evaluating it against the search terms. This method has the huge disadvantage of having to actually read every document for every query and is unfeasible for large libraries.

Calculation and Practice

TF-IDF

For this section we need to use TF-IDF to calculate the matching score between query Q and each document. We are only to consider the following terms: dish, duck, rabbit, Beijing, recipe, roast. The TF and IDF tables are shown along with the TF-IDF scores between query Q and each document. Note that roast does not appear in the documents so I am disregarding it since it would send the IDF value of that term to infinity.

First we need to setup the document data.

d1 <- "walks like a duck and quacks like a duck, must be a duck"
d2 <- "Beijing Duck is prized for the thin, crispy duck skin with authentic versions of
the dish serving"
d3 <- "Bugs' ascension to stardom also prompted the Warner animators to recast Daffy Duck
as the rabbit's rival, intensely jealous and determined to steal back the spotlight while
Bugs remained indifferent to the duck's jealousy, or used it to his advantage. This turned
out to be the recipe for the success of the duo"
d4 <- "I found this great recipe for rabbit Braised in Wine on cookingforengineers.com"
d5 <- "Last week Li has shown you how to make the Sechuan duck. Today we'll show you a
recipe for making Chinese dumplings, a popular dish that I had a chance to try last
summer in Beijing."

docs <- list(d1,d2,d3,d4,d5)
docs <- gsub("'", ' ', docs)
names(docs) <- paste0("doc",c(1:length(docs)))
my.corpus <- VectorSource(docs)
my.corpus$Names <- c(names(docs))
my.corpus <- Corpus(my.corpus)
my.corpus <- tm_map(my.corpus,removePunctuation)
tdm <- TermDocumentMatrix(my.corpus,list(dictionary=c("dish","duck","rabbit","beijing","recipe")))
tf <- as.matrix(tdm)

Here is the TF table:

##          Docs
## Terms     1 2 3 4 5
##   beijing 0 1 0 0 1
##   dish    0 1 0 0 1
##   duck    3 2 2 0 1
##   rabbit  0 0 1 1 0
##   recipe  0 0 1 1 1

Now we can convert that to a log frequency table for use in TF-IDF

tfLog  <- ifelse(tf>0,1+log10(tf),0)
tfLog

##          Docs
## Terms            1       2       3 4 5
##   beijing 0.000000 1.00000 0.00000 0 1
##   dish    0.000000 1.00000 0.00000 0 1
##   duck    1.477121 1.30103 1.30103 0 1
##   rabbit  0.000000 0.00000 1.00000 1 0
##   recipe  0.000000 0.00000 1.00000 1 1

Moving onto the IDF part. What I will do here is just take the tf table from earlier and make anything that is greater than 1 a 1. Then we can just do row sums to get the count of occurrence and apply the log of document count over count.

idf <- tf
idf[idf>0]<-1
idf <- log10(ncol(tf)/rowSums(idf))
idf

##    beijing       dish       duck     rabbit     recipe 
## 0.39794001 0.39794001 0.09691001 0.39794001 0.22184875

Now to get our tf-idf table we just multiply the idf’s by the log tf’s

tf.idf <- tfLog*idf
tf.idf

##          Docs
## Terms             1         2         3         4          5
##   beijing 0.0000000 0.3979400 0.0000000 0.0000000 0.39794001
##   dish    0.0000000 0.3979400 0.0000000 0.0000000 0.39794001
##   duck    0.1431478 0.1260828 0.1260828 0.0000000 0.09691001
##   rabbit  0.0000000 0.0000000 0.3979400 0.3979400 0.00000000
##   recipe  0.0000000 0.0000000 0.2218487 0.2218487 0.22184875

Now we are going to compute the similarity of the query “beijing duck recipe” with each of the documents. It did not say what type of similarity function to use so I will assume it is cosine similarity. To do this I am going to first define a cosine similarity function and simply apply it everywhere and then grab out the ones for the query.

cosd <- function(x,y) {
  z <- rbind(x,y)
  z <- as.matrix(z[,colSums(is.na(z)) == 0])
  x <- z[1,]
  y <- z[2,]
  x %*% y / (sqrt(x%*%x) * sqrt(y%*%y))
}

So first lets setup our query document and add it to the tf-idf matrix.

query <- idf
query[] <- 0
query["beijing"] <- idf["beijing"]
query["duck"] <- idf["duck"]
query["recipe"] <- idf["recipe"]
query

##    beijing       dish       duck     rabbit     recipe 
## 0.39794001 0.00000000 0.09691001 0.00000000 0.22184875

qtf.idf <- cbind(tf.idf,query)
qtf.idf

##                 1         2         3         4          5      query
## beijing 0.0000000 0.3979400 0.0000000 0.0000000 0.39794001 0.39794001
## dish    0.0000000 0.3979400 0.0000000 0.0000000 0.39794001 0.00000000
## duck    0.1431478 0.1260828 0.1260828 0.0000000 0.09691001 0.09691001
## rabbit  0.0000000 0.0000000 0.3979400 0.3979400 0.00000000 0.00000000
## recipe  0.0000000 0.0000000 0.2218487 0.2218487 0.22184875 0.22184875

Now we will use our cosine similarity function to get the distances between all pairs of documents

dists <- dist(qtf.idf,by_rows=FALSE,method=cosd,upper=TRUE)
dists

##                1          2          3          4          5      query
## 1                0.21861941 0.26671433 0.00000000 0.15818572 0.20805308
## 2     0.21861941            0.05830893 0.00000000 0.93097158 0.63497038
## 3     0.26671433 0.05830893            0.96377563 0.21213328 0.27900737
## 4     0.00000000 0.00000000 0.96377563            0.17633033 0.23191771
## 5     0.15818572 0.93097158 0.21213328 0.17633033            0.76031424
## query 0.20805308 0.63497038 0.27900737 0.23191771 0.76031424

Now if you just want the similarity of the query with each document select out that row

sort(as.matrix(dists)["query",1:5],decreasing=TRUE)

##         5         2         3         4         1 
## 0.7603142 0.6349704 0.2790074 0.2319177 0.2080531

PageRank

Iteration method

First create the adjacency matrix:

A = matrix(c(0,1,1,1,1,1,1,
             1,0,0,1,1,0,0,
             1,0,0,0,0,1,1,
             1,1,0,0,0,0,0,
             1,1,0,0,0,0,0,
             1,0,1,0,0,0,0,
             1,0,1,0,0,0,0),
           ncol=7,byrow=TRUE)
nodes = c('A','B','C','D','E','F','G')
rownames(A)=nodes
colnames(A)=nodes
A

##   A B C D E F G
## A 0 1 1 1 1 1 1
## B 1 0 0 1 1 0 0
## C 1 0 0 0 0 1 1
## D 1 1 0 0 0 0 0
## E 1 1 0 0 0 0 0
## F 1 0 1 0 0 0 0
## G 1 0 1 0 0 0 0

## Warning: package 'ggplot2' was built under R version 3.1.3

So first we will transpose the adjacency matrix and then do our column wise scaling.

At = t(A)
At = sweep(At,2,colSums(At),FUN="/")
At

##           A         B         C   D   E   F   G
## A 0.0000000 0.3333333 0.3333333 0.5 0.5 0.5 0.5
## B 0.1666667 0.0000000 0.0000000 0.5 0.5 0.0 0.0
## C 0.1666667 0.0000000 0.0000000 0.0 0.0 0.5 0.5
## D 0.1666667 0.3333333 0.0000000 0.0 0.0 0.0 0.0
## E 0.1666667 0.3333333 0.0000000 0.0 0.0 0.0 0.0
## F 0.1666667 0.0000000 0.3333333 0.0 0.0 0.0 0.0
## G 0.1666667 0.0000000 0.3333333 0.0 0.0 0.0 0.0

Now we need to set the original page ranks which are simply \(1/n\).

r = matrix(1/ncol(A),ncol(A),1)
rownames(r)=nodes
r

##        [,1]
## A 0.1428571
## B 0.1428571
## C 0.1428571
## D 0.1428571
## E 0.1428571
## F 0.1428571
## G 0.1428571

Next we use the formula \(r^{k+1}=Mr^k\) to compute the first 2 iterations.

t = r
r = At %*% r
t = cbind(t,r)
r = At %*% r
t = cbind(t,r)
colnames(t)=c('r0','r1','r2')
kable(t)

	r0	r1	r2
A	0.1428571	0.3809524	0.2539683
B	0.1428571	0.1666667	0.1349206
C	0.1428571	0.1666667	0.1349206
D	0.1428571	0.0714286	0.1190476
E	0.1428571	0.0714286	0.1190476
F	0.1428571	0.0714286	0.1190476
G	0.1428571	0.0714286	0.1190476

Eigenvector approach

For eigenvetor approach all we have to do is take our scaled matrix from the last part and find the primary eigenvector.

y = eigen(At)
y

## $values
## [1]  1.000000e+00 -6.666667e-01 -5.773503e-01  5.773503e-01 -3.333333e-01
## [6]  3.335461e-17  0.000000e+00
## 
## $vectors
##           [,1]       [,2]          [,3]          [,4]          [,5]
## [1,] 0.7171372 -0.6324555 -2.583706e-16 -2.209445e-16  8.164966e-01
## [2,] 0.3585686 -0.3162278 -5.477226e-01  5.477226e-01 -4.082483e-01
## [3,] 0.3585686 -0.3162278  5.477226e-01 -5.477226e-01 -4.082483e-01
## [4,] 0.2390457  0.3162278  3.162278e-01  3.162278e-01  5.678744e-17
## [5,] 0.2390457  0.3162278  3.162278e-01  3.162278e-01  5.678744e-17
## [6,] 0.2390457  0.3162278 -3.162278e-01 -3.162278e-01 -8.518115e-17
## [7,] 0.2390457  0.3162278 -3.162278e-01 -3.162278e-01 -8.518115e-17
##               [,6]          [,7]
## [1,] -7.289993e-32 -2.962788e-46
## [2,]  1.497000e-16 -7.817348e-17
## [3,] -9.961632e-17  5.560715e-17
## [4,]  7.071068e-01  7.278153e-17
## [5,] -7.071068e-01 -8.390064e-17
## [6,] -1.270972e-16 -7.071068e-01
## [7,] -1.607489e-17  7.071068e-01

So grab the max which is 1 and corresponds to column 1

rownames(y$vectors)=nodes
y$vectors[,which.max(abs(y$val))]

##         A         B         C         D         E         F         G 
## 0.7171372 0.3585686 0.3585686 0.2390457 0.2390457 0.2390457 0.2390457

Spam pages

One way in which spammers attempt to increase the rank of their pages is through something called link spam. The basic idea is to create what is called a spam farm of pages which are all pointed to from the target page and in return point back to the target page. The idea is to try and capture as much page rank as possible. In our particular example below we have 1000 spam pages and a page A and page B. By having Page A point to Page B which points to all of the pages in the spam farm it can be avoid being classified as a spider trap since the crawler is continually finding new pages. Now calculating the page rank of this particular page we will consider 5 pages and 10 pages and from there we will see that the rank of Page A and B are simply \(\frac{1}{3}\) while the rank of each spam page will be \(\frac{1}{n}\cdot\frac{1}{3}\) which in this case is simply \(\frac{1}{3000}\). The reason for this is that pageRank is simply random walker so at any given time you have a third chance of being on either Page A, Page B or a Spam page and further you have a \(1/n\) chance of being on any given spam page if you are indeed on a spam page. One way to combat this is to have random teleports so you don’t get stuck there. Another possible way would be to look at the urls and identify if they are nonsensical. Finally you could add in a measure that limits the amount of rank you can get from very lowly ranked pages.

We will first simulate with 5 spam pages

First create the adjacency matrix:

A = matrix(c(0,1,0,0,0,0,0,
             0,0,1,1,1,1,1,
             1,0,0,0,0,0,0,
             1,0,0,0,0,0,0,
             1,0,0,0,0,0,0,
             1,0,0,0,0,0,0,
             1,0,0,0,0,0,0),
           ncol=7,byrow=TRUE)
nodes = c('A','B','S1','S2','S3','S4','S5')
rownames(A)=nodes
colnames(A)=nodes
A

##    A B S1 S2 S3 S4 S5
## A  0 1  0  0  0  0  0
## B  0 0  1  1  1  1  1
## S1 1 0  0  0  0  0  0
## S2 1 0  0  0  0  0  0
## S3 1 0  0  0  0  0  0
## S4 1 0  0  0  0  0  0
## S5 1 0  0  0  0  0  0

So first we will transpose the adjacency matrix and then do our column wise scaling.

At = t(A)
At = sweep(At,2,colSums(At),FUN="/")
At

##    A   B S1 S2 S3 S4 S5
## A  0 0.0  1  1  1  1  1
## B  1 0.0  0  0  0  0  0
## S1 0 0.2  0  0  0  0  0
## S2 0 0.2  0  0  0  0  0
## S3 0 0.2  0  0  0  0  0
## S4 0 0.2  0  0  0  0  0
## S5 0 0.2  0  0  0  0  0

Next we take our scaled matrix from the last part and find the primary eigenvector (

y = eigen(At)
y

## $values
## [1] -0.5+0.8660254i -0.5-0.8660254i  1.0+0.0000000i  0.0+0.0000000i
## [5]  0.0+0.0000000i  0.0+0.0000000i  0.0+0.0000000i
## 
## $vectors
##                       [,1]                  [,2]         [,3]
## [1,]  0.6741999+0.0000000i  0.6741999+0.0000000i 0.6741999+0i
## [2,] -0.3370999-0.5838742i -0.3370999+0.5838742i 0.6741999+0i
## [3,] -0.0674200+0.1167748i -0.0674200-0.1167748i 0.1348400+0i
## [4,] -0.0674200+0.1167748i -0.0674200-0.1167748i 0.1348400+0i
## [5,] -0.0674200+0.1167748i -0.0674200-0.1167748i 0.1348400+0i
## [6,] -0.0674200+0.1167748i -0.0674200-0.1167748i 0.1348400+0i
## [7,] -0.0674200+0.1167748i -0.0674200-0.1167748i 0.1348400+0i
##               [,4]          [,5]          [,6]          [,7]
## [1,]  0.0000000+0i  0.0000000+0i  0.0000000+0i  0.0000000+0i
## [2,]  0.0000000+0i  0.0000000+0i  0.0000000+0i  0.0000000+0i
## [3,] -0.4472136+0i -0.4472136+0i -0.4472136+0i -0.4472136+0i
## [4,]  0.8618034+0i -0.1381966+0i -0.1381966+0i -0.1381966+0i
## [5,] -0.1381966+0i  0.8618034+0i -0.1381966+0i -0.1381966+0i
## [6,] -0.1381966+0i -0.1381966+0i  0.8618034+0i -0.1381966+0i
## [7,] -0.1381966+0i -0.1381966+0i -0.1381966+0i  0.8618034+0i

Note that we have to exclude vectors with imaginary numbers.

rownames(y$vectors)=nodes
goodVals <- abs(Im(y$values)) < 1e-6
y$values <- y$values[goodVals]
y$vectors <- y$vectors[,goodVals]

So grab the max which is 1 and corresponds to column 1

y <- y$vectors[,which.max(abs(y$val))]
# scale it
y<-y/sum(y)
y

##             A             B            S1            S2            S3 
## 0.33333333+0i 0.33333333+0i 0.06666667+0i 0.06666667+0i 0.06666667+0i
##            S4            S5 
## 0.06666667+0i 0.06666667+0i

We see that we got exactly what we expected with Pages A and B being \(1/3\) each and the spam pages being \(\frac{1}{3}\cdot\frac{1}{5} = 0.0666667\)

Now we will repeat with 10 spam pages.

We will first simulate with 5 spam pages

First create the adjacency matrix:

A = matrix(c(0,1,0,0,0,0,0,0,0,0,0,0,
             0,0,1,1,1,1,1,1,1,1,1,1,
             1,0,0,0,0,0,0,0,0,0,0,0,
             1,0,0,0,0,0,0,0,0,0,0,0,
             1,0,0,0,0,0,0,0,0,0,0,0,
             1,0,0,0,0,0,0,0,0,0,0,0,
             1,0,0,0,0,0,0,0,0,0,0,0,
             1,0,0,0,0,0,0,0,0,0,0,0,
             1,0,0,0,0,0,0,0,0,0,0,0,
             1,0,0,0,0,0,0,0,0,0,0,0,
             1,0,0,0,0,0,0,0,0,0,0,0,
             1,0,0,0,0,0,0,0,0,0,0,0),
           ncol=12,byrow=TRUE)
nodes = c('A','B','S1','S2','S3','S4','S5','S6','S7','S8','S9','S10')
rownames(A)=nodes
colnames(A)=nodes
A

##     A B S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
## A   0 1  0  0  0  0  0  0  0  0  0   0
## B   0 0  1  1  1  1  1  1  1  1  1   1
## S1  1 0  0  0  0  0  0  0  0  0  0   0
## S2  1 0  0  0  0  0  0  0  0  0  0   0
## S3  1 0  0  0  0  0  0  0  0  0  0   0
## S4  1 0  0  0  0  0  0  0  0  0  0   0
## S5  1 0  0  0  0  0  0  0  0  0  0   0
## S6  1 0  0  0  0  0  0  0  0  0  0   0
## S7  1 0  0  0  0  0  0  0  0  0  0   0
## S8  1 0  0  0  0  0  0  0  0  0  0   0
## S9  1 0  0  0  0  0  0  0  0  0  0   0
## S10 1 0  0  0  0  0  0  0  0  0  0   0

So first we will transpose the adjacency matrix and then do our column wise scaling.

At = t(A)
At = sweep(At,2,colSums(At),FUN="/")
At

##     A   B S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
## A   0 0.0  1  1  1  1  1  1  1  1  1   1
## B   1 0.0  0  0  0  0  0  0  0  0  0   0
## S1  0 0.1  0  0  0  0  0  0  0  0  0   0
## S2  0 0.1  0  0  0  0  0  0  0  0  0   0
## S3  0 0.1  0  0  0  0  0  0  0  0  0   0
## S4  0 0.1  0  0  0  0  0  0  0  0  0   0
## S5  0 0.1  0  0  0  0  0  0  0  0  0   0
## S6  0 0.1  0  0  0  0  0  0  0  0  0   0
## S7  0 0.1  0  0  0  0  0  0  0  0  0   0
## S8  0 0.1  0  0  0  0  0  0  0  0  0   0
## S9  0 0.1  0  0  0  0  0  0  0  0  0   0
## S10 0 0.1  0  0  0  0  0  0  0  0  0   0

Next we take our scaled matrix from the last part and find the primary eigenvector (

y = eigen(At)
y

## $values
##  [1]  1.0+0.0000000i -0.5+0.8660254i -0.5-0.8660254i  0.0+0.0000000i
##  [5]  0.0+0.0000000i  0.0+0.0000000i  0.0+0.0000000i  0.0+0.0000000i
##  [9]  0.0+0.0000000i  0.0+0.0000000i  0.0+0.0000000i  0.0+0.0000000i
## 
## $vectors
##                [,1]                    [,2]                    [,3]
##  [1,] 0.69006556+0i -0.34503278+0.59761430i -0.34503278-0.59761430i
##  [2,] 0.69006556+0i  0.69006556+0.00000000i  0.69006556+0.00000000i
##  [3,] 0.06900656+0i -0.03450328-0.05976143i -0.03450328+0.05976143i
##  [4,] 0.06900656+0i -0.03450328-0.05976143i -0.03450328+0.05976143i
##  [5,] 0.06900656+0i -0.03450328-0.05976143i -0.03450328+0.05976143i
##  [6,] 0.06900656+0i -0.03450328-0.05976143i -0.03450328+0.05976143i
##  [7,] 0.06900656+0i -0.03450328-0.05976143i -0.03450328+0.05976143i
##  [8,] 0.06900656+0i -0.03450328-0.05976143i -0.03450328+0.05976143i
##  [9,] 0.06900656+0i -0.03450328-0.05976143i -0.03450328+0.05976143i
## [10,] 0.06900656+0i -0.03450328-0.05976143i -0.03450328+0.05976143i
## [11,] 0.06900656+0i -0.03450328-0.05976143i -0.03450328+0.05976143i
## [12,] 0.06900656+0i -0.03450328-0.05976143i -0.03450328+0.05976143i
##                   [,4]             [,5]             [,6]             [,7]
##  [1,] -3.697785e-32+0i -3.697785e-32+0i -3.697785e-32+0i -3.697785e-32+0i
##  [2,]  4.930381e-32+0i  4.930381e-32+0i  4.930381e-32+0i  4.930381e-32+0i
##  [3,] -3.162278e-01+0i -3.162278e-01+0i -3.162278e-01+0i -3.162278e-01+0i
##  [4,]  9.240253e-01+0i -7.597469e-02+0i -7.597469e-02+0i -7.597469e-02+0i
##  [5,] -7.597469e-02+0i  9.240253e-01+0i -7.597469e-02+0i -7.597469e-02+0i
##  [6,] -7.597469e-02+0i -7.597469e-02+0i  9.240253e-01+0i -7.597469e-02+0i
##  [7,] -7.597469e-02+0i -7.597469e-02+0i -7.597469e-02+0i  9.240253e-01+0i
##  [8,] -7.597469e-02+0i -7.597469e-02+0i -7.597469e-02+0i -7.597469e-02+0i
##  [9,] -7.597469e-02+0i -7.597469e-02+0i -7.597469e-02+0i -7.597469e-02+0i
## [10,] -7.597469e-02+0i -7.597469e-02+0i -7.597469e-02+0i -7.597469e-02+0i
## [11,] -7.597469e-02+0i -7.597469e-02+0i -7.597469e-02+0i -7.597469e-02+0i
## [12,] -7.597469e-02+0i -7.597469e-02+0i -7.597469e-02+0i -7.597469e-02+0i
##                   [,8]             [,9]            [,10]            [,11]
##  [1,] -3.697785e-32+0i -3.697785e-32+0i -3.697785e-32+0i -3.697785e-32+0i
##  [2,]  4.930381e-32+0i  4.930381e-32+0i  4.930381e-32+0i  4.930381e-32+0i
##  [3,] -3.162278e-01+0i -3.162278e-01+0i -3.162278e-01+0i -3.162278e-01+0i
##  [4,] -7.597469e-02+0i -7.597469e-02+0i -7.597469e-02+0i -7.597469e-02+0i
##  [5,] -7.597469e-02+0i -7.597469e-02+0i -7.597469e-02+0i -7.597469e-02+0i
##  [6,] -7.597469e-02+0i -7.597469e-02+0i -7.597469e-02+0i -7.597469e-02+0i
##  [7,] -7.597469e-02+0i -7.597469e-02+0i -7.597469e-02+0i -7.597469e-02+0i
##  [8,]  9.240253e-01+0i -7.597469e-02+0i -7.597469e-02+0i -7.597469e-02+0i
##  [9,] -7.597469e-02+0i  9.240253e-01+0i -7.597469e-02+0i -7.597469e-02+0i
## [10,] -7.597469e-02+0i -7.597469e-02+0i  9.240253e-01+0i -7.597469e-02+0i
## [11,] -7.597469e-02+0i -7.597469e-02+0i -7.597469e-02+0i  9.240253e-01+0i
## [12,] -7.597469e-02+0i -7.597469e-02+0i -7.597469e-02+0i -7.597469e-02+0i
##                  [,12]
##  [1,] -3.697785e-32+0i
##  [2,]  4.930381e-32+0i
##  [3,] -3.162278e-01+0i
##  [4,] -7.597469e-02+0i
##  [5,] -7.597469e-02+0i
##  [6,] -7.597469e-02+0i
##  [7,] -7.597469e-02+0i
##  [8,] -7.597469e-02+0i
##  [9,] -7.597469e-02+0i
## [10,] -7.597469e-02+0i
## [11,] -7.597469e-02+0i
## [12,]  9.240253e-01+0i

Note that we have to exclude vectors with imaginary numbers.

rownames(y$vectors)=nodes
goodVals <- abs(Im(y$values)) < 1e-6
y$values <- y$values[goodVals]
y$vectors <- y$vectors[,goodVals]

So grab the max which is 1 and corresponds to column 1

y <- y$vectors[,which.max(abs(y$val))]
# scale it
y<-y/sum(y)
y

##             A             B            S1            S2            S3 
## 0.33333333+0i 0.33333333+0i 0.03333333+0i 0.03333333+0i 0.03333333+0i
##            S4            S5            S6            S7            S8 
## 0.03333333+0i 0.03333333+0i 0.03333333+0i 0.03333333+0i 0.03333333+0i
##            S9           S10 
## 0.03333333+0i 0.03333333+0i

We see that we got exactly what we expected with Pages A and B being \(1/3\) each and the spam pages being \(\frac{1}{3}\cdot\frac{1}{10} = 0.0333333\)

Solr

So for this part of the homework we first installed tomcat and then solr. We then proceeded to deploy solr onto tomcat. Nutch was used to crawl 5 web pages (mit.edu, nd.edu, fau.edu, stanford.edu, and uscb.edu). The results of these crawls were imported into solr for search and the Web front end of solr used to do some sample searches. In total solr had 39 documents.

The results for the queries are shown below:

Query	Hits
florida	5
california	8
catholic	8

I will now include all of the requested screenshots:

Now we will include the searches. 3 from port 8080 and 3 from port 8983 demonstrating they get the same result

hw4

Michael Crawford

March 28, 2015

Questions and Answers

What is a bag-of-word representation model? What is a Boolean retrieval model? What is a vector space model?

What is TF-IDF? What is the advantage of using TF-IDF for information retrieval?

What is a PageRank score? How does Pagerank score assess the importance of a webpage?

Please explain the disadvantage of using TF-IDF and Pagerank, alone, for information retrieval. Please propose one solution to combine TF-IDF and Pagerank to retrieve information online.

What is inverted index? When using inverted index for retrieval, what is the difference between document-at-a-time vs. query-at-a-time models.

Calculation and Practice

TF-IDF

PageRank

Iteration method

Eigenvector approach

Spam pages

Solr