1. Inverted Index

perl -ne '$a++;s/\s+/ $a\n/g;print $_;' < movies.txt | sort -t ' ' -k1,1 -k2,2n -u > index.txt
head -n5 index.txt

2. Inverted Index Implementation

#!/usr/bin/python
# -*- coding: utf-8 -*-
import sys

if __name__=="__main__":
    # initial structures
    inverted_index = {}
    number_of_postings = 0
    # read the input line by line
    for line in sys.stdin:
        # split each line into term and posting id
        term, doc_id = line.split(" ")
        # add term to dictionary
        if term not in inverted_index:
            inverted_index[term] = []
        inverted_index[term].append(doc_id)
        number_of_postings += 1
    print len(inverted_index), number_of_postings

Usage examples (and size of dict/postings comparison):

./prog_1.py < index.txt 
4286 52788

./prog_1.py < index_low.txt 
3684 50966

Get the 10 most frequent words from the lowercased test collection and treat them as stop words. How much does this reduce the size of the index (in terms of postings)?

3. Boolean queries

Extend your program from the first part to allow simple conjunctive boolean queries. Start by retrieving the result for a single word query like ‘school’ and show the result of that query.

Implement a function that performs the intersection between two posting lists according to the procedure described in the book and the lectures.

Adjust your query processing routines to return the number of comparisons needed to perform the intersections. Add a simple query optimization feature and measure the number of comparison steps after adding this feature.

#!/usr/bin/python
# -*- coding: utf-8 -*-
import sys, getopt
import copy

def intersect(p1, p2):
    '''
    Intersection between two lists
    '''
    answer = []
    # We have to copy these arrays, as the algorithm suggests using references
    # and we should not do that in Python
    p1 = copy.deepcopy(p1)
    p2 = copy.deepcopy(p2)
    comps = 0
    while p1 and p2:
        # compare first elements of the lists
        comps += 1
        if p1[0]==p2[0]:
            answer.append(p1[0])
            # remove first elements
            p1.pop(0)
            p2.pop(0)
        elif p1[0]>p2[0]:
            p2.pop(0)
        else:
            p1.pop(0)
    return answer, comps

def intersection(lists):
    '''
    Intersection of n lists
    '''
    if len(lists)<2:
        raise RuntimeError("There should be at least two lists")
    result = lists[0]
    comparasions = 0
    for l in lists[1:]:
        result, comps = intersect(result, l)
        comparasions += comps
    return result, comparasions

def intersection_optimised(lists):
    '''
    Intersection of n lists (optmised version)
    '''
    if len(lists)<2:
        raise RuntimeError("There should be at least two lists")
    lists.sort(cmp=lambda x, y: cmp(len(x), len(y)))
    result = lists[0]
    comparisons = 0
    for l in lists[1:]:
        result, comps = intersect(result, l)
        comparisons += comps
    return result, comparisons

if __name__=="__main__":
    # initilization
    inverted_index = {}
    number_of_postings = 0
    # parse options
    try:
        opts, _ = getopt.getopt(sys.argv[1:],"q:co")
    except getopt.GetoptError:
        print 'prog_2.py [-q term1 AND term2 [-c, -o]] < input'
        print "-q : set the query"
        print "-c : count the number of comparisons"
        print "-o : enable optimisation"
        sys.exit(2)
    # read the input line by line
    for line in sys.stdin:
        # split each line into term and posting id
        term, doc_id = line.replace("\n","").split(" ")
        # add term to dictionary
        if term not in inverted_index:
            inverted_index[term] = []
        inverted_index[term].append(int(doc_id))
        number_of_postings += 1
    opts = dict(opts)
    query = opts.get('-q')
    if query:
        # let's get the list of items
        q_terms = query.split("AND")
        # strip the strings
        q_terms = map(lambda x: x.strip(), q_terms)
        lists = []
        for q_term in q_terms:
            # use empty list as a default value to avoid errors
            q_list = inverted_index.get(q_term, [])
            q_list.sort()
            lists.append(q_list)
        if len(lists)>1:
            if '-o' in opts:
                docs, comparisons = intersection_optimised(lists)
            else:
                docs, comparisons = intersection(lists)
        else:
            docs = lists[0]
            comparisons = 0
        if "-c" in opts:
            print comparisons
        else:
            for doc in docs:
                print doc
    else:
        # No query: just pring the index size and postings count
        print len(inverted_index), number_of_postings

Use option -q to pass the query in quotation marks. Program will return IDs of documents which match the query. Without the query the program will return the size of the dictionary and the number of postings.

./prog_2.py -q 'aids' < index_low.txt
687
874

sed -n 687p movies.txt
" if that person facilitates the commission of such a crime " " or aids , abets or otherwise assists in its commission " . - That' s rather sweeping . - Well , if it' s any comfort , you' re in no jeopardy as long as you stay here , among friends . Are you saying I can' t leave the United States ? As your attorney , I strongly advise you not to travel to any country that recognizes the jurisdiction of the International Criminal Court . Well , just about every country in the world recognizes the ICC . America doesn' t . - Who else ? - SlDNEY :


./prog_2.py -q 'really AND kids' < index_low.txt 
72
224
385
475
523
584

sed -n 523p movies.txt
I wasn' t good at marriage , I wasn' t good at running a bed and breakfast , thank God we didn' t have kids . You don' t want kids ? I don' t think so . Well , I don' t know . For years I' ve never thought about the future , just kind of lived in the moment . But maybe I' m different now . I really want to take the next step , I want to move on with my life . - About the future ...


./prog_2.py -q 'really AND kids AND school' < index_low.txt 
72
224
385

sed -n 224p movies.txt
None of the other kids really liked me . I was always last one picked for everything . Hmm , it' s to bad that we didn' t go to the same school . Hal , I think you' re ready for this . Do I have a son ? No , Hal , you make me laugh . It stretches , it' s for you . Hey , what' s the ' T ' stand for ? Titan . Tighten ?

Use flag -c to tell the program to output the number of comparisons.

./prog_2.py -q 'really AND kids' -c < index_low.txt 
131

./prog_2.py -q 'really AND kids AND school' -c < index_low.txt 
148

Use flag -o to use the optimised version of the intersection function.

./prog_2.py -q 'really AND kids AND school' -c -o < index_low.txt 
111

The number of comparisons reduced from 148 to 111 for this particular query.

4. Tf-idf

Doc. freq:

./prog_2.py -q 'really' <index_low.txt | wc -l
121
./prog_2.py -q 'school' <index_low.txt | wc -l
15
./prog_2.py -q 'kids' <index_low.txt | wc -l
20

Term freq. for each doc:

 perl -ne '$a++;s/\s+/ $a\n/g;print $_;' < movies.txt | tr "[:upper:]" "[:lower:]" | sort -t ' ' -k1,1 -k2,2n | uniq -c  > index_w.txt

Documents with all 3 terms:

./prog_2.py -q 'really AND kids AND school' -o < index_low.txt
72
224
385

R function for tf-idf:

tfidf <- function(tf, df, N) {
(1+log10(tf))*log10(N/df)
}

Calculating plain tf-idf for each word in 3 docs:

cat index_w.txt | grep ' 72$' | grep really
 1 really 72
> tfidf(1, 121, 1000)
[1] 0.9172146

cat index_w.txt | grep ' 72$' | grep school
 2 school 72
> tfidf(2, 15, 1000)
[1] 2.37296

cat index_w.txt | grep ' 72$' | grep kids
 1 kids 72
> tfidf(1, 20, 1000)
[1] 1.69897

cat index_w.txt | grep ' 224$' | grep really
 1 really 224
> tfidf(1, 121, 1000)
[1] 0.9172146

cat index_w.txt | grep ' 224$' | grep school
 1 school 224
> tfidf(1, 15, 1000)
[1] 1.823909

cat index_w.txt | grep ' 224$' | grep kids
 1 kids 224
> tfidf(1, 20, 1000)
[1] 1.69897

cat index_w.txt | grep ' 385$' | grep really
 3 really 385
> tfidf(3, 121, 1000)
[1] 1.354837 

cat index_w.txt | grep ' 385$' | grep school
 2 school 385
> tfidf(2, 15, 1000)
[1] 1.823909
 
cat index_w.txt | grep ' 385$' | grep kids
 1 kids 385
> tfidf(1, 20, 1000)
[1] 1.69897

Calculating similarity and ranking (lnc.ltn)

calc <- data.frame(tokens=c('really', 'school', 'kids'), df=c(121, 15, 20))

#lnc for query vector
calc$q.tf <- c(1,1,1)
calc$q.tf.w <- 1 + log10(calc$q.tf)
calc$q.weight <- 1/sqrt(sum(calc$q.tf.w ^ 2))

#ltn for document vector
calc72 <- calc
calc72$d.tf <- c(1, 2, 1)
calc72$d.tf.w <- 1 + log10(calc72$d.tf)
calc72$d.idf <- log10(1000/calc72$df)
calc72$d.weight <- calc72$d.tf.w * calc72$d.idf

sum(calc72$q.weight * calc72$d.weight)
[1] 2.880484

calc224 <- calc
calc224$d.tf <- c(1, 1, 1)
calc224$d.tf.w <- 1 + log10(calc224$d.tf)
calc224$d.idf <- log10(1000/calc224$df)
calc224$d.weight <- calc224$d.tf.w * calc224$d.idf
sum(calc224$q.weight * calc224$d.weight)
[1] 2.563489

calc385 <- calc
calc385$d.tf <- c(3, 2, 1)
calc385$d.tf.w <- 1 + log10(calc385$d.tf)
calc385$d.idf <- log10(1000/calc385$df)
calc385$d.weight <- calc385$d.tf.w * calc385$d.idf
sum(calc385$q.weight * calc385$d.weight)
[1] 3.133146

So, for our query we will have a decreasing order of results: 385, 72, 224. I.e. the most relevant document to our query is:

Well , he was never really my type . Really ? OK , now you tell me something . Something you’ ve never told anyone . Well , in sh … school … … none of the other kids really liked me . I was always the last one picked for everything . It’ s too bad that we didn’ t go to the same school . Hal , I think you’ re ready for this . - Do I have a son ? - No .