Similarity Assignment

Create the Data

One of the corpora here: https://corpus.byu.edu/corpora.asp
Collect two bigrams that you can compare using the association measures listed (does not have to be X-Y, Z-Y, but that would help you compare them)
Create a dataframe like the one from lecture of those bigrams
Taking TV corpus and selecting two paired letter peanut butter and almond butter and we are interested in and calculate the relation between them using a basic contingency table.
A: Co-occurrence of X and Y
B: Number of occurrences of X without Y
C: Number of occurrences of Y without X
D: Number of occurrences that are not X or Y

##r chunk

peanutButter = c(1886, 1505, 4922, (326201276 - 1886 - 1505 - 4922)) 
AlomndButter = c(34, 639, 6774, (326201276 - 34 - 639 - 6774))
PeanutAlmond_Butter = as.data.frame(rbind(peanutButter,AlomndButter))
colnames(PeanutAlmond_Butter) = c("a", "b", "c", "d")
PeanutAlmond_Butter

##                 a    b    c         d
## peanutButter 1886 1505 4922 326192963
## AlomndButter   34  639 6774 326193829

Load the Libraries + Functions

Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.

##r chunk

library(stringdist)
library(Rling)
library(reticulate)

py_install('nltk')
py_install('pandas')
py_install('numpy')
py_install('scipy')
py_install('scikit-learn')
py_install('gensim')

The python section will load the libraries below.

Attraction and Reliance

Calculate the attraction for your bigrams.

##r chunk


attraction = PeanutAlmond_Butter$a/(PeanutAlmond_Butter$a+PeanutAlmond_Butter$c)*100
attraction

## [1] 27.7027027  0.4994125

rownames(PeanutAlmond_Butter)

## [1] "peanutButter" "AlomndButter"

Calculate the reliance for your bigrams.

##r chunk

reliance = PeanutAlmond_Butter$a/(PeanutAlmond_Butter$a+PeanutAlmond_Butter$b)*100
reliance

## [1] 55.617812  5.052006

rownames(PeanutAlmond_Butter)

## [1] "peanutButter" "AlomndButter"

Log Likelihood

Calculate the LL values for your bigrams.

##r chunk


#expected frequency
aExp = (PeanutAlmond_Butter$a + PeanutAlmond_Butter$b)*(PeanutAlmond_Butter$a + PeanutAlmond_Butter$c)/
  (PeanutAlmond_Butter$a + PeanutAlmond_Butter$b + PeanutAlmond_Butter$c + PeanutAlmond_Butter$d)
#p values
pvF = pv.Fisher.collostr(PeanutAlmond_Butter$a, PeanutAlmond_Butter$b, PeanutAlmond_Butter$c, PeanutAlmond_Butter$d)
#log based on expected frequency
logpvF = ifelse(PeanutAlmond_Butter$a < aExp, log10(pvF), -log10(pvF))
logpvF

## [1]      Inf 101.8631

LL = LL.collostr(PeanutAlmond_Butter$a, PeanutAlmond_Butter$b, PeanutAlmond_Butter$c, PeanutAlmond_Butter$d)
LL1 = ifelse(PeanutAlmond_Butter$a < aExp, -LL, LL)
LL1

## [1] 36572.2689   463.7855

Pointwise Mutual Information

Calculate the PMI for your bigrams.

##r chunk

PMI = log(PeanutAlmond_Butter$a / aExp)^2
PMI

## [1] 103.84639  60.71194

Odds Ratio

Calculate the OR for your bigrams.

##r chunk

logOR = log(PeanutAlmond_Butter$a*PeanutAlmond_Butter$d/(PeanutAlmond_Butter$b*PeanutAlmond_Butter$c))
logOR

## [1] 11.327195  7.848611

Interpret your results

Given the statistics you have calculated above, what is the relation of your bigrams? Write a short summary of the results, making sure to answer the following:

Are they related?
Do they attract or repel each other?
Are there differences between the separate bigrams?

Positive value shows there is attraction type and Fischer test shows positive numner therefore indicate mutual attraction. Positive indicates attraction type value.

Python Application

Load all the libraries you will need for the Python section. You can also put in the functions for normalizing the text and calculating the top 5 related objects.

##python chunk


import nltk
import re
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from gensim.summarization.bm25 import get_bm25_weights

Import the completed_clean_data and convert to a pandas dataframe. This dataset includes a list of scientific research articles that all appeared when I searched for “databases”, “corpus”, and “linguistic norms”.

##python chunk


#import data
df = pd.read_csv("C:\\Users\\rohan\\OneDrive\\540\\completed_clean_data.csv")

#show you one of the rows (row 2!)
df.iloc[[1]]

##                                 AUTHOR  ...                                           ABSTRACT
## 1  Brendan T. JohnsRandall K. Jamieson  ...  We measured and documented the influence of co...
## 
## [1 rows x 5 columns]

Use the normalizing text function to clean up the corpus - specifically, focus on the ABSTRACT column as our text to match.


## python chunk

stop_words = nltk.corpus.stopwords.words('english')

def normalize_document(doc):
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I|re.A)
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = nltk.word_tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

normalize_corpus = np.vectorize(normalize_document)

norm_corpus = normalize_corpus(list(df['ABSTRACT']))
len(norm_corpus)

## 2875

Calculate the cosine similarity between the abstracts of the attached documents.

##python chunk


tf = TfidfVectorizer(ngram_range=(1, 2), min_df=2)
tfidf_matrix = tf.fit_transform(norm_corpus)
tfidf_matrix.shape

## (2875, 30660)

doc_sim = cosine_similarity(tfidf_matrix)
doc_sim_df = pd.DataFrame(doc_sim)
doc_sim_df.head()

##        0         1         2     ...      2872      2873      2874
## 0  1.000000  0.022779  0.032152  ...  0.047954  0.023440  0.048615
## 1  0.022779  1.000000  0.024258  ...  0.051089  0.032311  0.011365
## 2  0.032152  0.024258  1.000000  ...  0.013188  0.035566  0.028252
## 3  0.028440  0.013722  0.043408  ...  0.004610  0.001366  0.000000
## 4  0.062236  0.025037  0.040312  ...  0.028702  0.013295  0.020856
## 
## [5 rows x 2875 columns]

Using our moving recommender - pick a single article (under TITLE) and recommend five other related articles.

##python chunk

def scientificresearch_article(TITLE, AUTHOR, ABSTRACT):
    # find movie id
    Scientificresearch_idx = np.where(TITLE == TITLE)[0][0]
    # get movie similarities
    sciresearch_similarities = ABSTRACT.iloc[Scientificresearch_idx].values
    # get top 5 similar movie IDs
    similar_sciresearch_idxs = np.argsort(-sciresearch_similarities)[1:6]
    # get top 5 movies
    similar_sciresearch = AUTHOR[similar_sciresearch_idxs]
    # return the top 5 movies
    return similar_sciresearch

scientificresearch_article("chinese lexical database cld", #name of film must be in dataset
                  df["TITLE"].values, #all film names
                  doc_sim_df #pd dataframe of similarity values
                  )

## array(['chinese lexical database cld a large scale lexical database for simplified mandarin chinese',
##        'meld sch a megastudy of lexical decision in simplified chinese',
##        'the chinese lexicon project a megastudy of lexical decision performance for 25000 traditional chinese two character compound words',
##        'the use of film subtitles to estimate word frequencies',
##        'speechreading and the structure of the lexicon computationally modeling the effects of reduced phonetic distinctiveness on lexical uniqueness'],
##       dtype=object)

Use the BM25 measure

##Python chunk

#doc_sim = get_bm25_weights(norm_corpus[0:10].tolist(), n_jobs = -1)
#doc_sim_df = pd.DataFrame(doc_sim)
#doc_sim_df.head()

#scientificresearch_article("introduction to eurowordnet",
                  #df["TITLE"].values[0:10],
                  #doc_sim_df)