One of the corpora here: https://corpus.byu.edu/corpora.asp
Collect two bigrams that you can compare using the association measures listed (does not have to be X-Y, Z-Y, but that would help you compare them)
Create a dataframe like the one from lecture of those bigrams
Taking TV corpus and selecting two paired letter peanut butter and almond butter and we are interested in and calculate the relation between them using a basic contingency table.
A: Co-occurrence of X and Y
B: Number of occurrences of X without Y
C: Number of occurrences of Y without X
D: Number of occurrences that are not X or Y
##r chunk
peanutButter = c(1886, 1505, 4922, (326201276 - 1886 - 1505 - 4922))
AlomndButter = c(34, 639, 6774, (326201276 - 34 - 639 - 6774))
PeanutAlmond_Butter = as.data.frame(rbind(peanutButter,AlomndButter))
colnames(PeanutAlmond_Butter) = c("a", "b", "c", "d")
PeanutAlmond_Butter
## a b c d
## peanutButter 1886 1505 4922 326192963
## AlomndButter 34 639 6774 326193829
Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.
##r chunk
library(stringdist)
library(Rling)
library(reticulate)
py_install('nltk')
py_install('pandas')
py_install('numpy')
py_install('scipy')
py_install('scikit-learn')
py_install('gensim')
The python section will load the libraries below.
Calculate the attraction for your bigrams.
##r chunk
attraction = PeanutAlmond_Butter$a/(PeanutAlmond_Butter$a+PeanutAlmond_Butter$c)*100
attraction
## [1] 27.7027027 0.4994125
rownames(PeanutAlmond_Butter)
## [1] "peanutButter" "AlomndButter"
Calculate the reliance for your bigrams.
##r chunk
reliance = PeanutAlmond_Butter$a/(PeanutAlmond_Butter$a+PeanutAlmond_Butter$b)*100
reliance
## [1] 55.617812 5.052006
rownames(PeanutAlmond_Butter)
## [1] "peanutButter" "AlomndButter"
Calculate the LL values for your bigrams.
##r chunk
#expected frequency
aExp = (PeanutAlmond_Butter$a + PeanutAlmond_Butter$b)*(PeanutAlmond_Butter$a + PeanutAlmond_Butter$c)/
(PeanutAlmond_Butter$a + PeanutAlmond_Butter$b + PeanutAlmond_Butter$c + PeanutAlmond_Butter$d)
#p values
pvF = pv.Fisher.collostr(PeanutAlmond_Butter$a, PeanutAlmond_Butter$b, PeanutAlmond_Butter$c, PeanutAlmond_Butter$d)
#log based on expected frequency
logpvF = ifelse(PeanutAlmond_Butter$a < aExp, log10(pvF), -log10(pvF))
logpvF
## [1] Inf 101.8631
LL = LL.collostr(PeanutAlmond_Butter$a, PeanutAlmond_Butter$b, PeanutAlmond_Butter$c, PeanutAlmond_Butter$d)
LL1 = ifelse(PeanutAlmond_Butter$a < aExp, -LL, LL)
LL1
## [1] 36572.2689 463.7855
Calculate the PMI for your bigrams.
##r chunk
PMI = log(PeanutAlmond_Butter$a / aExp)^2
PMI
## [1] 103.84639 60.71194
Calculate the OR for your bigrams.
##r chunk
logOR = log(PeanutAlmond_Butter$a*PeanutAlmond_Butter$d/(PeanutAlmond_Butter$b*PeanutAlmond_Butter$c))
logOR
## [1] 11.327195 7.848611
Given the statistics you have calculated above, what is the relation of your bigrams? Write a short summary of the results, making sure to answer the following:
Positive value shows there is attraction type and Fischer test shows positive numner therefore indicate mutual attraction. Positive indicates attraction type value.
Load all the libraries you will need for the Python section. You can also put in the functions for normalizing the text and calculating the top 5 related objects.
##python chunk
import nltk
import re
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from gensim.summarization.bm25 import get_bm25_weights
Import the completed_clean_data
and convert to a pandas
dataframe. This dataset includes a list of scientific research articles that all appeared when I searched for “databases”, “corpus”, and “linguistic norms”.
##python chunk
#import data
df = pd.read_csv("C:\\Users\\rohan\\OneDrive\\540\\completed_clean_data.csv")
#show you one of the rows (row 2!)
df.iloc[[1]]
## AUTHOR ... ABSTRACT
## 1 Brendan T. JohnsRandall K. Jamieson ... We measured and documented the influence of co...
##
## [1 rows x 5 columns]
Use the normalizing text function to clean up the corpus - specifically, focus on the ABSTRACT
column as our text to match.
## python chunk
stop_words = nltk.corpus.stopwords.words('english')
def normalize_document(doc):
# lower case and remove special characters\whitespaces
doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I|re.A)
doc = doc.lower()
doc = doc.strip()
# tokenize document
tokens = nltk.word_tokenize(doc)
# filter stopwords out of document
filtered_tokens = [token for token in tokens if token not in stop_words]
# re-create document from filtered tokens
doc = ' '.join(filtered_tokens)
return doc
normalize_corpus = np.vectorize(normalize_document)
norm_corpus = normalize_corpus(list(df['ABSTRACT']))
len(norm_corpus)
## 2875
Calculate the cosine similarity between the abstracts of the attached documents.
##python chunk
tf = TfidfVectorizer(ngram_range=(1, 2), min_df=2)
tfidf_matrix = tf.fit_transform(norm_corpus)
tfidf_matrix.shape
## (2875, 30660)
doc_sim = cosine_similarity(tfidf_matrix)
doc_sim_df = pd.DataFrame(doc_sim)
doc_sim_df.head()
## 0 1 2 ... 2872 2873 2874
## 0 1.000000 0.022779 0.032152 ... 0.047954 0.023440 0.048615
## 1 0.022779 1.000000 0.024258 ... 0.051089 0.032311 0.011365
## 2 0.032152 0.024258 1.000000 ... 0.013188 0.035566 0.028252
## 3 0.028440 0.013722 0.043408 ... 0.004610 0.001366 0.000000
## 4 0.062236 0.025037 0.040312 ... 0.028702 0.013295 0.020856
##
## [5 rows x 2875 columns]
Using our moving recommender - pick a single article (under TITLE
) and recommend five other related articles.
##python chunk
def scientificresearch_article(TITLE, AUTHOR, ABSTRACT):
# find movie id
Scientificresearch_idx = np.where(TITLE == TITLE)[0][0]
# get movie similarities
sciresearch_similarities = ABSTRACT.iloc[Scientificresearch_idx].values
# get top 5 similar movie IDs
similar_sciresearch_idxs = np.argsort(-sciresearch_similarities)[1:6]
# get top 5 movies
similar_sciresearch = AUTHOR[similar_sciresearch_idxs]
# return the top 5 movies
return similar_sciresearch
scientificresearch_article("chinese lexical database cld", #name of film must be in dataset
df["TITLE"].values, #all film names
doc_sim_df #pd dataframe of similarity values
)
## array(['chinese lexical database cld a large scale lexical database for simplified mandarin chinese',
## 'meld sch a megastudy of lexical decision in simplified chinese',
## 'the chinese lexicon project a megastudy of lexical decision performance for 25000 traditional chinese two character compound words',
## 'the use of film subtitles to estimate word frequencies',
## 'speechreading and the structure of the lexicon computationally modeling the effects of reduced phonetic distinctiveness on lexical uniqueness'],
## dtype=object)
##Python chunk
#doc_sim = get_bm25_weights(norm_corpus[0:10].tolist(), n_jobs = -1)
#doc_sim_df = pd.DataFrame(doc_sim)
#doc_sim_df.head()
#scientificresearch_article("introduction to eurowordnet",
#df["TITLE"].values[0:10],
#doc_sim_df)