Similarity Assignment

Python Application

Import the completed_clean_data and convert to a pandas dataframe. This dataset includes a list of scientific research articles that all appeared when I searched for “databases”, “corpus”, and “linguistic norms”.

library(reticulate)
library(tm)

## Loading required package: NLP

##python chunk

import pandas as pd

#import data
df = pd.read_csv('completed_clean_data.csv')

Load all the libraries you will need for the Python section. You can also put in the functions for normalizing the text and calculating the top 5 related objects.

##python chunk
import string
import nltk
import re
import numpy as np

stop_words = nltk.corpus.stopwords.words('english')

def normalize_document(doc):
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I|re.A)
    doc = doc.lower()
    doc = doc.strip()
    #remove punctuation
    doc = doc.translate(str.maketrans('', '', string.punctuation))
    # tokenize document
    tokens = nltk.word_tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

Use the normalizing text function to clean up the corpus - specifically, focus on the ABSTRACT column as our text to match.

##python chunk
normalize_corpus = np.vectorize(normalize_document)

norm_corpus = normalize_corpus(list(df['ABSTRACT']))
len(norm_corpus)

## 2875

Calculate the cosine similarity between the abstracts of the attached documents.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

tf = TfidfVectorizer(ngram_range=(1, 2), min_df=2)
tfidf_matrix = tf.fit_transform(norm_corpus)
tfidf_matrix.shape

## (2875, 30660)

doc_sim = cosine_similarity(tfidf_matrix)
doc_sim_df = pd.DataFrame(doc_sim)

Using our moving recommender - pick a single article (under TITLE) and recommend five other related articles.

##python chunk
def movie_recommender(movie_title, movies, doc_sims):
    # find movie id
    movie_idx = np.where(movies == movie_title)[0][0]
    # get movie similarities
    movie_similarities = doc_sims.iloc[movie_idx].values
    # get top 5 similar movie IDs
    similar_movie_idxs = np.argsort(-movie_similarities)[1:6]
    # get top 5 movies
    similar_movies = movies[similar_movie_idxs]
    # return the top 5 movies
    return similar_movies

movie_recommender("chinese lexical database cld", #name of film must be in dataset
                  df["TITLE"].values, #all film names
                  doc_sim_df #pd dataframe of similarity values
                  )

## array(['chinese lexical database cld a large scale lexical database for simplified mandarin chinese',
##        'meld sch a megastudy of lexical decision in simplified chinese',
##        'the chinese lexicon project a megastudy of lexical decision performance for 25000 traditional chinese two character compound words',
##        'the use of film subtitles to estimate word frequencies',
##        'speechreading and the structure of the lexicon computationally modeling the effects of reduced phonetic distinctiveness on lexical uniqueness'],
##       dtype=object)

Make a Change to the Model

Using the methods shown in class, make one change to the model to see how it impacts the outcome. Pick one of the following: use a different similarity metric, use phrases instead of single words (e.g. change ngram_range), use only more frequent terms (e.g. change min_df), or lemmatize the words in the processing step.

##python chunk

from sklearn.metrics import pairwise_distances

# euclidean distance similarity metric

doc_sim_euc = pairwise_distances(tfidf_matrix, metric = 'euclidean')
doc_sim_df_euc = pd.DataFrame(doc_sim_euc)


# using phrases 
tf = TfidfVectorizer(ngram_range=(2, 6), min_df=6)
tfidf_matrix = tf.fit_transform(norm_corpus)

doc_sim = cosine_similarity(tfidf_matrix)
doc_sim_df = pd.DataFrame(doc_sim)

movie_recommender("chinese lexical database cld",
                  df["TITLE"].values,
                  doc_sim_df)
                  
# lemmatizing

## array(['chinese lexical database cld a large scale lexical database for simplified mandarin chinese',
##        'reviews',
##        'k span a lexical database of korean surface phonetic forms and phonological neighborhood density statistics',
##        'toward a scalable holographic word form representation',
##        'sublexical frequency measures for orthographic and phonological units in german'],
##       dtype=object)

lemmer = nltk.stem.WordNetLemmatizer()

def LemTokens(tokens):
    return [lemmer.lemmatize(token) for token in tokens]

def normalize_document(doc):
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I|re.A)
    doc = doc.lower()
    doc = doc.strip()
    #remove punctuation
    doc = doc.translate(str.maketrans('', '', string.punctuation))
    # tokenize document
    tokens = LemTokens(nltk.word_tokenize(doc))
    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # lemmatize tokens
    filtered_tokens = [lemmer.lemmatize(token) for token in tokens]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

normalize_corpus = np.vectorize(normalize_document)

norm_corpus = normalize_corpus(list(df['ABSTRACT']))

tf = TfidfVectorizer(ngram_range=(1, 2), min_df=2)
tfidf_matrix = tf.fit_transform(norm_corpus)

doc_sim = cosine_similarity(tfidf_matrix)
doc_sim_df = pd.DataFrame(doc_sim)

Discussion Questions

Did you get the articles expected? Do the suggestions make sense? Did your change to the model improve the recommendations? What else might improve the recommendation algorithm?
ANSWER: Yes, i did get the articles expected ,Though the suggestions make sense, when compared to the text TITLE the suggestions are not that accurate with ABSTRACT , It didn’t improve the model a lot , We could use different text , use frequent single words or use other similarity measurement other than euclidean similarity.
Describe a set of texts and research question that interests you that could be explored using this method. Basically, what is a potential application of this method to another area of research?
- ANSWER: This could be used for suggesting the movies or videos in the streaming platform , could also be used in the search results for search engines or huge data base sites like quora, medium, can give suggestions based on the previuos searched texts , could also be used in plagiarism software.

Similarity Assignment

YASHWANTH K SURUNENI

2021-04-05

Python Application

Make a Change to the Model

Discussion Questions