Python Application

Import the completed_clean_data and convert to a pandas dataframe. This dataset includes a list of scientific research articles that all appeared when I searched for “databases”, “corpus”, and “linguistic norms”.

library(reticulate)
library(readr)
completed_clean_data <- read_csv("completed_clean_data.csv")
## Parsed with column specification:
## cols(
##   AUTHOR = col_character(),
##   JOURNAL = col_character(),
##   TITLE = col_character(),
##   YEAR = col_double(),
##   ABSTRACT = col_character()
## )

Load all the libraries you will need for the Python section. You can also put in the functions for normalizing the text and calculating the top 5 related objects.

import pandas as pd
import string
import nltk
import re
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

df = r.completed_clean_data

Use the normalizing text function to clean up the corpus - specifically, focus on the ABSTRACT column as our text to match.

stop_words = nltk.corpus.stopwords.words('english')

def normalize_document(doc):
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I|re.A)
    doc = doc.lower()
    doc = doc.strip()
    doc = doc.translate(str.maketrans('', '', string.punctuation))
    tokens = nltk.word_tokenize(doc)
    filtered_tokens = [token for token in tokens if token not in stop_words]
    doc = ' '.join(filtered_tokens)
    return doc

normalize_corpus = np.vectorize(normalize_document)

norm_corpus = normalize_corpus(list(df['ABSTRACT']))

Calculate the cosine similarity between the abstracts of the attached documents.

tf = TfidfVectorizer(ngram_range=(1, 2), min_df=2)
tfidf_matrix = tf.fit_transform(norm_corpus)
tfidf_matrix.shape
## (2875, 30660)
doc_sim = cosine_similarity(tfidf_matrix)
doc_sim_df = pd.DataFrame(doc_sim)
doc_sim_df.head()
##        0         1         2     ...      2872      2873      2874
## 0  1.000000  0.022779  0.032152  ...  0.047954  0.023440  0.048615
## 1  0.022779  1.000000  0.024258  ...  0.051089  0.032311  0.011365
## 2  0.032152  0.024258  1.000000  ...  0.013188  0.035566  0.028252
## 3  0.028440  0.013722  0.043408  ...  0.004610  0.001366  0.000000
## 4  0.062236  0.025037  0.040312  ...  0.028702  0.013295  0.020856
## 
## [5 rows x 2875 columns]

Using our moving recommender - pick a single article (under TITLE) and recommend five other related articles.

def movie_recommender(movie_title, movies, doc_sims):
    movie_idx = np.where(movies == movie_title)[0][0]
    movie_similarities = doc_sims.iloc[movie_idx].values
    similar_movie_idxs = np.argsort(-movie_similarities)[1:6]
    similar_movies = movies[similar_movie_idxs]
    return similar_movies
    
movie_recommender("chinese lexical database cld",df["TITLE"].values,doc_sim_df)
## array(['chinese lexical database cld a large scale lexical database for simplified mandarin chinese',
##        'meld sch a megastudy of lexical decision in simplified chinese',
##        'the chinese lexicon project a megastudy of lexical decision performance for 25000 traditional chinese two character compound words',
##        'the use of film subtitles to estimate word frequencies',
##        'speechreading and the structure of the lexicon computationally modeling the effects of reduced phonetic distinctiveness on lexical uniqueness'],
##       dtype=object)

Discussion Questions