Import the completed_clean_data and convert to a pandas dataframe. This dataset includes a list of scientific research articles that all appeared when I searched for “databases”, “corpus”, and “linguistic norms”.
library(readr)
data<- read_csv('completed_clean_data.csv')
##
## -- Column specification --------------------------------------------------------
## cols(
## AUTHOR = col_character(),
## JOURNAL = col_character(),
## TITLE = col_character(),
## YEAR = col_double(),
## ABSTRACT = col_character()
## )
library(reticulate)
#conda_create("r-reticulate")
use_condaenv('r-reticulate')
## Warning in normalizePath(path.expand(path), winslash, mustWork): path[1]="C:
## \Users\JIANWEI LI\.conda\envs\r-reticulate/python.exe": The system cannot find
## the file specified
#
# py_install("pandas")
# py_install("scikit-learn")
Load all the libraries you will need for the Python section. You can also put in the functions for normalizing the text and calculating the top 5 related objects.
##python chunk
import pandas as pd
import string
import nltk
import re
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
##python chunk
df = r.data
Use the normalizing text function to clean up the corpus - specifically, focus on the ABSTRACT column as our text to match.
##python chunk
df['ABSTRACT'][0]
## 'We present the Chinese Lexical Database (CLD): a large-scale lexical database for simplified Chinese. The CLD provides a wealth of lexical information for 3913 one-character words, 34,233 two-character words, 7143 three-character words, and 3355 four-character words, and is publicly available through http://www.chineselexicaldatabase.com. For each of the 48,644 words in the CLD, we provide a wide range of categorical predictors, as well as an extensive set of frequency measures, complexity measures, neighborhood density measures, orthography-phonology consistency measures, and information-theoretic measures. We evaluate the explanatory power of the lexical variables in the CLD in the context of experimental data through analyses of lexical decision latencies for one-character, two-character, three-character and four-character words, as well as word naming latencies for one-character and two-character words. The results of these analyses are discussed.'
stop_words = nltk.corpus.stopwords.words('english')
def normalize_document(doc):
doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I|re.A)
doc = doc.lower()
doc = doc.strip()
doc = doc.translate(str.maketrans('', '', string.punctuation))
tokens = nltk.word_tokenize(doc)
filtered_tokens = [t for t in tokens if t not in stop_words]
doc = ' '.join(filtered_tokens)
return doc
normalize_corpus = np.vectorize(normalize_document)
norm_corpus = normalize_corpus(list(df['ABSTRACT']))
Calculate the cosine similarity between the abstracts of the attached documents.
##python chunk
tf = TfidfVectorizer(ngram_range = (1,3),min_df = 2)
tfidf_matrix = tf.fit_transform(norm_corpus)
tfidf_matrix.shape
## (2875, 35446)
Using our moving recommender - pick a single article (under TITLE) and recommend five other related articles.
##python chunk
doc_sim = cosine_similarity(tfidf_matrix)
doc_sim_df = pd.DataFrame(doc_sim)
doc_sim_df.head()
## 0 1 2 ... 2872 2873 2874
## 0 1.000000 0.018582 0.025854 ... 0.035056 0.015692 0.036150
## 1 0.018582 1.000000 0.023115 ... 0.044258 0.025632 0.010014
## 2 0.025854 0.023115 1.000000 ... 0.011261 0.027811 0.024538
## 3 0.023046 0.013177 0.041085 ... 0.003967 0.001077 0.000000
## 4 0.050743 0.024190 0.038391 ... 0.024851 0.010541 0.018367
##
## [5 rows x 2875 columns]
Using the methods shown in class, make one change to the model to see how it impacts the outcome. Pick one of the following: use a different similarity metric, use phrases instead of single words (e.g. change ngram_range), use only more frequent terms (e.g. change min_df), or lemmatize the words in the processing step.
##python chunk
def movie_recommender(movie_title, movies, doc_sims):
movie_idx = np.where(movies == movie_title)[0][0]
movie_similarities = doc_sims.iloc[movie_idx].values
similar_movie_idxs = np.argsort(-movie_similarities)[1:6]
similar_movies = movies[similar_movie_idxs]
return similar_movies
movie_recommender("chinese lexical database cld",df["TITLE"].values,doc_sim_df)
## array(['chinese lexical database cld a large scale lexical database for simplified mandarin chinese',
## 'meld sch a megastudy of lexical decision in simplified chinese',
## 'the chinese lexicon project a megastudy of lexical decision performance for 25000 traditional chinese two character compound words',
## 'speechreading and the structure of the lexicon computationally modeling the effects of reduced phonetic distinctiveness on lexical uniqueness',
## 'the use of film subtitles to estimate word frequencies'],
## dtype=object)
Did you get the articles expected? Do the suggestions make sense? Did your change to the model improve the recommendations? What else might improve the recommendation algorithm?
ANSWER: when we use any search engine to find any similar journal
Describe a set of texts and research question that interests you that could be explored using this method. Basically, what is a potential application of this method to another area of research?
ANSWER: I will interested in the topic of tweets and its relative stock market and its performance.
ANSWER: