- Data Science Capstone Project
Created by Linghuan Zeng on 8/20/2015
Capstone link : https://zlhtao.shinyapps.io/Data_Science_Capstone
Dictionary Index is a list type file saved on the Shiny App, containing 191889 unique words extracted from “ en_US.blogs.txt ” file, and each list element's name refers to the list No. of Words Dictionary files ( “ *_list.Rdata ” );
Words Dictionary stores on the Github, and can be accessed through “https://raw.githubusercontent.com/zlhTao2012/DS_Capstone_files/master/Data/?_list.RData” ( Replace “?” by the list No. ) in R;
Each list element in the Words Dictionary “ *_list.Rdata ” file stores a list of the line No. of the unqiue word and the word's location in the specific text line (sentence) in the “en_US.blogs.txt” file;
Words Dictionary in the browser view mode link: https://github.com/zlhTao2012/DS_Capstone_files/tree/master/Data
I. N-gram Process (Word Existence and Location Match):
II. Words-Match of the Input Text Process (Word Existence Match):
III. Sum the N-gram weighted scores and the Words-Match weighted scores
Currently, each “ *_list.Rdata ” file (a list type file) contains FIVE words (list elements), but each word element may contain size-varied elements (“list of list” structure); -Solution: Average the size of the Words Dictionary files;
Currently, the “ en_US.blogs.txt ” file with 899288 records of sentences or paragraghs is used as the sample text file, but somtimes the prediction algorithm still cannot find enough sampled sentences to make decent prediction; -Solution: Combine other files such as “en_US.news.txt” or “en_US.twitter.txt” with the “en_US.blogs.txt”;
How to prioritize the words with the same total weighted scores ; No solution yet, and welcome any suggestions;