The dataset which Persado sent can be seen here. For convenience the top 5 rows have been shown. As you can see three variables such as company’s name, it’s label on stock market, the sector that it belongs to has been provided.
library(readr)
company_dataset_original<- read_csv("D:/UIC/MyResearch/Mwrd/Persado_Data_Scientist/company_dataset.csv")
head(company_dataset_original)
## # A tibble: 6 x 3
## CompanyName CompanyTicker Sector
## <chr> <chr> <chr>
## 1 1347 Property Insurance Holdings, Inc. PIH Finance
## 2 180 Degree Capital Corp. TURN Finance
## 3 1-800 FLOWERS.COM, Inc. FLWS Consumer Services
## 4 1st Constitution Bancorp (NJ) FCCY Finance
## 5 1st Source Corporation SRCE Finance
## 6 21Vianet Group, Inc. VNET Technology
This dataset has been loaded after altering the original dataset. Altering the dataset here means
We can see that there are 662 observation for which we don’t have the observations.
We will show below how scraping is done
All the tickers will be extracted from the second column of the given dataset
# list of different handles which you want to extract
library(readr)
company_dataset<- read_csv("D:/UIC/MyResearch/Mwrd/Persado_Data_Scientist/company_dataset_c.csv",col_types = cols(X1 = col_skip()))
company_dataset<-company_dataset[which(!is.na(company_dataset$description)),]
head(company_dataset$description)
## [1] "1347 Property Insurance Holdings, Inc., through its subsidiaries, provides property and casualty insurance products to individuals in Louisiana and Texas. The company offers homeowners<U+0092> insurance, manufactured home insurance, dwelling fire insurance, and wind/hail insurance products, as well as reinsurance products. It offers its insurance policies through a network of independent agents. The company was formerly known as Maison Insurance Holdings, Inc. and changed its name to 1347 Property Insurance Holdings, Inc. in November 2013. 1347 Property Insurance Holdings, Inc. was founded in 2012 and is based in Tampa, Florida."
## [2] "180 Degree Capital Corp. is a business development company specializing in early stage investments. The firm seeks to invest in tiny technology including microsystems and transformative nanotechnology companies and applications in the cleantech, biotechnology, energy, healthcare, and electronic sectors. It prefers to invest in biology innovation, where intersecting with innovations in areas such as electronics, physics, materials science, chemistry, information technology, engineering and mathematics. It seeks companies which employ or intend to employ technology that is at the microscale or smaller and if the employment of that technology is material to its business plan. The firm may make follow-on investments and seeks to co-invest. It prefers to hold membership on boards of directors or serve as observers to the boards of directors on its portfolio companies. 180 Degree Capital Corp. was founded in August 1981 and is based in Montclair, New Jersey with addition office in Los Angeles, California."
## [3] "1-800-Flowers.com, Inc., together with its subsidiaries, provides gourmet food and floral gifts for various occasions in the United States. It operates in three segments: Consumer Floral; Gourmet Food and Gift Baskets; and BloomNet Wire Service. The company offers a range of products, including fresh-cut flowers, floral and fruit arrangements and plants, gifts, popcorn, gourmet foods and gift baskets, cookies, chocolates, candies, wine, and gift-quality fruits, as well as balloons, candles, keepsake gifts, and plush stuffed animals. It also provides floral wire service through mybloomnet.net; gourmet gifts, such as fruits and other gourmet items through 1-877-322-1200 or harryanddavid.com; popcorn and specialty treats through 1-800-541-2676 or thepopcornfactory.com; cookies and baked gifts from 1-800-443-8124 or cheryls.com; gift baskets and towers from 1800baskets.com; English muffins and other breakfast treats from 1-800-999-1910 or wolfermans.com; carved fresh fruit arrangements from fruitbouquets.com; and steaks and chops from stockyards.com. The company offers its products to consumers and wholesalers. 1-800-FLOWERS.COM, Inc. was founded in 1976 and is headquartered in Carle Place, New York."
## [4] "1st Constitution Bancorp operates as the bank holding company for 1st Constitution Bank that provides commercial and retail banking services in the central and northeastern New Jersey areas. The company offers deposit products, including interest bearing demand deposits, such as interest-bearing checking accounts and money market accounts; and non-interest bearing demand, savings, and time deposits, as well as certificates of deposit. It also provides commercial loans, including term loans, lines of credit, and loans secured by equipment and receivables; secured and unsecured short-to-medium term commercial loans to businesses for working capital, business expansion, and the purchase of equipment and machinery; and construction loans to real estate developers for the acquisition, development, and construction of residential and commercial properties. In addition, the company offers residential first mortgage loans secured by owner-occupied property; construction loans; reverse mortgages; second mortgage home improvement loans; home equity lines of credit; and non-residential consumer loans for automobiles, recreation vehicles, and boats, as well as secured and unsecured personal loans, and deposit account secured loans. It serves corporations, individuals, partnerships, and other community organizations, as well as small businesses and not-for-profit organizations. As of February 2, 2017, the company operated through 19 banking offices in Cranbury, Fort Lee, Hamilton, Hightstown, Hillsborough, Hopewell, Jamesburg, Lawrenceville, Perth Amboy, Plainsboro, Rocky Hill, West Windsor, Princeton, Rumson, Fair Haven, Shrewsbury, Little Silver, and Asbury Park, New Jersey. 1st Constitution Bancorp was founded in 1989 and is based in Cranbury, New Jersey."
## [5] "1st Source Corporation operates as the bank holding company for 1st Source Bank that provides commercial and consumer banking services, trust and investment management services, and insurance to individual and business clients. Its consumer banking services include checking and savings accounts; certificates of deposit; individual retirement accounts; on-line and mobile banking products; automated teller machine services; consumer loans, real estate loans, and lines of credit; and financial planning, financial literacy, and other consultative services. The company also offers commercial, small business, agricultural, and real estate loans for general corporate purposes, including financing for industrial and commercial properties, equipment, inventories, accounts receivables, and acquisition financing; and commercial leasing, treasury management, and retirement planning services. In addition, it provides a range of trust, investment, agency, and custodial services comprising administration of estates and personal trusts; and manages investment accounts for individuals, employee benefit plans, and charitable foundations. Further, the company offers equipment loan and lease finance products for auto and light trucks, medium and heavy duty trucks, new and used general aviation aircraft, and construction equipment, as well as leases construction equipment, medium and heavy duty trucks, automobiles, and other equipment. Additionally, it provides corporate and personal property, casualty, and individual and group health and life insurance products and services; and investment advisory services to trust and investment clients. As of February 14, 2017, the company operated through 81 banking centers and 23 specialty finance group locations in the United States. 1st Source Corporation was founded in 1863 and is headquartered in South Bend, Indiana."
## [6] "21Vianet Group, Inc. provides carrier-neutral Internet data center services to Internet companies, government entities, blue-chip enterprises, and small-to mid-sized enterprises in the People<U+0092>s Republic of China. It offers hosting and related services to house servers and networking equipment in its data centers, and connects them through a data transmission network; and other hosting related value-added services. Its hosting and related services include managed hosting services that offer data center space to house its customers<U+0092> servers and networking equipment, and provide tailored server administration services; and interconnectivity services that enable customers to connect their servers with Internet backbones and other networks through its border gateway protocol network or single-line, dual-line, or multiple-line network. Its hosting and related services also comprise content delivery network services that optimize the speed and security of data transmission; cloud services that enable businesses to run their applications over the Internet using its IT infrastructure; virtual private network services; and other value-added services, such as firewall, server load balancing, data backup and recovery, data center management, server management, and backup server services. In addition, the company provides traffic charts and analysis, gateway monitoring for servers, domain name system setup, defense mechanism against distributed denial of service attacks, basic setting of switches and routers, and virus protections; and managed network services consisting of hosting area network, route optimization, and last-mile wired broadband services. As of December 31, 2016, it operated 18 self-built and 63 partnered data centers located in approximately 30 cities with 26,830 cabinets. It has a strategic partnership with Microsoft Corporation. The company was founded in 1999 and is headquartered in Beijing, the People<U+0092>s Republic of China."
Exploratory data analysis
As we can see from the dataset there are 3276 observations having 3 variables of which important variables are second and third.
Second variable is ticker and third variable is category which is to be predicted
library("dplyr")
# Create a empty variable known as description which will get filled once we have web scraping up and running
company_dataset<-mutate(.data = company_dataset,description="")
To do web scraping we will be using R Selenium package.R vest is another option
R Selenium gives power to the user since API’s restrict in what a user wants to download. Furthermore, automating the process via a browser always helps. Inline documentation of the code gives an idea on how Rselenium works
A video is attached to the document capturing the automation and web scraping
library("RSelenium")
library("dplyr")
# important because it downloads all the things(geckodriver,chrome driver,phantom js)
remDr<-rsDriver(port = 4444L,browser = "chrome",version = "latest",chromever = "latest",geckover = "latest",check = TRUE)
Client<-remDr$client
for(i in 1:3276){
Client$open()
Client$navigate(paste0("https://finance.yahoo.com/quote/",company_dataset[i,2],"/profile?p=",company_dataset[i,2],"/"))
Css<-Client$findElements(using = "css selector",value = ".Mt\\(30px\\) .Lh\\(1\\.6\\)")
if (length(Css)>0){
CSS_DATA_FRAME<- unlist(lapply(Css, function(x){x$getElementText()}))
company_dataset[i,4]<-CSS_DATA_FRAME
Client$close()
}
else{
company_dataset[i,4]<-NA
Client$close()
}
}
There were some tickers whose description were not available and others which had a W attached to their ticker were not available. All such cases have been put under unclassified
Following code is optional and may be used if the browser loses its internet connectivity or due to some other mishap
sum(is.na(company_dataset$description[771:1671]))
# Add the indices of the data if you are searching from within a certain range of the data
# NotFound<-company_dataset[which(is.na(company_dataset$description[771:1671]))+770,]
NotFound<-company_dataset[which(is.na(company_dataset$description[771:1671]))+770,]
# The Selenium code was modified by
Transferto<-NotFound[which(!is.na(NotFound$description)),]
# InsertatRows<-which(company_dataset$CompanyTicker %in% company_dataset$CompanyTicker)
# company_dataset[InsertatRows,]$description<-Transferto$description
Let’s see the number of unique sectors in the given dataset
unique(company_dataset$Sector)
## [1] "Finance" "Consumer Services"
## [3] "Technology" "Public Utilities"
## [5] "Capital Goods" "Basic Industries"
## [7] "Health Care" "Energy"
## [9] "Miscellaneous" "Transportation"
## [11] "Consumer Non-Durables" "Consumer Durables"
## [13] "n/a" NA
length(unique(company_dataset$Sector))
## [1] 14
We can see that there are 14 sectors of which two are categorised as “n/a” or NA. Below is the frequency distribution of the sector
# Converting from character to factor
company_dataset$Sector<-as.factor(company_dataset$Sector)
library("ggplot2")
g1<-ggplot(data = company_dataset,aes(company_dataset$Sector))+
geom_bar(aes(fill=company_dataset$Sector))+
theme(axis.text.x = element_blank())
plot(g1)
We can see that the top categories are Finance and Healthcare. Now we will calculate the frequencies of their occurences
library(knitr)
SectorFrequencies<-as.data.frame(sort(table(company_dataset$Sector),decreasing = T))
colnames(SectorFrequencies)<-c("Sector","Frequency")
kable(SectorFrequencies)
Sector | Frequency |
---|---|
Health Care | 619 |
Finance | 525 |
Technology | 429 |
Consumer Services | 332 |
Capital Goods | 167 |
Consumer Non-Durables | 108 |
Miscellaneous | 86 |
Basic Industries | 76 |
Consumer Durables | 73 |
Energy | 62 |
Public Utilities | 60 |
Transportation | 48 |
n/a | 24 |
We can infer from the sentences that are in the description and then create a word frequency distribution to predict the sector
There are 13 sectors of which one is n/a. Therefore we have 12 categories to be predicted
We will be using 70% of the model for training and the rest will be for testing
We have created subsets of the master data. This is done in order to produce files. This is done because this code can be reused later if we want to look at a single sector.The files created are situated in the project folder
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
set.seed(45)
ind=sample(2,nrow(company_dataset),replace=T,prob=c(0.7,0.3))
company_dataset_1=company_dataset[ind==1,]
company_test=company_dataset[ind==2,]
Finance_dataset<-company_dataset_1[company_dataset_1$Sector=="Finance",]
HealthCare<-company_dataset_1[company_dataset_1$Sector=="Health Care",]
Technology_dataset<-company_dataset_1[company_dataset_1$Sector=="Technology",]
Consumer_Services<-company_dataset_1[company_dataset_1$Sector=="Consumer Services",]
Capital_Goods<-company_dataset_1[company_dataset_1$Sector=="Capital Goods",]
Consumer_Non_Durables<-company_dataset_1[company_dataset_1$Sector=="Consumer Non-Durables",]
Consumer_Durables<-company_dataset_1[company_dataset_1$Sector=="Consumer Durables",]
Miscelleaneous<-company_dataset_1[company_dataset_1$Sector=="Miscellaneous",]
Basic_Industries<-company_dataset_1[company_dataset_1$Sector=="Basic Industries",]
Energy<-company_dataset_1[company_dataset_1$Sector=="Energy",]
Public_Utilities<-
company_dataset_1[company_dataset_1$Sector=="Public Utilities",]
Transportation<-company_dataset_1[company_dataset_1$Sector=="Transportation",]
Unclassified<-company_dataset_1[company_dataset_1$Sector=="n/a",]
We will construct two models one would be based on term frequencies and other would be based on t-f and i-d-f
Text Mining approach would involve the text from similar sector grouped together, and then we would create a bag of words which appear frequently in similar sectors.
The approach is to to count the frequency of words that matter and remove those words out which are of common English and to remove those words which have zero information content ( from our perspective and client’s perspective )
##### Text Mining
library("stringr")
Sector_Names<-data.frame(Sector=c("Finance_dataset","HealthCare","Basic_Industries","Capital_Goods","Consumer_Services","Consumer_Non_Durables","Consumer_Durables","Energy","Miscelleaneous","Public_Utilities","Transportation","Unclassified","Technology_dataset"))
Sector_Names$Sector=as.character(Sector_Names$Sector)
library(tm)
for(i in 1:(nrow(Sector_Names))){
text<-paste(get(Sector_Names[i,])$description,collapse = "")
write(text,file=paste0(x=Sector_Names[i,],".txt"))
#punctuation replacement
text2<-gsub(pattern = "\\W",replacement = " ",text)
#While Scraping a box type element having ascii code <U 0092> is present, therefore we have to delete that type of element from our text
# digits replacement
text2<-gsub(pattern="\\d",replacement = "",text2)
#We will delete characters which are of single length
text2<-gsub(pattern = "\\b[A-z]\\b{1}",replacement = "",text2)
# words removed with no information
text2<-removeWords(text2,stopwords())
text2<-stripWhitespace(text2)
# If we unstem the document it becomes easier and system is able to process faster (matching of the words).Depending on the sitaution of the analysis we will
text2<-stemDocument(text2)
text2<-removeWords(text2,c("servic","compani","The","product","provid","Inc","It","includ","well","also","found","In","As","Its","develop","known","sell","offer","use","various","serv","busi","This"))
setwd("D:/UIC/MyResearch/Mwrd/Persado_Data_Scientist/CleanedText")
write(text2,file=paste0(x=Sector_Names[i,],"withcleaning",".txt"))
setwd("D:/UIC/MyResearch/Mwrd/Persado_Data_Scientist")
textbagWithoutCleaning<-str_split(text2,pattern="\\s+")
textbag<-unlist(textbagWithoutCleaning)
write.csv(head(sort(table(textbag),decreasing = T),n=200),file =paste0(x=Sector_Names[i,],"withoutcleaning",".csv"))
}
Now we wil look at the dataset and identify what words are occuring in most of the dataset. We see that words like After deleting these words we will be creating a tf-idf matrix We are using the tm package and then the Document Term Matrix function
Creating a corpus which can help us
library(tm)
filePath<-("D:/UIC/MyResearch/Mwrd/Persado_Data_Scientist/CleanedText/")
Corpus_Yahoo<-Corpus(DirSource(filePath), readerControl = list(language="english"))
dtm_yahoo<-DocumentTermMatrix(Corpus_Yahoo)
# Calculate and sort by word frequencies
dtm_yahoo<-tm::removeSparseTerms(x = dtm_yahoo,sparse = 0.5)
word.freq<-sort(colSums(as.matrix(dtm_yahoo)),decreasing = T)
table_word_freq<-data.frame(word=names(word.freq),
abs.freq=word.freq,
relative.frequency=word.freq/length(word.freq))
rownames(table_word_freq)<-NULL
kable(x = table_word_freq[1:200,1:2],caption = "Top 200 Frequent words")
word | abs.freq |
---|---|
oper | 1891 |
loan | 1694 |
manag | 1532 |
commerci | 1416 |
headquart | 1401 |
segment | 1384 |
system | 1334 |
market | 1267 |
solut | 1210 |
bank | 1168 |
base | 1054 |
state | 1043 |
unite | 991 |
manufactur | 964 |
corpor | 899 |
treatment | 858 |
deposit | 828 |
industri | 826 |
account | 821 |
technolog | 818 |
custom | 807 |
clinic | 788 |
consum | 749 |
addit | 733 |
network | 683 |
applic | 682 |
equip | 656 |
retail | 635 |
new | 630 |
real | 627 |
design | 623 |
compris | 616 |
hold | 598 |
secur | 593 |
name | 588 |
financi | 586 |
data | 566 |
estat | 563 |
mobil | 547 |
brand | 523 |
invest | 514 |
home | 513 |
insur | 513 |
devic | 495 |
sale | 492 |
softwar | 492 |
line | 482 |
process | 480 |
subsidiari | 477 |
platform | 467 |
patient | 465 |
cancer | 454 |
medic | 453 |
pharmaceut | 451 |
credit | 450 |
primarili | 443 |
residenti | 441 |
distribut | 437 |
former | 430 |
direct | 429 |
relat | 429 |
small | 428 |
offic | 419 |
chang | 418 |
construct | 406 |
control | 403 |
store | 400 |
california | 399 |
individu | 398 |
program | 395 |
group | 394 |
properti | 388 |
mortgag | 385 |
non | 375 |
onlin | 372 |
center | 369 |
power | 368 |
health | 362 |
integr | 357 |
support | 352 |
person | 348 |
electron | 346 |
york | 345 |
intern | 340 |
energi | 332 |
locat | 327 |
enabl | 319 |
worldwid | 315 |
research | 309 |
focus | 308 |
digit | 308 |
internet | 305 |
cell | 301 |
plan | 296 |
test | 294 |
engag | 292 |
own | 292 |
distributor | 289 |
ltd | 289 |
financ | 287 |
famili | 286 |
specialti | 286 |
drug | 285 |
gas | 281 |
licens | 281 |
servic | 280 |
check | 279 |
engin | 276 |
inform | 276 |
enterpris | 271 |
further | 271 |
video | 268 |
rang | 267 |
oil | 266 |
transport | 266 |
cloud | 265 |
communic | 263 |
time | 263 |
approxim | 261 |
stage | 257 |
agreement | 255 |
vehicl | 255 |
equiti | 255 |
profession | 249 |
food | 248 |
independ | 247 |
two | 246 |
money | 245 |
connect | 243 |
facil | 243 |
organ | 240 |
autom | 236 |
access | 236 |
care | 235 |
govern | 235 |
healthcar | 234 |
high | 233 |
north | 230 |
america | 229 |
china | 226 |
monitor | 226 |
collabor | 225 |
consist | 223 |
content | 223 |
compon | 222 |
storag | 222 |
suppli | 217 |
card | 211 |
togeth | 210 |
tool | 208 |
asset | 206 |
media | 205 |
decemb | 204 |
texa | 204 |
incorpor | 201 |
produc | 200 |
game | 199 |
branch | 198 |
hospit | 198 |
origin | 198 |
europ | 197 |
institut | 197 |
comput | 197 |
activ | 195 |
interest | 194 |
leas | 194 |
client | 194 |
accessori | 193 |
com | 193 |
canada | 191 |
deliveri | 191 |
mainten | 191 |
user | 191 |
corp | 189 |
portfolio | 189 |
packag | 188 |
wireless | 187 |
acquisit | 186 |
channel | 185 |
instal | 185 |
compani | 184 |
protect | 184 |
target | 184 |
life | 184 |
counti | 183 |
privat | 183 |
capit | 178 |
payment | 174 |
one | 173 |
internat | 170 |
multi | 170 |
websit | 170 |
area | 169 |
machin | 167 |
term | 167 |
perform | 166 |
restaur | 166 |
demand | 166 |
build | 165 |
natur | 165 |
# Find association in this way we can create anagrams
# tm::findAssocs(dtm_yahoo,c("treatment"),0.8)
We will create a bag of words and then do the sentiment analysis. The conventional sentiment analysis is to classify as positive or negative. We will use the same approach but rather than classifying it as binary we will classify it into 13 different sectors
When we will check for a description.We will clean it and stem it because all the documents are stemmed
match_sector<-data.frame()
for(i in 1:786){
match_sector[i,1]<-gsub(pattern="\\d",replacement = "",x=company_test[i,4])
match_sector[i,1]<-gsub(pattern = "\\W",replacement = " ",x=match_sector[i,1])
match_sector[i,1]<-gsub(pattern = "\\b[A-z]\\b{1}",replacement = "",x=match_sector[i,1])
match_sector[i,1]<-removeWords(as.character(match_sector[i,1]),stopwords())
match_sector[i,1]<-stemDocument(match_sector[i,1])
# match_sector[i,1]<-sectortest(match_sector[i,1])
desc_split<-unlist(str_split(match_sector[i,1],pattern="\\s+"))
setwd("D:/UIC/MyResearch/Mwrd/Persado_Data_Scientist/Cleanedtext")
match_files<-as.data.frame(list.files())
match_files$`list.files()`<-as.character(match_files$`list.files()`)
for(j in 1:13){
assign(value=sum(!is.na(match(desc_split,scan(match_files[j,1],what = 'character',sep = "",nmax=3000)))),x = paste0(match_files[j,1],"match_test"))
}
# We want to retrieve the names from the global environment
Basic_Industries_Final<-as.data.frame(Basic_Industrieswithcleaning.txtmatch_test)
Capital_Goods_Final<-as.data.frame(Capital_Goodswithcleaning.txtmatch_test)
Consumer_Durables_Final<-as.data.frame(Consumer_Durableswithcleaning.txtmatch_test)
Consumer_Non_Durables_Final<-as.data.frame(Consumer_Non_Durableswithcleaning.txtmatch_test)
Consumer_Services_Final<-as.data.frame(Consumer_Serviceswithcleaning.txtmatch_test)
Energy_Final<-as.data.frame(Energywithcleaning.txtmatch_test)
Finance_dataset_Final<-as.data.frame(Finance_datasetwithcleaning.txtmatch_test)
HealthCare_Final<-as.data.frame(HealthCarewithcleaning.txtmatch_test)
Miscelleaneous_Final<-as.data.frame(Miscelleaneouswithcleaning.txtmatch_test)
Public_Utilities_Final<-as.data.frame(Public_Utilitieswithcleaning.txtmatch_test)
Technology_dataset_Final<-as.data.frame(Technology_datasetwithcleaning.txtmatch_test)
Transportation_Final<-as.data.frame(Transportationwithcleaning.txtmatch_test)
Unclassified_Final<-as.data.frame(Unclassifiedwithcleaning.txtmatch_test)
test_1<-data.frame()
test_1<-cbind.data.frame(Basic_Industries_Final,Capital_Goods_Final,Consumer_Durables_Final,Consumer_Non_Durables_Final,Consumer_Services_Final,Energy_Final,Finance_dataset_Final,HealthCare_Final,Miscelleaneous_Final,Public_Utilities_Final,Technology_dataset_Final,Transportation_Final,Unclassified_Final)
match_sector[i,1]<-names(test_1)[which.max(test_1[1,])]
}
Cleaning up the names of the predicted sector and comparing results with the original
After that will break the data into training and testing and then would compare the results
library(knitr)
for(i in 1:nrow(match_sector)){
match_sector[i,1]<-gsub(pattern = "withcleaning.txtmatch_test",replacement="",x=match_sector[i,1])
}
for(i in 1:nrow(match_sector)){
match_sector[i,1]<-gsub(pattern = "_dataset",replacement="",x=match_sector[i,1])
}
Results_Sector<-as.data.frame(sort(table(match_sector$V1),decreasing = T))
kable(Results_Sector,caption = "which_Sector")
Var1 | Freq |
---|---|
HealthCare | 142 |
Finance | 123 |
Technology | 93 |
Consumer_Services | 85 |
Consumer_Durables | 65 |
Miscelleaneous | 65 |
Capital_Goods | 58 |
Basic_Industries | 49 |
Consumer_Non_Durables | 41 |
Public_Utilities | 32 |
Transportation | 14 |
Energy | 13 |
Unclassified | 6 |
We will build a decision tree using the most frequent terms. In this case we will take n=200 and weights can be given using the frequencies
decision_trees_file_list<-list.files(path=getwd(),pattern ="*withoutcleaning.csv")
for( i in 1:13){
library(readr)
assign(x=decision_trees_file_list[i],value =read_csv(paste0("D:/UIC/MyResearch/Mwrd/Persado_Data_Scientist/",decision_trees_file_list[i]), col_types = cols(X1 = col_skip())))
}
Delete the frequency column from them
decision_trees_file_list<-as.data.frame(decision_trees_file_list)
decision_trees_file_list$decision_trees_file_list<-as.character(decision_trees_file_list$decision_trees_file_list)
for(i in 1:nrow(decision_trees_file_list))
{
assign(value=get(decision_trees_file_list[i,])[,1],x=decision_trees_file_list[i,])
}
for(i in 1:nrow(decision_trees_file_list))
{
assign(value=as.data.frame(t(get(decision_trees_file_list[i,]))),x=paste0("t_",decision_trees_file_list[i,]))
}
Merging all the datasets together
library(knitr)
words_decision_build<-rbind(t_Basic_Industrieswithoutcleaning.csv,t_Capital_Goodswithoutcleaning.csv,t_Consumer_Durableswithoutcleaning.csv,t_Consumer_Non_Durableswithoutcleaning.csv,t_Consumer_Serviceswithoutcleaning.csv,t_Energywithoutcleaning.csv,t_Finance_datasetwithoutcleaning.csv,t_HealthCarewithoutcleaning.csv,t_Miscelleaneouswithoutcleaning.csv,t_Public_Utilitieswithoutcleaning.csv,t_Technology_datasetwithoutcleaning.csv,t_Transportationwithoutcleaning.csv,t_Unclassifiedwithoutcleaning.csv)
kable(x=words_decision_build,caption = "Comparison of frequent words across sectors")
V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | V29 | V30 | V31 | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | V41 | V42 | V43 | V44 | V45 | V46 | V47 | V48 | V49 | V50 | V51 | V52 | V53 | V54 | V55 | V56 | V57 | V58 | V59 | V60 | V61 | V62 | V63 | V64 | V65 | V66 | V67 | V68 | V69 | V70 | V71 | V72 | V73 | V74 | V75 | V76 | V77 | V78 | V79 | V80 | V81 | V82 | V83 | V84 | V85 | V86 | V87 | V88 | V89 | V90 | V91 | V92 | V93 | V94 | V95 | V96 | V97 | V98 | V99 | V100 | V101 | V102 | V103 | V104 | V105 | V106 | V107 | V108 | V109 | V110 | V111 | V112 | V113 | V114 | V115 | V116 | V117 | V118 | V119 | V120 | V121 | V122 | V123 | V124 | V125 | V126 | V127 | V128 | V129 | V130 | V131 | V132 | V133 | V134 | V135 | V136 | V137 | V138 | V139 | V140 | V141 | V142 | V143 | V144 | V145 | V146 | V147 | V148 | V149 | V150 | V151 | V152 | V153 | V154 | V155 | V156 | V157 | V158 | V159 | V160 | V161 | V162 | V163 | V164 | V165 | V166 | V167 | V168 | V169 | V170 | V171 | V172 | V173 | V174 | V175 | V176 | V177 | V178 | V179 | V180 | V181 | V182 | V183 | V184 | V185 | V186 | V187 | V188 | V189 | V190 | V191 | V192 | V193 | V194 | V195 | V196 | V197 | V198 | V199 | V200 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
textbag | segment | industri | manufactur | headquart | construct | oil | oper | market | base | chemic | wast | gas | system | facil | produc | State | Unite | energi | custom | fuel | mine | name | manag | process | water | brand | distribut | power | mainten | primarili | Product | specialti | storag | subsidiari | commerci | solut | Corpor | applic | retail | plant | treatment | addit | build | design | sale | technolog | compris | engin | home | intern | medic | control | ethanol | food | grain | metal | renew | Servic | Compani | dispos | System | transport | communic | electr | equip | fabric | own | pipe | steel | two | agricultur | chang | distil | instal | materi | network | pharmaceut | store | biomass | California | Canada | center | collect | corn | diesel | distributor | engag | former | lubric | natur | petroleum | program | rang | rehabilit | togeth | util | worldwid | agent | approxim | cabl | coal | coat | direct | drill | explor | health | independ | Industrial | liquid | regul | relat | solar | wood | anim | care | Chemic | China | clean | consum | contractor | Energi | enhanc | generat | gold | hardwar | infrastructur | locat | PC | Plain | protect | Republ | surfac | Texa | trait | acid | coin | complianc | fertil | fiber | general | govern | Green | hazard | healthcar | high | Home | hydrogen | Illinoi | integr | line | non | paper | part | recycl | roof | secur | Specialti | tank | wire | acr | agenc | American | cabinet | carbon | corros | crop | destruct | devic | feed | forc | hous | Industri | ingredi | investor | leas | Metal | parti | Peopl | Pharmaceut | pipelin | precious | recal | reduc | spray | vacuum | valu | work | Africa | aluminum | balloon | cell | cleaner | clinic | compon | consult | dealer | disord | door | dri | emiss |
textbag1 | system | manufactur | segment | industri | market | equip | design | headquart | oper | compon | control | process | applic | engin | power | electron | solut | Corpor | sale | commerci | devic | Unite | custom | State | vehicl | gas | electr | test | brand | primarili | direct | part | distributor | measur | name | addit | instrument | automot | machin | America | base | medic | assembl | high | build | subsidiari | worldwid | autom | distribut | energi | independ | integr | manag | Technolog | compris | cabl | construct | materi | metal | oil | relat | semiconductor | support | laser | monitor | repres | technolog | circuit | plastic | softwar | Europ | Group | intern | packag | suppli | tool | heat | repair | steel | truck | aerospac | China | imag | inspect | network | precis | consum | data | optic | research | air | generat | non | origin | two | end | former | instal | light | residenti | System | unit | accessori | aluminum | Ltd | militari | North | railcar | sensor | Servic | Asia | chang | defens | produc | transport | comput | fabric | forc | Hold | pipe | specialti | structur | water | chemic | concret | metrolog | New | program | pump | detect | digit | door | fluid | laboratori | storag | three | analysi | board | California | cell | coat | govern | Industri | line | secur | togeth | wafer | window | wire | Canada | connector | Energi | engag | home | inject | liquid | multi | perform | real | time | valv | blood | car | food | general | mainten | panel | plant | rang | treatment | aircraft | center | dealer | deliveri | environment | famili | field | fuel | mechan | molecular | organ | Peopl | platform | Republ | singl | user | York | dispens | facil | flow | hopper | hous | HTG | institut | leas | mobil | pharmaceut | protect | retail | |
textbag2 | system | manufactur | segment | market | oper | control | headquart | brand | retail | industri | equip | commerci | design | distribut | vehicl | furnitur | power | store | light | Unite | State | applic | base | electr | custom | food | solut | mobil | suppli | video | display | repair | addit | sale | center | distributor | independ | network | primarili | compris | Corpor | data | devic | pool | relat | specialti | subsidiari | direct | home | LED | pet | truck | worldwid | America | chemic | communic | compon | engin | outdoor | chair | detect | electron | manag | metal | monitor | name | part | process | residenti | access | accessori | Pet | pump | togeth | care | digit | Furnitur | Industri | instal | intern | mainten | offic | restaur | secur | supplement | treatment | bird | Canada | consum | conveni | end | facil | govern | graphic | integr | rang | seat | steel | tabl | technolog | two | area | automot | California | door | finish | fuel | Further | garden | interior | Internet | motor | non | nutrit | oil | packag | Solut | System | Technolog | water | wood | chang | cleaner | Compani | construct | energi | Garden | gas | Ltd | materi | North | protect | repres | room | sport | tire | traffic | user | util | voic | wholesal | approxim | architectur | chain | clinic | dealer | desk | diesel | fixtur | formul | health | high | inform | Internat | item | lawn | New | panel | person | plastic | privat | support | surveil | three | transmiss | transport | travel | agricultur | camera | Center | CN | comput | driver | edibl | enabl | Europ | filter | furnish | generat | grid | Hold | Hooker | insur | label | laser | Light | merchant | modul | Ohio | onlin | optic | pharmaceut | Pizza | produc | replac | RoadSquad | satellit | sensor | signag | site |
textbag3 | store | brand | retail | segment | market | oper | food | manufactur | distribut | Unite | headquart | State | name | base | accessori | invest | Corpor | specialti | direct | distributor | beverag | industri | apparel | consum | Food | fruit | Group | licens | custom | subsidiari | produc | Canada | equip | outlet | primarili | intern | Internat | Ltd | chain | locat | New | wholesal | York | privat | relat | Brand | China | Compani | compris | frozen | snack | chang | design | former | Madden | million | independ | organ | process | solut | addit | approxim | California | control | electron | firm | footwear | groceri | own | devic | mass | packag | sale | system | women | com | label | Monster | technolog | togeth | toy | veget | data | depart | engag | men | Retail | cabl | channel | coffe | commerc | compon | conveni | high | merchandis | North | pack | Peopl | power | protein | Republ | Technolog | applic | dri | drink | line | militari | network | nutrit | outdoor | sport | three | water | worldwid | America | audio | center | display | game | Girl | manag | natur | portfolio | rang | region | rice | Steve | Wholesal | can | energi | fashion | Fresh | home | item | juic | non | seek | shirt | vehicl | wireless | activ | casino | chemic | club | entertain | Europ | good | hardwar | health | Limit | medic | mix | nut | onlin | Operat | part | prefer | protect | secur | shop | softwar | supermarket | suppli | video | agreement | aircraft | Asia | chees | chip | commerci | consist | countri | currenc | dairi | Decemb | digit | Energi | engin | facil | fresh | ice | Industri | integr | plant | Product | restaur | school | Seneca | shoe | small | state | two | Websit | agent | American | bottl | bran | Calavo | Co | Columbia |
textbag4 | oper | segment | Unite | State | headquart | base | manag | own | store | program | New | brand | retail | distribut | name | subsidiari | com | restaur | market | consum | network | addit | Group | approxim | televis | custom | entertain | video | home | York | Corpor | content | compris | onlin | relat | Internet | solut | properti | real | technolog | center | food | invest | primarili | data | former | digit | intern | accessori | locat | Websit | industri | system | applic | chang | mobil | equip | estat | game | sport | commerci | suppli | franchis | Hold | channel | inform | Internat | secur | togeth | educ | facil | manufactur | merchandis | sale | state | engag | Servic | Decemb | media | residenti | engin | health | advertis | electron | platform | satellit | broadcast | item | Liberti | licens | shop | book | cabl | California | Canada | direct | Further | profession | design | two | America | casino | mortgag | connect | hotel | leas | live | produc | client | Colorado | incom | North | radio | specialti | non | privat | China | construct | devic | Entertain | Network | public | Republ | support | consist | financi | agenc | build | general | good | REIT | communic | gift | govern | Hotel | Media | travel | American | comput | Corp | tax | tool | Citi | loan | mainten | news | offic | packag | Texa | worldwid | access | apparel | children | Compani | Europ | feder | healthcar | project | protect | Resort | trust | wireless | consult | infrastructur | line | organ | school | space | squar | storag | asset | casual | energi | financ | integr | plan | rang | region | rental | research | room | softwar | activ | area | demand | Educat | enterpris | feet | independ | Mexico | million | Peopl | publish | repair | station | test | vehicl | corpor | Florida | involv |
textbag5 | oil | gas | natur | Texa | system | oper | Energi | headquart | fuel | State | Unite | Basin | power | approxim | control | reserv | properti | segment | crude | Decemb | locat | explor | interest | primarili | produc | own | market | engag | industri | Oklahoma | Partner | acquisit | acr | barrel | prove | Permian | cell | LP | net | West | Corpor | independ | manag | storag | applic | custom | gross | Oil | solut | Colorado | data | drill | equip | equival | field | Houston | manufactur | North | Resourc | technolog | million | packag | process | autom | compris | focus | pipelin | Servic | Shale | transport | activ | addit | distribut | East | electr | Gas | gather | liquid | mine | monitor | oilfield | Product | region | train | base | Canada | chang | coal | compon | compressor | consist | Counti | design | engin | former | Indiana | Mid | miner | name | partner | South | Virginia | area | Blueknight | energi | estim | general | hold | instal | subsidiari | Technolog | acquir | asphalt | Denver | Hold | incorpor | inject | land | onshor | Pennsylvania | project | sale | state | water | work | chemic | Compressco | Contin | Crude | Eagl | exploit | generat | GP | LLC | Ltd | mainten | Midland | mile | non | part | pressur | pump | relat | Reserv | rig | royalti | stationari | truck | vehicl | Westport | Wyom | aggreg | America | case | Citi | CNG | collect | complet | CSI | Dakota | digit | facil | Gulf | handl | hydrogen | Illinoi | intern | Legaci | low | Mountain | parti | PEM | petroleum | purchas | Rocki | seismic | station | System | tank | third | three | unit | assembl | Australia | batteri | burner | Centenni | China | Coast | combin | compress | condens | convers | counti | crew | Develop | divis | drive | Ford | Format |
textbag6 | loan | commerci | bank | deposit | account | oper | estat | real | insur | manag | credit | consum | invest | market | individu | mortgag | Bank | residenti | offic | save | hold | headquart | home | line | check | secur | segment | construct | financi | addit | money | equiti | person | famili | New | compris | certif | properti | Corpor | Financi | trust | branch | financ | base | Unite | small | retail | plan | Bancorp | card | State | locat | interest | automobil | bear | retir | custom | time | demand | industri | fund | non | York | one | cash | corpor | term | asset | brokerag | Counti | onlin | equip | Hold | liabil | mobil | primarili | subsidiari | trade | four | lend | rang | full | purchas | coverag | Corp | life | First | land | autom | capit | relat | unsecur | Group | Further | accept | administr | portfolio | Internet | vehicl | exchang | wealth | institut | multi | network | agricultur | transfer | California | client | health | safe | casualti | profession | advisori | Carolina | size | leas | Pennsylvania | teller | consist | payment | solut | Decemb | direct | Insuranc | medium | NOW | Commun | fix | annuiti | communiti | debit | Ohio | Washington | acquisit | box | instal | machin | South | Bancshar | facil | Jersey | reinsur | Florida | improv | name | Texa | Virginia | agenc | former | owner | polici | privat | remot | area | broker | mutual | Nation | stock | letter | North | origin | chang | transact | two | benefit | center | general | program | counti | investor | agent | bond | telephon | captur | incom | Line | work | sale | Trust | wire | bill | Compani | municip | Georgia | Massachusett | protect | Servic | Citi | electron | process | Capit | tax | ATM | Commerci | depositori | engag | homeown | Illinoi | Maryland | treasuri |
textbag7 | treatment | clinic | Phase | trial | diseas | patient | cancer | headquart | candid | commerci | treat | Pharmaceut | drug | base | stage | therapi | cell | system | Unite | focus | medic | agreement | State | Therapeut | II | therapeut | biopharmaceut | collabor | market | target | inhibitor | oral | research | manufactur | name | segment | test | technolog | disord | pharmaceut | devic | former | III | Corpor | program | small | chang | addit | licens | tumor | lead | non | California | manag | acut | antibodi | molecul | care | compris | oncolog | studi | human | diagnost | oper | design | hospit | complet | chronic | gene | engag | pain | infect | receptor | novel | New | anti | preclin | health | Ltd | solut | platform | proprietari | inject | activ | relat | combin | protein | breast | deliveri | discoveri | prevent | tissu | Massachusett | solid | primarili | imag | intern | specialti | Pharma | monitor | sale | syndrom | formul | adult | direct | immun | surgic | genet | Medic | medicin | sever | associ | blood | distributor | multipl | indic | York | center | procedur | advanc | condit | discov | inflammatori | vaccin | agent | immunotherapi | select | surgeri | type | virus | worldwid | lung | physician | laboratori | leukemia | San | strateg | dental | diabet | healthcar | Limit | refractori | two | biotechnolog | factor | Hold | organ | acid | Compani | enabl | kinas | rare | custom | metastat | Research | implant | incorpor | liver | pre | subsidiari | biolog | Europ | invas | applic | detect | equip | molecular | Servic | System | bone | cardiac | compound | consist | DNA | malign | monoclon | need | skin | Bioscienc | Co | facil | resist | area | enhanc | generat | generic | growth | hematolog | support | control | Health | home | lymphoma | pharmaci | pipelin | prostat | risk | Univers | analysi | brand |
textbag8 | manag | solut | applic | market | oper | technolog | onlin | base | headquart | mobil | brand | patent | platform | custom | system | enabl | Unite | data | manufactur | process | State | advertis | Corpor | addit | sale | payment | softwar | consum | design | industri | inform | segment | power | retail | com | digit | real | compris | devic | equip | network | integr | relat | China | estat | financi | client | user | chang | licens | name | subsidiari | commerci | direct | electr | former | health | intern | store | support | cloud | govern | home | account | ad | analyt | control | game | perform | California | channel | commerc | corpor | deliveri | energi | generat | healthcar | media | New | portfolio | America | content | CoStar | end | enterpris | Further | Internet | marketplac | properti | risk | Technolog | tool | valu | asset | electron | optim | primarili | program | secur | sourc | Web | Websit | worldwid | bank | distribut | facilit | gas | insur | medic | merchandis | own | photo | search | togeth | transfer | York | call | card | Europ | Peopl | person | profession | promot | report | Republ | transact | wast | access | autom | care | communic | distributor | Energi | financ | Group | Internat | label | Manag | new | North | plan | Servic | storag | suppli | vehicl | wireless | buy | connect | high | instal | institut | interact | invest | kiosk | laser | order | organ | project | rang | transport | allow | Analyt | approxim | audit | batteri | clinic | collect | complianc | Connect | consist | display | film | incorpor | Liberti | light | machin | monitor | offic | record | research | save | South | strategi | tax | ticket | video | app | Asia | bill | buyer | compon | consult | educ | engag | enhanc | good | heat | Hold | individu | |
textbag9 | custom | oper | network | communic | voic | segment | data | headquart | Internet | mobil | equip | system | access | base | solut | manag | market | subsidiari | Unite | wireless | industri | broadband | residenti | State | water | gas | approxim | commerci | fiber | natur | optic | own | video | connect | devic | distribut | manufactur | Commun | design | telecommun | addit | enterpris | line | retail | applic | small | integr | name | sale | solar | brand | energi | Ltd | messag | call | cloud | digit | facil | power | togeth | carrier | Corpor | Decemb | Ethernet | project | secur | subscrib | support | chang | compris | distanc | fix | former | independ | magicJack | transport | center | Compani | direct | Energi | intern | local | long | million | point | softwar | TV | user | Virginia | VoIP | wast | wastewat | accessori | corpor | Hold | medium | Mobil | premis | rang | relat | technolog | telephon | two | wholesal | audio | California | conferenc | content | Counti | electr | end | infrastructur | locat | mainten | parti | Servic | smartphon | termin | third | util | aggreg | cabl | Canada | capac | construct | consum | distributor | Further | govern | high | home | inform | Internat | land | LNG | media | Partner | phone | plant | primarili | privat | protect | store | suppli | US | vehicl | WiFi | advertis | allow | engin | generat | hazard | host | Israel | municip | New | onlin | origin | partner | protocol | Retail | San | state | storag | tablet | televis | worldwid | ad | backup | cellular | channel | collect | contract | copper | dealer | enabl | enclosur | engag | forc | fuel | Jersey | leas | Network | North | passiv | Pennsylvania | platform | satellit | smart | Solut | Spark | station | suit | System | Texa | transfer | treatment | valu | virtual | Water |
textbag10 | solut | manag | system | applic | softwar | data | manufactur | segment | headquart | custom | network | oper | base | market | design | technolog | mobil | platform | integr | equip | industri | process | enabl | support | devic | secur | cloud | sale | addit | enterpris | consum | Unite | communic | digit | direct | inform | Corpor | electron | comput | onlin | user | video | worldwid | State | name | control | connect | content | California | compris | power | game | tool | access | organ | profession | storag | distributor | Internet | Ltd | retail | chang | media | semiconductor | engin | former | wireless | financi | Servic | IT | Web | govern | monitor | analyt | infrastructur | relat | healthcar | train | brand | perform | Solut | center | subsidiari | autom | advertis | Further | origin | intern | optim | plan | Technolog | test | distribut | high | Europ | primarili | mainten | allow | consult | end | channel | medic | project | Websit | America | China | health | automot | packag | resel | Asia | educ | display | IP | deliv | line | analysi | implement | Manag | social | optic | chip | energi | Hold | account | hardwar | payment | suit | ad | circuit | client | rang | System | commerci | engag | memori | modul | server | small | time | voic | creat | Group | telecommun | protect | consist | instal | technic | compon | forc | intellig | program | suppli | valu | light | construct | Network | signal | activ | QAD | togeth | across | employe | enhanc | experi | New | point | real | travel | build | com | deliveri | function | licens | person | report | specif | embed | Enterpris | insur | North | partner | publish | store | care | clinic | host | offic | solar | transport | collabor | environ | interact | parti | track | two | complianc | desktop | independ | |
textbag11 | transport | oper | segment | aircraft | fleet | freight | truckload | headquart | carrier | custom | dri | Unite | air | State | own | trailer | cargo | logist | manag | base | Transport | Decemb | leas | deliveri | subsidiari | truck | vessel | oil | contain | rail | shipment | temperatur | airlin | approxim | equip | industri | intermod | manufactur | ocean | tractor | charter | compris | ground | mainten | retail | brokerag | bulk | chemic | control | region | solut | time | Canada | Corpor | expedit | Express | haul | Mexico | ship | special | van | airport | America | consum | former | good | less | name | North | parti | third | Air | chang | commerci | commod | fuel | Greec | Group | intern | Logist | passeng | petroleum | tanker | worldwid | addit | consolid | contract | CryoPort | de | distribut | flight | Hold | independ | load | long | part | two | Worldwid | Airlin | automot | Boe | consist | crude | destin | drybulk | Florida | forward | general | jet | liquid | Ltd | materi | medium | non | produc | rang | shipper | togeth | American | arrang | center | coal | contractor | dedic | definit | drayag | facil | fee | food | handl | insur | iron | network | New | order | primarili | Servic | Shipper | storag | suppli | support | track | travel | Truck | Washington | account | ad | Airbus | ArcBest | Arkansa | asset | citi | com | Corp | diesel | driver | dwt | electron | line | local | locat | March | Maroussi | metal | militari | ore | packag | protect | railroad | refin | relat | schedul | Ship | Student | technolog | Top | Truckload | vapor | wareh | Werner | York | YRC | Aeroportuario | agent | agreement | Asia | border | capac | Capit | Caribbean | carri | cement | Centro | combin | commerc | connect | contractu | crew | cross | Cryoport |
textbag12 | million | invest | loan | firm | debt | real | commerci | estat | New | equip | secur | small | term | process | York | deposit | design | headquart | manag | market | oper | account | year | equiti | Busi | data | electron | financ | platform | relat | seek | Technolog | addit | bank | capit | mobil | Newtek | Servic | technolog | base | check | Corp | credit | financi | former | individu | payment | solut | Websit | acquisit | brand | broadcast | card | industri | lien | offic | softwar | Bank | California | certif | Citi | client | CLO | construct | consum | debit | focus | fund | Further | handset | insur | investig | Louisiana | Ltd | Massachusett | matur | non | owner | program | rang | residenti | Secur | special | State | subordin | Unite | unsecur | work | administr | AquaBounti | area | cash | cell | China | compris | Credit | DHX | eCommerc | enabl | Feder | Financ | first | forens | Fund | high | host | inform | intern | Internet | line | make | name | net | origin | payrol | prefer | primarili | profession | purpos | second | segment | senior | size | stem | tax | Union | Web | accept | ancillari | approxim | Capit | chang | consult | content | Corpor | distribut | document | enterpris | famili | Famili | file | Garden | Group | hardwar | incom | institut | licens | Limit | litig | Manag | manufactur | Media | middl | money | month | Nation | onlin | own | part | phone | portfolio | privat | properti | public | retail | save | select | stock | support | televis | TICC | tool | transact | within | access | Account | ACH | acquir | Advantag | agent | allianc | American | Angele | anim | approv | Asian | asset | Associat | averag | backup | balanc | benefit | biotechnolog | branch | Capco | captur | cart | Chines | cloud | Coast |
Creating a column and putting in the target sector
row.names(words_decision_build)<-NULL
words_decision_build<-cbind(words_decision_build,decision_trees_file_list)
#Stripping of unneccessary names from the target column
words_decision_build[,201]<-gsub(pattern = 'withoutcleaning.csv',replacement = "",x = words_decision_build[,201])
# Renaming the target variable
colnames(words_decision_build)[colnames(words_decision_build)=="decision_trees_file_list"] <- "Target_Sector"
We will create a data frame having 13 target variable and then 200 words as it’s dependent variable
Few things to keep in mind
The dataset will contain more instances of Finance and Healthcare
The description would be in isolation
Therefore from 1 and 2 even though the samples of Finance and Healthcare would be more. But within a sample frequency of words would be similar among all sectors
library("tree")
tree_model=tree(Target_Sector~.,words_decision_build[,c(1:99,201)])
plot(tree_model)
tree.control(nobs=100,mincut = 10)
set.seed(3)
cv_tree=cv.tree(tree_model,FUN=prune.tree,K = 10)
plot(cv_tree$size,cv_tree$dev,type="b")
plot(tree_model)
pruned_model<-prune.misclass(tree_model,best = 7)
plot(pruned_model)
text(pruned_model,pretty=0)
Findings. We will compare the sector given and compare it with the original sector
compare_sector_original_test<-cbind.data.frame(company_test,match_sector)
for(i in 1:nrow(compare_sector_original_test)){
compare_sector_original_test[i,5]<-gsub(pattern = "Basic_Industries",replacement = "Basic Industries",x=compare_sector_original_test[i,5])
compare_sector_original_test[i,5]<-gsub(pattern = "HealthCare",replacement = "Health Care",x=compare_sector_original_test[i,5])
compare_sector_original_test[i,5]<-gsub(pattern = "Capital_Goods",replacement = "Capital Goods",x=compare_sector_original_test[i,5])
compare_sector_original_test[i,5]<-gsub(pattern = "Consumer_Durables",replacement = "Consumer Durables",x=compare_sector_original_test[i,5])
compare_sector_original_test[i,5]<-gsub(pattern = "Consumer_Non_Durables",replacement = "Consumer Non-Durables",x=compare_sector_original_test[i,5])
compare_sector_original_test[i,5]<-gsub(pattern = "Consumer_Services",replacement = "Consumer Services",x=compare_sector_original_test[i,5])
compare_sector_original_test[i,5]<-gsub(pattern = "Public_Utilities",replacement = "Public Utilities",x=compare_sector_original_test[i,5])
}
sum=0
for(i in 1:nrow(compare_sector_original_test)){
if(compare_sector_original_test[i,3]==compare_sector_original_test[i,5]){
sum=sum+1
}
}
print(sum)
## [1] 466
accuracy=sum/nrow(compare_sector_original_test)
print(accuracy)
## [1] 0.5928753
The code can be improved.One chunk is written poorly as I was not able to get the files in the global environment and hence was not able to retrieve the variables.
Time constraint. If given more time I can build a decision tree and predict what words lead to what sector. Right now the model is in nascent stage and I need two more weeks to build a decision tree model
Certain sectors didn’t have a description on Yahoo Finance so they were removed from the dataset. For those datasets which had n/a written in the Sector but whose description was available were included in the data
Cross-Validation can be performed on the number of words scanned from a file so that optimum number of words from the file are retrieved and thus computing time can be shortened
Word-Associations haven’t been tested and they are a beautiful way of checking out correlations
Other techniques such as SVM, KNN can be employed and then the best techniques can be used