This notebook serves as a document to the data science challenge. Commenting and documentation has been done wherever necessary.

Instructions for the zip folder

  1. It contains the generated csv,txt files
  2. The names make it clear.Files which are cleaned will have withcleaning as appended to the file name. eg Basic_Industrieswithcleaning.txt
  3. To reproduce what I have done project hierarchies and folders would either have to be recreated or changed in the code. For eg I am retrieving the data from my local disk so the code would have something like D:/UIC/…..
  4. After performing cleaning operation the files were loaded in a folder called CleanedText and R would sometimes change the operating directory and call that folder.
  5. Inside would be a clipping of my web scraping process.

Importing the dataset

The dataset which Persado sent can be seen here. For convenience the top 5 rows have been shown. As you can see three variables such as company’s name, it’s label on stock market, the sector that it belongs to has been provided.

library(readr)
company_dataset_original<- read_csv("D:/UIC/MyResearch/Mwrd/Persado_Data_Scientist/company_dataset.csv")

head(company_dataset_original)
## # A tibble: 6 x 3
##                              CompanyName CompanyTicker            Sector
##                                    <chr>         <chr>             <chr>
## 1 1347 Property Insurance Holdings, Inc.           PIH           Finance
## 2               180 Degree Capital Corp.          TURN           Finance
## 3                1-800 FLOWERS.COM, Inc.          FLWS Consumer Services
## 4          1st Constitution Bancorp (NJ)          FCCY           Finance
## 5                 1st Source Corporation          SRCE           Finance
## 6                   21Vianet Group, Inc.          VNET        Technology

Working dataset

This dataset has been loaded after altering the original dataset. Altering the dataset here means

  1. Adding an extra column for description
  2. Placing the scraped description so that the description ties strongly to the actual company and sits inside the data frame
  3. Deleting those rows which don’t have any description

We can see that there are 662 observation for which we don’t have the observations.

We will show below how scraping is done

All the tickers will be extracted from the second column of the given dataset

# list of different handles which you want to extract
library(readr)
company_dataset<- read_csv("D:/UIC/MyResearch/Mwrd/Persado_Data_Scientist/company_dataset_c.csv",col_types = cols(X1 = col_skip()))

company_dataset<-company_dataset[which(!is.na(company_dataset$description)),]

head(company_dataset$description)
## [1] "1347 Property Insurance Holdings, Inc., through its subsidiaries, provides property and casualty insurance products to individuals in Louisiana and Texas. The company offers homeowners<U+0092> insurance, manufactured home insurance, dwelling fire insurance, and wind/hail insurance products, as well as reinsurance products. It offers its insurance policies through a network of independent agents. The company was formerly known as Maison Insurance Holdings, Inc. and changed its name to 1347 Property Insurance Holdings, Inc. in November 2013. 1347 Property Insurance Holdings, Inc. was founded in 2012 and is based in Tampa, Florida."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
## [2] "180 Degree Capital Corp. is a business development company specializing in early stage investments. The firm seeks to invest in tiny technology including microsystems and transformative nanotechnology companies and applications in the cleantech, biotechnology, energy, healthcare, and electronic sectors. It prefers to invest in biology innovation, where intersecting with innovations in areas such as electronics, physics, materials science, chemistry, information technology, engineering and mathematics. It seeks companies which employ or intend to employ technology that is at the microscale or smaller and if the employment of that technology is material to its business plan. The firm may make follow-on investments and seeks to co-invest. It prefers to hold membership on boards of directors or serve as observers to the boards of directors on its portfolio companies. 180 Degree Capital Corp. was founded in August 1981 and is based in Montclair, New Jersey with addition office in Los Angeles, California."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
## [3] "1-800-Flowers.com, Inc., together with its subsidiaries, provides gourmet food and floral gifts for various occasions in the United States. It operates in three segments: Consumer Floral; Gourmet Food and Gift Baskets; and BloomNet Wire Service. The company offers a range of products, including fresh-cut flowers, floral and fruit arrangements and plants, gifts, popcorn, gourmet foods and gift baskets, cookies, chocolates, candies, wine, and gift-quality fruits, as well as balloons, candles, keepsake gifts, and plush stuffed animals. It also provides floral wire service through mybloomnet.net; gourmet gifts, such as fruits and other gourmet items through 1-877-322-1200 or harryanddavid.com; popcorn and specialty treats through 1-800-541-2676 or thepopcornfactory.com; cookies and baked gifts from 1-800-443-8124 or cheryls.com; gift baskets and towers from 1800baskets.com; English muffins and other breakfast treats from 1-800-999-1910 or wolfermans.com; carved fresh fruit arrangements from fruitbouquets.com; and steaks and chops from stockyards.com. The company offers its products to consumers and wholesalers. 1-800-FLOWERS.COM, Inc. was founded in 1976 and is headquartered in Carle Place, New York."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
## [4] "1st Constitution Bancorp operates as the bank holding company for 1st Constitution Bank that provides commercial and retail banking services in the central and northeastern New Jersey areas. The company offers deposit products, including interest bearing demand deposits, such as interest-bearing checking accounts and money market accounts; and non-interest bearing demand, savings, and time deposits, as well as certificates of deposit. It also provides commercial loans, including term loans, lines of credit, and loans secured by equipment and receivables; secured and unsecured short-to-medium term commercial loans to businesses for working capital, business expansion, and the purchase of equipment and machinery; and construction loans to real estate developers for the acquisition, development, and construction of residential and commercial properties. In addition, the company offers residential first mortgage loans secured by owner-occupied property; construction loans; reverse mortgages; second mortgage home improvement loans; home equity lines of credit; and non-residential consumer loans for automobiles, recreation vehicles, and boats, as well as secured and unsecured personal loans, and deposit account secured loans. It serves corporations, individuals, partnerships, and other community organizations, as well as small businesses and not-for-profit organizations. As of February 2, 2017, the company operated through 19 banking offices in Cranbury, Fort Lee, Hamilton, Hightstown, Hillsborough, Hopewell, Jamesburg, Lawrenceville, Perth Amboy, Plainsboro, Rocky Hill, West Windsor, Princeton, Rumson, Fair Haven, Shrewsbury, Little Silver, and Asbury Park, New Jersey. 1st Constitution Bancorp was founded in 1989 and is based in Cranbury, New Jersey."                                                                                                                                                                                                                 
## [5] "1st Source Corporation operates as the bank holding company for 1st Source Bank that provides commercial and consumer banking services, trust and investment management services, and insurance to individual and business clients. Its consumer banking services include checking and savings accounts; certificates of deposit; individual retirement accounts; on-line and mobile banking products; automated teller machine services; consumer loans, real estate loans, and lines of credit; and financial planning, financial literacy, and other consultative services. The company also offers commercial, small business, agricultural, and real estate loans for general corporate purposes, including financing for industrial and commercial properties, equipment, inventories, accounts receivables, and acquisition financing; and commercial leasing, treasury management, and retirement planning services. In addition, it provides a range of trust, investment, agency, and custodial services comprising administration of estates and personal trusts; and manages investment accounts for individuals, employee benefit plans, and charitable foundations. Further, the company offers equipment loan and lease finance products for auto and light trucks, medium and heavy duty trucks, new and used general aviation aircraft, and construction equipment, as well as leases construction equipment, medium and heavy duty trucks, automobiles, and other equipment. Additionally, it provides corporate and personal property, casualty, and individual and group health and life insurance products and services; and investment advisory services to trust and investment clients. As of February 14, 2017, the company operated through 81 banking centers and 23 specialty finance group locations in the United States. 1st Source Corporation was founded in 1863 and is headquartered in South Bend, Indiana."                                                                                                                 
## [6] "21Vianet Group, Inc. provides carrier-neutral Internet data center services to Internet companies, government entities, blue-chip enterprises, and small-to mid-sized enterprises in the People<U+0092>s Republic of China. It offers hosting and related services to house servers and networking equipment in its data centers, and connects them through a data transmission network; and other hosting related value-added services. Its hosting and related services include managed hosting services that offer data center space to house its customers<U+0092> servers and networking equipment, and provide tailored server administration services; and interconnectivity services that enable customers to connect their servers with Internet backbones and other networks through its border gateway protocol network or single-line, dual-line, or multiple-line network. Its hosting and related services also comprise content delivery network services that optimize the speed and security of data transmission; cloud services that enable businesses to run their applications over the Internet using its IT infrastructure; virtual private network services; and other value-added services, such as firewall, server load balancing, data backup and recovery, data center management, server management, and backup server services. In addition, the company provides traffic charts and analysis, gateway monitoring for servers, domain name system setup, defense mechanism against distributed denial of service attacks, basic setting of switches and routers, and virus protections; and managed network services consisting of hosting area network, route optimization, and last-mile wired broadband services. As of December 31, 2016, it operated 18 self-built and 63 partnered data centers located in approximately 30 cities with 26,830 cabinets. It has a strategic partnership with Microsoft Corporation. The company was founded in 1999 and is headquartered in Beijing, the People<U+0092>s Republic of China."

Exploratory data analysis

As we can see from the dataset there are 3276 observations having 3 variables of which important variables are second and third.

Second variable is ticker and third variable is category which is to be predicted

library("dplyr")
# Create a empty variable known as description which will get filled once we have web scraping up and running 
company_dataset<-mutate(.data = company_dataset,description="")

Web Scraping

To do web scraping we will be using R Selenium package.R vest is another option

R Selenium gives power to the user since API’s restrict in what a user wants to download. Furthermore, automating the process via a browser always helps. Inline documentation of the code gives an idea on how Rselenium works

A video is attached to the document capturing the automation and web scraping

library("RSelenium")
library("dplyr")

# important because it downloads all the things(geckodriver,chrome driver,phantom js)
remDr<-rsDriver(port = 4444L,browser = "chrome",version = "latest",chromever = "latest",geckover = "latest",check = TRUE)

Client<-remDr$client


for(i in 1:3276){
Client$open()   
Client$navigate(paste0("https://finance.yahoo.com/quote/",company_dataset[i,2],"/profile?p=",company_dataset[i,2],"/"))

Css<-Client$findElements(using = "css selector",value = ".Mt\\(30px\\) .Lh\\(1\\.6\\)")

if (length(Css)>0){
CSS_DATA_FRAME<- unlist(lapply(Css, function(x){x$getElementText()}))
company_dataset[i,4]<-CSS_DATA_FRAME
Client$close()
}
else{
company_dataset[i,4]<-NA  
Client$close()
}

}
Post- Scraping

There were some tickers whose description were not available and others which had a W attached to their ticker were not available. All such cases have been put under unclassified

Following code is optional and may be used if the browser loses its internet connectivity or due to some other mishap

sum(is.na(company_dataset$description[771:1671]))
# Add the indices of the data if you are searching from within a certain range of the data

# NotFound<-company_dataset[which(is.na(company_dataset$description[771:1671]))+770,]

NotFound<-company_dataset[which(is.na(company_dataset$description[771:1671]))+770,]


# The Selenium code was modified by 

Transferto<-NotFound[which(!is.na(NotFound$description)),]
# InsertatRows<-which(company_dataset$CompanyTicker %in% company_dataset$CompanyTicker)
# company_dataset[InsertatRows,]$description<-Transferto$description

Analysis of the variable sector

Let’s see the number of unique sectors in the given dataset

unique(company_dataset$Sector)
##  [1] "Finance"               "Consumer Services"    
##  [3] "Technology"            "Public Utilities"     
##  [5] "Capital Goods"         "Basic Industries"     
##  [7] "Health Care"           "Energy"               
##  [9] "Miscellaneous"         "Transportation"       
## [11] "Consumer Non-Durables" "Consumer Durables"    
## [13] "n/a"                   NA
length(unique(company_dataset$Sector))
## [1] 14

We can see that there are 14 sectors of which two are categorised as “n/a” or NA. Below is the frequency distribution of the sector

# Converting from character to factor 

company_dataset$Sector<-as.factor(company_dataset$Sector)
library("ggplot2")
g1<-ggplot(data = company_dataset,aes(company_dataset$Sector))+
geom_bar(aes(fill=company_dataset$Sector))+
theme(axis.text.x = element_blank())  
plot(g1)  

We can see that the top categories are Finance and Healthcare. Now we will calculate the frequencies of their occurences

library(knitr)
SectorFrequencies<-as.data.frame(sort(table(company_dataset$Sector),decreasing = T))
colnames(SectorFrequencies)<-c("Sector","Frequency")
kable(SectorFrequencies)
Sector Frequency
Health Care 619
Finance 525
Technology 429
Consumer Services 332
Capital Goods 167
Consumer Non-Durables 108
Miscellaneous 86
Basic Industries 76
Consumer Durables 73
Energy 62
Public Utilities 60
Transportation 48
n/a 24

Model

We can infer from the sentences that are in the description and then create a word frequency distribution to predict the sector

There are 13 sectors of which one is n/a. Therefore we have 12 categories to be predicted

We will be using 70% of the model for training and the rest will be for testing

We have created subsets of the master data. This is done in order to produce files. This is done because this code can be reused later if we want to look at a single sector.The files created are situated in the project folder

library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
set.seed(45)
ind=sample(2,nrow(company_dataset),replace=T,prob=c(0.7,0.3))
company_dataset_1=company_dataset[ind==1,]
company_test=company_dataset[ind==2,]


Finance_dataset<-company_dataset_1[company_dataset_1$Sector=="Finance",]

HealthCare<-company_dataset_1[company_dataset_1$Sector=="Health Care",]

Technology_dataset<-company_dataset_1[company_dataset_1$Sector=="Technology",]

Consumer_Services<-company_dataset_1[company_dataset_1$Sector=="Consumer Services",]

Capital_Goods<-company_dataset_1[company_dataset_1$Sector=="Capital Goods",]

Consumer_Non_Durables<-company_dataset_1[company_dataset_1$Sector=="Consumer Non-Durables",]

Consumer_Durables<-company_dataset_1[company_dataset_1$Sector=="Consumer Durables",]

Miscelleaneous<-company_dataset_1[company_dataset_1$Sector=="Miscellaneous",]

Basic_Industries<-company_dataset_1[company_dataset_1$Sector=="Basic Industries",]

Energy<-company_dataset_1[company_dataset_1$Sector=="Energy",]

Public_Utilities<-
company_dataset_1[company_dataset_1$Sector=="Public Utilities",]

Transportation<-company_dataset_1[company_dataset_1$Sector=="Transportation",]

Unclassified<-company_dataset_1[company_dataset_1$Sector=="n/a",]

We will construct two models one would be based on term frequencies and other would be based on t-f and i-d-f

Text Mining approach would involve the text from similar sector grouped together, and then we would create a bag of words which appear frequently in similar sectors.

The approach is to to count the frequency of words that matter and remove those words out which are of common English and to remove those words which have zero information content ( from our perspective and client’s perspective )

  1. Step 1 : Collapse all the descriptions into a paragraph
  2. Step 2 : Replace punctuations such as ,_:/ etc.
  3. Step 3 : Repace all the digits
  4. Step 4 : Replace single letter words
  5. Step 5 : Replace stopwords words such as a,and,of,the
  6. Step 6 : Delete additional words in a retrospective manner
  7. Step 7 : Make a textbag of words and after that create a frequency tablle
##### Text Mining

library("stringr")
Sector_Names<-data.frame(Sector=c("Finance_dataset","HealthCare","Basic_Industries","Capital_Goods","Consumer_Services","Consumer_Non_Durables","Consumer_Durables","Energy","Miscelleaneous","Public_Utilities","Transportation","Unclassified","Technology_dataset"))

Sector_Names$Sector=as.character(Sector_Names$Sector)


library(tm)


for(i in 1:(nrow(Sector_Names))){

text<-paste(get(Sector_Names[i,])$description,collapse = "")



write(text,file=paste0(x=Sector_Names[i,],".txt"))

#punctuation replacement
text2<-gsub(pattern = "\\W",replacement = " ",text)

#While Scraping a box type element having ascii code <U 0092> is present, therefore we have to delete that type of element from our text

# digits replacement

text2<-gsub(pattern="\\d",replacement = "",text2)

#We will delete characters which are of single length

text2<-gsub(pattern = "\\b[A-z]\\b{1}",replacement = "",text2)

# words removed with no information
text2<-removeWords(text2,stopwords())

                   

text2<-stripWhitespace(text2)

# If we unstem the document it becomes easier and system is able to process faster (matching of the words).Depending on the sitaution of the analysis we will 

text2<-stemDocument(text2)

text2<-removeWords(text2,c("servic","compani","The","product","provid","Inc","It","includ","well","also","found","In","As","Its","develop","known","sell","offer","use","various","serv","busi","This"))

setwd("D:/UIC/MyResearch/Mwrd/Persado_Data_Scientist/CleanedText")

write(text2,file=paste0(x=Sector_Names[i,],"withcleaning",".txt"))

setwd("D:/UIC/MyResearch/Mwrd/Persado_Data_Scientist")



textbagWithoutCleaning<-str_split(text2,pattern="\\s+")
textbag<-unlist(textbagWithoutCleaning)

write.csv(head(sort(table(textbag),decreasing = T),n=200),file =paste0(x=Sector_Names[i,],"withoutcleaning",".csv"))

}

Now we wil look at the dataset and identify what words are occuring in most of the dataset. We see that words like After deleting these words we will be creating a tf-idf matrix We are using the tm package and then the Document Term Matrix function

Creating a corpus which can help us

  1. tf-idf
  2. frequent words
  3. comparison table
library(tm)

filePath<-("D:/UIC/MyResearch/Mwrd/Persado_Data_Scientist/CleanedText/")

Corpus_Yahoo<-Corpus(DirSource(filePath), readerControl = list(language="english"))
dtm_yahoo<-DocumentTermMatrix(Corpus_Yahoo)

# Calculate and sort by word frequencies

dtm_yahoo<-tm::removeSparseTerms(x = dtm_yahoo,sparse = 0.5)
word.freq<-sort(colSums(as.matrix(dtm_yahoo)),decreasing = T)
table_word_freq<-data.frame(word=names(word.freq),
abs.freq=word.freq,
relative.frequency=word.freq/length(word.freq))
rownames(table_word_freq)<-NULL


kable(x = table_word_freq[1:200,1:2],caption = "Top 200 Frequent words")
Top 200 Frequent words
word abs.freq
oper 1891
loan 1694
manag 1532
commerci 1416
headquart 1401
segment 1384
system 1334
market 1267
solut 1210
bank 1168
base 1054
state 1043
unite 991
manufactur 964
corpor 899
treatment 858
deposit 828
industri 826
account 821
technolog 818
custom 807
clinic 788
consum 749
addit 733
network 683
applic 682
equip 656
retail 635
new 630
real 627
design 623
compris 616
hold 598
secur 593
name 588
financi 586
data 566
estat 563
mobil 547
brand 523
invest 514
home 513
insur 513
devic 495
sale 492
softwar 492
line 482
process 480
subsidiari 477
platform 467
patient 465
cancer 454
medic 453
pharmaceut 451
credit 450
primarili 443
residenti 441
distribut 437
former 430
direct 429
relat 429
small 428
offic 419
chang 418
construct 406
control 403
store 400
california 399
individu 398
program 395
group 394
properti 388
mortgag 385
non 375
onlin 372
center 369
power 368
health 362
integr 357
support 352
person 348
electron 346
york 345
intern 340
energi 332
locat 327
enabl 319
worldwid 315
research 309
focus 308
digit 308
internet 305
cell 301
plan 296
test 294
engag 292
own 292
distributor 289
ltd 289
financ 287
famili 286
specialti 286
drug 285
gas 281
licens 281
servic 280
check 279
engin 276
inform 276
enterpris 271
further 271
video 268
rang 267
oil 266
transport 266
cloud 265
communic 263
time 263
approxim 261
stage 257
agreement 255
vehicl 255
equiti 255
profession 249
food 248
independ 247
two 246
money 245
connect 243
facil 243
organ 240
autom 236
access 236
care 235
govern 235
healthcar 234
high 233
north 230
america 229
china 226
monitor 226
collabor 225
consist 223
content 223
compon 222
storag 222
suppli 217
card 211
togeth 210
tool 208
asset 206
media 205
decemb 204
texa 204
incorpor 201
produc 200
game 199
branch 198
hospit 198
origin 198
europ 197
institut 197
comput 197
activ 195
interest 194
leas 194
client 194
accessori 193
com 193
canada 191
deliveri 191
mainten 191
user 191
corp 189
portfolio 189
packag 188
wireless 187
acquisit 186
channel 185
instal 185
compani 184
protect 184
target 184
life 184
counti 183
privat 183
capit 178
payment 174
one 173
internat 170
multi 170
websit 170
area 169
machin 167
term 167
perform 166
restaur 166
demand 166
build 165
natur 165
# Find association in this way we can create anagrams 

# tm::findAssocs(dtm_yahoo,c("treatment"),0.8)

Model1

We will create a bag of words and then do the sentiment analysis. The conventional sentiment analysis is to classify as positive or negative. We will use the same approach but rather than classifying it as binary we will classify it into 13 different sectors

When we will check for a description.We will clean it and stem it because all the documents are stemmed

match_sector<-data.frame()

for(i in 1:786){
  
  
  match_sector[i,1]<-gsub(pattern="\\d",replacement = "",x=company_test[i,4])
  match_sector[i,1]<-gsub(pattern = "\\W",replacement = " ",x=match_sector[i,1])
  match_sector[i,1]<-gsub(pattern = "\\b[A-z]\\b{1}",replacement = "",x=match_sector[i,1])
  match_sector[i,1]<-removeWords(as.character(match_sector[i,1]),stopwords())
  match_sector[i,1]<-stemDocument(match_sector[i,1])
  # match_sector[i,1]<-sectortest(match_sector[i,1])
  
  desc_split<-unlist(str_split(match_sector[i,1],pattern="\\s+"))
 setwd("D:/UIC/MyResearch/Mwrd/Persado_Data_Scientist/Cleanedtext")
 
 
 match_files<-as.data.frame(list.files())
 match_files$`list.files()`<-as.character(match_files$`list.files()`)
  
  for(j in 1:13){
  assign(value=sum(!is.na(match(desc_split,scan(match_files[j,1],what = 'character',sep = "",nmax=3000)))),x = paste0(match_files[j,1],"match_test"))
  }




# We want to retrieve the names from the global environment

Basic_Industries_Final<-as.data.frame(Basic_Industrieswithcleaning.txtmatch_test)
Capital_Goods_Final<-as.data.frame(Capital_Goodswithcleaning.txtmatch_test)
Consumer_Durables_Final<-as.data.frame(Consumer_Durableswithcleaning.txtmatch_test)
Consumer_Non_Durables_Final<-as.data.frame(Consumer_Non_Durableswithcleaning.txtmatch_test)
Consumer_Services_Final<-as.data.frame(Consumer_Serviceswithcleaning.txtmatch_test)
Energy_Final<-as.data.frame(Energywithcleaning.txtmatch_test)
Finance_dataset_Final<-as.data.frame(Finance_datasetwithcleaning.txtmatch_test)
HealthCare_Final<-as.data.frame(HealthCarewithcleaning.txtmatch_test)
Miscelleaneous_Final<-as.data.frame(Miscelleaneouswithcleaning.txtmatch_test)
Public_Utilities_Final<-as.data.frame(Public_Utilitieswithcleaning.txtmatch_test) 
Technology_dataset_Final<-as.data.frame(Technology_datasetwithcleaning.txtmatch_test)
Transportation_Final<-as.data.frame(Transportationwithcleaning.txtmatch_test)
Unclassified_Final<-as.data.frame(Unclassifiedwithcleaning.txtmatch_test)

test_1<-data.frame()

test_1<-cbind.data.frame(Basic_Industries_Final,Capital_Goods_Final,Consumer_Durables_Final,Consumer_Non_Durables_Final,Consumer_Services_Final,Energy_Final,Finance_dataset_Final,HealthCare_Final,Miscelleaneous_Final,Public_Utilities_Final,Technology_dataset_Final,Transportation_Final,Unclassified_Final)

match_sector[i,1]<-names(test_1)[which.max(test_1[1,])]

}

Cleaning up the names of the predicted sector and comparing results with the original

After that will break the data into training and testing and then would compare the results

library(knitr)
for(i in 1:nrow(match_sector)){
  match_sector[i,1]<-gsub(pattern = "withcleaning.txtmatch_test",replacement="",x=match_sector[i,1])
  }

for(i in 1:nrow(match_sector)){
  match_sector[i,1]<-gsub(pattern = "_dataset",replacement="",x=match_sector[i,1])
  }




Results_Sector<-as.data.frame(sort(table(match_sector$V1),decreasing = T))

kable(Results_Sector,caption = "which_Sector")
which_Sector
Var1 Freq
HealthCare 142
Finance 123
Technology 93
Consumer_Services 85
Consumer_Durables 65
Miscelleaneous 65
Capital_Goods 58
Basic_Industries 49
Consumer_Non_Durables 41
Public_Utilities 32
Transportation 14
Energy 13
Unclassified 6

Model 2

We will build a decision tree using the most frequent terms. In this case we will take n=200 and weights can be given using the frequencies

decision_trees_file_list<-list.files(path=getwd(),pattern ="*withoutcleaning.csv")

for( i in 1:13){

library(readr)
assign(x=decision_trees_file_list[i],value =read_csv(paste0("D:/UIC/MyResearch/Mwrd/Persado_Data_Scientist/",decision_trees_file_list[i]), col_types = cols(X1 = col_skip())))
  
}

Delete the frequency column from them

decision_trees_file_list<-as.data.frame(decision_trees_file_list)

decision_trees_file_list$decision_trees_file_list<-as.character(decision_trees_file_list$decision_trees_file_list)

for(i in 1:nrow(decision_trees_file_list))
{
assign(value=get(decision_trees_file_list[i,])[,1],x=decision_trees_file_list[i,])
}
for(i in 1:nrow(decision_trees_file_list))
{
assign(value=as.data.frame(t(get(decision_trees_file_list[i,]))),x=paste0("t_",decision_trees_file_list[i,]))
  
}

Merging all the datasets together

library(knitr)

words_decision_build<-rbind(t_Basic_Industrieswithoutcleaning.csv,t_Capital_Goodswithoutcleaning.csv,t_Consumer_Durableswithoutcleaning.csv,t_Consumer_Non_Durableswithoutcleaning.csv,t_Consumer_Serviceswithoutcleaning.csv,t_Energywithoutcleaning.csv,t_Finance_datasetwithoutcleaning.csv,t_HealthCarewithoutcleaning.csv,t_Miscelleaneouswithoutcleaning.csv,t_Public_Utilitieswithoutcleaning.csv,t_Technology_datasetwithoutcleaning.csv,t_Transportationwithoutcleaning.csv,t_Unclassifiedwithoutcleaning.csv)

kable(x=words_decision_build,caption = "Comparison of frequent words across sectors")
Comparison of frequent words across sectors
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 V41 V42 V43 V44 V45 V46 V47 V48 V49 V50 V51 V52 V53 V54 V55 V56 V57 V58 V59 V60 V61 V62 V63 V64 V65 V66 V67 V68 V69 V70 V71 V72 V73 V74 V75 V76 V77 V78 V79 V80 V81 V82 V83 V84 V85 V86 V87 V88 V89 V90 V91 V92 V93 V94 V95 V96 V97 V98 V99 V100 V101 V102 V103 V104 V105 V106 V107 V108 V109 V110 V111 V112 V113 V114 V115 V116 V117 V118 V119 V120 V121 V122 V123 V124 V125 V126 V127 V128 V129 V130 V131 V132 V133 V134 V135 V136 V137 V138 V139 V140 V141 V142 V143 V144 V145 V146 V147 V148 V149 V150 V151 V152 V153 V154 V155 V156 V157 V158 V159 V160 V161 V162 V163 V164 V165 V166 V167 V168 V169 V170 V171 V172 V173 V174 V175 V176 V177 V178 V179 V180 V181 V182 V183 V184 V185 V186 V187 V188 V189 V190 V191 V192 V193 V194 V195 V196 V197 V198 V199 V200
textbag segment industri manufactur headquart construct oil oper market base chemic wast gas system facil produc State Unite energi custom fuel mine name manag process water brand distribut power mainten primarili Product specialti storag subsidiari commerci solut Corpor applic retail plant treatment addit build design sale technolog compris engin home intern medic control ethanol food grain metal renew Servic Compani dispos System transport communic electr equip fabric own pipe steel two agricultur chang distil instal materi network pharmaceut store biomass California Canada center collect corn diesel distributor engag former lubric natur petroleum program rang rehabilit togeth util worldwid agent approxim cabl coal coat direct drill explor health independ Industrial liquid regul relat solar wood anim care Chemic China clean consum contractor Energi enhanc generat gold hardwar infrastructur locat PC Plain protect Republ surfac Texa trait acid coin complianc fertil fiber general govern Green hazard healthcar high Home hydrogen Illinoi integr line non paper part recycl roof secur Specialti tank wire acr agenc American cabinet carbon corros crop destruct devic feed forc hous Industri ingredi investor leas Metal parti Peopl Pharmaceut pipelin precious recal reduc spray vacuum valu work Africa aluminum balloon cell cleaner clinic compon consult dealer disord door dri emiss
textbag1 system manufactur segment industri market equip design headquart oper compon control process applic engin power electron solut Corpor sale commerci devic Unite custom State vehicl gas electr test brand primarili direct part distributor measur name addit instrument automot machin America base medic assembl high build subsidiari worldwid autom distribut energi independ integr manag Technolog compris cabl construct materi metal oil relat semiconductor support laser monitor repres technolog circuit plastic softwar Europ Group intern packag suppli tool heat repair steel truck aerospac China imag inspect network precis consum data optic research air generat non origin two end former instal light residenti System unit accessori aluminum Ltd militari North railcar sensor Servic Asia chang defens print produc transport comput fabric forc Hold pipe specialti structur water chemic concret metrolog New program pump detect digit door fluid laboratori storag three analysi board California cell coat govern Industri line secur togeth wafer window wire Canada connector Energi engag home inject liquid multi perform real time valv blood car food general mainten panel plant rang treatment aircraft center dealer deliveri environment famili field fuel mechan molecular organ Peopl platform Republ singl user York dispens facil flow hopper hous HTG institut leas mobil pharmaceut protect retail
textbag2 system manufactur segment market oper control headquart brand retail industri equip commerci design distribut vehicl furnitur power store light Unite State applic base electr custom food solut mobil suppli video display repair addit sale center distributor independ network primarili compris Corpor data devic pool relat specialti subsidiari direct home LED pet truck worldwid America chemic communic compon engin outdoor chair detect electron manag metal monitor name part process residenti access accessori Pet pump togeth care digit Furnitur Industri instal intern mainten offic restaur secur supplement treatment bird Canada consum conveni end facil govern graphic integr rang seat steel tabl technolog two area automot California door finish fuel Further garden interior Internet motor non nutrit oil packag Solut System Technolog water wood chang cleaner Compani construct energi Garden gas Ltd materi North protect repres room sport tire traffic user util voic wholesal approxim architectur chain clinic dealer desk diesel fixtur formul health high inform Internat item lawn New panel person plastic privat support surveil three transmiss transport travel agricultur camera Center CN comput driver edibl enabl Europ filter furnish generat grid Hold Hooker insur label laser Light merchant modul Ohio onlin optic pharmaceut Pizza produc replac RoadSquad satellit sensor signag site
textbag3 store brand retail segment market oper food manufactur distribut Unite headquart State name base accessori invest Corpor specialti direct distributor beverag industri apparel consum Food fruit Group licens custom subsidiari produc Canada equip outlet primarili intern Internat Ltd chain locat New wholesal York privat relat Brand China Compani compris frozen snack chang design former Madden million independ organ process solut addit approxim California control electron firm footwear groceri own devic mass packag sale system women com label Monster technolog togeth toy veget data depart engag men Retail cabl channel coffe commerc compon conveni high merchandis North pack Peopl power protein Republ Technolog applic dri drink line militari network nutrit outdoor sport three water worldwid America audio center display game Girl manag natur portfolio rang region rice Steve Wholesal can energi fashion Fresh home item juic non seek shirt vehicl wireless activ casino chemic club entertain Europ good hardwar health Limit medic mix nut onlin Operat part prefer protect secur shop softwar supermarket suppli video agreement aircraft Asia chees chip commerci consist countri currenc dairi Decemb digit Energi engin facil fresh ice Industri integr plant Product restaur school Seneca shoe small state two Websit agent American bottl bran Calavo Co Columbia
textbag4 oper segment Unite State headquart base manag own store program New brand retail distribut name subsidiari com restaur market consum network addit Group approxim televis custom entertain video home York Corpor content compris onlin relat Internet solut properti real technolog center food invest primarili data former digit intern accessori locat Websit industri system applic chang mobil equip estat game sport commerci suppli franchis Hold channel inform Internat secur togeth educ facil manufactur merchandis sale state engag Servic Decemb media residenti engin health advertis electron platform satellit broadcast item Liberti licens shop book cabl California Canada direct Further profession design two America casino mortgag connect hotel leas live produc client Colorado incom North radio specialti non privat China construct devic Entertain Network public Republ support consist financi agenc build general good REIT communic gift govern Hotel Media travel American comput Corp tax tool Citi loan mainten news offic packag Texa worldwid access apparel children Compani Europ feder healthcar project protect Resort trust wireless consult infrastructur line organ school space squar storag asset casual energi financ integr plan rang region rental research room softwar activ area demand Educat enterpris feet independ Mexico million Peopl publish repair station test vehicl corpor Florida involv
textbag5 oil gas natur Texa system oper Energi headquart fuel State Unite Basin power approxim control reserv properti segment crude Decemb locat explor interest primarili produc own market engag industri Oklahoma Partner acquisit acr barrel prove Permian cell LP net West Corpor independ manag storag applic custom gross Oil solut Colorado data drill equip equival field Houston manufactur North Resourc technolog million packag process autom compris focus pipelin Servic Shale transport activ addit distribut East electr Gas gather liquid mine monitor oilfield Product region train base Canada chang coal compon compressor consist Counti design engin former Indiana Mid miner name partner South Virginia area Blueknight energi estim general hold instal subsidiari Technolog acquir asphalt Denver Hold incorpor inject land onshor Pennsylvania project sale state water work chemic Compressco Contin Crude Eagl exploit generat GP LLC Ltd mainten Midland mile non part pressur pump relat Reserv rig royalti stationari truck vehicl Westport Wyom aggreg America case Citi CNG collect complet CSI Dakota digit facil Gulf handl hydrogen Illinoi intern Legaci low Mountain parti PEM petroleum purchas Rocki seismic station System tank third three unit assembl Australia batteri burner Centenni China Coast combin compress condens convers counti crew Develop divis drive Ford Format
textbag6 loan commerci bank deposit account oper estat real insur manag credit consum invest market individu mortgag Bank residenti offic save hold headquart home line check secur segment construct financi addit money equiti person famili New compris certif properti Corpor Financi trust branch financ base Unite small retail plan Bancorp card State locat interest automobil bear retir custom time demand industri fund non York one cash corpor term asset brokerag Counti onlin equip Hold liabil mobil primarili subsidiari trade four lend rang full purchas coverag Corp life First land autom capit relat unsecur Group Further accept administr portfolio Internet vehicl exchang wealth institut multi network agricultur transfer California client health safe casualti profession advisori Carolina size leas Pennsylvania teller consist payment solut Decemb direct Insuranc medium NOW Commun fix annuiti communiti debit Ohio Washington acquisit box instal machin South Bancshar facil Jersey reinsur Florida improv name Texa Virginia agenc former owner polici privat remot area broker mutual Nation stock letter North origin chang transact two benefit center general program counti investor agent bond telephon captur incom Line work sale Trust wire bill Compani municip Georgia Massachusett protect Servic Citi electron process Capit tax ATM Commerci depositori engag homeown Illinoi Maryland treasuri
textbag7 treatment clinic Phase trial diseas patient cancer headquart candid commerci treat Pharmaceut drug base stage therapi cell system Unite focus medic agreement State Therapeut II therapeut biopharmaceut collabor market target inhibitor oral research manufactur name segment test technolog disord pharmaceut devic former III Corpor program small chang addit licens tumor lead non California manag acut antibodi molecul care compris oncolog studi human diagnost oper design hospit complet chronic gene engag pain infect receptor novel New anti preclin health Ltd solut platform proprietari inject activ relat combin protein breast deliveri discoveri prevent tissu Massachusett solid primarili imag intern specialti Pharma monitor sale syndrom formul adult direct immun surgic genet Medic medicin sever associ blood distributor multipl indic York center procedur advanc condit discov inflammatori vaccin agent immunotherapi select surgeri type virus worldwid lung physician laboratori leukemia San strateg dental diabet healthcar Limit refractori two biotechnolog factor Hold organ acid Compani enabl kinas rare custom metastat Research implant incorpor liver pre subsidiari biolog Europ invas applic detect equip molecular Servic System bone cardiac compound consist DNA malign monoclon need skin Bioscienc Co facil resist area enhanc generat generic growth hematolog support control Health home lymphoma pharmaci pipelin prostat risk Univers analysi brand
textbag8 manag solut applic market oper technolog onlin base headquart mobil brand patent platform custom system enabl Unite data manufactur process State advertis Corpor addit sale payment softwar consum design industri inform segment power retail com digit real compris devic equip network integr relat China estat financi client user chang licens name subsidiari commerci direct electr former health intern store support cloud govern home account ad analyt control game perform California channel commerc corpor deliveri energi generat healthcar media New portfolio America content CoStar end enterpris Further Internet marketplac properti risk Technolog tool valu asset electron optim primarili program secur sourc Web Websit worldwid bank distribut facilit gas insur medic merchandis own photo search togeth transfer York call card Europ Peopl person print profession promot report Republ transact wast access autom care communic distributor Energi financ Group Internat label Manag new North plan Servic storag suppli vehicl wireless buy connect high instal institut interact invest kiosk laser order organ project rang transport allow Analyt approxim audit batteri clinic collect complianc Connect consist display film incorpor Liberti light machin monitor offic record research save South strategi tax ticket video app Asia bill buyer compon consult educ engag enhanc good heat Hold individu
textbag9 custom oper network communic voic segment data headquart Internet mobil equip system access base solut manag market subsidiari Unite wireless industri broadband residenti State water gas approxim commerci fiber natur optic own video connect devic distribut manufactur Commun design telecommun addit enterpris line retail applic small integr name sale solar brand energi Ltd messag call cloud digit facil power togeth carrier Corpor Decemb Ethernet project secur subscrib support chang compris distanc fix former independ magicJack transport center Compani direct Energi intern local long million point softwar TV user Virginia VoIP wast wastewat accessori corpor Hold medium Mobil premis rang relat technolog telephon two wholesal audio California conferenc content Counti electr end infrastructur locat mainten parti Servic smartphon termin third util aggreg cabl Canada capac construct consum distributor Further govern high home inform Internat land LNG media Partner phone plant primarili privat protect store suppli US vehicl WiFi advertis allow engin generat hazard host Israel municip New onlin origin partner protocol Retail San state storag tablet televis worldwid ad backup cellular channel collect contract copper dealer enabl enclosur engag forc fuel Jersey leas Network North passiv Pennsylvania platform satellit smart Solut Spark station suit System Texa transfer treatment valu virtual Water
textbag10 solut manag system applic softwar data manufactur segment headquart custom network oper base market design technolog mobil platform integr equip industri process enabl support devic secur cloud sale addit enterpris consum Unite communic digit direct inform Corpor electron comput onlin user video worldwid State name control connect content California compris power game tool access organ profession storag distributor Internet Ltd retail chang media semiconductor engin former wireless financi Servic IT Web govern monitor analyt infrastructur relat healthcar train brand perform Solut center subsidiari autom advertis Further origin intern optim plan Technolog test distribut high Europ primarili mainten allow consult end channel medic project Websit America China health automot packag resel Asia educ display IP deliv line analysi implement Manag social optic chip energi Hold account hardwar payment suit ad circuit client rang System commerci engag memori modul server small time voic creat Group telecommun protect consist instal technic compon forc intellig program suppli valu light construct Network signal activ QAD togeth across employe enhanc experi New point real travel build com deliveri function licens person report specif embed Enterpris insur North partner print publish store care clinic host offic solar transport collabor environ interact parti track two complianc desktop independ
textbag11 transport oper segment aircraft fleet freight truckload headquart carrier custom dri Unite air State own trailer cargo logist manag base Transport Decemb leas deliveri subsidiari truck vessel oil contain rail shipment temperatur airlin approxim equip industri intermod manufactur ocean tractor charter compris ground mainten retail brokerag bulk chemic control region solut time Canada Corpor expedit Express haul Mexico ship special van airport America consum former good less name North parti third Air chang commerci commod fuel Greec Group intern Logist passeng petroleum tanker worldwid addit consolid contract CryoPort de distribut flight Hold independ load long part two Worldwid Airlin automot Boe consist crude destin drybulk Florida forward general jet liquid Ltd materi medium non produc rang shipper togeth American arrang center coal contractor dedic definit drayag facil fee food handl insur iron network New order primarili Servic Shipper storag suppli support track travel Truck Washington account ad Airbus ArcBest Arkansa asset citi com Corp diesel driver dwt electron line local locat March Maroussi metal militari ore packag protect railroad refin relat schedul Ship Student technolog Top Truckload vapor wareh Werner York YRC Aeroportuario agent agreement Asia border capac Capit Caribbean carri cement Centro combin commerc connect contractu crew cross Cryoport
textbag12 million invest loan firm debt real commerci estat New equip secur small term process York deposit design headquart manag market oper account year equiti Busi data electron financ platform relat seek Technolog addit bank capit mobil Newtek Servic technolog base check Corp credit financi former individu payment solut Websit acquisit brand broadcast card industri lien offic softwar Bank California certif Citi client CLO construct consum debit focus fund Further handset insur investig Louisiana Ltd Massachusett matur non owner program rang residenti Secur special State subordin Unite unsecur work administr AquaBounti area cash cell China compris Credit DHX eCommerc enabl Feder Financ first forens Fund high host inform intern Internet line make name net origin payrol prefer primarili profession purpos second segment senior size stem tax Union Web accept ancillari approxim Capit chang consult content Corpor distribut document enterpris famili Famili file Garden Group hardwar incom institut licens Limit litig Manag manufactur Media middl money month Nation onlin own part phone portfolio privat properti public retail save select stock support televis TICC tool transact within access Account ACH acquir Advantag agent allianc American Angele anim approv Asian asset Associat averag backup balanc benefit biotechnolog branch Capco captur cart Chines cloud Coast

Creating a column and putting in the target sector

row.names(words_decision_build)<-NULL

words_decision_build<-cbind(words_decision_build,decision_trees_file_list)

#Stripping of unneccessary names from the target column

words_decision_build[,201]<-gsub(pattern = 'withoutcleaning.csv',replacement = "",x = words_decision_build[,201])

# Renaming the target variable

colnames(words_decision_build)[colnames(words_decision_build)=="decision_trees_file_list"] <- "Target_Sector"

We will create a data frame having 13 target variable and then 200 words as it’s dependent variable

Building a decision tree

Few things to keep in mind

The dataset will contain more instances of Finance and Healthcare

The description would be in isolation

Therefore from 1 and 2 even though the samples of Finance and Healthcare would be more. But within a sample frequency of words would be similar among all sectors

library("tree")



tree_model=tree(Target_Sector~.,words_decision_build[,c(1:99,201)])
plot(tree_model)

tree.control(nobs=100,mincut = 10)

set.seed(3)
cv_tree=cv.tree(tree_model,FUN=prune.tree,K = 10)

plot(cv_tree$size,cv_tree$dev,type="b")
plot(tree_model)

pruned_model<-prune.misclass(tree_model,best = 7)

plot(pruned_model)
text(pruned_model,pretty=0)

Findings,Issue and Future Improvements

Findings. We will compare the sector given and compare it with the original sector

compare_sector_original_test<-cbind.data.frame(company_test,match_sector)


for(i in 1:nrow(compare_sector_original_test)){

compare_sector_original_test[i,5]<-gsub(pattern = "Basic_Industries",replacement = "Basic Industries",x=compare_sector_original_test[i,5])

compare_sector_original_test[i,5]<-gsub(pattern = "HealthCare",replacement = "Health Care",x=compare_sector_original_test[i,5])

compare_sector_original_test[i,5]<-gsub(pattern = "Capital_Goods",replacement = "Capital Goods",x=compare_sector_original_test[i,5])

compare_sector_original_test[i,5]<-gsub(pattern = "Consumer_Durables",replacement = "Consumer Durables",x=compare_sector_original_test[i,5])

compare_sector_original_test[i,5]<-gsub(pattern = "Consumer_Non_Durables",replacement = "Consumer Non-Durables",x=compare_sector_original_test[i,5])

compare_sector_original_test[i,5]<-gsub(pattern = "Consumer_Services",replacement = "Consumer Services",x=compare_sector_original_test[i,5])

compare_sector_original_test[i,5]<-gsub(pattern = "Public_Utilities",replacement = "Public Utilities",x=compare_sector_original_test[i,5])

}
  sum=0


  for(i in 1:nrow(compare_sector_original_test)){

    if(compare_sector_original_test[i,3]==compare_sector_original_test[i,5]){
    sum=sum+1
  }
  }

print(sum)
## [1] 466
accuracy=sum/nrow(compare_sector_original_test)
print(accuracy)
## [1] 0.5928753

We correctltly predicted 466 observations and our accuracy was 59.28 %

Issues

  1. The code can be improved.One chunk is written poorly as I was not able to get the files in the global environment and hence was not able to retrieve the variables.

  2. Time constraint. If given more time I can build a decision tree and predict what words lead to what sector. Right now the model is in nascent stage and I need two more weeks to build a decision tree model

  3. Certain sectors didn’t have a description on Yahoo Finance so they were removed from the dataset. For those datasets which had n/a written in the Sector but whose description was available were included in the data

Future-Improvements

  1. Cross-Validation can be performed on the number of words scanned from a file so that optimum number of words from the file are retrieved and thus computing time can be shortened

  2. Word-Associations haven’t been tested and they are a beautiful way of checking out correlations

  3. Other techniques such as SVM, KNN can be employed and then the best techniques can be used