DATA 607 Final Project

12/9/2020

Background

Question(s):

What are the best Data Science companies to work for? and
What are the characteristics that make them so?

Data:

Top Tech Companies Stock Price (.csv) and
Glassdoor Company Reviews (web scrape).

Data is properly cited in the Appendix.

Approach:

Acquire and filter.
Tidy and transform.
Visualize and analyze.

Acquire and Filter (p.1)

We read in the Top Tech Companies Stock Price (.csv) and then filter for companies that (1) have the highest rate of growth or (2) are of the highest value:

#Read in csv files
tech_data <- read_csv("https://raw.githubusercontent.com/Magnus-PS/CUNY-SPS-DATA-607/Final-Project/tech_sector_list.csv")
tech_table <- as_tibble(tech_data)
#filter 1: "% Change"
high_growth <- filter(tech_table, `% Change` > 0.06) #top 3
#filter 2: "Market Cap"
high_val <- filter(tech_table, `Market Cap (Billions)` > 1000) #top 2

Filtering provides us with the following short list of companies:

Symbol	Name	% Change	Market Cap (Billions)	Sector
SQ	Square Inc.	0.0609	93.754	Financial Services
PLTR	Palantir Technologies Inc.	0.1592	39.512	Big Data
UMC	United Microelectronics Corporation	0.2024	15.834	Semiconductor
AAPL	Apple Inc.	-0.0297	1936.000	Big Tech
MSFT	Microsoft Corporation	-0.0013	1589.000	Big Tech

Acquire and Filter (p.2)

To gain insight as to why employees like these particular companies and what might differentiate them from others, we scrape Glassdoor for the corresponding company’s “Pros”:

#1. Download HTML and convert to XML with read_html()
a <- read_html("https://www.glassdoor.com/Reviews/Apple-Reviews-E1138.htm")
#2. Extract specific nodes with html_nodes()
a_ext <- html_nodes(a,'.v2__EIReviewDetailsV2__fullWidth:nth-child(1) span')
#3. Extract review text from HTML
a_pros <- html_text(a_ext) #collect pros section of 1st 10 reviews

The same procedure (as that above) was applied to Microsoft, Palantir, and Square … providing the 1st 10 reviews for each company and stored “Pros” as a list of strings.

Tidy and Transform

We want to widdle our list of strings into only those that are useful, defining characteristics for our top companies.

We tidy the text via regular expressions (removing non-alpha numeric characters, digits, and compressing white space) and remove common stop words.

The removal of common stop words via built in stopwords() function didn’t remove all non-words and non-descriptors, so we filter again. Once for low frequency (< n = 4) and again (manually) for non-word, non-descriptive entries.

Our resulting table is …

Table

#output as kable table
refined %>%
  kbl() %>%
  kable_minimal()

value	n
work	19
people	14
company	12
benefits	7
life	6
perks	6
working	6
culture	5
love	5
balance	4
compensation	4
employees	4
job	4
mission	4
place	4
smart	4

Visualize and Analyze (1)

#visualize the frequency count
ggplot(refined) +
  geom_bar(aes(reorder(value,n) , y = n, fill=value), stat = "identity", position = "dodge", width = 1) + coord_flip() +
  theme(legend.position = "none") +
  labs( title = "Word Count Frequency", x = "", y = "", fill = "Source")

Visualize and Analyze (2)

#word cloud
wordcloud2(data=refined, color = "random-light", backgroundColor = "grey")

Conclusion

“The first principle is that you mustn’t fool yourself and you are the easiest person to fool.”

— Richard Feynman

……………………………………………………………………

Thrice filtering led to our short list of top companies: Apple, Microsoft, Palantir, and Square. While our use of regular expressions and tidying & transforming sifted out our differentiating characteristics: meaningful work, sense of belonging, employer care for employees, and work-life balance.

……………………………………………………………………

How is this useful?

Employers can align with what employees are looking for,
Employees can align with great employers, and
A similar approach could be applied elsewhere to gain similar insight.

Data Citation [Appendix A]

Sources of data can be identified as the following (cited APA-style below):

Tomas Mantero. (2020). Top Tech Companies Stock Price [.csv file]. Retrieved from https://www.kaggle.com/tomasmantero/top-tech-companies-stock-price?select=Technology+Sector+List.csv
Glassdoor. (2020). Glassdoor Company Reviews [web scrape]. Retrieved from https://www.glassdoor.com/member/home/companies.htm