12/9/2020

Background

Question(s):

  • What are the best Data Science companies to work for? and
  • What are the characteristics that make them so?

Data:

  • Top Tech Companies Stock Price (.csv) and
  • Glassdoor Company Reviews (web scrape).

Data is properly cited in the Appendix.

Approach:

  • Acquire and filter.
  • Tidy and transform.
  • Visualize and analyze.

Acquire and Filter (p.1)

We read in the Top Tech Companies Stock Price (.csv) and then filter for companies that (1) have the highest rate of growth or (2) are of the highest value:

#Read in csv files
tech_data <- read_csv("https://raw.githubusercontent.com/Magnus-PS/CUNY-SPS-DATA-607/Final-Project/tech_sector_list.csv")
tech_table <- as_tibble(tech_data)
#filter 1: "% Change"
high_growth <- filter(tech_table, `% Change` > 0.06) #top 3
#filter 2: "Market Cap"
high_val <- filter(tech_table, `Market Cap (Billions)` > 1000) #top 2

Filtering provides us with the following short list of companies:

Symbol Name % Change Market Cap (Billions) Sector
SQ Square Inc.  0.0609 93.754 Financial Services
PLTR Palantir Technologies Inc.  0.1592 39.512 Big Data
UMC United Microelectronics Corporation 0.2024 15.834 Semiconductor
AAPL Apple Inc.  -0.0297 1936.000 Big Tech
MSFT Microsoft Corporation -0.0013 1589.000 Big Tech

Acquire and Filter (p.2)

To gain insight as to why employees like these particular companies and what might differentiate them from others, we scrape Glassdoor for the corresponding company’s “Pros”:

#1. Download HTML and convert to XML with read_html()
a <- read_html("https://www.glassdoor.com/Reviews/Apple-Reviews-E1138.htm")
#2. Extract specific nodes with html_nodes()
a_ext <- html_nodes(a,'.v2__EIReviewDetailsV2__fullWidth:nth-child(1) span')
#3. Extract review text from HTML
a_pros <- html_text(a_ext) #collect pros section of 1st 10 reviews

The same procedure (as that above) was applied to Microsoft, Palantir, and Square … providing the 1st 10 reviews for each company and stored “Pros” as a list of strings.

Tidy and Transform

We want to widdle our list of strings into only those that are useful, defining characteristics for our top companies.

We tidy the text via regular expressions (removing non-alpha numeric characters, digits, and compressing white space) and remove common stop words.

The removal of common stop words via built in stopwords() function didn’t remove all non-words and non-descriptors, so we filter again. Once for low frequency (< n = 4) and again (manually) for non-word, non-descriptive entries.

Our resulting table is …

Table

#output as kable table
refined %>%
  kbl() %>%
  kable_minimal()
value n
work 19
people 14
company 12
benefits 7
life 6
perks 6
working 6
culture 5
love 5
balance 4
compensation 4
employees 4
job 4
mission 4
place 4
smart 4

Visualize and Analyze (1)

#visualize the frequency count
ggplot(refined) +
  geom_bar(aes(reorder(value,n) , y = n, fill=value), stat = "identity", position = "dodge", width = 1) + coord_flip() +
  theme(legend.position = "none") +
  labs( title = "Word Count Frequency", x = "", y = "", fill = "Source")

Visualize and Analyze (2)

#word cloud
wordcloud2(data=refined, color = "random-light", backgroundColor = "grey")

Conclusion

“The first principle is that you mustn’t fool yourself and you are the easiest person to fool.”

— Richard Feynman

……………………………………………………………………

Thrice filtering led to our short list of top companies: Apple, Microsoft, Palantir, and Square. While our use of regular expressions and tidying & transforming sifted out our differentiating characteristics: meaningful work, sense of belonging, employer care for employees, and work-life balance.

……………………………………………………………………

How is this useful?

  • Employers can align with what employees are looking for,
  • Employees can align with great employers, and
  • A similar approach could be applied elsewhere to gain similar insight.

Data Citation [Appendix A]