library(XML)
library(xml2)
library(tidyverse)
library(rvest)
library(stringr)
library(RMySQL)
library(knitr)

Git-Hub and project documentation

The datasets and .rmd file used in this project can be found at: https://github.com/forhadakbar/data607fall2019/tree/master/Week%2008 and https://github.com/ShovanBiswas/DATA607/tree/master/Week8

Overview

In this project, we have shown who the thought leaders of Data Science are, their primary cares, and the changes that has happened in this field, over the years. In order to implement this project, we surveyed the net, including social media, job sites, LinkedIn profiles of influential data scientists, data science related articles, and extracted information, to create a database. We scraped many screens, analyzed the data, and presented the content in visually appealing form.

Motivation

Team FABS decided to choose “Alternative Project 3 – Data Science Thought Leadership”, because we felt it would be challenging, to scrape various web pages, analyze the information, and put together the information in visually presentable format.

Team Work & Communication

Initially, we tried to form a team, but found many of the members were, already taken. So, we eventually formed a team of two, called FABS. We communicated over email, phone and WhatsApp and charted out a plan, research procedure and division of work. Then, we collaboratively worked with TeamViewer.

Approach

In order to answer the key questions of this assignment, we needed a list of thought leaders in data science. So, we scraped websites, having the following criteria:

  1. Provides professional profile of the thought leader–in LinkedIn, or other social media.
  2. Provides their geographical location.
  3. Provides description about their expertise.

Based on these criteria, we charted certain steps, which are displayed in the following workflow.

setwd('D:/CUNY-DataScience/Fall 2019/Data Acquisition and Management DATA 607/Week 08')

Tools

The tools, used for this project are: R, Tableau, SelectorGadget, whcih is an extension in Chrome, MySQL database.

Who are today’s “thought leaders” in data science?

We scraped a site and culled out top 20 thought leaders in data science; then imported them into a MySQL database, from where we used to extract and analyzed them. However, in order to achieve the complex activities of scraping twitter link, and a host of child urls, we used another supporting RMD, in which we first wrote our the extracted the information into CSV file, and imported into MySQL database.

url <- read_html("https://www.thinkers360.com/top-20-global-thought-leaders-on-analytics-july-2018/",stringsAsFactors = FALSE)
readurl <- url %>%
        html_nodes("table") %>%
        .[1] %>%
        html_table(fill = TRUE)

top20 <- as_tibble(readurl[[1]])
top20

Load data into R data structure

con <- dbConnect(MySQL(),                                # database connection
                 user="Data607", password = passwd,
                 dbname="project3", host="localhost")

sql <- "SELECT * FROM profile"                           # Fetch data from MySQL instance into a R dataframe
profile <- dbGetQuery(con, sql)
kable(profile)
Rank Twitter Handle Name Score profileURL tweeterURL
1 @ewilson1776 Eric Wilson 100.00 https://www.thinkers360.com/tl/profiles/view/126 https://twitter.com/ewilson1776
2 @rautsan Sandeep Raut 76.92 https://www.thinkers360.com/tl/SandeepRaut https://twitter.com/rautsan
3 @charlesaraujo Charles Araujo 62.31 https://www.thinkers360.com/tl/profiles/view/107 https://twitter.com/charlesaraujo
4 @tkspeaks Thomas Koulopoulos 61.54 https://www.thinkers360.com/tl/profiles/view/109 https://twitter.com/tkspeaks
6 @ihilgefort Ingo Hilgefort 60.77 https://www.thinkers360.com/tl/profiles/view/92 https://twitter.com/ihilgefort
7 @Ronald_vanLoon Ronald van Loon 48.46 https://www.thinkers360.com/tl/profiles/view/110 https://twitter.com/Ronald_vanLoon
8 @HarryTuttleOne Ed Wakelam 39.23 https://www.thinkers360.com/tl/profiles/view/29 https://twitter.com/HarryTuttleOne
9 @TrippBraden Tripp Braden 36.92 https://www.thinkers360.com/tl/profiles/view/100 https://twitter.com/TrippBraden
10 @dhinchcliffe Dion Hinchclife 34.62 https://www.thinkers360.com/tl/profiles/view/141 https://twitter.com/dhinchcliffe
11 @MarkDataDriven Mark Reynolds 33.08 https://www.thinkers360.com/tl/profiles/view/281 https://twitter.com/MarkDataDriven
12 @tomraftery Tom Raftery 32.31 https://www.thinkers360.com/tl/profiles/view/225 https://twitter.com/tomraftery
13 @jimmarous Jim Marous 32.31 https://www.thinkers360.com/tl/profiles/view/132 https://twitter.com/jimmarous
14 @hessiejones Hessie Jones 31.54 https://www.thinkers360.com/tl/profiles/view/147 https://twitter.com/hessiejones
15 @mkrigsman Michael Krigsman 31.54 https://www.thinkers360.com/tl/profiles/view/137 https://twitter.com/mkrigsman
16 @SetDeep Deepak Seth 31.54 https://www.thinkers360.com/tl/profiles/view/157 https://twitter.com/SetDeep
17 @mabstep Matt Stephens 30.77 https://www.thinkers360.com/tl/profiles/view/8 https://twitter.com/mabstep
18 @vsolank1 Vinay Solanki 30.77 https://www.thinkers360.com/tl/profiles/view/257 https://twitter.com/vsolank1
19 @DaveAWhitney Dave Whitney 30.77 https://www.thinkers360.com/tl/daveawhitney https://twitter.com/DaveAWhitney
20 @sallyeaves Sally Eaves 30.77 https://www.thinkers360.com/tl/profiles/view/151 https://twitter.com/sallyeaves

Analysis

The thought leader, who were considered, were scored by the top authorities. Based on their scores, we have given a visual analysis, with ggplot2.

ggplot(profile, aes(x = reorder(Name,Score), y = Score)) +
    geom_bar(stat = "identity", position = position_dodge(),  fill="steelblue") +
    geom_text(aes(label = Score), vjust = .5, hjust = 1, position = position_dodge(width = 0.9), color = "black") +
        ggtitle("Top 20 Global Thought Leaders on Analytics") +
    xlab("Name") + ylab("Score") +
        coord_flip()  

Visualization LinkedIn Profile

We have visualized their LinkedIn profile and geographical location.

sql<- "SELECT * FROM linkedin"                     #Store sql query in a variable
linkedin<- dbGetQuery(con, sql)                    #Fetch data from MySQL instance into a R dataframe 
kable(linkedin)
Name linkedinURL DataLink$Location linkSkill
Charles Araujo https://www.linkedin.com/in/charlesaraujo New York, United States IT Service Management | IT Strategy | Enterprise Software
Dave Whitney https://www.linkedin.com/in/daveawhitney Arizona, United States Innovation | Healthcare Information Exchange | Radiology
Deepak Seth https://www.linkedin.com/in/deepaksethitleader Texas, United States Business Intelligence | Strategy | Cross-functional Team Leadership
Dion Hinchcliffe https://www.linkedin.com/in/dhinchcliffe Washington DC, United States Strategy | Enterprise Software | Enterprise Architecture
Eric Wilson https://www.linkedin.com/in/wilsondemand Kentucky, United States Cross-functional Team Leadership | Forecasting | Process Improvement
Evan Sinar https://www.linkedin.com/in/evansinar/ Pennsylvania, United States Talent Management|Psychometrics|Leadership Development
Gregory Piatetsky https://www.linkedin.com/in/gpiatetsky/ Massachusetts, United States Data Mining| Business Analytics|Data Science|Predictive Analytics
Hessie Jones https://www.linkedin.com/in/hessiejones1 Pickering, Canada Digital Marketing | Social Media Marketing | Online Advertising
Jim Marous https://www.linkedin.com/in/jimmarous Ohio, United States Fintech | Financial Services | Banking
Mark Reynolds https://www.linkedin.com/in/profreynolds Texas, United States Digital Transformation | Enterprise Solution Architect | IoT and Edge Computing
Matt Stephens https://www.linkedin.com/in/matt-stephens Atlanta, United States Communication | Community Outreach | Ideation
Michael Krigsman https://www.linkedin.com/in/mkrigsman Boston, United States Enterprise Software | Strategy | CRM
Sally Eaves https://www.linkedin.com/in/sally-eaves London, United Kingdom Social Media | Change Management | Start-ups
Sandeep Raut https://www.linkedin.com/in/rautsandeep Mumbai, India Business Intelligence | Analytics | Digital Transformation
Tamara Dull https://www.linkedin.com/in/tamaradull/ California, United States Product Management|Enterprise Software|Business Intelligence
Tripp Braden https://www.linkedin.com/in/trippbraden Michigan, United States Strategy | Business Development | Strategic Partnerships
Vinay Solanki https://www.linkedin.com/in/vinaysolanki Delhi NCR, India Team Management | Java | Business Analysis

What are the topics that data scientists care most about?

In the following we have discussed about things that thought leaders of DS care about, e.g. skillset, where the DS jobs are or will be, technology etc. Right below, we listed some of those points.

url_ds_cares <- 'https://www.innoarchitech.com/blog/what-is-data-science-does-data-scientist-do'
DS_Cares_HTML <- read_html(url_ds_cares)
DS_Cares_HTML_Para_Nodes <- DS_Cares_HTML %>% html_nodes('p')  %>% .[26:37] %>% html_text()
kable(DS_Cares_HTML_Para_Nodes, format = "html")
x
Prediction (predict a value based on inputs)
Classification (e.g., spam or not spam)
Recommendations (e.g., Amazon and Netflix recommendations)
Pattern detection and grouping (e.g., classification without known classes)
Anomaly detection (e.g., fraud detection)
Recognition (image, text, audio, video, facial, …)
Actionable insights (via dashboards, reports, visualizations, …)
Automated processes and decision-making (e.g., credit card approval)
Scoring and ranking (e.g., FICO score)
Segmentation (e.g., demographic-based marketing)
Optimization (e.g., risk management)
Forecasts (e.g., sales and revenue)
library(magick)
## Linking to ImageMagick 6.9.9.14
## Enabled features: cairo, freetype, fftw, ghostscript, lcms, pango, rsvg, webp
## Disabled features: fontconfig, x11
joblistings <- image_read('https://miro.medium.com/max/1094/1*3K7QnzBXI0Ys3NZgNRTezA.png')
print(joblistings)
## # A tibble: 1 x 7
##   format width height colorspace matte filesize density
##   <chr>  <int>  <int> <chr>      <lgl>    <int> <chr>  
## 1 PNG      700    450 sRGB       FALSE    17850 72x72

generalskills <- image_read('https://miro.medium.com/max/1094/1*-oG0j_wGSW_9cNNs4_qgFQ.png')
print(generalskills)
## # A tibble: 1 x 7
##   format width height colorspace matte filesize density
##   <chr>  <int>  <int> <chr>      <lgl>    <int> <chr>  
## 1 PNG      700    450 sRGB       FALSE    32080 72x72

skills <- image_read('https://miro.medium.com/max/1094/1*jnZT4gFAzScOJ_VnYsni0g.png')
print(skills)
## # A tibble: 1 x 7
##   format width height colorspace matte filesize density
##   <chr>  <int>  <int> <chr>      <lgl>    <int> <chr>  
## 1 PNG      700    450 sRGB       FALSE    27391 72x72

technology <- image_read('https://miro.medium.com/max/1094/1*iueZKOOBidZtr-FTYyf6QA.png')
print(technology)
## # A tibble: 1 x 7
##   format width height colorspace matte filesize density
##   <chr>  <int>  <int> <chr>      <lgl>    <int> <chr>  
## 1 PNG      700    450 sRGB       FALSE    28889 72x72

cloud <- image_read('https://www.octoparse.com/media/6032/blog.png?width=699&height=397')
print(cloud)
## # A tibble: 1 x 7
##   format width height colorspace matte filesize density
##   <chr>  <int>  <int> <chr>      <lgl>    <int> <chr>  
## 1 PNG      699    397 sRGB       TRUE    263804 38x38

datascience <- image_read('https://blog.datasciencedojo.com/content/images/2019/06/Hippocratic-Oath-of-a-data-scientist-2.png')
print(datascience)
## # A tibble: 1 x 7
##   format width height colorspace matte filesize density
##   <chr>  <int>  <int> <chr>      <lgl>    <int> <chr>  
## 1 PNG      518    800 sRGB       TRUE    233690 38x38

Here, we extracted text, from an online image, which could yield more information, on detailed analysis.

library(tesseract)
cat(image_ocr(datascience))
## OF A DATA SCIENTIST
## ON DATA
## {wll remember that data beats algorithm.
## ‘The quality of my model is going to be
## impacted by the quality, quantity, and
## variety of data vsed to buildit. | ON SIMPLICITY
## 1 ull remember that big data is just
## {00 | wil remember that 2 simpler model
## is better than a complex model, Unless |
## can Justify, | wll nt use complex tools,
## ON MODELING ‘techniques, or models.
## {will remember that my mode! will be
## used in ways | never intended. {will not
## ive people false comfort about the
## correctness of my mode ON BEAUTY
## 1 wll remember that | didn’t make this
## world and it doesnt satisfy my equations.
## The beauty of equations, theorems, and
## lemmas i deceptive
## ON BUSINESS VALUE ‘
## | will remember thatthe world does not
## care about my model unless Wt adds
## business value
## ON PREDICTIONS
## {ull remember that, 90% everything can be
## predicted. Even the "best" model wil lead
## to problems
## ON IMPACT
## My models may impact lives, society, and
## ‘the economy. | will ensure everyone is
## aware of the possible pitfalls of my model. | ON ETHICS
## | will remember that | may face ethical
## dilemmas in my pursuit for a better
## model. will ensure that ll use my best
## “sgment in abtaining data, building the
## model, understanding and communicating
## ON HUMILITY | 2nvbies, and defining metrics
## Lull remember thatthe business of data
## science is complex, | will accept that my
## model can, and will be wrong
## datasciencedojo

How do these change over time, and across geographical location?

Here’s a brief history of how data science evolved, over the years. We already saw the geographical location of some of the data scientists.

url_history <- 'https://www.dataversity.net/brief-history-data-science/'
DS_History_HTML <- read_html(url_history)
DS_History_HTML_Para_Nodes <- DS_History_HTML %>% html_nodes('p')  %>% .[13:23] %>% html_text()
kable(DS_History_HTML_Para_Nodes, format = "html")
x
In 2001, Software-as-a-Service (SaaS) was created. This was the pre-cursor to using Cloud-based applications.
In 2001, William S. Cleveland laid out plans for training Data Scientists to meet the needs of the future. He presented an action plan titled, Data Science: An Action Plan for Expanding the Technical Areas of the field of Statistics. It described how to increase the technical experience and range of data analysts and specified six areas of study for university departments. It promoted developing specific resources for research in each of the six areas. His plan also applies to government and corporate research.
In 2002, the International Council for Science: Committee on Data for Science and Technology began publishing the Data Science Journal, a publication focused on issues such as the description of data systems, their publication on the internet, applications and legal issues.
In 2006, Hadoop 0.1.0, an open-source, non-relational database, was released. Hadoop was based on Nutch, another open-source database.
In 2008, the title, “Data Scientist” became a buzzword, and eventually a part of the language. DJ Patil and Jeff Hammerbacher, of LinkedIn and Facebook, are given credit for initiating its use as a buzzword.
In 2009, the term NoSQL was reintroduced (a variation had been used since 1998) by Johan Oskarsson, when he organized a discussion on “open-source, non-relational databases”.
In 2011, job listings for Data Scientists increased by 15,000%. There was also an increase in seminars and conferences devoted specifically to Data Science and Big Data. Data Science had proven itself to be a source of profits and had become a part of corporate culture.
In 2011, James Dixon, CTO of Pentaho promoted the concept of Data Lakes, rather than Data Warehouses. Dixon stated the difference between a Data Warehouse and a Data Lake is that the Data Warehouse pre-categorizes the data at the point of entry, wasting time and energy, while a Data Lake accepts the information using a non-relational database (NoSQL) and does not categorize the data, but simply stores it.
In 2013, IBM shared statistics showing 90% of the data in the world had been created within the last two years.
In 2015, using Deep Learning techniques, Google’s speech recognition, Google Voice, experienced a dramatic performance jump of 49 percent.
In 2015, Bloomberg’s Jack Clark, wrote that it had been a landmark year for Artificial Intelligence (AI). Within Google, the total of software projects using AI increased from “sporadic usage” to more than 2,700 projects over the year.

Conclusion

In conclusion, we were able to build a table of more than 20 of today’s thought leaders of Data Science. These experts are actively keeping up with all new developments in :

  • Analysis,
  • Machine Learning,
  • Staistics,
  • Computer Science,
  • Artificial Intelligence
  • Visual Analytics

Moreover, some of the non-technical skills that are greatly needed and valued include:

  • excellent strategy,
  • communication,
  • research,
  • leadership skills, and so on.

While some of the important technical skills are:

  • Python,
  • R,
  • SQL,
  • Hadoop,
  • Spark
  • Business Intelligence, and more.