Project 3

library(XML)
library(xml2)
library(tidyverse)
library(rvest)
library(stringr)
library(RMySQL)
library(knitr)

Git-Hub and project documentation

The datasets and .rmd file used in this project can be found at: https://github.com/forhadakbar/data607fall2019/tree/master/Week%2008 and https://github.com/ShovanBiswas/DATA607/tree/master/Week8

Overview

In this project, we have shown who the thought leaders of Data Science are, their primary cares, and the changes that has happened in this field, over the years. In order to implement this project, we surveyed the net, including social media, job sites, LinkedIn profiles of influential data scientists, data science related articles, and extracted information, to create a database. We scraped many screens, analyzed the data, and presented the content in visually appealing form.

Motivation

Team FABS decided to choose “Alternative Project 3 – Data Science Thought Leadership”, because we felt it would be challenging, to scrape various web pages, analyze the information, and put together the information in visually presentable format.

Team Work & Communication

Initially, we tried to form a team, but found many of the members were, already taken. So, we eventually formed a team of two, called FABS. We communicated over email, phone and WhatsApp and charted out a plan, research procedure and division of work. Then, we collaboratively worked with TeamViewer.

Approach

In order to answer the key questions of this assignment, we needed a list of thought leaders in data science. So, we scraped websites, having the following criteria:

Provides professional profile of the thought leader–in LinkedIn, or other social media.
Provides their geographical location.
Provides description about their expertise.

Based on these criteria, we charted certain steps, which are displayed in the following workflow.

setwd('D:/CUNY-DataScience/Fall 2019/Data Acquisition and Management DATA 607/Week 08')

Tools

The tools, used for this project are: R, Tableau, SelectorGadget, whcih is an extension in Chrome, MySQL database.

Who are today’s “thought leaders” in data science?

We scraped a site and culled out top 20 thought leaders in data science; then imported them into a MySQL database, from where we used to extract and analyzed them. However, in order to achieve the complex activities of scraping twitter link, and a host of child urls, we used another supporting RMD, in which we first wrote our the extracted the information into CSV file, and imported into MySQL database.

url <- read_html("https://www.thinkers360.com/top-20-global-thought-leaders-on-analytics-july-2018/",stringsAsFactors = FALSE)
readurl <- url %>%
        html_nodes("table") %>%
        .[1] %>%
        html_table(fill = TRUE)

top20 <- as_tibble(readurl[[1]])
top20

Load data into R data structure

con <- dbConnect(MySQL(),                                # database connection
                 user="Data607", password = passwd,
                 dbname="project3", host="localhost")

sql <- "SELECT * FROM profile"                           # Fetch data from MySQL instance into a R dataframe
profile <- dbGetQuery(con, sql)
kable(profile)

Rank	Twitter Handle	Name	Score	profileURL	tweeterURL
1	@ewilson1776	Eric Wilson	100.00	https://www.thinkers360.com/tl/profiles/view/126	https://twitter.com/ewilson1776
2	@rautsan	Sandeep Raut	76.92	https://www.thinkers360.com/tl/SandeepRaut	https://twitter.com/rautsan
3	@charlesaraujo	Charles Araujo	62.31	https://www.thinkers360.com/tl/profiles/view/107	https://twitter.com/charlesaraujo
4	@tkspeaks	Thomas Koulopoulos	61.54	https://www.thinkers360.com/tl/profiles/view/109	https://twitter.com/tkspeaks
6	@ihilgefort	Ingo Hilgefort	60.77	https://www.thinkers360.com/tl/profiles/view/92	https://twitter.com/ihilgefort
7	@Ronald_vanLoon	Ronald van Loon	48.46	https://www.thinkers360.com/tl/profiles/view/110	https://twitter.com/Ronald_vanLoon
8	@HarryTuttleOne	Ed Wakelam	39.23	https://www.thinkers360.com/tl/profiles/view/29	https://twitter.com/HarryTuttleOne
9	@TrippBraden	Tripp Braden	36.92	https://www.thinkers360.com/tl/profiles/view/100	https://twitter.com/TrippBraden
10	@dhinchcliffe	Dion Hinchclife	34.62	https://www.thinkers360.com/tl/profiles/view/141	https://twitter.com/dhinchcliffe
11	@MarkDataDriven	Mark Reynolds	33.08	https://www.thinkers360.com/tl/profiles/view/281	https://twitter.com/MarkDataDriven
12	@tomraftery	Tom Raftery	32.31	https://www.thinkers360.com/tl/profiles/view/225	https://twitter.com/tomraftery
13	@jimmarous	Jim Marous	32.31	https://www.thinkers360.com/tl/profiles/view/132	https://twitter.com/jimmarous
14	@hessiejones	Hessie Jones	31.54	https://www.thinkers360.com/tl/profiles/view/147	https://twitter.com/hessiejones
15	@mkrigsman	Michael Krigsman	31.54	https://www.thinkers360.com/tl/profiles/view/137	https://twitter.com/mkrigsman
16	@SetDeep	Deepak Seth	31.54	https://www.thinkers360.com/tl/profiles/view/157	https://twitter.com/SetDeep
17	@mabstep	Matt Stephens	30.77	https://www.thinkers360.com/tl/profiles/view/8	https://twitter.com/mabstep
18	@vsolank1	Vinay Solanki	30.77	https://www.thinkers360.com/tl/profiles/view/257	https://twitter.com/vsolank1
19	@DaveAWhitney	Dave Whitney	30.77	https://www.thinkers360.com/tl/daveawhitney	https://twitter.com/DaveAWhitney
20	@sallyeaves	Sally Eaves	30.77	https://www.thinkers360.com/tl/profiles/view/151	https://twitter.com/sallyeaves

Analysis

The thought leader, who were considered, were scored by the top authorities. Based on their scores, we have given a visual analysis, with ggplot2.

ggplot(profile, aes(x = reorder(Name,Score), y = Score)) +
    geom_bar(stat = "identity", position = position_dodge(),  fill="steelblue") +
    geom_text(aes(label = Score), vjust = .5, hjust = 1, position = position_dodge(width = 0.9), color = "black") +
        ggtitle("Top 20 Global Thought Leaders on Analytics") +
    xlab("Name") + ylab("Score") +
        coord_flip()

Visualization LinkedIn Profile

We have visualized their LinkedIn profile and geographical location.

sql<- "SELECT * FROM linkedin"                     #Store sql query in a variable
linkedin<- dbGetQuery(con, sql)                    #Fetch data from MySQL instance into a R dataframe 
kable(linkedin)

Name	linkedinURL	DataLink$Location	linkSkill
Charles Araujo	https://www.linkedin.com/in/charlesaraujo	New York, United States	IT Service Management \| IT Strategy \| Enterprise Software
Dave Whitney	https://www.linkedin.com/in/daveawhitney	Arizona, United States	Innovation \| Healthcare Information Exchange \| Radiology
Deepak Seth	https://www.linkedin.com/in/deepaksethitleader	Texas, United States	Business Intelligence \| Strategy \| Cross-functional Team Leadership
Dion Hinchcliffe	https://www.linkedin.com/in/dhinchcliffe	Washington DC, United States	Strategy \| Enterprise Software \| Enterprise Architecture
Eric Wilson	https://www.linkedin.com/in/wilsondemand	Kentucky, United States	Cross-functional Team Leadership \| Forecasting \| Process Improvement
Evan Sinar	https://www.linkedin.com/in/evansinar/	Pennsylvania, United States	Talent Management\|Psychometrics\|Leadership Development
Gregory Piatetsky	https://www.linkedin.com/in/gpiatetsky/	Massachusetts, United States	Data Mining\| Business Analytics\|Data Science\|Predictive Analytics
Hessie Jones	https://www.linkedin.com/in/hessiejones1	Pickering, Canada	Digital Marketing \| Social Media Marketing \| Online Advertising
Jim Marous	https://www.linkedin.com/in/jimmarous	Ohio, United States	Fintech \| Financial Services \| Banking
Mark Reynolds	https://www.linkedin.com/in/profreynolds	Texas, United States	Digital Transformation \| Enterprise Solution Architect \| IoT and Edge Computing
Matt Stephens	https://www.linkedin.com/in/matt-stephens	Atlanta, United States	Communication \| Community Outreach \| Ideation
Michael Krigsman	https://www.linkedin.com/in/mkrigsman	Boston, United States	Enterprise Software \| Strategy \| CRM
Sally Eaves	https://www.linkedin.com/in/sally-eaves	London, United Kingdom	Social Media \| Change Management \| Start-ups
Sandeep Raut	https://www.linkedin.com/in/rautsandeep	Mumbai, India	Business Intelligence \| Analytics \| Digital Transformation
Tamara Dull	https://www.linkedin.com/in/tamaradull/	California, United States	Product Management\|Enterprise Software\|Business Intelligence
Tripp Braden	https://www.linkedin.com/in/trippbraden	Michigan, United States	Strategy \| Business Development \| Strategic Partnerships
Vinay Solanki	https://www.linkedin.com/in/vinaysolanki	Delhi NCR, India	Team Management \| Java \| Business Analysis

What are the topics that data scientists care most about?

In the following we have discussed about things that thought leaders of DS care about, e.g. skillset, where the DS jobs are or will be, technology etc. Right below, we listed some of those points.

url_ds_cares <- 'https://www.innoarchitech.com/blog/what-is-data-science-does-data-scientist-do'
DS_Cares_HTML <- read_html(url_ds_cares)
DS_Cares_HTML_Para_Nodes <- DS_Cares_HTML %>% html_nodes('p')  %>% .[26:37] %>% html_text()
kable(DS_Cares_HTML_Para_Nodes, format = "html")

x
Prediction (predict a value based on inputs)
Classification (e.g., spam or not spam)
Recommendations (e.g., Amazon and Netflix recommendations)
Pattern detection and grouping (e.g., classification without known classes)
Anomaly detection (e.g., fraud detection)
Recognition (image, text, audio, video, facial, …)
Actionable insights (via dashboards, reports, visualizations, …)
Automated processes and decision-making (e.g., credit card approval)
Scoring and ranking (e.g., FICO score)
Segmentation (e.g., demographic-based marketing)
Optimization (e.g., risk management)
Forecasts (e.g., sales and revenue)

library(magick)

## Linking to ImageMagick 6.9.9.14
## Enabled features: cairo, freetype, fftw, ghostscript, lcms, pango, rsvg, webp
## Disabled features: fontconfig, x11

joblistings <- image_read('https://miro.medium.com/max/1094/1*3K7QnzBXI0Ys3NZgNRTezA.png')
print(joblistings)

## # A tibble: 1 x 7
##   format width height colorspace matte filesize density
##   <chr>  <int>  <int> <chr>      <lgl>    <int> <chr>  
## 1 PNG      700    450 sRGB       FALSE    17850 72x72

generalskills <- image_read('https://miro.medium.com/max/1094/1*-oG0j_wGSW_9cNNs4_qgFQ.png')
print(generalskills)

## # A tibble: 1 x 7
##   format width height colorspace matte filesize density
##   <chr>  <int>  <int> <chr>      <lgl>    <int> <chr>  
## 1 PNG      700    450 sRGB       FALSE    32080 72x72

skills <- image_read('https://miro.medium.com/max/1094/1*jnZT4gFAzScOJ_VnYsni0g.png')
print(skills)

## # A tibble: 1 x 7
##   format width height colorspace matte filesize density
##   <chr>  <int>  <int> <chr>      <lgl>    <int> <chr>  
## 1 PNG      700    450 sRGB       FALSE    27391 72x72

technology <- image_read('https://miro.medium.com/max/1094/1*iueZKOOBidZtr-FTYyf6QA.png')
print(technology)

## # A tibble: 1 x 7
##   format width height colorspace matte filesize density
##   <chr>  <int>  <int> <chr>      <lgl>    <int> <chr>  
## 1 PNG      700    450 sRGB       FALSE    28889 72x72

cloud <- image_read('https://www.octoparse.com/media/6032/blog.png?width=699&height=397')
print(cloud)

## # A tibble: 1 x 7
##   format width height colorspace matte filesize density
##   <chr>  <int>  <int> <chr>      <lgl>    <int> <chr>  
## 1 PNG      699    397 sRGB       TRUE    263804 38x38

datascience <- image_read('https://blog.datasciencedojo.com/content/images/2019/06/Hippocratic-Oath-of-a-data-scientist-2.png')
print(datascience)

## # A tibble: 1 x 7
##   format width height colorspace matte filesize density
##   <chr>  <int>  <int> <chr>      <lgl>    <int> <chr>  
## 1 PNG      518    800 sRGB       TRUE    233690 38x38

Here, we extracted text, from an online image, which could yield more information, on detailed analysis.

library(tesseract)
cat(image_ocr(datascience))

## OF A DATA SCIENTIST
## ON DATA
## {wll remember that data beats algorithm.
## ‘The quality of my model is going to be
## impacted by the quality, quantity, and
## variety of data vsed to buildit. | ON SIMPLICITY
## 1 ull remember that big data is just
## {00 | wil remember that 2 simpler model
## is better than a complex model, Unless |
## can Justify, | wll nt use complex tools,
## ON MODELING ‘techniques, or models.
## {will remember that my mode! will be
## used in ways | never intended. {will not
## ive people false comfort about the
## correctness of my mode ON BEAUTY
## 1 wll remember that | didn’t make this
## world and it doesnt satisfy my equations.
## The beauty of equations, theorems, and
## lemmas i deceptive
## ON BUSINESS VALUE ‘
## | will remember thatthe world does not
## care about my model unless Wt adds
## business value
## ON PREDICTIONS
## {ull remember that, 90% everything can be
## predicted. Even the "best" model wil lead
## to problems
## ON IMPACT
## My models may impact lives, society, and
## ‘the economy. | will ensure everyone is
## aware of the possible pitfalls of my model. | ON ETHICS
## | will remember that | may face ethical
## dilemmas in my pursuit for a better
## model. will ensure that ll use my best
## “sgment in abtaining data, building the
## model, understanding and communicating
## ON HUMILITY | 2nvbies, and defining metrics
## Lull remember thatthe business of data
## science is complex, | will accept that my
## model can, and will be wrong
## datasciencedojo

How do these change over time, and across geographical location?

Here’s a brief history of how data science evolved, over the years. We already saw the geographical location of some of the data scientists.

url_history <- 'https://www.dataversity.net/brief-history-data-science/'
DS_History_HTML <- read_html(url_history)
DS_History_HTML_Para_Nodes <- DS_History_HTML %>% html_nodes('p')  %>% .[13:23] %>% html_text()
kable(DS_History_HTML_Para_Nodes, format = "html")

x
In 2001, Software-as-a-Service (SaaS) was created. This was the pre-cursor to using Cloud-based applications.
In 2001, William S. Cleveland laid out plans for training Data Scientists to meet the needs of the future. He presented an action plan titled, Data Science: An Action Plan for Expanding the Technical Areas of the field of Statistics. It described how to increase the technical experience and range of data analysts and specified six areas of study for university departments. It promoted developing specific resources for research in each of the six areas. His plan also applies to government and corporate research.
In 2002, the International Council for Science: Committee on Data for Science and Technology began publishing the Data Science Journal, a publication focused on issues such as the description of data systems, their publication on the internet, applications and legal issues.
In 2006, Hadoop 0.1.0, an open-source, non-relational database, was released. Hadoop was based on Nutch, another open-source database.
In 2008, the title, “Data Scientist” became a buzzword, and eventually a part of the language. DJ Patil and Jeff Hammerbacher, of LinkedIn and Facebook, are given credit for initiating its use as a buzzword.
In 2009, the term NoSQL was reintroduced (a variation had been used since 1998) by Johan Oskarsson, when he organized a discussion on “open-source, non-relational databases”.
In 2011, job listings for Data Scientists increased by 15,000%. There was also an increase in seminars and conferences devoted specifically to Data Science and Big Data. Data Science had proven itself to be a source of profits and had become a part of corporate culture.
In 2011, James Dixon, CTO of Pentaho promoted the concept of Data Lakes, rather than Data Warehouses. Dixon stated the difference between a Data Warehouse and a Data Lake is that the Data Warehouse pre-categorizes the data at the point of entry, wasting time and energy, while a Data Lake accepts the information using a non-relational database (NoSQL) and does not categorize the data, but simply stores it.
In 2013, IBM shared statistics showing 90% of the data in the world had been created within the last two years.
In 2015, using Deep Learning techniques, Google’s speech recognition, Google Voice, experienced a dramatic performance jump of 49 percent.
In 2015, Bloomberg’s Jack Clark, wrote that it had been a landmark year for Artificial Intelligence (AI). Within Google, the total of software projects using AI increased from “sporadic usage” to more than 2,700 projects over the year.

Conclusion

In conclusion, we were able to build a table of more than 20 of today’s thought leaders of Data Science. These experts are actively keeping up with all new developments in :

Analysis,
Machine Learning,
Staistics,
Computer Science,
Artificial Intelligence
Visual Analytics

Moreover, some of the non-technical skills that are greatly needed and valued include:

excellent strategy,
communication,
research,
leadership skills, and so on.

While some of the important technical skills are:

Python,
R,
SQL,
Hadoop,
Spark
Business Intelligence, and more.