library(XML)
library(xml2)
library(tidyverse)
library(rvest)
library(stringr)
library(RMySQL)
library(knitr)
The datasets and .rmd file used in this project can be found at: https://github.com/forhadakbar/data607fall2019/tree/master/Week%2008 and https://github.com/ShovanBiswas/DATA607/tree/master/Week8
In this project, we have shown who the thought leaders of Data Science are, their primary cares, and the changes that has happened in this field, over the years. In order to implement this project, we surveyed the net, including social media, job sites, LinkedIn profiles of influential data scientists, data science related articles, and extracted information, to create a database. We scraped many screens, analyzed the data, and presented the content in visually appealing form.
Team FABS decided to choose “Alternative Project 3 – Data Science Thought Leadership”, because we felt it would be challenging, to scrape various web pages, analyze the information, and put together the information in visually presentable format.
Initially, we tried to form a team, but found many of the members were, already taken. So, we eventually formed a team of two, called FABS. We communicated over email, phone and WhatsApp and charted out a plan, research procedure and division of work. Then, we collaboratively worked with TeamViewer.
In order to answer the key questions of this assignment, we needed a list of thought leaders in data science. So, we scraped websites, having the following criteria:
Based on these criteria, we charted certain steps, which are displayed in the following workflow.
setwd('D:/CUNY-DataScience/Fall 2019/Data Acquisition and Management DATA 607/Week 08')
The tools, used for this project are: R, Tableau, SelectorGadget, whcih is an extension in Chrome, MySQL database.
We scraped a site and culled out top 20 thought leaders in data science; then imported them into a MySQL database, from where we used to extract and analyzed them. However, in order to achieve the complex activities of scraping twitter link, and a host of child urls, we used another supporting RMD, in which we first wrote our the extracted the information into CSV file, and imported into MySQL database.
url <- read_html("https://www.thinkers360.com/top-20-global-thought-leaders-on-analytics-july-2018/",stringsAsFactors = FALSE)
readurl <- url %>%
html_nodes("table") %>%
.[1] %>%
html_table(fill = TRUE)
top20 <- as_tibble(readurl[[1]])
top20
con <- dbConnect(MySQL(), # database connection
user="Data607", password = passwd,
dbname="project3", host="localhost")
sql <- "SELECT * FROM profile" # Fetch data from MySQL instance into a R dataframe
profile <- dbGetQuery(con, sql)
kable(profile)
The thought leader, who were considered, were scored by the top authorities. Based on their scores, we have given a visual analysis, with ggplot2.
ggplot(profile, aes(x = reorder(Name,Score), y = Score)) +
geom_bar(stat = "identity", position = position_dodge(), fill="steelblue") +
geom_text(aes(label = Score), vjust = .5, hjust = 1, position = position_dodge(width = 0.9), color = "black") +
ggtitle("Top 20 Global Thought Leaders on Analytics") +
xlab("Name") + ylab("Score") +
coord_flip()
We have visualized their LinkedIn profile and geographical location.
sql<- "SELECT * FROM linkedin" #Store sql query in a variable
linkedin<- dbGetQuery(con, sql) #Fetch data from MySQL instance into a R dataframe
kable(linkedin)
| Name | linkedinURL | DataLink$Location | linkSkill |
|---|---|---|---|
| Charles Araujo | https://www.linkedin.com/in/charlesaraujo | New York, United States | IT Service Management | IT Strategy | Enterprise Software |
| Dave Whitney | https://www.linkedin.com/in/daveawhitney | Arizona, United States | Innovation | Healthcare Information Exchange | Radiology |
| Deepak Seth | https://www.linkedin.com/in/deepaksethitleader | Texas, United States | Business Intelligence | Strategy | Cross-functional Team Leadership |
| Dion Hinchcliffe | https://www.linkedin.com/in/dhinchcliffe | Washington DC, United States | Strategy | Enterprise Software | Enterprise Architecture |
| Eric Wilson | https://www.linkedin.com/in/wilsondemand | Kentucky, United States | Cross-functional Team Leadership | Forecasting | Process Improvement |
| Evan Sinar | https://www.linkedin.com/in/evansinar/ | Pennsylvania, United States | Talent Management|Psychometrics|Leadership Development |
| Gregory Piatetsky | https://www.linkedin.com/in/gpiatetsky/ | Massachusetts, United States | Data Mining| Business Analytics|Data Science|Predictive Analytics |
| Hessie Jones | https://www.linkedin.com/in/hessiejones1 | Pickering, Canada | Digital Marketing | Social Media Marketing | Online Advertising |
| Jim Marous | https://www.linkedin.com/in/jimmarous | Ohio, United States | Fintech | Financial Services | Banking |
| Mark Reynolds | https://www.linkedin.com/in/profreynolds | Texas, United States | Digital Transformation | Enterprise Solution Architect | IoT and Edge Computing |
| Matt Stephens | https://www.linkedin.com/in/matt-stephens | Atlanta, United States | Communication | Community Outreach | Ideation |
| Michael Krigsman | https://www.linkedin.com/in/mkrigsman | Boston, United States | Enterprise Software | Strategy | CRM |
| Sally Eaves | https://www.linkedin.com/in/sally-eaves | London, United Kingdom | Social Media | Change Management | Start-ups |
| Sandeep Raut | https://www.linkedin.com/in/rautsandeep | Mumbai, India | Business Intelligence | Analytics | Digital Transformation |
| Tamara Dull | https://www.linkedin.com/in/tamaradull/ | California, United States | Product Management|Enterprise Software|Business Intelligence |
| Tripp Braden | https://www.linkedin.com/in/trippbraden | Michigan, United States | Strategy | Business Development | Strategic Partnerships |
| Vinay Solanki | https://www.linkedin.com/in/vinaysolanki | Delhi NCR, India | Team Management | Java | Business Analysis |
In the following we have discussed about things that thought leaders of DS care about, e.g. skillset, where the DS jobs are or will be, technology etc. Right below, we listed some of those points.
url_ds_cares <- 'https://www.innoarchitech.com/blog/what-is-data-science-does-data-scientist-do'
DS_Cares_HTML <- read_html(url_ds_cares)
DS_Cares_HTML_Para_Nodes <- DS_Cares_HTML %>% html_nodes('p') %>% .[26:37] %>% html_text()
kable(DS_Cares_HTML_Para_Nodes, format = "html")
| x |
|---|
| Prediction (predict a value based on inputs) |
| Classification (e.g., spam or not spam) |
| Recommendations (e.g., Amazon and Netflix recommendations) |
| Pattern detection and grouping (e.g., classification without known classes) |
| Anomaly detection (e.g., fraud detection) |
| Recognition (image, text, audio, video, facial, …) |
| Actionable insights (via dashboards, reports, visualizations, …) |
| Automated processes and decision-making (e.g., credit card approval) |
| Scoring and ranking (e.g., FICO score) |
| Segmentation (e.g., demographic-based marketing) |
| Optimization (e.g., risk management) |
| Forecasts (e.g., sales and revenue) |
library(magick)
## Linking to ImageMagick 6.9.9.14
## Enabled features: cairo, freetype, fftw, ghostscript, lcms, pango, rsvg, webp
## Disabled features: fontconfig, x11
joblistings <- image_read('https://miro.medium.com/max/1094/1*3K7QnzBXI0Ys3NZgNRTezA.png')
print(joblistings)
## # A tibble: 1 x 7
## format width height colorspace matte filesize density
## <chr> <int> <int> <chr> <lgl> <int> <chr>
## 1 PNG 700 450 sRGB FALSE 17850 72x72
generalskills <- image_read('https://miro.medium.com/max/1094/1*-oG0j_wGSW_9cNNs4_qgFQ.png')
print(generalskills)
## # A tibble: 1 x 7
## format width height colorspace matte filesize density
## <chr> <int> <int> <chr> <lgl> <int> <chr>
## 1 PNG 700 450 sRGB FALSE 32080 72x72
skills <- image_read('https://miro.medium.com/max/1094/1*jnZT4gFAzScOJ_VnYsni0g.png')
print(skills)
## # A tibble: 1 x 7
## format width height colorspace matte filesize density
## <chr> <int> <int> <chr> <lgl> <int> <chr>
## 1 PNG 700 450 sRGB FALSE 27391 72x72
technology <- image_read('https://miro.medium.com/max/1094/1*iueZKOOBidZtr-FTYyf6QA.png')
print(technology)
## # A tibble: 1 x 7
## format width height colorspace matte filesize density
## <chr> <int> <int> <chr> <lgl> <int> <chr>
## 1 PNG 700 450 sRGB FALSE 28889 72x72
cloud <- image_read('https://www.octoparse.com/media/6032/blog.png?width=699&height=397')
print(cloud)
## # A tibble: 1 x 7
## format width height colorspace matte filesize density
## <chr> <int> <int> <chr> <lgl> <int> <chr>
## 1 PNG 699 397 sRGB TRUE 263804 38x38
datascience <- image_read('https://blog.datasciencedojo.com/content/images/2019/06/Hippocratic-Oath-of-a-data-scientist-2.png')
print(datascience)
## # A tibble: 1 x 7
## format width height colorspace matte filesize density
## <chr> <int> <int> <chr> <lgl> <int> <chr>
## 1 PNG 518 800 sRGB TRUE 233690 38x38
Here, we extracted text, from an online image, which could yield more information, on detailed analysis.
library(tesseract)
cat(image_ocr(datascience))
## OF A DATA SCIENTIST
## ON DATA
## {wll remember that data beats algorithm.
## ‘The quality of my model is going to be
## impacted by the quality, quantity, and
## variety of data vsed to buildit. | ON SIMPLICITY
## 1 ull remember that big data is just
## {00 | wil remember that 2 simpler model
## is better than a complex model, Unless |
## can Justify, | wll nt use complex tools,
## ON MODELING ‘techniques, or models.
## {will remember that my mode! will be
## used in ways | never intended. {will not
## ive people false comfort about the
## correctness of my mode ON BEAUTY
## 1 wll remember that | didn’t make this
## world and it doesnt satisfy my equations.
## The beauty of equations, theorems, and
## lemmas i deceptive
## ON BUSINESS VALUE ‘
## | will remember thatthe world does not
## care about my model unless Wt adds
## business value
## ON PREDICTIONS
## {ull remember that, 90% everything can be
## predicted. Even the "best" model wil lead
## to problems
## ON IMPACT
## My models may impact lives, society, and
## ‘the economy. | will ensure everyone is
## aware of the possible pitfalls of my model. | ON ETHICS
## | will remember that | may face ethical
## dilemmas in my pursuit for a better
## model. will ensure that ll use my best
## “sgment in abtaining data, building the
## model, understanding and communicating
## ON HUMILITY | 2nvbies, and defining metrics
## Lull remember thatthe business of data
## science is complex, | will accept that my
## model can, and will be wrong
## datasciencedojo
Here’s a brief history of how data science evolved, over the years. We already saw the geographical location of some of the data scientists.
url_history <- 'https://www.dataversity.net/brief-history-data-science/'
DS_History_HTML <- read_html(url_history)
DS_History_HTML_Para_Nodes <- DS_History_HTML %>% html_nodes('p') %>% .[13:23] %>% html_text()
kable(DS_History_HTML_Para_Nodes, format = "html")
| x |
|---|
| In 2001, Software-as-a-Service (SaaS) was created. This was the pre-cursor to using Cloud-based applications. |
| In 2001, William S. Cleveland laid out plans for training Data Scientists to meet the needs of the future. He presented an action plan titled, Data Science: An Action Plan for Expanding the Technical Areas of the field of Statistics. It described how to increase the technical experience and range of data analysts and specified six areas of study for university departments. It promoted developing specific resources for research in each of the six areas. His plan also applies to government and corporate research. |
| In 2002, the International Council for Science: Committee on Data for Science and Technology began publishing the Data Science Journal, a publication focused on issues such as the description of data systems, their publication on the internet, applications and legal issues. |
| In 2006, Hadoop 0.1.0, an open-source, non-relational database, was released. Hadoop was based on Nutch, another open-source database. |
| In 2008, the title, “Data Scientist” became a buzzword, and eventually a part of the language. DJ Patil and Jeff Hammerbacher, of LinkedIn and Facebook, are given credit for initiating its use as a buzzword. |
| In 2009, the term NoSQL was reintroduced (a variation had been used since 1998) by Johan Oskarsson, when he organized a discussion on “open-source, non-relational databases”. |
| In 2011, job listings for Data Scientists increased by 15,000%. There was also an increase in seminars and conferences devoted specifically to Data Science and Big Data. Data Science had proven itself to be a source of profits and had become a part of corporate culture. |
| In 2011, James Dixon, CTO of Pentaho promoted the concept of Data Lakes, rather than Data Warehouses. Dixon stated the difference between a Data Warehouse and a Data Lake is that the Data Warehouse pre-categorizes the data at the point of entry, wasting time and energy, while a Data Lake accepts the information using a non-relational database (NoSQL) and does not categorize the data, but simply stores it. |
| In 2013, IBM shared statistics showing 90% of the data in the world had been created within the last two years. |
| In 2015, using Deep Learning techniques, Google’s speech recognition, Google Voice, experienced a dramatic performance jump of 49 percent. |
| In 2015, Bloomberg’s Jack Clark, wrote that it had been a landmark year for Artificial Intelligence (AI). Within Google, the total of software projects using AI increased from “sporadic usage” to more than 2,700 projects over the year. |
In conclusion, we were able to build a table of more than 20 of today’s thought leaders of Data Science. These experts are actively keeping up with all new developments in :
Moreover, some of the non-technical skills that are greatly needed and valued include:
While some of the important technical skills are: