Each data has a time constraint and had time box, but the efforts to have raw data to provide downstream on this project. The following data resource from Google, Indeed and Kaggle Job Boards generalize the approach to initiate a search for “data science” terms to collect the results, identify the skills and a preliminary filtering of noise and irrelevant data. The results set is provided to transforming the data with the following variables: Skill Type, Skill name, rating and company name.
# 2. variables Settings
db_user <- 'root'
db_password <- ''
db_name <- 'skills'
db_host <- '127.0.0.1' # localhost for local access
db_port <- 3306
# 3. Read data from db
mydb <- dbConnect(MySQL(), user = db_user, password = db_password,
dbname = db_name, host = db_host, port = db_port)
s <- paste0(" SELECT SkillType, SkillName, Rating,
CASE
WHEN Company = 1 THEN 'Google'
ELSE 'NA'
END AS Company
FROM GoogleData
UNION
SELECT SkillType, SkillName, Rating,
CASE
WHEN Company = 2 THEN 'Indeed'
ELSE 'NA'
END AS Company
FROM IndeedData
UNION
SELECT SkillType, SkillName, Rating,
CASE
WHEN Company = 3 THEN 'Kaggle'
ELSE 'NA'
END AS Company
FROM KaggleData;")
rs <- dbSendQuery(mydb, s)
df <- fetch(rs, n = -1)
on.exit(dbDisconnect(mydb))## Warning: Closing open result sets
#uf <- df %>% gather(SkillName, n, Rating:Company)
kable(df) %>%
kable_styling() %>%
scroll_box(height = "500px")| SkillType | SkillName | Rating | Company |
|---|---|---|---|
| communication | ai law | 7 | |
| business | analysis | 10 | |
| math | analytics | 11 | |
| programming | big data | 42 | |
| programming | cloud | 2 | |
| business | creativity | 3 | |
| programming | data engineering | 4 | |
| programming | data mining | 3 | |
| programming | data natives | 8 | |
| programming | devops | 2 | |
| business | entrepreneurship | 2 | |
| programming | fintech | 2 | |
| programming | galaxql | 2 | |
| business | government | 3 | |
| programming | Hadoop | 12 | |
| business | information | 9 | |
| programming | IOS | 2 | |
| programming | Java | 2 | |
| programming | machine learning | 15 | |
| business | manufacturing | 2 | |
| programming | MATLAB | 2 | |
| visualization | media | 6 | |
| programming | mobile development | 2 | |
| programming | network | 2 | |
| programming | NOSQL | 5 | |
| math | pattern discovery | 2 | |
| business | policy | 3 | |
| programming | Python | 7 | |
| programming | r | 7 | |
| business | retail | 2 | |
| programming | RSS | 2 | |
| programming | Ruby | 2 | |
| programming | SAS | 2 | |
| business | security and risk strategy | 2 | |
| programming | SQL | 6 | |
| programming | storage | 2 | |
| programming | tableau | 2 | |
| programming | web development | 2 | |
| programming | agent-based Simulation | 32 | Indeed |
| programming | ANOVA | 138 | Indeed |
| math | Bayes networks | 74 | Indeed |
| math | Bayesian Statistics | 175 | Indeed |
| programming | big data | 44 | Indeed |
| programming | BUGS | 160 | Indeed |
| programming | C | 80 | Indeed |
| math | calculus | 91 | Indeed |
| programming | cloud tech | 106 | Indeed |
| programming | clustering | 165 | Indeed |
| math | continuous Simulation | 149 | Indeed |
| math | convex Optimization | 17 | Indeed |
| programming | CSS | 93 | Indeed |
| programming | DBA | 143 | Indeed |
| business | decision trees | 117 | Indeed |
| visualization | design | 15 | Indeed |
| programming | discrete Simulation | 94 | Indeed |
| programming | Distributed Data | 94 | Indeed |
| visualization | experimental design | 152 | Indeed |
| business | forecasting | 49 | Indeed |
| math | general linear model | 133 | Indeed |
| math | geographic covariates | 0 | Indeed |
| programming | GIS | 233 | Indeed |
| business | global Optimization | 94 | Indeed |
| visualization | Graphical Models | 122 | Indeed |
| programming | Hadoop | 76 | Indeed |
| programming | Hive | 106 | Indeed |
| programming | HTML | 73 | Indeed |
| math | integer Optimization | 96 | Indeed |
| programming | JAVA | 70 | Indeed |
| programming | JavaScript | 83 | Indeed |
| programming | JSON | 146 | Indeed |
| math | linear algebra | 69 | Indeed |
| programming | Machine Learning | 98 | Indeed |
| visualization | Map/Reduce | 109 | Indeed |
| visualization | mapping | 134 | Indeed |
| math | Math | 76 | Indeed |
| programming | MATLAB | 108 | Indeed |
| math | MCMC | 68 | Indeed |
| math | Monte-Carlo Statistics | 170 | Indeed |
| math | multinomial modeling | 16 | Indeed |
| programming | MySQL | 114 | Indeed |
| programming | neural nets | 64 | Indeed |
| programming | nix | 6 | Indeed |
| programming | NoSQL | 89 | Indeed |
| programming | Objective C | 129 | Indeed |
| programming | Oracle | 121 | Indeed |
| programming | Perl | 96 | Indeed |
| programming | Pig | 155 | Indeed |
| programming | PostgreSQL | 124 | Indeed |
| business | Product Development | 41 | Indeed |
| business | project management | 25 | Indeed |
| programming | Python | 52 | Indeed |
| programming | R | 64 | Indeed |
| programming | Rails | 132 | Indeed |
| business | real analysis | 70 | Indeed |
| programming | RegEx | 86 | Indeed |
| programming | Ruby | 102 | Indeed |
| programming | SAS | 52 | Indeed |
| programming | Simulation | 136 | Indeed |
| communication | social networks | 172 | Indeed |
| math | Spatial Statistics | 109 | Indeed |
| programming | SPSS | 122 | Indeed |
| programming | SQL | 44 | Indeed |
| programming | Stata | 107 | Indeed |
| visualization | statistical graphics | 183 | Indeed |
| math | Statistics | 85 | Indeed |
| programming | Structured Data | 75 | Indeed |
| business | Surveys and Marketing | 130 | Indeed |
| programming | SVM | 144 | Indeed |
| business | Systems Administration | 109 | Indeed |
| communication | technical writing/publishing | 4 | Indeed |
| programming | Teradata | 127 | Indeed |
| programming | text mining | 174 | Indeed |
| math | time-series analysis | 167 | Indeed |
| programming | UNIX | 89 | Indeed |
| programming | Unstructured Data | 92 | Indeed |
| visualization | visualization | 80 | Indeed |
| visualization | web-based data-viz | 0 | Indeed |
| programming | XML | 185 | Indeed |
| programming | Python | 28 | Kaggle |
| math | analytics | 23 | Kaggle |
| programming | machine learning | 19 | Kaggle |
| business | business | 14 | Kaggle |
| programming | programming | 14 | Kaggle |
| communication | team | 14 | Kaggle |
| math | statistics | 13 | Kaggle |
| programming | SQL | 12 | Kaggle |
| communication | communication | 11 | Kaggle |
| programming | modeling | 9 | Kaggle |
| communication | interpersonal | 9 | Kaggle |
| communication | management | 9 | Kaggle |
| programming | big data | 7 | Kaggle |
| visualization | design | 6 | Kaggle |
| programming | Hadoop | 6 | Kaggle |
| programming | Java | 6 | Kaggle |
| business | research | 4 | Kaggle |
| communication | creative | 4 | Kaggle |
| math | math | 3 | Kaggle |
| programming | MATLAB | 3 | Kaggle |
| programming | predictive analytics | 3 | Kaggle |
| programming | SAS | 3 | Kaggle |
| communication | leadership | 3 | Kaggle |
| business | analysis | 2 | Kaggle |
| visualization | presentation | 1 | Kaggle |
| programming | R | 1 | Kaggle |
summary(df)## SkillType SkillName Rating
## Length:144 Length:144 Length:144
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## Company
## Length:144
## Class :character
## Mode :character
## [1] "business" "communication" "math" "programming"
## [5] "visualization"
## Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]):
## argument is not numeric or logical: returning NA
The purpose of this exercise is to import three data sets in mysql db and tidy them with above libraries.
Please email to: kleber.perez@live.com for any suggestion.