Project 3 Part 1 Prompt

Create a short document, with the names of group members. You should briefly describe your collaboration tool(s) you’ll use as a group, including for communication, code sharing, and project documentation. You should have identified your data sources, where the data can be found, and how to load it. And you should have created at least a logical model for your normalized database, and produced an Entity-Relationship (ER) diagram documenting your database design.

Team Members

Beshkia Kvarnstrom

Susanna Wong

Gabriel Castellanos

Mohammad Rahman

Laura Puebla Aguila

Collaboration Tools

We used WhatsApp to communicate on regularly and Zoom for online meetings as needed. We used GitHub repository for code sharing, and write-ups for markdown files.

Project 3 Approach

The goal of the project is to use data to answer the question, “Which are the most valued data science skills?”. We decided to web scrap a major job search engine (ex: Indeed, Linkedin, Glassdoor, etc) for job postings. Create a large csv file that contains the job title, job URL, company name, job salary, and job description. Then, we will create a list of technical skills, and soft skills, and find the top 5 most frequent technical skills and soft skills on the job postings. The most valued data science skills are the skills that appear most often on job postings.

Data Source

Plan A: We decided to web scrap major job search engines (Indeed, Linkedin, Glassdoor, etc). In the beginning, we used the package ‘rvest’ on R to web scrape Indeed for job postings. However, we ran into a 403 error. There is an API restriction. So, we used Parsehub and Octoparse to scrap data from the job search engine.

https://github.com/BeshkiaKvarnstrom/MSDS-DATA607/blob/main/Data_Scientist_Jobs_United%20States%20_Indeed.csv.xlsx

https://github.com/BeshkiaKvarnstrom/MSDS-DATA607/blob/main/Data_Scientist_Jobs_United%20States_glassdoor.xlsx.xlsx

https://github.com/suswong/DATA-607-Project-3/blob/main/Job%20details%20by%20search_LinkedIn.csv

Plan B: Use a dataset we found on Kaggle. The dataset contains job listings in the US from Glassdoor. The CSV file contains Job title, salary estimate, job description, company name, company rating, company headquarters, company size, and more. The dataset from Kaggle was based on Ken Jee’s dataset.

https://www.kaggle.com/datasets/nikhilbhathi/data-scientist-salary-us-glassdoor?resource=download

We plan to use the raw file compiled by Ken Jee to do our own tidying and analysis. https://github.com/PlayingNumbers/ds_salary_proj

https://raw.githubusercontent.com/PlayingNumbers/ds_salary_proj/master/glassdoor_jobs.csv

ER Diagram

url <- "https://raw.githubusercontent.com/suswong/DATA-607-Project-3/main/Data%20607%20ER%20Diagram.png"
knitr::include_graphics(url)

Database Plan:

Once we acquire our data using the above-mentioned tools (Plan A or Plan B), we will plan to load our data into a database instance on SQL Server. Once data is loaded into a database on the SQL server into normalized tables, we will connect the database to R using the approriate packages/libraries. From there we load the data into a dataframe to begin tidying/analysis. An ER diagram is shown below, but a summary of the database model is as follows:

One industry (IT, Analytics, Biotech etc..) has many positions (i.e. jobs); one position can have many skill positions; one skill can have many position skills; one skill group has many skills; on job level (junior, senior, associate) can have many positon skills, one level can have many positions associated with said position; one employer can have many positions potentially open, one employer can have many locations; one salary (or range) can belong to many open positions; one education requirement can belong to many open positions