Project 3
Team - Data Domination
Authors: Nfn Tenzin Dakar, Nwe Oo Mon, Crystal Quezada
Date: 2024-10-20
Our group, DATA DOMINATION, will use Zoom and text messages for communication. All files, including data and documentation, are stored in a GitHub repo. We’ve created an entity-relationship diagram (ERD) using Mysql DataBase and identified our data sources, which will be loaded via CSV files from GitHub.
Our team leverages several tools to streamline collaboration and project development:
R Markdown in RStudio Cloud: We use R Markdown to document our project, and publish the results through RPubs (rpubs.com), making our work easily accessible.
MySQL : This tool was employed to design the entity-relationship diagram (ERD) for the project.
GitHub Repository : All source CSV files and R Markdown (RMD) files are stored centrally on GitHub, allowing the team to access and manage files efficiently.
Communication: Our discussions and updates happen over Zoom for meetings and text messages for daily communication, ensuring smooth coordination.
Kaggle ML & DS Survey: Our dataset is job listings based on software, which can be found on Kaggle, it is also stored in our collaborative GitHub repository.
The dataset provides insights into the job market for data scientists in 2018, focusing on the demand for various programming languages and data tools across major job platforms, including LinkedIn, Indeed, SimplyHired, and Monster. It details the number of job postings that mention each keyword (e.g., Python, R, SQL) and calculates the percentage of total postings that each keyword represents on these platforms. Additionally, the dataset includes a comparison with 2017 data from Glassdoor.
The Keyword table stores the list of programming languages or technologies (e.g., Python, SQL, R), while the JobSource table holds different job platforms (e.g., LinkedIn, Indeed, Monster). The Metrics table acts as an associative entity that breaks down the many-to-many relationship between Keyword and Job Source into two one-to-many relationships.
For instance, “Python” can appear in multiple metrics entries, each representing a different job source (e.g., LinkedIn, Indeed).
For example, LinkedIn can have metrics for various keywords (e.g., Python, SQL, R).
Metrics has a foreign key to Keyword (keyword_id) and a foreign key to JobSource (source_id).
Data from MySQL will be imported to R in following steps:
if (!requireNamespace("DBI", quietly = TRUE)) install.packages("DBI")
if (!requireNamespace("RMariaDB", quietly = TRUE)) install.packages("RMariaDB")
if (!requireNamespace("tidyverse", quietly = TRUE)) install.packages("tidyverse")
if (!requireNamespace("knitr", quietly = TRUE)) install.packages("knitr")
if (!requireNamespace("scales", quietly = TRUE)) install.packages("scales")
library(DBI)
library(RMariaDB)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(knitr)
library(scales)
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
Store the Database Password Securely: To avoid exposing the database password in the code, the Password was stored in an environment file (.Renviron).
Connect to the MySQL Database: Using dbConnect(), a connection will established to the MySQL database by providing the necessary credentials (user, password, database name, host, and port).
Load Data from the Database: Once connected, SQL queries will be used to fetch data from specific tables. The result is stored in an R dataframe.
Close the Connection: After loading the data, the database will be disconnected using dbDisconnect() to free up resources.
con <- dbConnect(
RMariaDB::MariaDB(),
user = 'root',
password = Sys.getenv("MYSQL_PWD"),
dbname = 'job_listing',
host = 'localhost',
port = 3306
)
# Check if the connection is successful
if (!is.null(con)) {
print("Connected successfully!")
} else {
print("Connection failed.")
}
## [1] "Connected successfully!"
# Load the data from the MySQL table into an R dataframe
jobsource <- dbGetQuery(con, "SELECT * FROM jobsource")
keyword <- dbGetQuery(con, "SELECT * FROM keyword")
metrics <- dbGetQuery(con, "SELECT * FROM metrics")
# Execute the SQL query and fetch the results into an R data frame
query <- "SELECT keyword.keyword_name,
jobsource.source_name,
metrics.count,
metrics.percentage
FROM metrics
JOIN jobsource ON metrics.source_id = jobsource.source_id
JOIN keyword ON metrics.keyword_id = keyword.keyword_id;"
# Fetch the data
metrics_combined <- dbGetQuery(con, query)
# Display the results
glimpse (metrics_combined)
## Rows: 148
## Columns: 4
## $ keyword_name <chr> "Python", "R", "SQL", "Spark", "Hadoop", "Java", "SAS", "…
## $ source_name <chr> "Indeed", "Indeed", "Indeed", "Indeed", "Indeed", "Indeed…
## $ count <int> 3818, 3106, 2628, 1551, 1578, 1377, 1134, 1012, 830, 739,…
## $ percentage <dbl> 0.74, 0.60, 0.51, 0.30, 0.31, 0.27, 0.22, 0.20, 0.16, 0.1…
# Close the database connection
dbDisconnect(con)
Database Design: The SQL database was structured with normalized tables, using appropriate data types—VARCHAR for text fields, INT for numerical fields, and DECIMAL for percentage values. This approach ensured efficient and accurate data storage and retrieval.
Data Cleaning: Before loading data into SQL, the CSV file underwent thorough pre-processing in Excel. This included removing commas from numeric fields, converting percentage formats to numbers, filling empty cells with -1 to denote missing values, and removing redundant rows like “Total.” These steps ensured that the data was clean and aligned with the database requirements.
Data Loading: The cleaned data was then imported into SQL tables. The NULLIF function was used to handle missing or invalid data, converting any -1 entries to NULL during the import process, which maintained data integrity.
Normalization and Relationships: The database design incorporated normalization techniques to minimize data redundancy, enhancing data efficiency. Foreign keys were used to establish clear relationships between tables, improving data integration and consistency.
ggplot(jobsource, aes(x = reorder(source_name, job_postings), y = job_postings, fill = source_name)) +
geom_bar(stat = "identity") +
labs(x = "Job Source", y = "Total Job Postings", title = "Total Job Postings by Job Source") +
theme_minimal() +
geom_text(aes(label = job_postings), vjust = -0.5, size = 3)
# Sort the data by 'percentage' in descending order
metrics_combined <- metrics_combined %>%
arrange(desc(percentage))
# Convert 'keyword_name' to a factor based on the sorted order
metrics_combined$keyword_name <- factor(metrics_combined$keyword_name, levels = unique(metrics_combined$keyword_name))
# Create the heatmap with the corrected order and color gradient
ggplot(metrics_combined, aes(x = source_name, y = keyword_name, fill = percentage)) +
geom_tile(color = "white") +
scale_fill_gradientn(colors = c("green", "blue", "red"),
values = scales::rescale(c(0, 0.5, 1)),
labels = scales::percent_format(accuracy = 1)) +
labs(title = "Demand for Tools Across each Platform (%)",
x = "Platforms",
y = "Tools") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Filter top 5 and bottom 5 skills by average percentage
top_skills <- keyword %>%
arrange(desc(aveg_percentage)) %>%
head(5)
bottom_skills <- keyword %>%
arrange(aveg_percentage) %>%
head(5)
# Combine top and bottom skills into one data frame
top_bottom_skills <- rbind(top_skills, bottom_skills)
# Create a new column to indicate the color
top_bottom_skills$color <- ifelse(top_bottom_skills$aveg_percentage %in% top_skills$aveg_percentage, "Top 5", "Bottom 5")
# Plotting
ggplot(top_bottom_skills, aes(x = reorder(keyword_name, aveg_percentage), y = aveg_percentage, fill = color)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("Top 5" = "green", "Bottom 5" = "red")) +
labs(x = "Skill", y = "Average Percentage across Job Sources", title = "Top 5 and Bottom 5 Skills by Average Percentage", fill = "") +
theme_minimal() +
geom_text(aes(label = percent(aveg_percentage, accuracy = 1)), vjust = ifelse(top_bottom_skills$color == "Top 5", -0.5, 1.5), size = 3) +
coord_flip()
# Sort the data to get top tools per platform
top_tools <- metrics_combined %>%
group_by(source_name) %>%
top_n(5, wt = count) %>%
ungroup()
# Create the faceted bar chart
ggplot(top_tools, aes(x = reorder(keyword_name, -count), y = count, fill = source_name)) +
geom_bar(stat = "identity") +
facet_wrap(~ source_name, scales = "free_y") + # Facet by source_name (platform)
coord_flip() + # Flip coordinates for horizontal bars
labs(title = "Top Tools by Platform",
x = "Tool",
y = "Count") +
theme_minimal() +
theme(legend.position = "none") # Remove the legend since we already use facets
# Filter out rows with NA values in the 'percentage_dif' column
data_filtered <- keyword %>%
filter(!is.na(percentage_dif))
# Display the data frame
kable(data_filtered, caption = "Filtered Data")
| keyword_id | keyword_name | aveg_percentage | Glassdoor_percentage | percentage_dif |
|---|---|---|---|---|
| 1 | Python | 0.73 | 0.72 | 0.01 |
| 2 | R | 0.60 | 0.64 | -0.04 |
| 3 | SQL | 0.50 | 0.51 | -0.01 |
| 4 | Spark | 0.29 | 0.27 | 0.02 |
| 5 | Hadoop | 0.30 | 0.39 | -0.09 |
| 6 | Java | 0.26 | 0.33 | -0.07 |
| 7 | SAS | 0.23 | 0.30 | -0.07 |
| 8 | Tableau | 0.19 | 0.14 | 0.05 |
| 9 | Hive | 0.16 | 0.17 | -0.01 |
| 14 | Matlab | 0.12 | 0.20 | -0.08 |
# Plotting
ggplot(data_filtered, aes(x = reorder(keyword_name, percentage_dif), y = percentage_dif, fill = percentage_dif > 0)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("TRUE" = "green", "FALSE" = "red"),
labels = c("Negative Change", "Positive Change")) +
labs(x = "Skill", y = "Percentage Difference from 2017 Glassdoor Postings",
title = "Percentage Difference (2018 Average vs 2017 Glassdoor)") +
theme_minimal() +
geom_text(aes(label = scales::percent(percentage_dif, accuracy = 1)),
vjust = 0.5, hjust = ifelse(data_filtered$percentage_dif > 0, -0.1, 1.1), size = 3.0) +
coord_flip()
The table presents the correlation matrix among job postings for the four platforms: Indeed, LinkedIn, Monster, and SimplyHired.
The correlation values are very high, ranging from 0.98 to 0.99, indicating a strong positive correlation between job postings on these platforms.
This matrix indicates that the job market for data science roles exhibits similar trends across the four job platforms
metrics_count <- metrics_combined %>%
select(-percentage)
# Pivot the data to a wide format based on 'source_name'
wide_data_count <- metrics_count %>%
pivot_wider(names_from = source_name, values_from = count)
# View the reshaped data
print(wide_data_count)
## # A tibble: 37 × 5
## keyword_name SimplyHire Indeed LinkedIn Monster
## <fct> <int> <int> <int> <int>
## 1 Python 2888 3818 6347 2544
## 2 R 2393 3106 4553 2365
## 3 SQL 2056 2628 3879 1841
## 4 Hadoop 1164 1578 2142 1200
## 5 Spark 1167 1551 2169 1062
## 6 Java 1059 1377 1944 1002
## 7 SAS 910 1134 1713 978
## 8 Tableau 780 1012 1216 744
## 9 Hive 637 830 1182 619
## 10 AWS 607 791 947 467
## # ℹ 27 more rows
# Compute the Pearson correlation matrix
cor_matrix <- cor(wide_data_count [,-1], use = "complete.obs", method = "pearson")
# Print the correlation matrix
print(cor_matrix)
## SimplyHire Indeed LinkedIn Monster
## SimplyHire 1.0000000 0.9993663 0.9941885 0.9933402
## Indeed 0.9993663 1.0000000 0.9948814 0.9928074
## LinkedIn 0.9941885 0.9948814 1.0000000 0.9858190
## Monster 0.9933402 0.9928074 0.9858190 1.0000000
# Add a new column 'Subcategory' to categorize each skill
metrics_combined$Subcategory <- ifelse(metrics_combined$keyword_name %in% c("Python", "R", "Java", "C++", "C"), "Programming Languages",
ifelse(metrics_combined$keyword_name %in% c("Hadoop", "Spark", "Hive", "Pig", "Hbase"), "Big Data Technologies",
ifelse(metrics_combined$keyword_name %in% c("SQL", "NoSQL", "MySQL", "MongoDB", "Cassandra"), "Data Management & Databases",
ifelse(metrics_combined$keyword_name %in% c("Tableau", "D3", "Matlab"), "Data Visualization",
ifelse(metrics_combined$keyword_name %in% c("TensorFlow", "Keras", "PyTorch", "Scikit-learn", "Pandas", "Numpy"), "Machine Learning & Deep Learning",
ifelse(metrics_combined$keyword_name %in% c("AWS", "Azure", "Docker", "Git"), "Cloud & DevOps Tools",
ifelse(metrics_combined$keyword_name %in% c("SAS", "SPSS"), "Statistical Analysis & Tools",
"Other")))))))
# Summing job listings for each platform across subcategories
subcat_summary <- metrics_combined %>%
group_by(Subcategory) %>%
summarise(total_listings = sum(count)) %>%
arrange(desc(total_listings))
# Create the bar plot for subcategories and their popularity
ggplot(subcat_summary, aes(x = reorder(Subcategory, total_listings), y = total_listings, fill = Subcategory)) +
geom_bar(stat = "identity") +
labs(title = "Popularity of Data Science Subcategories (Total Job Listings)",
x = "Subcategory", y = "Total Job Listings") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none") +
coord_flip()
The bar plot illustrates the demand for data visualization skills, specifically comparing Tableau and D3 across four job platforms: Indeed, LinkedIn, Monster, and SimplyHire.
Tableau has a significantly higher demand than D3 across all platforms, with percentages ranging from 14% on LinkedIn to 20% on Indeed, Monster, and SimplyHire.
D3 remains relatively consistent but lower across platforms, maintaining a demand between 3% to 4%.
This trend suggests that Tableau continues to be the preferred tool for data visualization roles.
# Filter the data to include only D3 and Tableau skills
data_visual_filtered <- metrics_combined %>%
filter(keyword_name %in% c("D3", "Tableau"))
ggplot(data_visual_filtered, aes(x = source_name, y = percentage, fill = keyword_name)) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = percent(percentage, accuracy = 1)), # Convert to percentage format
position = position_dodge(width = 0.9),
vjust = -0.5, size = 3) + # Adjust label position and size
labs(title = "Demand for Data Visualization Skills Across Platforms",
x = "Job source", y = "Percentage", fill = "") +
theme_minimal() +
ylim(0, 0.25) # Set the y-axis limit to ensure label visibility