The goal of this project is to find most valuable data science skills. I collected datasets from kaggle.com and uploaded it into my github and to MySql database. I will load the dataset into MySQL and find the top 15 Data Scientist Skill, Data Analysts and Data engineer. I will also try to find data science skills across different sectors.
The data I collected from kaggle.com and collected into my github account.
https://github.com/mrahman234/Data-607/tree/main/Project%203/Dataset
I will use R Studio for my collaboration and code development. This allows us to view and share code within the project. I am using R Markdown within RStudio Cloud for project documentation to publish through RPubs. To create the ER Diagram, I will use Lucid App website. The source file of my project and datasets will be uploaded to my github account.
ER Diagram:
I tried to normalize the ER Diagram a little bit. This normalization reduces data redundancy by ensuring that each data item is stored only once. Also it ensures data consistency, as updates, deletions, and insertions only have to be performed in one place.
mydb = dbConnect(MySQL(), user='mohammed.rahman76', password='mohammed.rahman76', dbname='mohammed.rahman76', port=3306, host='cunydata607sql.mysql.database.azure.com')
summary(mydb)
## <MySQLConnection:0,0>
## User: mohammed.rahman76
## Host: cunydata607sql.mysql.database.azure.com
## Dbname: mohammed.rahman76
## Connection type: cunydata607sql.mysql.database.azure.com via TCP/IP
##
## Results:
#showing tables list
dbListTables(mydb)
## [1] "general_skills" "indeed_dataset" "job_dataset" "movies_ratings"
## [5] "tb"
Loading database to variable
indeed <- dbReadTable(mydb, "job_dataset")
After the database was loaded into R from MySQL, the Skill column tab required some organization. The dataset provides data of job list from Indeed. From there, I removed all strings, trimmed, and split the string into pieces with commas. By doing this, I will eventually use the unlist function to turn the list into a vector. After creating a table to count the frequencies, I proceeded to trim the function once more and set it as a data frame. I wanted to put the frequency of skills in descending order, therefore I mutated new column names. To see the ratios between the frequency and the total of the frequencies for each talent, I lastly mutated another new column.
Now, I’m creating a graph of top 15 skills for Data Scientists.
indeed_skills <- indeed$Skill %>%
str_remove_all("\\[|\\]|\\'") %>%
str_trim() %>%
str_split(",") %>%
unlist() %>%
str_trim() %>%
table() %>%
as.data.frame() %>%
set_colnames(c("skill", "frequency")) %>%
arrange(desc(frequency)) %>%
mutate(proportion = frequency / sum(frequency))
#graph of top 15 skills overall
indeed_skills %>%
slice(1:15) %>%
ggplot(., aes(x = reorder(skill, frequency), y = frequency)) +
geom_bar(stat = "identity", fill = "#739ade") +
coord_flip() +
ggtitle("Top 15 Skills for in the dataset") +
xlab("skill")
Firstly, I am subsetting the dataframe. Next, I used regular expressions to sanitize the talents column and removed []. After that, I divided the string and reduced the data. Then, I cleaned the job type columns and unnested the skills. After that, I determined the frequencies and proportions by eliminating the white spaces. Finding the top valuable skills only for Data Scientists. The findings were then plotted using ggplot2.
#subsetting data
job_vs_skill <- indeed %>%
select(Job_Title, Job_Type, Skill)
#cleaning skills column
job_vs_skill$Skill <- job_vs_skill$Skill %>%
str_remove_all("\\[|\\]|\\'") %>%
str_trim() %>%
str_split(",")
#cleaning job type columns
job_vs_skill$Job_Type <- job_vs_skill$Job_Type %>%
str_replace_all("_", " ")
#unnesting
job_vs_skill <- job_vs_skill %>%
unnest(Skill)
#removing white space
job_vs_skill$Skill <- trimws(job_vs_skill$Skill, which = c("both"))
#finding frequencies and proportions
job_vs_skill <- job_vs_skill %>%
group_by(Job_Type, Skill) %>%
summarise(freq = n()) %>%
mutate(proportion = freq/ sum(freq),
label = round(proportion * 100, 2))
## `summarise()` has grouped output by 'Job_Type'. You can override using the
## `.groups` argument.
#top 10 skills by job type
job_vs_skill2 <- job_vs_skill %>%
group_by(Job_Type) %>%
top_n(10, proportion) %>%
ungroup() %>%
arrange(Job_Type, desc(proportion))
job_vs_skill %>%
filter(Job_Type == "data scientist") %>%
ggplot(aes(x = reorder(Skill, freq), y = freq)) +
geom_bar(stat = "identity", fill = "#4e58f1") +
coord_flip() +
ggtitle("Top 15 Most Valued Skills for Data Scientists") +
xlab("skill") +
ylab("frequency")
Machine Learning, R, Python, and Statistical Software top the pack, while AI and ArcGIs rank last among the most valued talents for data scientists.