1 Sources

-Link to Dataset: https://www.kaggle.com/phuchuynguyen/datarelated-developers-survey-by-stack-overflow?select=processed_data_toDummies.csv

  • Provide your presentation title. It can be the same as the title of your article. Or, you can modify the title.

“Factors Correlated with Careers in Data Science”

2 Data

  • Describe your dataset that is relevant, related to, and informative of the article. The chosen Kaggle data set includes data from 2017, 2018, 2019, and 2020 about aspects of jobs in data science (education level, country, salary, organization size, job satisfaction, etc.). When missing values are filtered out, only the years 2019 and 2020 remain in the data set. Originally, the data was taken from the Stack Overflow Annual Developer Survey (80,000 responses from 180 countries and territories), taken annually to assess such factors in tech-related developer careers. There are 14 variables analyzed in total, across about 33,000 observations (employees across the globe). After deleting missing values, there are approximately 18,000 employees in the dataset. Analysis can be performed on the data to analyze correlations between company size, years of experience, salary, and other variables as they relate to job satisfaction and job availability.
    Such information can be of use to undergraduate students in our class as we begin our search for employment and begin understanding what kinds of data jobs may best suit our preferences. The data set is heavily related to the chosen article, since both present data on data science job satisfaction as it relates to job title, location, company size, and salary. Also, both draw their data from the Stack Overflow Annual Developer Survey.

3 Article

The chosen article is from the “Towards Data Science” website. The article was chosen because it analyzes factors (data job type, years of coding experience, salary, etc.) related to developer and data fields, just as the dataset does. Both the dataset and the article rely on the Stack Overflow Annual Developer Survey in order to present correlations between all of these career-related variables.

There are two key arguments made in the article (of many) that caught my interest. The first argument involves job satisfaction, and reports that overall, smaller company sizes are correlated with relatively high job satisfaction scores as compared to larger sized companies. The second argument is that data-related jobs (data scientist or ML specialist) have decreased from 2019-2020. Through my Exploratory Data Analysis in this project, I was able to agree with both points made. There is especially strong evidence for the second argument, because I tried to divide the data into groups (by Country) and still the same trend in job availability was observed (see graph section for more detail).

4 Initial Look at Dataset

  • Print the first 6 rows using function head().
#first set working directory to location of file
#install.packages("kableExtra")
require("kableExtra")
## Loading required package: kableExtra
dsdata<-read.csv("processed_data_toDummies.csv")
kable(str(dsdata))
## 'data.frame':    33601 obs. of  14 variables:
##  $ Year                                         : int  2017 2017 2017 2017 2017 2017 2017 2017 2017 2017 ...
##  $ Hobbyist                                     : chr  "Yes, both" "Yes, I program as a hobby" "No" "Yes, I program as a hobby" ...
##  $ ConvertedComp                                : num  43750 51282 25000 100000 27000 ...
##  $ Country                                      : chr  "United Kingdom" "Denmark" "Israel" "United States" ...
##  $ EdLevel                                      : chr  "Bachelor's degree" "Some college/university study without earning a bachelor's degree" "Some college/university study without earning a bachelor's degree" "Some college/university study without earning a bachelor's degree" ...
##  $ Employment                                   : chr  "Employed full-time" "Employed part-time" "Employed full-time" "Employed full-time" ...
##  $ JobSat                                       : int  4 10 6 5 7 10 4 8 10 10 ...
##  $ OrgSize                                      : chr  "2 to 9 employees" "100 to 499 employees" "5,000 to 9,999 employees" "20 to 99 employees" ...
##  $ UndergradMajor                               : chr  "Computer science" "Computer science" "Computer science" "Computer science" ...
##  $ YearsCodePro                                 : int  2 3 4 15 5 5 10 9 1 16 ...
##  $ Data.scientist.or.machine.learning.specialist: int  1 1 1 0 0 0 0 0 0 0 ...
##  $ Database.administrator                       : int  1 0 0 1 1 1 1 1 1 1 ...
##  $ Data.or.business.analyst                     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Engineer..data                               : int  NA NA NA NA NA NA NA NA NA NA ...
kable(head(dsdata))
Year Hobbyist ConvertedComp Country EdLevel Employment JobSat OrgSize UndergradMajor YearsCodePro Data.scientist.or.machine.learning.specialist Database.administrator Data.or.business.analyst Engineer..data
2017 Yes, both 43750.00 United Kingdom Bachelor’s degree Employed full-time 4 2 to 9 employees Computer science 2 1 1 NA NA
2017 Yes, I program as a hobby 51282.05 Denmark Some college/university study without earning a bachelor’s degree Employed part-time 10 100 to 499 employees Computer science 3 1 0 NA NA
2017 No 25000.00 Israel Some college/university study without earning a bachelor’s degree Employed full-time 6 5,000 to 9,999 employees Computer science 4 1 0 NA NA
2017 Yes, I program as a hobby 100000.00 United States Some college/university study without earning a bachelor’s degree Employed full-time 5 20 to 99 employees Computer science 15 0 1 NA NA
2017 Yes, both 27000.00 Ukraine Master’s degree Employed full-time 7 100 to 499 employees Computer science 5 0 1 NA NA
2017 Yes, I program as a hobby 120000.00 United States Bachelor’s degree Employed full-time 10 20 to 99 employees Computer science 5 0 1 NA NA
kable(summary(dsdata))
Year Hobbyist ConvertedComp Country EdLevel Employment JobSat OrgSize UndergradMajor YearsCodePro Data.scientist.or.machine.learning.specialist Database.administrator Data.or.business.analyst Engineer..data
Min. :2017 Length:33601 Min. : 0.18 Length:33601 Length:33601 Length:33601 Min. : 0.000 Length:33601 Length:33601 Min. : 1.000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.000
1st Qu.:2018 Class :character 1st Qu.: 24576.00 Class :character Class :character Class :character 1st Qu.: 5.000 Class :character Class :character 1st Qu.: 3.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000
Median :2019 Mode :character Median : 53000.00 Mode :character Mode :character Mode :character Median : 6.000 Mode :character Mode :character Median : 6.000 Median :0.0000 Median :1.0000 Median :0.0000 Median :0.000
Mean :2019 NA Mean : 62593.09 NA NA NA Mean : 6.124 NA NA Mean : 8.756 Mean :0.3217 Mean :0.5401 Mean :0.3215 Mean :0.304
3rd Qu.:2019 NA 3rd Qu.: 87036.00 NA NA NA 3rd Qu.: 8.000 NA NA 3rd Qu.:12.000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.000
Max. :2020 NA Max. :299436.00 NA NA NA Max. :10.000 NA NA Max. :30.000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.000
NA NA NA NA NA NA NA’s :75 NA NA NA’s :83 NA NA NA’s :2564 NA’s :13355

5 Data Validation

5.1 Data Types and Ranges

The data types for most fields are correct as a whole. For example, ConvertedComp (salary) is a dbl type, JobSat (job satisfaction) is an int, YearsCodePro (years of professional coding experience) is an int, etc. Some variables (farther below) were changed to a different data type when used in analysis.

Additionally, the ranges of the variables in the dataset seem to be reasonable, with some variables having binary (0/1) values and others having only limited integer numeric (YearsCodePro, JobSat) or categorical values.

5.2 Checking for Duplicates (Variables of Interest)

#5 unique responses are included for the "Hobbyist" variable
unique(dsdata$Hobbyist)
## [1] "Yes, both"                                
## [2] "Yes, I program as a hobby"                
## [3] "No"                                       
## [4] "Yes, I contribute to open source projects"
## [5] "Yes"
#180 countries are represented by the data
head(unique(dsdata$Country))
## [1] "United Kingdom" "Denmark"        "Israel"         "United States" 
## [5] "Ukraine"        "Canada"
#10 different education levels are represented by the data
unique(dsdata$EdLevel)
##  [1] "Bachelor's degree"                                                
##  [2] "Some college/university study without earning a bachelor's degree"
##  [3] "Master's degree"                                                  
##  [4] "Secondary school"                                                 
##  [5] "Doctoral degree"                                                  
##  [6] "Professional degree"                                              
##  [7] "I prefer not to answer"                                           
##  [8] "Primary/elementary school"                                        
##  [9] "I never completed any formal education"                           
## [10] "Associate degree"                                                 
## [11] ""
#11 different job satisfaction levels are represented by the data
unique(dsdata$JobSat)
##  [1]  4 10  6  5  7  8  1  3  9  2 NA  0
#11 different organization sizes are represented by the data
unique(dsdata$OrgSize)
##  [1] "2 to 9 employees"                                  
##  [2] "100 to 499 employees"                              
##  [3] "5,000 to 9,999 employees"                          
##  [4] "20 to 99 employees"                                
##  [5] "1,000 to 4,999 employees"                          
##  [6] "10 to 19 employees"                                
##  [7] "500 to 999 employees"                              
##  [8] "10,000 or more employees"                          
##  [9] "I don't know"                                      
## [10] ""                                                  
## [11] "I prefer not to answer"                            
## [12] "Just me - I am a freelancer, sole proprietor, etc."
#Years of Professional Coding Experience represented by the data
unique(dsdata$YearsCodePro)
##  [1]  2  3  4 15  5 10  9  1 16  7 20 18 17 11  6 14  8 19 NA 13 12 23 26 30 29
## [26] 22 25 21 24 27 28
#Data Scientist/ML Specialist/Database Admin./Analyst Job Title
unique(dsdata$Data.scientist.or.machine.learning.specialist)
## [1] 1 0
unique(dsdata$Database.administrator)
## [1] 1 0
unique(dsdata$Data.or.business.analyst)
## [1] NA  0  1

5.3 Missing Values

The missing values present in this data set take on both the forms NA (as in YearsCodePro and JobSat) and "" (empty strings as in UndergradMajor and EdLevel).

The rows for all missing data (NA) were deleted, using na.omit().

dsdata<-na.omit(dsdata) #deleting NA's

The rows for all empty strings (’’) were deleted, using the following code for variables of interest.

dsdata <- dsdata[-which(dsdata$EdLevel == ""), ] #subsetting out empty strings
dsdata <- dsdata[-which(dsdata$OrgSize == ""), ]
dsdata <- dsdata[-which(dsdata$UndergradMajor == ""), ]

5.4 Data Types and Descriptive Statistics for Variables of Interest

#Job Satisfaction (JobSat) 
dsdata$JobSat<-as.numeric(dsdata$JobSat) 
summary((dsdata$JobSat))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   5.000   6.000   5.962   8.000   8.000

JobSat was used in the key plots as both an x-axis and response variable, and thus it was switched to type factor or num according to its use.

#Data Scientist/ML Specialist/Database Admin./Analyst Job Title
dsdata$Data.scientist.or.machine.learning.specialist<-as.factor(dsdata$Data.scientist.or.machine.learning.specialist)
dsdata$Database.administrator<-as.factor(dsdata$Database.administrator)
dsdata$Data.or.business.analyst<-as.factor(dsdata$Data.or.business.analyst)
summary(dsdata$Data.scientist.or.machine.learning.specialist)
##     0     1 
## 12170  5806
summary(dsdata$Database.administrator)
##    0    1 
## 9504 8472
summary(dsdata$Data.or.business.analyst)
##     0     1 
## 12403  5573

Thes Job Title variables should be encoded as FACTORS instead of ints, since there are only two levels of each (binary representation).

6 Plots

6.1 Company Size versus Job Satisfaction

require(ggplot2)
## Loading required package: ggplot2
dsdata$JobSat<-as.factor(dsdata$JobSat) #change to factor for bar graph x-axis
company_data<-dsdata[-which(dsdata$OrgSize=="Just me - I am a freelancer, sole proprietor, etc."),] #remove single-person companies(lots of variation)
ggplot(company_data, aes(x=JobSat))+geom_bar()+facet_wrap(~OrgSize)+labs(title='Job Satisfaction by Company Size',x = 'Job Satisfaction', y = 'Count') #facet_wrap() on OrgSize

dsdata$JobSat<-as.numeric(dsdata$JobSat) #change back to numeric for violin plot (to use on y-axis as response variable)
company_data<-dsdata[-which(dsdata$OrgSize=="Just me - I am a freelancer, sole proprietor, etc."),]
ggplot(company_data, aes(x=OrgSize, y=JobSat))+geom_violin()+theme(axis.text.x = element_text(angle=15))+ labs(title='Job Satisfaction by Company Size',x = 'Company Size', y = 'Job Satisfaction') #violin plot; widths of plots reflect densities

Variables used in analysis: JobSat (job satisfaction ranging from 1-5 or 2-8, with 1 as not satisfied and 5 or 8 as very satisfied) and OrgSize (size of the organization).

Argument 1 of the article is with regards to Company Size versus Job Satisfaction. The article argues that while overall it is hard to draw conclusions regarding the relationship between company size and job satisfaction, smaller companies have higher job satisfaction ratings than do larger ones. In order to investigate, I first removed the “freelancer” level of the OrgSize variable because freelance/single positions have many external factors that may affect job satisfaction aside from the sole size of the company. With the remaining groups of company sizes, we can see that the article is correct.

For the bar graphs, this trend is true. Overall, the graphs display the same trend of being skewed-left and heavy-tailed, with more density of points in the higher job satisfaction score ranges. Smaller companies only (2-9 employees, 10-19 employees, 20-99 employees) have a taller bar at a score of 8 than at 6. Meanwhile, larger companies have taller or equivalent bars at 6 than at 8, indicating that a score of “very satisfied” is not as common for large companies.

The general trend is the same for all violin plots, where density of points is small at lower job satisfaction scores and increases in the higher ranges of scores. Additionally, for SMALLER size companies (10-19 employees, 2-9 employees, 20-99 employees) the width of the plot is larger at a score of 5 than it is at a score of 4. Meanwhile for large companies (5,000-9,999 or 10,000 or more) the highest job satisfaction scores often have slimmer densities than more moderate scores.

6.2 Job Availability by Country and Year

#require(ggplot2)

country_7<-c('United States', 'Spain', 'Germany', 
             'Australia', 'Ireland', 'United Kingdom', 'India') #a sample of 7 countries from the Country column

dsdata_subset<-dsdata[(dsdata$Country %in% country_7 & dsdata$Data.scientist.or.machine.learning.specialist=='1'),]  #subset the data frame to only these countries and to only the observations with a '1' for the job title category

dsdata_subset$Country<-factor(dsdata_subset$Country, levels=c('United States', 'Spain', 'Germany', 
             'Australia', 'Ireland', 'United Kingdom', 'India'), labels=c('US', 'Spain', 'Germany', 'Australia', 'Ireland', 'UK', 'India')) #Changing the levels of the Country variable to shorter names

ggplot(dsdata_subset, aes(x=Data.scientist.or.machine.learning.specialist))+
  geom_bar(aes(fill=Country))+
  facet_grid(Year~Country)+
  labs(title='Data Scientist and ML Specialist Jobs by Year and Country', x='Data Scientist and ML Specialist Jobs', y='Count') #facet_grid() on both Year and Country

Variables used in analysis: Data.scientist.or.machine.learning.specialist (1 if yes, 0 if not); Country (a subset of 7 major countries); Year (2019 or 2020).

Argument 2 of the article concerned the availability of data scientists and ML specialist jobs by year and country. The article makes the claim that when all countries are taken together, the number of Data Scientist and ML Specialist jobs have noticeably decreased from 2019-2020.

The article is correct, and there is strong evidence in the above exploratory data analysis to prove the claim true. For each of the 7 countries analyzed, facet_grid() was employed to observe the data with Country as the columns and Year as the rows. From 2019-2020, every country showed a decline in the number (count) of Data Scientist or ML Specialist employees (with Ireland staying about the same.) Thus, even when the data is grouped into smaller categories, the overall trend in job reduction remains.

This result is surprising considering the rise of data science as a field in the recent decade. There are 2 reasons this trend may have been observed, and can inform future analysis for this project: 1)There is a simultaneous rise in specialty-related data science jobs, careers in analytics, and careers in machine learning that are not classified under the title “Data Scientist” or “ML Specialist” as encoded in the data. The rise in these other job titles may account for this observed decline. 2)While grouping by country may not reveal a different trend, grouping by YearsCodePro (years of professional coding), EdLevel (educational level), etc. may show different trends for the availability of these 2 job titles over the span of 2019-2020.