Project 1: Data Scientist Jobs in 2021

Author

Qian He

$Data Scientist Jobs in 2021$

Introduction:

Dataset Topic:Data Scientist Jobs in 2021

Variables:Average Salary(k),Lower Salary,Rating

Source:https://aijobs.net

I want to explore the relationships between average salary,lower salary and company Rating.Understanding these relationships can help me better identify salary levels vary across different jobs . Additionally,analyzing ratings for companies can provide me insight about the relationship between rating and high salary for future work reference.

Load the dataset and filter nas and outliers

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

p1 <- read_csv("Data scientist jobs in 2021.csv")

Rows: 742 Columns: 41
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (16): Job Title, Salary Estimate, Company Name, Location, Headquarters, ...
dbl (25): index, Rating, Founded, Hourly, Employer provided, Lower Salary, U...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

view(p1)

#filter all nas in average salary and age(ouliers:negative work experience and work experience more than 60 )
jobs1 <- p1 |>
  filter(!is.na('Avg Salary(K)'),
         !is.na(Rating),
         !is.na('Lower Salary'),
         Rating > 0) |>
  #change Rating from numerical variable into categorical variable (for graphing purpose)
  mutate(categorical_Rating= cut(Rating,breaks=3,labels=c("2-3","3-4","4-5")))
jobs1

# A tibble: 731 × 42
   index `Job Title`            `Salary Estimate` Rating `Company Name` Location
   <dbl> <chr>                  <chr>              <dbl> <chr>          <chr>   
 1     0 Data Scientist         $53K-$91K (Glass…    3.8 "Tecolote Res… Albuque…
 2     1 Healthcare Data Scien… $63K-$112K (Glas…    3.4 "University o… Linthic…
 3     2 Data Scientist         $80K-$90K (Glass…    4.8 "KnowBe4\n4.8" Clearwa…
 4     3 Data Scientist         $56K-$97K (Glass…    3.8 "PNNL\n3.8"    Richlan…
 5     4 Data Scientist         $86K-$143K (Glas…    2.9 "Affinity Sol… New Yor…
 6     5 Data Scientist         $71K-$119K (Glas…    3.4 "CyrusOne\n3.… Dallas,…
 7     6 Data Scientist         $54K-$93K (Glass…    4.1 "ClearOne Adv… Baltimo…
 8     7 Data Scientist         $86K-$142K (Glas…    3.8 "Logic20/20\n… San Jos…
 9     8 Research Scientist     $38K-$84K (Glass…    3.3 "Rochester Re… Rochest…
10     9 Data Scientist         $120K-$160K (Gla…    4.6 "<intent>\n4.… New Yor…
# ℹ 721 more rows
# ℹ 36 more variables: Headquarters <chr>, Size <chr>, Founded <dbl>,
#   `Type of ownership` <chr>, Industry <chr>, Sector <chr>, Revenue <chr>,
#   Competitors <chr>, Hourly <dbl>, `Employer provided` <dbl>,
#   `Lower Salary` <dbl>, `Upper Salary` <dbl>, `Avg Salary(K)` <dbl>,
#   company_txt <chr>, `Job Location` <chr>, Age <dbl>, Python <dbl>,
#   spark <dbl>, aws <dbl>, excel <dbl>, sql <dbl>, sas <dbl>, keras <dbl>, …

##Regression & Modeling

#y=Lower Salary,x=Avg Salary(K)
#cor:correlation
library(ggfortify)
cor(jobs1$`Avg Salary(K)`,jobs1$`Lower Salary`)

[1] 0.9801454

#find out the correlation between Lower Salary ,Avg Salary and Rating
p1 <- lm(`Lower Salary` ~ `Avg Salary(K)`+ Rating, data=jobs1)
summary(p1)


Call:
lm(formula = `Lower Salary` ~ `Avg Salary(K)` + Rating, data = jobs1)

Residuals:
    Min      1Q  Median      3Q     Max 
-35.672  -3.477   0.602   2.441  32.321 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -13.190266   1.492199  -8.839  < 2e-16 ***
`Avg Salary(K)`   0.796005   0.005956 133.638  < 2e-16 ***
Rating            1.875276   0.387904   4.834 1.63e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.924 on 728 degrees of freedom
Multiple R-squared:  0.9619,    Adjusted R-squared:  0.9618 
F-statistic:  9192 on 2 and 728 DF,  p-value: < 2.2e-16

#diagnostic plots
autoplot(p1,1:4,nrow=2,ncol=2)

Warning: `fortify(<lm>)` was deprecated in ggplot2 4.0.0.
ℹ Please use `broom::augment(<lm>)` instead.
ℹ The deprecated feature was likely used in the ggfortify package.
  Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.

Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the ggfortify package.
  Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the ggfortify package.
  Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.

options(scipen=0)
#delete outlier based on Cook's distance
jobs2<-jobs1[-c(77,326), ]
p2<-lm(`Lower Salary` ~ `Avg Salary(K)`+ Rating, data=jobs2)
summary(p2)


Call:
lm(formula = `Lower Salary` ~ `Avg Salary(K)` + Rating, data = jobs2)

Residuals:
    Min      1Q  Median      3Q     Max 
-33.481  -3.448   0.515   2.392  32.069 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -13.595063   1.458457  -9.322  < 2e-16 ***
`Avg Salary(K)`   0.797639   0.005837 136.644  < 2e-16 ***
Rating            1.953745   0.378805   5.158 3.23e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.782 on 726 degrees of freedom
Multiple R-squared:  0.9636,    Adjusted R-squared:  0.9635 
F-statistic:  9607 on 2 and 726 DF,  p-value: < 2.2e-16

#re-run diagnostic plots
autoplot(p2,1:4,nrow=2,ncol=2)

The regression model is : Lower Salary =-13.5951+0.7976(Avg Salary(k))+1.9537(Rating)

The coefficient for Avg Salary is 0.7976,which means Avg Salary has a strong postive relationship with Lower Salary.As the Lower Salary increases by 1 unit, the Lower Salary will also increase.

The p-value of Avg Salary is less than 2e-16 and has 3 asterisks which suggests it is a highly meaningful variable to explain the linear increase in Lower Salary . The Adjusted R-Squared value states about 96% of the variation in the observations may be explained by the model.

The coefficient for Rating is 1.9537 and its p-value is 3.23e-07(with 3 asterisks),which means Rating is also a significant factor for lower salary in this model.Higer lower salary comes with high company rating.

From the diagnostic plots,most residuals are located around zero. It shows the dataset is mostly a reasonable fit.From Normal Q-Q,there are a few outliers at the right ends.From Cook’s distance,we can conclude that observations 194 and 469 have a relatively strong effect on the model.

Overall,this model shows a very strong linear relationship between average salary,lower salary and rating.

Data Visualisation

# create a scatterplot
library(ggfortify)
p2 <- ggplot(jobs2,aes(x=`Avg Salary(K)`,
                       y=`Lower Salary`,
                       color=factor(`categorical_Rating`)))+ 
  #define colors
  scale_color_manual(values=c("pink","lightblue","yellow"))+
  #add the points
  geom_point(alpha=0.5,size=1)+  
  labs(title="    Average Salary VS Lower Salary ",
       subtitle="                          for Data Scientist Jobs in 2021",
  caption = "Source:https://aijobs.net",
  x="average salary(k)$ ",
  y="Lower Salary(k)$",
  #to delete "factor" before rating in the legend
  color="Rating")+
  #to get rid of a few outliers
  xlim(40,180)+
  ylim(25,125)+
  theme_minimal(base_size = 16)+
  geom_smooth(method="lm",
              se=FALSE,
              size=1,
              color="lightgreen")
 
p2

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 45 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 45 rows containing missing values or values outside the scale range
(`geom_point()`).

Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_smooth()`).

Essay

In Project 1, I first loaded the dataset by using “read_csv()” and “view()”. Then I cleaned the dataset by removing missing values (NAs) and filtering out unwanted values, such as negative ratings. I also changed Rating from numerical variable into categorical variable (for graphing purpose). And I used “cor()” and lm(y~x)to find out the correlation between Lower Salary, Avg Salary, and Rating. After that, I used “autoplot()” to analyze the plot and remove outliers, so that the plot would be more normally distributed for graphing. In addition, I limited the salary range to “x(40,180),y(25,125)” accordingly in the plot to make the visualization bigger and easier to interpret.

From the regression model, we can easily see a strong positive relationship between average salary and lower salary. As the average salary increases, the lower salary increases as well. One interesting observation from the scatter plot above is that ratings do not show a clear impact on salary given its statistically significance. It shows rating’s impact on lower salary is much smaller compared to that of average salary.

From the diagnostic plots, a few observations (such as 194 and 469) might have a relatively stronger influence on the model. These points slightly affect the residual distribution, but after deleting some outliers, new ones just kept appearing,which shows the dataset is not perfectly normal distribution.

One limitation of this analysis is that I didn’t have more time to explore more visualisations using alluvial diagrams. If I had more time, I would use alluvial diagrams to compare them with scatterplots for this dataset.

Citation

All code references are based on previous Data 110 handouts,homework.

google:how to change continuous variables into categorical variables in R ggplot ?（code “cut”, “breaks” :for Rating legend examples）