Project 1: Data Scientist Jobs in 2021

Author

Qian He

Data Scientist Jobs in 2021

Introduction:

Dataset Topic:Data Scientist Jobs in 2021

Variables:Average Salary(k),Lower Salary,Rating

Source:https://aijobs.net

I want to explore the relations between job average salary,lower salary and Rating.

Load the dataset and filter nas and outliers

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
p1 <- read_csv("Data scientist jobs in 2021.csv")
Rows: 742 Columns: 41
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (16): Job Title, Salary Estimate, Company Name, Location, Headquarters, ...
dbl (25): index, Rating, Founded, Hourly, Employer provided, Lower Salary, U...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
view(p1)

#filter all nas in average salary and age(ouliers:negative work experience and work experience more than 60 )
jobs1 <- p1 |>
  filter(!is.na('Avg Salary(K)'),
         !is.na(Rating),
         !is.na('Lower Salary'),
         Rating > 0)
jobs1
# A tibble: 731 × 41
   index `Job Title`            `Salary Estimate` Rating `Company Name` Location
   <dbl> <chr>                  <chr>              <dbl> <chr>          <chr>   
 1     0 Data Scientist         $53K-$91K (Glass…    3.8 "Tecolote Res… Albuque…
 2     1 Healthcare Data Scien… $63K-$112K (Glas…    3.4 "University o… Linthic…
 3     2 Data Scientist         $80K-$90K (Glass…    4.8 "KnowBe4\n4.8" Clearwa…
 4     3 Data Scientist         $56K-$97K (Glass…    3.8 "PNNL\n3.8"    Richlan…
 5     4 Data Scientist         $86K-$143K (Glas…    2.9 "Affinity Sol… New Yor…
 6     5 Data Scientist         $71K-$119K (Glas…    3.4 "CyrusOne\n3.… Dallas,…
 7     6 Data Scientist         $54K-$93K (Glass…    4.1 "ClearOne Adv… Baltimo…
 8     7 Data Scientist         $86K-$142K (Glas…    3.8 "Logic20/20\n… San Jos…
 9     8 Research Scientist     $38K-$84K (Glass…    3.3 "Rochester Re… Rochest…
10     9 Data Scientist         $120K-$160K (Gla…    4.6 "<intent>\n4.… New Yor…
# ℹ 721 more rows
# ℹ 35 more variables: Headquarters <chr>, Size <chr>, Founded <dbl>,
#   `Type of ownership` <chr>, Industry <chr>, Sector <chr>, Revenue <chr>,
#   Competitors <chr>, Hourly <dbl>, `Employer provided` <dbl>,
#   `Lower Salary` <dbl>, `Upper Salary` <dbl>, `Avg Salary(K)` <dbl>,
#   company_txt <chr>, `Job Location` <chr>, Age <dbl>, Python <dbl>,
#   spark <dbl>, aws <dbl>, excel <dbl>, sql <dbl>, sas <dbl>, keras <dbl>, …

##Regression & Modeling

#y=Lower Salary,x=Avg Salary(K)
#cor:correlation
library(ggfortify)
cor(jobs1$`Avg Salary(K)`,jobs1$`Lower Salary`)
[1] 0.9801454
#find out the correlation between Lower Salary ,Avg Salary and Rating
p1 <- lm(`Lower Salary` ~ `Avg Salary(K)`+ Rating, data=jobs1)
summary(p1)

Call:
lm(formula = `Lower Salary` ~ `Avg Salary(K)` + Rating, data = jobs1)

Residuals:
    Min      1Q  Median      3Q     Max 
-35.672  -3.477   0.602   2.441  32.321 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -13.190266   1.492199  -8.839  < 2e-16 ***
`Avg Salary(K)`   0.796005   0.005956 133.638  < 2e-16 ***
Rating            1.875276   0.387904   4.834 1.63e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.924 on 728 degrees of freedom
Multiple R-squared:  0.9619,    Adjusted R-squared:  0.9618 
F-statistic:  9192 on 2 and 728 DF,  p-value: < 2.2e-16
#diagnostic plots
autoplot(p1,1:4,nrow=2,ncol=2)
Warning: `fortify(<lm>)` was deprecated in ggplot2 4.0.0.
ℹ Please use `broom::augment(<lm>)` instead.
ℹ The deprecated feature was likely used in the ggfortify package.
  Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the ggfortify package.
  Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the ggfortify package.
  Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.

options(scipen=0)
#delete outlier 54,111
jobs2<-jobs1[-c(54,110),]
p2<-lm(`Lower Salary` ~ `Avg Salary(K)`+ Rating, data=jobs2)
summary(p2)

Call:
lm(formula = `Lower Salary` ~ `Avg Salary(K)` + Rating, data = jobs2)

Residuals:
    Min      1Q  Median      3Q     Max 
-35.666  -3.531   0.597   2.443  32.329 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -13.178918   1.496762  -8.805  < 2e-16 ***
`Avg Salary(K)`   0.795956   0.005972 133.273  < 2e-16 ***
Rating            1.872983   0.388836   4.817 1.78e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.932 on 726 degrees of freedom
Multiple R-squared:  0.9618,    Adjusted R-squared:  0.9617 
F-statistic:  9134 on 2 and 726 DF,  p-value: < 2.2e-16
#re-run diagnostic plots
autoplot(p2,1:4,nrow=2,ncol=2)

The regression model is : Lower Salary =0.7973(Avg Salary(k))-0.1459(Rating)-5.7201

The coefficient for Avg Salary is 0.7973,which means Avg Salary has a strong postive relationship with Lower Salary.As the Lower Salary increases, the Lower Salary will also increase.

The p-value of Avg Salary is less than 2e-16 and has 3 asterisks which suggests it is a highly meaningful variable to explain the linear increase in Lower Salary . The Adjusted R-Squared value states about 96% of the variation in the observations may be explained by the model.

The coefficient for Rating is -0.1459 and its p-value is 0.74,which means Rating is not a significant factor for lower salary in this model.

The (autoplot(p1)) residuals plot and scale-location plot both show that observations 54 nad 110 have an effect on the residuals plot as well having high scale-location values.According to the autoplot(p2),the Residuals vs Fitted plot and Scale-location plot show that residuals are mostly distributed around zero and 0.6,which indicates that the model is mostly normal distribution.From Normal Q-Q,there are a few outliers at the right ends.From Cook’s distance,we can conclude that observations 135 nad 193 have a strong effect on the model.

Overall,this model shows a very strong linear relationship between average salary and lower salary,while there is no compelling evidence that Rating contributes significantly to the model.

Data Visualisation

# create a scatterplot
library(ggfortify)
p2 <- ggplot(jobs2,aes(x=`Avg Salary(K)`,
                       y=`Lower Salary`,
                       color=factor(`Rating`)))+ 
  #add the points
  geom_point(alpha=0.5,size=1)+  
  labs(title="    Average Salary VS Lower Salary ",
       subtitle="                          for Data Scientist Jobs in 2021",
  caption = "Source:https://aijobs.net",
  x="average salary(k)$ ",
  y="Lower Salary(k)$",
  #to delete "factor" before rating in the legend
  color="Rating")+
  #to get rid of a few outliers
  xlim(40,180)+
  ylim(25,125)+
  theme_minimal(base_size = 16)+
  geom_smooth(method="lm",
              se=FALSE,
              size=1,
              color="lightgreen")
 
p2
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 46 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 46 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_smooth()`).

Essay

In Project 1, I first loaded the dataset by using “read_csv()” and “view()”. Then I cleaned the dataset by removing missing values (NAs) and filtering out unwanted values, such as negative ratings. I also used “cor()” and lm(y~x)to find out the correlation between Lower Salary, Avg Salary, and Rating. After that, I used “autoplot()” to analyze the plot and remove outliers, so that the plot would be more normally distributed for graphing. In addition, I limited the salary range to “x(40,180),y(25,125)” accordingly in the plot to make the visualization bigger and easier to interpret.

From the regression model in the visualization, we can easily see a strong positive relationship between average salary and lower salary. As the average salary increases, the lower salary increases as well. One interesting observation is that ratings do not show a clear impact on salary, which was surprising given my expectations.

From the diagnostic plots, a few observations (such as 54 and 110) might have a stronger influence on the model. These points slightly affect the residual distribution, but after deleting some outliers, new ones just kept appearing.

One limitation of this analysis is that I didn’t have more time to explore more visualisations using alluvial diagrams. If I had more time, I would use alluvial diagrams to compare them with scatterplots for this dataset.

Citation

All code references are based on previous Data 110 handouts,homeworks.