I want to explore the relations between job average salary,lower salary and Rating.
Load the dataset and filter nas and outliers
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
p1 <-read_csv("Data scientist jobs in 2021.csv")
Rows: 742 Columns: 41
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (16): Job Title, Salary Estimate, Company Name, Location, Headquarters, ...
dbl (25): index, Rating, Founded, Hourly, Employer provided, Lower Salary, U...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
view(p1)#filter all nas in average salary and age(ouliers:negative work experience and work experience more than 60 )jobs1 <- p1 |>filter(!is.na('Avg Salary(K)'),!is.na(Rating),!is.na('Lower Salary'), Rating >0)jobs1
# A tibble: 731 × 41
index `Job Title` `Salary Estimate` Rating `Company Name` Location
<dbl> <chr> <chr> <dbl> <chr> <chr>
1 0 Data Scientist $53K-$91K (Glass… 3.8 "Tecolote Res… Albuque…
2 1 Healthcare Data Scien… $63K-$112K (Glas… 3.4 "University o… Linthic…
3 2 Data Scientist $80K-$90K (Glass… 4.8 "KnowBe4\n4.8" Clearwa…
4 3 Data Scientist $56K-$97K (Glass… 3.8 "PNNL\n3.8" Richlan…
5 4 Data Scientist $86K-$143K (Glas… 2.9 "Affinity Sol… New Yor…
6 5 Data Scientist $71K-$119K (Glas… 3.4 "CyrusOne\n3.… Dallas,…
7 6 Data Scientist $54K-$93K (Glass… 4.1 "ClearOne Adv… Baltimo…
8 7 Data Scientist $86K-$142K (Glas… 3.8 "Logic20/20\n… San Jos…
9 8 Research Scientist $38K-$84K (Glass… 3.3 "Rochester Re… Rochest…
10 9 Data Scientist $120K-$160K (Gla… 4.6 "<intent>\n4.… New Yor…
# ℹ 721 more rows
# ℹ 35 more variables: Headquarters <chr>, Size <chr>, Founded <dbl>,
# `Type of ownership` <chr>, Industry <chr>, Sector <chr>, Revenue <chr>,
# Competitors <chr>, Hourly <dbl>, `Employer provided` <dbl>,
# `Lower Salary` <dbl>, `Upper Salary` <dbl>, `Avg Salary(K)` <dbl>,
# company_txt <chr>, `Job Location` <chr>, Age <dbl>, Python <dbl>,
# spark <dbl>, aws <dbl>, excel <dbl>, sql <dbl>, sas <dbl>, keras <dbl>, …
#find out the correlation between Lower Salary ,Avg Salary and Ratingp1 <-lm(`Lower Salary`~`Avg Salary(K)`+ Rating, data=jobs1)summary(p1)
Call:
lm(formula = `Lower Salary` ~ `Avg Salary(K)` + Rating, data = jobs1)
Residuals:
Min 1Q Median 3Q Max
-35.672 -3.477 0.602 2.441 32.321
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -13.190266 1.492199 -8.839 < 2e-16 ***
`Avg Salary(K)` 0.796005 0.005956 133.638 < 2e-16 ***
Rating 1.875276 0.387904 4.834 1.63e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.924 on 728 degrees of freedom
Multiple R-squared: 0.9619, Adjusted R-squared: 0.9618
F-statistic: 9192 on 2 and 728 DF, p-value: < 2.2e-16
#diagnostic plotsautoplot(p1,1:4,nrow=2,ncol=2)
Warning: `fortify(<lm>)` was deprecated in ggplot2 4.0.0.
ℹ Please use `broom::augment(<lm>)` instead.
ℹ The deprecated feature was likely used in the ggfortify package.
Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the ggfortify package.
Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the ggfortify package.
Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
The regression model is : Lower Salary =0.7973(Avg Salary(k))-0.1459(Rating)-5.7201
The coefficient for Avg Salary is 0.7973,which means Avg Salary has a strong postive relationship with Lower Salary.As the Lower Salary increases, the Lower Salary will also increase.
The p-value of Avg Salary is less than 2e-16 and has 3 asterisks which suggests it is a highly meaningful variable to explain the linear increase in Lower Salary . The Adjusted R-Squared value states about 96% of the variation in the observations may be explained by the model.
The coefficient for Rating is -0.1459 and its p-value is 0.74,which means Rating is not a significant factor for lower salary in this model.
The (autoplot(p1)) residuals plot and scale-location plot both show that observations 54 nad 110 have an effect on the residuals plot as well having high scale-location values.According to the autoplot(p2),the Residuals vs Fitted plot and Scale-location plot show that residuals are mostly distributed around zero and 0.6,which indicates that the model is mostly normal distribution.From Normal Q-Q,there are a few outliers at the right ends.From Cook’s distance,we can conclude that observations 135 nad 193 have a strong effect on the model.
Overall,this model shows a very strong linear relationship between average salary and lower salary,while there is no compelling evidence that Rating contributes significantly to the model.
Data Visualisation
# create a scatterplotlibrary(ggfortify)p2 <-ggplot(jobs2,aes(x=`Avg Salary(K)`,y=`Lower Salary`,color=factor(`Rating`)))+#add the pointsgeom_point(alpha=0.5,size=1)+labs(title=" Average Salary VS Lower Salary ",subtitle=" for Data Scientist Jobs in 2021",caption ="Source:https://aijobs.net",x="average salary(k)$ ",y="Lower Salary(k)$",#to delete "factor" before rating in the legendcolor="Rating")+#to get rid of a few outliersxlim(40,180)+ylim(25,125)+theme_minimal(base_size =16)+geom_smooth(method="lm",se=FALSE,size=1,color="lightgreen")p2
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 46 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 46 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_smooth()`).
Essay
In Project 1, I first loaded the dataset by using “read_csv()” and “view()”. Then I cleaned the dataset by removing missing values (NAs) and filtering out unwanted values, such as negative ratings. I also used “cor()” and lm(y~x)to find out the correlation between Lower Salary, Avg Salary, and Rating. After that, I used “autoplot()” to analyze the plot and remove outliers, so that the plot would be more normally distributed for graphing. In addition, I limited the salary range to “x(40,180),y(25,125)” accordingly in the plot to make the visualization bigger and easier to interpret.
From the regression model in the visualization, we can easily see a strong positive relationship between average salary and lower salary. As the average salary increases, the lower salary increases as well. One interesting observation is that ratings do not show a clear impact on salary, which was surprising given my expectations.
From the diagnostic plots, a few observations (such as 54 and 110) might have a stronger influence on the model. These points slightly affect the residual distribution, but after deleting some outliers, new ones just kept appearing.
One limitation of this analysis is that I didn’t have more time to explore more visualisations using alluvial diagrams. If I had more time, I would use alluvial diagrams to compare them with scatterplots for this dataset.
Citation
All code references are based on previous Data 110 handouts,homeworks.