##packages
library(readr)
library(dplyr)
library(tidyr)
library(Hmisc)
library(forecast)
# Dataset 1
Data1<-read_csv("C:/Users/Pooja/Downloads/DS.csv")
Missing column names filled in: 'X1' [1]Parsed with column specification:
cols(
X1 = [32mcol_double()[39m,
title = [31mcol_character()[39m,
company = [31mcol_character()[39m,
cpage = [31mcol_character()[39m,
ratings = [32mcol_double()[39m,
location = [31mcol_character()[39m,
days_ago = [32mcol_double()[39m,
summary = [31mcol_character()[39m
)
head(Data1)
# Dataset 2
Data2<-read_csv("C:/Users/Pooja/Downloads/listings.csv")
Parsed with column specification:
cols(
.default = col_double(),
jobTitle = [31mcol_character()[39m,
jobClassification = [31mcol_character()[39m,
jobSubClassification = [31mcol_character()[39m,
advertiserName = [31mcol_character()[39m,
companyName = [31mcol_character()[39m,
listingDate = [34mcol_datetime(format = "")[39m,
expiryDate = [34mcol_datetime(format = "")[39m,
teaser = [31mcol_character()[39m,
nation = [31mcol_character()[39m,
state = [31mcol_character()[39m,
city = [31mcol_character()[39m,
area = [31mcol_character()[39m,
suburb = [31mcol_character()[39m,
workType = [31mcol_character()[39m,
salary_string = [31mcol_character()[39m,
isRightToWorkRequired = [33mcol_logical()[39m,
desktopAdTemplate = [31mcol_character()[39m,
mobileAdTemplate = [31mcol_character()[39m,
companyProfileUrl = [31mcol_character()[39m,
seekJobListingUrl = [31mcol_character()[39m
# ... with 2 more columns
)
See spec(...) for full column specifications.
head(Data2)
# merged dataset
Data3<-left_join(Data2,Data1,c("jobTitle"="title"))
head(Data3)
NA
2 Datasets have been taken in this study.
The first Dataset is data scientist jobs in australia October 25 2019, data collected from Indeed.com.
Data scientist is one of the hottest jobs in the world right now and this dataset is the result of the job search attempts.
This is a web scraped data set using BeautifulSoap and Selenium.
This Dataset contains the details of data scientist jobs in Australia found on Indeed scraped on october 25 2019. This dataset contains8 columns.
This data has taken from Kaggle.com.
It includes different types of variables.
The variables include: # : Count of the data scientist jobs in the dataset title: This tells the title of the job(role) company: Company name cpage: url of the company page on indeed ratings: rating of the company given by the employees location: location of the company days_ago: no of days its been since the job has been posted summary: a brief summary of the role provided by the company.
This Dataset can be found on https://www.kaggle.com/santokalayil/data-scientist-jobs-in-australia-october-25-2019.
The second Dataset is also taken from kaggle.com
It is IT job listings which includes Data Scientist jobs etc in Australia for the year 2019-2020.
This dataset has been scraped from seek.com.
This dataset gives some more vital inofrmaiton about the datascience jobs which can be added on to the previous dataset as it provides few hot trends regarding the role.
Like programming languages to learn for the datascience role or to specialise in any of them.
This dataset is the collection every search result for data scientist along with each of the programming languages and applications.
It consists of over 50 columns, numeric ,text and geographic data to explore.
This dataset is a beginner friendly and is great for visualization.
This includes over 50 variables, some of which are: JobID: ID given to this Job Jobtitle: Title/role of this job JobClassification: Category of the job/role advertiserName: Institution which offered the job Listing Date: Date on which the job has been posted expiry Date: Last date for applying for the job teaser: a brief description of the role and job Nation: Country where the role is available location: locale of the work place Worktype: Type of job ex: Full time ,contract etc Salary: Salary expected of the role Isrighttoworkrequired: Yes/No , whether it requires govt visa conditions DesktopADtemplate: brief template of the job as displayed on the desktop SeekjoblistingURL: URL of the job listing on the seek.com Languages: Programming languages required for this job/role recruiter: person/institution recruiting for the job and so on.
This dataset can be found on https://www.kaggle.com/nomilk/data-science-job-listings-australia-20192020.
The first dataset has been read using read_csv function and stored in data frame Data1. and the dataset is output using head() function. Similarly for the dataset where it is stored in the dataframe Data2.
To meet the requirement #1 given in assignment the 2 datasets have been merged using leftjoin() funciton where Data2 is left joined with Data1 that means complete Data2 dataset merged with Data1 using common variable which here is jobtitle.
This merges the Data scientist jobs dataset with all the jobs listed in the other dataset and results only in Data science jobs on both the websites with all the information required around the job posting.
This merged dataset will be used in the following sections for the study.
summary(Data3)
jobId jobTitle jobClassification jobSubClassification
Min. :38098375 Length:29751 Length:29751 Length:29751
1st Qu.:38965720 Class :character Class :character Class :character
Median :39597982 Mode :character Mode :character Mode :character
Mean :39617436
3rd Qu.:40270299
Max. :41037334
advertiserName advertiserId companyId companyName companyRating
Length:29751 Min. : 1742 Min. :432299 Length:29751 Min. :2.600
Class :character 1st Qu.:22062501 1st Qu.:432419 Class :character 1st Qu.:3.200
Mode :character Median :26734030 Median :432709 Mode :character Median :3.400
Mean :26346559 Mean :509449 Mean :3.483
3rd Qu.:32860434 3rd Qu.:434719 3rd Qu.:3.700
Max. :44645027 Max. :834866 Max. :4.800
NA's :26395 NA's :26395
listingDate expiryDate teaser
Min. :2019-01-16 12:17:41 Min. :2019-03-06 23:59:59 Length:29751
1st Qu.:2019-05-07 17:16:09 1st Qu.:2019-06-06 23:59:59 Class :character
Median :2019-07-31 16:38:17 Median :2019-08-30 23:59:59 Mode :character
Mean :2019-08-04 20:04:29 Mean :2019-09-04 02:22:18
3rd Qu.:2019-10-29 16:00:25 3rd Qu.:2019-11-28 23:59:59
Max. :2020-02-25 15:55:48 Max. :2020-03-27 00:00:00
nation state city area suburb
Length:29751 Length:29751 Length:29751 Length:29751 Length:29751
Class :character Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character Mode :character
workType salary_string isRightToWorkRequired desktopAdTemplate
Length:29751 Length:29751 Mode :logical Length:29751
Class :character Class :character FALSE:19579 Class :character
Mode :character Mode :character TRUE :10172 Mode :character
mobileAdTemplate companyProfileUrl seekJobListingUrl R Python
Length:29751 Length:29751 Length:29751 Min. :0.0000 Min. :0.0000
Class :character Class :character Class :character 1st Qu.:1.0000 1st Qu.:1.0000
Mode :character Mode :character Mode :character Median :1.0000 Median :1.0000
Mean :0.7955 Mean :0.8671
3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000
Matlab SQL Stata Minitab SPSS
Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.00e+00 Min. :0.00000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00e+00 1st Qu.:0.00000
Median :0.0000 Median :1.0000 Median :0.00000 Median :0.00e+00 Median :0.00000
Mean :0.0882 Mean :0.6727 Mean :0.01079 Mean :3.36e-05 Mean :0.03244
3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.00e+00 3rd Qu.:0.00000
Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.00e+00 Max. :1.00000
Ruby C Scala Tableau Java
Min. :0.000000 Min. :0.00000 Min. :0.0000 Min. :0.000 Min. :0.0000
1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000
Median :0.000000 Median :0.00000 Median :0.0000 Median :0.000 Median :0.0000
Mean :0.002286 Mean :0.09536 Mean :0.2488 Mean :0.221 Mean :0.1328
3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.000 3rd Qu.:0.0000
Max. :1.000000 Max. :1.00000 Max. :1.0000 Max. :1.000 Max. :1.0000
Hadoop SAS Julia Knime D3
Min. :0.0000 Min. :0.000 Min. :0.00000 Min. :0.000000 Min. :0.000000
1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.000000
Median :0.0000 Median :0.000 Median :0.00000 Median :0.000000 Median :0.000000
Mean :0.2565 Mean :0.201 Mean :0.02555 Mean :0.002084 Mean :0.005613
3rd Qu.:1.0000 3rd Qu.:0.000 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.000000
Max. :1.0000 Max. :1.000 Max. :1.00000 Max. :1.000000 Max. :1.000000
Clojure Haskell Lisp Golang Spark
Min. :0.00e+00 Min. :0.000000 Min. :0 Min. :0.000000 Min. :0.0000
1st Qu.:0.00e+00 1st Qu.:0.000000 1st Qu.:0 1st Qu.:0.000000 1st Qu.:0.0000
Median :0.00e+00 Median :0.000000 Median :0 Median :0.000000 Median :0.0000
Mean :3.36e-05 Mean :0.001647 Mean :0 Mean :0.001546 Mean :0.1349
3rd Qu.:0.00e+00 3rd Qu.:0.000000 3rd Qu.:0 3rd Qu.:0.000000 3rd Qu.:0.0000
Max. :1.00e+00 Max. :1.000000 Max. :0 Max. :1.000000 Max. :1.0000
Javascript F# Fortran first_seen
Min. :0.000000 Min. :0.000000 Min. :0.000000 Min. :2019-03-06
1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:2019-05-07
Median :0.000000 Median :0.000000 Median :0.000000 Median :2019-07-31
Mean :0.005412 Mean :0.001513 Mean :0.003092 Mean :2019-08-05
3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:2019-10-29
Max. :1.000000 Max. :1.000000 Max. :1.000000 Max. :2020-02-25
last_seen recruiter X1 company cpage
Min. :2019-03-06 Min. :0.0000 Min. : 1.0 Length:29751 Length:29751
1st Qu.:2019-06-01 1st Qu.:0.0000 1st Qu.: 18.0 Class :character Class :character
Median :2019-08-21 Median :1.0000 Median : 76.0 Mode :character Mode :character
Mean :2019-08-27 Mean :0.5409 Mean :145.1
3rd Qu.:2019-11-21 3rd Qu.:1.0000 3rd Qu.:352.0
Max. :2020-02-26 Max. :1.0000 Max. :510.0
NA's :599
ratings location days_ago summary
Min. :0.0000 Length:29751 Min. : 1.00 Length:29751
1st Qu.:0.0000 Class :character 1st Qu.: 6.00 Class :character
Median :0.0000 Mode :character Median :14.00 Mode :character
Mean :0.7778 Mean :14.63
3rd Qu.:0.0000 3rd Qu.:23.00
Max. :4.1000 Max. :30.00
NA's :599 NA's :599
class(Data3$listingDate)
[1] "POSIXct" "POSIXt"
Data3$listingDate<-as.Date(Data3$listingDate)
class(Data3$expiryDate)
[1] "POSIXct" "POSIXt"
Data3$expiryDate<-as.Date(Data3$expiryDate)
class(Data3$last_seen)
[1] "Date"
class(Data3$jobId)
[1] "numeric"
Data3$jobId<-as.character(Data3$jobId)
Data3$workType<-factor(Data3$workType,labels = c("Casual/Vacation","Contract/Temp", "Full Time" , "Part Time"))
class(Data3$workType)
[1] "factor"
levels(Data3$workType)
[1] "Casual/Vacation" "Contract/Temp" "Full Time" "Part Time"
The variables in the Dataset Data3 are summarised using the summary function.
The variables in the Dataset are all different types which includes numerics, character etc.
In the summary function, we see that variables are Character datatypes like summary, jobclassification etc and numerics such as companyratings etc.
here we check the datatypes using class function.
the listing date and expiry date attributes are in “POSIXct” “POSIXt” Datatypes. Hence have been converted to Date format using as.date function.
And the jobid attribute which is in numeric datatype has been converted to character to check that the variables can be easily converted to other datatypes.
Since the requirement of the assignment mentions to have atleast one factor variable and the variable Worktype needs to be one. This variables is converted to factor using factor functions and labelled.
The class and levels of the factored variable is checked using class and level function.
Hence in this section, The requirement #2 to #4 have been met.
Data1<-Data1%>%separate(location, into = c("City", "State"),sep =" ")
Expected 2 pieces. Additional pieces discarded in 47 rows [9, 10, 52, 54, 61, 76, 83, 104, 108, 131, 133, 135, 139, 145, 156, 158, 174, 176, 180, 188, ...].Expected 2 pieces. Missing pieces filled with `NA` in 34 rows [7, 13, 16, 21, 28, 29, 34, 39, 44, 45, 53, 79, 143, 152, 175, 178, 183, 206, 213, 244, ...].
head(Data1$City)
[1] "Sydney" "Canberra" "Sydney" "Melbourne" "Taringa" "Sydney"
head(Data1$State)
[1] "NSW" "ACT" "NSW" "VIC" "QLD" "NSW"
The Dataset Data1 doesnot conform to tidy data principles in a way because as we can see that the variable location is not atomic i.e not a single value.
In tidy Data, 1.Each variable must have its own column. 2.Each observation must have its own row. 3.Each value must have its own cell.
Since this variable could be split into city and state variables and conform to tidy data principles, this is done so using separate function.
class(Data3$listingDate)
[1] "Date"
class(Data3$expiryDate)
[1] "Date"
Data3$listingDate<-as.Date(Data3$listingDate,format="%yyyy/%mm/%dd")
Data3$expiryDate<-as.Date(Data3$expiryDate,format="%yyyy/%mm/%dd")
Data3<-mutate(Data3,Duration=(expiryDate-listingDate))
Data3$Duration
Time differences in days
[1] 58 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
[31] 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
[61] 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
[91] 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
[121] 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
[151] 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
[181] 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
[211] 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
[241] 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
[271] 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
[301] 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
[331] 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
[361] 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
[391] 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 41 30 30 30 30 30 30 30 30 30 30 30
[421] 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
[451] 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
[481] 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
[511] 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
[541] 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
[571] 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
[601] 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
[631] 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
[661] 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
[691] 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
[721] 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
[751] 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
[781] 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
[811] 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
[841] 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
[871] 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
[901] 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
[931] 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
[961] 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
[991] 30 30 30 30 30 30 30 30 30 30
[ reached getOption("max.print") -- omitted 28751 entries ]
In this section, we are required to mutate a new variable.
We first check the datatype of the listingDate and expiryDate using class function.
Then the two variables are changed to date datatypes using as.date function as shown above to a format as mentioned.
Then using mutate function Duration variable is created which gives the duration in days between the the date when the job has been posted on to the website and when it expires on the site.
This can be seen in the output.
sum(is.na(Data3))
[1] 145422
colSums(is.na(Data3))
jobId jobTitle jobClassification jobSubClassification
0 0 0 0
advertiserName advertiserId companyId companyName
0 0 26395 21589
companyRating listingDate expiryDate teaser
26395 0 0 0
nation state city area
0 0 0 11302
suburb workType salary_string isRightToWorkRequired
8820 0 19617 0
desktopAdTemplate mobileAdTemplate companyProfileUrl seekJobListingUrl
5522 0 21589 0
R Python Matlab SQL
0 0 0 0
Stata Minitab SPSS Ruby
0 0 0 0
C Scala Tableau Java
0 0 0 0
Hadoop SAS Julia Knime
0 0 0 0
D3 Clojure Haskell Lisp
0 0 0 0
Golang Spark Javascript F#
0 0 0 0
Fortran first_seen last_seen recruiter
0 0 0 0
X1 company cpage ratings
599 599 599 599
location days_ago summary Duration
599 599 599 0
Data3$days_ago<- impute(Data3$days_ago, fun = mean)
Data3$ratings<-impute(Data3$ratings,fun=mean)
Data3<-select(Data3,-companyName:-companyRating)
Data3<-select(Data3,-companyProfileUrl)
Data3<-select(Data3,-cpage,-company,-location,-summary)
Data3$area<-impute(Data3$area,fun=mode)
Data3$suburb<-impute(Data3$suburb,fun=mode)
Data3$salary_string<-impute(Data3$salary_string,fun = mode)
Data3$desktopAdTemplate<-impute(Data3$desktopAdTemplate,fun=mode)
head(Data3)
sapply(Data3,is.infinite)
jobId jobTitle jobClassification jobSubClassification advertiserName advertiserId
[1,] FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE FALSE FALSE
[7,] FALSE FALSE FALSE FALSE FALSE FALSE
[8,] FALSE FALSE FALSE FALSE FALSE FALSE
[9,] FALSE FALSE FALSE FALSE FALSE FALSE
[10,] FALSE FALSE FALSE FALSE FALSE FALSE
[11,] FALSE FALSE FALSE FALSE FALSE FALSE
[12,] FALSE FALSE FALSE FALSE FALSE FALSE
[13,] FALSE FALSE FALSE FALSE FALSE FALSE
[14,] FALSE FALSE FALSE FALSE FALSE FALSE
[15,] FALSE FALSE FALSE FALSE FALSE FALSE
[16,] FALSE FALSE FALSE FALSE FALSE FALSE
[17,] FALSE FALSE FALSE FALSE FALSE FALSE
[18,] FALSE FALSE FALSE FALSE FALSE FALSE
companyId listingDate expiryDate teaser nation state city area suburb workType
[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[7,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[8,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[10,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[11,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[12,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[14,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[15,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[16,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[17,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[18,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
salary_string isRightToWorkRequired desktopAdTemplate mobileAdTemplate
[1,] FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE
[7,] FALSE FALSE FALSE FALSE
[8,] FALSE FALSE FALSE FALSE
[9,] FALSE FALSE FALSE FALSE
[10,] FALSE FALSE FALSE FALSE
[11,] FALSE FALSE FALSE FALSE
[12,] FALSE FALSE FALSE FALSE
[13,] FALSE FALSE FALSE FALSE
[14,] FALSE FALSE FALSE FALSE
[15,] FALSE FALSE FALSE FALSE
[16,] FALSE FALSE FALSE FALSE
[17,] FALSE FALSE FALSE FALSE
[18,] FALSE FALSE FALSE FALSE
seekJobListingUrl R Python Matlab SQL Stata Minitab SPSS Ruby C Scala
[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[7,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[8,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[10,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[11,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[12,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[14,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[15,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[16,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[17,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[18,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Tableau Java Hadoop SAS Julia Knime D3 Clojure Haskell Lisp Golang Spark
[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[7,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[8,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[10,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[11,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[12,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[14,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[15,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[16,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[17,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[18,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Javascript F# Fortran first_seen last_seen recruiter X1 ratings days_ago
[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[7,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[8,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[10,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[11,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[12,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[14,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[15,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[16,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[17,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[18,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Duration
[1,] FALSE
[2,] FALSE
[3,] FALSE
[4,] FALSE
[5,] FALSE
[6,] FALSE
[7,] FALSE
[8,] FALSE
[9,] FALSE
[10,] FALSE
[11,] FALSE
[12,] FALSE
[13,] FALSE
[14,] FALSE
[15,] FALSE
[16,] FALSE
[17,] FALSE
[18,] FALSE
[ reached getOption("max.print") -- omitted 29733 rows ]
Now we check the missing values in the dataset.
This is done using the function is.na as above and also the number of NA’s in the columns is also obtained using colSums function.
Since there are a lot of missing values found, we decide to impute them using suitable method.
days_ago and ratings attributes are imputed using impute function with mean values since its a numeric variable.
Few of the variables like company related which have been not completely provided have been excluded since imputing with random values is meaningless.
This has been depicted in the above code using select function.
and even the other variables such as area,suburbs,salary etc which are categorical variables having missing values have been imputed using the impute function with their mode values.
Then no errors are found in this dataset as each variables have meaningful values and no inconsistencies found.
Also to find special values in the dataset, we use sapply function as shown and find that there are no special values.
class(Data3$Duration)
[1] "difftime"
Data3$Duration<-as.numeric(Data3$Duration)
Data3$ratings<-as.numeric(Data3$ratings)
Data3$days_ago<-as.numeric(Data3$days_ago)
boxplot(Data3$Duration)
boxplot(Data3$ratings)
boxplot(Data3$days_ago)
To check the outliers in the dataset , we use boxplot function.
Since we do not find any unusual outliers in thsi dataset, we let it the way it is.
We can see that the variables duration,days ago and ratings do not have any unusual outliers.
variable ratings contains ratings given by the employees which vary.
hist(Data3$ratings)
sq_ratings<-sqrt(Data3$ratings)
hist(sq_ratings)
recep<-1/Data3$ratings
hist(recep)
Data3$ratings<-as.numeric(Data3$ratings)
hist(Data3$ratings)
Data3$days_ago<-as.numeric(Data3$days_ago)
hist(Data3$Duration)
hist(Data3$days_ago)
box_rat<-BoxCox(Data3$ratings,lambda = "auto")
hist(box_rat)
NA
NA
Here we need apply tranformation to one of the variable of the dataset.
the variable ratings is taken to apply tranformation to change the scale for better understanding of the variable.
Hist function is used to know the spread of the variable.
sqrt function is used to take square root values and plot histogram.
Then reciprocal transformation gives better results.
The BoxCox transformation method is applied to the ratings variable and whose results can be seen in the histogram.
References:
Thomas, S., 2019. Data Scientist Jobs In Australia October 25 2019. [online] Kaggle.com. Available at: https://www.kaggle.com/santokalayil/data-scientist-jobs-in-australia-october-25-2019 [Accessed 6 June 2020].
Kaggle.com. 2020. Data Science Job Listings - Australia - 2019-2020. [online] Available at: https://www.kaggle.com/nomilk/data-science-job-listings-australia-20192020 [Accessed 6 June 2020].