This project conducts preprocessing, exploratory data analysis, and predictive data analysis the “SalaryData” dataset provided by the instructor. The quality of the data provide was generally high. A data scope was difficult to craft due to the lack of supporting context, needs to be solved, and plan for adoption of results provided. Nonetheless, visions of successful analysis included: creating analyses that are realistic and relevant to the real world in a business application, thoroughly exploring the dataset via graphical representations, and building at least a handful of predictive models with a high degree of accuracy.
Originally, the data had a structure of 10 variable, and 200 entries. As a result of the conducting of preprocessing, which screened the data for duplicates, removed specific outliers, rows with NA values, and the "ID’ variable, the data structure was reduced to 9 variable, and 192 entries. Multiple variables were then converted into factor form for graphical analysis.
Graphical exploratory data analysis was conducted on the dataset, yielding many significant insights. Examples of insights include: how both salaries and bonuses are higher in the IT industry, compared to the Non-IT industry; how both the IT and Non-IT industry are fairly evenly comprised of males and females, although the IT industry does have roughly 5% more males than females; how gender does not play any significant role in determining the salary or bonus of an individual; how both salary and bonus increase as years of experience increase; and how salary and bonus appear to be highly correlated with the number of certifications attained by an individual.
Predictive data analysis was then conducted on the dataset. The data was partitioned into a training dataset and a testing dataset in a 70/30 split, using sampling without replacement.Models to predict each variable were created, with only those for industry, marital status, bonus, and salary yeilding results of any significant accuracy.
Models for industry and marital status were created using logistic regression due to the discrete class nature of variables predicted. Models for bonus and salary were created using multiple linear regression due to the continuous numerical nature of variables predicted. Once the models were trained on the training dataset, they were fed the testing dataset to evaluate accuracy. The created industry model had an accuracy of 72% when fed the testing data partition of the dataset, the marital status model had an accuracy of 62%, the bonus model an accuracy of 93.5% (and adjusted R^2 of 0.9252), and the salary model an accuracy of 87.6% (and adjusted R^2 of 0.7328).
It is recommended that only models for industry, marital status, bonus, and salary be used for predictive use cases. In a business application, it is unlikely that predictive models for industry or marital status be of any use to a company. However, predictive models for salary can be extremely helpful for uses such as determining the proper salary to give an incoming employee based on factors like education, experience, certifications, and more. Predictive models for bonus can be extremely helpful for determining the proper bonus to give an existing employee.
The quality of the data provide in the “SalaryData” dataset was generally high. No duplicate entries were found and the minor issues found, such as major outliers and NA values, were easily corrected.
Unfortunately, for scoping, since the dataset was provided by the instructor with no supporting context, needs to be solved, or plan for adoption of results, it is difficult to craft an analysis scope. Nonetheless, visions of successful analysis include: creating analyses that are realistic and relevant to the real world in a business application, thoroughly exploring the dataset via graphical representations, and building at least a handful of predictive models with a high degree of accuracy.
Issues with the data included:
The following is an outline and justification of the steps taken in preprocessing the dataset:
library(readxl)
salary <-
read_excel("C:/Users/cbado/OneDrive/Documents/R/SalaryData.xlsx")
head(salary,3)
## # A tibble: 3 x 10
## id gender industry education satisfied married experience certification
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0 1 2 0 1 6 3
## 2 2 1 1 3 1 1 4 3
## 3 3 1 1 3 0 1 7 4
## # ... with 2 more variables: bonus <dbl>, salary <dbl>
options(scipen=999)
sapply(salary, function(salary) (boxplot.stats(salary)$out))
## $id
## numeric(0)
##
## $gender
## numeric(0)
##
## $industry
## numeric(0)
##
## $education
## numeric(0)
##
## $satisfied
## numeric(0)
##
## $married
## numeric(0)
##
## $experience
## [1] 21
##
## $certification
## numeric(0)
##
## $bonus
## [1] 2147.082 3080.000 2089.164 1989.036 2353.000 2308.150 2497.600 1961.564
## [9] 2263.248 2586.880 2656.830 2775.465
##
## $salary
## numeric(0)
boxplot(salary$bonus,
ylab = "Bonus",
main = "Boxplot of Bonus")
(However, the outlier in the Experience column is most likely an entry error and must be removed)
boxplot(salary$experience,
ylab = "Experience",
main = "Boxplot of Experience")
mtext(paste("Outliers: ", paste(boxplot.stats(salary$experience)$out, collapse = ", ")))
out <- boxplot.stats(salary$experience)$out
out_ind <- which(salary$experience %in% c(out))
salary[out_ind, 7] = NA
sapply(salary, function(salary) sum(is.na(salary)))
## id gender industry education satisfied
## 0 0 1 2 2
## married experience certification bonus salary
## 0 1 0 2 0
library(tidyr)
salary<-drop_na(salary)
sapply(salary, function(salary) sum(is.na(salary)))
## id gender industry education satisfied
## 0 0 0 0 0
## married experience certification bonus salary
## 0 0 0 0 0
length(unique(salary$id)) == nrow(salary)
## [1] TRUE
salary$id <- NULL
library(DataExplorer)
plot_correlation(na.omit(salary), maxcat = 5L)
salary$gender <-ifelse(
test = salary$gender==0,yes = "Male", no="Female")
salary$gender <-as.factor(salary$gender)
salary$industry <-ifelse(test = salary$industry==0,yes = "Non-IT", no="IT")
salary$industry <-as.factor(salary$industry)
salary$education <-ifelse(test = salary$education==1,yes = "Undergrad", no=(ifelse(test = salary$education==2,yes = "Grad", no="Advanced")))
salary$education <- as.factor(salary$education)
ordered(salary$education, levels = c("Undergrad", "Grad", "Advanced"))
salary$satisfied <-ifelse(test = salary$satisfied==1,yes = "Yes", no= "No")
salary$satisfied <- as.factor(salary$satisfied)
salary$married <-ifelse(test = salary$married==1,yes = "Yes", no= "No")
salary$married <- as.factor(salary$married)
salary$certification <- as.factor(salary$certification)
head(salary,4)
## # A tibble: 4 x 9
## gender industry education satisfied married experience certification bonus
## <fct> <fct> <fct> <fct> <fct> <dbl> <fct> <dbl>
## 1 Male IT Grad No Yes 6 3 2147.
## 2 Female IT Advanced Yes Yes 4 3 1565.
## 3 Female IT Advanced No Yes 7 4 3080
## 4 Female IT Advanced Yes Yes 6 3 2089.
## # ... with 1 more variable: salary <dbl>
library(ggplot2)
library(cowplot)
summary(salary)
## gender industry education satisfied married experience
## Female:93 IT :107 Advanced :44 No :101 No :106 Min. :1.00
## Male :99 Non-IT: 85 Grad :88 Yes: 91 Yes: 86 1st Qu.:1.00
## Undergrad:60 Median :2.00
## Mean :3.12
## 3rd Qu.:5.00
## Max. :8.00
## certification bonus salary
## 0:54 Min. : 26.93 Min. : 26928
## 1:64 1st Qu.: 127.91 1st Qu.: 43850
## 2:48 Median : 269.74 Median : 53186
## 3:23 Mean : 548.91 Mean : 56878
## 4: 3 3rd Qu.: 774.39 3rd Qu.: 66298
## Max. :3080.00 Max. :102242
The table above is a summary of the “SalaryData” dataset after preprocessing. For factor variables, it provides a count of instances. For numerical variables, it provides min/max, measures of central tendency, and quartiles. The summary is helpful in creating a preliminary understanding of the data.
salary$education <- factor(salary$education,levels = c("Undergrad", "Grad", "Advanced"))
edusalary <- ggplot(salary, aes(x=education, y=salary, fill=education)) + geom_boxplot()+ggtitle("Salary by Education Level")+xlab("Education Level")+ylab("Salary")+scale_y_continuous(labels=scales::dollar_format())+labs(fill = "Education Level")
edubonus <- ggplot(salary, aes(x=education, y=bonus, fill=education)) + geom_boxplot()+ggtitle("Bonus by Education Level")+xlab("Education Level")+ylab("Bonus")+scale_y_continuous(labels=scales::dollar_format())+labs(fill = "Education Level")
plot_grid(edusalary, edubonus, labels = "AUTO")
It can be seen that in both the salary and bonus amounts generally increase proportionally to advancement in education level, specifically when measured by mean. However, it appears that bonus is more dependent on education level than salary is, but salary has a greater range per each respective education level. Outliers exist in graduate salaries, some beyond any salary received by those with advanced degrees. Only an individual with an advanced degree is likely to be given a bonus over $2250.
salary$education <- factor(salary$education,levels = c("Advanced", "Grad", "Undergrad"))
ggplot() + geom_col(data = salary, aes(x="", y = frequency(education), fill = education), position = "fill")+ggtitle("Education Level Composition by Industry") + xlab("") + ylab("Composition Percentage")+labs(fill = "Education")+facet_grid(. ~ industry)+scale_y_continuous(labels=scales::percent_format())
The education composition of both the IT and Non-IT industry is evaluated. The Non-IT industry has a greater number of individuals with undergrad degrees, but the IT industry has the most grad and advanced degrees, with only just over 25% of IT industry individuals having only an undergrad degree.
ggplot() + geom_col(data = salary, aes(x=experience, y = frequency(education), fill = education), position = "fill")+ggtitle("Education Level Composition by Years of Experience") + xlab("Years of Experience") + ylab("Composition Percentage")+labs(fill = "Education")+scale_y_continuous(labels=scales::percent_format())
An analysis of education level composition by years of experience displays a trend of likely upward progression in education level as more years of experience are gained. After seven years, over 75% of individuals have an advanced degree, and after eight years, 100%. However, the eight year 100% advanced degree rate phenomenon likely only occurs because there are only 2 individuals in the dataset with eight years of experience.
indsalary <- ggplot(salary, aes(x=industry, y=salary, fill=industry)) + geom_boxplot()+ggtitle("Salary by Industry")+xlab("Industry")+ylab("Salary")+scale_y_continuous(labels=scales::dollar_format())+labs(fill = "Industry")
indbonus <- ggplot(salary, aes(x=industry, y=bonus, fill=industry)) + geom_boxplot()+ggtitle("Bonus by Industry")+xlab("Industry")+ylab("Bonus")+scale_y_continuous(labels=scales::dollar_format())+labs(fill = "Industry")
plot_grid(indsalary, indbonus, labels = "AUTO")
An analysis of salary and bonus by industry reveals that both salaries and bonuses are higher in the IT industry, compared to the Non-IT industry. However, for the IT industry, both salary and bonus had a greater range relative to the Non-IT industry. Also, the Non-IT industry has many more high-bonus outliers, meaning IT industry level bonuses can still be received in the Non-IT industry.
ggplot() + geom_col(data = salary, aes(x="", y = frequency(gender), fill = gender), position = "fill")+ggtitle("Gender by Industry") + xlab("") + ylab("Composition Percentage")+labs(fill = "Gender")+scale_y_continuous(labels=scales::percent_format())+facet_grid(. ~ industry)+scale_fill_manual(values=c("#f06aa4", "#6ca0dc"))
An analysis of gender by industry reveals that both the IT and Non-IT industry are fairly evenly comprised of males and females, although the IT industry does have roughly 5% more males than females. Gender would be a very poor predicter of industry in any predictive analysis.
ggplot(salary, aes(x=salary, y=bonus, color=industry)) + geom_point()+ggtitle("Bonus by Salary")+xlab("Salary")+ylab("Bonus")+scale_y_continuous(labels=scales::dollar_format())+scale_x_continuous(labels=scales::dollar_format())+scale_color_manual(values=c("#FFCB05","#00274C"))+labs(color = "Industry")
An analysis of salary and bonus across all industries reveals that as salary increases, so too does bonus at an increasing proportionality. The higher one’s salary, the larger one’s bonus, generally. However, as salary increases, the range of received bonuses also increases. It can also be seen that most individuals that earn over $60,000 per year are in the IT industry, even when accounting for the fact that there are 25% more individuals in the dataset that are in the IT industry.
library(plyr)
mc <- ddply(salary, "industry", summarise, grp.mean=mean(salary))
a <- ggplot(salary, aes(x=salary, color=industry))+
geom_histogram(aes(y=..density..), color="black", fill="grey")+geom_density(alpha=.2, fill=NA)+facet_grid(industry ~ .)+
geom_vline(data=mc, aes(xintercept=grp.mean, color=industry),
linetype="dashed")+scale_color_manual(values=c("orange", "blue"))+scale_x_continuous(labels=scales::dollar_format())+ggtitle("Salary Distribution by Industry")+xlab("Salary")+ylab("Density")+labs(color = "Salary")
mg <- ddply(salary, "industry", summarise, grp.mean=mean(bonus))
b <- ggplot(salary, aes(x=bonus, color=industry))+
geom_histogram(aes(y=..density..), color="black", fill="grey")+geom_density(alpha=.2, fill=NA)+facet_grid(industry ~ .)+
geom_vline(data=mg, aes(xintercept=grp.mean, color=industry),
linetype="dashed")+scale_color_manual(values=c("orange", "blue"))+scale_x_continuous(labels=scales::dollar_format())+ggtitle("Bonus Distribution by Industry")+xlab("Bonus")+ylab("Density")+labs(color = "Bonus")
plot_grid(a, b, labels = "AUTO")
A distribution analysis of both salary and bonus by industry further confirms insights generated in the previous chart. It can be seen that salary in the IT industry is bimodally distributed, where as salary in the Non-IT industry and bonus in both the IT and Non-IT industry are unimodal. All distribution are right-skewed, since individuals receiving a high salary or bonus is less common due to the cost it generates for the business.
mu <- ddply(salary, "gender", summarise, grp.mean=mean(salary))
c <- ggplot(salary, aes(x=salary, color=gender))+
geom_histogram(aes(y=..density..), color="black", fill="grey")+geom_density(alpha=.2, fill=NA)+facet_grid(gender ~ .)+
geom_vline(data=mu, aes(xintercept=grp.mean, color=gender),
linetype="dashed")+scale_color_manual(values=c("#f06aa4", "#6ca0dc"))+scale_x_continuous(labels=scales::dollar_format())+ggtitle("Salary Distribution by Gender")+xlab("Salary")+ylab("Density")+labs(color = "Salary")
mb <- ddply(salary, "gender", summarise, grp.mean=mean(bonus))
d <- ggplot(salary, aes(x=bonus, color=gender))+
geom_histogram(aes(y=..density..), color="black", fill="grey")+geom_density(alpha=.2, fill=NA)+facet_grid(gender ~ .)+
geom_vline(data=mb, aes(xintercept=grp.mean, color=gender),
linetype="dashed")+scale_color_manual(values=c("#f06aa4", "#6ca0dc"))+scale_x_continuous(labels=scales::dollar_format())+ggtitle("Bonus Distribution by Gender")+xlab("Bonus")+ylab("Density")+labs(color = "Bonus")
plot_grid(c, d, labels = "AUTO")
A distribution analysis of both salary and bonus by gender reveals that gender does not play any significant role in determining the salary or bonus of an individual (which is definitely a good thing). However, males do earn a very slight amount more in both salary and bonus. Both salary and bonus are similarly distributed by gender and both are right-skewed.
ggplot(salary, aes(x=factor(""), stat="bin", fill=married)) + geom_bar(position="fill") +scale_y_continuous(labels = c("0", "25%", "50%", "75%", "100%")) +
ggtitle("Marriage Rate by Industry") + xlab("") + ylab("") +
facet_grid(facets=. ~ industry)+
coord_polar(theta="y")+
scale_fill_manual(values=c("#999999", "#E69F00"))+labs(fill = "Married")
marit<-table(salary$married[salary$industry=="IT"])
marnit<-table(salary$married[salary$industry=="Non-IT"])
round(prop.table(marit),digits=2)
##
## No Yes
## 0.43 0.57
round(prop.table(marnit),digits=2)
##
## No Yes
## 0.71 0.29
An analysis of marriage rate by industry reveals that 57% of individuals in the IT industry are married, compared to only 29% in the Non-IT industry. However, used alone, marital status would be a poor predictor of industry.
ggplot() + geom_col(data = salary, aes(x=gender, y = frequency(married), fill = married), position = "fill")+ggtitle("Marriage Rate by Gender") + xlab("Gender") + ylab("Marriage Rate")+labs(fill = "Married")+facet_grid(facets=. ~ industry)+scale_y_continuous(labels=scales::percent_format())
Evaluating marriage rates by gender again confirms that gender does not play a significant role in predicting other variables. The marriage rate for both males and females in both the IT and Non-IT industry are relatively similar, aside from how individuals in the IT industry are more likely to be married, as confirmed in the previous pie charts.
ggplot() + geom_col(data = salary, aes(x=experience, y = frequency(married), fill = married), position = "fill")+ggtitle("Marriage Rate by Years of Experience") + xlab("Years of Experience") + ylab("Marriage Rate")+labs(fill = "Married")+scale_y_continuous(labels=scales::percent_format())
Evaluating marriage rates by years of experience reveals that the likelihood that an individual is married increases as they gain more years of experience, up until year 7, where the marriage rates plateau at roughly 50%. Perhaps this is because there are only a small number of individuals with seven or more years of experience that comprise the sample data, making this analysis not truly representative of actual marriage rates beyond seven years of experience. Or perhaps, after seven years of experience, individuals begin to get divorced.
ggplot(salary, aes(x=factor(married), y=salary, fill=married))+geom_boxplot()+ggtitle("Salary by Marital Status") + xlab("Married") + ylab("Salary")+scale_y_continuous(labels=scales::dollar_format())+labs(fill = "Married")
An analysis of salary by marital status exhibits that one’s salary is likely to be higher if they are married. However, this is rather misleading, since the likelihood of one being married increases with years of experience, and thus, this increase in salary is likely to be due to an increased number of years of experience as opposed to simply being married.
ggplot(salary, aes(x=factor(experience), y=salary))+geom_boxplot()+ggtitle("Salary by Years of Experience") + xlab("Years of Experience") + ylab("Salary")+scale_y_continuous(labels=scales::dollar_format())+facet_grid(facets=. ~ industry)
Generally, both salary and bonus increase as years of experience increase. This further confirms the postulation that marital status does not influence salary, and instead salary is higher if one is married simply because the likelihood of being married increase as years of experience increase. It can be seen that the ranges of salary for each year of experience vary much greater in the IT industry, compared to the Non-IT industry.
salary$education <- factor(salary$education,levels = c("Undergrad", "Grad", "Advanced"))
ggplot() + geom_col(data = salary, aes(x=education, y = frequency(satisfied), fill = satisfied), position = "fill")+ggtitle("Satisfaction Rate by Education Level") + xlab("Education Level") + ylab("Satisfaction Rate")+labs(fill = "Satisfied")+facet_grid(facets=. ~ industry)+scale_y_continuous(labels=scales::percent_format())
An analysis of satisfaction rate by education level and industry reveals that in the IT industry, satisfaction increases as one’s education level increases. However, in the Non-IT industry, the opposite is true, as satisfaction decreases as one’s education level increases.
ggplot() + geom_col(data = salary, aes(x=experience, y = frequency(satisfied), fill = satisfied), position = "fill")+ggtitle("Satisfaction Rate by Years of Experience") + xlab("Years of Experience") + ylab("Satisfaction Rate")+labs(fill = "Satisfied")+scale_y_continuous(labels=scales::percent_format())
No consistent trend can be derived from the graph above, implying that the years of experience one has does not have an influcence on their satisfaction.
ggplot() + geom_col(data = salary, aes(x=married, y = frequency(satisfied), fill = satisfied), position = "fill")+ggtitle("Satisfaction Rate by Marital Status") + xlab("Married") + ylab("Satisfaction Rate")+labs(fill = "Satisfied")+scale_y_continuous(labels=scales::percent_format())
Those who are married have a higher likelihood of being satisfied. Those who are not married are likely to not be satisified.
ggplot() + geom_col(data = salary, aes(x=gender, y = frequency(satisfied), fill = satisfied), position = "fill")+ggtitle("Satisfaction Rate by Gender") + xlab("Gender") + ylab("Satisfaction Rate")+labs(fill = "Satisfied")+scale_y_continuous(labels=scales::percent_format())+facet_grid(facets=. ~ industry)
Males in both the IT industry and Non-IT industry are marginally more likely to be satisfied compared to females. Both males and females are not likely to be satisfied in the Non-IT industry.
e <- ggplot(salary, aes(x=satisfied, y=salary, fill=satisfied)) + geom_boxplot()+ggtitle("Salary by Job Satisfaction")+xlab("Satisfied")+ylab("Salary")+scale_y_continuous(labels=scales::dollar_format())+labs(fill = "Satisfied")
f <- ggplot(salary, aes(x=satisfied, y=bonus, fill=satisfied)) + geom_boxplot()+ggtitle("Bonus by Job Satisfaction")+xlab("Satisfied")+ylab("Bonus")+scale_y_continuous(labels=scales::dollar_format())+labs(fill = "Satisfied")
plot_grid(e, f, labels = "AUTO")
While there is a very large range for all observed metrics on the above plot, those who are satisfied generally have higher salaries and bonuses. It is likely that those who are satisfied because they feel they are being adequately being compensated for their work, as opposed to being compensated based on whether they are satisfied.
ggplot() + geom_col(data = salary, aes(x="", y = frequency(certification), fill = factor(certification)), position = position_fill(reverse = TRUE))+ggtitle("# of Certifications Attained Composition by Industry") + xlab("") + ylab("Composition Percentage")+labs(fill = "# of Certifications Attained")+facet_grid(. ~ industry)+scale_y_continuous(labels=scales::percent_format())
An analysis of the composition of the number of certifications attained by an individual, by industry, exhibits that those in the Non-IT industry are most likely to have zero or only one certification. Over half of those in the IT industry hold two or more certification, though most hold one.
ggplot(salary, aes(x=factor(certification), y=salary))+geom_boxplot()+ggtitle("Salary by Number of Certifications Attained") + xlab("Number of Certifications Attained") + ylab("Salary")+scale_y_continuous(labels=scales::dollar_format())+facet_grid(facets=. ~ industry)
In both the IT and Non-IT industry, salary appears to be highly correlated with the number of certifications attained by an individual. However, salary increases at a greater rate based on number of certifications attained in the IT industry.
ggplot(salary, aes(x=factor(certification), y=bonus))+geom_boxplot()+ggtitle("Bonus by Number of Certifications Attained") + xlab("Number of Certifications Attained") + ylab("Bonus")+scale_y_continuous(labels=scales::dollar_format())+facet_grid(facets=. ~ industry)
In both the IT and Non-IT industry, bonus appears to be exponentially correlated with the number of certifications attained by an individual. However, a bonus is less likely to be given based on number of certifications attained in the Non-IT industry.
salary <- read_excel("C:/Users/cbado/OneDrive/Documents/R/SalaryData.xlsx")
out <- boxplot.stats(salary$experience)$out
out_ind <- which(salary$experience %in% c(out))
salary[out_ind, 7] = NA
library(tidyr)
salary<-drop_na(salary)
salary$id <- NULL
This step re-loads the original “SalaryData” dataset. It is then re-preprocessed. The reason for this is because it is easier to re-load and have the variables be in their orignal form instead of the factor form many attributes were given in the previous section. It is realized that this is a crude method of doing this, but gives the proper resulting data nonetheless.
dt = sort(sample(nrow(salary), nrow(salary)*.7))
train<-salary[dt,]
test<-salary[-dt,]
This step partitions the data, segmenting a training and a testing dataset. The training dataset is comprised of 70% of the instances observed in the dataset, where as the testing dataset is comprised of only 30%. This is sampling without replacement.
A logistic regression model is best to determine gender, since gender is a discrete class variable.
gen <- glm(gender ~ ., data = train, family = "binomial")
gen
##
## Call: glm(formula = gender ~ ., family = "binomial", data = train)
##
## Coefficients:
## (Intercept) industry education satisfied married
## 1.04155468 -0.02764653 0.11026936 -0.09580969 0.16969102
## experience certification bonus salary
## -0.00605253 -0.59886184 0.00126216 -0.00002218
##
## Degrees of Freedom: 133 Total (i.e. Null); 125 Residual
## Null Deviance: 185.6
## Residual Deviance: 180.9 AIC: 198.9
ll.null <- gen$null.deviance/-2
ll.proposed <- gen$deviance/-2
(ll.null - ll.proposed) / ll.null
1 - pchisq(2*(ll.proposed - ll.null), df=(length(gen$coefficients)-1))
## [1] 0.7890994
This model has a combined R^2 value of 0.569. However, R^2 is not significant for evaluating how well the model properly predicts using new input data, as a high R^2 could imply an overfit of the model to the training data.
predictions <- predict(gen, test, type="response")
prediction.rd <- ifelse(predictions > 0.5, 1, 0)
table(prediction.rd, test$gender)
##
## prediction.rd 0 1
## 0 20 19
## 1 10 9
accuracy <- table(prediction.rd, test$gender)
sum(diag(accuracy))/sum(accuracy)
## [1] 0.5
This model has an accuracy rate of only 41% when fed testing data.
predicted.gen.data <- data.frame(probability.of.gender=gen$fitted.values,gender=train$gender)
predicted.gen.data <- predicted.gen.data[order(predicted.gen.data$probability.of.gender, decreasing = FALSE),]
predicted.gen.data$rank <- 1:nrow(predicted.gen.data)
ggplot(data = predicted.gen.data,aes(x=rank, y=probability.of.gender))+geom_point(aes(colour=gender), alpha=1, shape=4, stroke=2)+ggtitle("Predicting Gender")+xlab("Index")+ylab("Prob. of Gender")+labs(color = "Gender")+scale_colour_gradient(low = "#CC6594",high = "#347DC1")
Based on the 41% accuracy rate when predicting gender based on new input data, as well as the visualization displayed in the above plot, gender cannot be accurately predicted.
A logistic regression model is best to determine industry, since industry is a discrete class variable.
indu <- glm(industry ~ ., data = train, family = "binomial")
indu
##
## Call: glm(formula = industry ~ ., family = "binomial", data = train)
##
## Coefficients:
## (Intercept) gender education satisfied married
## -0.49159288 -0.06268086 0.21882641 1.00422943 0.32428259
## experience certification bonus salary
## 0.10187521 -0.32169114 0.00178695 -0.00001743
##
## Degrees of Freedom: 133 Total (i.e. Null); 125 Residual
## Null Deviance: 182.8
## Residual Deviance: 156.6 AIC: 174.6
ll.null <- indu$null.deviance/-2
ll.proposed <- indu$deviance/-2
(ll.null - ll.proposed) / ll.null
1 - pchisq(2*(ll.proposed - ll.null), df=(length(indu$coefficients)-1))
## [1] 0.0009719201
This model has a combined R^2 value of 0.001421184, which is extremely poor.
predictions <- predict(indu, test, type="response")
prediction.rd <- ifelse(predictions > 0.5, 1, 0)
table(prediction.rd, test$industry)
##
## prediction.rd 0 1
## 0 12 9
## 1 16 21
accuracy <- table(prediction.rd, test$industry)
sum(diag(accuracy))/sum(accuracy)
## [1] 0.5689655
This model has an accuracy rate of 72% when fed testing data, which is quite good.
predicted.indu.data <- data.frame(probability.of.indu=indu$fitted.values,industry=train$industry)
predicted.indu.data <- predicted.indu.data[order(predicted.indu.data$probability.of.indu, decreasing = FALSE),]
predicted.indu.data$rank <- 1:nrow(predicted.indu.data)
ggplot(data = predicted.indu.data,aes(x=rank, y=probability.of.indu))+geom_point(aes(colour=industry), alpha=1, shape=4, stroke=2)+ggtitle("Predicting Industry")+xlab("Index")+ylab("Prob. of Industry")+labs(color = "Industry")+scale_colour_gradient(low = "#FFCB05",high = "#00274C")
Based on the 72% accuracy rate when predicting industry based on new input data, as well as the visualization displayed in the above plot, the generated model does a fair job at accurately predicting industry.
A k-nearest neighbors (KNN) algorithm is best to determine education level, since it is a supervised machine learning algorithm that is good at solving classification problems, and education level is a class variable.
library(class)
educ <- knn(train,test,cl=train$education,k=11)
tab <- table(educ,test$education)
tab
##
## educ 1 2 3
## 1 8 5 1
## 2 7 17 5
## 3 3 5 7
accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
accuracy(tab)
## [1] 55.17241
Based on the 46.6% accuracy rate when predicting education level based on new input data, this model cannot accurately predict education level, though this model is better at predicting education level than if one were to randomly select an education level for their prediction.
A logistic regression model is best to determine satisfaction, since satisfaction is a discrete class variable.
sat <- glm(satisfied ~ industry+salary+married, data = train, family = "binomial")
sat
##
## Call: glm(formula = satisfied ~ industry + salary + married, family = "binomial",
## data = train)
##
## Coefficients:
## (Intercept) industry salary married
## -1.59933214 0.94476212 0.00001257 0.54212355
##
## Degrees of Freedom: 133 Total (i.e. Null); 130 Residual
## Null Deviance: 185.5
## Residual Deviance: 170.1 AIC: 178.1
ll.null <- sat$null.deviance/-2
ll.proposed <- sat$deviance/-2
(ll.null - ll.proposed) / ll.null
1 - pchisq(2*(ll.proposed - ll.null), df=(length(sat$coefficients)-1))
## [1] 0.001478168
This model has a combined R^2 value of 0.5957026.
predictions <- predict(sat, test, type="response")
prediction.rd <- ifelse(predictions > 0.5, 1, 0)
table(prediction.rd, test$satisfied)
##
## prediction.rd 0 1
## 0 18 17
## 1 13 10
accuracy <- table(prediction.rd, test$satisfied)
sum(diag(accuracy))/sum(accuracy)
## [1] 0.4827586
This model has an accuracy rate of 59% when fed testing data, which is better than randomly tossing a coin to predict outcome, but still relatively poor.
predicted.sat.data <- data.frame(probability.of.satisfied=sat$fitted.values,satisfied=train$satisfied)
predicted.sat.data <- predicted.sat.data[order(predicted.sat.data$probability.of.satisfied, decreasing = FALSE),]
predicted.sat.data$rank <- 1:nrow(predicted.sat.data)
ggplot(data = predicted.sat.data,aes(x=rank, y=probability.of.satisfied))+geom_point(aes(colour=satisfied), alpha=1, shape=4, stroke=2)+ggtitle("Predicting Satisfaction based on Industry, Salary, & Marital Status")+xlab("Index")+ylab("Prob. of Satisfied")+labs(color = "Satisfied")+scale_colour_gradient(low = "#FFCB05",high = "#00274C")
Based on the 59% accuracy rate when predicting satisfaction based on new input data, as well as the visualization displayed in the above plot, the generated model can predict satisfaction, but does so poorly, at a rate only 10% better than randomly flipping a coin.
A logistic regression model is best to determine marital status, since marital status is a discrete class variable.
marr <- glm(married ~ salary+industry, data = train, family = "binomial")
marr
##
## Call: glm(formula = married ~ salary + industry, family = "binomial",
## data = train)
##
## Coefficients:
## (Intercept) salary industry
## -3.76362284 0.00005759 0.49448835
##
## Degrees of Freedom: 133 Total (i.e. Null); 131 Residual
## Null Deviance: 184.3
## Residual Deviance: 153.5 AIC: 159.5
ll.null <- marr$null.deviance/-2
ll.proposed <- marr$deviance/-2
(ll.null - ll.proposed) / ll.null
1 - pchisq(2*(ll.proposed - ll.null), df=(length(marr$coefficients)-1))
## [1] 0.0000002073406
This model has a combined R^2 value of 0.00000085, extremely poor.
predictions <- predict(marr, test, type="response")
prediction.rd <- ifelse(predictions > 0.5, 1, 0)
table(prediction.rd, test$married)
##
## prediction.rd 0 1
## 0 28 10
## 1 4 16
accuracy <- table(prediction.rd, test$married)
sum(diag(accuracy))/sum(accuracy)
## [1] 0.7586207
This model has an accuracy rate of 62% when fed testing data, which is fair.
predicted.marr.data <- data.frame(probability.of.married=marr$fitted.values,married=train$married)
predicted.marr.data <- predicted.marr.data[order(predicted.marr.data$probability.of.married, decreasing = FALSE),]
predicted.marr.data$rank <- 1:nrow(predicted.marr.data)
ggplot(data = predicted.marr.data,aes(x=rank, y=probability.of.married))+geom_point(aes(colour=married), alpha=1, shape=4, stroke=2)+ggtitle("Predicting Marital Status based on Salary & Industry ")+xlab("Index")+ylab("Prob. of Married")+labs(color = "Married")+scale_colour_gradient(low = "#FFCB05",high = "#00274C")
Based on the 74% accuracy rate when predicting marital status based on new input data, as well as the visualization displayed in the above plot, the generated model does a fair job at accurately predicting marital status.
A k-nearest neighbors (KNN) algorithm is best to determine the number of certifications attained, since it is a supervised machine learning algorithm that is good at solving classification problems, and number of certifications attained is a class variable.
certi <- knn(train,test,cl=train$certification,k=12)
tab <- table(certi,test$certification)
tab
##
## certi 0 1 2 3 4
## 0 5 5 1 0 0
## 1 9 8 5 0 0
## 2 0 7 11 1 1
## 3 0 1 1 2 1
## 4 0 0 0 0 0
accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
accuracy(tab)
## [1] 44.82759
Based on the 55% accuracy rate when predicting the number of certifications attained by an individual based on new input data, this model cannot accurately predict the number of certifications attained by an individual, though this model is better at predicting certification than if one were to randomly flip a coin for their prediction.
A linear regression model is best to predict bonus, since bonus is a continuous numerical variable. The LOG of bonus was used to create the model, since it was determined that a non-linear relationship exists between bonus and predictive attributes. Using the LOG of bonus allows for a linear model to be used to predict bonus.
logbonus <-log(train$bonus)
bonfit = lm(logbonus~certification+salary+experience, data=train)
summary(bonfit)
##
## Call:
## lm(formula = logbonus ~ certification + salary + experience,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.19638 -0.10474 0.05824 0.21090 0.74390
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.250664082 0.121362255 26.785 < 0.0000000000000002 ***
## certification 0.548236288 0.043725929 12.538 < 0.0000000000000002 ***
## salary 0.000023520 0.000003193 7.365 0.0000000000181 ***
## experience 0.124517321 0.027926289 4.459 0.0000176301755 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3418 on 130 degrees of freedom
## Multiple R-squared: 0.9209, Adjusted R-squared: 0.919
## F-statistic: 504.3 on 3 and 130 DF, p-value: < 0.00000000000000022
All variables have a p-value of less than 0.05, meaning they are statistically significant and contribute to the model properly. The generated model has an adjusted R^2 value of 0.9252, which is very good.
predbon <- predict(bonfit, test)
actuals_preds <- data.frame(cbind(actuals=test$bonus, predicteds=predbon))
cor(actuals_preds)
## actuals predicteds
## actuals 1.0000000 0.9151683
## predicteds 0.9151683 1.0000000
This model has an accuracy rate of 93.5% when fed testing data, which is very good. Based on the 93.5% accuracy rate when predicting marital status based on new input data, as well as the high adjusted R^2 value, the generated model does a very good job at accurately predicting bonus.
A linear regression model is best to predict salary, since salary is a continuous numerical variable. In this model, bonus cannot be used as a predictor variable, because if this model is being used to determine the proper salary for an incoming employee, they will not have received a bonus at the company yet, and their bonus will be partly dependent on their salary.
bonfit = lm(salary~education+married+experience, data=train)
summary(bonfit)
##
## Call:
## lm(formula = salary ~ education + married + experience, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22502.4 -6189.4 556.7 4935.2 27909.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26524.8 2290.4 11.581 < 0.0000000000000002 ***
## education 4517.1 1138.1 3.969 0.000119 ***
## married 4461.9 1736.0 2.570 0.011291 *
## experience 6335.6 441.6 14.347 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9141 on 130 degrees of freedom
## Multiple R-squared: 0.74, Adjusted R-squared: 0.734
## F-statistic: 123.3 on 3 and 130 DF, p-value: < 0.00000000000000022
All variables have a p-value of less than 0.05, meaning they are statistically significant and contribute to the model properly. The generated model has an adjusted R^2 value of 0.7328, which is good.
predbon <- predict(bonfit, test)
actuals_preds <- data.frame(cbind(actuals=test$salary, predicteds=predbon))
cor(actuals_preds)
## actuals predicteds
## actuals 1.0000000 0.8528089
## predicteds 0.8528089 1.0000000
This model has an accuracy rate of 87.6% when fed testing data, which is very good. Based on the 93.5% accuracy rate when predicting marital status based on new input data, as well as the high adjusted R^2 value, the generated model does a very good job at accurately predicting salary.
It is recommended that only models for industry, marital status, bonus, and salary be used for predictive use cases. In a business application, it is unlikely that predictive models for industry or marital status be of any use to a company. However, predictive models for salary can be extremely helpful for uses such as determining the proper salary to give an incoming employee based on factors like education, experience, certifications, and more. Predictive models for bonus can be extremely helpful for determining the proper bonus to give an existing employee.