Create a linear regression model from salary-dataf.csv and predict salary of the data as given in salary-test .csv.
Data is saved as a csv file in /Users/Charu/Desktop/Machine Learning
These are the salary data used in Weisberg’s book, consisting of observations on six variables for 52 tenure-track professors in a small college. The variables are:
sx = Sex, coded 1 for female and 0 for male rk = Rank, coded 1 for assistant professor, 2 for associate professor, and 3 for full professor yr = Number of years in current rank dg = Highest degree, coded 1 if doctorate, 0 if masters yd = Number of years since highest degree was earned sl = Academic year salary, in dollars.
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(corrgram)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(caret)
## Loading required package: lattice
Functions
Read Dataset
datasal <-read.csv("/Users/Charu/Desktop/Machine Learning/data-2.csv")
head(datasal)
## sx rk yr dg yd sl
## 1 male full 25 doctorate 35 36350
## 2 male full 13 doctorate 22 35350
## 3 male full 10 doctorate 23 28200
## 4 female full 7 doctorate 27 26775
## 5 male full 19 masters 30 33696
## 6 male full 16 doctorate 21 28516
Observation The above data set doesn’t contain any unknown data. The data is clean and complete.
sum(is.na(datasal))
## [1] 0
Summary
#summary(dataf)
lapply(datasal, FUN=summary)
## $sx
## female male
## 14 38
##
## $rk
## assistant associate full
## 18 14 20
##
## $yr
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 3.000 7.000 7.481 11.000 25.000
##
## $dg
## doctorate masters
## 34 18
##
## $yd
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 6.75 15.50 16.12 23.25 35.00
##
## $sl
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15000 18250 23720 23800 27260 38040
detect_outliers(datasal$yr)
## [1] 25
detect_outliers(datasal$yd)
## integer(0)
detect_outliers(datasal$sl)
## integer(0)
Observation
Outliers present only in one feature that is in yr.
But Outlier count is low.
For this model we will work with the outliers.
plotgraph <- function(inp, na.rm=TRUE) {
outplot <- ggplot(datasal, aes(x="", y=inp)) +
geom_boxplot(aes(fill=inp), color="blue") +
labs(title="Salary Outliers")
outplot
}
lapply(datasal, FUN=plotgraph)
## $sx
##
## $rk
##
## $yr
##
## $dg
##
## $yd
##
## $sl
datasal$sx <- as.factor(datasal$sx)
levels(datasal$sx) <- c("1", "0")
datasal$rk <- as.factor(datasal$rk)
levels(datasal$rk) <- c("1", "2","3")
datasal$dg <- as.factor(datasal$dg)
levels(datasal$dg) <- c("1", "0")
head(datasal)
## sx rk yr dg yd sl
## 1 0 3 25 1 35 36350
## 2 0 3 13 1 22 35350
## 3 0 3 10 1 23 28200
## 4 1 3 7 1 27 26775
## 5 0 3 19 0 30 33696
## 6 0 3 16 1 21 28516
Correlation
vctCorr = numeric(0)
for (i in names(datasal)){
cor.result <- cor(datasal$sl, as.numeric(datasal[,i]))
vctCorr <- c(vctCorr, cor.result)
}
dfrCorr <- vctCorr
names(dfrCorr) <- names(datasal)
dfrCorr
## sx rk yr dg yd sl
## 0.25278237 0.86748836 0.70066899 0.06972576 0.67485416 1.00000000
Observation The correlation output shows that sl has correlation with rk,yr and fairly less with the other factors. This also tells us that in estimating salary these 2 highly correlated factors will play a major role.
Visualize
dfrGraph <- gather(datasal, variable, value, -sl)
## Warning: attributes are not identical across measure variables; they will
## be dropped
head(dfrGraph)
## sl variable value
## 1 36350 sx 0
## 2 35350 sx 0
## 3 28200 sx 0
## 4 26775 sx 1
## 5 33696 sx 0
## 6 28516 sx 0
ggplot(dfrGraph) +
geom_jitter(aes(value,sl, colour=variable)) +
geom_smooth(aes(value,sl, colour=variable), method=lm, se=FALSE) +
facet_wrap(~variable, scales="free_x") +
labs(title="Relation Of salary With Other Features")
Observation
There is some impact of all the features with sl correlating sl with other variables we see that - rk has the highest positive correlation
Dataset Split
set.seed(808)
TrainRecords <- createDataPartition(y=datasal$sl, p=0.7, list=FALSE)
TrainData <- datasal[TrainRecords,]
TestData <- datasal[-TrainRecords,]
Spilting the dataset into Training and Test Dataset
70% of the data is spilt as train data and 30 % as test data
Training Dataset RowCount & ColCount
dim(TestData)
## [1] 12 6
Testing Dataset Head
head(TestData)
## sx rk yr dg yd sl
## 3 0 3 10 1 23 28200
## 5 0 3 19 0 30 33696
## 11 0 3 12 1 22 27025
## 16 0 3 7 1 15 29342
## 27 0 2 11 1 14 24800
## 29 0 2 3 0 7 26182
Training Dataset Summary
lapply(TrainData, FUN=summary)
## $sx
## 1 0
## 10 30
##
## $rk
## 1 2 3
## 13 11 16
##
## $yr
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 3.000 7.500 7.725 11.000 25.000
##
## $dg
## 1 0
## 26 14
##
## $yd
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 8.5 17.0 16.8 24.0 35.0
##
## $sl
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 16090 18250 23720 23920 27070 38040
Testing Dataset Summary
lapply(TestData, FUN=summary)
## $sx
## 1 0
## 4 8
##
## $rk
## 1 2 3
## 5 3 4
##
## $yr
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 5.500 6.667 10.250 19.000
##
## $dg
## 1 0
## 8 4
##
## $yd
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.75 14.00 13.83 22.25 33.00
##
## $sl
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15000 19350 24260 23400 27320 33700
Finding Best Multi Linear Model AIC provides a means for model selection.
#?step()
stpModel=step(lm(data=TrainData, sl~.), trace=0, steps=36)
stpSummary <- summary(stpModel)
stpSummary
##
## Call:
## lm(formula = sl ~ rk + yr, data = TrainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3583.6 -1004.6 47.3 935.4 9246.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15819.45 743.12 21.288 < 2e-16 ***
## rk2 4276.94 986.77 4.334 0.000112 ***
## rk3 9920.36 976.07 10.164 4.02e-12 ***
## yr 382.30 76.11 5.023 1.40e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2350 on 36 degrees of freedom
## Multiple R-squared: 0.8576, Adjusted R-squared: 0.8458
## F-statistic: 72.29 on 3 and 36 DF, p-value: 2.623e-15
Observation
Best results given by sl ~ rk + yr
Make Final Multi Linear Model
x1 <- TrainData$rk
x2 <- TrainData$yr
y <- TrainData$sl
slmModel <- lm(y~x1+x2)
Observation
No errors. Model successfully created.
Show Model
# print summary
summary(slmModel)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3583.6 -1004.6 47.3 935.4 9246.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15819.45 743.12 21.288 < 2e-16 ***
## x12 4276.94 986.77 4.334 0.000112 ***
## x13 9920.36 976.07 10.164 4.02e-12 ***
## x2 382.30 76.11 5.023 1.40e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2350 on 36 degrees of freedom
## Multiple R-squared: 0.8576, Adjusted R-squared: 0.8458
## F-statistic: 72.29 on 3 and 36 DF, p-value: 2.623e-15
This model is the best to predict salary because it gives a fairly high r value which shows that these factors are highly related to salary. Also because the p value is significant and shows that all factors in this model are equally significant in change of salary.
PredData <- select(TestData, -c(sl,sx,dg,yd))
colnames(PredData) <- c("x1","x2")
head(PredData)
## x1 x2
## 3 3 10
## 5 3 19
## 11 3 12
## 16 3 7
## 27 2 11
## 29 2 3
dim(PredData)
## [1] 12 2
pred_sl <- predict(slmModel, PredData)
pred_sl
## 3 5 11 16 27 29 33 34
## 29562.78 33003.46 30327.38 28415.89 24301.66 21243.29 19260.13 21625.58
## 45 48 51 52
## 16584.05 16584.05 16201.75 15819.45
TestData <- cbind(TestData, pred_sl)
Observation
The above output will show the pridicted of the salary(sl) based on the regression model.