Problem Definition

Create a linear regression model from salary-dataf.csv and predict salary of the data as given in salary-test .csv.

Data Location

Data is saved as a csv file in /Users/Charu/Desktop/Machine Learning

Data Description

These are the salary data used in Weisberg’s book, consisting of observations on six variables for 52 tenure-track professors in a small college. The variables are:

sx = Sex, coded 1 for female and 0 for male rk = Rank, coded 1 for assistant professor, 2 for associate professor, and 3 for full professor yr = Number of years in current rank dg = Highest degree, coded 1 if doctorate, 0 if masters yd = Number of years since highest degree was earned sl = Academic year salary, in dollars.

library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(corrgram)
library(gridExtra) 
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(caret)
## Loading required package: lattice

Functions

Read Dataset

datasal <-read.csv("/Users/Charu/Desktop/Machine Learning/data-2.csv")
head(datasal)
##       sx   rk yr        dg yd    sl
## 1   male full 25 doctorate 35 36350
## 2   male full 13 doctorate 22 35350
## 3   male full 10 doctorate 23 28200
## 4 female full  7 doctorate 27 26775
## 5   male full 19   masters 30 33696
## 6   male full 16 doctorate 21 28516

Observation The above data set doesn’t contain any unknown data. The data is clean and complete.

sum(is.na(datasal))
## [1] 0

Summary

#summary(dataf)
lapply(datasal, FUN=summary)
## $sx
## female   male 
##     14     38 
## 
## $rk
## assistant associate      full 
##        18        14        20 
## 
## $yr
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   3.000   7.000   7.481  11.000  25.000 
## 
## $dg
## doctorate   masters 
##        34        18 
## 
## $yd
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    6.75   15.50   16.12   23.25   35.00 
## 
## $sl
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15000   18250   23720   23800   27260   38040
detect_outliers(datasal$yr)
## [1] 25
detect_outliers(datasal$yd)
## integer(0)
detect_outliers(datasal$sl)
## integer(0)

Observation
Outliers present only in one feature that is in yr.
But Outlier count is low.
For this model we will work with the outliers.

plotgraph <- function(inp, na.rm=TRUE) {
outplot <- ggplot(datasal, aes(x="", y=inp)) +
            geom_boxplot(aes(fill=inp), color="blue") +
            labs(title="Salary Outliers")
outplot
}
lapply(datasal, FUN=plotgraph)
## $sx

## 
## $rk

## 
## $yr

## 
## $dg

## 
## $yd

## 
## $sl

datasal$sx <- as.factor(datasal$sx)
levels(datasal$sx) <- c("1", "0")
datasal$rk <- as.factor(datasal$rk)
levels(datasal$rk) <- c("1", "2","3")
datasal$dg <- as.factor(datasal$dg)
levels(datasal$dg) <- c("1", "0")

head(datasal)
##   sx rk yr dg yd    sl
## 1  0  3 25  1 35 36350
## 2  0  3 13  1 22 35350
## 3  0  3 10  1 23 28200
## 4  1  3  7  1 27 26775
## 5  0  3 19  0 30 33696
## 6  0  3 16  1 21 28516

Correlation

vctCorr = numeric(0)
for (i in names(datasal)){
    cor.result <- cor(datasal$sl, as.numeric(datasal[,i]))
    vctCorr <- c(vctCorr, cor.result)
}
dfrCorr <- vctCorr
names(dfrCorr) <- names(datasal)
dfrCorr
##         sx         rk         yr         dg         yd         sl 
## 0.25278237 0.86748836 0.70066899 0.06972576 0.67485416 1.00000000

Observation The correlation output shows that sl has correlation with rk,yr and fairly less with the other factors. This also tells us that in estimating salary these 2 highly correlated factors will play a major role.

Visualize

dfrGraph <- gather(datasal, variable, value, -sl)
## Warning: attributes are not identical across measure variables; they will
## be dropped
head(dfrGraph)
##      sl variable value
## 1 36350       sx     0
## 2 35350       sx     0
## 3 28200       sx     0
## 4 26775       sx     1
## 5 33696       sx     0
## 6 28516       sx     0
ggplot(dfrGraph) +
    geom_jitter(aes(value,sl, colour=variable)) + 
    geom_smooth(aes(value,sl, colour=variable), method=lm, se=FALSE) +
    facet_wrap(~variable, scales="free_x") +
    labs(title="Relation Of salary With Other Features")

Observation
There is some impact of all the features with sl correlating sl with other variables we see that - rk has the highest positive correlation

Dataset Split

set.seed(808)
TrainRecords <- createDataPartition(y=datasal$sl, p=0.7, list=FALSE)
TrainData <- datasal[TrainRecords,]
TestData <- datasal[-TrainRecords,]

Spilting the dataset into Training and Test Dataset
70% of the data is spilt as train data and 30 % as test data

Training Dataset RowCount & ColCount

dim(TestData)
## [1] 12  6

Testing Dataset Head

head(TestData)
##    sx rk yr dg yd    sl
## 3   0  3 10  1 23 28200
## 5   0  3 19  0 30 33696
## 11  0  3 12  1 22 27025
## 16  0  3  7  1 15 29342
## 27  0  2 11  1 14 24800
## 29  0  2  3  0  7 26182

Training Dataset Summary

lapply(TrainData, FUN=summary)
## $sx
##  1  0 
## 10 30 
## 
## $rk
##  1  2  3 
## 13 11 16 
## 
## $yr
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   3.000   7.500   7.725  11.000  25.000 
## 
## $dg
##  1  0 
## 26 14 
## 
## $yd
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     8.5    17.0    16.8    24.0    35.0 
## 
## $sl
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16090   18250   23720   23920   27070   38040

Testing Dataset Summary

lapply(TestData, FUN=summary)
## $sx
## 1 0 
## 4 8 
## 
## $rk
## 1 2 3 
## 5 3 4 
## 
## $yr
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   5.500   6.667  10.250  19.000 
## 
## $dg
## 1 0 
## 8 4 
## 
## $yd
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    2.75   14.00   13.83   22.25   33.00 
## 
## $sl
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15000   19350   24260   23400   27320   33700

Finding Best Multi Linear Model AIC provides a means for model selection.

#?step()
stpModel=step(lm(data=TrainData, sl~.), trace=0, steps=36)
stpSummary <- summary(stpModel)
stpSummary 
## 
## Call:
## lm(formula = sl ~ rk + yr, data = TrainData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3583.6 -1004.6    47.3   935.4  9246.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15819.45     743.12  21.288  < 2e-16 ***
## rk2          4276.94     986.77   4.334 0.000112 ***
## rk3          9920.36     976.07  10.164 4.02e-12 ***
## yr            382.30      76.11   5.023 1.40e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2350 on 36 degrees of freedom
## Multiple R-squared:  0.8576, Adjusted R-squared:  0.8458 
## F-statistic: 72.29 on 3 and 36 DF,  p-value: 2.623e-15

Observation
Best results given by sl ~ rk + yr

Make Final Multi Linear Model

x1 <- TrainData$rk
x2 <- TrainData$yr
y <- TrainData$sl
slmModel <- lm(y~x1+x2)

Observation
No errors. Model successfully created.

Show Model

# print summary
summary(slmModel)
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3583.6 -1004.6    47.3   935.4  9246.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15819.45     743.12  21.288  < 2e-16 ***
## x12          4276.94     986.77   4.334 0.000112 ***
## x13          9920.36     976.07  10.164 4.02e-12 ***
## x2            382.30      76.11   5.023 1.40e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2350 on 36 degrees of freedom
## Multiple R-squared:  0.8576, Adjusted R-squared:  0.8458 
## F-statistic: 72.29 on 3 and 36 DF,  p-value: 2.623e-15

This model is the best to predict salary because it gives a fairly high r value which shows that these factors are highly related to salary. Also because the p value is significant and shows that all factors in this model are equally significant in change of salary.

PredData <- select(TestData, -c(sl,sx,dg,yd))
colnames(PredData) <- c("x1","x2")
head(PredData)
##    x1 x2
## 3   3 10
## 5   3 19
## 11  3 12
## 16  3  7
## 27  2 11
## 29  2  3
dim(PredData)
## [1] 12  2
pred_sl <- predict(slmModel, PredData)
pred_sl
##        3        5       11       16       27       29       33       34 
## 29562.78 33003.46 30327.38 28415.89 24301.66 21243.29 19260.13 21625.58 
##       45       48       51       52 
## 16584.05 16584.05 16201.75 15819.45
TestData <- cbind(TestData, pred_sl)

Observation
The above output will show the pridicted of the salary(sl) based on the regression model.