Let’s Start with Setting up Our Datasheet

We fabricated our CT water datasheet from an existing IBM datasheet. We selected 1200 entries to demonstrate our project. All of our R codes are shown below in chunks if anyone is interested to use reproduce any results/charts that we make here.

Before we start the demonstration, we need to load our data.We will be using CT Water.csv that we created. You can certainly use your own datasheet, if you need a copy of our datasheet, please visit the end of this page under the Additional Resources

We also need to install the following packages:

install.packages('DataExplorer')

install.packages("psych")

install.packages("plotrix")

install.packages("sm")

install.packages("scatterplot3d")

install.packages("corrplot")

install.packages("ggcorrplot")

install.packages("ggplot2")

# Copy and paste the following packages names to your install command box:
# DataExplorer, psych, plotrix, sm, scatterplot3d, corrplot, ggcorrplot, ggplot2
# It will take a about 40 seconds wait until it says "C:\Users\yduan3\AppData\Local\Temp\RtmpojWFVV\downloaded_packages"

  getwd () # Just in case if your working directory is not on the desk top, you can set it to desktop

## [1] "C:/Users/yduan3/Desktop"

  setwd("C:/Users/yduan3/Desktop")  # Now we know we are under the correct working directory 
  CT <-read.csv("CT_Water.csv") # Let's take a look of the dataset, which is is assigend to "CT" variable  
  summary(CT) # We now know the Mean, Median, Max and Min for all our variables

##       Age           Gender    JobInvolvement  JobSatisfaction
##  Min.   :18.00   Female:478   Min.   :1.000   Min.   :1.000  
##  1st Qu.:30.00   Male  :722   1st Qu.:2.000   1st Qu.:2.000  
##  Median :36.00                Median :3.000   Median :3.000  
##  Mean   :37.04                Mean   :2.723   Mean   :3.019  
##  3rd Qu.:43.00                3rd Qu.:3.000   3rd Qu.:4.000  
##  Max.   :60.00                Max.   :4.000   Max.   :4.000  
##  MonthlyIncome   PerformanceRating RelationshipSatisfaction
##  Min.   : 1009   Min.   :3.000     Min.   :1.000           
##  1st Qu.: 2942   1st Qu.:3.000     1st Qu.:2.000           
##  Median : 5000   Median :3.000     Median :3.000           
##  Mean   : 6564   Mean   :3.154     Mean   :2.702           
##  3rd Qu.: 8380   3rd Qu.:3.000     3rd Qu.:4.000           
##  Max.   :19999   Max.   :4.000     Max.   :4.000           
##     Training     WorkLifeBalance
##  Min.   :0.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:2.000  
##  Median :3.000   Median :3.000  
##  Mean   :3.382   Mean   :2.874  
##  3rd Qu.:5.000   3rd Qu.:3.000  
##  Max.   :6.000   Max.   :4.000

To make sure we can easily call our datasheet, we will attach our datasheet to RStudio so that we do not have to use the $ sign every time we call a variable

  attach(CT)
  str(CT) # Now, let's take a peak of all the variables in our dataset

## 'data.frame':    1200 obs. of  9 variables:
##  $ Age                     : int  28 58 25 43 33 33 34 36 36 31 ...
##  $ Gender                  : Factor w/ 2 levels "Female","Male": 2 1 2 1 2 2 1 2 1 1 ...
##  $ JobInvolvement          : int  2 2 2 3 2 4 3 2 4 3 ...
##  $ JobSatisfaction         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ MonthlyIncome           : int  4724 10008 4256 9985 2799 3143 6500 7587 2743 4148 ...
##  $ PerformanceRating       : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ RelationshipSatisfaction: int  3 4 1 1 2 2 2 2 3 3 ...
##  $ Training                : int  0 0 1 1 1 1 1 1 1 1 ...
##  $ WorkLifeBalance         : int  1 1 1 1 1 1 1 1 1 1 ...

  head(CT, 3) # We can also see some sample entries by using head and tail function

##   Age Gender JobInvolvement JobSatisfaction MonthlyIncome
## 1  28   Male              2               1          4724
## 2  58 Female              2               1         10008
## 3  25   Male              2               1          4256
##   PerformanceRating RelationshipSatisfaction Training WorkLifeBalance
## 1                 3                        3        0               1
## 2                 3                        4        0               1
## 3                 3                        1        1               1

  tail(CT, 3) # We can also see some sample entries by using head and tail function

##      Age Gender JobInvolvement JobSatisfaction MonthlyIncome
## 1198  59   Male              3               4         10512
## 1199  32 Female              3               4         11159
## 1200  40   Male              3               4         10322
##      PerformanceRating RelationshipSatisfaction Training WorkLifeBalance
## 1198                 3                        4        6               4
## 1199                 3                        4        6               4
## 1200                 4                        4        6               4

Descriptive Statistics with Charts

We will begin with running some descriptive statistics of our data and create some charts to visualize our data

library(psych)
# The * Sign tells you something weird is going on.Since Gender is a factor/categorial, we can't rely on some"
# stats shown here"
  describe(CT)

##                          vars    n    mean      sd median trimmed     mad
## Age                         1 1200   37.04    9.18     36   36.58    8.90
## Gender*                     2 1200    1.60    0.49      2    1.63    0.00
## JobInvolvement              3 1200    2.72    0.72      3    2.73    0.00
## JobSatisfaction             4 1200    3.02    1.06      3    3.15    1.48
## MonthlyIncome               5 1200 6564.43 4711.43   5000 5736.82 3346.97
## PerformanceRating           6 1200    3.15    0.36      3    3.07    0.00
## RelationshipSatisfaction    7 1200    2.70    1.09      3    2.75    1.48
## Training                    8 1200    3.38    1.67      3    3.38    1.48
## WorkLifeBalance             9 1200    2.87    0.85      3    2.92    1.48
##                           min   max range  skew kurtosis     se
## Age                        18    60    42  0.41    -0.45   0.27
## Gender*                     1     2     1 -0.41    -1.83   0.01
## JobInvolvement              1     4     3 -0.49     0.21   0.02
## JobSatisfaction             1     4     3 -0.69    -0.83   0.03
## MonthlyIncome            1009 19999 18990  1.35     0.96 136.01
## PerformanceRating           3     4     1  1.91     1.66   0.01
## RelationshipSatisfaction    1     4     3 -0.29    -1.23   0.03
## Training                    0     6     6  0.14    -0.89   0.05
## WorkLifeBalance             1     4     3 -0.33    -0.57   0.02

library(DataExplorer)
  # We want to get some visuals of course 
  plot_density(CT)

** It’s understandable that Gender is not plotted. However, we couldn’t figure out why Performance rating is not plotted. So, we are going to create a density plot just for Performance Rating. **

  # Kernel Density Plot
    d <- density(PerformanceRating)
    plot(d, 
         xlab="N = 300 Bandwidth = 0.1065",
         ylab="Density", #y-axis label
         main="CT Water Employee Performance Rating")
    polygon(d, col="cyan1", border="red", lwd =2)

  # 3D Exploded Pie Chart for Gender
    summary(Gender)

## Female   Male 
##    478    722

    slices <- c(478, 722) 
    lbls <- c("Female", "Male")
    pct <- round(slices/sum(slices)*100)
    lbls <- paste(lbls, pct) # add percents to labels 
    lbls <- paste(lbls,"%",sep="") # ad % to label
    pie3D(slices,labels=lbls,explode=0.1,
          main="CT Water Employee Gender")

library(sm)

## Package 'sm', version 2.2-5.6: type help(sm) for summary information

  # Compare Male and Female, Performance Ratings 
    # create value labels 
    GenderPR <- factor(Gender, levels= c("Female","Male"))
    # plot densities 
    sm.density.compare(PerformanceRating,Gender,xlab="Employee Performance Ratings ")
    title(main="Gender Differences in Performance")
    legend("topright",legend = c("Female","Male"),title = "Gender",
                 fill = c("red","green"))

  # Let's make a plot for Age and Income to see the relationship between them. 
    plot(Age,MonthlyIncome, main = "CT Water Employee Age and Income \n Scatterplot", 
         pch=19,col=rainbow(7, alpha = .9))

library(scatterplot3d) 
  # 3D Scatterplot with Coloring and Vertical Drop Lines
     
      scatterplot3d(Age,JobSatisfaction,MonthlyIncome, pch=16, highlight.3d=TRUE,
                    type="h", main="3D Scatterplot")

Based on this graph, we can safely conclude: 1) Older workers seem to be make more money than younger workers 2) Employees who are in their 40s and 50s seem to be more satisfied with their work than those who are in their 20s and 30s

  # plot densities 
    # create value labels 
    cyl.f <- factor(RelationshipSatisfaction, levels= c(1,2,3,4),
                    labels = c("Very Unsatisfied", "Unsatisfied", "Satisfied", "Very Satisfied")) 
      # plot densities 
          sm.density.compare(JobSatisfaction, RelationshipSatisfaction, xlab="Job Satisfaction")
          title(main="JobSatisfaction Distribution by Relationship Satisfaction")
          legend("topleft",legend = c("Very Unsatisfied", "Unsatisfied", "Satisfied", "Very Satisfied"),
                 fill = c("red","green","aquamarine","blue"))

As we can see, Job Satisfaction seems to be related to Relationship Satisfaction (Our scale for Job Satisfaction only goes up to 4, that’s why all the lines seems to decline after 4.

   # Grouped Bar Plot
    counts <- table(JobSatisfaction, RelationshipSatisfaction)
    barplot(counts, main="Job Satisfaction vs. Relationship Satisfaction",
          col=c("aquamarine","coral"),
            legend = NULL, beside=T, border = T)
    legend("topleft",legend = c("Job Satisfaction", "Relationship Satisfaction"),title = "Relationship Satisfaction",
           fill = c("aquamarine","coral"))

Inferential Statistics with Charts

library(ggcorrplot)

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

library(corrplot)

## corrplot 0.84 loaded

  # Correlation Matrix 
    source("http://www.sthda.com/upload/rquery_cormat.r")
    # since we can only run correlation for numeric values, we exculuded "gender"
    NEW_CT <- data.frame(Age,JobInvolvement,JobSatisfaction,MonthlyIncome,PerformanceRating,
                         RelationshipSatisfaction,Training,WorkLifeBalance)
    col<- colorRampPalette(c("blue", "white", "red"))(20)
    cormat<-rquery.cormat(NEW_CT, type="full", col=col)

library(DataExplorer)
  # If you rather to have all numbers on your correlation, try these two code
    # plot_correlation(CT, type = 'continuous','Review.Date')
                 
    # Correlation matrix
    corr <- round(cor(NEW_CT), 1)
    # Plot
    ggcorrplot(corr, hc.order = TRUE, 
               type = "lower", 
               lab = TRUE, 
               lab_size = 3, 
               method="circle", 
               colors = c("tomato2", "white",
                          "springgreen3"), 
               title="Correlogram of CT Water", 
               ggtheme=theme_bw)

Based on our charts above, we know that Age and monthly income seem to be possitively correlated. We can get additional stats by using cor.test() and cov() function

  # Correlation and Covariance 
    # You can also do (use = "complete.obs", method="kendall") or (" method="spearman")
    cor.test(Age, MonthlyIncome, use ='complete.obs',method = "pearson") # default test is pearson

## 
##  Pearson's product-moment correlation
## 
## data:  Age and MonthlyIncome
## t = 19.887, df = 1198, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.454408 0.539565
## sample estimates:
##       cor 
## 0.4981869

    cov(NEW_CT)

##                                    Age JobInvolvement JobSatisfaction
## Age                       8.427222e+01    0.246182930     -0.23423130
## JobInvolvement            2.461829e-01    0.517219905     -0.01053934
## JobSatisfaction          -2.342313e-01   -0.010539338      1.12140047
## MonthlyIncome             2.154704e+04  -62.215929942   -257.65452460
## PerformanceRating        -9.827634e-03   -0.007353350      0.14383167
## RelationshipSatisfaction  6.948513e-01    0.028318043      0.09579789
## Training                  1.711482e-01   -0.027762024      0.66740756
## WorkLifeBalance           3.547679e-02    0.002688351      0.45612524
##                          MonthlyIncome PerformanceRating
## Age                       2.154704e+04      -0.009827634
## JobInvolvement           -6.221593e+01      -0.007353350
## JobSatisfaction          -2.576545e+02       0.143831665
## MonthlyIncome             2.219761e+07     -29.495934112
## PerformanceRating        -2.949593e+01       0.130508062
## RelationshipSatisfaction  1.661094e+02      -0.009848485
## Training                  6.593274e+01       0.132103142
## WorkLifeBalance          -2.115332e+01       0.036930081
##                          RelationshipSatisfaction    Training
## Age                                   0.694851265  0.17114818
## JobInvolvement                        0.028318043 -0.02776202
## JobSatisfaction                       0.095797887  0.66740756
## MonthlyIncome                       166.109382819 65.93273561
## PerformanceRating                    -0.009848485  0.13210314
## RelationshipSatisfaction              1.195326661  0.20570197
## Training                              0.205701974  2.79332499
## WorkLifeBalance                       0.013304142  0.18985126
##                          WorkLifeBalance
## Age                          0.035476786
## JobInvolvement               0.002688351
## JobSatisfaction              0.456125243
## MonthlyIncome              -21.153315263
## PerformanceRating            0.036930081
## RelationshipSatisfaction     0.013304142
## Training                     0.189851265
## WorkLifeBalance              0.725603976

  # One Sample t test 
    t.test(MonthlyIncome,mu=5000) # Ho: mu=5000

## 
##  One Sample t-test
## 
## data:  MonthlyIncome
## t = 11.503, df = 1199, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 5000
## 95 percent confidence interval:
##  6297.586 6831.264
## sample estimates:
## mean of x 
##  6564.425

Now, we are going to make our first Multiple Regression Model

  # First Model of Multiple Regression 
    Model_One <- lm (JobSatisfaction ~ PerformanceRating + Age + Training + JobInvolvement+ WorkLifeBalance + MonthlyIncome, data = CT)
      summary(Model_One)

## 
## Call:
## lm(formula = JobSatisfaction ~ PerformanceRating + Age + Training + 
##     JobInvolvement + WorkLifeBalance + MonthlyIncome, data = CT)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.57118 -0.48577  0.08718  0.58096  2.39958 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.459e+00  2.490e-01  -5.858 6.06e-09 ***
## PerformanceRating  7.779e-01  6.642e-02  11.711  < 2e-16 ***
## Age               -7.222e-04  2.932e-03  -0.246   0.8054    
## Training           1.653e-01  1.438e-02  11.496  < 2e-16 ***
## JobInvolvement    -4.122e-03  3.247e-02  -0.127   0.8990    
## WorkLifeBalance    5.455e-01  2.772e-02  19.678  < 2e-16 ***
## MonthlyIncome     -9.855e-06  5.710e-06  -1.726   0.0846 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8068 on 1193 degrees of freedom
## Multiple R-squared:  0.4225, Adjusted R-squared:  0.4196 
## F-statistic: 145.5 on 6 and 1193 DF,  p-value: < 2.2e-16

         plot(Model_One)

         termplot(Model_One)

This is our second Multiple Regression Model

  Model_Two <- lm (JobSatisfaction ~ PerformanceRating  + Training + WorkLifeBalance, data = CT)
      summary(Model_Two)

## 
## Call:
## lm(formula = JobSatisfaction ~ PerformanceRating + Training + 
##     WorkLifeBalance, data = CT)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.52035 -0.48324  0.09547  0.60072  2.35184 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.56960    0.21208  -7.401 2.54e-13 ***
## PerformanceRating  0.78073    0.06643  11.752  < 2e-16 ***
## Training           0.16492    0.01438  11.465  < 2e-16 ***
## WorkLifeBalance    0.54573    0.02774  19.673  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8073 on 1196 degrees of freedom
## Multiple R-squared:  0.4203, Adjusted R-squared:  0.4188 
## F-statistic:   289 on 3 and 1196 DF,  p-value: < 2.2e-16

         plot(Model_Two)

         termplot(Model_Two)

ANOVA

anova(Model_One,Model_Two)

## Analysis of Variance Table
## 
## Model 1: JobSatisfaction ~ PerformanceRating + Age + Training + JobInvolvement + 
##     WorkLifeBalance + MonthlyIncome
## Model 2: JobSatisfaction ~ PerformanceRating + Training + WorkLifeBalance
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1   1193 776.48                           
## 2   1196 779.49 -3   -3.0103 1.5417  0.202

This shows us that reducing Age, JobInlovment and MonthlyIncome didn’t really improve our model any better, so we can use either model to predit Job Satisfaction. (If you have more models, ANOVA analysis will show you which ones are the bad ones)

Using Model 1 to predict Job Satisfaction:

round(predict(Model_One, data.frame(Age = 60, MonthlyIncome = 5000, JobInvolvement = 4, PerformanceRating = 4, Training = 4, WorkLifeBalance =1), interval = 'confidence'),2)

##    fit  lwr  upr
## 1 2.75 2.52 2.98

round(predict(Model_One, data.frame(Age = 25, MonthlyIncome = 8500, JobInvolvement = 1, PerformanceRating = 3, Training = 5, WorkLifeBalance =2), interval = 'confidence'),2)

##    fit  lwr  upr
## 1 2.69 2.53 2.84

round(predict(Model_One, data.frame(Age = 35, MonthlyIncome = 2500, JobInvolvement = 2, PerformanceRating = 2, Training = 3, WorkLifeBalance =3), interval = 'confidence'),2)

##    fit lwr  upr
## 1 2.17   2 2.34

round(predict(Model_One, data.frame(Age = 33, MonthlyIncome = 4050, JobInvolvement = 4,  PerformanceRating = 3, Training = 2, WorkLifeBalance =4), interval = 'confidence'),2)

##    fit  lwr  upr
## 1 3.31 3.18 3.43

round(predict(Model_One, data.frame(Age = 50, MonthlyIncome = 2800, JobInvolvement = 3, PerformanceRating = 1, Training = 1, WorkLifeBalance =3), interval = 'confidence'),2)

##    fit  lwr  upr
## 1 1.05 0.75 1.34

round(predict(Model_One, data.frame(Age = 45, MonthlyIncome = 7200, JobInvolvement = 3, PerformanceRating = 4, Training = 2, WorkLifeBalance =2), interval = 'confidence'),2)

##    fit  lwr  upr
## 1 2.96 2.81 3.11

round(predict(Model_One, data.frame(Age = 60, MonthlyIncome = 9800, JobInvolvement = 2, PerformanceRating = 2, Training = 3, WorkLifeBalance =1), interval = 'confidence'),2)

##    fit  lwr  upr
## 1 0.99 0.77 1.21

Using Model 2 to predict Job Satisfaction:

round(predict(Model_Two, data.frame(PerformanceRating = 4, Training = 4, WorkLifeBalance =1), interval = 'confidence'),2)

##    fit lwr  upr
## 1 2.76 2.6 2.92

round(predict(Model_Two, data.frame(PerformanceRating = 3, Training = 5, WorkLifeBalance =2), interval = 'confidence'),2)

##    fit lwr  upr
## 1 2.69 2.6 2.78

round(predict(Model_Two, data.frame(PerformanceRating = 2, Training = 3, WorkLifeBalance =3), interval = 'confidence'),2)

##    fit  lwr  upr
## 1 2.12 1.97 2.28

round(predict(Model_Two, data.frame(PerformanceRating = 3, Training = 2, WorkLifeBalance =4), interval = 'confidence'),2)

##    fit  lwr  upr
## 1 3.29 3.19 3.38

round(predict(Model_Two, data.frame(PerformanceRating = 1, Training = 1, WorkLifeBalance =3), interval = 'confidence'),2)

##    fit  lwr  upr
## 1 1.01 0.73 1.29

round(predict(Model_Two, data.frame(PerformanceRating = 4, Training = 2, WorkLifeBalance =2), interval = 'confidence'),2)

##    fit  lwr  upr
## 1 2.97 2.83 3.12

round(predict(Model_Two, data.frame(PerformanceRating = 2, Training = 3, WorkLifeBalance =1), interval = 'confidence'),2)

##    fit  lwr  upr
## 1 1.03 0.86 1.21

Additional Resources

We’ve put together some useful resources for learning R, if you are interested, please visit this shared folder on Google Drive: https://drive.google.com/drive/folders/1ujTdvpKxAlnysKi4gY4H3sQES4tcJQ0c?usp=sharing.

Thank you for paying attention:)

Yay! We are graduating! MAIOP 2019

CT Water Project

Taylor Becchetti, Nick Boccuzzi, Patrick McBrearty, Ethan Duan

5/9/2019