We fabricated our CT water datasheet from an existing IBM datasheet. We selected 1200 entries to demonstrate our project. All of our R codes are shown below in chunks if anyone is interested to use reproduce any results/charts that we make here.
Before we start the demonstration, we need to load our data.We will be using CT Water.csv that we created. You can certainly use your own datasheet, if you need a copy of our datasheet, please visit the end of this page under the Additional Resources
We also need to install the following packages:
install.packages('DataExplorer')
install.packages("psych")
install.packages("plotrix")
install.packages("sm")
install.packages("scatterplot3d")
install.packages("corrplot")
install.packages("ggcorrplot")
install.packages("ggplot2")
# Copy and paste the following packages names to your install command box:
# DataExplorer, psych, plotrix, sm, scatterplot3d, corrplot, ggcorrplot, ggplot2
# It will take a about 40 seconds wait until it says "C:\Users\yduan3\AppData\Local\Temp\RtmpojWFVV\downloaded_packages"
getwd () # Just in case if your working directory is not on the desk top, you can set it to desktop
## [1] "C:/Users/yduan3/Desktop"
setwd("C:/Users/yduan3/Desktop") # Now we know we are under the correct working directory
CT <-read.csv("CT_Water.csv") # Let's take a look of the dataset, which is is assigend to "CT" variable
summary(CT) # We now know the Mean, Median, Max and Min for all our variables
## Age Gender JobInvolvement JobSatisfaction
## Min. :18.00 Female:478 Min. :1.000 Min. :1.000
## 1st Qu.:30.00 Male :722 1st Qu.:2.000 1st Qu.:2.000
## Median :36.00 Median :3.000 Median :3.000
## Mean :37.04 Mean :2.723 Mean :3.019
## 3rd Qu.:43.00 3rd Qu.:3.000 3rd Qu.:4.000
## Max. :60.00 Max. :4.000 Max. :4.000
## MonthlyIncome PerformanceRating RelationshipSatisfaction
## Min. : 1009 Min. :3.000 Min. :1.000
## 1st Qu.: 2942 1st Qu.:3.000 1st Qu.:2.000
## Median : 5000 Median :3.000 Median :3.000
## Mean : 6564 Mean :3.154 Mean :2.702
## 3rd Qu.: 8380 3rd Qu.:3.000 3rd Qu.:4.000
## Max. :19999 Max. :4.000 Max. :4.000
## Training WorkLifeBalance
## Min. :0.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:2.000
## Median :3.000 Median :3.000
## Mean :3.382 Mean :2.874
## 3rd Qu.:5.000 3rd Qu.:3.000
## Max. :6.000 Max. :4.000
To make sure we can easily call our datasheet, we will attach our datasheet to RStudio so that we do not have to use the $ sign every time we call a variable
attach(CT)
str(CT) # Now, let's take a peak of all the variables in our dataset
## 'data.frame': 1200 obs. of 9 variables:
## $ Age : int 28 58 25 43 33 33 34 36 36 31 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 2 1 2 1 2 2 1 2 1 1 ...
## $ JobInvolvement : int 2 2 2 3 2 4 3 2 4 3 ...
## $ JobSatisfaction : int 1 1 1 1 1 1 1 1 1 1 ...
## $ MonthlyIncome : int 4724 10008 4256 9985 2799 3143 6500 7587 2743 4148 ...
## $ PerformanceRating : int 3 3 3 3 3 3 3 3 3 3 ...
## $ RelationshipSatisfaction: int 3 4 1 1 2 2 2 2 3 3 ...
## $ Training : int 0 0 1 1 1 1 1 1 1 1 ...
## $ WorkLifeBalance : int 1 1 1 1 1 1 1 1 1 1 ...
head(CT, 3) # We can also see some sample entries by using head and tail function
## Age Gender JobInvolvement JobSatisfaction MonthlyIncome
## 1 28 Male 2 1 4724
## 2 58 Female 2 1 10008
## 3 25 Male 2 1 4256
## PerformanceRating RelationshipSatisfaction Training WorkLifeBalance
## 1 3 3 0 1
## 2 3 4 0 1
## 3 3 1 1 1
tail(CT, 3) # We can also see some sample entries by using head and tail function
## Age Gender JobInvolvement JobSatisfaction MonthlyIncome
## 1198 59 Male 3 4 10512
## 1199 32 Female 3 4 11159
## 1200 40 Male 3 4 10322
## PerformanceRating RelationshipSatisfaction Training WorkLifeBalance
## 1198 3 4 6 4
## 1199 3 4 6 4
## 1200 4 4 6 4
We will begin with running some descriptive statistics of our data and create some charts to visualize our data
library(psych)
# The * Sign tells you something weird is going on.Since Gender is a factor/categorial, we can't rely on some"
# stats shown here"
describe(CT)
## vars n mean sd median trimmed mad
## Age 1 1200 37.04 9.18 36 36.58 8.90
## Gender* 2 1200 1.60 0.49 2 1.63 0.00
## JobInvolvement 3 1200 2.72 0.72 3 2.73 0.00
## JobSatisfaction 4 1200 3.02 1.06 3 3.15 1.48
## MonthlyIncome 5 1200 6564.43 4711.43 5000 5736.82 3346.97
## PerformanceRating 6 1200 3.15 0.36 3 3.07 0.00
## RelationshipSatisfaction 7 1200 2.70 1.09 3 2.75 1.48
## Training 8 1200 3.38 1.67 3 3.38 1.48
## WorkLifeBalance 9 1200 2.87 0.85 3 2.92 1.48
## min max range skew kurtosis se
## Age 18 60 42 0.41 -0.45 0.27
## Gender* 1 2 1 -0.41 -1.83 0.01
## JobInvolvement 1 4 3 -0.49 0.21 0.02
## JobSatisfaction 1 4 3 -0.69 -0.83 0.03
## MonthlyIncome 1009 19999 18990 1.35 0.96 136.01
## PerformanceRating 3 4 1 1.91 1.66 0.01
## RelationshipSatisfaction 1 4 3 -0.29 -1.23 0.03
## Training 0 6 6 0.14 -0.89 0.05
## WorkLifeBalance 1 4 3 -0.33 -0.57 0.02
library(DataExplorer)
# We want to get some visuals of course
plot_density(CT)
** It’s understandable that Gender is not plotted. However, we couldn’t figure out why Performance rating is not plotted. So, we are going to create a density plot just for Performance Rating. **
# Kernel Density Plot
d <- density(PerformanceRating)
plot(d,
xlab="N = 300 Bandwidth = 0.1065",
ylab="Density", #y-axis label
main="CT Water Employee Performance Rating")
polygon(d, col="cyan1", border="red", lwd =2)
# 3D Exploded Pie Chart for Gender
summary(Gender)
## Female Male
## 478 722
slices <- c(478, 722)
lbls <- c("Female", "Male")
pct <- round(slices/sum(slices)*100)
lbls <- paste(lbls, pct) # add percents to labels
lbls <- paste(lbls,"%",sep="") # ad % to label
pie3D(slices,labels=lbls,explode=0.1,
main="CT Water Employee Gender")
library(sm)
## Package 'sm', version 2.2-5.6: type help(sm) for summary information
# Compare Male and Female, Performance Ratings
# create value labels
GenderPR <- factor(Gender, levels= c("Female","Male"))
# plot densities
sm.density.compare(PerformanceRating,Gender,xlab="Employee Performance Ratings ")
title(main="Gender Differences in Performance")
legend("topright",legend = c("Female","Male"),title = "Gender",
fill = c("red","green"))
# Let's make a plot for Age and Income to see the relationship between them.
plot(Age,MonthlyIncome, main = "CT Water Employee Age and Income \n Scatterplot",
pch=19,col=rainbow(7, alpha = .9))
library(scatterplot3d)
# 3D Scatterplot with Coloring and Vertical Drop Lines
scatterplot3d(Age,JobSatisfaction,MonthlyIncome, pch=16, highlight.3d=TRUE,
type="h", main="3D Scatterplot")
Based on this graph, we can safely conclude: 1) Older workers seem to be make more money than younger workers 2) Employees who are in their 40s and 50s seem to be more satisfied with their work than those who are in their 20s and 30s
# plot densities
# create value labels
cyl.f <- factor(RelationshipSatisfaction, levels= c(1,2,3,4),
labels = c("Very Unsatisfied", "Unsatisfied", "Satisfied", "Very Satisfied"))
# plot densities
sm.density.compare(JobSatisfaction, RelationshipSatisfaction, xlab="Job Satisfaction")
title(main="JobSatisfaction Distribution by Relationship Satisfaction")
legend("topleft",legend = c("Very Unsatisfied", "Unsatisfied", "Satisfied", "Very Satisfied"),
fill = c("red","green","aquamarine","blue"))
As we can see, Job Satisfaction seems to be related to Relationship Satisfaction (Our scale for Job Satisfaction only goes up to 4, that’s why all the lines seems to decline after 4.
# Grouped Bar Plot
counts <- table(JobSatisfaction, RelationshipSatisfaction)
barplot(counts, main="Job Satisfaction vs. Relationship Satisfaction",
col=c("aquamarine","coral"),
legend = NULL, beside=T, border = T)
legend("topleft",legend = c("Job Satisfaction", "Relationship Satisfaction"),title = "Relationship Satisfaction",
fill = c("aquamarine","coral"))
library(ggcorrplot)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
library(corrplot)
## corrplot 0.84 loaded
# Correlation Matrix
source("http://www.sthda.com/upload/rquery_cormat.r")
# since we can only run correlation for numeric values, we exculuded "gender"
NEW_CT <- data.frame(Age,JobInvolvement,JobSatisfaction,MonthlyIncome,PerformanceRating,
RelationshipSatisfaction,Training,WorkLifeBalance)
col<- colorRampPalette(c("blue", "white", "red"))(20)
cormat<-rquery.cormat(NEW_CT, type="full", col=col)
library(DataExplorer)
# If you rather to have all numbers on your correlation, try these two code
# plot_correlation(CT, type = 'continuous','Review.Date')
# Correlation matrix
corr <- round(cor(NEW_CT), 1)
# Plot
ggcorrplot(corr, hc.order = TRUE,
type = "lower",
lab = TRUE,
lab_size = 3,
method="circle",
colors = c("tomato2", "white",
"springgreen3"),
title="Correlogram of CT Water",
ggtheme=theme_bw)
Based on our charts above, we know that Age and monthly income seem to be possitively correlated. We can get additional stats by using cor.test() and cov() function
# Correlation and Covariance
# You can also do (use = "complete.obs", method="kendall") or (" method="spearman")
cor.test(Age, MonthlyIncome, use ='complete.obs',method = "pearson") # default test is pearson
##
## Pearson's product-moment correlation
##
## data: Age and MonthlyIncome
## t = 19.887, df = 1198, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.454408 0.539565
## sample estimates:
## cor
## 0.4981869
cov(NEW_CT)
## Age JobInvolvement JobSatisfaction
## Age 8.427222e+01 0.246182930 -0.23423130
## JobInvolvement 2.461829e-01 0.517219905 -0.01053934
## JobSatisfaction -2.342313e-01 -0.010539338 1.12140047
## MonthlyIncome 2.154704e+04 -62.215929942 -257.65452460
## PerformanceRating -9.827634e-03 -0.007353350 0.14383167
## RelationshipSatisfaction 6.948513e-01 0.028318043 0.09579789
## Training 1.711482e-01 -0.027762024 0.66740756
## WorkLifeBalance 3.547679e-02 0.002688351 0.45612524
## MonthlyIncome PerformanceRating
## Age 2.154704e+04 -0.009827634
## JobInvolvement -6.221593e+01 -0.007353350
## JobSatisfaction -2.576545e+02 0.143831665
## MonthlyIncome 2.219761e+07 -29.495934112
## PerformanceRating -2.949593e+01 0.130508062
## RelationshipSatisfaction 1.661094e+02 -0.009848485
## Training 6.593274e+01 0.132103142
## WorkLifeBalance -2.115332e+01 0.036930081
## RelationshipSatisfaction Training
## Age 0.694851265 0.17114818
## JobInvolvement 0.028318043 -0.02776202
## JobSatisfaction 0.095797887 0.66740756
## MonthlyIncome 166.109382819 65.93273561
## PerformanceRating -0.009848485 0.13210314
## RelationshipSatisfaction 1.195326661 0.20570197
## Training 0.205701974 2.79332499
## WorkLifeBalance 0.013304142 0.18985126
## WorkLifeBalance
## Age 0.035476786
## JobInvolvement 0.002688351
## JobSatisfaction 0.456125243
## MonthlyIncome -21.153315263
## PerformanceRating 0.036930081
## RelationshipSatisfaction 0.013304142
## Training 0.189851265
## WorkLifeBalance 0.725603976
# One Sample t test
t.test(MonthlyIncome,mu=5000) # Ho: mu=5000
##
## One Sample t-test
##
## data: MonthlyIncome
## t = 11.503, df = 1199, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 5000
## 95 percent confidence interval:
## 6297.586 6831.264
## sample estimates:
## mean of x
## 6564.425
# First Model of Multiple Regression
Model_One <- lm (JobSatisfaction ~ PerformanceRating + Age + Training + JobInvolvement+ WorkLifeBalance + MonthlyIncome, data = CT)
summary(Model_One)
##
## Call:
## lm(formula = JobSatisfaction ~ PerformanceRating + Age + Training +
## JobInvolvement + WorkLifeBalance + MonthlyIncome, data = CT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.57118 -0.48577 0.08718 0.58096 2.39958
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.459e+00 2.490e-01 -5.858 6.06e-09 ***
## PerformanceRating 7.779e-01 6.642e-02 11.711 < 2e-16 ***
## Age -7.222e-04 2.932e-03 -0.246 0.8054
## Training 1.653e-01 1.438e-02 11.496 < 2e-16 ***
## JobInvolvement -4.122e-03 3.247e-02 -0.127 0.8990
## WorkLifeBalance 5.455e-01 2.772e-02 19.678 < 2e-16 ***
## MonthlyIncome -9.855e-06 5.710e-06 -1.726 0.0846 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8068 on 1193 degrees of freedom
## Multiple R-squared: 0.4225, Adjusted R-squared: 0.4196
## F-statistic: 145.5 on 6 and 1193 DF, p-value: < 2.2e-16
plot(Model_One)
termplot(Model_One)
Model_Two <- lm (JobSatisfaction ~ PerformanceRating + Training + WorkLifeBalance, data = CT)
summary(Model_Two)
##
## Call:
## lm(formula = JobSatisfaction ~ PerformanceRating + Training +
## WorkLifeBalance, data = CT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.52035 -0.48324 0.09547 0.60072 2.35184
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.56960 0.21208 -7.401 2.54e-13 ***
## PerformanceRating 0.78073 0.06643 11.752 < 2e-16 ***
## Training 0.16492 0.01438 11.465 < 2e-16 ***
## WorkLifeBalance 0.54573 0.02774 19.673 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8073 on 1196 degrees of freedom
## Multiple R-squared: 0.4203, Adjusted R-squared: 0.4188
## F-statistic: 289 on 3 and 1196 DF, p-value: < 2.2e-16
plot(Model_Two)
termplot(Model_Two)
anova(Model_One,Model_Two)
## Analysis of Variance Table
##
## Model 1: JobSatisfaction ~ PerformanceRating + Age + Training + JobInvolvement +
## WorkLifeBalance + MonthlyIncome
## Model 2: JobSatisfaction ~ PerformanceRating + Training + WorkLifeBalance
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1193 776.48
## 2 1196 779.49 -3 -3.0103 1.5417 0.202
This shows us that reducing Age, JobInlovment and MonthlyIncome didn’t really improve our model any better, so we can use either model to predit Job Satisfaction. (If you have more models, ANOVA analysis will show you which ones are the bad ones)
round(predict(Model_One, data.frame(Age = 60, MonthlyIncome = 5000, JobInvolvement = 4, PerformanceRating = 4, Training = 4, WorkLifeBalance =1), interval = 'confidence'),2)
## fit lwr upr
## 1 2.75 2.52 2.98
round(predict(Model_One, data.frame(Age = 25, MonthlyIncome = 8500, JobInvolvement = 1, PerformanceRating = 3, Training = 5, WorkLifeBalance =2), interval = 'confidence'),2)
## fit lwr upr
## 1 2.69 2.53 2.84
round(predict(Model_One, data.frame(Age = 35, MonthlyIncome = 2500, JobInvolvement = 2, PerformanceRating = 2, Training = 3, WorkLifeBalance =3), interval = 'confidence'),2)
## fit lwr upr
## 1 2.17 2 2.34
round(predict(Model_One, data.frame(Age = 33, MonthlyIncome = 4050, JobInvolvement = 4, PerformanceRating = 3, Training = 2, WorkLifeBalance =4), interval = 'confidence'),2)
## fit lwr upr
## 1 3.31 3.18 3.43
round(predict(Model_One, data.frame(Age = 50, MonthlyIncome = 2800, JobInvolvement = 3, PerformanceRating = 1, Training = 1, WorkLifeBalance =3), interval = 'confidence'),2)
## fit lwr upr
## 1 1.05 0.75 1.34
round(predict(Model_One, data.frame(Age = 45, MonthlyIncome = 7200, JobInvolvement = 3, PerformanceRating = 4, Training = 2, WorkLifeBalance =2), interval = 'confidence'),2)
## fit lwr upr
## 1 2.96 2.81 3.11
round(predict(Model_One, data.frame(Age = 60, MonthlyIncome = 9800, JobInvolvement = 2, PerformanceRating = 2, Training = 3, WorkLifeBalance =1), interval = 'confidence'),2)
## fit lwr upr
## 1 0.99 0.77 1.21
round(predict(Model_Two, data.frame(PerformanceRating = 4, Training = 4, WorkLifeBalance =1), interval = 'confidence'),2)
## fit lwr upr
## 1 2.76 2.6 2.92
round(predict(Model_Two, data.frame(PerformanceRating = 3, Training = 5, WorkLifeBalance =2), interval = 'confidence'),2)
## fit lwr upr
## 1 2.69 2.6 2.78
round(predict(Model_Two, data.frame(PerformanceRating = 2, Training = 3, WorkLifeBalance =3), interval = 'confidence'),2)
## fit lwr upr
## 1 2.12 1.97 2.28
round(predict(Model_Two, data.frame(PerformanceRating = 3, Training = 2, WorkLifeBalance =4), interval = 'confidence'),2)
## fit lwr upr
## 1 3.29 3.19 3.38
round(predict(Model_Two, data.frame(PerformanceRating = 1, Training = 1, WorkLifeBalance =3), interval = 'confidence'),2)
## fit lwr upr
## 1 1.01 0.73 1.29
round(predict(Model_Two, data.frame(PerformanceRating = 4, Training = 2, WorkLifeBalance =2), interval = 'confidence'),2)
## fit lwr upr
## 1 2.97 2.83 3.12
round(predict(Model_Two, data.frame(PerformanceRating = 2, Training = 3, WorkLifeBalance =1), interval = 'confidence'),2)
## fit lwr upr
## 1 1.03 0.86 1.21
We’ve put together some useful resources for learning R, if you are interested, please visit this shared folder on Google Drive: https://drive.google.com/drive/folders/1ujTdvpKxAlnysKi4gY4H3sQES4tcJQ0c?usp=sharing.
Yay! We are graduating! MAIOP 2019