This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.
plot(cars)
Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.
When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).
The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.
#Import Data: HR_Onboarding_and Performance
library(readxl)
HR_Onboarding_and_Performance <- read_excel("HR Onboarding and Performance.xlsx")
View(HR_Onboarding_and_Performance)
setwd("/cloud/project")
Install tidyverse packages
install.packages("tidyverse")
Installing package into ‘/home/rstudio-user/R/x86_64-pc-linux-gnu-library/3.6’
(as ‘lib’ is unspecified)
trying URL 'http://package-proxy/src/contrib/tidyverse_1.3.0.tar.gz'
Content type 'application/x-tar' length 434637 bytes (424 KB)
==================================================
downloaded 424 KB
* installing *binary* package ‘tidyverse’ ...
* DONE (tidyverse)
The downloaded source packages are in
‘/tmp/RtmpPDJ73z/downloaded_packages’
install.packages("psych")
Installing package into ‘/home/rstudio-user/R/x86_64-pc-linux-gnu-library/3.6’
(as ‘lib’ is unspecified)
trying URL 'http://package-proxy/src/contrib/psych_1.9.12.31.tar.gz'
Content type 'application/x-tar' length 3809302 bytes (3.6 MB)
==================================================
downloaded 3.6 MB
* installing *binary* package ‘psych’ ...
* DONE (psych)
The downloaded source packages are in
‘/tmp/RtmpPDJ73z/downloaded_packages’
install.packages("readxl")
Error in install.packages : Updating loaded packages
library(tidyverse)
[30m── [1mAttaching packages[22m ──────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──[39m
[30m[32m✓[30m [34mggplot2[30m 3.2.1 [32m✓[30m [34mpurrr [30m 0.3.3
[32m✓[30m [34mtibble [30m 2.1.3 [32m✓[30m [34mdplyr [30m 0.8.4
[32m✓[30m [34mtidyr [30m 1.0.2 [32m✓[30m [34mstringr[30m 1.4.0
[32m✓[30m [34mreadr [30m 1.3.1 [32m✓[30m [34mforcats[30m 0.4.0[39m
[30m── [1mConflicts[22m ─────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31mx[30m [34mdplyr[30m::[32mfilter()[30m masks [34mstats[30m::filter()
[31mx[30m [34mdplyr[30m::[32mlag()[30m masks [34mstats[30m::lag()[39m
library(psych)
Attaching package: ‘psych’
The following objects are masked from ‘package:ggplot2’:
%+%, alpha
library(readxl)
Review the Data
view(HR_Onboarding_and_Performance)
#2.Describe the Data # a.Rename the Variables
HR_Data<-HR_Onboarding_and_Performance
emails<-HR_Data$`Number of e-mails sent or received within the first 90 days`
sales<-HR_Data$`Sales within the first 90 days`
early_knowledge<-HR_Data$`Early employee knowledge attainment scores (out of 10)`
mentor_time<-HR_Data$`Time with mentor/new hire buddy (hours)`
social_events<-HR_Data$`Onboarding social events attended (out of 10)`
#b.Create Visualizations First, let’s review the data for Sales. There are 30 observed employees. The average sales in the first 90 days is 5.73 with a 2.95 standard deviation. The median and mode is 7. The range is 9, from 1 to 10. The data is skewed to the left by -0.23.
HR_Data%>%ggplot(aes(sales))+geom_histogram()
describe(sales)
Now, let’s review the data for emails. The avarage number of emails is 38.3 with a standard deviation of 24.94. The minimum number of emails is 12, and the maximum is 80. The data is skewed to the right by 0.52.
HR_Data%>%ggplot(aes(emails))+geom_histogram()
describe(emails)
Describe the Data # a.Create Visualizations The data for onboarding social events attended, is bimodal. The mean is 5.2, and the standard deviation is 2.76. The exact range number are not clearly explained from the histogram.
HR_Data%>%ggplot(aes(HR_Data$`Onboarding social events attended (out of 10)`))+geom_histogram()
describe(social_events)
Early knowledge attainment scores have a mean of 4.67, and a standard deviation of 2.44. The range is 8 (min=1, max=9). The data is positively skewed 0.07.
HR_Data%>%ggplot(aes(early_knowledge))+geom_histogram()
describe(early_knowledge)
#2.Describe the Data # a.Create Visualizations The time with a mentor/new hire buddy (noted in hours) data shows a mean of 4.5. The standard deviation is 1.31. The minimum amount of hours is 2 and the maximum is 6. The two modes are 4 and 6 hours.
HR_Data%>%ggplot(aes(mentor_time))+geom_histogram()
describe(mentor_time)
Multivariate Description # a.How do they relate to each other? pairs() Multivariate Descriptions are displayed int he scatterplots below.
pairs(HR_Data)
Simple Linear Regressions Hypothesis 1: Emails have a strong positive correlation to Sales Do emails have a strong positive correlation to sales?
emails_and_sales<-lm(HR_Data$`Sales within the first 90 days`~HR_Data$`Number of e-mails sent or received within the first 90 days`)
summary(emails_and_sales)
Call:
lm(formula = HR_Data$`Sales within the first 90 days` ~ HR_Data$`Number of e-mails sent or received within the first 90 days`)
Residuals:
Min 1Q Median 3Q Max
-4.3726 -0.2976 -0.1322 1.0773 3.1467
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.18625 0.63080 3.466 0.00172 **
HR_Data$`Number of e-mails sent or received within the first 90 days` 0.09261 0.01387 6.677 3.02e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.863 on 28 degrees of freedom
Multiple R-squared: 0.6142, Adjusted R-squared: 0.6004
F-statistic: 44.58 on 1 and 28 DF, p-value: 3.023e-07
What we learned: The number of sales within the first 90 days intercepts at 2.18625 when there are no emails sent or received in the first 90 days. For every email, the number of sales would increase 0.09261. This research may be off 0.01387 in emails and 0.63080 in sales. This data is significant, as noted by the probability of each predictor being less than .001. This model explains 60.04% of the variance in sales within the first 90 days. On average, we are off 1.863, as noted in residual standard error.
Hypothesis 2: Social_events have a weak positive correlation to Sales. Do social events affect sales in the first 90 days?
#2. Simple Linear Regression
Social_events_and_sales<-lm(HR_Data$`Sales within the first 90 days`~HR_Data$`Onboarding social events attended (out of 10)`)
summary(Social_events_and_sales)
Call:
lm(formula = HR_Data$`Sales within the first 90 days` ~ HR_Data$`Onboarding social events attended (out of 10)`)
Residuals:
Min 1Q Median 3Q Max
-4.3225 -2.1757 0.7971 1.6775 4.3841
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.4964 1.1726 5.540 6.35e-06 ***
HR_Data$`Onboarding social events attended (out of 10)` -0.1467 0.1999 -0.734 0.469
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.971 on 28 degrees of freedom
Multiple R-squared: 0.01888, Adjusted R-squared: -0.01616
F-statistic: 0.5387 on 1 and 28 DF, p-value: 0.4691
What we learned: Without any onboarding social events attended, we coudl expect sales to be 6.4964. For every event attended, the sales are predicted to diminish by 0.1467. This data is not statistically significant, with the p-value being 0.4691 (more than .01). The standard error in sales is 1.1726 and in social events attended, 0.1999. On average, we are off 2.971. This model explains 1.61% of the variance in sales within the first 90 days.
Hypothesis 3: Mentor_time has a strong and positive correlation with Sales. Does time spent with a mentor affect the number of sales in the first 90 days?
#2. Simple Linear Regression
mentor_time_and_sales<-lm(sales~mentor_time)
summary(mentor_time_and_sales)
Call:
lm(formula = sales ~ mentor_time)
Residuals:
Min 1Q Median 3Q Max
-3.3091 -2.0818 -0.4606 1.8424 4.9939
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.5515 1.7154 0.322 0.75021
mentor_time 1.1515 0.3666 3.141 0.00395 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.579 on 28 degrees of freedom
Multiple R-squared: 0.2606, Adjusted R-squared: 0.2342
F-statistic: 9.869 on 1 and 28 DF, p-value: 0.003947
What we learned: Prior to any time spent with a mentor, we could expect sales within the first 90 days to be 0.5515. After every hour of mentoring, the sales should increase 1.1515. The standard error in sales is 1.7154, and 0.3666 in mentoring time. This model is statistically significant. This model explains 23.42% of the variance in sales within the first 90 days. On average, this model is off by 2.579.
Hypothesis 4: Early_knowledge has a strong and positive correlation with Sales Does knowledge developed prior to hiring affect the sales in the first 90 days?
```
#2.Simple Linear Regression
early_knowledge_and_sales<-lm(sales~early_knowledge)
summary(early_knowledge_and_sales)
Call: lm(formula = sales ~ early_knowledge)
Residuals: Min 1Q Median 3Q Max -4.6992 -1.0699 0.3886 2.3630 4.1154
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.6162 1.1100 3.258 0.00294 ** early_knowledge 0.4537 0.2115 2.145 0.04080 * — Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.78 on 28 degrees of freedom Multiple R-squared: 0.1411, Adjusted R-squared: 0.1104 F-statistic: 4.6 on 1 and 28 DF, p-value: 0.0408
What we learned: With no early knowledge in sales, sales within the first 90 days are 3.6162. For every early employee knowledge attainment score, the sales increase 0.4537. This model explains 11.04% of the variance of sales within the first 90 days. On average, this model is of by 2.78.
Full Linear Model To review how all four elements (emails, social events, mentor time, and early knowledge) affect sales when combined the following full linear model was created.
summary(lm(sales~emails+social_events+mentor_time+early_knowledge))
Call: lm(formula = sales ~ emails + social_events + mentor_time + early_knowledge)
Residuals: Min 1Q Median 3Q Max -4.3003 -0.8707 0.3104 0.7461 2.7616
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.43533 1.39185 -1.750 0.0924 .
emails 0.09275 0.01167 7.945 2.67e-08 social_events 0.07644 0.13195 0.579 0.5676
mentor_time 0.99801 0.20464 4.877 5.12e-05 early_knowledge -0.05829 0.16494 -0.353 0.7268
— Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.373 on 25 degrees of freedom Multiple R-squared: 0.813, Adjusted R-squared: 0.7831 F-statistic: 27.17 on 4 and 25 DF, p-value: 8.831e-09
What we learned: When combining all data from emails, onboarding social events, mentor time, and early employee knowledge attainment scores the measures find statistically signicance in the first 90 days of sales.The multiple linear regression model deciphers that all variables are correlated due to the change in estimates from the simple linear regression models above. These variables reflect 78.31% of variance in sales in the first 90 days. This model is statistically signicant, with a p-value of less than 0.01.