To do this, I will first load in the data.
company_A <- read.csv("Acumen_Data_Analysis_Exercise.csv", header = TRUE)
Second, I will look at its structure using str().
str(company_A)
## 'data.frame': 19103 obs. of 9 variables:
## $ ï..Observation.Number : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Quarter : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Employee.Id : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Sex..Male.1. : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Race : int 3 3 3 3 3 3 3 3 3 3 ...
## $ Age : int 27 28 28 28 29 29 29 29 30 30 ...
## $ Hospital.Visit.This.Quarter..1.Yes.: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Salary : chr "$36,907" "$37,907" "$38,907" "$39,907" ...
## $ Health.Score : num 3.7 5 4 2.3 2.1 1.5 4.7 2.3 2.8 2.8 ...
I’m noticing that most of the variables are integers. Also seeing some of these variables look like they should be classed as factors because they look categorical. I won’t change these just yet, but just making a note of it.
I’m now going to look for reasonable data by checking for null values and outliers. I will be using max() and min() on the variables that would have obvious outliers.
max(company_A$Age)
## [1] 172
min(company_A$Age)
## [1] 7
Now, I’m not saying you can’t be 7 or 172 years old and be working, but it is fairly likely that these values, in addition to any other extremes were entered incorrectly. I’ll check for more outliers for age using plot().
plot(company_A$Employee.Id, company_A$Age)
It looks like there are a few more outliers for age. I’ll now check for Health Score and Salary.
max(company_A$Salary)
## [1] "$68,826"
min(company_A$Salary)
## [1] "$28,351"
max(company_A$Health.Score)
## [1] 10
min(company_A$Health.Score)
## [1] 0.6
It looks like age and Health Score have outliers. For Health Score, as per the instructions, the highest the highest value could be is 6. So, anything above that is not reasonable.
Now I’ll look at if there are missing values using colSums()
colSums(is.na(company_A))
## ï..Observation.Number Quarter
## 0 0
## Employee.Id Sex..Male.1.
## 0 71
## Race Age
## 2123 0
## Hospital.Visit.This.Quarter..1.Yes. Salary
## 0 0
## Health.Score
## 0
sum(is.na(company_A))
## [1] 2194
It looks like there are 2194 missing values, 71 from the Sex.Male.1 column and 2123 from the Race column.
To do this, I will run separate models for each quarter.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.3.6 v purrr 0.3.4
## v tibble 3.1.8 v dplyr 1.0.9
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.1.3
## Warning: package 'tidyr' was built under R version 4.1.2
## Warning: package 'readr' was built under R version 4.1.2
## Warning: package 'dplyr' was built under R version 4.1.3
## Warning: package 'stringr' was built under R version 4.1.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
company_A<-na.omit(company_A)
company_A$Race<-as.factor(company_A$Race)
company_A$Sex..Male.1.<-as.factor(company_A$Sex..Male.1.)
company_A$Hospital.Visit.This.Quarter..1.Yes.<-as.factor(company_A$Hospital.Visit.This.Quarter..1.Yes.)
qt1<-company_A%>%
filter(company_A$Quarter == 1)
qt2<-company_A%>%
filter(company_A$Quarter == 2)
qt3<-company_A%>%
filter(company_A$Quarter == 3)
qt4<-company_A%>%
filter(company_A$Quarter == 4)
qt5<-company_A%>%
filter(company_A$Quarter == 5)
qt6<-company_A%>%
filter(company_A$Quarter == 6)
qt7<-company_A%>%
filter(company_A$Quarter == 7)
qt8<-company_A%>%
filter(company_A$Quarter == 8)
qt9<-company_A%>%
filter(company_A$Quarter == 9)
qt10<-company_A%>%
filter(company_A$Quarter == 10)
qt11<-company_A%>%
filter(company_A$Quarter == 11)
qt12<-company_A%>%
filter(company_A$Quarter == 12)
qt_model1<-lm(qt1$Health.Score~qt1$Age+qt1$Race+qt1$Sex..Male.1.)
# Employees by Race in the First Quarter
ggplot(qt1, aes(Race))+
geom_bar()
qt_model2<-lm(qt2$Health.Score~qt2$Age+qt2$Race+qt2$Sex..Male.1.)
# Employees by Race in the Second Quarter
ggplot(qt2, aes(Race))+
geom_bar()
qt_model3<-lm(qt3$Health.Score~qt3$Age+qt3$Race+qt3$Sex..Male.1.)
# Employees by Race in the Third Quarter
ggplot(qt3, aes(Race))+
geom_bar()
qt_model4<-lm(qt4$Health.Score~qt4$Age+qt4$Race+qt4$Sex..Male.1.)
# Employees by Race in the Fourth Quarter
ggplot(qt4, aes(Race))+
geom_bar()
qt_model5<-lm(qt5$Health.Score~qt5$Age+qt5$Race+qt5$Sex..Male.1.)
# Employees by Race in the Fifth Quarter
ggplot(qt5, aes(Race))+
geom_bar()
qt_model6<-lm(qt6$Health.Score~qt6$Age+qt6$Race+qt6$Sex..Male.1.)
# Employees by Race in the Sixth Quarter
ggplot(qt6, aes(Race))+
geom_bar()
qt_model7<-lm(qt7$Health.Score~qt7$Age+qt7$Race+qt7$Sex..Male.1.)
# Employees by Race in the Seventh Quarter
ggplot(qt7, aes(Race))+
geom_bar()
qt_model8<-lm(qt8$Health.Score~qt8$Age+qt8$Race+qt8$Sex..Male.1.)
# Employees by Race in the Eighth Quarter
ggplot(qt8, aes(Race))+
geom_bar()
qt_model9<-lm(qt9$Health.Score~qt9$Age+qt9$Race+qt9$Sex..Male.1.)
# Employees by Race in the Ninth Quarter
ggplot(qt9, aes(Race))+
geom_bar()
qt_model10<-lm(qt10$Health.Score~qt10$Age+qt10$Race+qt10$Sex..Male.1.)
# Employees by Race in the Tenth Quarter
ggplot(qt10, aes(Race))+
geom_bar()
qt_model11<-lm(qt11$Health.Score~qt11$Age+qt11$Race+qt11$Sex..Male.1.)
# Employees by Race in the Eleventh Quarter
ggplot(qt11, aes(Race))+
geom_bar()
qt_model12<-lm(qt12$Health.Score~qt12$Age+qt12$Race+qt12$Sex..Male.1.)
# Employees by Race in the Twelfth Quarter
ggplot(qt12, aes(Race))+
geom_bar()
It seems that most employees over time are of Race 1. I’ll check next for age, sex, and hospital visits.
For the First Quarter:
ggplot(qt1, aes(Age))+
geom_bar()
ggplot(qt1, aes(Sex..Male.1.))+
geom_bar()
ggplot(qt1, aes(Hospital.Visit.This.Quarter..1.Yes.))+
geom_bar()
Second:
ggplot(qt2, aes(Age))+
geom_bar()
ggplot(qt2, aes(Sex..Male.1.))+
geom_bar()
ggplot(qt2, aes(Hospital.Visit.This.Quarter..1.Yes.))+
geom_bar()
Third:
ggplot(qt3, aes(Age))+
geom_bar()
ggplot(qt3, aes(Sex..Male.1.))+
geom_bar()
ggplot(qt3, aes(Hospital.Visit.This.Quarter..1.Yes.))+
geom_bar()
Fourth:
ggplot(qt4, aes(Age))+
geom_bar()
ggplot(qt4, aes(Sex..Male.1.))+
geom_bar()
ggplot(qt4, aes(Hospital.Visit.This.Quarter..1.Yes.))+
geom_bar()
Fifth:
ggplot(qt5, aes(Age))+
geom_bar()
ggplot(qt5, aes(Sex..Male.1.))+
geom_bar()
ggplot(qt5, aes(Hospital.Visit.This.Quarter..1.Yes.))+
geom_bar()
Sixth:
ggplot(qt6, aes(Age))+
geom_bar()
ggplot(qt6, aes(Sex..Male.1.))+
geom_bar()
ggplot(qt6, aes(Hospital.Visit.This.Quarter..1.Yes.))+
geom_bar()
Seventh:
ggplot(qt7, aes(Age))+
geom_bar()
ggplot(qt7, aes(Sex..Male.1.))+
geom_bar()
ggplot(qt7, aes(Hospital.Visit.This.Quarter..1.Yes.))+
geom_bar()
Eighth:
ggplot(qt8, aes(Age))+
geom_bar()
ggplot(qt8, aes(Sex..Male.1.))+
geom_bar()
ggplot(qt8, aes(Hospital.Visit.This.Quarter..1.Yes.))+
geom_bar()
Ninth:
ggplot(qt9, aes(Age))+
geom_bar()
ggplot(qt9, aes(Sex..Male.1.))+
geom_bar()
ggplot(qt9, aes(Hospital.Visit.This.Quarter..1.Yes.))+
geom_bar()
Tenth:
ggplot(qt10, aes(Age))+
geom_bar()
ggplot(qt10, aes(Sex..Male.1.))+
geom_bar()
ggplot(qt10, aes(Hospital.Visit.This.Quarter..1.Yes.))+
geom_bar()
Eleventh:
ggplot(qt11, aes(Age))+
geom_bar()
ggplot(qt11, aes(Sex..Male.1.))+
geom_bar()
ggplot(qt11, aes(Hospital.Visit.This.Quarter..1.Yes.))+
geom_bar()
Twelfth:
ggplot(qt12, aes(Age))+
geom_bar()
ggplot(qt12, aes(Sex..Male.1.))+
geom_bar()
ggplot(qt12, aes(Hospital.Visit.This.Quarter..1.Yes.))+
geom_bar()
So, by observation and over time, it looks like most employees are of race one, are mostly male (changing to mostly female by the latter half of the number of quarters), are mostly around their mid 20s, and the number of hospital visits increases by the last quarter.
To do this, I will run a model where the health score is the response variable and there are various explanatory variables for each quarter.
For the first quarter:
model1<-lm(qt1$Health.Score~qt1$Sex..Male.1.+qt1$Race+qt1$Age+qt1$Hospital.Visit.This.Quarter..1.Yes.)
summary(model1)
##
## Call:
## lm(formula = qt1$Health.Score ~ qt1$Sex..Male.1. + qt1$Race +
## qt1$Age + qt1$Hospital.Visit.This.Quarter..1.Yes.)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4271 -1.1700 -0.4795 0.4959 7.3892
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.32256 0.38037 3.477 0.000544
## qt1$Sex..Male.1.1 0.26732 0.15957 1.675 0.094421
## qt1$Race2 -0.29185 0.18491 -1.578 0.115021
## qt1$Race3 -0.01444 0.22467 -0.064 0.948769
## qt1$Age 0.06870 0.01242 5.531 4.78e-08
## qt1$Hospital.Visit.This.Quarter..1.Yes.1 0.71168 0.30290 2.350 0.019122
##
## (Intercept) ***
## qt1$Sex..Male.1.1 .
## qt1$Race2
## qt1$Race3
## qt1$Age ***
## qt1$Hospital.Visit.This.Quarter..1.Yes.1 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.948 on 594 degrees of freedom
## Multiple R-squared: 0.06269, Adjusted R-squared: 0.0548
## F-statistic: 7.946 on 5 and 594 DF, p-value: 2.922e-07
Second:
model2<-lm(qt2$Health.Score~qt2$Sex..Male.1.+qt2$Race+qt2$Age+qt2$Hospital.Visit.This.Quarter..1.Yes.)
summary(model2)
##
## Call:
## lm(formula = qt2$Health.Score ~ qt2$Sex..Male.1. + qt2$Race +
## qt2$Age + qt2$Hospital.Visit.This.Quarter..1.Yes.)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4476 -1.1972 -0.4763 0.5630 7.1651
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.72001 0.34165 5.034 5.96e-07
## qt2$Sex..Male.1.1 0.45455 0.13632 3.335 0.000895
## qt2$Race2 -0.16840 0.15796 -1.066 0.286696
## qt2$Race3 -0.30775 0.19111 -1.610 0.107729
## qt2$Age 0.05472 0.01132 4.832 1.63e-06
## qt2$Hospital.Visit.This.Quarter..1.Yes.1 0.79373 0.24270 3.270 0.001121
##
## (Intercept) ***
## qt2$Sex..Male.1.1 ***
## qt2$Race2
## qt2$Race3
## qt2$Age ***
## qt2$Hospital.Visit.This.Quarter..1.Yes.1 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.905 on 778 degrees of freedom
## Multiple R-squared: 0.05809, Adjusted R-squared: 0.05204
## F-statistic: 9.596 on 5 and 778 DF, p-value: 6.697e-09
Third:
model3<-lm(qt3$Health.Score~qt3$Sex..Male.1.+qt3$Race+qt3$Age+qt3$Hospital.Visit.This.Quarter..1.Yes.)
summary(model3)
##
## Call:
## lm(formula = qt3$Health.Score ~ qt3$Sex..Male.1. + qt3$Race +
## qt3$Age + qt3$Hospital.Visit.This.Quarter..1.Yes.)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5108 -1.1601 -0.4715 0.5064 7.0781
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.20205 0.31669 6.953 6.42e-12
## qt3$Sex..Male.1.1 0.40871 0.12000 3.406 0.000685
## qt3$Race2 -0.10816 0.13719 -0.788 0.430683
## qt3$Race3 0.04669 0.17237 0.271 0.786538
## qt3$Age 0.03600 0.01048 3.434 0.000620
## qt3$Hospital.Visit.This.Quarter..1.Yes.1 0.43096 0.20857 2.066 0.039060
##
## (Intercept) ***
## qt3$Sex..Male.1.1 ***
## qt3$Race2
## qt3$Race3
## qt3$Age ***
## qt3$Hospital.Visit.This.Quarter..1.Yes.1 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.902 on 1004 degrees of freedom
## Multiple R-squared: 0.02657, Adjusted R-squared: 0.02173
## F-statistic: 5.482 on 5 and 1004 DF, p-value: 5.502e-05
Fourth:
model4<-lm(qt4$Health.Score~qt4$Sex..Male.1.+qt4$Race+qt4$Age+qt4$Hospital.Visit.This.Quarter..1.Yes.)
summary(model4)
##
## Call:
## lm(formula = qt4$Health.Score ~ qt4$Sex..Male.1. + qt4$Race +
## qt4$Age + qt4$Hospital.Visit.This.Quarter..1.Yes.)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4909 -1.1346 -0.4534 0.4464 7.1420
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.818195 0.276359 6.579 6.88e-11
## qt4$Sex..Male.1.1 0.444742 0.103321 4.304 1.80e-05
## qt4$Race2 -0.143200 0.117463 -1.219 0.2230
## qt4$Race3 -0.287713 0.151520 -1.899 0.0578
## qt4$Age 0.049290 0.009114 5.408 7.59e-08
## qt4$Hospital.Visit.This.Quarter..1.Yes.1 0.843352 0.170753 4.939 8.89e-07
##
## (Intercept) ***
## qt4$Sex..Male.1.1 ***
## qt4$Race2
## qt4$Race3 .
## qt4$Age ***
## qt4$Hospital.Visit.This.Quarter..1.Yes.1 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.851 on 1281 degrees of freedom
## Multiple R-squared: 0.05777, Adjusted R-squared: 0.05409
## F-statistic: 15.71 on 5 and 1281 DF, p-value: 4.944e-15
Fifth:
model5<-lm(qt5$Health.Score~qt5$Sex..Male.1.+qt5$Race+qt5$Age+qt5$Hospital.Visit.This.Quarter..1.Yes.)
summary(model5)
##
## Call:
## lm(formula = qt5$Health.Score ~ qt5$Sex..Male.1. + qt5$Race +
## qt5$Age + qt5$Hospital.Visit.This.Quarter..1.Yes.)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7432 -1.1099 -0.4350 0.4134 7.0905
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.024436 0.223236 9.069 < 2e-16
## qt5$Sex..Male.1.1 0.532961 0.095706 5.569 3.04e-08
## qt5$Race2 0.078398 0.109119 0.718 0.47258
## qt5$Race3 -0.322904 0.138730 -2.328 0.02007
## qt5$Age 0.038480 0.007097 5.422 6.87e-08
## qt5$Hospital.Visit.This.Quarter..1.Yes.1 0.619372 0.159760 3.877 0.00011
##
## (Intercept) ***
## qt5$Sex..Male.1.1 ***
## qt5$Race2
## qt5$Race3 *
## qt5$Age ***
## qt5$Hospital.Visit.This.Quarter..1.Yes.1 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.843 on 1479 degrees of freedom
## Multiple R-squared: 0.05269, Adjusted R-squared: 0.04949
## F-statistic: 16.45 on 5 and 1479 DF, p-value: 7.836e-16
Sixth:
model6<-lm(qt6$Health.Score~qt6$Sex..Male.1.+qt6$Race+qt6$Age+qt6$Hospital.Visit.This.Quarter..1.Yes.)
summary(model6)
##
## Call:
## lm(formula = qt6$Health.Score ~ qt6$Sex..Male.1. + qt6$Race +
## qt6$Age + qt6$Hospital.Visit.This.Quarter..1.Yes.)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4295 -1.1511 -0.4506 0.5174 7.0724
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.578877 0.221915 7.115 1.70e-12
## qt6$Sex..Male.1.1 0.436259 0.094196 4.631 3.93e-06
## qt6$Race2 -0.009635 0.107652 -0.090 0.9287
## qt6$Race3 -0.251302 0.136359 -1.843 0.0655
## qt6$Age 0.056197 0.006880 8.168 6.36e-16
## qt6$Hospital.Visit.This.Quarter..1.Yes.1 0.805841 0.154338 5.221 2.01e-07
##
## (Intercept) ***
## qt6$Sex..Male.1.1 ***
## qt6$Race2
## qt6$Race3 .
## qt6$Age ***
## qt6$Hospital.Visit.This.Quarter..1.Yes.1 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.869 on 1570 degrees of freedom
## Multiple R-squared: 0.06834, Adjusted R-squared: 0.06537
## F-statistic: 23.03 on 5 and 1570 DF, p-value: < 2.2e-16
Seventh:
model7<-lm(qt7$Health.Score~qt7$Sex..Male.1.+qt7$Race+qt7$Age+qt7$Hospital.Visit.This.Quarter..1.Yes.)
summary(model7)
##
## Call:
## lm(formula = qt7$Health.Score ~ qt7$Sex..Male.1. + qt7$Race +
## qt7$Age + qt7$Hospital.Visit.This.Quarter..1.Yes.)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.5312 -1.1344 -0.4216 0.5627 6.9224
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.463281 0.213282 11.549 < 2e-16
## qt7$Sex..Male.1.1 0.299004 0.090896 3.290 0.00103
## qt7$Race2 -0.033734 0.103820 -0.325 0.74528
## qt7$Race3 -0.119032 0.130836 -0.910 0.36307
## qt7$Age 0.029335 0.006513 4.504 7.14e-06
## qt7$Hospital.Visit.This.Quarter..1.Yes.1 0.735199 0.150650 4.880 1.16e-06
##
## (Intercept) ***
## qt7$Sex..Male.1.1 **
## qt7$Race2
## qt7$Race3
## qt7$Age ***
## qt7$Hospital.Visit.This.Quarter..1.Yes.1 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.836 on 1633 degrees of freedom
## Multiple R-squared: 0.0336, Adjusted R-squared: 0.03064
## F-statistic: 11.35 on 5 and 1633 DF, p-value: 8.69e-11
Eighth:
model8<-lm(qt8$Health.Score~qt8$Sex..Male.1.+qt8$Race+qt8$Age+qt8$Hospital.Visit.This.Quarter..1.Yes.)
summary(model8)
##
## Call:
## lm(formula = qt8$Health.Score ~ qt8$Sex..Male.1. + qt8$Race +
## qt8$Age + qt8$Hospital.Visit.This.Quarter..1.Yes.)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2620 -1.1425 -0.5202 0.4574 7.0693
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.936734 0.217865 8.890 < 2e-16
## qt8$Sex..Male.1.1 0.411611 0.093008 4.426 1.02e-05
## qt8$Race2 -0.008874 0.106299 -0.083 0.933
## qt8$Race3 -0.211888 0.134054 -1.581 0.114
## qt8$Age 0.044662 0.006533 6.836 1.14e-11
## qt8$Hospital.Visit.This.Quarter..1.Yes.1 0.668579 0.154596 4.325 1.62e-05
##
## (Intercept) ***
## qt8$Sex..Male.1.1 ***
## qt8$Race2
## qt8$Race3
## qt8$Age ***
## qt8$Hospital.Visit.This.Quarter..1.Yes.1 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.898 on 1664 degrees of freedom
## Multiple R-squared: 0.04901, Adjusted R-squared: 0.04615
## F-statistic: 17.15 on 5 and 1664 DF, p-value: < 2.2e-16
Ninth:
model9<-lm(qt9$Health.Score~qt9$Sex..Male.1.+qt9$Race+qt9$Age+qt9$Hospital.Visit.This.Quarter..1.Yes.)
summary(model9)
##
## Call:
## lm(formula = qt9$Health.Score ~ qt9$Sex..Male.1. + qt9$Race +
## qt9$Age + qt9$Hospital.Visit.This.Quarter..1.Yes.)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.5720 -1.2288 -0.4834 0.4941 6.8264
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.399441 0.224089 10.708 < 2e-16
## qt9$Sex..Male.1.1 0.337130 0.095869 3.517 0.000449
## qt9$Race2 -0.013128 0.109691 -0.120 0.904748
## qt9$Race3 -0.043340 0.138358 -0.313 0.754131
## qt9$Age 0.032257 0.006606 4.883 1.14e-06
## qt9$Hospital.Visit.This.Quarter..1.Yes.1 0.522906 0.146075 3.580 0.000354
##
## (Intercept) ***
## qt9$Sex..Male.1.1 ***
## qt9$Race2
## qt9$Race3
## qt9$Age ***
## qt9$Hospital.Visit.This.Quarter..1.Yes.1 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.972 on 1689 degrees of freedom
## Multiple R-squared: 0.02736, Adjusted R-squared: 0.02449
## F-statistic: 9.504 on 5 and 1689 DF, p-value: 5.941e-09
Tenth:
model10<-lm(qt10$Health.Score~qt10$Sex..Male.1.+qt10$Race+qt10$Age+qt10$Hospital.Visit.This.Quarter..1.Yes.)
summary(model10)
##
## Call:
## lm(formula = qt10$Health.Score ~ qt10$Sex..Male.1. + qt10$Race +
## qt10$Age + qt10$Hospital.Visit.This.Quarter..1.Yes.)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.4953 -1.1779 -0.4867 0.4682 7.0730
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.896010 0.213842 8.866 < 2e-16
## qt10$Sex..Male.1.1 0.292131 0.092089 3.172 0.00154
## qt10$Race2 -0.246912 0.105475 -2.341 0.01935
## qt10$Race3 0.105706 0.133305 0.793 0.42791
## qt10$Age 0.047331 0.006213 7.618 4.23e-14
## qt10$Hospital.Visit.This.Quarter..1.Yes.1 0.847819 0.155999 5.435 6.28e-08
##
## (Intercept) ***
## qt10$Sex..Male.1.1 **
## qt10$Race2 *
## qt10$Race3
## qt10$Age ***
## qt10$Hospital.Visit.This.Quarter..1.Yes.1 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.905 on 1709 degrees of freedom
## Multiple R-squared: 0.05789, Adjusted R-squared: 0.05514
## F-statistic: 21 on 5 and 1709 DF, p-value: < 2.2e-16
Eleventh:
model11<-lm(qt11$Health.Score~qt11$Sex..Male.1.+qt11$Race+qt11$Age+qt11$Hospital.Visit.This.Quarter..1.Yes.)
summary(model11)
##
## Call:
## lm(formula = qt11$Health.Score ~ qt11$Sex..Male.1. + qt11$Race +
## qt11$Age + qt11$Hospital.Visit.This.Quarter..1.Yes.)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.3847 -1.1330 -0.3995 0.4783 6.9514
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.103525 0.201395 10.445 < 2e-16
## qt11$Sex..Male.1.1 0.424476 0.087485 4.852 1.33e-06
## qt11$Race2 -0.104412 0.099828 -1.046 0.296
## qt11$Race3 -0.165749 0.126616 -1.309 0.191
## qt11$Age 0.038871 0.005779 6.726 2.37e-11
## qt11$Hospital.Visit.This.Quarter..1.Yes.1 1.101326 0.147221 7.481 1.17e-13
##
## (Intercept) ***
## qt11$Sex..Male.1.1 ***
## qt11$Race2
## qt11$Race3
## qt11$Age ***
## qt11$Hospital.Visit.This.Quarter..1.Yes.1 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.816 on 1723 degrees of freedom
## Multiple R-squared: 0.06951, Adjusted R-squared: 0.06681
## F-statistic: 25.74 on 5 and 1723 DF, p-value: < 2.2e-16
Twelfth:
model12<-lm(qt12$Health.Score~qt12$Sex..Male.1.+qt12$Race+qt12$Age+qt12$Hospital.Visit.This.Quarter..1.Yes.)
summary(model12)
##
## Call:
## lm(formula = qt12$Health.Score ~ qt12$Sex..Male.1. + qt12$Race +
## qt12$Age + qt12$Hospital.Visit.This.Quarter..1.Yes.)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.8911 -1.2972 -0.5224 0.4670 6.9522
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.960318 0.223075 8.788 < 2e-16
## qt12$Sex..Male.1.1 0.255122 0.096510 2.643 0.00828
## qt12$Race2 -0.148343 0.110257 -1.345 0.17866
## qt12$Race3 -0.171447 0.139322 -1.231 0.21865
## qt12$Age 0.049432 0.006311 7.833 8.26e-15
## qt12$Hospital.Visit.This.Quarter..1.Yes.1 0.967204 0.119710 8.080 1.21e-15
##
## (Intercept) ***
## qt12$Sex..Male.1.1 **
## qt12$Race2
## qt12$Race3
## qt12$Age ***
## qt12$Hospital.Visit.This.Quarter..1.Yes.1 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.009 on 1731 degrees of freedom
## Multiple R-squared: 0.07384, Adjusted R-squared: 0.07117
## F-statistic: 27.6 on 5 and 1731 DF, p-value: < 2.2e-16
By observation, it looks like the variables that are most associated with health score are sex, age, and hospital visits, based on the significance of the p value from each summary of the models.
I’ll first use the variables with the most association to Health Score to evaluate if, over time, employees are getting sicker. For the sake of this exercise, we will ignore the Health Scores that are above 6.
First:
#Using coefficients, looking for negative or positive values, where negative means better health score and positive means worse.
lm(qt1$Health.Score~qt1$Age)
##
## Call:
## lm(formula = qt1$Health.Score ~ qt1$Age)
##
## Coefficients:
## (Intercept) qt1$Age
## 1.48993 0.06641
ggplot(qt1, aes(Employee.Id, Health.Score, color = Race))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
ggplot(qt1, aes(Employee.Id, Health.Score, color = qt1$Sex..Male.1.))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
ggplot(qt1, aes(Employee.Id, Health.Score, color = qt1$Hospital.Visit.This.Quarter..1.Yes.))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
Second:
#Using coefficients, looking for negative or positive values, where negative means better health score and positive means worse.
lm(qt2$Health.Score~qt2$Age)
##
## Call:
## lm(formula = qt2$Health.Score ~ qt2$Age)
##
## Coefficients:
## (Intercept) qt2$Age
## 1.88267 0.05579
ggplot(qt2, aes(Employee.Id, Health.Score, color = Race))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
ggplot(qt2, aes(Employee.Id, Health.Score, color = Sex..Male.1.))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
ggplot(qt2, aes(Employee.Id, Health.Score, color = Hospital.Visit.This.Quarter..1.Yes.))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
Third:
#Using coefficients, looking for negative or positive values, where negative means better health score and positive means worse.
lm(qt3$Health.Score~qt3$Age)
##
## Call:
## lm(formula = qt3$Health.Score ~ qt3$Age)
##
## Coefficients:
## (Intercept) qt3$Age
## 2.43639 0.03551
ggplot(qt3, aes(Employee.Id, Health.Score, color = Race))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
ggplot(qt3, aes(Employee.Id, Health.Score, color = Sex..Male.1.))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
ggplot(qt3, aes(Employee.Id, Health.Score, color = Hospital.Visit.This.Quarter..1.Yes.))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
Fourth:
#Using coefficients, looking for negative or positive values, where negative means better health score and positive means worse.
lm(qt4$Health.Score~qt4$Age)
##
## Call:
## lm(formula = qt4$Health.Score ~ qt4$Age)
##
## Coefficients:
## (Intercept) qt4$Age
## 2.01842 0.05024
ggplot(qt4, aes(Employee.Id, Health.Score, color = Race))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
ggplot(qt4, aes(Employee.Id, Health.Score, color = qt4$Sex..Male.1.))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
ggplot(qt4, aes(Employee.Id, Health.Score, color = Hospital.Visit.This.Quarter..1.Yes.))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
Fifth:
#Using coefficients, looking for negative or positive values, where negative means better health score and positive means worse.
lm(qt5$Health.Score~qt5$Age)
##
## Call:
## lm(formula = qt5$Health.Score ~ qt5$Age)
##
## Coefficients:
## (Intercept) qt5$Age
## 2.34498 0.03806
ggplot(qt5, aes(Employee.Id, Health.Score, color = Race))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
ggplot(qt5, aes(Employee.Id, Health.Score, color = qt5$Sex..Male.1.))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
ggplot(qt5, aes(Employee.Id, Health.Score, color = Hospital.Visit.This.Quarter..1.Yes.))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
Sixth:
#Using coefficients, looking for negative or positive values, where negative means better health score and positive means worse.
lm(qt6$Health.Score~qt6$Age)
##
## Call:
## lm(formula = qt6$Health.Score ~ qt6$Age)
##
## Coefficients:
## (Intercept) qt6$Age
## 1.87905 0.05507
ggplot(qt6, aes(Employee.Id, Health.Score, color = Race))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
ggplot(qt6, aes(Employee.Id, Health.Score, color = qt6$Sex..Male.1.))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
ggplot(qt6, aes(Employee.Id, Health.Score, color = Hospital.Visit.This.Quarter..1.Yes.))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
Seventh:
#Using coefficients, looking for negative or positive values, where negative means better health score and positive means worse.
lm(qt7$Health.Score~qt7$Age)
##
## Call:
## lm(formula = qt7$Health.Score ~ qt7$Age)
##
## Coefficients:
## (Intercept) qt7$Age
## 2.67012 0.02913
ggplot(qt7, aes(Employee.Id, Health.Score, color = Race))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
ggplot(qt7, aes(Employee.Id, Health.Score, color = qt7$Sex..Male.1.))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
ggplot(qt7, aes(Employee.Id, Health.Score, color = Hospital.Visit.This.Quarter..1.Yes.))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
Eighth:
#Using coefficients, looking for negative or positive values, where negative means better health score and positive means worse.
lm(qt8$Health.Score~qt8$Age)
##
## Call:
## lm(formula = qt8$Health.Score ~ qt8$Age)
##
## Coefficients:
## (Intercept) qt8$Age
## 2.22071 0.04337
ggplot(qt8, aes(Employee.Id, Health.Score, color = Race))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
ggplot(qt8, aes(Employee.Id, Health.Score, color = qt8$Sex..Male.1.))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
ggplot(qt8, aes(Employee.Id, Health.Score, color = Hospital.Visit.This.Quarter..1.Yes.))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
Ninth:
#Using coefficients, looking for negative or positive values, where negative means better health score and positive means worse.
lm(qt9$Health.Score~qt9$Age)
##
## Call:
## lm(formula = qt9$Health.Score ~ qt9$Age)
##
## Coefficients:
## (Intercept) qt9$Age
## 2.65358 0.03136
ggplot(qt9, aes(Employee.Id, Health.Score, color = Race))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
ggplot(qt9, aes(Employee.Id, Health.Score, color = qt9$Sex..Male.1.))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
ggplot(qt9, aes(Employee.Id, Health.Score, color = Hospital.Visit.This.Quarter..1.Yes.))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
Tenth:
#Using coefficients, looking for negative or positive values, where negative means better health score and positive means worse.
lm(qt10$Health.Score~qt10$Age)
##
## Call:
## lm(formula = qt10$Health.Score ~ qt10$Age)
##
## Coefficients:
## (Intercept) qt10$Age
## 2.09652 0.04655
ggplot(qt10, aes(Employee.Id, Health.Score, color = Race))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
ggplot(qt10, aes(Employee.Id, Health.Score, color = qt10$Sex..Male.1.))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
ggplot(qt10, aes(Employee.Id, Health.Score, color = Hospital.Visit.This.Quarter..1.Yes.))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
Eleventh:
#Using coefficients, looking for negative or positive values, where negative means better health score and positive means worse.
lm(qt11$Health.Score~qt11$Age)
##
## Call:
## lm(formula = qt11$Health.Score ~ qt11$Age)
##
## Coefficients:
## (Intercept) qt11$Age
## 2.38942 0.03832
ggplot(qt11, aes(Employee.Id, Health.Score, color = Race))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
ggplot(qt11, aes(Employee.Id, Health.Score, color = qt11$Sex..Male.1.))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
ggplot(qt11, aes(Employee.Id, Health.Score, color = Hospital.Visit.This.Quarter..1.Yes.))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
Twelfth:
#Using coefficients, looking for negative or positive values, where negative means better health score and positive means worse.
lm(qt12$Health.Score~qt12$Age)
##
## Call:
## lm(formula = qt12$Health.Score ~ qt12$Age)
##
## Coefficients:
## (Intercept) qt12$Age
## 2.2090 0.0497
ggplot(qt12, aes(Employee.Id, Health.Score, color = Race))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
ggplot(qt12, aes(Employee.Id, Health.Score, color = qt12$Sex..Male.1.))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
ggplot(qt12, aes(Employee.Id, Health.Score, color = Hospital.Visit.This.Quarter..1.Yes.))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
Overall, it seems like the only people getting sicker are employees of race one, employees who are male, and who are older (employee age and health score has a positive correlation). So, it isn’t incorrect to say that employees are getting sicker, but it might be more worth while to hire more employees outside of race one, have more female employees, ones who stay out of the hospital (though this might be a causation/correlation situation), and who are younger in efforts to decrease the health scores.