Acumen_Data_Analsis_Exercise_Jaffe

Question One

a) Are all the values in the data reasonable? Are there missing values?

To do this, I will first load in the data.

company_A <- read.csv("Acumen_Data_Analysis_Exercise.csv", header = TRUE)

Second, I will look at its structure using str().

str(company_A)

## 'data.frame':    19103 obs. of  9 variables:
##  $ ï..Observation.Number              : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Quarter                            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Employee.Id                        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Sex..Male.1.                       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Race                               : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ Age                                : int  27 28 28 28 29 29 29 29 30 30 ...
##  $ Hospital.Visit.This.Quarter..1.Yes.: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Salary                             : chr  "$36,907" "$37,907" "$38,907" "$39,907" ...
##  $ Health.Score                       : num  3.7 5 4 2.3 2.1 1.5 4.7 2.3 2.8 2.8 ...

I’m noticing that most of the variables are integers. Also seeing some of these variables look like they should be classed as factors because they look categorical. I won’t change these just yet, but just making a note of it.

I’m now going to look for reasonable data by checking for null values and outliers. I will be using max() and min() on the variables that would have obvious outliers.

max(company_A$Age)

## [1] 172

min(company_A$Age)

## [1] 7

Now, I’m not saying you can’t be 7 or 172 years old and be working, but it is fairly likely that these values, in addition to any other extremes were entered incorrectly. I’ll check for more outliers for age using plot().

plot(company_A$Employee.Id, company_A$Age)

It looks like there are a few more outliers for age. I’ll now check for Health Score and Salary.

max(company_A$Salary)

## [1] "$68,826"

min(company_A$Salary)

## [1] "$28,351"

max(company_A$Health.Score)

## [1] 10

min(company_A$Health.Score)

## [1] 0.6

It looks like age and Health Score have outliers. For Health Score, as per the instructions, the highest the highest value could be is 6. So, anything above that is not reasonable.

Now I’ll look at if there are missing values using colSums()

colSums(is.na(company_A))

##               ï..Observation.Number                             Quarter 
##                                   0                                   0 
##                         Employee.Id                        Sex..Male.1. 
##                                   0                                  71 
##                                Race                                 Age 
##                                2123                                   0 
## Hospital.Visit.This.Quarter..1.Yes.                              Salary 
##                                   0                                   0 
##                        Health.Score 
##                                   0

sum(is.na(company_A))

## [1] 2194

It looks like there are 2194 missing values, 71 from the Sex.Male.1 column and 2123 from the Race column.

b) What are the characteristics of employees at Company A? Do these demographics change over time?

To do this, I will run separate models for each quarter.

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.3.6     v purrr   0.3.4
## v tibble  3.1.8     v dplyr   1.0.9
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1

## Warning: package 'ggplot2' was built under R version 4.1.3

## Warning: package 'tidyr' was built under R version 4.1.2

## Warning: package 'readr' was built under R version 4.1.2

## Warning: package 'dplyr' was built under R version 4.1.3

## Warning: package 'stringr' was built under R version 4.1.2

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

company_A<-na.omit(company_A)
company_A$Race<-as.factor(company_A$Race)
company_A$Sex..Male.1.<-as.factor(company_A$Sex..Male.1.)
company_A$Hospital.Visit.This.Quarter..1.Yes.<-as.factor(company_A$Hospital.Visit.This.Quarter..1.Yes.)
qt1<-company_A%>%
  filter(company_A$Quarter == 1)

qt2<-company_A%>%
  filter(company_A$Quarter == 2)

qt3<-company_A%>%
  filter(company_A$Quarter == 3)

qt4<-company_A%>%
  filter(company_A$Quarter == 4)

qt5<-company_A%>%
  filter(company_A$Quarter == 5)

qt6<-company_A%>%
  filter(company_A$Quarter == 6)

qt7<-company_A%>%
  filter(company_A$Quarter == 7)

qt8<-company_A%>%
  filter(company_A$Quarter == 8)

qt9<-company_A%>%
  filter(company_A$Quarter == 9)

qt10<-company_A%>%
  filter(company_A$Quarter == 10)

qt11<-company_A%>%
  filter(company_A$Quarter == 11)

qt12<-company_A%>%
  filter(company_A$Quarter == 12)


qt_model1<-lm(qt1$Health.Score~qt1$Age+qt1$Race+qt1$Sex..Male.1.)
# Employees by Race in the First Quarter
ggplot(qt1, aes(Race))+
  geom_bar()

qt_model2<-lm(qt2$Health.Score~qt2$Age+qt2$Race+qt2$Sex..Male.1.)
# Employees by Race in the Second Quarter
ggplot(qt2, aes(Race))+
  geom_bar()

qt_model3<-lm(qt3$Health.Score~qt3$Age+qt3$Race+qt3$Sex..Male.1.)
# Employees by Race in the Third Quarter
ggplot(qt3, aes(Race))+
  geom_bar()

qt_model4<-lm(qt4$Health.Score~qt4$Age+qt4$Race+qt4$Sex..Male.1.)
# Employees by Race in the Fourth Quarter
ggplot(qt4, aes(Race))+
  geom_bar()

qt_model5<-lm(qt5$Health.Score~qt5$Age+qt5$Race+qt5$Sex..Male.1.)
# Employees by Race in the Fifth Quarter
ggplot(qt5, aes(Race))+
  geom_bar()

qt_model6<-lm(qt6$Health.Score~qt6$Age+qt6$Race+qt6$Sex..Male.1.)
# Employees by Race in the Sixth Quarter
ggplot(qt6, aes(Race))+
  geom_bar()

qt_model7<-lm(qt7$Health.Score~qt7$Age+qt7$Race+qt7$Sex..Male.1.)
# Employees by Race in the Seventh Quarter
ggplot(qt7, aes(Race))+
  geom_bar()

qt_model8<-lm(qt8$Health.Score~qt8$Age+qt8$Race+qt8$Sex..Male.1.)
# Employees by Race in the Eighth Quarter
ggplot(qt8, aes(Race))+
  geom_bar()

qt_model9<-lm(qt9$Health.Score~qt9$Age+qt9$Race+qt9$Sex..Male.1.)
# Employees by Race in the Ninth Quarter
ggplot(qt9, aes(Race))+
  geom_bar()

qt_model10<-lm(qt10$Health.Score~qt10$Age+qt10$Race+qt10$Sex..Male.1.)
# Employees by Race in the Tenth Quarter
ggplot(qt10, aes(Race))+
  geom_bar()

qt_model11<-lm(qt11$Health.Score~qt11$Age+qt11$Race+qt11$Sex..Male.1.)
# Employees by Race in the Eleventh Quarter
ggplot(qt11, aes(Race))+
  geom_bar()

qt_model12<-lm(qt12$Health.Score~qt12$Age+qt12$Race+qt12$Sex..Male.1.)
# Employees by Race in the Twelfth Quarter
ggplot(qt12, aes(Race))+
  geom_bar()

It seems that most employees over time are of Race 1. I’ll check next for age, sex, and hospital visits.

For the First Quarter:

ggplot(qt1, aes(Age))+
  geom_bar()

ggplot(qt1, aes(Sex..Male.1.))+
  geom_bar()

ggplot(qt1, aes(Hospital.Visit.This.Quarter..1.Yes.))+
  geom_bar()

Second:

ggplot(qt2, aes(Age))+
  geom_bar()

ggplot(qt2, aes(Sex..Male.1.))+
  geom_bar()

ggplot(qt2, aes(Hospital.Visit.This.Quarter..1.Yes.))+
  geom_bar()

Third:

ggplot(qt3, aes(Age))+
  geom_bar()

ggplot(qt3, aes(Sex..Male.1.))+
  geom_bar()

ggplot(qt3, aes(Hospital.Visit.This.Quarter..1.Yes.))+
  geom_bar()

Fourth:

ggplot(qt4, aes(Age))+
  geom_bar()

ggplot(qt4, aes(Sex..Male.1.))+
  geom_bar()

ggplot(qt4, aes(Hospital.Visit.This.Quarter..1.Yes.))+
  geom_bar()

Fifth:

ggplot(qt5, aes(Age))+
  geom_bar()

ggplot(qt5, aes(Sex..Male.1.))+
  geom_bar()

ggplot(qt5, aes(Hospital.Visit.This.Quarter..1.Yes.))+
  geom_bar()

Sixth:

ggplot(qt6, aes(Age))+
  geom_bar()

ggplot(qt6, aes(Sex..Male.1.))+
  geom_bar()

ggplot(qt6, aes(Hospital.Visit.This.Quarter..1.Yes.))+
  geom_bar()

Seventh:

ggplot(qt7, aes(Age))+
  geom_bar()

ggplot(qt7, aes(Sex..Male.1.))+
  geom_bar()

ggplot(qt7, aes(Hospital.Visit.This.Quarter..1.Yes.))+
  geom_bar()

Eighth:

ggplot(qt8, aes(Age))+
  geom_bar()

ggplot(qt8, aes(Sex..Male.1.))+
  geom_bar()

ggplot(qt8, aes(Hospital.Visit.This.Quarter..1.Yes.))+
  geom_bar()

Ninth:

ggplot(qt9, aes(Age))+
  geom_bar()

ggplot(qt9, aes(Sex..Male.1.))+
  geom_bar()

ggplot(qt9, aes(Hospital.Visit.This.Quarter..1.Yes.))+
  geom_bar()

Tenth:

ggplot(qt10, aes(Age))+
  geom_bar()

ggplot(qt10, aes(Sex..Male.1.))+
  geom_bar()

ggplot(qt10, aes(Hospital.Visit.This.Quarter..1.Yes.))+
  geom_bar()

Eleventh:

ggplot(qt11, aes(Age))+
  geom_bar()

ggplot(qt11, aes(Sex..Male.1.))+
  geom_bar()

ggplot(qt11, aes(Hospital.Visit.This.Quarter..1.Yes.))+
  geom_bar()

Twelfth:

ggplot(qt12, aes(Age))+
  geom_bar()

ggplot(qt12, aes(Sex..Male.1.))+
  geom_bar()

ggplot(qt12, aes(Hospital.Visit.This.Quarter..1.Yes.))+
  geom_bar()

So, by observation and over time, it looks like most employees are of race one, are mostly male (changing to mostly female by the latter half of the number of quarters), are mostly around their mid 20s, and the number of hospital visits increases by the last quarter.

Question Two

a) Which characteristics are associated with the health score?

To do this, I will run a model where the health score is the response variable and there are various explanatory variables for each quarter.

For the first quarter:

model1<-lm(qt1$Health.Score~qt1$Sex..Male.1.+qt1$Race+qt1$Age+qt1$Hospital.Visit.This.Quarter..1.Yes.)
summary(model1)

## 
## Call:
## lm(formula = qt1$Health.Score ~ qt1$Sex..Male.1. + qt1$Race + 
##     qt1$Age + qt1$Hospital.Visit.This.Quarter..1.Yes.)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4271 -1.1700 -0.4795  0.4959  7.3892 
## 
## Coefficients:
##                                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)                               1.32256    0.38037   3.477 0.000544
## qt1$Sex..Male.1.1                         0.26732    0.15957   1.675 0.094421
## qt1$Race2                                -0.29185    0.18491  -1.578 0.115021
## qt1$Race3                                -0.01444    0.22467  -0.064 0.948769
## qt1$Age                                   0.06870    0.01242   5.531 4.78e-08
## qt1$Hospital.Visit.This.Quarter..1.Yes.1  0.71168    0.30290   2.350 0.019122
##                                             
## (Intercept)                              ***
## qt1$Sex..Male.1.1                        .  
## qt1$Race2                                   
## qt1$Race3                                   
## qt1$Age                                  ***
## qt1$Hospital.Visit.This.Quarter..1.Yes.1 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.948 on 594 degrees of freedom
## Multiple R-squared:  0.06269,    Adjusted R-squared:  0.0548 
## F-statistic: 7.946 on 5 and 594 DF,  p-value: 2.922e-07

Second:

model2<-lm(qt2$Health.Score~qt2$Sex..Male.1.+qt2$Race+qt2$Age+qt2$Hospital.Visit.This.Quarter..1.Yes.)
summary(model2)

## 
## Call:
## lm(formula = qt2$Health.Score ~ qt2$Sex..Male.1. + qt2$Race + 
##     qt2$Age + qt2$Hospital.Visit.This.Quarter..1.Yes.)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4476 -1.1972 -0.4763  0.5630  7.1651 
## 
## Coefficients:
##                                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)                               1.72001    0.34165   5.034 5.96e-07
## qt2$Sex..Male.1.1                         0.45455    0.13632   3.335 0.000895
## qt2$Race2                                -0.16840    0.15796  -1.066 0.286696
## qt2$Race3                                -0.30775    0.19111  -1.610 0.107729
## qt2$Age                                   0.05472    0.01132   4.832 1.63e-06
## qt2$Hospital.Visit.This.Quarter..1.Yes.1  0.79373    0.24270   3.270 0.001121
##                                             
## (Intercept)                              ***
## qt2$Sex..Male.1.1                        ***
## qt2$Race2                                   
## qt2$Race3                                   
## qt2$Age                                  ***
## qt2$Hospital.Visit.This.Quarter..1.Yes.1 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.905 on 778 degrees of freedom
## Multiple R-squared:  0.05809,    Adjusted R-squared:  0.05204 
## F-statistic: 9.596 on 5 and 778 DF,  p-value: 6.697e-09

Third:

model3<-lm(qt3$Health.Score~qt3$Sex..Male.1.+qt3$Race+qt3$Age+qt3$Hospital.Visit.This.Quarter..1.Yes.)
summary(model3)

## 
## Call:
## lm(formula = qt3$Health.Score ~ qt3$Sex..Male.1. + qt3$Race + 
##     qt3$Age + qt3$Hospital.Visit.This.Quarter..1.Yes.)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5108 -1.1601 -0.4715  0.5064  7.0781 
## 
## Coefficients:
##                                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)                               2.20205    0.31669   6.953 6.42e-12
## qt3$Sex..Male.1.1                         0.40871    0.12000   3.406 0.000685
## qt3$Race2                                -0.10816    0.13719  -0.788 0.430683
## qt3$Race3                                 0.04669    0.17237   0.271 0.786538
## qt3$Age                                   0.03600    0.01048   3.434 0.000620
## qt3$Hospital.Visit.This.Quarter..1.Yes.1  0.43096    0.20857   2.066 0.039060
##                                             
## (Intercept)                              ***
## qt3$Sex..Male.1.1                        ***
## qt3$Race2                                   
## qt3$Race3                                   
## qt3$Age                                  ***
## qt3$Hospital.Visit.This.Quarter..1.Yes.1 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.902 on 1004 degrees of freedom
## Multiple R-squared:  0.02657,    Adjusted R-squared:  0.02173 
## F-statistic: 5.482 on 5 and 1004 DF,  p-value: 5.502e-05

Fourth:

model4<-lm(qt4$Health.Score~qt4$Sex..Male.1.+qt4$Race+qt4$Age+qt4$Hospital.Visit.This.Quarter..1.Yes.)
summary(model4)

## 
## Call:
## lm(formula = qt4$Health.Score ~ qt4$Sex..Male.1. + qt4$Race + 
##     qt4$Age + qt4$Hospital.Visit.This.Quarter..1.Yes.)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4909 -1.1346 -0.4534  0.4464  7.1420 
## 
## Coefficients:
##                                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)                               1.818195   0.276359   6.579 6.88e-11
## qt4$Sex..Male.1.1                         0.444742   0.103321   4.304 1.80e-05
## qt4$Race2                                -0.143200   0.117463  -1.219   0.2230
## qt4$Race3                                -0.287713   0.151520  -1.899   0.0578
## qt4$Age                                   0.049290   0.009114   5.408 7.59e-08
## qt4$Hospital.Visit.This.Quarter..1.Yes.1  0.843352   0.170753   4.939 8.89e-07
##                                             
## (Intercept)                              ***
## qt4$Sex..Male.1.1                        ***
## qt4$Race2                                   
## qt4$Race3                                .  
## qt4$Age                                  ***
## qt4$Hospital.Visit.This.Quarter..1.Yes.1 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.851 on 1281 degrees of freedom
## Multiple R-squared:  0.05777,    Adjusted R-squared:  0.05409 
## F-statistic: 15.71 on 5 and 1281 DF,  p-value: 4.944e-15

Fifth:

model5<-lm(qt5$Health.Score~qt5$Sex..Male.1.+qt5$Race+qt5$Age+qt5$Hospital.Visit.This.Quarter..1.Yes.)
summary(model5)

## 
## Call:
## lm(formula = qt5$Health.Score ~ qt5$Sex..Male.1. + qt5$Race + 
##     qt5$Age + qt5$Hospital.Visit.This.Quarter..1.Yes.)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7432 -1.1099 -0.4350  0.4134  7.0905 
## 
## Coefficients:
##                                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)                               2.024436   0.223236   9.069  < 2e-16
## qt5$Sex..Male.1.1                         0.532961   0.095706   5.569 3.04e-08
## qt5$Race2                                 0.078398   0.109119   0.718  0.47258
## qt5$Race3                                -0.322904   0.138730  -2.328  0.02007
## qt5$Age                                   0.038480   0.007097   5.422 6.87e-08
## qt5$Hospital.Visit.This.Quarter..1.Yes.1  0.619372   0.159760   3.877  0.00011
##                                             
## (Intercept)                              ***
## qt5$Sex..Male.1.1                        ***
## qt5$Race2                                   
## qt5$Race3                                *  
## qt5$Age                                  ***
## qt5$Hospital.Visit.This.Quarter..1.Yes.1 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.843 on 1479 degrees of freedom
## Multiple R-squared:  0.05269,    Adjusted R-squared:  0.04949 
## F-statistic: 16.45 on 5 and 1479 DF,  p-value: 7.836e-16

Sixth:

model6<-lm(qt6$Health.Score~qt6$Sex..Male.1.+qt6$Race+qt6$Age+qt6$Hospital.Visit.This.Quarter..1.Yes.)
summary(model6)

## 
## Call:
## lm(formula = qt6$Health.Score ~ qt6$Sex..Male.1. + qt6$Race + 
##     qt6$Age + qt6$Hospital.Visit.This.Quarter..1.Yes.)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4295 -1.1511 -0.4506  0.5174  7.0724 
## 
## Coefficients:
##                                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)                               1.578877   0.221915   7.115 1.70e-12
## qt6$Sex..Male.1.1                         0.436259   0.094196   4.631 3.93e-06
## qt6$Race2                                -0.009635   0.107652  -0.090   0.9287
## qt6$Race3                                -0.251302   0.136359  -1.843   0.0655
## qt6$Age                                   0.056197   0.006880   8.168 6.36e-16
## qt6$Hospital.Visit.This.Quarter..1.Yes.1  0.805841   0.154338   5.221 2.01e-07
##                                             
## (Intercept)                              ***
## qt6$Sex..Male.1.1                        ***
## qt6$Race2                                   
## qt6$Race3                                .  
## qt6$Age                                  ***
## qt6$Hospital.Visit.This.Quarter..1.Yes.1 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.869 on 1570 degrees of freedom
## Multiple R-squared:  0.06834,    Adjusted R-squared:  0.06537 
## F-statistic: 23.03 on 5 and 1570 DF,  p-value: < 2.2e-16

Seventh:

model7<-lm(qt7$Health.Score~qt7$Sex..Male.1.+qt7$Race+qt7$Age+qt7$Hospital.Visit.This.Quarter..1.Yes.)
summary(model7)

## 
## Call:
## lm(formula = qt7$Health.Score ~ qt7$Sex..Male.1. + qt7$Race + 
##     qt7$Age + qt7$Hospital.Visit.This.Quarter..1.Yes.)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5312 -1.1344 -0.4216  0.5627  6.9224 
## 
## Coefficients:
##                                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)                               2.463281   0.213282  11.549  < 2e-16
## qt7$Sex..Male.1.1                         0.299004   0.090896   3.290  0.00103
## qt7$Race2                                -0.033734   0.103820  -0.325  0.74528
## qt7$Race3                                -0.119032   0.130836  -0.910  0.36307
## qt7$Age                                   0.029335   0.006513   4.504 7.14e-06
## qt7$Hospital.Visit.This.Quarter..1.Yes.1  0.735199   0.150650   4.880 1.16e-06
##                                             
## (Intercept)                              ***
## qt7$Sex..Male.1.1                        ** 
## qt7$Race2                                   
## qt7$Race3                                   
## qt7$Age                                  ***
## qt7$Hospital.Visit.This.Quarter..1.Yes.1 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.836 on 1633 degrees of freedom
## Multiple R-squared:  0.0336, Adjusted R-squared:  0.03064 
## F-statistic: 11.35 on 5 and 1633 DF,  p-value: 8.69e-11

Eighth:

model8<-lm(qt8$Health.Score~qt8$Sex..Male.1.+qt8$Race+qt8$Age+qt8$Hospital.Visit.This.Quarter..1.Yes.)
summary(model8)

## 
## Call:
## lm(formula = qt8$Health.Score ~ qt8$Sex..Male.1. + qt8$Race + 
##     qt8$Age + qt8$Hospital.Visit.This.Quarter..1.Yes.)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2620 -1.1425 -0.5202  0.4574  7.0693 
## 
## Coefficients:
##                                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)                               1.936734   0.217865   8.890  < 2e-16
## qt8$Sex..Male.1.1                         0.411611   0.093008   4.426 1.02e-05
## qt8$Race2                                -0.008874   0.106299  -0.083    0.933
## qt8$Race3                                -0.211888   0.134054  -1.581    0.114
## qt8$Age                                   0.044662   0.006533   6.836 1.14e-11
## qt8$Hospital.Visit.This.Quarter..1.Yes.1  0.668579   0.154596   4.325 1.62e-05
##                                             
## (Intercept)                              ***
## qt8$Sex..Male.1.1                        ***
## qt8$Race2                                   
## qt8$Race3                                   
## qt8$Age                                  ***
## qt8$Hospital.Visit.This.Quarter..1.Yes.1 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.898 on 1664 degrees of freedom
## Multiple R-squared:  0.04901,    Adjusted R-squared:  0.04615 
## F-statistic: 17.15 on 5 and 1664 DF,  p-value: < 2.2e-16

Ninth:

model9<-lm(qt9$Health.Score~qt9$Sex..Male.1.+qt9$Race+qt9$Age+qt9$Hospital.Visit.This.Quarter..1.Yes.)
summary(model9)

## 
## Call:
## lm(formula = qt9$Health.Score ~ qt9$Sex..Male.1. + qt9$Race + 
##     qt9$Age + qt9$Hospital.Visit.This.Quarter..1.Yes.)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5720 -1.2288 -0.4834  0.4941  6.8264 
## 
## Coefficients:
##                                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)                               2.399441   0.224089  10.708  < 2e-16
## qt9$Sex..Male.1.1                         0.337130   0.095869   3.517 0.000449
## qt9$Race2                                -0.013128   0.109691  -0.120 0.904748
## qt9$Race3                                -0.043340   0.138358  -0.313 0.754131
## qt9$Age                                   0.032257   0.006606   4.883 1.14e-06
## qt9$Hospital.Visit.This.Quarter..1.Yes.1  0.522906   0.146075   3.580 0.000354
##                                             
## (Intercept)                              ***
## qt9$Sex..Male.1.1                        ***
## qt9$Race2                                   
## qt9$Race3                                   
## qt9$Age                                  ***
## qt9$Hospital.Visit.This.Quarter..1.Yes.1 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.972 on 1689 degrees of freedom
## Multiple R-squared:  0.02736,    Adjusted R-squared:  0.02449 
## F-statistic: 9.504 on 5 and 1689 DF,  p-value: 5.941e-09

Tenth:

model10<-lm(qt10$Health.Score~qt10$Sex..Male.1.+qt10$Race+qt10$Age+qt10$Hospital.Visit.This.Quarter..1.Yes.)
summary(model10)

## 
## Call:
## lm(formula = qt10$Health.Score ~ qt10$Sex..Male.1. + qt10$Race + 
##     qt10$Age + qt10$Hospital.Visit.This.Quarter..1.Yes.)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.4953 -1.1779 -0.4867  0.4682  7.0730 
## 
## Coefficients:
##                                            Estimate Std. Error t value Pr(>|t|)
## (Intercept)                                1.896010   0.213842   8.866  < 2e-16
## qt10$Sex..Male.1.1                         0.292131   0.092089   3.172  0.00154
## qt10$Race2                                -0.246912   0.105475  -2.341  0.01935
## qt10$Race3                                 0.105706   0.133305   0.793  0.42791
## qt10$Age                                   0.047331   0.006213   7.618 4.23e-14
## qt10$Hospital.Visit.This.Quarter..1.Yes.1  0.847819   0.155999   5.435 6.28e-08
##                                              
## (Intercept)                               ***
## qt10$Sex..Male.1.1                        ** 
## qt10$Race2                                *  
## qt10$Race3                                   
## qt10$Age                                  ***
## qt10$Hospital.Visit.This.Quarter..1.Yes.1 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.905 on 1709 degrees of freedom
## Multiple R-squared:  0.05789,    Adjusted R-squared:  0.05514 
## F-statistic:    21 on 5 and 1709 DF,  p-value: < 2.2e-16

Eleventh:

model11<-lm(qt11$Health.Score~qt11$Sex..Male.1.+qt11$Race+qt11$Age+qt11$Hospital.Visit.This.Quarter..1.Yes.)
summary(model11)

## 
## Call:
## lm(formula = qt11$Health.Score ~ qt11$Sex..Male.1. + qt11$Race + 
##     qt11$Age + qt11$Hospital.Visit.This.Quarter..1.Yes.)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3847 -1.1330 -0.3995  0.4783  6.9514 
## 
## Coefficients:
##                                            Estimate Std. Error t value Pr(>|t|)
## (Intercept)                                2.103525   0.201395  10.445  < 2e-16
## qt11$Sex..Male.1.1                         0.424476   0.087485   4.852 1.33e-06
## qt11$Race2                                -0.104412   0.099828  -1.046    0.296
## qt11$Race3                                -0.165749   0.126616  -1.309    0.191
## qt11$Age                                   0.038871   0.005779   6.726 2.37e-11
## qt11$Hospital.Visit.This.Quarter..1.Yes.1  1.101326   0.147221   7.481 1.17e-13
##                                              
## (Intercept)                               ***
## qt11$Sex..Male.1.1                        ***
## qt11$Race2                                   
## qt11$Race3                                   
## qt11$Age                                  ***
## qt11$Hospital.Visit.This.Quarter..1.Yes.1 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.816 on 1723 degrees of freedom
## Multiple R-squared:  0.06951,    Adjusted R-squared:  0.06681 
## F-statistic: 25.74 on 5 and 1723 DF,  p-value: < 2.2e-16

Twelfth:

model12<-lm(qt12$Health.Score~qt12$Sex..Male.1.+qt12$Race+qt12$Age+qt12$Hospital.Visit.This.Quarter..1.Yes.)
summary(model12)

## 
## Call:
## lm(formula = qt12$Health.Score ~ qt12$Sex..Male.1. + qt12$Race + 
##     qt12$Age + qt12$Hospital.Visit.This.Quarter..1.Yes.)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8911 -1.2972 -0.5224  0.4670  6.9522 
## 
## Coefficients:
##                                            Estimate Std. Error t value Pr(>|t|)
## (Intercept)                                1.960318   0.223075   8.788  < 2e-16
## qt12$Sex..Male.1.1                         0.255122   0.096510   2.643  0.00828
## qt12$Race2                                -0.148343   0.110257  -1.345  0.17866
## qt12$Race3                                -0.171447   0.139322  -1.231  0.21865
## qt12$Age                                   0.049432   0.006311   7.833 8.26e-15
## qt12$Hospital.Visit.This.Quarter..1.Yes.1  0.967204   0.119710   8.080 1.21e-15
##                                              
## (Intercept)                               ***
## qt12$Sex..Male.1.1                        ** 
## qt12$Race2                                   
## qt12$Race3                                   
## qt12$Age                                  ***
## qt12$Hospital.Visit.This.Quarter..1.Yes.1 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.009 on 1731 degrees of freedom
## Multiple R-squared:  0.07384,    Adjusted R-squared:  0.07117 
## F-statistic:  27.6 on 5 and 1731 DF,  p-value: < 2.2e-16

By observation, it looks like the variables that are most associated with health score are sex, age, and hospital visits, based on the significance of the p value from each summary of the models.

Question Three

a) Using the information from Questions 1 and 2, describe how you would evaluate InsurAHealth’s claim that employees are getting sicker. How would you evaluate the claim? Impliment the steps you’ve suggested.

I’ll first use the variables with the most association to Health Score to evaluate if, over time, employees are getting sicker. For the sake of this exercise, we will ignore the Health Scores that are above 6.

First:

#Using coefficients, looking for negative or positive values, where negative means better health score and positive means worse.

lm(qt1$Health.Score~qt1$Age)

## 
## Call:
## lm(formula = qt1$Health.Score ~ qt1$Age)
## 
## Coefficients:
## (Intercept)      qt1$Age  
##     1.48993      0.06641

ggplot(qt1, aes(Employee.Id, Health.Score, color = Race))+
  geom_point()+
  geom_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

ggplot(qt1, aes(Employee.Id, Health.Score, color = qt1$Sex..Male.1.))+
  geom_point()+
  geom_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

ggplot(qt1, aes(Employee.Id, Health.Score, color = qt1$Hospital.Visit.This.Quarter..1.Yes.))+
  geom_point()+
  geom_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

Second:

#Using coefficients, looking for negative or positive values, where negative means better health score and positive means worse.
lm(qt2$Health.Score~qt2$Age)

## 
## Call:
## lm(formula = qt2$Health.Score ~ qt2$Age)
## 
## Coefficients:
## (Intercept)      qt2$Age  
##     1.88267      0.05579

ggplot(qt2, aes(Employee.Id, Health.Score, color = Race))+
  geom_point()+
  geom_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

ggplot(qt2, aes(Employee.Id, Health.Score, color = Sex..Male.1.))+
  geom_point()+
  geom_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

ggplot(qt2, aes(Employee.Id, Health.Score, color = Hospital.Visit.This.Quarter..1.Yes.))+
  geom_point()+
  geom_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

Third:

#Using coefficients, looking for negative or positive values, where negative means better health score and positive means worse.
lm(qt3$Health.Score~qt3$Age)

## 
## Call:
## lm(formula = qt3$Health.Score ~ qt3$Age)
## 
## Coefficients:
## (Intercept)      qt3$Age  
##     2.43639      0.03551

ggplot(qt3, aes(Employee.Id, Health.Score, color = Race))+
  geom_point()+
  geom_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

ggplot(qt3, aes(Employee.Id, Health.Score, color = Sex..Male.1.))+
  geom_point()+
  geom_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

ggplot(qt3, aes(Employee.Id, Health.Score, color = Hospital.Visit.This.Quarter..1.Yes.))+
  geom_point()+
  geom_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

Fourth:

#Using coefficients, looking for negative or positive values, where negative means better health score and positive means worse.
lm(qt4$Health.Score~qt4$Age)

## 
## Call:
## lm(formula = qt4$Health.Score ~ qt4$Age)
## 
## Coefficients:
## (Intercept)      qt4$Age  
##     2.01842      0.05024

ggplot(qt4, aes(Employee.Id, Health.Score, color = Race))+
  geom_point()+
  geom_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

ggplot(qt4, aes(Employee.Id, Health.Score, color = qt4$Sex..Male.1.))+
  geom_point()+
  geom_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

ggplot(qt4, aes(Employee.Id, Health.Score, color = Hospital.Visit.This.Quarter..1.Yes.))+
  geom_point()+
  geom_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

Fifth:

#Using coefficients, looking for negative or positive values, where negative means better health score and positive means worse.
lm(qt5$Health.Score~qt5$Age)

## 
## Call:
## lm(formula = qt5$Health.Score ~ qt5$Age)
## 
## Coefficients:
## (Intercept)      qt5$Age  
##     2.34498      0.03806

ggplot(qt5, aes(Employee.Id, Health.Score, color = Race))+
  geom_point()+
  geom_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

ggplot(qt5, aes(Employee.Id, Health.Score, color = qt5$Sex..Male.1.))+
  geom_point()+
  geom_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

ggplot(qt5, aes(Employee.Id, Health.Score, color = Hospital.Visit.This.Quarter..1.Yes.))+
  geom_point()+
  geom_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

Sixth:

#Using coefficients, looking for negative or positive values, where negative means better health score and positive means worse.
lm(qt6$Health.Score~qt6$Age)

## 
## Call:
## lm(formula = qt6$Health.Score ~ qt6$Age)
## 
## Coefficients:
## (Intercept)      qt6$Age  
##     1.87905      0.05507

ggplot(qt6, aes(Employee.Id, Health.Score, color = Race))+
  geom_point()+
  geom_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

ggplot(qt6, aes(Employee.Id, Health.Score, color = qt6$Sex..Male.1.))+
  geom_point()+
  geom_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

ggplot(qt6, aes(Employee.Id, Health.Score, color = Hospital.Visit.This.Quarter..1.Yes.))+
  geom_point()+
  geom_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

Seventh:

#Using coefficients, looking for negative or positive values, where negative means better health score and positive means worse.
lm(qt7$Health.Score~qt7$Age)

## 
## Call:
## lm(formula = qt7$Health.Score ~ qt7$Age)
## 
## Coefficients:
## (Intercept)      qt7$Age  
##     2.67012      0.02913

ggplot(qt7, aes(Employee.Id, Health.Score, color = Race))+
  geom_point()+
  geom_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

ggplot(qt7, aes(Employee.Id, Health.Score, color = qt7$Sex..Male.1.))+
  geom_point()+
  geom_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

ggplot(qt7, aes(Employee.Id, Health.Score, color = Hospital.Visit.This.Quarter..1.Yes.))+
  geom_point()+
  geom_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

Eighth:

#Using coefficients, looking for negative or positive values, where negative means better health score and positive means worse.
lm(qt8$Health.Score~qt8$Age)

## 
## Call:
## lm(formula = qt8$Health.Score ~ qt8$Age)
## 
## Coefficients:
## (Intercept)      qt8$Age  
##     2.22071      0.04337

ggplot(qt8, aes(Employee.Id, Health.Score, color = Race))+
  geom_point()+
  geom_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

ggplot(qt8, aes(Employee.Id, Health.Score, color = qt8$Sex..Male.1.))+
  geom_point()+
  geom_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

ggplot(qt8, aes(Employee.Id, Health.Score, color = Hospital.Visit.This.Quarter..1.Yes.))+
  geom_point()+
  geom_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

Ninth:

#Using coefficients, looking for negative or positive values, where negative means better health score and positive means worse.
lm(qt9$Health.Score~qt9$Age)

## 
## Call:
## lm(formula = qt9$Health.Score ~ qt9$Age)
## 
## Coefficients:
## (Intercept)      qt9$Age  
##     2.65358      0.03136

ggplot(qt9, aes(Employee.Id, Health.Score, color = Race))+
  geom_point()+
  geom_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

ggplot(qt9, aes(Employee.Id, Health.Score, color = qt9$Sex..Male.1.))+
  geom_point()+
  geom_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

ggplot(qt9, aes(Employee.Id, Health.Score, color = Hospital.Visit.This.Quarter..1.Yes.))+
  geom_point()+
  geom_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

Tenth:

#Using coefficients, looking for negative or positive values, where negative means better health score and positive means worse.
lm(qt10$Health.Score~qt10$Age)

## 
## Call:
## lm(formula = qt10$Health.Score ~ qt10$Age)
## 
## Coefficients:
## (Intercept)     qt10$Age  
##     2.09652      0.04655

ggplot(qt10, aes(Employee.Id, Health.Score, color = Race))+
  geom_point()+
  geom_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

ggplot(qt10, aes(Employee.Id, Health.Score, color = qt10$Sex..Male.1.))+
  geom_point()+
  geom_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

ggplot(qt10, aes(Employee.Id, Health.Score, color = Hospital.Visit.This.Quarter..1.Yes.))+
  geom_point()+
  geom_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

Eleventh:

#Using coefficients, looking for negative or positive values, where negative means better health score and positive means worse.
lm(qt11$Health.Score~qt11$Age)

## 
## Call:
## lm(formula = qt11$Health.Score ~ qt11$Age)
## 
## Coefficients:
## (Intercept)     qt11$Age  
##     2.38942      0.03832

ggplot(qt11, aes(Employee.Id, Health.Score, color = Race))+
  geom_point()+
  geom_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

ggplot(qt11, aes(Employee.Id, Health.Score, color = qt11$Sex..Male.1.))+
  geom_point()+
  geom_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

ggplot(qt11, aes(Employee.Id, Health.Score, color = Hospital.Visit.This.Quarter..1.Yes.))+
  geom_point()+
  geom_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

Twelfth:

#Using coefficients, looking for negative or positive values, where negative means better health score and positive means worse.
lm(qt12$Health.Score~qt12$Age)

## 
## Call:
## lm(formula = qt12$Health.Score ~ qt12$Age)
## 
## Coefficients:
## (Intercept)     qt12$Age  
##      2.2090       0.0497

ggplot(qt12, aes(Employee.Id, Health.Score, color = Race))+
  geom_point()+
  geom_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

ggplot(qt12, aes(Employee.Id, Health.Score, color = qt12$Sex..Male.1.))+
  geom_point()+
  geom_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

ggplot(qt12, aes(Employee.Id, Health.Score, color = Hospital.Visit.This.Quarter..1.Yes.))+
  geom_point()+
  geom_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

Overall, it seems like the only people getting sicker are employees of race one, employees who are male, and who are older (employee age and health score has a positive correlation). So, it isn’t incorrect to say that employees are getting sicker, but it might be more worth while to hire more employees outside of race one, have more female employees, ones who stay out of the hospital (though this might be a causation/correlation situation), and who are younger in efforts to decrease the health scores.

Acumen_Data_Analsis_Exercise_Jaffe_Benjamin

2022-08-04

Question One

a) Are all the values in the data reasonable? Are there missing values?

b) What are the characteristics of employees at Company A? Do these demographics change over time?

Question Two

a) Which characteristics are associated with the health score?

Question Three

a) Using the information from Questions 1 and 2, describe how you would evaluate InsurAHealth’s claim that employees are getting sicker. How would you evaluate the claim? Impliment the steps you’ve suggested.