The data used in this recipe was found using the “100+ interesting data sets” webpage. It is part of the R dataset package called “Ecdat,” which various datasets for the purpose of econometrics.
This dataset in particular has data of number of physician visits by individuals in the United States.
# Read in dataset from Ecdat package
#install.packages("Ecdat")
# library(Ecdat)
data(OFP, package = "Ecdat")
x <- OFP
The original data being examined has 4,406 observations of 19 variables.
head(x)
## ofp ofnp opp opnp emr hosp numchron adldiff age black sex maried
## 1 5 0 0 0 0 1 2 0 6.9 yes male yes
## 2 1 0 2 0 2 0 2 0 7.4 no female yes
## 3 13 0 0 0 3 3 4 1 6.6 yes female no
## 4 16 0 5 0 1 1 2 1 7.6 no male yes
## 5 3 0 0 0 0 0 2 1 7.9 no female yes
## 6 17 0 0 0 0 0 5 1 6.6 no female no
## school faminc employed privins medicaid region hlth
## 1 6 2.8810 yes yes no other other
## 2 10 2.7478 no yes no other other
## 3 10 0.6532 no no yes other poor
## 4 3 0.6588 no yes no other poor
## 5 6 0.6588 no yes no other other
## 6 7 0.3301 no no yes other poor
summary(x)
## ofp ofnp opp opnp
## Min. : 0.000 Min. : 0.000 Min. : 0.0000 Min. : 0.0000
## 1st Qu.: 1.000 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median : 4.000 Median : 0.000 Median : 0.0000 Median : 0.0000
## Mean : 5.774 Mean : 1.618 Mean : 0.7508 Mean : 0.5361
## 3rd Qu.: 8.000 3rd Qu.: 1.000 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :89.000 Max. :104.000 Max. :141.0000 Max. :155.0000
## emr hosp numchron adldiff
## Min. : 0.0000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.: 0.0000 1st Qu.:0.000 1st Qu.:1.000 1st Qu.:0.000
## Median : 0.0000 Median :0.000 Median :1.000 Median :0.000
## Mean : 0.2635 Mean :0.296 Mean :1.542 Mean :0.204
## 3rd Qu.: 0.0000 3rd Qu.:0.000 3rd Qu.:2.000 3rd Qu.:0.000
## Max. :12.0000 Max. :8.000 Max. :8.000 Max. :1.000
## age black sex maried school
## Min. : 6.600 no :3890 female:2628 no :2000 Min. : 0.00
## 1st Qu.: 6.900 yes: 516 male :1778 yes:2406 1st Qu.: 8.00
## Median : 7.300 Median :11.00
## Mean : 7.402 Mean :10.29
## 3rd Qu.: 7.800 3rd Qu.:12.00
## Max. :10.900 Max. :18.00
## faminc employed privins medicaid region
## Min. :-1.0125 no :3951 no : 985 no :4004 other :1614
## 1st Qu.: 0.9122 yes: 455 yes:3421 yes: 402 noreast: 837
## Median : 1.6982 midwest:1157
## Mean : 2.5271 west : 798
## 3rd Qu.: 3.1728
## Max. :54.8351
## hlth
## other :3509
## excellent: 343
## poor : 554
##
##
##
# As seen below, there are 452 observations of 12 variables, including: 5 factors and 7 numeric variables.
str(x)
## 'data.frame': 4406 obs. of 19 variables:
## $ ofp : int 5 1 13 16 3 17 9 3 1 0 ...
## $ ofnp : int 0 0 0 0 0 0 0 0 0 0 ...
## $ opp : int 0 2 0 5 0 0 0 0 0 0 ...
## $ opnp : int 0 0 0 0 0 0 0 0 0 0 ...
## $ emr : int 0 2 3 1 0 0 0 0 0 0 ...
## $ hosp : int 1 0 3 1 0 0 0 0 0 0 ...
## $ numchron: int 2 2 4 2 2 5 0 0 0 0 ...
## $ adldiff : int 0 0 1 1 1 1 0 0 0 0 ...
## $ age : num 6.9 7.4 6.6 7.6 7.9 6.6 7.5 8.7 7.3 7.8 ...
## $ black : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 2 1 1 1 1 1 1 ...
## $ maried : Factor w/ 2 levels "no","yes": 2 2 1 2 2 1 1 1 1 1 ...
## $ school : int 6 10 10 3 6 7 8 8 8 8 ...
## $ faminc : num 2.881 2.748 0.653 0.659 0.659 ...
## $ employed: Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
## $ privins : Factor w/ 2 levels "no","yes": 2 2 1 2 2 1 2 2 2 2 ...
## $ medicaid: Factor w/ 2 levels "no","yes": 1 1 2 1 1 2 1 1 1 1 ...
## $ region : Factor w/ 4 levels "other","noreast",..: 1 1 1 1 1 1 3 3 3 3 ...
## $ hlth : Factor w/ 3 levels "other","excellent",..: 1 1 3 3 1 3 1 1 1 1 ...
x <- x[,c(1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)] # eliminate unused variables
x$school <- as.numeric(x$school) # assign as a numeric variable
x$school <- cut(x$school, c(0, 12, 19)) #create intervals out of continuous variable
x$school <- as.factor(x$school) #reassign as a factor
x <- x[x$hlth %in% c("excellent", "poor"),] # eliminate "other" level
x$hlth <- droplevels(x$hlth) # drop unused level
x <- x[, c(1,2,3,4,5,7,11)] # eliminate all unused variables
# Assign all factors as binary
x$black <- as.character(x$black)
x$black[x$black == "no"] <- 1
x$black[x$black == "yes"] <- -1
x$sex <- as.character(x$sex)
x$sex[x$sex == "female"] <- 1
x$sex[x$sex == "male"] <- -1
x$employed <- as.character(x$employed)
x$employed[x$employed == "no"] <- 1
x$employed[x$employed == "yes"] <- -1
x$school <- as.character(x$school)
x$school[x$school == "(0,12]"] <- 1
x$school[x$school == "(12,19]"] <- -1
x$maried <- as.character(x$maried)
x$maried[x$maried == "no"] <- 1
x$maried[x$maried == "yes"] <- -1
x$hlth <- as.character(x$hlth)
x$hlth[x$hlth == "excellent"] <- 1
x$hlth[x$hlth == "poor"] <- -1
x$hlth <- as.integer(x$hlth)
x$sex <- as.integer(x$sex)
x$black <- as.integer(x$black)
x$maried <- as.integer(x$maried)
x$school <- as.integer(x$school)
x$employed <- as.integer(x$employed)
x$ofp <- as.numeric(x$ofp)
The only continuous variable in this dataset is the number of physician visits by the individual.
The response variable being analyzed in this dataset is the number of physician visits. As seen below, the range of unemployment is between 0 and 117 weeks, with a mean of 18.51 weeks.
summary(x$ofp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 5.000 6.806 9.000 89.000
The data is organized into 19 variables, but was narrowed down to 7, including: 6 factors and 1 numeric variable.
There was no randomization in this recipe because the data points are based on observations about physician visits among American individuals.
A fractional factorial design will be used to complete analysis. Each factor will be narrowed down to two levels through either subsetting intervals, or by excluding certain levels.
A 2^(6-1) design matrix is used to look at all the possible combinations of factors.
The experiment was based on observations, and was therefore not randomized.
The experiment had no repeated measures.
There was no blocking in this recipe.
x_plot <- x # rename so that we can rename as factors without screwing up original data
summary(x_plot)
## ofp black sex maried
## Min. : 0.000 Min. :-1.0000 Min. :-1.0000 Min. :-1.00000
## 1st Qu.: 1.000 1st Qu.: 1.0000 1st Qu.:-1.0000 1st Qu.:-1.00000
## Median : 5.000 Median : 1.0000 Median : 1.0000 Median :-1.00000
## Mean : 6.806 Mean : 0.7547 Mean : 0.1929 Mean :-0.02341
## 3rd Qu.: 9.000 3rd Qu.: 1.0000 3rd Qu.: 1.0000 3rd Qu.: 1.00000
## Max. :89.000 Max. : 1.0000 Max. : 1.0000 Max. : 1.00000
##
## school employed hlth
## Min. :-1.0000 Min. :-1.0000 Min. :-1.0000
## 1st Qu.: 1.0000 1st Qu.: 1.0000 1st Qu.:-1.0000
## Median : 1.0000 Median : 1.0000 Median :-1.0000
## Mean : 0.5972 Mean : 0.7904 Mean :-0.2352
## 3rd Qu.: 1.0000 3rd Qu.: 1.0000 3rd Qu.: 1.0000
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000
## NA's :33
x_plot$black <- as.factor(x_plot$black)
x_plot$sex <- as.factor(x_plot$sex)
x_plot$maried <- as.factor(x_plot$maried)
x_plot$school <- as.factor(x_plot$school)
x_plot$employed <- as.factor(x_plot$employed)
x_plot$hlth <- as.factor(x_plot$hlth)
The boxplot below shows the number of doctor visits versus race. There does not seem to be any distinct difference in medians between black and non black individuals. The range of number of doctor visits for black individuals does, however have a larger range.
par(mfrow=c(1,1))
plot(x_plot$ofp ~ x_plot$black, xlab = "Black (1) or Non-Black (-1)", ylab = "Number of Doctors Visits", main="Number of Doctor Visits versus Race" )
The boxplot below shows the frequency of doctor visits by the gender of the individual. The median frequency for females is slightly greater than females, but the males have a larger range in frequencies.
plot(x_plot$ofp ~ x_plot$sex, xlab = "Male (-1) or Female (1)", ylab = "Number of Doctors Visits", main = "Number of Doctors Visits versus Gender")
The boxplot below shows frequency of docor visits of employed vs. unemployed individuals. The employed individuals has a larger median, and seems to be right scewed.
plot(x_plot$ofp ~ x_plot$employed, xlab = "Unemployed (-1) or Employed (1)", ylab = "Number of Doctors Visits", main = "Number of Doctors Visits versus Employment Status")
The boxplot below shows the number of doctor visits versus the marital status of the individual. Married individuals have a larger range, but the medians of the two groups seem relatively consistent.
plot(x_plot$ofp ~ x_plot$maried, xlab = "Married (-1) or Single (1))", ylab = "Number of Doctors Visits", main = "Number of Doctors Visits versus Married Status")
The boxplot below shows the frequency of doctor visits and their relationship to the health of the individual. As expected, poor health individuals have greater medians, along with a much larger range - indicating that that poor health individuals have various levels of severity.
plot(x_plot$ofp ~ x_plot$hlth, xlab = "Excellent (1) or Poor (-1)", ylab = "Number of Doctors Visits", main = "Number of Doctors Visits versus Health")
The boxplot below shows the frequency of doctor visits related to the years of school the individual has had. There doesn’t seem to be any signidicant difference in means or ranges of the two groups.
plot(x_plot$ofp ~ x_plot$school, xlab = "Low (12 or less years) (1) or High (greater than 12 years) (-1)", ylab = "Number of Doctors Visits", main = "Number of Doctors Visits versus Years of School")
The ANOVA below is a 6 factor, 2 level analysis of various. The null hypothesis states that the variation in the number of annual doctors visits (ofp) cannot be explained by anything other than randomization. The alternative, however, states that it is possible that something other than randomization can explain the variation in annual doctors visits.
model <- aov(ofp~black + sex + maried + school + employed + hlth, data=x)
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## black 1 15 15 0.239 0.625
## sex 1 68 68 1.105 0.293
## maried 1 16 16 0.261 0.610
## school 1 104 104 1.694 0.193
## employed 1 9 9 0.143 0.706
## hlth 1 6303 6303 102.772 <2e-16 ***
## Residuals 857 52560 61
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 33 observations deleted due to missingness
In this case, the p-values for all of the main effects of the six factors are greater than an alpha of 0.05. Therefore they are insignificant, and we must fail to reject the null. Therefore, it is possible that frequency of doctor visits can still not be explained by anything other than randomization.
# install.packages("FrF2")
library(FrF2)
## Warning: package 'FrF2' was built under R version 3.1.2
## Loading required package: DoE.base
## Warning: package 'DoE.base' was built under R version 3.1.2
## Loading required package: grid
## Loading required package: conf.design
## Warning: package 'conf.design' was built under R version 3.1.2
##
## Attaching package: 'DoE.base'
##
## The following objects are masked from 'package:stats':
##
## aov, lm
##
## The following object is masked from 'package:graphics':
##
## plot.design
attach(x)
# Constructing the design matrix
design <- FrF2(32, nfactors=6, estimable = formula("~black + sex + maried + school + employed + hlth +hlth:(~black + sex + maried + school + employed + hlth)"), factor.names = c("black", "sex", "maried", "school", "employed","hlth"), res5=TRUE, clear=FALSE)
design
## black sex maried school employed hlth
## 1 -1 -1 -1 1 1 -1
## 2 -1 -1 -1 -1 -1 -1
## 3 -1 -1 -1 -1 1 1
## 4 1 1 -1 1 1 -1
## 5 -1 -1 -1 1 -1 1
## 6 -1 1 -1 1 1 1
## 7 -1 -1 1 -1 -1 1
## 8 1 1 1 1 1 1
## 9 1 -1 -1 -1 -1 1
## 10 1 -1 -1 -1 1 -1
## 11 -1 -1 1 1 -1 -1
## 12 1 -1 1 1 -1 1
## 13 -1 1 1 -1 1 1
## 14 -1 1 -1 -1 1 -1
## 15 1 -1 1 -1 1 1
## 16 1 -1 1 1 1 -1
## 17 1 1 1 1 -1 -1
## 18 -1 1 1 1 -1 1
## 19 1 1 1 -1 1 -1
## 20 -1 1 1 1 1 -1
## 21 1 -1 -1 1 1 1
## 22 1 1 1 -1 -1 1
## 23 -1 1 1 -1 -1 -1
## 24 1 1 -1 1 -1 1
## 25 1 1 -1 -1 1 1
## 26 -1 -1 1 -1 1 -1
## 27 1 -1 -1 1 -1 -1
## 28 1 -1 1 -1 -1 -1
## 29 -1 1 -1 1 -1 -1
## 30 -1 1 -1 -1 -1 1
## 31 -1 -1 1 1 1 1
## 32 1 1 -1 -1 -1 -1
## class=design, type= FrF2.estimable
aliasprint(design)
## $legend
## [1] A=black B=sex C=maried D=school E=employed F=hlth
##
## [[2]]
## [1] no aliasing among main effects and 2fis
# Creating new datafram which include the combined data of the original, along with the factor level combinations
new = merge(design, x, by=c("black", "sex", "maried", "school", "employed", "hlth"), all = FALSE)
head(new)
## black sex maried school employed hlth ofp
## 1 -1 -1 -1 1 -1 1 0
## 2 -1 -1 -1 1 -1 1 2
## 3 -1 -1 -1 1 -1 1 2
## 4 -1 -1 -1 1 1 -1 9
## 5 -1 -1 -1 1 1 -1 0
## 6 -1 -1 -1 1 1 -1 7
# Run the same ANOVA with the new dataset
model2 <- aov(ofp ~black + sex + maried + school + employed + hlth, data=new)
summary(model2)
## Df Sum Sq Mean Sq F value Pr(>F)
## black 1 227 227 3.006 0.0837 .
## sex 1 48 48 0.634 0.4263
## maried 1 89 89 1.173 0.2794
## school 1 10 10 0.126 0.7225
## employed 1 240 240 3.177 0.0755 .
## hlth 1 4379 4379 58.020 1.87e-13 ***
## Residuals 403 30417 75
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Running the ANOVA for the new dataset result in different p-values from our original model. Now, race and health are both almost significant, with p-values of 0.0837 and 0.0755 respectively, with an alpha of 0.05. However, since these p-values are still not less than 0.05, we must fail to reject the null hypothesis - that the variation in number of doctor visists cannot be explained by anything other than randomization.
Assumptions for ANOVA tests include that the data is normally distributed.
shapiro.test(x$ofp)
##
## Shapiro-Wilk normality test
##
## data: x$ofp
## W = 0.711, p-value < 2.2e-16
The null of a Shapiro-Wilk test states that the data presented is normally distributed. Unfortunately, the p-value is less than an alpha of 0.05, and therefore, we must reject the null hypothesis. This suggests, instead, that the frequency of doctor visits data is not normally distributed.
par(mfrow = c(1,1))
qqnorm(residuals(model))
qqline(residuals(model))
The Q-Q Normality Plots of the residuals for the original model do not appear to be normal. When normal, the data points would lay linearly, and in this case, they do not seem to lie along the line. This suggests that the data may not be parametric.
par(mfrow = c(1,1))
qqnorm(residuals(model2))
qqline(residuals(model2))
Once again, the Q-Q normality plot of the residuals, this time for the second model, do not appear normal.
plot(fitted(model),residuals(model))
The plots of the fitted values versus the residual values are supposed to appear symmetric over a residual of 0. The datapoints of the original model do not appear to be symmetric over the length of the dynamic range, and are not symmetric over 0.
plot(fitted(model2),residuals(model2))
As before, the plots of the fitted values versus the residual values for the the new model do not appear to be normal.
Based on the model adequacy testing, it appears that the data is not normally distributed, and that this may not be the best fit of the data.
A Kruskal-Wallis test can be used to evaluate whether groups’ distributions are identical, without the assumption that the data is normally distributed.
With the kruskal-wallis test, we could only analyze the main effect of one factor at a time. I chose to analyze race and health.
kruskal.test(ofp ~ black, data=new)
##
## Kruskal-Wallis rank sum test
##
## data: ofp by black
## Kruskal-Wallis chi-squared = 3.0358, df = 1, p-value = 0.08145
The p-value of this Kruskal-Wallis test is greater than an alpha of 0.05. Therefore, you must fail to reject the null - that variation in frequency of doctor visits cannot be explained by anything other than randomization.
kruskal.test(ofp ~ hlth, data=new)
##
## Kruskal-Wallis rank sum test
##
## data: ofp by hlth
## Kruskal-Wallis chi-squared = 65.8292, df = 1, p-value = 4.918e-16
The p-value for the factor, health, of this Kruskal-Wallis test is less than an alpha of 0.05. Therefore, you must reject the null - that variation in frequency of doctor visits cannot be explained by something other than randomization, and suggest an alternative - that the randomization can be explained by something other than randomization.