For this recipe, a dataset of childrens’ health status in the U.S. in 1986 is analyzed. The health metric is a continuous Dependent Variable, which is dependent on factors including number of doctor visits, number of children per household, and access to healthcare. There were 485 observations. This data can be found under the Doctor section library in the Ecdat Package.
#reading in data
library("Ecdat", lib.loc="~/R/win-library/3.1")
## Warning: package 'Ecdat' was built under R version 3.1.3
## Loading required package: Ecfun
## Warning: package 'Ecfun' was built under R version 3.1.3
##
## Attaching package: 'Ecdat'
## The following object is masked from 'package:datasets':
##
## Orange
library("Ecfun", lib.loc="~/R/win-library/3.1")
data(Doctor)
summary(Doctor)
## doctor children access health
## Min. : 0.00 Min. :1.000 Min. :0.0000 Min. :-1.524000
## 1st Qu.: 0.00 1st Qu.:1.000 1st Qu.:0.2500 1st Qu.:-1.066000
## Median : 1.00 Median :2.000 Median :0.3500 Median :-0.421000
## Mean : 1.61 Mean :2.264 Mean :0.3812 Mean :-0.000041
## 3rd Qu.: 2.00 3rd Qu.:3.000 3rd Qu.:0.5000 3rd Qu.: 0.657000
## Max. :48.00 Max. :9.000 Max. :0.9200 Max. : 7.217000
The three factors being considered for the experiment are:
Doctor: the number of doctor visits.
Children: the number of children in the household.
Access: a measure of access to health care
In the Experimental Design portion of this Project, I will explore the factors, and decide which IV to block accordingly.
#setting factors
Doctor$doctor=as.factor(Doctor$doctor)
Doctor$children=as.factor(Doctor$children)
Doctor$access=as.factor(Doctor$access)
The response variable, health, is considered a continuous dependent variable. We hypothesize that the metric is dependent on the factors listed above. It can assume all values, not just integer values, over a range (-1.524<= x <= 7.217). That range is simply what was experienced in this study, but if further studies were conducted, the range could certainly broaden.
Initially, my intuition was telling me to use doctor (the number of doctor visits) as the response variable. This is a viable option as it very well could be dependent on the other three attributes. I would expect that with an increased number of children and increased access to health care, the number of visits would increase as well. With a larger “health” number, (larger positive numbers are associated with poorer health), I would expect the number of doctor visits to increase as well. This analysis could be performed. But, after critically evaluating the assignment, I realized that my dependent variable, the doctor visits, would only produce discrete integer variables. The assignment calls for continuous DV. So, I flipped my argument to analyze the effects of the factors on the “health” metric. I hypothesize that with increased access to healthcare and increased number of doctor visits, the health metric will go down. As for number of children, I do not forsee a correlation, but I will be performing further analysis under Experimental Design. I believe that number of children will prove to be a nuisance factor, and then I will have to select which factor, of the remaining two, to block.
The structure of the Doctor data is as follows:
#analyzing structure
str(Doctor)
## 'data.frame': 485 obs. of 4 variables:
## $ doctor : Factor w/ 17 levels "0","1","2","3",..: 1 2 1 1 12 4 1 7 2 1 ...
## $ children: Factor w/ 8 levels "1","2","3","4",..: 1 3 4 2 1 1 2 1 1 1 ...
## $ access : Factor w/ 18 levels "0","0.08","0.17",..: 11 3 10 7 15 4 11 15 4 15 ...
## $ health : num 0.495 0.52 -1.227 -1.524 0.173 ...
First and last 6 observations of the dataset:
head(Doctor)
## doctor children access health
## 1 0 1 0.5 0.495
## 2 1 3 0.17 0.520
## 3 0 4 0.42 -1.227
## 4 0 2 0.33 -1.524
## 5 11 1 0.67 0.173
## 6 3 1 0.25 -0.905
tail(Doctor)
## doctor children access health
## 480 0 2 0.5 -1.524
## 481 0 2 0.5 -0.421
## 482 0 2 0.36 -0.930
## 483 9 1 0.33 -1.066
## 484 0 2 0.33 -1.363
## 485 1 3 0.5 -0.099
The contributing factors and the response variables were selected naturally, just by following the guidelines outlined in the assignment. Health is certainly going to be the response variable, since it is a continuous dependent variable. Now, figuring out which independent variables to withhold depends on some inferrential statistics. Using box plots, below, I will eliminate one of the three factors, and consider it a nuisance factor.
#checking for relationships
boxplot(Doctor$health~Doctor$doctor, xlab="the number of doctor visits
+ ", ylab="Health")
boxplot(Doctor$health~Doctor$children, xlab="The number of children in the household", ylab="Health")
boxplot(Doctor$health~Doctor$access, xlab="Access to Health Care
+ ", ylab="Health")
As the boxplots highlight, the Number of children in the houselhold does not truly correlate with a significant variation in health level. The means are relatively even as the number of children increases. The other Independent variable, access and doctor, show much more variation. Thereby, I will define children as a nuisance factor, and focus on doctor and access as my “pair of categorical IVs.”
As we proceed, I will be blocking the access factor, to investigate the effect that number of doctor visits has on health.
A randomized, incomplete block design will be the approach to examine the effects of doctor on health. ANOVA testing will be performed to further examine the impacts.
To determine the proper sample size for our experiment, it is imperative that we set type I and type II error metrics. This will allow us to determine the effects of the relevant factor. Typically, one would use Cohen’s d to find the effect size, and then utilize the G Power tool to calculate total sample size. This is highly dependent on the notion that the factor would only have two levels. Since my factor, doctor, has 48 levels, i could not simply use Cohens d to estimate effect and sample size. I had to perfrom a sensitivity analysis within G Power to estimate the sample size. Our actual sample size is relatively small, so I was unsure if G power would increase or decrease the suggested sample size. Using the typical Type I and Type II errors to be used are .05 and .95, respectively, I received a total sample size of 528, and an effect size of .0.3049884. (This is surprising to me, given previous experiences where our original sample size acted as a population, and our G power sample size was a smaller portion of that populaton.)
This design allows us to evaluate the main effects of the targeted factor on the response variable, without the interaction effects of other factors weighing in.
This data is random because it is drawn from 485 random households across the United States. By looking at the physical data, we can observe an assortment of “children,” “access,” “health,” and “doctor” fields. We did not have any say during the gathering of the sample, so we have to accept the data for what it is, with its current level of randomness.
There are replicates, study observed the exact same factors on each participating household. Each household is an observation, but also a replicate. They would be “repeated” if the same study were performed on the same household more than once.
There is blocking in this model. The “children factor” was blocked immediately, since it is a nuisance factor at this point. The more significant blocking was on “access.” This block allowed us to analyze the effects of “doctor” directly on “health,” without interaction effects from “access.”
The first tet we will focus on is Null Hypothesis testing. Other tests will be explored in the next section.
Summary Stats for the entire dataset:
summary(Doctor)
## doctor children access health
## 0 :241 1 :171 0.33 : 86 Min. :-1.524000
## 1 : 96 2 :142 0.5 : 84 1st Qu.:-1.066000
## 2 : 55 3 : 93 0.17 : 56 Median :-0.421000
## 3 : 30 4 : 49 0.42 : 48 Mean :-0.000041
## 4 : 14 5 : 19 0.67 : 40 3rd Qu.: 0.657000
## 6 : 12 6 : 6 0.58 : 37 Max. : 7.217000
## (Other): 37 (Other): 5 (Other):134
Mean Health Metric:
mean(Doctor$health)
## [1] -4.123711e-05
We must first draw a histogram, to learn more about the distribution.
hist(Doctor$health)
As you can see, the distribution is not normal, but almost more exponential. We cannot assume normality while conducting further analysis.
Boxplot for doctor visits.
boxplot(Doctor$health~Doctor$doctor, xlab="Doctor Visits", ylab="Health")
After analyzing the box plot for the following factors, there are a lot of things we can conclude. It seems that the trend is as doctor visits increase, the health metric increases as well. The health metric is quite couter-intuative though, as we have to keep in mind that the higher the number is, the less healthy the patient. So from the boxplot we can infer that unhealthy patients are taking more trips to the doctor. This hypothesis does make sense, according to the data. But, one could also argue that going to the doctor more will keep you healthier. This theor is not supported by the boxplot.
Model 1: ANOVA for the impact of number of doctor visits on health. This is taking into account the entire sample size of 485.
#anova test
model1=aov(Doctor$health~Doctor$doctor)
anova(model1)
## Analysis of Variance Table
##
## Response: Doctor$health
## Df Sum Sq Mean Sq F value Pr(>F)
## Doctor$doctor 16 151.58 9.4737 5.2592 3.071e-10 ***
## Residuals 468 843.03 1.8013
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The probability that the variation in health is directly caused by randomization is 0.000000000307. The F statistic came out to 5.2592. This being said, we reject our null hypothesis. The number of doctor visits could potentially be enough to predict the overall health of an individual.
The following qqnorm plot confirms the notion that our sample does not fit a normal distribution.
#model normality check
qqnorm(residuals(model1))
qqline(residuals(model1))
testset=Doctor[(Doctor),528]
model2=aov(Doctor$health~Doctor$doctor, data=testset)
anova(model2)
## Analysis of Variance Table
##
## Response: Doctor$health
## Df Sum Sq Mean Sq F value Pr(>F)
## Doctor$doctor 16 151.58 9.4737 5.2592 3.071e-10 ***
## Residuals 468 843.03 1.8013
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
eval=FALSE
This output suggest that the probability that the variation in health is directly caused by randomization is 0.00001174. The F statistic came out to 5.994. This being said, we reject our null hypothesis. The number of doctor visits could potentially be enough to predict the overall health of an individual.
The difference between our G Power n and our original n, in terms of significant impact on the model, is almost negligible.
model3=aov(Doctor$health~Doctor$doctor + Doctor$access)
anova(model3)
## Analysis of Variance Table
##
## Response: Doctor$health
## Df Sum Sq Mean Sq F value Pr(>F)
## Doctor$doctor 16 151.58 9.4737 5.2397 3.709e-10 ***
## Doctor$access 17 27.59 1.6229 0.8976 0.5772
## Residuals 451 815.44 1.8081
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This ANOVA suggests that our blocking variable, access, does not have a significant effect on health. THis further solidifies the hypothesis that visits to the doctor is highly correlated with health.
The only other model that we can consider for main effects, while still blocking “access,” would be ANOVA for the “children” factor. The results can be seen in the following ANOVA. It is safe to say that the number of children in a household does not have a significant impact on the Health of a child.
model4=aov(Doctor$health~Doctor$children)
anova(model4)
## Analysis of Variance Table
##
## Response: Doctor$health
## Df Sum Sq Mean Sq F value Pr(>F)
## Doctor$children 7 11.42 1.6315 0.7915 0.5946
## Residuals 477 983.19 2.0612
We can also model interaction effects between the factors. The first ANOVA will model the interaction between “children”" and “doctor.”
#alternate interaction models
model5=aov(Doctor$health~Doctor$children*Doctor$doctor)
anova(model5)
## Analysis of Variance Table
##
## Response: Doctor$health
## Df Sum Sq Mean Sq F value Pr(>F)
## Doctor$children 7 11.42 1.6315 0.8905 0.5138
## Doctor$doctor 16 151.37 9.4604 5.1635 6.273e-10 ***
## Doctor$children:Doctor$doctor 30 42.16 1.4053 0.7670 0.8096
## Residuals 431 789.66 1.8322
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
interaction.plot(Doctor$children, Doctor$doctor, Doctor$health)
According to our ANOVA, the INteraction effects are insignificant, but according to our interaction plot, we can see that there certainly are interactions amongsth the datafields. They are just negligible in our case apparently.
The second ANOVA for interaction will be between “children” and “access.”
model6=aov(Doctor$health~Doctor$children*Doctor$access)
anova(model6)
## Analysis of Variance Table
##
## Response: Doctor$health
## Df Sum Sq Mean Sq F value Pr(>F)
## Doctor$children 7 11.42 1.6315 0.7749 0.6087
## Doctor$access 17 34.25 2.0147 0.9568 0.5067
## Doctor$children:Doctor$access 44 73.03 1.6598 0.7883 0.8328
## Residuals 416 875.91 2.1056
interaction.plot(Doctor$children, Doctor$access, Doctor$health)
Again, our ANOVA proves that the main effects for access to healthcare are above .05, deeming it insignificant. We also notice again that there are interactions between the factors, but the ANOVA claims that they are insignificant in retrospect.
Our last ANOVA for interaction is between “access” and “doctor.”
model7=aov(Doctor$health~Doctor$access*Doctor$doctor)
anova(model7)
## Analysis of Variance Table
##
## Response: Doctor$health
## Df Sum Sq Mean Sq F value Pr(>F)
## Doctor$access 17 33.39 1.9640 1.1813 0.276559
## Doctor$doctor 16 145.78 9.1113 5.4800 1.372e-10 ***
## Doctor$access:Doctor$doctor 62 168.67 2.7206 1.6363 0.003054 **
## Residuals 389 646.77 1.6626
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
interaction.plot(Doctor$access, Doctor$doctor, Doctor$health)
This ANOVA and interaction chart suggest that there is a significant interaction effect between the access to healthcare, doctor visits, and overall health. This makes sense when applied to the real world as well.
Assuming the STDEV(health) of the population to be q.43352, the margin of error for our health metric at a 95% confidence level is 0.1275795. The confidence interval is between -0.1276208 and 0.1275383.
#interval estimation construction
sd(Doctor$health)
## [1] 1.43352
n=485
sigma=1.43352
sem=sigma/sqrt(n); sem
## [1] 0.06509279
E=qnorm(.975)*sem; E
## [1] 0.1275795
xbar=mean(Doctor$health)
xbar+c(-E, E)
## [1] -0.1276208 0.1275383
To best estimate the health metric, one could use the “doctor” factor. As seen in the parameters below, there is a relationship between “doctor” visits and “health.” The higher the number if visits, generally speaking, the higher the health metric (which ultimately signifies poor health.) This is a sound estimation strategy.
coef(model1)
## (Intercept) Doctor$doctor1 Doctor$doctor2 Doctor$doctor3
## -0.3943444 0.5181881 0.5272535 1.1424444
## Doctor$doctor4 Doctor$doctor5 Doctor$doctor6 Doctor$doctor7
## 0.7937015 0.4282194 1.6350111 1.6524873
## Doctor$doctor8 Doctor$doctor9 Doctor$doctor10 Doctor$doctor11
## 1.2325444 1.9497194 0.8980111 0.5673444
## Doctor$doctor12 Doctor$doctor15 Doctor$doctor16 Doctor$doctor24
## -0.3746556 -0.0766556 -0.6716556 6.1683444
## Doctor$doctor48
## 2.4263444
The qualms that I experienced include the following. First off, the sensitivity analysis in GPower kept on giving me a number higher than my population size. This forces me to re-use random datum in my ANOVA. The second issue was that in my Interval Estimation I used Normal distribution, but I am not sure if this is accurate since originally the distribution was more exponential.
If we take a look at the qqnorm plots for the doctor vs health and the children vs health, we can realize that the doctor factor did not have a perfect fit, but it was the closest of the factors.
qqnorm(residuals(model1))
qqline(residuals(model1))
qqnorm(residuals(model4))
qqline(residuals(model4))
N/A
Raw Data: Ecdat Doctor data
All Complete R code included in the RMarkDown.