The HELPrct dataset in the mosaicData package includes data from the Health Evaluation and Linkage to Primary Care study, which was conducted in Boston 10 years ago. One of the study variables is a measure of physical function, with higher scores being better (possible scores can range from 0 to 100 points). Describe the sample size plus CENTER, SPREAD and SHAPE of this distribution, providing only a single measure of center and a single measure of spread. Be sure to provide an interpretation in the context of the problem.Could you provide any different graph to describe the distribution of this variable?
favstats(~ pcs, data=HELPrct)
## min Q1 median Q3 max mean sd n missing
## 14.07429 40.38438 48.87681 56.95329 74.80633 48.04854 10.7846 453 0
densityplot(~ pcs,
main="Figure 1: Density plot\nof Physical Component Scores from HELP study",
data=HELPrct)
SOLUTION:
SAMPLE SIZE: n = 453 participants
CENTER: median = 48.88 physical component score
-the value of the median is slightly higher than the mean, and since the distribution is slightly left skewed, median is a better representation of the data. SPREAD: sd = 10.78 physical component score -standard deviation is a good measure for spread in this data set because the distribution of the data is nearly normal and you can determine how much the average participant differs from another.
SHAPE: skewed with a slight tail to the left
-If I had to make another graphical representation of this data set I would recommend a box plot because it will show the slight left skew as well as the minimum and maximum physical component scores clearly. It will also clearly display the IQR.
The faithful dataset contains the waiting time (in minutes) to the next eruption of the Old Faithful geyser in Yellowstone National Park in Wyoming. Describe the sample size plus CENTER, SPREAD and SHAPE of this distribution, providing only a single measure of center and a single measure of spread. Be sured to provide an interpretation in the context of the problem (and donโt forget to specify units).Could you provide any different graph to describe the distribution of this variable?
favstats(~ waiting, data=faithful)
## min Q1 median Q3 max mean sd n missing
## 43 58 76 82 96 70.89706 13.59497 272 0
densityplot(~ waiting,
xlab="Waiting time to next eruption (in mins)",
main="Figure 2: Density plot of Old Faithful geyser dataset", data=faithful)
SOLUTION:
SAMPLE SIZE: n = 272 observations CENTER: median = 76 minutes -meadian is a better measurement of central tendency for this data set since the median is shifted toward the higher density peak of this bimodal distribution. The mean will take into account the two peaks at approximately 50 seconds and 80 seconds and since the 80 second peak contributes to nearly double the density as the 50 second peak, the value of the mean seems to be a less accurate representation of the data.Overall, the median does a better job but ideally you would have measurements of central tendency for each of the two peaks; a mixture model.
SPREAD: IQR= -IQR is a better measure of spread for this data set because of the shape of the distribution is
bimodal.
SHAPE: bimodal
-If I had to make another graphical representation of this data set I would recommend a histogram as it is still capable of showing the biodal nature of the distribution.