The data set is from a case-control study of smoking and Alzheimer’s disease. The data set has two variables of main interest:
smoking a factor with four levels “None”, “<10”, “10-20”, and “>20” (cigarettes per day)disease a factor with three levels “Alzheimer”, “Other dementias”, and “Other diagnoses”.library(tidyverse)
## ── Attaching packages ────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.5
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ───────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
# Import data
data("alzheimer", package = "coin")
# create a table
tbl <- xtabs(~disease + smoking, alzheimer)
ftable(tbl)
## smoking None <10 10-20 >20
## disease
## Alzheimer 126 15 30 27
## Other dementias 79 8 33 44
## Other diagnoses 104 5 47 20
# create a mosaic plot from the table
library(vcd)
## Loading required package: grid
mosaic(tbl,
shade = TRUE,
legend = TRUE,
labeling_args = list(set_varnames = c(disease = "")),
set_labels = list(disease = c("Alzheimer", "Other\ndementias", "Other\ndiagnoses")))
The largest group that has dementia is the non smokers. They have zero cigarettes per day. ## Q2 Describe one group that has more cases than expected given independence (by chance). Discuss it by number of cigarettes per day. The group that has more cases than expected given indepence is people with dementia who smoke twenty or more cigarettes per day. ## Q3 Does smoking seem to matter in determining other dementias? Discuss your reason using the masaic chart above. Based on the data smoking does not seem to matter in determining dementia. As people who smoke get alzheimers at nearly the same rate as those who dont. Also the pearson residiuals number is 0 meaning they get it at the rate expected. ## Q4 Create correlation plot for RailTrail. Hint: The RailTrail data set is from the mosaicData package.
data(mosaicdata)
## Warning in data(mosaicdata): data set 'mosaicdata' not found
Variables that have positive relationship with the number of trail users are: hightemp, avgtemp, lowtemp, and summer. ## Q6 What season seems to be least popular for trail users? The season that seems to be the most popular for trail users is summer because it has the highest positive correlation out of all the seasons. ## Q7 The correlation coefficient between hightemp and cloudcover is quite small. Would you be sure that the two variables are not related at all? Hint: One word answer (e.g., yes or no) is NOT enough. Explain why.
data(hightemp)
## Warning in data(hightemp): data set 'hightemp' not found
data(cloudcover)
## Warning in data(cloudcover): data set 'cloudcover' not found