ANLY-545_Lab#3

The goal is to generate a dataset which shows the price people paid for three products

set.seed(234)           
prod <- expand.grid(sex = c("M", "F"), prod = c("A","B","C"))
#prod <- expand.grid(sex = c("M", "F"), prod = LETTERS[1:3]) same as above line of code
poin<-rpois(n=6, lambda=4)
poin

## [1] 5 5 1 5 1 5

prod$price <- poin
prod

##   sex prod price
## 1   M    A     5
## 2   F    A     5
## 3   M    B     1
## 4   F    B     5
## 5   M    C     1
## 6   F    C     5

Find the best distribution for the data

library(vcd)

## Loading required package: grid

library(grid)
library(gnm)
library(vcdExtra)
WomenQueue

## nWomen
##  0  1  2  3  4  5  6  7  8  9 10 
##  1  3  4 23 25 19 18  5  1  1  0

#lecture#3
Ord_plot(WomenQueue, main = "Queue", gp=gpar(cex=1), pch=16)

Using Ord_plot command from R, we found that the best distribution for our data is Binomal distribution with no. of frequencies=10 and estimated probability =0.53.

Find the row proportions for this table, i.e., the proportions of patients with differing length of stay for each value of visit frequency. What do you see there that relates to the question of independence of the table variables?

data("Hospital", package = "vcd")
str(Hospital)

##  table [1:3, 1:3] 43 6 9 16 11 18 3 10 16
##  - attr(*, "dimnames")=List of 2
##   ..$ Visit frequency: chr [1:3] "Regular" "Less than monthly" "Never"
##   ..$ Length of stay : chr [1:3] "2-9" "10-19" "20+"

prop.table (Hospital,1)

##                    Length of stay
## Visit frequency           2-9     10-19       20+
##   Regular           0.6935484 0.2580645 0.0483871
##   Less than monthly 0.2222222 0.4074074 0.3703704
##   Never             0.2093023 0.4186047 0.3720930

Out results indicate that the frequency of visits is reduced over the length of stay: For people staying 2-9 years, people will get more frequent visitors (69.35%) For people staying 10-19 years, the frequency of regular visits will be very low (25.8%), they may however get visitors in less than a month approx 40% of the time For people staying 20+ years, the frequency of regular visits is reduced greatly (only 4.8%).

Carry out a ??2 test for association between the two variables. What do you conclude?

chisq.test(Hospital)

## 
##  Pearson's Chi-squared test
## 
## data:  Hospital
## X-squared = 35.171, df = 4, p-value = 4.284e-07

Our results indicate that since p <0.05 there is a an association between the frequency of visits and the length of stay.

Say you don’t trust the asymptotic p-value from the test because the sample size is relatively small. See help(chisq.test)for how to get a Monte Carlo simulated p-value.

chisq.test(Hospital, simulate=TRUE)

## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  Hospital
## X-squared = 35.171, df = NA, p-value = 0.0004998

Our results indicate that since p value is still below 0.05, there is no change in the results.

Use MASS::loglm() to carry out the standard ??2 test.

library(MASS)
H_lm <- loglm(~1+2, data=Hospital, fitted=TRUE)
H_lm

## Call:
## loglm(formula = ~1 + 2, data = Hospital, fitted = TRUE)
## 
## Statistics:
##                       X^2 df     P(> X^2)
## Likelihood Ratio 38.35297  4 9.475535e-08
## Pearson          35.17109  4 4.284198e-07

Since p-value for both of the chi square statistics in our model are <0.05 this indicates an association between both the variables.

Use assocstats() to compute association statistics. How would you describe the strength of association here? Review Strength of association: ?? coefficient, contingency coefficient (C), Cramer’s V (0???V???1)

assocstats(Hospital)

##                     X^2 df   P(> X^2)
## Likelihood Ratio 38.353  4 9.4755e-08
## Pearson          35.171  4 4.2842e-07
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.459 
## Cramer's V        : 0.365

Cramer’s V is positively related with Chi-square statistics. However the value is only 0.3 which mean there is a weak association. If the Cramer’s V was close to 1 it would indicate a strong association between the variables.

Both variables can be considered ordinal, so CMHtest() may be useful here. Carry out that analysis. Do any of the tests lead to different conclusions?

CMHtest(Hospital)

## Cochran-Mantel-Haenszel Statistics for Visit frequency by Length of stay 
## 
##                  AltHypothesis  Chisq Df       Prob
## cor        Nonzero correlation 29.138  1 6.7393e-08
## rmeans  Row mean scores differ 34.391  2 3.4044e-08
## cmeans  Col mean scores differ 29.607  2 3.7233e-07
## general    General association 34.905  4 4.8596e-07

Our results indicate that the p-value is still less than 0.05 and both row and column factors are ordered factors so the association between the two variables i.e. frequency of visits and length of stay still holds true.

Try mosaic() function for visualizing two-way contingency tables with this data. Give a simple description of the pattern of association between visits and length of stay.

library(vcd)
mosaic(Hospital, shade=TRUE, labeling=labeling_residuals())

From Mosaic Plot, we find association between the visit and length of stay, especially about the corner effect.

If one stays for a very short time, he can expect to get more frequent visits which is also confirmed by the negative association with short stay and never visited. Similarly, if one stays very long (i.e. 20+), there is a higher higher likelihood of no visitors which is also confirmed by the negative association with long stay and regular visitors.

Conduct correspondence analysis to see if it confirms your previous result

library(ca)
hosp <- ca(Hospital)
str(hosp)

## List of 16
##  $ sv        : num [1:2] 0.51615 0.00633
##  $ nd        : logi NA
##  $ rownames  : chr [1:3] "Regular" "Less than monthly" "Never"
##  $ rowmass   : num [1:3] 0.47 0.205 0.326
##  $ rowdist   : num [1:3] 0.548 0.473 0.494
##  $ rowinertia: num [1:3] 0.1412 0.0457 0.0795
##  $ rowcoord  : num [1:3, 1:2] -1.0624 0.916 0.9568 0.0154 -1.7464 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:3] "Regular" "Less than monthly" "Never"
##   .. ..$ : chr [1:2] "Dim1" "Dim2"
##  $ rowsup    : logi(0) 
##  $ colnames  : chr [1:3] "2-9" "10-19" "20+"
##  $ colmass   : num [1:3] 0.439 0.341 0.22
##  $ coldist   : num [1:3] 0.544 0.229 0.734
##  $ colinertia: num [1:3] 0.1303 0.0179 0.1183
##  $ colcoord  : num [1:3, 1:2] -1.055 0.443 1.422 -0.404 1.318 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:3] "2-9" "10-19" "20+"
##   .. ..$ : chr [1:2] "Dim1" "Dim2"
##  $ colsup    : logi(0) 
##  $ N         : num [1:3, 1:3] 43 6 9 16 11 18 3 10 16
##  $ call      : language ca.matrix(obj = obj)
##  - attr(*, "class")= chr "ca"

plot(hosp)

Our correspondence analysis confirms that the patients with shorter stay are more likely to get regular visitors as compared to patients with longer stay of more than 20+ years

One slight complication here is that there are 8 cells with zero frequencies. Four of these (male and female children in 1st and 2nd class who died) should be considered sampling zeros, but 4 (children among the crew) should probably be considered structural zeros (cells where data could not occur. Suggest a method to take care of these zeros.

Titanic

## , , Age = Child, Survived = No
## 
##       Sex
## Class  Male Female
##   1st     0      0
##   2nd     0      0
##   3rd    35     17
##   Crew    0      0
## 
## , , Age = Adult, Survived = No
## 
##       Sex
## Class  Male Female
##   1st   118      4
##   2nd   154     13
##   3rd   387     89
##   Crew  670      3
## 
## , , Age = Child, Survived = Yes
## 
##       Sex
## Class  Male Female
##   1st     5      1
##   2nd    11     13
##   3rd    13     14
##   Crew    0      0
## 
## , , Age = Adult, Survived = Yes
## 
##       Sex
## Class  Male Female
##   1st    57    140
##   2nd    14     80
##   3rd    75     76
##   Crew  192     20

Titanic <- Titanic+0.5
Titanic

## , , Age = Child, Survived = No
## 
##       Sex
## Class   Male Female
##   1st    0.5    0.5
##   2nd    0.5    0.5
##   3rd   35.5   17.5
##   Crew   0.5    0.5
## 
## , , Age = Adult, Survived = No
## 
##       Sex
## Class   Male Female
##   1st  118.5    4.5
##   2nd  154.5   13.5
##   3rd  387.5   89.5
##   Crew 670.5    3.5
## 
## , , Age = Child, Survived = Yes
## 
##       Sex
## Class   Male Female
##   1st    5.5    1.5
##   2nd   11.5   13.5
##   3rd   13.5   14.5
##   Crew   0.5    0.5
## 
## , , Age = Adult, Survived = Yes
## 
##       Sex
## Class   Male Female
##   1st   57.5  140.5
##   2nd   14.5   80.5
##   3rd   75.5   76.5
##   Crew 192.5   20.5

It is natural to consider Survival as the natural response variable, and the remaining variables as explanatory. Therefore, all models should include the high-order term among age, gender and class. Therefore, the minimal null model is [AGC][S], which asserts that survival is jointly independent of Age, Sex and Class. Fit this model, and obtain a mosaic plot. Interpret the pattern of the residuals in this mosaic plot.

#Finding Fit
titanic.loglm1 <- loglm(Freq~  (Age * Sex * Class) + Survived, data= Titanic)
summary(titanic.loglm1)

## Formula:
## Freq ~ (Age * Sex * Class) + Survived
## attr(,"variables")
## list(Freq, Age, Sex, Class, Survived)
## attr(,"factors")
##          Age Sex Class Survived Age:Sex Age:Class Sex:Class Age:Sex:Class
## Freq       0   0     0        0       0         0         0             0
## Age        1   0     0        0       1         1         0             1
## Sex        0   1     0        0       1         0         1             1
## Class      0   0     1        0       0         1         1             1
## Survived   0   0     0        1       0         0         0             0
## attr(,"term.labels")
## [1] "Age"           "Sex"           "Class"         "Survived"     
## [5] "Age:Sex"       "Age:Class"     "Sex:Class"     "Age:Sex:Class"
## attr(,"order")
## [1] 1 1 1 1 2 2 2 3
## attr(,"intercept")
## [1] 1
## attr(,"response")
## [1] 1
## attr(,".Environment")
## <environment: R_GlobalEnv>
## 
## Statistics:
##                       X^2 df P(> X^2)
## Likelihood Ratio 659.3162 15        0
## Pearson          643.1600 15        0

#Obtaining a Mosaic Plot and interpreting the pattern of residuals
mosaic(titanic.loglm1, main ="model [AGC][S]",labeling = labeling_residuals())

Our mosiac plot shows that overall 1st and 2nd class adult females are most likely to survive, however except for class 1st, males are less likely to survive. Female child in class 2nd is more likely to survive than male child in the same class.

Fit a main effects model for survival, [AGC][AS][GS][CS], that includes an association of survival with each of age, gender and class. Is this an adequate fit? What does the pattern of residuals tell you about remaining associations?

#Finding Fit
(titanic.loglm2 <- loglm(~ (Class * Age * Sex) + Survived*(Class + Age + Sex), data=Titanic))

## Call:
## loglm(formula = ~(Class * Age * Sex) + Survived * (Class + Age + 
##     Sex), data = Titanic)
## 
## Statistics:
##                       X^2 df P(> X^2)
## Likelihood Ratio 105.5531 10        0
## Pearson          101.4396 10        0

#Obtaining a Mosaic Plot and interpreting the pattern of residuals
mosaic(titanic.loglm2, main ="model[AGC][AS][GS][CS]", labeling = labeling_residuals())

our mosaic plot shows that for class 2nd male child is more likely to survive while in the model above this same was true for female child. Overall, Class 3rd male adult has a higher survival rate and female child has a higher non survival rate. However in this analysis, overall the number of associations have significantly reduced compared the model above without interactions of survival with class, Age and Sex.

What model would you use to allow an interaction of Age and Gender in their effect on Survival? Fit this model as above, and obtain the mosaic plot.Compare the three models using anova()

#Finding Fit
(titanic.loglm3 <- loglm(~ (Class * Age * Sex) + Survived*(Class + Age * Sex), data=Titanic))

## Call:
## loglm(formula = ~(Class * Age * Sex) + Survived * (Class + Age * 
##     Sex), data = Titanic)
## 
## Statistics:
##                       X^2 df     P(> X^2)
## Likelihood Ratio 85.51508  9 1.287859e-14
## Pearson          78.83273  9 2.755574e-13

summary(titanic.loglm3)

## Formula:
## ~(Class * Age * Sex) + Survived * (Class + Age * Sex)
## attr(,"variables")
## list(Class, Age, Sex, Survived)
## attr(,"factors")
##          Class Age Sex Survived Class:Age Class:Sex Age:Sex Class:Survived
## Class        1   0   0        0         1         1       0              1
## Age          0   1   0        0         1         0       1              0
## Sex          0   0   1        0         0         1       1              0
## Survived     0   0   0        1         0         0       0              1
##          Age:Survived Sex:Survived Class:Age:Sex Age:Sex:Survived
## Class               0            0             1                0
## Age                 1            0             1                1
## Sex                 0            1             1                1
## Survived            1            1             0                1
## attr(,"term.labels")
##  [1] "Class"            "Age"              "Sex"             
##  [4] "Survived"         "Class:Age"        "Class:Sex"       
##  [7] "Age:Sex"          "Class:Survived"   "Age:Survived"    
## [10] "Sex:Survived"     "Class:Age:Sex"    "Age:Sex:Survived"
## attr(,"order")
##  [1] 1 1 1 1 2 2 2 2 2 2 3 3
## attr(,"intercept")
## [1] 1
## attr(,"response")
## [1] 0
## attr(,".Environment")
## <environment: R_GlobalEnv>
## 
## Statistics:
##                       X^2 df     P(> X^2)
## Likelihood Ratio 85.51508  9 1.287859e-14
## Pearson          78.83273  9 2.755574e-13

#Obtaining a Mosaic Plot and interpreting the pattern of residuals
mosaic(titanic.loglm3,  main="Model [AGC][AS][GS][CS][AGS]")

# compare the models using Anova
anova(titanic.loglm1,titanic.loglm2, titanic.loglm3, test= "Chisq")

## LR tests for hierarchical log-linear models
## 
## Model 1:
##  Freq ~ (Age * Sex * Class) + Survived 
## Model 2:
##  ~(Class * Age * Sex) + Survived * (Class + Age + Sex) 
## Model 3:
##  ~(Class * Age * Sex) + Survived * (Class + Age * Sex) 
## 
##            Deviance df Delta(Dev) Delta(df) P(> Delta(Dev)
## Model 1   659.31623 15                                    
## Model 2   105.55308 10  553.76314         5          0e+00
## Model 3    85.51508  9   20.03800         1          1e-05
## Saturated   0.00000  0   85.51508         9          0e+00

After we allowed more interactions between the variables, mosaic plot showed that remaining association reduced a lot.This plot indicates that overall male in 3rd class has higher survival rate overall compared to females. Our model comparison using ANOVA shows that the association between variables reduced as we increased the no. of interactions between different variables.