library(tidyverse)
library(knitr)
library(Hmisc)

options(scipen=999)

Categorical Data Analysis for Goodness of Fit Tests, Test of Independence, and Comparison of Multiple Proportions

Questions
Categorical Data Analysis for Goodness of Fit Tests, Test of Independence, and Comparison of Multiple Proportions

  1. The Wall Street Journal’s has metrics in which exchange traded funds can be called “winners” and “losers.” The above analysts (A–E) choose exchange-traded funds to add to their portfolios, that previously consisted only of stocks. Here is the summary table:

    In R, the above table can be entered as follows using the concatenate function, c(), which creates a vector of values.

      Analysts <- c(A=5, B=8, C=15, D=20, E=12)
      Analysts
    ##  A  B  C  D  E 
    ##  5  8 15 20 12

    That said, normally we will have more than a single variable, so a data frame would usually be entered instead. Here, though, because the data are already summarized (i.e., frequencies are given), we can create a table.

    Analysts_df <- data.frame(cbind(
                 Analyst=c("A", "B", "C", "D", "E"),
                 Frequency=c(5, 8, 15, 20, 12)))
    Analysts_df

    Are there any differences among the traders in the number of exchange-traded funds added to their portfolios?

    1. What type of statistical procedure is needed to evaluate the question (be as specific as possible)?
      Answer:

         We need to a Chisq goodness of fit across categories.
      
         ```r
         obs <- c(5, 8, 15, 20, 12)
         prob <- obs/sum(obs)
      
         chisq.test(c(5, 8, 15, 20, 12), p=rep(.2, 5))
         ```
      
         ```
         ## 
         ##    Chi-squared test for given probabilities
         ## 
         ## data:  c(5, 8, 15, 20, 12)
         ## X-squared = 11.5, df = 4, p-value = 0.02148
         ```
      
         ```r
         chisq.test(c(5, 8, 15, 20, 12), p=c(.0834,.1333,.25,.3333,.2))
         ```
      
         ```
         ## 
         ##    Chi-squared test for given probabilities
         ## 
         ## data:  c(5, 8, 15, 20, 12)
         ## X-squared = 0.0000038976, df = 4, p-value = 1
         ```
    2. What is the value of the test statistic?
      Answer:*

         X-squared = 11.5  
    3. What is the \(p\)-value to test the appropriate null hypothesis?
      Answer:

         p-value = 0.02148  
    4. What is the conclusion?
      Answer:

         Reject Null Hypothesis. 
  2. The tabled data here shows the frequencies for the numbers of meetings the CEO has with the different types of employees about retention issues. Note that all employees of these categories have the following percentages: C-Suite (7%), Midlevel (20%), New Managers (25%), Senior Staff (30%), and Other Staff (18%). Does the CEO meet with the different types of employees at a rate consistent with the number of such employees in those positions?

    Analyst Frequency
    C-Suite 5
    Midlevel 8
    New Managers 15
    Senior Staff 20
    Other Staff 12
    1. What type of statistical procedure is needed to evaluate the question (be as specific as possible)?
      Answer: Modified Chisq Goodness of fit test with specific probabilities, across the categories.

      chisq.test(c(5,8,15,20,12), p=c(.07,.2,.25,.3,.18))
      ## Warning in chisq.test(c(5, 8, 15, 20, 12), p = c(0.07, 0.2, 0.25, 0.3, 0.18)):
      ## Chi-squared approximation may be incorrect
      ## 
      ##   Chi-squared test for given probabilities
      ## 
      ## data:  c(5, 8, 15, 20, 12)
      ## X-squared = 1.8413, df = 4, p-value = 0.7649
    2. What is the value of the test statistic?
      Answer:

         X-squared = 1.8413,  
    3. What is the \(p\)-value to test the appropriate null hypothesis?
      Answer:

         p-value = 0.7649
    4. What is the conclusion?
      Answer:
      Fail to Reject eh Null Hypothesis, Observed Frequencies are different than the expected frequencies.

  3. Social networking is hugely influential in today’s society. The Pew Research Center used a survey of adults in different countries to determine the percentage of adults who use social networking sites. The results are given below in tabular form. Do this question with the definitional formulas and paste the relevant parts of your spreadsheet below for part a. Of course, you may use a program to check your answers.

    Country Yes No
    Great Britain 344 456
    Israel 265 235
    Russia 301 399
    United States 500 500
    1. Conduct a hypothesis test to determine if the proportion of social networking users is the same across the four countries.

      yes <- c(344,265,301,500)
      no <- c(456,235,399,500)
      
      country <- as.table((cbind(yes,no)))
      
      
      dimnames(country) <- list(Attainment = c('GB','Isreal','Russia','US'),social = c('yes','no'))
      
      chisq.test(country)
      ## 
      ##    Pearson's Chi-squared test
      ## 
      ## data:  country
      ## X-squared = 20.474, df = 3, p-value = 0.0001354
      1. From your above calculations, what is the value of the test statistic?
        Answer:

           X-squared = 20.474  
      2. What is the \(p\)-value?
        Answer:

           p-value = 0.0001354
      3. Using a .05 level of significance, what is your conclusion and brief summary?
        Answer: Reject the Null Hypothesis. Not all social Media useres are thes same accross the country.

  4. The Wall Street Journal Corporate Perceptions Study surveyed readers and asked how each rated the quality of management and the reputation of the company for over 250 worldwide corporations. Both the quality of management and the reputation of the company were rated on an excellent, good, and fair categorical scale. The sample included 200 respondents.

    Managment Excellent Good Fair
    Excellent 40 25 5
    Good 35 35 10
    Fair 25 10 15
    1. Use a .05 level of significance and test for independence of the quality of management and the reputation of the company.

      reputation <- as.table(rbind(
        c(40,25,5),
        c(35,35,10),
        c(25,10,15)))
      
      
      dimnames(reputation) <- list(management =c('Excellent','Good','Fair'),reputation =                     c('Excellent','Good','Fair'))
      
      chisq.test(reputation)
      ## 
      ##  Pearson's Chi-squared test
      ## 
      ## data:  reputation
      ## X-squared = 17.028, df = 4, p-value = 0.001909
      1. What is the value of the test statistic?
        Answer:

           X-squared = 17.028  
      2. What is the \(p\)-value?
        Answer:

           p-value = 0.001909 
      3. What is the conclusion?
        Answer: Reject the Null Hypothesis, they are not independent.

    2. If there is a lack of independence (i.e., some dependence or association) between the two ratings, discuss and use probabilities to justify your answer.

      40+35+25+25+35+10+5+10+15
      ## [1] 200
      excellent <- c(40,35,25)
      good <- c(25,35,10)
      fair <- c(5,10,15)
      data <- cbind(excellent,good, fair)

      Answer: When we take the sum of the probabilities above, they add up to 1.

  5. Three cities – Anchorage, Atlanta, and Minneapolis – were chosen to examine the proportion of married couples in which both spouses married households were in the workforce in households (instead of only one). Couples in which neither spouse works are not considered in this analysis.

    City Both Work One Works
    Anchorage 57 33
    Atlanta 70 50
    Minneapolis 63 27
    both <- c(57,70,63) 
    one <- c(33,50,27)
    
    work <- cbind(both,one)
    chisq.test(work)
    ## 
    ##    Pearson's Chi-squared test
    ## 
    ## data:  work
    ## X-squared = 3.0144, df = 2, p-value = 0.2215
    1. Conduct a hypothesis test to determine if the population proportion of married couples with both husband and wife in the workforce is the same for the three cities using a .05 Type I error rate.
      1. What is the value of the test statistic?
        Answer:

           X-squared = 3.0144 
      2. What is the \(p\)-value?
        Answer:

           p-value = .2215
      3. What is the conclusion?
        Answer: We fail to reject the Null Hypothesis, there is not statistical evidence that they are different.

    2. Consider the proportion of married couples with both husband and wife in the workforce.
      Brought in function from proportions assignment
      1. What is the point estimate?

        (57+70+63)/(57+33+70+50+63+27)
        ## [1] 0.6333333
      2. What is the 95% confidence interval?

          inf_for_prop(.6333,300,.33,.05,2,'equal')
        ## Standard error Null True:  0.02714774 
        ## Standard error Null False:  0.02782272 
        ## Critical Values:  -1.959964 1.959964 
        ## Margin of Error:  0.05453153 
        ## Confidence Interval:  0.5787685 0.6878315 
        ## Z-value:  11.1722 
        ## P-Value 0
  6. Do question “by hand” (i.e., through the definitional formulas).
    Answer:

       See Above
  7. An evaluation of smart phones was conducted by a major manufacture of smart phones to better understand the wants and needs of consumers, specifically non-smart phone users but for users that are likely to adopt a smart phone. Four age categories were used for participants, and a total of 1909 participants took part in the evaluation. Smart phone categories were Apple iOS, Android, BlackBerry, Other, and Unsure. The data file can be found here as a CSV file, here as an SPSS file.

    device <- read.csv('https://www.dropbox.com/s/fp643ivxmfl8u36/SmartPhone.csv?dl=1')
    
    chisq.test(xtabs(~Site + Desired, data=device))
    ## 
    ##    Pearson's Chi-squared test
    ## 
    ## data:  xtabs(~Site + Desired, data = device)
    ## X-squared = 9.4605, df = 5, p-value = 0.09205
    1. Is there statistical evidence that participants in the Midwest and West differ with regards to their preference for smart phone operating systems? Explain.
      Answer: No with a p-value = 0.09205 it is not statistically significant

    For the questions that follow, ignore the variable Site

    1. What are the relevant descriptive statistics of the data?
      Answer: Measure of Central Tendency , Measure of Spread and Measure of Shape

    2. Is there a statistically significant difference in the pattern of desired smart phone operating systems by age? Explain.

       chisq.test(xtabs(~Age + Desired, data=device))
      ## 
      ##    Pearson's Chi-squared test
      ## 
      ## data:  xtabs(~Age + Desired, data = device)
      ## X-squared = 164.61, df = 15, p-value < 0.00000000000000022

      There is statistical difference.

    3. Is there a statistical difference between males and females with regards to their preference for Apple iOS? Explain.

       ```r
       chisq.test(xtabs(~Sex + Desired, data=device))
       ```
      
       ```
       ## 
       ##  Pearson's Chi-squared test
       ## 
       ## data:  xtabs(~Sex + Desired, data = device)
       ## X-squared = 45.22, df = 5, p-value = 0.00000001309
       ```
      
       There is statistical difference. 
      1. What is the estimate for the difference between males and females with regard to preference for Apple iOS?

        apple <- device %>%
          mutate(Desired = ifelse(Desired == 'Apple iOS','Yes','No')) %>%
          group_by(Desired, Sex) %>%
          summarise(count =n()) %>%
          mutate(Desired = as.factor(Desired),
                Sex = as.factor(Sex)) %>%
          spread(Desired, count) %>%
          mutate(ratio = Yes/(No+Yes))
        ## `summarise()` regrouping output by 'Desired' (override with `.groups` argument)
          apple
          apple$ratio[1]-apple$ratio[2]
        ## [1] 0.0865612
      2. What is the corresponding 95% confidence interval?
        Answer:

           [0.047    0.131]
    4. Considering only the data from participants who chose one of the “Big 3” operating systems (i.e., Apple iOS, Android, and BlackBerry, thus making the data set based on 1,267 participants), are the three operating systems equally selected as the operating system of choice by likely smart phone upgraders? Explain.

              big_three <- device %>%
              filter(Desired == 'Apple iOS' | Desired == 'BlackBerry 10' | Desired =='Android' )
              chisq.test(xtabs(~ Desired, data=big_three))
      ## 
      ##     Chi-squared test for given probabilities
      ## 
      ## data:  xtabs(~Desired, data = big_three)
      ## X-squared = 172.05, df = 2, p-value < 0.00000000000000022

      Answer:

         We can see from the p-value = p-value < 0.00000000000000022, that there is a statistical difference between the "Big Three"
  8. Find or collect data that is appropriate for a goodness of fit, test of independence, or test of multiple proportions. Analyze the data and tell the story of what you have learned from the data. For the omnibus question that you address, supplement it with more targeted questions that help to explain some of the nuance. Provide all aspects necessary to fully understand the data, the questions, and your answers to the questions. Please supplement your answers with computer output.
    Answer: Explanation:

     This particular survey looked at patient preparedness based on a specific surgical procedure they were about to undergo (i.e. heart failure, reconstructive knee surgery, childbirth, etc.). For our chi-squared test comparison, we were interested in exploring whether or not patients who answered the survey question as to whether they managed their health prior to their surgery would affect subsequent questions regarding how prepared they were for their actual operation, their recovery, and their post-op follow-up appointment. Based on our chi square testing of these three questions, it seems that there is a difference in survey responses based on whether the patient was already maintaining a healthy lifestyle, and whether or not they were prepared for their operation, their recovery, and their post-op follow-up meeting.  
    
     For a health data solutions company, these results are important because it can help us target those patients who are in more need of specific and detailed care for their upcoming operation. For example, this data would suggest that hospitals should determine their more vulnerable patient populations ahead of time. Based on these results, it would be important to ask a patient about their lifestyle and whether or not they feel it is being managed prior to their operation. If this population can be determined at a much earlier notice, we would be able to provide more relevant and detailed patient engagement efforts to prepare this patient population group for their surgery, recovery, and follow-up appointments.