Advanced Statistical Inference Midterm Exam

Section 1
- Instructions
- Linear Regression
Section 2
Section 3
Section 4
Section 5

Section 1

Instructions

Using the following data, produce a visualization with patient_days_admitted on the x-axis and dollar_spent_per_patient on the y-axis. After producing this visualization, explain what this relationship means and how this relationship might guide a decision.

After looking at the bivariate relationship in the previous visualization, add department to the visualization as a grouping variable. Does this change your interpretation of the relationship or add any expanded understanding? If so, how? Does this visualization offer any additional explanatory power over the more simple visualization? Explain.

library(readr)
library(tidyverse)

sec1Link <- read_csv("https://www.nd.edu/~sberry5/data/visualizationData.csv")

Linear Regression

library(dplyr)

ggplot(sec1Link,aes(patient_days_admitted, dollar_spent_per_patient))+
geom_point()+
  geom_smooth(method = "lm")+
  labs(title = "Relationship Between Dollar Spent Per Patient & Days of Patient Admitted", 
       x = "Days of Patient Admitted", y = "Dollar Spent per Patient")+
  scale_x_continuous(breaks = seq(0, 30, by = 5))

In order to better analyze the data, I changed the tick mark for x-axis to a smaller scale – which is every 5 intervals. I also tried changing it to every 2 intervals and some other values but I could not see an obviously trend on how many days of admission is linked to higher dollar spent or lower dollar spent. Although the linear regression line showed a moderate positive relationship between the two variables, there are way too many data that are away from the line. Thus, we need to break down the data to sub-categories for a better evaluation.

ggplot(sec1Link,aes(patient_days_admitted, dollar_spent_per_patient, color = department, shape = department))+
geom_point(size = 2.5, alpha = 0.5)+
geom_point(color = "black", size = 1.5, alpha = 0.6)+
  geom_smooth(method = "lm")+
  labs(title = "Relationship Between Dollar Spent Per Patient & Days of Patient Admitted", 
       x = "Days of Patient Admitted", y = "Dollar Spent per Patient")+
  scale_x_continuous(breaks = seq(0, 30, by = 5))

Now looking at the graph again, we got a much better visualization on how the dollar spent changes based on the days of admission. Each department treats its patients differently, and we cannot generalize all the data without looking into each different department separately. We now can conclude that as the days of a patient admitted, his or her dollar spent increases.

Section 2

Instructions

Using the following data, formulate a hypothesis for training.sessions.attended.year’s effect on customer.satisfaction.scores.year. Please clearly state the relationship that you would expect to find. Using an appropriate technique from the general linear model, test your hypothesis and report your findings – interpretation of your model’s coefficients is critical. Describe your rationale for any data processing (e.g., centering) that you might undertake.

After reporting and interpreting your findings, conduct a post hoc power analysis to determine if you had a sufficient sample size to detect an effect. Discuss the results from your power analysis.

sec2Link <- read_csv("https://www.nd.edu/~sberry5/data/glmData.csv")

Hypothesis

Null hypothesis: There is no relationship between customer satisfaction scores and number of training sessions attended.

Alternative hypothesis: There is a positive relationship between customer satisfaction scores and number of training sessions attended.

sec2Test = lm(customer.satisfaction.scores.year ~ training.sessions.attended.year, data = sec2Link)
summary(sec2Test)


Call:
lm(formula = customer.satisfaction.scores.year ~ training.sessions.attended.year, 
    data = sec2Link)

Residuals:
     Min       1Q   Median       3Q      Max 
-30.9118  -6.3126  -0.6134   7.0420  28.8038 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      61.3040     0.8570   71.53   <2e-16 ***
training.sessions.attended.year   2.4731     0.1218   20.30   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.47 on 527 degrees of freedom
Multiple R-squared:  0.4388,    Adjusted R-squared:  0.4377 
F-statistic: 412.1 on 1 and 527 DF,  p-value: < 2.2e-16

ggplot(sec2Link, aes(training.sessions.attended.year, customer.satisfaction.scores.year))+
geom_point(size = 2, color = "pink", alpha = 0.5)+
geom_point(size = 0.5, color = "black", alpha = 0.6)+
  geom_smooth(method = "lm", color = "black")+
  labs(title = "Relationship Between Customer Satisfaction & # of Training Sessions Attended", 
       x = "Number of Training Sessions Attended", y = "Customer Satisfaction Score")+
  scale_x_continuous(breaks = seq(0, 13, by = 1))

Analysis

The linear model test output has a significant p-value, meaning it is highly unlikely that we observe the relationship between the predictor (number of training sessions attended) and response (customer satisfaction score) variables due to chance. In addition, the coefficient t-value is also large, meaning there is a strong evidence against my null hypothesis. Thus, I can reject my null hypothesis and accept my alternative hypothesis.

The estimated coefficient at intercept was 61.3040. This indicates when people did not attend any training sessions at all, the average satisfaction score they rated was roughly 61. As each unit of training session increases, the satisfaction score would increase by 2.4731.

Centering the Data

sec2Center <- sec2Link %>% 
  mutate(trainingCenter = training.sessions.attended.year - mean(training.sessions.attended.year, na.rm = TRUE)) %>%
  lm(customer.satisfaction.scores.year ~ trainingCenter, data = .)

summary(sec2Center)


Call:
lm(formula = customer.satisfaction.scores.year ~ trainingCenter, 
    data = .)

Residuals:
     Min       1Q   Median       3Q      Max 
-30.9118  -6.3126  -0.6134   7.0420  28.8038 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     76.0441     0.4552   167.0   <2e-16 ***
trainingCenter   2.4731     0.1218    20.3   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.47 on 527 degrees of freedom
Multiple R-squared:  0.4388,    Adjusted R-squared:  0.4377 
F-statistic: 412.1 on 1 and 527 DF,  p-value: < 2.2e-16

Since my continuous predictor is not really meaningful when it is zero (for those who did not attend any training sessions, how effective would their ratings be?), we would want to center our data by substracting the mean.

After centering the independent variable, the estimated coefficient at intercept changed to 76.0441.

Power Analysis

library(pwr)

f2 = 0.4377 / (1 - 0.4377)
round(f2, digit = 4)

[1] 0.7784

pwr.f2.test(u = 1, v = 527, f2 = 0.7784, power = NULL)


     Multiple regression power calculation 

              u = 1
              v = 527
             f2 = 0.7784
      sig.level = 0.05
          power = 1

### in post hoc power analysis, we use v to predict power, to see how confident we are rejecting null hypothesis
### v = # of rows in data - # of terms (intercept + predictors)
### here we have power = 1, very confident in rejecting null hypothesis

The sample size needed is around 13. In this case, we have sufficient data. Since the sample size needed is not large, we can also learn that the effect size must have been large – it is very effective.

Section 3

Instructions

Consider the following A/B testing data. This data tracks a user’s time on page (timeOnPage) and the UI design (design). In A/B testing, we are concerned with the difference between the two groups on the outcome. Select the appropriate technique from the general linear model and determine if any significant differences exist between the two competing page designs. Describe your rationale for any data processing that you might undertake.

Discuss your results and indicate any actionable decision that comes from your analysis. Additionally, determine if your analyses were sufficiently powered.

sec3Link <- read_csv("https://www.nd.edu/~sberry5/data/abData.csv")

Hypothesis

Null hypothesis: There is no significant differences exist between the two competing page designs.

Alternative hypothesis: There is a significant difference exists between the two competing page designs.

I chose t-test as the statistical model for this hypothesis testing:

sec3Test = t.test(sec3Link$MinutesOnPage ~ sec3Link$PageConfiguration,
       alternative = "two.sided")
sec3Test


    Welch Two Sample t-test

data:  sec3Link$MinutesOnPage by sec3Link$PageConfiguration
t = 94.396, df = 1043.9, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 2.885949 3.008477
sample estimates:
mean in group A mean in group B 
       5.974857        3.027644

Analysis

It has a significant p-value, indicating there is sufficient evidence to reject the null hypothesis and accept the alternative hypothesis. The mean value in group A is obviously larger than the mean value in group B with values of 5.974857 and 3.027644 accordingly.

In order to make sure there are not a lot of NAs that would skew the data, we get a summary of the data frame.

summary(sec3Link)

 PageConfiguration  MinutesOnPage  
 Length:1059        Min.   :1.671  
 Class :character   1st Qu.:2.993  
 Mode  :character   Median :3.869  
                    Mean   :4.433  
                    3rd Qu.:5.918  
                    Max.   :7.378

glimpse(sec3Link)

Observations: 1,059
Variables: 2
$ PageConfiguration <chr> "B", "B", "B", "A", "B", "A", "A", "B", "A", "…
$ MinutesOnPage     <dbl> 2.687553, 2.102057, 3.172279, 6.863614, 3.3126…

No missing values – all good. The t-test output indicated a Degree of Freedom of 1057. Thus, my analyses were sufficiently powered. This A/B testing told us that among those 2 competing page designs, users in group A tend to spend a longer period of time comparing to group B.

Section 4

Instructions

Using the following data, determine if there are any differences in the daily net profit of three different store locations. Select the appropriate test from the general linear model and determine if any significant differences exist. Describe your rationale for any data processing that you might undertake.

Discuss your results.

sec4Link <- read_csv("https://www.nd.edu/~sberry5/data/performanceData.csv")

Hypothesis

Null hypothesis: There are no significant differences exist in the daily net profit of three different store locations.

Alternative hypothesis: There are significant differences exist in the daily net profit of three different store locations.

Anova test

Using Anova test to see if any significant differences exist on daily net profit among 3 different store locations.

sec4Test = aov(daily_net_profit_thousand ~ facility_location, data = sec4Link)
summary(sec4Test)

                    Df Sum Sq Mean Sq F value Pr(>F)    
facility_location    2   2986  1492.9   674.6 <2e-16 ***
Residuals         1092   2417     2.2                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With the p-value being less than 2e-16, we can learn that there are significant differences between those 3 locations’ daily net profit. However, we don’t know which pairs of groups are different. We need to run more tests to determine if the mean difference between specific pairs of group are statistically significant.

##Tukey HSD

computing Tukey HSD for performing multiple pairwise-comparison between the means of 3 groups.

TukeyHSD(sec4Test)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = daily_net_profit_thousand ~ facility_location, data = sec4Link)

$facility_location
                           diff        lwr       upr     p adj
403 Barr-10 Maple     3.4505044  3.1920592  3.708950 0.0000000
710 Oakland-10 Maple -0.1025902 -0.3610354  0.155855 0.6203836
710 Oakland-403 Barr -3.5530947 -3.8115399 -3.294649 0.0000000

Analysis

As the output showing, the mean difference between 403 Barr location and 10 Maple location, and the mean difference between 710 Oakland and 403 Barr are very significant with p-values that are very close to 0.

Now we can plot the data in several different ways to show the differences among them:

library("ggpubr")
ggboxplot(sec4Link, x = "facility_location", y = "daily_net_profit_thousand", 
          color = "facility_location", palette = c("#00AFBB", "#E7B800", "#FC4E07"),
          order = c("403 Barr", "10 Maple", "710 Oakland"),
          ylab = "Daily Net Profit in Thousand", xlab = "Facility Location")

This boxplot supports the conclusion made earlier.

Section 5

Instructions

Using the following data, determine what variables influence a franchise’s ultimate outcome – failure or success. Using any variables available to you, select the appropriate method and test your model. Discuss your results and describe your rationale for any data processing that you might undertake.

sec5Link <- read_csv("https://www.nd.edu/~sberry5/data/outcomeData.csv")

First thing first, checking if there are any missing values & any inappropriate data types.

summary(sec5Link)

    storeID      outcomeClosedOpen employeeCount   
 Min.   :  1.0   Min.   :0.0000    Min.   : 3.623  
 1st Qu.:119.5   1st Qu.:0.0000    1st Qu.: 9.502  
 Median :238.0   Median :0.0000    Median :12.347  
 Mean   :238.0   Mean   :0.4758    Mean   :12.325  
 3rd Qu.:356.5   3rd Qu.:1.0000    3rd Qu.:15.028  
 Max.   :475.0   Max.   :1.0000    Max.   :22.114  
 dailyNetProfitThousands quartersWithHealthViolations peoplePerSqMile 
 Min.   : 2.495          Min.   :0.0000               Min.   : 63.91  
 1st Qu.: 3.967          1st Qu.:0.0000               1st Qu.:195.15  
 Median : 4.650          Median :0.0000               Median :240.73  
 Mean   : 6.348          Mean   :0.5874               Mean   :244.51  
 3rd Qu.: 8.944          3rd Qu.:1.0000               3rd Qu.:295.11  
 Max.   :10.298          Max.   :3.0000               Max.   :393.97

Changing storeID to character.

sec5Link$storeID = as.character(sec5Link$storeID)

Correlation

library(corrplot)

sec5Link %>%
  keep(is.numeric) %>%
  cor() %>%
  corrplot(method = "number")

Since variable outcomeClosedOpen indicates if the business is still running successfully or it is out of business and closed, I am looking at it as it is independent variable and trying to find which variable(s) are highly correlated to it. Based on the graph, dailyNetProfitThousands is the most correlated variable with a correlation coefficient of 0.99; peoplePerSqMile is the second most correlated variable with a correlation coefficient of 0.87; employeeCount being the third with a correlation coefficient of 0.64.

T-test

Now that I have those 3 variables, I want to test if they are statistical significant. I decided to look into those 3 pairs of data separately with t-test.

test1 = t.test(dailyNetProfitThousands ~ outcomeClosedOpen, data = sec5Link)
test1


    Welch Two Sample t-test

data:  dailyNetProfitThousands by outcomeClosedOpen
t = -125.77, df = 466.93, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -5.053172 -4.897697
sample estimates:
mean in group 0 mean in group 1 
       3.980373        8.955807

test2 = t.test(peoplePerSqMile ~ outcomeClosedOpen, data = sec5Link)
test2


    Welch Two Sample t-test

data:  peoplePerSqMile by outcomeClosedOpen
t = -38.056, df = 472.98, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -101.97249  -91.95904
sample estimates:
mean in group 0 mean in group 1 
       198.3724        295.3382

test3 = t.test(employeeCount ~ outcomeClosedOpen, data = sec5Link)
test3


    Welch Two Sample t-test

data:  employeeCount by outcomeClosedOpen
t = -18.341, df = 470.75, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -5.339915 -4.306419
sample estimates:
mean in group 0 mean in group 1 
       10.03001        14.85318

Analysis

All those 3 tests have a significant p-value, meaning the correlations among my 3 pairs of data are significant and are not due to chance. Based on the results, I can conclude that daily net profit is key to a business. At the end of the day, how successful a business is doing is all about the money. The more dollars you are making the more successful your business can be. That is why the mean of daily profit in the group of open business is more than doubled the group of closed business from my test1 result. Test2 is telling me that store traffic is important too. The more customers you have in your store, the more amount of money they spent in your store. Develope strategies that would attract more potential consumers to walk in your store. For example, attractive window display that makes people who are just walking by are willing to stop by your store. Lastly, having more available employees in the store is helpful. As a customer, I found it less pleasant to shop in a store where I cannot find any store clerks to help me out when I needed something. Customer services is essential for a successful business.