Advanced Statistical Inference Midterm Exam
Section 1
Instructions
Using the following data, produce a visualization with patient_days_admitted on the x-axis and dollar_spent_per_patient on the y-axis. After producing this visualization, explain what this relationship means and how this relationship might guide a decision.
After looking at the bivariate relationship in the previous visualization, add department to the visualization as a grouping variable. Does this change your interpretation of the relationship or add any expanded understanding? If so, how? Does this visualization offer any additional explanatory power over the more simple visualization? Explain.
Linear Regression
library(dplyr)
ggplot(sec1Link,aes(patient_days_admitted, dollar_spent_per_patient))+
geom_point()+
geom_smooth(method = "lm")+
labs(title = "Relationship Between Dollar Spent Per Patient & Days of Patient Admitted",
x = "Days of Patient Admitted", y = "Dollar Spent per Patient")+
scale_x_continuous(breaks = seq(0, 30, by = 5))In order to better analyze the data, I changed the tick mark for x-axis to a smaller scale – which is every 5 intervals. I also tried changing it to every 2 intervals and some other values but I could not see an obviously trend on how many days of admission is linked to higher dollar spent or lower dollar spent. Although the linear regression line showed a moderate positive relationship between the two variables, there are way too many data that are away from the line. Thus, we need to break down the data to sub-categories for a better evaluation.
ggplot(sec1Link,aes(patient_days_admitted, dollar_spent_per_patient, color = department, shape = department))+
geom_point(size = 2.5, alpha = 0.5)+
geom_point(color = "black", size = 1.5, alpha = 0.6)+
geom_smooth(method = "lm")+
labs(title = "Relationship Between Dollar Spent Per Patient & Days of Patient Admitted",
x = "Days of Patient Admitted", y = "Dollar Spent per Patient")+
scale_x_continuous(breaks = seq(0, 30, by = 5))Now looking at the graph again, we got a much better visualization on how the dollar spent changes based on the days of admission. Each department treats its patients differently, and we cannot generalize all the data without looking into each different department separately. We now can conclude that as the days of a patient admitted, his or her dollar spent increases.
Section 2
Instructions
Using the following data, formulate a hypothesis for training.sessions.attended.year’s effect on customer.satisfaction.scores.year. Please clearly state the relationship that you would expect to find. Using an appropriate technique from the general linear model, test your hypothesis and report your findings – interpretation of your model’s coefficients is critical. Describe your rationale for any data processing (e.g., centering) that you might undertake.
After reporting and interpreting your findings, conduct a post hoc power analysis to determine if you had a sufficient sample size to detect an effect. Discuss the results from your power analysis.
Hypothesis
Null hypothesis: There is no relationship between customer satisfaction scores and number of training sessions attended.
Alternative hypothesis: There is a positive relationship between customer satisfaction scores and number of training sessions attended.
sec2Test = lm(customer.satisfaction.scores.year ~ training.sessions.attended.year, data = sec2Link)
summary(sec2Test)
Call:
lm(formula = customer.satisfaction.scores.year ~ training.sessions.attended.year,
data = sec2Link)
Residuals:
Min 1Q Median 3Q Max
-30.9118 -6.3126 -0.6134 7.0420 28.8038
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 61.3040 0.8570 71.53 <2e-16 ***
training.sessions.attended.year 2.4731 0.1218 20.30 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.47 on 527 degrees of freedom
Multiple R-squared: 0.4388, Adjusted R-squared: 0.4377
F-statistic: 412.1 on 1 and 527 DF, p-value: < 2.2e-16
ggplot(sec2Link, aes(training.sessions.attended.year, customer.satisfaction.scores.year))+
geom_point(size = 2, color = "pink", alpha = 0.5)+
geom_point(size = 0.5, color = "black", alpha = 0.6)+
geom_smooth(method = "lm", color = "black")+
labs(title = "Relationship Between Customer Satisfaction & # of Training Sessions Attended",
x = "Number of Training Sessions Attended", y = "Customer Satisfaction Score")+
scale_x_continuous(breaks = seq(0, 13, by = 1))Analysis
The linear model test output has a significant p-value, meaning it is highly unlikely that we observe the relationship between the predictor (number of training sessions attended) and response (customer satisfaction score) variables due to chance. In addition, the coefficient t-value is also large, meaning there is a strong evidence against my null hypothesis. Thus, I can reject my null hypothesis and accept my alternative hypothesis.
The estimated coefficient at intercept was 61.3040. This indicates when people did not attend any training sessions at all, the average satisfaction score they rated was roughly 61. As each unit of training session increases, the satisfaction score would increase by 2.4731.
Centering the Data
sec2Center <- sec2Link %>%
mutate(trainingCenter = training.sessions.attended.year - mean(training.sessions.attended.year, na.rm = TRUE)) %>%
lm(customer.satisfaction.scores.year ~ trainingCenter, data = .)
summary(sec2Center)
Call:
lm(formula = customer.satisfaction.scores.year ~ trainingCenter,
data = .)
Residuals:
Min 1Q Median 3Q Max
-30.9118 -6.3126 -0.6134 7.0420 28.8038
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 76.0441 0.4552 167.0 <2e-16 ***
trainingCenter 2.4731 0.1218 20.3 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.47 on 527 degrees of freedom
Multiple R-squared: 0.4388, Adjusted R-squared: 0.4377
F-statistic: 412.1 on 1 and 527 DF, p-value: < 2.2e-16
Since my continuous predictor is not really meaningful when it is zero (for those who did not attend any training sessions, how effective would their ratings be?), we would want to center our data by substracting the mean.
After centering the independent variable, the estimated coefficient at intercept changed to 76.0441.
Power Analysis
[1] 0.7784
Multiple regression power calculation
u = 1
v = 527
f2 = 0.7784
sig.level = 0.05
power = 1
### in post hoc power analysis, we use v to predict power, to see how confident we are rejecting null hypothesis
### v = # of rows in data - # of terms (intercept + predictors)
### here we have power = 1, very confident in rejecting null hypothesisThe sample size needed is around 13. In this case, we have sufficient data. Since the sample size needed is not large, we can also learn that the effect size must have been large – it is very effective.
Section 3
Instructions
Consider the following A/B testing data. This data tracks a user’s time on page (timeOnPage) and the UI design (design). In A/B testing, we are concerned with the difference between the two groups on the outcome. Select the appropriate technique from the general linear model and determine if any significant differences exist between the two competing page designs. Describe your rationale for any data processing that you might undertake.
Discuss your results and indicate any actionable decision that comes from your analysis. Additionally, determine if your analyses were sufficiently powered.
Hypothesis
Null hypothesis: There is no significant differences exist between the two competing page designs.
Alternative hypothesis: There is a significant difference exists between the two competing page designs.
I chose t-test as the statistical model for this hypothesis testing:
sec3Test = t.test(sec3Link$MinutesOnPage ~ sec3Link$PageConfiguration,
alternative = "two.sided")
sec3Test
Welch Two Sample t-test
data: sec3Link$MinutesOnPage by sec3Link$PageConfiguration
t = 94.396, df = 1043.9, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
2.885949 3.008477
sample estimates:
mean in group A mean in group B
5.974857 3.027644
Analysis
It has a significant p-value, indicating there is sufficient evidence to reject the null hypothesis and accept the alternative hypothesis. The mean value in group A is obviously larger than the mean value in group B with values of 5.974857 and 3.027644 accordingly.
In order to make sure there are not a lot of NAs that would skew the data, we get a summary of the data frame.
PageConfiguration MinutesOnPage
Length:1059 Min. :1.671
Class :character 1st Qu.:2.993
Mode :character Median :3.869
Mean :4.433
3rd Qu.:5.918
Max. :7.378
Observations: 1,059
Variables: 2
$ PageConfiguration <chr> "B", "B", "B", "A", "B", "A", "A", "B", "A", "…
$ MinutesOnPage <dbl> 2.687553, 2.102057, 3.172279, 6.863614, 3.3126…
No missing values – all good. The t-test output indicated a Degree of Freedom of 1057. Thus, my analyses were sufficiently powered. This A/B testing told us that among those 2 competing page designs, users in group A tend to spend a longer period of time comparing to group B.
Section 4
Instructions
Using the following data, determine if there are any differences in the daily net profit of three different store locations. Select the appropriate test from the general linear model and determine if any significant differences exist. Describe your rationale for any data processing that you might undertake.
Discuss your results.
Hypothesis
Null hypothesis: There are no significant differences exist in the daily net profit of three different store locations.
Alternative hypothesis: There are significant differences exist in the daily net profit of three different store locations.
Anova test
Using Anova test to see if any significant differences exist on daily net profit among 3 different store locations.
Df Sum Sq Mean Sq F value Pr(>F)
facility_location 2 2986 1492.9 674.6 <2e-16 ***
Residuals 1092 2417 2.2
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
With the p-value being less than 2e-16, we can learn that there are significant differences between those 3 locations’ daily net profit. However, we don’t know which pairs of groups are different. We need to run more tests to determine if the mean difference between specific pairs of group are statistically significant.
##Tukey HSD
computing Tukey HSD for performing multiple pairwise-comparison between the means of 3 groups.
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = daily_net_profit_thousand ~ facility_location, data = sec4Link)
$facility_location
diff lwr upr p adj
403 Barr-10 Maple 3.4505044 3.1920592 3.708950 0.0000000
710 Oakland-10 Maple -0.1025902 -0.3610354 0.155855 0.6203836
710 Oakland-403 Barr -3.5530947 -3.8115399 -3.294649 0.0000000
Analysis
As the output showing, the mean difference between 403 Barr location and 10 Maple location, and the mean difference between 710 Oakland and 403 Barr are very significant with p-values that are very close to 0.
Now we can plot the data in several different ways to show the differences among them:
library("ggpubr")
ggboxplot(sec4Link, x = "facility_location", y = "daily_net_profit_thousand",
color = "facility_location", palette = c("#00AFBB", "#E7B800", "#FC4E07"),
order = c("403 Barr", "10 Maple", "710 Oakland"),
ylab = "Daily Net Profit in Thousand", xlab = "Facility Location")This boxplot supports the conclusion made earlier.
Section 5
Instructions
Using the following data, determine what variables influence a franchise’s ultimate outcome – failure or success. Using any variables available to you, select the appropriate method and test your model. Discuss your results and describe your rationale for any data processing that you might undertake.
First thing first, checking if there are any missing values & any inappropriate data types.
storeID outcomeClosedOpen employeeCount
Min. : 1.0 Min. :0.0000 Min. : 3.623
1st Qu.:119.5 1st Qu.:0.0000 1st Qu.: 9.502
Median :238.0 Median :0.0000 Median :12.347
Mean :238.0 Mean :0.4758 Mean :12.325
3rd Qu.:356.5 3rd Qu.:1.0000 3rd Qu.:15.028
Max. :475.0 Max. :1.0000 Max. :22.114
dailyNetProfitThousands quartersWithHealthViolations peoplePerSqMile
Min. : 2.495 Min. :0.0000 Min. : 63.91
1st Qu.: 3.967 1st Qu.:0.0000 1st Qu.:195.15
Median : 4.650 Median :0.0000 Median :240.73
Mean : 6.348 Mean :0.5874 Mean :244.51
3rd Qu.: 8.944 3rd Qu.:1.0000 3rd Qu.:295.11
Max. :10.298 Max. :3.0000 Max. :393.97
Changing storeID to character.
Correlation
Since variable outcomeClosedOpen indicates if the business is still running successfully or it is out of business and closed, I am looking at it as it is independent variable and trying to find which variable(s) are highly correlated to it. Based on the graph, dailyNetProfitThousands is the most correlated variable with a correlation coefficient of 0.99; peoplePerSqMile is the second most correlated variable with a correlation coefficient of 0.87; employeeCount being the third with a correlation coefficient of 0.64.
T-test
Now that I have those 3 variables, I want to test if they are statistical significant. I decided to look into those 3 pairs of data separately with t-test.
Welch Two Sample t-test
data: dailyNetProfitThousands by outcomeClosedOpen
t = -125.77, df = 466.93, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-5.053172 -4.897697
sample estimates:
mean in group 0 mean in group 1
3.980373 8.955807
Welch Two Sample t-test
data: peoplePerSqMile by outcomeClosedOpen
t = -38.056, df = 472.98, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-101.97249 -91.95904
sample estimates:
mean in group 0 mean in group 1
198.3724 295.3382
Welch Two Sample t-test
data: employeeCount by outcomeClosedOpen
t = -18.341, df = 470.75, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-5.339915 -4.306419
sample estimates:
mean in group 0 mean in group 1
10.03001 14.85318
Analysis
All those 3 tests have a significant p-value, meaning the correlations among my 3 pairs of data are significant and are not due to chance. Based on the results, I can conclude that daily net profit is key to a business. At the end of the day, how successful a business is doing is all about the money. The more dollars you are making the more successful your business can be. That is why the mean of daily profit in the group of open business is more than doubled the group of closed business from my test1 result. Test2 is telling me that store traffic is important too. The more customers you have in your store, the more amount of money they spent in your store. Develope strategies that would attract more potential consumers to walk in your store. For example, attractive window display that makes people who are just walking by are willing to stop by your store. Lastly, having more available employees in the store is helpful. As a customer, I found it less pleasant to shop in a store where I cannot find any store clerks to help me out when I needed something. Customer services is essential for a successful business.