Indicate the simplest and most appropriate test for each of the following situations (2.5% of exam). Please also include your reasoning for choosing the test for partial/full credit:
a.We wish to compare the speed of a new processor to an older processor for which the mean speed is known.
One-sample T-test: This test will indicate whether the new and old speeds are similar in terms of their true means. It will highlight the significance of mean group differences.
b. We would like to test whether age groups (child, teen, young adult, adult, senior) differ in their enjoyment (a continuous variable) of a commercial.
One-way ANOVA: tests the significance of group differences between two or more groups when the IVs (groups) are categorical and DVs (enjoyment) are continuous.
c. We are interested in whether men and women differ in their rates of driving foreign or domestic vehicles.
One-way ANOVA: same dynamic, tests the significance of group differences between 2+ groups. Here you can do two separate ANOVA tests, one for foreign and the other for domestic. If domestic and foreign are correlated, we can use MANOVA instead.
d. We have run an experiment looking at the effects of sugar and salt content (both taken at three levels) on reported appetitiveness (i.e., enjoyment of taste: a continuous variable).
Paired T-test: tests whether a DP (salt content) has an effect on an ID (appetitiveness) while it experiences changes over time.
e. Focus groups were randomly assigned to one of two commercials and their likelihood of buying the product advertised was taken on a scale from 0 to 100. We wish to know if the commercials differed in their effectiveness.
One-way ANOVA: tests the significance of group differences between two or more groups (commercials).
f. We have been observing the outcomes dice thrown on the craps table at our local casino and we have a feeling the dice are loaded (set to land on certain numbers more frequently). We want to test if they are fair dice (equally likely to land on all sides).
Chi square GOF – tests if the data corresponds to a population with specified set of probabilities. Expected frequencies (normal dice should have a 1/6 probability for all sides) are compared with actual (loaded might specific sides to fall more often). The dice are determined to be loaded if the difference between expected and actual is large.
g. We have a 20-factor experiment each with three levels (a 320 design). We can only perform one replicate, but we want to know what effects, including interactions, seem to have an impact.
2k Factorial design: there is a categorical predictor with more than two levels and a continuous DP. 2k factorial designs are frequently used in factor screening experiments.
h. Patients ratings of pain (on a scale of 1 to 100) were taken before and after a drug treatment was given and we want to know if the drug reduced pain significantly.
Paired T-Test: checks whether a DV (pain ratings) changes over time (before and after treatment).
Your food cart sells a meat, a vegetarian, and a vegan dish. Your cart is located near a very busy convention center acting nearly every day as a venue for conferences related to business and entertainment. You have an intuition that there are differences in what each group of attendees prefers for lunch and you would like to see if your intuitions are correct so you can better prepare for each type of event (e.g., prep for more meat dishes during business conferences). You collect data on the numbers of each meal type being bought and what kind of event occurred that day over the past 30 days. The data is below. Perform the most suitable/appropriate analysis and summarize the results (2.5% of exam).
Is there a significant difference between business and entertainment lunch preferences?
First, let’s build enter the variables and build the table:
Conference_type = c("Business","Business", "Business","Entertainment", "Entertainment", "Entertainment")
Food_type= c("Meat","Vegetarian","Vegan","Meat","Vegetarian","Vegan")
Preference= c(155, 120, 5, 200, 300, 100)
FoodCart=data.frame(Conference_type, Food_type, Preference)
str(FoodCart)
## 'data.frame': 6 obs. of 3 variables:
## $ Conference_type: Factor w/ 2 levels "Business","Entertainment": 1 1 1 2 2 2
## $ Food_type : Factor w/ 3 levels "Meat","Vegan",..: 1 3 2 1 3 2
## $ Preference : num 155 120 5 200 300 100
The data looks normally distributed.
plot(density(FoodCart$Preference))
The boxplots below show a quick visual of the demand of meals per event and food type.
Entertainment events consume more meals than Business type events
There also seems to exist variance in the preference of food type (without specifying conference type)
boxplot(FoodCart$Preference~FoodCart$Conference_type, main= "Meals per Event Type",
ylab="Meals", col=c(2,3), lwd=2)
boxplot(FoodCart$Preference~FoodCart$Food_type, main= "Meals per Food Type",
ylab="Preferences", col=c(3,4,5), lwd=2)
The proportion of meal types seems similarly distributed fo entertainment and business Obviously entertainment venues consume more meals, but we are comparing the proportion regardless of number of meals.
Both events sell more vegetarian and meat than vegan.
business=FoodCart[1:3,]
business$Preference/sum(business$Preference)
## [1] 0.55357143 0.42857143 0.01785714
pie(business$Preference/sum(business$Preference), labels = c("Meat","Vegetarian","Vegan"),
main="Proportion of preferences for Business")
Entertainment=FoodCart[4:6,]
Entertainment$Preference/sum(Entertainment$Preference)
## [1] 0.3333333 0.5000000 0.1666667
pie(Entertainment$Preference/sum(Entertainment$Preference), labels = c("Meat","Vegetarian","Vegan"),
main="Proportion of preferences for Entertainment")
Lets plug in the model to test for variance between conference types:
summary(aov(Preference~Conference_type,data = FoodCart))
## Df Sum Sq Mean Sq F value Pr(>F)
## Conference_type 1 17067 17067 2.112 0.22
## Residuals 4 32317 8079
The model indicated that there is not a significant difference in preference between entertainment and business. Even though the means differ due to Entertainment having much more traffic, food type proportions are similar.
Lets verify our assumptions graphically:
plot(aov(Preference~Conference_type,data = FoodCart))
There isn’t a pattern between residuals and fitted values and data follows a normal distribution. Also, the last plot shows no signs of influential values that would distort the parameter estimates.
Your firm is involved in creating energy (efficient) saving lightbulbs using LEDs. One of the major hurdles to getting customers to buy the new lightbulbs is that many complain that the bulbs put out light that is very unfriendly to the eyes. Two common factors are generally associated with light seen as being unfriendly/uncomfortable: One is the level of luminance (i.e., how bright it is) and the other is the amount of “blue” light present (e.g., more blue light puts more strain on the eyes). Your firm has run an experiment with 3 levels of luminance (“low”, “medium”, and “high”) and 3 levels of blue (“no blue”, “low blue”, and “moderate blue”). 100 focus groups took part in the experiment and each focus groups rated their impressions of the light for each possible of the 9 possible combinations in random order. Ratings went from -100 for (Hated the Light) to +100 (Loved the Light). The data set is titled Exam1Q2.xlsx. Perform the appropriate analysis and summarize the results (5% of exam).
Analyze the different luminance and blue light levels to determine the best fit for a higher impression rate
library(readxl)
## Warning: package 'readxl' was built under R version 3.4.4
LightData <- read_excel("C:/Users/Enrique/OneDrive/Documents/HU/ANLY510_Principles7Applicaitons02/Data/LightData.xlsx")
str(LightData)
## Classes 'tbl_df', 'tbl' and 'data.frame': 900 obs. of 4 variables:
## $ BlueLevel : chr "None" "Low" "Moderate" "None" ...
## $ Lum : chr "Low" "Low" "Low" "Medium" ...
## $ Impression: num 20 33 -9 36 55 -2 35 97 13 15 ...
## $ FocusGroup: num 1 1 1 1 1 1 1 1 1 2 ...
Let’s convert those characters into factors.
LightData$BlueLevel=factor(LightData$BlueLevel,levels = c("None","Low","Moderate"))
LightData$Lum=factor(LightData$Lum,levels = c("Low","Medium","High"))
LightData$FocusGroup=as.character(LightData$FocusGroup)
str(LightData)
## Classes 'tbl_df', 'tbl' and 'data.frame': 900 obs. of 4 variables:
## $ BlueLevel : Factor w/ 3 levels "None","Low","Moderate": 1 2 3 1 2 3 1 2 3 1 ...
## $ Lum : Factor w/ 3 levels "Low","Medium",..: 1 1 1 2 2 2 3 3 3 1 ...
## $ Impression: num 20 33 -9 36 55 -2 35 97 13 15 ...
## $ FocusGroup: chr "1" "1" "1" "1" ...
Data looks slightly skewed, let’s test normality and skew with agostino and shapiro tests.
plot(density(LightData$Impression))
Normality result: failed
shapiro.test(LightData$Impression)
##
## Shapiro-Wilk normality test
##
## data: LightData$Impression
## W = 0.9439, p-value < 2.2e-16
Skew result: failed
library(moments)
agostino.test(LightData$Impression)
##
## D'Agostino skewness test
##
## data: LightData$Impression
## skew = 0.44069, z = 5.21200, p-value = 1.868e-07
## alternative hypothesis: data have a skewness
Data failed for skew, let’s fix that by taking the log.
LightData$Impression=log(LightData$Impression)
## Warning in log(LightData$Impression): NaNs produced
agostino.test(LightData$Impression)
##
## D'Agostino skewness test
##
## data: LightData$Impression
## skew = -0.34085, z = -3.62270, p-value = 0.0002916
## alternative hypothesis: data have a skewness
Even with log, test failed for skew but the p-value increased. We should continue to test for variance.
Variance result: Failed
library(vcdExtra)
## Warning: package 'vcdExtra' was built under R version 3.4.4
## Loading required package: vcd
## Warning: package 'vcd' was built under R version 3.4.4
## Loading required package: grid
## Loading required package: gnm
## Warning: package 'gnm' was built under R version 3.4.4
bartlett.test(LightData$Impression, LightData$BlueLevel)
##
## Bartlett test of homogeneity of variances
##
## data: LightData$Impression and LightData$BlueLevel
## Bartlett's K-squared = 214.55, df = 2, p-value < 2.2e-16
bartlett.test(LightData$Impression, LightData$Lum)
##
## Bartlett test of homogeneity of variances
##
## data: LightData$Impression and LightData$Lum
## Bartlett's K-squared = 327.18, df = 2, p-value < 2.2e-16
bartlett.test(LightData$Impression, LightData$FocusGroup)
##
## Bartlett test of homogeneity of variances
##
## data: LightData$Impression and LightData$FocusGroup
## Bartlett's K-squared = 5.0653, df = 99, p-value = 1
Blue level and lum seem to have significant variance.Let’s check how much is it:
tapply(LightData$Impression, LightData$BlueLevel, var)
## None Low Moderate
## 0.1843453 0.1274930 NA
tapply(LightData$Impression, LightData$Lum, var)
## Low Medium High
## NA NA 0.712987
Variance is not as high as it is under 4-fold for acceptable variance. Let’s continue with the model.
ANOVA Model
LightModel= aov(LightData$Impression~LightData$FocusGroup*LightData$BlueLevel+
LightData$Lum, data = LightData)
summary(LightModel)
## Df Sum Sq Mean Sq F value
## LightData$FocusGroup 99 0.16 0.00 0.051
## LightData$BlueLevel 2 199.17 99.59 3074.940
## LightData$Lum 2 79.73 39.87 1230.972
## LightData$FocusGroup:LightData$BlueLevel 198 1.15 0.01 0.179
## Residuals 398 12.89 0.03
## Pr(>F)
## LightData$FocusGroup 1
## LightData$BlueLevel <2e-16 ***
## LightData$Lum <2e-16 ***
## LightData$FocusGroup:LightData$BlueLevel 1
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 200 observations deleted due to missingness
The model confirms the variance is significant when testing Blue level and Luminance separately.
Variance is insignificant when combining blue level and luminance with focus groups
Let’s plot interaction of this model:
library(lsmeans)
## Warning: package 'lsmeans' was built under R version 3.4.4
## The 'lsmeans' package is being deprecated.
## Users are encouraged to switch to 'emmeans'.
## See help('transition') for more information, including how
## to convert 'lsmeans' objects and scripts to work with 'emmeans'.
lsmip(object = LightModel,formula = LightData$BlueLevel~LightData$Lum,
main="Interaction: Blue Level & Luminance", col = c(2,3,4))
The figure above portaris the predicted impression over luminance levels with separate lines for blue light levels.One can see a clear increase in rates of impression as luminance and blue light levels increase from low to high.
Based on these results, the firm should consider maintaining their LEDs at high levels of luminance and blue light.