Introduction to R Markdown

R Markdown, is an extremely useful tool that professional data scientists and business analysts use in their day-to-day work.

You can open R Markdown documents in RStudio as well. You should see a little command called “Knit”, which allows you to “knit” the entire R Markdown file into a HTML document, or a PDF document, or a MS Word Document (note, for MS Word, you’ll need MS Word installed on your system; for PDF, you need to have Tex/Latex distribution installed).

R Markdown is nice to use simply because it allows you to embed both code and write-up into the same document, and it produces presentable output, so you can use it to generate reports from your homework, and, when you eventually go out to work in a company, for your projects.

Here’s how you embed a “chunk” of R code.

1+1
## [1] 2

After the three apostrophes, you’ll need r, then you can give the chunk a name. Please note that CHUNK NAMES HAVE TO BE A SINGLE-WORD, NO SPACE ALLOWED. Also, names have to be unique, that is, every chunk needs a different name (this has led to rendering failures in previous final exams). You can give chunks names like:

or, what will help you with homework:

These names are for you to help organize your code. (In practice it will be very useful when you have files with thousands of lines of code…). After the name of the chunk, you can give it certain options, separated by commas. I will highlight one important option.

There is a lot to syntax to learn using the R Markdown, but we don’t need you to be an expert in R Markdown (although we do expect proficiency in R!). Hopefully, you can copy all the R Markdown syntax you need from the templates we provide.

Note about working directories in R Markdown. If you do not specify your working directory via setwd('...'), and you hit Knit, the document will assume that the working directory is the directory that the .rmd file is in. Thus, if your rmd is in XYZ/folder1/code.rmd and your dataset is XYZ/folder1/data.csv, then you can simply run d0 <- read.csv('data.csv') without running setwd().

Submission Instructions

Tutorial 5

# install required packages if you have not (suggested packages: rcompanion, rstatix, Rmisc, dplyr, tidyr, rpivotTable, knitr, psych)
# install.packages("dplyr") #only need to run this code once to install the package
# load required packages 
# library("xxxx")

library("rcompanion") #this package is required for transformTukey function
library("rstatix")
## 
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
## 
##     filter
library("Rmisc") 
## Loading required package: lattice
## Loading required package: plyr
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:rstatix':
## 
##     desc, mutate
library("dplyr") #need to call the library before you use the package
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library("tidyr")
library("rpivotTable")
library("knitr")
library("psych")
## 
## Attaching package: 'psych'
## The following object is masked from 'package:rcompanion':
## 
##     phi

In the last tutorial, you were tasked to help the store manager develop dashboards that will enable him to gain better insights of the data. In this tutorial, you will use the data to conduct sampling estimation and hypotheses testing. Where necessary, check the distribution for the variables and for the presence of outliers.

You may use the following guideline to round off your final answers: If the answer is greater than 1, round off to 2 decimal places. If the answer is less than 1, round off to 3 significant numbers. When rounding, also take note of the natural rounding points, for example, costs in dollars would round off to 2 decimal places.

Tutorial 5 Question 2: CardioGood Fitness (25 marks - for self-grading)

(To be discussed in Week 10)

Context: Your market research team at AdRight is assigned to study the profile of the typical customer for each treadmill product offered by CardioGood Fitness. The team decides to collect data on individuals who purchased a treadmill at a CardioGoodFitness retail store during the prior three months.

The data are stored in the CardioGoodFitness.csv file. The team identifies the following customer variables to study:

  • Product: product purchased (TM195, TM498, or TM798)
  • Age: in years
  • Gender: Female or Male
  • Education: number of years of education
  • MaritalStatus: Single or Partnered
  • Usage: average number of times the customer plans to use the treadmill each week
  • Fitness: self-rated fitness on an 1-to-5 scale, where 1 is poor shape and 5 is excellent shape.
  • Income: annual household income ($)
  • Miles: average number of miles the customer expects to walk/run each week
## Please check that the .csv file is in the same directory as your Rmd file.
# setwd("  ") (Make sure to set your directory if you get an error "cannot open connection")

d1 <- read.csv("CardioGoodFitness.csv")

checking of Data

head(d1,10)
##    Product Age Gender Education MaritalStatus Usage Fitness Income Miles
## 1    TM195  18   Male        14        Single     3       4  29562   112
## 2    TM195  19   Male        15        Single     2       3  31836    75
## 3    TM195  19 Female        14     Partnered     4       3  30699    66
## 4    TM195  19   Male        12        Single     3       3  32973    85
## 5    TM195  20   Male        13     Partnered     4       2  35247    47
## 6    TM195  20 Female        14     Partnered     3       3  32973    66
## 7    TM195  21 Female        14     Partnered     3       3  35247    75
## 8    TM195  21   Male        13        Single     3       3  32973    85
## 9    TM195  21   Male        15        Single     5       4  35247   141
## 10   TM195  21 Female        15     Partnered     2       3  37521    85
summary(d1)
##    Product               Age           Gender            Education    
##  Length:180         Min.   :18.00   Length:180         Min.   :12.00  
##  Class :character   1st Qu.:24.00   Class :character   1st Qu.:14.00  
##  Mode  :character   Median :26.00   Mode  :character   Median :16.00  
##                     Mean   :28.79                      Mean   :15.57  
##                     3rd Qu.:33.00                      3rd Qu.:16.00  
##                     Max.   :50.00                      Max.   :21.00  
##  MaritalStatus          Usage          Fitness          Income      
##  Length:180         Min.   :2.000   Min.   :1.000   Min.   : 29562  
##  Class :character   1st Qu.:3.000   1st Qu.:3.000   1st Qu.: 44059  
##  Mode  :character   Median :3.000   Median :3.000   Median : 50597  
##                     Mean   :3.456   Mean   :3.311   Mean   : 53720  
##                     3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.: 58668  
##                     Max.   :7.000   Max.   :5.000   Max.   :104581  
##      Miles      
##  Min.   : 21.0  
##  1st Qu.: 66.0  
##  Median : 94.0  
##  Mean   :103.2  
##  3rd Qu.:114.8  
##  Max.   :360.0
str(d1)
## 'data.frame':    180 obs. of  9 variables:
##  $ Product      : chr  "TM195" "TM195" "TM195" "TM195" ...
##  $ Age          : int  18 19 19 19 20 20 21 21 21 21 ...
##  $ Gender       : chr  "Male" "Male" "Female" "Male" ...
##  $ Education    : int  14 15 14 12 13 14 14 13 15 15 ...
##  $ MaritalStatus: chr  "Single" "Single" "Partnered" "Single" ...
##  $ Usage        : int  3 2 4 3 4 3 3 3 5 2 ...
##  $ Fitness      : int  4 3 3 3 2 3 3 3 4 3 ...
##  $ Income       : int  29562 31836 30699 32973 35247 32973 35247 32973 35247 37521 ...
##  $ Miles        : int  112 75 66 85 47 66 75 85 141 85 ...

Q2.(a) Computing Interval Estimates

Using the data that the team has collected:

    1. Find the 95% prediction interval for Age. Explain this finding to the team. (3 marks)
    1. Develop the 95% confidence interval for mean customer age. Based on this confidence interval, explain to the team if there is sufficient evidence to conclude that its customers’ average age is 30? (2 marks)
    1. Develop the 95% confidence interval for the true proportion of male customers. From this interval estimate, should the team believe that the company has more male customers than female customers?) (2 marks)

Type your findings and explanations in the space below.

For tutorial discussion: Repeat the above with 90% intervals. Compare the widths of the 90% intervals vs 95% intervals. Which intervals provide better precision?

BEGIN: YOUR ANSWER

    1. Find the 95% prediction interval for Age. Explain this finding to the team. (3 marks)
h1<-hist(d1$Age, main="Histogram of Customer Age",xlab="Customer Age",ylab="No. of Customers", col=c("darkorange"),ylim=c(0,75), labels=TRUE)

qqnorm(d1$Age, ylab="Sample Quantities for 'Age' for Customers")
qqline(d1$Age,col="red")

# Shapiro-Wilk normality test

shapiro.test(d1$Age)
## 
##  Shapiro-Wilk normality test
## 
## data:  d1$Age
## W = 0.91577, p-value = 1.183e-08

Finding The test statistic W is 0.91577 and the p-value is 1.183e-08 (very small). This result suggests that the sample does not come from a normally distributed population, as the p-value is less than the commonly used significance level of 0.05. Therefore, we reject the null hypothesis and conclude that the “Age” variable is not normally distributed.

#transform of 'Age' variable since the data is normally distributed

d1$Age.t = transformTukey(d1$Age, plotit=TRUE)

## 
##     lambda      W Shapiro.p.value
## 351  -1.25 0.9808         0.01379
## 
## if (lambda >  0){TRANS = x ^ lambda} 
## if (lambda == 0){TRANS = log(x)} 
## if (lambda <  0){TRANS = -1 * x ^ lambda}

BEGIN: YOUR ANSWER

#using t-distribution to calculate the 95% CI of upper and lower Interval of true age population
mnage.t <- mean(d1$Age.t)
sdage.t <- sd(d1$Age.t)

lPI95.aget <- mnage.t + (qt(0.025, df = (nrow(d1)-1))*sdage.t*sqrt(1+1/nrow(d1)))
uPI95.aget <- mnage.t - (qt(0.025, df = (nrow(d1)-1))*sdage.t*sqrt(1+1/nrow(d1)))

#reverse transform
#using y=-1*x^lamda  where lambda(-1.25) less than 0

lPT95.age <- (-1/lPI95.aget) ^(1/1.25)
uPT95.age <- (-1/uPI95.aget) ^(1/1.25)

kable(cbind(lPT95.age, uPT95.age), digits=2, caption="Reverse Transformation Interval 95%CI")
Reverse Transformation Interval 95%CI
lPT95.age uPT95.age
19.41 49.26

Reverse transform; to derive the formula y=-1*x^lamda -y = x^-1.25 = 1/(x^1,25) x^1,25 = -1/y x = (-1/y)^(1/1.25) where y can be with lPI95.aget or uPI95.aget Hence, 95% PI of the transformed Interval of age of the true population lies between 19.41 to 49.26

#using t-distribution to calculate the 90% CI of upper and lower Interval of true age population
mnage.t <- mean(d1$Age.t)
sdage.t <- sd(d1$Age.t)

lPI90.aget <- mnage.t + (qt(0.05, df = (nrow(d1)-1))*sdage.t*sqrt(1+1/nrow(d1)))
uPI90.aget <- mnage.t - (qt(0.05, df = (nrow(d1)-1))*sdage.t*sqrt(1+1/nrow(d1)))
#kable(cbind(lPI90.aget, uPI95.aget), caption = "90% Prediction Interval for Age for Customer")

#reverse transform
#using y=-1*x^lamda  where lambda(-1.25) less than 0

lPT90.age <- (-1/lPI90.aget) ^(1/1.25)
uPT90.age <- (-1/uPI90.aget) ^(1/1.25)

kable(cbind(lPT90.age, uPT90.age), digits=2, caption="Reverse Transformation Interval 90%CI")
Reverse Transformation Interval 90%CI
lPT90.age uPT90.age
20.33 43.19

Reverse transform; to derive the formula y=-1*x^lamda -y = x^-1.25 = 1/(x^1,25) x^1,25 = -1/y x = (-1/y)^(1/1.25) where y can be with lPI90.aget or uPI90.aget Hence, 90% PI of the transformed Interval of age of the true population lies between 20.33 to 43.19

    1. Develop the 95% confidence interval for mean customer age. Based on this confidence interval, explain to the team if there is sufficient evidence to conclude that its customers’ average age is 30? (2 marks)

Using One Sample t-test

#using t.test function
t.test(d1$Age,
       alternative="two.sided",
       mu=30,
       conf.level = 0.95)
## 
##  One Sample t-test
## 
## data:  d1$Age
## t = -2.3401, df = 179, p-value = 0.02038
## alternative hypothesis: true mean is not equal to 30
## 95 percent confidence interval:
##  27.76763 29.81015
## sample estimates:
## mean of x 
##  28.78889
#using t.test function
t.test(d1$Age,alternative="two.sided",
       mu=30,
       conf.level = 0.90)
## 
##  One Sample t-test
## 
## data:  d1$Age
## t = -2.3401, df = 179, p-value = 0.02038
## alternative hypothesis: true mean is not equal to 30
## 90 percent confidence interval:
##  27.93319 29.64459
## sample estimates:
## mean of x 
##  28.78889

Finding

Using One Sample t-test on both tests, we set the null hypothesis(H0) is that the true mean age is equal to 30, and the alternative hypothesis(H1) is that it is not equal to 30.

The p-value for both tests is 0.02038, which is less than the significance level of 0.05, indicating that we reject the null hypothesis in favor of the alternative hypothesis.

In 95% confidence interval of (27.76763, 29.81015), which means that we are 95% confident that the true mean age falls between 27.76763 and 29.81015, while 90% confidence interval of (27.93319, 29.64459), which means that we are 90% confident that the true mean age falls between 27.93319 and 29.64459. Hence, we can conclude that we can reject the claim since in both 95% CI and 90% CI are the true mean age is less than 30.

dMale <- d1 %>% filter(d1$Gender=="Male")

## 90% confidence interval for proportion of Male customers

prop.test(x=nrow(dMale),n=nrow(d1),conf.level = 0.95)
## 
##  1-sample proportions test with continuity correction
## 
## data:  nrow(dMale) out of nrow(d1), null probability 0.5
## X-squared = 4.05, df = 1, p-value = 0.04417
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.5019570 0.6502243
## sample estimates:
##         p 
## 0.5777778
dMale <- d1 %>% filter(d1$Gender=="Male")

## 90% confidence interval for proportion of Male customers

prop.test(x=nrow(dMale),n=nrow(d1),conf.level = 0.90)
## 
##  1-sample proportions test with continuity correction
## 
## data:  nrow(dMale) out of nrow(d1), null probability 0.5
## X-squared = 4.05, df = 1, p-value = 0.04417
## alternative hypothesis: true p is not equal to 0.5
## 90 percent confidence interval:
##  0.5137230 0.6394242
## sample estimates:
##         p 
## 0.5777778

Finding

Using one-sample proportions test, he null hypothesis is that the true proportion of males in the sample is 0.5 (i.e., an equal number of males and females), while the alternative hypothesis is that the true proportion is different from 0.5 (i.e., there are more males or females in the sample).

The sample proportion is 0.5777778, which suggests that there are slightly more males than females in the sample. The 95% confidence interval is [0.5019570, 0.6502243], which means that we can be 95% confident that the true proportion of males in the population lies between 0.5019570 and 0.6502243. The 90% confidence interval is [0.5137230, 0.6394242], which means that we can be 90% confident that the true proportion of males in the population lies between 0.5137230 and 0.6394242.

Type your answer here.

END: YOUR ANSWER

Q2.(b) Understanding Customer Income data

Could you assist the team to assess whether there is any difference in mean income between

    1. female versus male customers.
    1. customers who purchased TM195, TM498 versus TM798.

Plot a graph for each of the comparisons, then set up the appropriate hypotheses and conduct the hypotheses tests using a 5% significance level. Type the hypotheses and your findings in the space below.

(8 marks)

BEGIN: YOUR ANSWER

i. female versus male customers.

Comparison of incomes of the genders

Hypotheses:

Null hypothesis (H0): There is no significant difference in mean income between male and female customers.
Alternative hypothesis (Ha): There is a significant difference in mean income between male and female customers.
𝐻0: 𝜇income of Male − 𝜇income of Female = 0
𝐻𝑎: 𝜇income of Male − 𝜇income of Female ≠ 0

Findings:

p-value = 0.004021, which is less than the 5% significance level
Therefore, we reject the null hypothesis
There is a significant difference in mean income between male and female customers
95% confidence interval for the difference in means is (2174.392, 11293.312) and does not contain 0, providing additional evidence for the significant difference.
## Type your codes here

dMale <- d1 %>% filter(d1$Gender=="Male")
dFemale <- d1 %>% filter(d1$Gender=="Female")

lapply(list(dMale, dFemale), nrow)
## [[1]]
## [1] 104
## 
## [[2]]
## [1] 76
# combine all gender(Male, Female) into one dataframe
BD.less<-rbind(dMale,dFemale)

#Conduct ANOVA on Age
# plot histogram for Income 
par(mfcol=c(1,2))
hist(dMale$Income, main="Histogram for Male Customers", xlab="Income", ylab="No of customers", ylim = c(0,35), col= "blue", labels=TRUE, xaxp=c(0,110000,20), cex.axis=0.8)
hist(dFemale$Income, main="Histogram for Female Customer", xlab="Income", ylab="No of customers", ylim = c(0,35), col= "red", labels=TRUE, xaxp=c(0,110000,20), cex.axis=0.8)

t.test(dMale$Income, dFemale$Income)
## 
##  Welch Two Sample t-test
## 
## data:  dMale$Income and dFemale$Income
## t = 2.9146, df = 177.23, p-value = 0.004021
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   2174.392 11293.312
## sample estimates:
## mean of x mean of y 
##  56562.76  49828.91

Finding The p-value of the test is 0.004021, which is less than the typical significance level of 0.05. This indicates strong evidence against the null hypothesis that there is no difference in mean income between males and females.

The alternative hypothesis, which states that the true difference in means is not equal to 0, is supported by the results of the test. The 95 percent confidence interval for the difference in means is (2174.392, 11293.312), which indicates that we can be 95% confident that the true difference in means lies within this interval.

The sample estimates for the mean income of males and females are $56,562.76 and $49,828.91, respectively. The difference between these means is $6,733.85, which is positive and suggests that males have a higher mean income than females in the sample

ii. customers who purchased TM195, TM498 versus TM798.

𝐻0: 𝜇1 = 𝜇2 = 𝜇3 𝐻𝑎: at least one mean is different from the others ANOVA 1. are randomly and independently obtained, 2. are normally distributed, and 3. have equal variances. ANOVA requires assumptions that the m groups or factor levels being studied represent populations whose outcome measures: What statistical test do we use to test the differences in means? Where: • 𝜇1 = Mean income for purchased TM195 • 𝜇2 = Mean income for purchased TM498 • 𝜇3 = Mean income for purchased TM798

## Type your codes here

dTM195 <- d1 %>% filter(d1$Product=="TM195")
dTM498 <- d1 %>% filter(d1$Product=="TM498")
dTM798 <- d1 %>% filter(d1$Product=="TM798")

lapply(list(dTM195, dTM498, dTM798), nrow)
## [[1]]
## [1] 80
## 
## [[2]]
## [1] 60
## 
## [[3]]
## [1] 40
# combine all gender(Male, Female) into one dataframe
BD.less<-rbind(dTM195,dTM498,dTM798)

#Conduct ANOVA on Age
# plot histogram for Income 
par(mfcol=c(2,2))
hist(dTM195$Income, main="Histogram for Purchased TM195", xlab="Income", ylab="No of customers", ylim = c(0,25), labels=TRUE, col = "Orange")
hist(dTM498$Income, main="Histogram for Purchased TM498", xlab="Income", ylab="No of customers", ylim = c(0,25),labels=TRUE, col = "Orange")
hist(dTM798$Income, main="Histogram for Purchased TM798", xlab="Income", ylab="No of customers",  ylim = c(0,25), labels=TRUE, col = "Orange")

## check equality of variance of data

## check equality of variance

par(mfcol=c(2,2))
qqnorm(dTM195$Income, main="QQplot for purchading TM195", xlab="Income", ylab = "No of customers")
qqline(dTM195$Income)
qqnorm(dTM498$Income, main="QQplot for purchading TM498", xlab="Income", ylab = "No of customers")
qqline(dTM498$Income)
qqnorm(dTM798$Income, main="QQplot for purchading TM798", xlab="Income", ylab = "No of customers")
qqline(dTM798$Income)

##boxplot
boxplot(d1$Income ~ d1$Product)

# Shapiro Wilk Test
lapply(list(dTM195, dTM498, dTM798), 
       function(d1)
       {
          shapiro.test(d1$Income)
       })
## [[1]]
## 
##  Shapiro-Wilk normality test
## 
## data:  d1$Income
## W = 0.97657, p-value = 0.1489
## 
## 
## [[2]]
## 
##  Shapiro-Wilk normality test
## 
## data:  d1$Income
## W = 0.97486, p-value = 0.2503
## 
## 
## [[3]]
## 
##  Shapiro-Wilk normality test
## 
## data:  d1$Income
## W = 0.9128, p-value = 0.004605

Finding In output [1], the test statistic (W) is 0.97657, and the p-value is 0.1489. In output [2], the test statistic (W) is 0.97486, and the p-value is 0.2503. In output [3], the test statistic (W) is 0.9128, and the p-value is 0.004605.

In all 3 cases, Since the p-value is less than the significance level of 0.05, we reject the null hypothesis. Thus, we can assume that the data does not follow a normal distribution.

# Since sample sizes are not equal, check if equal variance assumption is met
fligner.test(Income~ Product, d1)
## 
##  Fligner-Killeen test of homogeneity of variances
## 
## data:  Income by Product
## Fligner-Killeen:med chi-squared = 53.891, df = 2, p-value = 1.985e-12

Finding

The p-value associated with this test is 1.985e-12, which is less than the typical alpha level of 0.05. Therefore, we can reject the null hypothesis that the variances across the groups are equal, and conclude that there is evidence of heterogeneity of variances.

# check if sample sizes are equal to see if equal variance assumption is important to meet
table(d1$Product)
## 
## TM195 TM498 TM798 
##    80    60    40
# Since sample sizes are not equal, check if equal variance assumption is met
fligner.test(Income ~ Product,d1)
## 
##  Fligner-Killeen test of homogeneity of variances
## 
## data:  Income by Product
## Fligner-Killeen:med chi-squared = 53.891, df = 2, p-value = 1.985e-12

The p-value associated with the test statistic is very small (1.985e-12), which indicates strong evidence against the null hypothesis of homogeneity of variances across the three groups. Therefore, you can conclude that there is a significant difference in the variances of income across the three products

#fligner.test gave p=0.1016> 0.05. Hence we cannot reject H0 that variances are equal. Based on this result, we proceed to conduct ANOVA test.

#ANOVA on Products

aov.usage<-aov(d1$Income ~ as.factor(d1$Product)) #note the group variable should be a factor
summary(aov.usage)
##                        Df    Sum Sq   Mean Sq F value Pr(>F)    
## as.factor(d1$Product)   2 2.449e+10 1.225e+10   89.26 <2e-16 ***
## Residuals             177 2.428e+10 1.372e+08                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The “F value” of 89.26 indicates that the variance between the groups (based on the means of the usage) is much larger than the variance within the groups. The p-value (<2e-16) of less than 0.001 indicates strong evidence against the null hypothesis of equal means, thus we can conclude that there is a significant difference in the mean usage between at least two of the product groups.

In summary, the ANOVA results suggest that there is a significant effect of the “Product” variable on the “Income” variable, indicating that the mean usage of at least one of the product groups is significantly different from the others.

Type your answer here.

END: YOUR ANSWER

Q2.(c) Comparing customer usage and exercise pattern

There are two variables that capture the customers usage and exercise pattern: Usage and Miles.

    1. Test if there is any significant difference in mean Usage across the 3 products - TM195, TM498 versus TM798 (3 marks)
    1. Repeat the above test (i) for Miles.(3 marks)
    1. Test if the mean Miles for customers is higher for customers with “High” fitness versus those with “Low” fitness. “High” fitness is defined by customers with a self-rated fitness score that is greater than the median while “Low” fitness is defined by customer with a self-rated fitness score equal or lower than the median. (Hint: You can create a categorical variable for the two levels of Fitness.) (4 marks)

Set up the appropriate hypotheses and conduct the hypotheses tests using 5% significance level. Type your hypotheses and findings in the space below.

For tutorial discussion: Based on your analyses of the CardioGoodFitness restore, discuss how you think the 3 types of treadmills may differ (e.g. in terms of the features or customers that are attracted to the product).

BEGIN: YOUR ANSWER

##- i. Test if there is any significant difference in mean Usage across the 3 products - TM195, TM498 versus TM798 (3 marks)

## Type your codes here

dTM195 <- d1 %>% filter(d1$Product=="TM195")
dTM498 <- d1 %>% filter(d1$Product=="TM498")
dTM798 <- d1 %>% filter(d1$Product=="TM798")

lapply(list(dTM195, dTM498, dTM798), nrow)
## [[1]]
## [1] 80
## 
## [[2]]
## [1] 60
## 
## [[3]]
## [1] 40

𝐻0: 𝜇1 = 𝜇2 = 𝜇3 𝐻𝑎: at least one mean is different from the others ANOVA 1. are randomly and independently obtained, 2. are normally distributed, and 3. have equal variances. ANOVA requires assumptions that the m groups or factor levels being studied represent populations whose outcome measures: What statistical test do we use to test the differences in means? Where: • 𝜇1 = Mean Usage for purchased TM195 • 𝜇2 = Mean Usage for purchased TM498 • 𝜇3 = Mean Usage for purchased TM798

# combine all product(TM195,TM498,TM798) into one dataframe
BD.less<-rbind(dTM195,dTM498,dTM798)

#Conduct ANOVA on Age
# plot histogram for Income 
par(mfcol=c(2,2))
hist(dTM195$Usage, main="Histogram for Purchased TM195", xlab="Usage", ylab="No of customers",  ylim=c(0,40), labels=TRUE, col = "Orange")
hist(dTM498$Usage, main="Histogram for Purchased TM498", xlab="Usage", ylab="No of customers",  ylim=c(0,40), labels=TRUE, col = "Orange")
hist(dTM798$Usage, main="Histogram for Purchased TM798", xlab="Usage", ylab="No of customers",  ylim=c(0,40), labels=TRUE, col = "Orange")

# plot QQplot

par(mfcol=c(2,2))
qqnorm(dTM195$Usage, main="QQplot for TM195`", xlab="Usage")
qqline(dTM195$Usage)
qqnorm(dTM498$Usage, main="QQplot for TM498`", xlab="Usage")
qqline(dTM498$Usage)
qqnorm(dTM798$Usage, main="QQplot for TM798`", xlab="Usage")
qqline(dTM798$Usage)

# plot boxplot
boxplot(d1$Usage ~ d1$Product, cex.axis=0.8)

# Shapiro Wilk Test
lapply(list(dTM195, dTM498, dTM798), 
       function(d1)
       {
          shapiro.test(d1$Usage)
       })
## [[1]]
## 
##  Shapiro-Wilk normality test
## 
## data:  d1$Usage
## W = 0.84728, p-value = 1.313e-07
## 
## 
## [[2]]
## 
##  Shapiro-Wilk normality test
## 
## data:  d1$Usage
## W = 0.84205, p-value = 1.789e-06
## 
## 
## [[3]]
## 
##  Shapiro-Wilk normality test
## 
## data:  d1$Usage
## W = 0.85138, p-value = 9.721e-05

Finding In all three cases, the p-value is less than 0.05, which is a common significance level used in hypothesis testing. This means that we reject the null hypothesis that the data is normally distributed. Therefore, we can conclude that the variable “Usage” is not normally distributed in this data set.

# check if sample sizes are equal to see if equal variance assumption is important to meet
table(d1$Product)
## 
## TM195 TM498 TM798 
##    80    60    40
# Since sample sizes are not equal, check if equal variance assumption is met
fligner.test(Usage ~ Product,d1)
## 
##  Fligner-Killeen test of homogeneity of variances
## 
## data:  Usage by Product
## Fligner-Killeen:med chi-squared = 4.5734, df = 2, p-value = 0.1016

The null hypothesis of the test is that the variances of the groups are equal, while the alternative hypothesis is that at least one group has a different variance than the others.

Based on the p-value of 0.1016, we fail to reject the null hypothesis at a significance level of 0.05, indicating that there is not enough evidence to conclude that the variances of the groups are significantly different. However, since the p-value is relatively close to the significance level, it may be worth investigating further to see if the result is practically significant.

Overall, the Fligner-Killeen test suggests that there is no strong evidence of a difference in variances between the groups in the “Usage by Product” data.

#fligner.test gave p=0.1016> 0.05. Hence we cannot reject H0 that variances are equal. Based on this result, we proceed to conduct ANOVA test.

#ANOVA on Products

aov.usage<-aov(d1$Usage ~ as.factor(d1$Product)) #note the group variable should be a factor
summary(aov.usage)
##                        Df Sum Sq Mean Sq F value Pr(>F)    
## as.factor(d1$Product)   2  89.55   44.77   65.44 <2e-16 ***
## Residuals             177 121.10    0.68                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The “F value” of 65.44 indicates that the variance between the groups (based on the means of the usage) is much larger than the variance within the groups. The p-value (<2e-16) of less than 0.001 indicates strong evidence against the null hypothesis of equal means, thus we can conclude that there is a significant difference in the mean usage between at least two of the product groups.

In summary, the ANOVA results suggest that there is a significant effect of the “Product” variable on the “Usage” variable, indicating that the mean usage of at least one of the product groups is significantly different from the others.

ii. Repeat the above test (i) for Miles.(3 marks)

𝐻0: 𝜇1 = 𝜇2 = 𝜇3 𝐻𝑎: at least one mean is different from the others ANOVA 1. are randomly and independently obtained, 2. are normally distributed, and 3. have equal variances. ANOVA requires assumptions that the m groups or factor levels being studied represent populations whose outcome measures: What statistical test do we use to test the differences in means? Where: • 𝜇1 = Mean Miles for purchased TM195 • 𝜇2 = Mean Miles for purchased TM498 • 𝜇3 = Mean Miles for purchased TM798

# combine all product(TM195,TM498,TM798) into one dataframe
BD.less<-rbind(dTM195,dTM498,dTM798)

#Conduct ANOVA on Age
# plot histogram for Income 
par(mfcol=c(2,2))
hist(dTM195$Miles, main="Histogram for Purchased TM195", xlab="Miles", ylab="No of customers",  ylim=c(0,30), labels=TRUE, col = "Orange")
hist(dTM498$Miles, main="Histogram for Purchased TM498", xlab="Miles", ylab="No of customers",  ylim=c(0,30), labels=TRUE, col = "Orange")
hist(dTM798$Miles, main="Histogram for Purchased TM798", xlab="Miles", ylab="No of customers",  ylim=c(0,30), labels=TRUE, col = "Orange")

# plot QQplot

par(mfcol=c(2,2))
qqnorm(dTM195$Miles, main="QQplot for TM195`", xlab="Miles")
qqline(dTM195$Miles)
qqnorm(dTM498$Miles, main="QQplot for TM498`", xlab="Miles")
qqline(dTM498$Miles)
qqnorm(dTM798$Miles, main="QQplot for TM798`", xlab="Miles")
qqline(dTM798$Miles)

# plot boxplot
boxplot(d1$Miles ~ d1$Product, cex.axis=0.8)

# Shapiro Wilk Test
lapply(list(dTM195, dTM498, dTM798), 
       function(d1)
       {
          shapiro.test(d1$Miles)
       })
## [[1]]
## 
##  Shapiro-Wilk normality test
## 
## data:  d1$Miles
## W = 0.93074, p-value = 0.0003177
## 
## 
## [[2]]
## 
##  Shapiro-Wilk normality test
## 
## data:  d1$Miles
## W = 0.91622, p-value = 0.000542
## 
## 
## [[3]]
## 
##  Shapiro-Wilk normality test
## 
## data:  d1$Miles
## W = 0.90513, p-value = 0.002706
# check if sample sizes are equal to see if equal variance assumption is important to meet
table(d1$Product)
## 
## TM195 TM498 TM798 
##    80    60    40
# Since sample sizes are not equal, check if equal variance assumption is met
fligner.test(Miles ~ Product,d1)
## 
##  Fligner-Killeen test of homogeneity of variances
## 
## data:  Miles by Product
## Fligner-Killeen:med chi-squared = 17.002, df = 2, p-value = 0.0002033

The reported test statistic is 17.002 and the p-value is 0.0002033. The null hypothesis of the test is that the variances of the “Miles” variable are equal across all three products. The low p-value suggests that there is strong evidence to reject the null hypothesis and conclude that the variances are not equal across all products.

In summary, based on the results of the Fligner-Killeen test, there is evidence to suggest that the variances of the “Miles” variable differ across the three products being compared.

#fligner.test gave p=0.1016> 0.05. Hence we cannot reject H0 that variances are equal. Based on this result, we proceed to conduct ANOVA test.

#ANOVA on Products

aov.usage<-aov(d1$Miles ~ as.factor(d1$Product)) #note the group variable should be a factor
summary(aov.usage)
##                        Df Sum Sq Mean Sq F value Pr(>F)    
## as.factor(d1$Product)   2 209625  104813   68.24 <2e-16 ***
## Residuals             177 271855    1536                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The “F value” of 68.24 indicates that the variance between the groups (based on the means of the usage) is much larger than the variance within the groups. The p-value (<2e-16) of less than 0.001 indicates strong evidence against the null hypothesis of equal means, thus we can conclude that there is a significant difference in the mean usage between at least two of the product groups.

In summary, the ANOVA results suggest that there is a significant effect of the “Product” variable on the “Miles” variable, indicating that the mean usage of at least one of the product groups is significantly different from the others.

##- iii. Test if the mean Miles for customers is higher for customers with “High” fitness versus those with “Low” fitness. “High” fitness is defined by customers with a self-rated fitness score that is greater than the median while “Low” fitness is defined by customer with a self-rated fitness score equal or lower than the median. (Hint: You can create a categorical variable for the two levels of Fitness.) (4 marks)

# Compute the median of Fitness
med_fitness <- median(d1$Fitness)

# Create a new variable for Fitness_level
d1$fitness_level <- NA
d1$fitness_level[d1$Fitness>med_fitness] <- "High"
d1$fitness_level[d1$Fitness<=med_fitness] <- "Low"

d1$fitness_factor<-factor(d1$fitness_level, levels = c("High","Low"))
t.test(d1$Fitness~d1$fitness_factor,alternative="greater")
## 
##  Welch Two Sample t-test
## 
## data:  d1$Fitness by d1$fitness_factor
## t = 22.753, df = 96.712, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group High and group Low is greater than 0
## 95 percent confidence interval:
##  1.671988      Inf
## sample estimates:
## mean in group High  mean in group Low 
##           4.563636           2.760000

Finding Based on the output, we can see that the t-statistic is 22.753, and the degrees of freedom are 96.712. The p-value is less than 2.2e-16, which indicates strong evidence against the null hypothesis that there is no difference in means between the two groups.

The alternative hypothesis is that the true difference in means between the High and Low groups is greater than 0. The 95 percent confidence interval for the difference in means is 1.671988 to infinity, which suggests that the mean for the High group is significantly larger than the mean for the Low group.

The sample estimates show that the mean for the High group is 4.563636, and the mean for the Low group is 2.760000. This indicates that the High group has a higher mean than the Low group. Overall, these results suggest that there is a significant difference in the means between the two groups, with the High group having a significantly higher mean than the Low group.

Type your answer here.

END: YOUR ANSWER