R Markdown, is an extremely useful tool that professional data scientists and business analysts use in their day-to-day work.
You can open R Markdown documents in RStudio as well. You should see a little command called “Knit”, which allows you to “knit” the entire R Markdown file into a HTML document, or a PDF document, or a MS Word Document (note, for MS Word, you’ll need MS Word installed on your system; for PDF, you need to have Tex/Latex distribution installed).
R Markdown is nice to use simply because it allows you to embed both code and write-up into the same document, and it produces presentable output, so you can use it to generate reports from your homework, and, when you eventually go out to work in a company, for your projects.
Here’s how you embed a “chunk” of R code.
1+1
## [1] 2
After the three apostrophes, you’ll need r, then you can
give the chunk a name. Please note that CHUNK NAMES HAVE TO BE A
SINGLE-WORD, NO SPACE ALLOWED. Also, names have to be unique,
that is, every chunk needs a different name (this has
led to rendering failures in previous final exams). You can give chunks
names like:
chunk1read-in-datarun-regressionor, what will help you with homework:
load-libraryq1.(a) etcThese names are for you to help organize your code. (In practice it will be very useful when you have files with thousands of lines of code…). After the name of the chunk, you can give it certain options, separated by commas. I will highlight one important option.
echo=TRUE means the code chunk will be copied into the
output file. For homework purposes, ALWAYS set
echo=TRUE so we know what code you wrote. When you go out
to work in a company and you want to produce nice looking reports, feel
free to set it to FALSE.There is a lot to syntax to learn using the R Markdown, but we don’t need you to be an expert in R Markdown (although we do expect proficiency in R!). Hopefully, you can copy all the R Markdown syntax you need from the templates we provide.
Note about working directories in R Markdown. If you do not
specify your working directory via setwd('...'), and you
hit Knit, the document will assume that the working
directory is the directory that the .rmd file is in. Thus,
if your rmd is in XYZ/folder1/code.rmd and your dataset is
XYZ/folder1/data.csv, then you can simply run
d0 <- read.csv('data.csv') without running
setwd().
output: html_document.html format for assignments/exam. You
may want to play with PDF file using pdf_document for your own
benefit.echo=TRUE in all
chunks.library('package_name')
statements that you’ll need to run your code and future
reproduction.T[X]_[MatricNumber].rmd,
and the output will automatically be
T[X]_[MatricNumber].html.
T5_12345.html# install required packages if you have not (suggested packages: rcompanion, rstatix, Rmisc, dplyr, tidyr, rpivotTable, knitr, psych)
# install.packages("dplyr") #only need to run this code once to install the package
# load required packages
# library("xxxx")
library("rcompanion") #this package is required for transformTukey function
library("rstatix")
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
library("Rmisc")
## Loading required package: lattice
## Loading required package: plyr
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:rstatix':
##
## desc, mutate
library("dplyr") #need to call the library before you use the package
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library("tidyr")
library("rpivotTable")
library("knitr")
library("psych")
##
## Attaching package: 'psych'
## The following object is masked from 'package:rcompanion':
##
## phi
In the last tutorial, you were tasked to help the store manager develop dashboards that will enable him to gain better insights of the data. In this tutorial, you will use the data to conduct sampling estimation and hypotheses testing. Where necessary, check the distribution for the variables and for the presence of outliers.
You may use the following guideline to round off your final answers: If the answer is greater than 1, round off to 2 decimal places. If the answer is less than 1, round off to 3 significant numbers. When rounding, also take note of the natural rounding points, for example, costs in dollars would round off to 2 decimal places.Context: Your market research team at AdRight is assigned to study the profile of the typical customer for each treadmill product offered by CardioGood Fitness. The team decides to collect data on individuals who purchased a treadmill at a CardioGoodFitness retail store during the prior three months.
The data are stored in the CardioGoodFitness.csv file.
The team identifies the following customer variables to study:
Product: product purchased (TM195, TM498, or
TM798)Age: in yearsGender: Female or MaleEducation: number of years of educationMaritalStatus: Single or PartneredUsage: average number of times the customer plans to
use the treadmill each weekFitness: self-rated fitness on an 1-to-5 scale, where 1
is poor shape and 5 is excellent shape.Income: annual household income ($)Miles: average number of miles the customer expects to
walk/run each week## Please check that the .csv file is in the same directory as your Rmd file.
# setwd(" ") (Make sure to set your directory if you get an error "cannot open connection")
d1 <- read.csv("CardioGoodFitness.csv")
head(d1,10)
## Product Age Gender Education MaritalStatus Usage Fitness Income Miles
## 1 TM195 18 Male 14 Single 3 4 29562 112
## 2 TM195 19 Male 15 Single 2 3 31836 75
## 3 TM195 19 Female 14 Partnered 4 3 30699 66
## 4 TM195 19 Male 12 Single 3 3 32973 85
## 5 TM195 20 Male 13 Partnered 4 2 35247 47
## 6 TM195 20 Female 14 Partnered 3 3 32973 66
## 7 TM195 21 Female 14 Partnered 3 3 35247 75
## 8 TM195 21 Male 13 Single 3 3 32973 85
## 9 TM195 21 Male 15 Single 5 4 35247 141
## 10 TM195 21 Female 15 Partnered 2 3 37521 85
summary(d1)
## Product Age Gender Education
## Length:180 Min. :18.00 Length:180 Min. :12.00
## Class :character 1st Qu.:24.00 Class :character 1st Qu.:14.00
## Mode :character Median :26.00 Mode :character Median :16.00
## Mean :28.79 Mean :15.57
## 3rd Qu.:33.00 3rd Qu.:16.00
## Max. :50.00 Max. :21.00
## MaritalStatus Usage Fitness Income
## Length:180 Min. :2.000 Min. :1.000 Min. : 29562
## Class :character 1st Qu.:3.000 1st Qu.:3.000 1st Qu.: 44059
## Mode :character Median :3.000 Median :3.000 Median : 50597
## Mean :3.456 Mean :3.311 Mean : 53720
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.: 58668
## Max. :7.000 Max. :5.000 Max. :104581
## Miles
## Min. : 21.0
## 1st Qu.: 66.0
## Median : 94.0
## Mean :103.2
## 3rd Qu.:114.8
## Max. :360.0
str(d1)
## 'data.frame': 180 obs. of 9 variables:
## $ Product : chr "TM195" "TM195" "TM195" "TM195" ...
## $ Age : int 18 19 19 19 20 20 21 21 21 21 ...
## $ Gender : chr "Male" "Male" "Female" "Male" ...
## $ Education : int 14 15 14 12 13 14 14 13 15 15 ...
## $ MaritalStatus: chr "Single" "Single" "Partnered" "Single" ...
## $ Usage : int 3 2 4 3 4 3 3 3 5 2 ...
## $ Fitness : int 4 3 3 3 2 3 3 3 4 3 ...
## $ Income : int 29562 31836 30699 32973 35247 32973 35247 32973 35247 37521 ...
## $ Miles : int 112 75 66 85 47 66 75 85 141 85 ...
Using the data that the team has collected:
Age. Explain this
finding to the team. (3 marks)Type your findings and explanations in the space below.
For tutorial discussion: Repeat the above with 90% intervals. Compare the widths of the 90% intervals vs 95% intervals. Which intervals provide better precision?
BEGIN: YOUR ANSWER
Age. Explain this
finding to the team. (3 marks)h1<-hist(d1$Age, main="Histogram of Customer Age",xlab="Customer Age",ylab="No. of Customers", col=c("darkorange"),ylim=c(0,75), labels=TRUE)
qqnorm(d1$Age, ylab="Sample Quantities for 'Age' for Customers")
qqline(d1$Age,col="red")
# Shapiro-Wilk normality test
shapiro.test(d1$Age)
##
## Shapiro-Wilk normality test
##
## data: d1$Age
## W = 0.91577, p-value = 1.183e-08
Finding The test statistic W is 0.91577 and the p-value is 1.183e-08 (very small). This result suggests that the sample does not come from a normally distributed population, as the p-value is less than the commonly used significance level of 0.05. Therefore, we reject the null hypothesis and conclude that the “Age” variable is not normally distributed.
#transform of 'Age' variable since the data is normally distributed
d1$Age.t = transformTukey(d1$Age, plotit=TRUE)
##
## lambda W Shapiro.p.value
## 351 -1.25 0.9808 0.01379
##
## if (lambda > 0){TRANS = x ^ lambda}
## if (lambda == 0){TRANS = log(x)}
## if (lambda < 0){TRANS = -1 * x ^ lambda}
BEGIN: YOUR ANSWER
#using t-distribution to calculate the 95% CI of upper and lower Interval of true age population
mnage.t <- mean(d1$Age.t)
sdage.t <- sd(d1$Age.t)
lPI95.aget <- mnage.t + (qt(0.025, df = (nrow(d1)-1))*sdage.t*sqrt(1+1/nrow(d1)))
uPI95.aget <- mnage.t - (qt(0.025, df = (nrow(d1)-1))*sdage.t*sqrt(1+1/nrow(d1)))
#reverse transform
#using y=-1*x^lamda where lambda(-1.25) less than 0
lPT95.age <- (-1/lPI95.aget) ^(1/1.25)
uPT95.age <- (-1/uPI95.aget) ^(1/1.25)
kable(cbind(lPT95.age, uPT95.age), digits=2, caption="Reverse Transformation Interval 95%CI")
| lPT95.age | uPT95.age |
|---|---|
| 19.41 | 49.26 |
Reverse transform; to derive the formula y=-1*x^lamda -y = x^-1.25 = 1/(x^1,25) x^1,25 = -1/y x = (-1/y)^(1/1.25) where y can be with lPI95.aget or uPI95.aget Hence, 95% PI of the transformed Interval of age of the true population lies between 19.41 to 49.26
#using t-distribution to calculate the 90% CI of upper and lower Interval of true age population
mnage.t <- mean(d1$Age.t)
sdage.t <- sd(d1$Age.t)
lPI90.aget <- mnage.t + (qt(0.05, df = (nrow(d1)-1))*sdage.t*sqrt(1+1/nrow(d1)))
uPI90.aget <- mnage.t - (qt(0.05, df = (nrow(d1)-1))*sdage.t*sqrt(1+1/nrow(d1)))
#kable(cbind(lPI90.aget, uPI95.aget), caption = "90% Prediction Interval for Age for Customer")
#reverse transform
#using y=-1*x^lamda where lambda(-1.25) less than 0
lPT90.age <- (-1/lPI90.aget) ^(1/1.25)
uPT90.age <- (-1/uPI90.aget) ^(1/1.25)
kable(cbind(lPT90.age, uPT90.age), digits=2, caption="Reverse Transformation Interval 90%CI")
| lPT90.age | uPT90.age |
|---|---|
| 20.33 | 43.19 |
Reverse transform; to derive the formula y=-1*x^lamda -y = x^-1.25 = 1/(x^1,25) x^1,25 = -1/y x = (-1/y)^(1/1.25) where y can be with lPI90.aget or uPI90.aget Hence, 90% PI of the transformed Interval of age of the true population lies between 20.33 to 43.19
#using t.test function
t.test(d1$Age,
alternative="two.sided",
mu=30,
conf.level = 0.95)
##
## One Sample t-test
##
## data: d1$Age
## t = -2.3401, df = 179, p-value = 0.02038
## alternative hypothesis: true mean is not equal to 30
## 95 percent confidence interval:
## 27.76763 29.81015
## sample estimates:
## mean of x
## 28.78889
#using t.test function
t.test(d1$Age,alternative="two.sided",
mu=30,
conf.level = 0.90)
##
## One Sample t-test
##
## data: d1$Age
## t = -2.3401, df = 179, p-value = 0.02038
## alternative hypothesis: true mean is not equal to 30
## 90 percent confidence interval:
## 27.93319 29.64459
## sample estimates:
## mean of x
## 28.78889
Finding
Using One Sample t-test on both tests, we set the null hypothesis(H0) is that the true mean age is equal to 30, and the alternative hypothesis(H1) is that it is not equal to 30.
The p-value for both tests is 0.02038, which is less than the significance level of 0.05, indicating that we reject the null hypothesis in favor of the alternative hypothesis.
In 95% confidence interval of (27.76763, 29.81015), which means that we are 95% confident that the true mean age falls between 27.76763 and 29.81015, while 90% confidence interval of (27.93319, 29.64459), which means that we are 90% confident that the true mean age falls between 27.93319 and 29.64459. Hence, we can conclude that we can reject the claim since in both 95% CI and 90% CI are the true mean age is less than 30.
dMale <- d1 %>% filter(d1$Gender=="Male")
## 90% confidence interval for proportion of Male customers
prop.test(x=nrow(dMale),n=nrow(d1),conf.level = 0.95)
##
## 1-sample proportions test with continuity correction
##
## data: nrow(dMale) out of nrow(d1), null probability 0.5
## X-squared = 4.05, df = 1, p-value = 0.04417
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.5019570 0.6502243
## sample estimates:
## p
## 0.5777778
dMale <- d1 %>% filter(d1$Gender=="Male")
## 90% confidence interval for proportion of Male customers
prop.test(x=nrow(dMale),n=nrow(d1),conf.level = 0.90)
##
## 1-sample proportions test with continuity correction
##
## data: nrow(dMale) out of nrow(d1), null probability 0.5
## X-squared = 4.05, df = 1, p-value = 0.04417
## alternative hypothesis: true p is not equal to 0.5
## 90 percent confidence interval:
## 0.5137230 0.6394242
## sample estimates:
## p
## 0.5777778
Finding
Using one-sample proportions test, he null hypothesis is that the true proportion of males in the sample is 0.5 (i.e., an equal number of males and females), while the alternative hypothesis is that the true proportion is different from 0.5 (i.e., there are more males or females in the sample).
The sample proportion is 0.5777778, which suggests that there are slightly more males than females in the sample. The 95% confidence interval is [0.5019570, 0.6502243], which means that we can be 95% confident that the true proportion of males in the population lies between 0.5019570 and 0.6502243. The 90% confidence interval is [0.5137230, 0.6394242], which means that we can be 90% confident that the true proportion of males in the population lies between 0.5137230 and 0.6394242.
Type your answer here.
END: YOUR ANSWER
Could you assist the team to assess whether there is any difference in mean income between
Plot a graph for each of the comparisons, then set up the appropriate hypotheses and conduct the hypotheses tests using a 5% significance level. Type the hypotheses and your findings in the space below.
(8 marks)
BEGIN: YOUR ANSWER
Hypotheses:
Null hypothesis (H0): There is no significant difference in mean income between male and female customers.
Alternative hypothesis (Ha): There is a significant difference in mean income between male and female customers.
𝐻0: 𝜇income of Male − 𝜇income of Female = 0
𝐻𝑎: 𝜇income of Male − 𝜇income of Female ≠ 0
Findings:
p-value = 0.004021, which is less than the 5% significance level
Therefore, we reject the null hypothesis
There is a significant difference in mean income between male and female customers
95% confidence interval for the difference in means is (2174.392, 11293.312) and does not contain 0, providing additional evidence for the significant difference.
## Type your codes here
dMale <- d1 %>% filter(d1$Gender=="Male")
dFemale <- d1 %>% filter(d1$Gender=="Female")
lapply(list(dMale, dFemale), nrow)
## [[1]]
## [1] 104
##
## [[2]]
## [1] 76
# combine all gender(Male, Female) into one dataframe
BD.less<-rbind(dMale,dFemale)
#Conduct ANOVA on Age
# plot histogram for Income
par(mfcol=c(1,2))
hist(dMale$Income, main="Histogram for Male Customers", xlab="Income", ylab="No of customers", ylim = c(0,35), col= "blue", labels=TRUE, xaxp=c(0,110000,20), cex.axis=0.8)
hist(dFemale$Income, main="Histogram for Female Customer", xlab="Income", ylab="No of customers", ylim = c(0,35), col= "red", labels=TRUE, xaxp=c(0,110000,20), cex.axis=0.8)
t.test(dMale$Income, dFemale$Income)
##
## Welch Two Sample t-test
##
## data: dMale$Income and dFemale$Income
## t = 2.9146, df = 177.23, p-value = 0.004021
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2174.392 11293.312
## sample estimates:
## mean of x mean of y
## 56562.76 49828.91
Finding The p-value of the test is 0.004021, which is less than the typical significance level of 0.05. This indicates strong evidence against the null hypothesis that there is no difference in mean income between males and females.
The alternative hypothesis, which states that the true difference in means is not equal to 0, is supported by the results of the test. The 95 percent confidence interval for the difference in means is (2174.392, 11293.312), which indicates that we can be 95% confident that the true difference in means lies within this interval.
The sample estimates for the mean income of males and females are $56,562.76 and $49,828.91, respectively. The difference between these means is $6,733.85, which is positive and suggests that males have a higher mean income than females in the sample
𝐻0: 𝜇1 = 𝜇2 = 𝜇3 𝐻𝑎: at least one mean is different from the others ANOVA 1. are randomly and independently obtained, 2. are normally distributed, and 3. have equal variances. ANOVA requires assumptions that the m groups or factor levels being studied represent populations whose outcome measures: What statistical test do we use to test the differences in means? Where: • 𝜇1 = Mean income for purchased TM195 • 𝜇2 = Mean income for purchased TM498 • 𝜇3 = Mean income for purchased TM798
## Type your codes here
dTM195 <- d1 %>% filter(d1$Product=="TM195")
dTM498 <- d1 %>% filter(d1$Product=="TM498")
dTM798 <- d1 %>% filter(d1$Product=="TM798")
lapply(list(dTM195, dTM498, dTM798), nrow)
## [[1]]
## [1] 80
##
## [[2]]
## [1] 60
##
## [[3]]
## [1] 40
# combine all gender(Male, Female) into one dataframe
BD.less<-rbind(dTM195,dTM498,dTM798)
#Conduct ANOVA on Age
# plot histogram for Income
par(mfcol=c(2,2))
hist(dTM195$Income, main="Histogram for Purchased TM195", xlab="Income", ylab="No of customers", ylim = c(0,25), labels=TRUE, col = "Orange")
hist(dTM498$Income, main="Histogram for Purchased TM498", xlab="Income", ylab="No of customers", ylim = c(0,25),labels=TRUE, col = "Orange")
hist(dTM798$Income, main="Histogram for Purchased TM798", xlab="Income", ylab="No of customers", ylim = c(0,25), labels=TRUE, col = "Orange")
## check equality of variance of data
## check equality of variance
par(mfcol=c(2,2))
qqnorm(dTM195$Income, main="QQplot for purchading TM195", xlab="Income", ylab = "No of customers")
qqline(dTM195$Income)
qqnorm(dTM498$Income, main="QQplot for purchading TM498", xlab="Income", ylab = "No of customers")
qqline(dTM498$Income)
qqnorm(dTM798$Income, main="QQplot for purchading TM798", xlab="Income", ylab = "No of customers")
qqline(dTM798$Income)
##boxplot
boxplot(d1$Income ~ d1$Product)
# Shapiro Wilk Test
lapply(list(dTM195, dTM498, dTM798),
function(d1)
{
shapiro.test(d1$Income)
})
## [[1]]
##
## Shapiro-Wilk normality test
##
## data: d1$Income
## W = 0.97657, p-value = 0.1489
##
##
## [[2]]
##
## Shapiro-Wilk normality test
##
## data: d1$Income
## W = 0.97486, p-value = 0.2503
##
##
## [[3]]
##
## Shapiro-Wilk normality test
##
## data: d1$Income
## W = 0.9128, p-value = 0.004605
Finding In output [1], the test statistic (W) is 0.97657, and the p-value is 0.1489. In output [2], the test statistic (W) is 0.97486, and the p-value is 0.2503. In output [3], the test statistic (W) is 0.9128, and the p-value is 0.004605.
In all 3 cases, Since the p-value is less than the significance level of 0.05, we reject the null hypothesis. Thus, we can assume that the data does not follow a normal distribution.
# Since sample sizes are not equal, check if equal variance assumption is met
fligner.test(Income~ Product, d1)
##
## Fligner-Killeen test of homogeneity of variances
##
## data: Income by Product
## Fligner-Killeen:med chi-squared = 53.891, df = 2, p-value = 1.985e-12
Finding
The p-value associated with this test is 1.985e-12, which is less than the typical alpha level of 0.05. Therefore, we can reject the null hypothesis that the variances across the groups are equal, and conclude that there is evidence of heterogeneity of variances.
# check if sample sizes are equal to see if equal variance assumption is important to meet
table(d1$Product)
##
## TM195 TM498 TM798
## 80 60 40
# Since sample sizes are not equal, check if equal variance assumption is met
fligner.test(Income ~ Product,d1)
##
## Fligner-Killeen test of homogeneity of variances
##
## data: Income by Product
## Fligner-Killeen:med chi-squared = 53.891, df = 2, p-value = 1.985e-12
The p-value associated with the test statistic is very small (1.985e-12), which indicates strong evidence against the null hypothesis of homogeneity of variances across the three groups. Therefore, you can conclude that there is a significant difference in the variances of income across the three products
#fligner.test gave p=0.1016> 0.05. Hence we cannot reject H0 that variances are equal. Based on this result, we proceed to conduct ANOVA test.
#ANOVA on Products
aov.usage<-aov(d1$Income ~ as.factor(d1$Product)) #note the group variable should be a factor
summary(aov.usage)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(d1$Product) 2 2.449e+10 1.225e+10 89.26 <2e-16 ***
## Residuals 177 2.428e+10 1.372e+08
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The “F value” of 89.26 indicates that the variance between the groups (based on the means of the usage) is much larger than the variance within the groups. The p-value (<2e-16) of less than 0.001 indicates strong evidence against the null hypothesis of equal means, thus we can conclude that there is a significant difference in the mean usage between at least two of the product groups.
In summary, the ANOVA results suggest that there is a significant effect of the “Product” variable on the “Income” variable, indicating that the mean usage of at least one of the product groups is significantly different from the others.
Type your answer here.
END: YOUR ANSWER
There are two variables that capture the customers usage and exercise
pattern: Usage and Miles.
Usage across the 3 products - TM195, TM498 versus TM798 (3
marks)Miles.(3 marks)Miles for customers is higher for
customers with “High” fitness versus those with “Low” fitness. “High”
fitness is defined by customers with a self-rated fitness score that is
greater than the median while “Low” fitness is defined by customer with
a self-rated fitness score equal or lower than the median. (Hint: You
can create a categorical variable for the two levels of Fitness.) (4
marks)Set up the appropriate hypotheses and conduct the hypotheses tests using 5% significance level. Type your hypotheses and findings in the space below.
For tutorial discussion: Based on your analyses of the CardioGoodFitness restore, discuss how you think the 3 types of treadmills may differ (e.g. in terms of the features or customers that are attracted to the product).
BEGIN: YOUR ANSWER
##- i. Test if there is any significant difference in mean
Usage across the 3 products - TM195, TM498 versus TM798 (3
marks)
## Type your codes here
dTM195 <- d1 %>% filter(d1$Product=="TM195")
dTM498 <- d1 %>% filter(d1$Product=="TM498")
dTM798 <- d1 %>% filter(d1$Product=="TM798")
lapply(list(dTM195, dTM498, dTM798), nrow)
## [[1]]
## [1] 80
##
## [[2]]
## [1] 60
##
## [[3]]
## [1] 40
𝐻0: 𝜇1 = 𝜇2 = 𝜇3 𝐻𝑎: at least one mean is different from the others ANOVA 1. are randomly and independently obtained, 2. are normally distributed, and 3. have equal variances. ANOVA requires assumptions that the m groups or factor levels being studied represent populations whose outcome measures: What statistical test do we use to test the differences in means? Where: • 𝜇1 = Mean Usage for purchased TM195 • 𝜇2 = Mean Usage for purchased TM498 • 𝜇3 = Mean Usage for purchased TM798
# combine all product(TM195,TM498,TM798) into one dataframe
BD.less<-rbind(dTM195,dTM498,dTM798)
#Conduct ANOVA on Age
# plot histogram for Income
par(mfcol=c(2,2))
hist(dTM195$Usage, main="Histogram for Purchased TM195", xlab="Usage", ylab="No of customers", ylim=c(0,40), labels=TRUE, col = "Orange")
hist(dTM498$Usage, main="Histogram for Purchased TM498", xlab="Usage", ylab="No of customers", ylim=c(0,40), labels=TRUE, col = "Orange")
hist(dTM798$Usage, main="Histogram for Purchased TM798", xlab="Usage", ylab="No of customers", ylim=c(0,40), labels=TRUE, col = "Orange")
# plot QQplot
par(mfcol=c(2,2))
qqnorm(dTM195$Usage, main="QQplot for TM195`", xlab="Usage")
qqline(dTM195$Usage)
qqnorm(dTM498$Usage, main="QQplot for TM498`", xlab="Usage")
qqline(dTM498$Usage)
qqnorm(dTM798$Usage, main="QQplot for TM798`", xlab="Usage")
qqline(dTM798$Usage)
# plot boxplot
boxplot(d1$Usage ~ d1$Product, cex.axis=0.8)
# Shapiro Wilk Test
lapply(list(dTM195, dTM498, dTM798),
function(d1)
{
shapiro.test(d1$Usage)
})
## [[1]]
##
## Shapiro-Wilk normality test
##
## data: d1$Usage
## W = 0.84728, p-value = 1.313e-07
##
##
## [[2]]
##
## Shapiro-Wilk normality test
##
## data: d1$Usage
## W = 0.84205, p-value = 1.789e-06
##
##
## [[3]]
##
## Shapiro-Wilk normality test
##
## data: d1$Usage
## W = 0.85138, p-value = 9.721e-05
Finding In all three cases, the p-value is less than 0.05, which is a common significance level used in hypothesis testing. This means that we reject the null hypothesis that the data is normally distributed. Therefore, we can conclude that the variable “Usage” is not normally distributed in this data set.
# check if sample sizes are equal to see if equal variance assumption is important to meet
table(d1$Product)
##
## TM195 TM498 TM798
## 80 60 40
# Since sample sizes are not equal, check if equal variance assumption is met
fligner.test(Usage ~ Product,d1)
##
## Fligner-Killeen test of homogeneity of variances
##
## data: Usage by Product
## Fligner-Killeen:med chi-squared = 4.5734, df = 2, p-value = 0.1016
The null hypothesis of the test is that the variances of the groups are equal, while the alternative hypothesis is that at least one group has a different variance than the others.
Based on the p-value of 0.1016, we fail to reject the null hypothesis at a significance level of 0.05, indicating that there is not enough evidence to conclude that the variances of the groups are significantly different. However, since the p-value is relatively close to the significance level, it may be worth investigating further to see if the result is practically significant.
Overall, the Fligner-Killeen test suggests that there is no strong evidence of a difference in variances between the groups in the “Usage by Product” data.
#fligner.test gave p=0.1016> 0.05. Hence we cannot reject H0 that variances are equal. Based on this result, we proceed to conduct ANOVA test.
#ANOVA on Products
aov.usage<-aov(d1$Usage ~ as.factor(d1$Product)) #note the group variable should be a factor
summary(aov.usage)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(d1$Product) 2 89.55 44.77 65.44 <2e-16 ***
## Residuals 177 121.10 0.68
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The “F value” of 65.44 indicates that the variance between the groups (based on the means of the usage) is much larger than the variance within the groups. The p-value (<2e-16) of less than 0.001 indicates strong evidence against the null hypothesis of equal means, thus we can conclude that there is a significant difference in the mean usage between at least two of the product groups.
In summary, the ANOVA results suggest that there is a significant effect of the “Product” variable on the “Usage” variable, indicating that the mean usage of at least one of the product groups is significantly different from the others.
Miles.(3 marks)𝐻0: 𝜇1 = 𝜇2 = 𝜇3 𝐻𝑎: at least one mean is different from the others ANOVA 1. are randomly and independently obtained, 2. are normally distributed, and 3. have equal variances. ANOVA requires assumptions that the m groups or factor levels being studied represent populations whose outcome measures: What statistical test do we use to test the differences in means? Where: • 𝜇1 = Mean Miles for purchased TM195 • 𝜇2 = Mean Miles for purchased TM498 • 𝜇3 = Mean Miles for purchased TM798
# combine all product(TM195,TM498,TM798) into one dataframe
BD.less<-rbind(dTM195,dTM498,dTM798)
#Conduct ANOVA on Age
# plot histogram for Income
par(mfcol=c(2,2))
hist(dTM195$Miles, main="Histogram for Purchased TM195", xlab="Miles", ylab="No of customers", ylim=c(0,30), labels=TRUE, col = "Orange")
hist(dTM498$Miles, main="Histogram for Purchased TM498", xlab="Miles", ylab="No of customers", ylim=c(0,30), labels=TRUE, col = "Orange")
hist(dTM798$Miles, main="Histogram for Purchased TM798", xlab="Miles", ylab="No of customers", ylim=c(0,30), labels=TRUE, col = "Orange")
# plot QQplot
par(mfcol=c(2,2))
qqnorm(dTM195$Miles, main="QQplot for TM195`", xlab="Miles")
qqline(dTM195$Miles)
qqnorm(dTM498$Miles, main="QQplot for TM498`", xlab="Miles")
qqline(dTM498$Miles)
qqnorm(dTM798$Miles, main="QQplot for TM798`", xlab="Miles")
qqline(dTM798$Miles)
# plot boxplot
boxplot(d1$Miles ~ d1$Product, cex.axis=0.8)
# Shapiro Wilk Test
lapply(list(dTM195, dTM498, dTM798),
function(d1)
{
shapiro.test(d1$Miles)
})
## [[1]]
##
## Shapiro-Wilk normality test
##
## data: d1$Miles
## W = 0.93074, p-value = 0.0003177
##
##
## [[2]]
##
## Shapiro-Wilk normality test
##
## data: d1$Miles
## W = 0.91622, p-value = 0.000542
##
##
## [[3]]
##
## Shapiro-Wilk normality test
##
## data: d1$Miles
## W = 0.90513, p-value = 0.002706
# check if sample sizes are equal to see if equal variance assumption is important to meet
table(d1$Product)
##
## TM195 TM498 TM798
## 80 60 40
# Since sample sizes are not equal, check if equal variance assumption is met
fligner.test(Miles ~ Product,d1)
##
## Fligner-Killeen test of homogeneity of variances
##
## data: Miles by Product
## Fligner-Killeen:med chi-squared = 17.002, df = 2, p-value = 0.0002033
The reported test statistic is 17.002 and the p-value is 0.0002033. The null hypothesis of the test is that the variances of the “Miles” variable are equal across all three products. The low p-value suggests that there is strong evidence to reject the null hypothesis and conclude that the variances are not equal across all products.
In summary, based on the results of the Fligner-Killeen test, there is evidence to suggest that the variances of the “Miles” variable differ across the three products being compared.
#fligner.test gave p=0.1016> 0.05. Hence we cannot reject H0 that variances are equal. Based on this result, we proceed to conduct ANOVA test.
#ANOVA on Products
aov.usage<-aov(d1$Miles ~ as.factor(d1$Product)) #note the group variable should be a factor
summary(aov.usage)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(d1$Product) 2 209625 104813 68.24 <2e-16 ***
## Residuals 177 271855 1536
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The “F value” of 68.24 indicates that the variance between the groups (based on the means of the usage) is much larger than the variance within the groups. The p-value (<2e-16) of less than 0.001 indicates strong evidence against the null hypothesis of equal means, thus we can conclude that there is a significant difference in the mean usage between at least two of the product groups.
In summary, the ANOVA results suggest that there is a significant effect of the “Product” variable on the “Miles” variable, indicating that the mean usage of at least one of the product groups is significantly different from the others.
##- iii. Test if the mean Miles for customers is higher
for customers with “High” fitness versus those with “Low” fitness.
“High” fitness is defined by customers with a self-rated fitness score
that is greater than the median while “Low” fitness is defined by
customer with a self-rated fitness score equal or lower than the median.
(Hint: You can create a categorical variable for the two levels of
Fitness.) (4 marks)
# Compute the median of Fitness
med_fitness <- median(d1$Fitness)
# Create a new variable for Fitness_level
d1$fitness_level <- NA
d1$fitness_level[d1$Fitness>med_fitness] <- "High"
d1$fitness_level[d1$Fitness<=med_fitness] <- "Low"
d1$fitness_factor<-factor(d1$fitness_level, levels = c("High","Low"))
t.test(d1$Fitness~d1$fitness_factor,alternative="greater")
##
## Welch Two Sample t-test
##
## data: d1$Fitness by d1$fitness_factor
## t = 22.753, df = 96.712, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group High and group Low is greater than 0
## 95 percent confidence interval:
## 1.671988 Inf
## sample estimates:
## mean in group High mean in group Low
## 4.563636 2.760000
Finding Based on the output, we can see that the t-statistic is 22.753, and the degrees of freedom are 96.712. The p-value is less than 2.2e-16, which indicates strong evidence against the null hypothesis that there is no difference in means between the two groups.
The alternative hypothesis is that the true difference in means between the High and Low groups is greater than 0. The 95 percent confidence interval for the difference in means is 1.671988 to infinity, which suggests that the mean for the High group is significantly larger than the mean for the Low group.
The sample estimates show that the mean for the High group is 4.563636, and the mean for the Low group is 2.760000. This indicates that the High group has a higher mean than the Low group. Overall, these results suggest that there is a significant difference in the means between the two groups, with the High group having a significantly higher mean than the Low group.
Type your answer here.
END: YOUR ANSWER