Module: 208251 Regression Analysis and Non-Parametric
Statistics
Instructor: Wisunee Puggard
Affiliation: Department of Statistics, Faculty of Science,
Chiang Mai University.
Objectives: Students are able to
perform descriptive statistics
apply appropriate non-parametric statistics tests to answer
reseach questions of interest.
Import data
dt <- read.csv('/Users/wisuneepuggard/Desktop/LAB208251/AStore_Data.csv',header=TRUE)
head(dt)#To view first 6 rows of the data set
## Customer MethodOfPayment ItemsPerchased DiscountAmount Sales Gender
## 1 1 Cash 1 0 39.5 male
## 2 2 Cash 1 25 102.4 female
## 3 3 Cash 1 0 22.5 male
## 4 4 Cash 5 12 100.4 male
## 5 5 Cash 2 0 54.0 male
## 6 6 Cash 1 0 39.5 female
## MaritalStatus Age
## 1 single 32
## 2 married 36
## 3 divorced 32
## 4 widowed 28
## 5 single 34
## 6 single 44
str(dt) #To view type of variable in the data
## 'data.frame': 30 obs. of 8 variables:
## $ Customer : int 1 2 3 4 5 6 7 8 9 10 ...
## $ MethodOfPayment: chr "Cash" "Cash" "Cash" "Cash" ...
## $ ItemsPerchased : int 1 1 1 5 2 1 9 3 4 4 ...
## $ DiscountAmount : int 0 25 0 12 0 0 11 34 18 4 ...
## $ Sales : num 39.5 102.4 22.5 100.4 54 ...
## $ Gender : chr "male" "female" "male" "male" ...
## $ MaritalStatus : chr "single" "married" "divorced" "widowed" ...
## $ Age : int 32 36 32 28 34 44 30 52 30 34 ...
Method of payment, Gender, Marital Status are qualitative variables. Then these variables need to be changed to be as.factor
dt$MethodOfPayment= as.factor(dt$MethodOfPayment)
dt$Gender = as.factor(dt$Gender)
dt$MaritalStatus = as.factor(dt$MaritalStatus)
A. Is the average age of customers equal to 45?
summary(dt$Age) #calculate summary statistics of Age
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20.00 31.25 38.00 40.27 49.75 68.00
par(mfrow=c(1,2))#create plots layout with 1 row and 2 columns
boxplot(dt$Age) #create boxplot of age
hist(dt$Age) #create histogram of age
#perform Wilcoxon test for one sample
wilcox.test(x=dt$Age, mu=45, alternative = "two.sided")
## Warning in wilcox.test.default(x = dt$Age, mu = 45, alternative = "two.sided"):
## cannot compute exact p-value with ties
## Warning in wilcox.test.default(x = dt$Age, mu = 45, alternative = "two.sided"):
## cannot compute exact p-value with zeroes
##
## Wilcoxon signed rank test with continuity correction
##
## data: dt$Age
## V = 116.5, p-value = 0.04996
## alternative hypothesis: true location is not equal to 45
B. Is the median number of purchased items less than 7?
summary(dt$ItemsPerchased)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.250 8.000 7.967 10.750 20.000
par(mfrow=c(1,2)) #to set layout of plots to be 1 row 2 columns
boxplot(dt$ItemsPerchased,ylab="Items perchased")
hist(dt$ItemsPerchased,xlab="Items perchased")
#perform .........for testing mean of one sample
wilcox.test(x=dt$ItemsPerchased,mu=7,alternative = "less")
## Warning in wilcox.test.default(x = dt$ItemsPerchased, mu = 7, alternative =
## "less"): cannot compute exact p-value with ties
## Warning in wilcox.test.default(x = dt$ItemsPerchased, mu = 7, alternative =
## "less"): cannot compute exact p-value with zeroes
##
## Wilcoxon signed rank test with continuity correction
##
## data: dt$ItemsPerchased
## V = 225, p-value = 0.6964
## alternative hypothesis: true location is less than 7
1. Import data
dt2 <- read.csv('/Users/wisuneepuggard/Desktop/LAB208251/Music_Data.csv',header=TRUE)
head(dt2)#To view first 6 rows of the data set
## Runner Type Scale
## 1 1 None 8
## 2 2 None 7
## 3 3 None 6
## 4 4 None 8
## 5 5 None 5
## 6 6 None 9
str(dt2) #To view type of variable in the data
## 'data.frame': 36 obs. of 3 variables:
## $ Runner: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Type : chr "None" "None" "None" "None" ...
## $ Scale : int 8 7 6 8 5 9 7 8 8 7 ...
Type is qualitative variable –> set as.factor
dt2$Type = as.factor(dt2$Type)
Explore data
#find summary statistics of Scale for each Type
tapply(dt2$Scale, dt2$Type, summary)
## $Classical
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.000 6.750 7.500 7.417 8.000 9.000
##
## $Dance
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.0 6.0 6.5 6.5 7.0 8.0
##
## $None
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.000 7.000 7.500 7.417 8.000 9.000
#create boxplot of Scale by Type
par(mfrow=c(1,1))
boxplot(dt2$Scale ~ dt2$Type,xlab="Type of music",ylab="Scale")
A) Is there a difference between None to Classical ?
wilcox.test(x=dt2$Scale[dt2$Type=="None"], y=dt2$Scale[dt2$Type=="Classical"],
paired = TRUE, alternative = "two.sided")
## Warning in wilcox.test.default(x = dt2$Scale[dt2$Type == "None"], y =
## dt2$Scale[dt2$Type == : cannot compute exact p-value with ties
## Warning in wilcox.test.default(x = dt2$Scale[dt2$Type == "None"], y =
## dt2$Scale[dt2$Type == : cannot compute exact p-value with zeroes
##
## Wilcoxon signed rank test with continuity correction
##
## data: dt2$Scale[dt2$Type == "None"] and dt2$Scale[dt2$Type == "Classical"]
## V = 23, p-value = 1
## alternative hypothesis: true location shift is not equal to 0
B) Is the Classical higher than Dance?
wilcox.test(x=dt2$Scale[dt2$Type=="Classical"], y=dt2$Scale[dt2$Type=="Dance"],
paired = TRUE, alternative = "greater")
## Warning in wilcox.test.default(x = dt2$Scale[dt2$Type == "Classical"], y =
## dt2$Scale[dt2$Type == : cannot compute exact p-value with ties
## Warning in wilcox.test.default(x = dt2$Scale[dt2$Type == "Classical"], y =
## dt2$Scale[dt2$Type == : cannot compute exact p-value with zeroes
##
## Wilcoxon signed rank test with continuity correction
##
## data: dt2$Scale[dt2$Type == "Classical"] and dt2$Scale[dt2$Type == "Dance"]
## V = 24.5, p-value = 0.04231
## alternative hypothesis: true location shift is greater than 0
You must submit:
R file with your codes, and
Answer sheet with your handwriting
On Mango, see the deadline there!
For each question, Write 1) hypotheses 2) non-parametric statistic test 3) test value and p-value 4) accept or reject H0
Data collected for a sample of 50 in-store transactions during one day in 2021. A store’s manager wants to use this sample data to learn about customer’s behavior. Use the methods of descriptive statistics and non-parametric statistics presented in this module to analyze the data; any finding that appear interesting and of potential value to the store’s manager.
Data file: AStore_Data.csv
## Customer MethodOfPayment ItemsPerchased DiscountAmount Sales Gender
## 1 1 Cash 1 0 39.5 male
## 2 2 Cash 1 25 102.4 female
## 3 3 Cash 1 0 22.5 male
## 4 4 Cash 5 12 100.4 male
## 5 5 Cash 2 0 54.0 male
## 6 6 Cash 1 0 39.5 female
## MaritalStatus Age
## 1 single 32
## 2 married 36
## 3 divorced 32
## 4 widowed 28
## 5 single 34
## 6 single 44
At the significance level of 0.05, use an appropriate nonparametric test to answer the following questions:
Is the average age of customers more than 45?
Is there a difference between average sales of female and male customers?
A researcher wants to examine whether music has an effect on the perceived psychological effort required to perform an exercise session. To test this, the researcher recruited 12 runners who each ran three times on a treadmill for 30 minutes. For consistency, the treadmill speed was the same for all three runs. In a random order, each subject ran: (a) listening to no music at all; (b) listening to classical music; and (c) listening to dance music. At the end of each run, subjects were asked to record how hard the running session felt on a scale of 1 to 10, with 1 being easy and 10 extremely hard.
Data file: Music_Data.csv
## Runner Type Scale
## 1 1 None 8
## 2 2 None 7
## 3 3 None 6
## 4 4 None 8
## 5 5 None 5
## 6 6 None 9
At the significance level of 0.05, use an appropriate nonparametric test to answer the following questions:
Is the average scale while running with no music less than 7?
Is the average scale while running with classical music higher than running with no music?