Module: 208251 Regression Analysis and Non-Parametric Statistics

Instructor: Wisunee Puggard

Affiliation: Department of Statistics, Faculty of Science, Chiang Mai University.

Objectives: Students are able to

  1. perform descriptive statistics

  2. apply appropriate non-parametric statistics tests to answer reseach questions of interest.

Practice I: AStore_data

Import data

dt <- read.csv('/Users/wisuneepuggard/Desktop/LAB208251/AStore_Data.csv',header=TRUE)
head(dt)#To view first 6 rows of the data set
##   Customer MethodOfPayment ItemsPerchased DiscountAmount Sales Gender
## 1        1            Cash              1              0  39.5   male
## 2        2            Cash              1             25 102.4 female
## 3        3            Cash              1              0  22.5   male
## 4        4            Cash              5             12 100.4   male
## 5        5            Cash              2              0  54.0   male
## 6        6            Cash              1              0  39.5 female
##   MaritalStatus Age
## 1        single  32
## 2       married  36
## 3      divorced  32
## 4       widowed  28
## 5        single  34
## 6        single  44
str(dt) #To view type of variable in the data
## 'data.frame':    30 obs. of  8 variables:
##  $ Customer       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MethodOfPayment: chr  "Cash" "Cash" "Cash" "Cash" ...
##  $ ItemsPerchased : int  1 1 1 5 2 1 9 3 4 4 ...
##  $ DiscountAmount : int  0 25 0 12 0 0 11 34 18 4 ...
##  $ Sales          : num  39.5 102.4 22.5 100.4 54 ...
##  $ Gender         : chr  "male" "female" "male" "male" ...
##  $ MaritalStatus  : chr  "single" "married" "divorced" "widowed" ...
##  $ Age            : int  32 36 32 28 34 44 30 52 30 34 ...

Method of payment, Gender, Marital Status are qualitative variables. Then these variables need to be changed to be as.factor

dt$MethodOfPayment= as.factor(dt$MethodOfPayment)
dt$Gender  = as.factor(dt$Gender)
dt$MaritalStatus = as.factor(dt$MaritalStatus)

Q1

A. Is the average age of customers equal to 45?

summary(dt$Age)  #calculate summary statistics of Age
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20.00   31.25   38.00   40.27   49.75   68.00
par(mfrow=c(1,2))#create plots layout with 1 row and 2 columns
boxplot(dt$Age)  #create boxplot of age
hist(dt$Age)     #create histogram of age

#perform Wilcoxon test for one sample 
wilcox.test(x=dt$Age, mu=45, alternative = "two.sided")
## Warning in wilcox.test.default(x = dt$Age, mu = 45, alternative = "two.sided"):
## cannot compute exact p-value with ties
## Warning in wilcox.test.default(x = dt$Age, mu = 45, alternative = "two.sided"):
## cannot compute exact p-value with zeroes
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  dt$Age
## V = 116.5, p-value = 0.04996
## alternative hypothesis: true location is not equal to 45

Q2

B. Is the median number of purchased items less than 7?

summary(dt$ItemsPerchased)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.250   8.000   7.967  10.750  20.000
par(mfrow=c(1,2)) #to set layout of plots to be 1 row 2 columns
boxplot(dt$ItemsPerchased,ylab="Items perchased")
hist(dt$ItemsPerchased,xlab="Items perchased")

#perform .........for testing mean of one sample 
wilcox.test(x=dt$ItemsPerchased,mu=7,alternative = "less")
## Warning in wilcox.test.default(x = dt$ItemsPerchased, mu = 7, alternative =
## "less"): cannot compute exact p-value with ties
## Warning in wilcox.test.default(x = dt$ItemsPerchased, mu = 7, alternative =
## "less"): cannot compute exact p-value with zeroes
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  dt$ItemsPerchased
## V = 225, p-value = 0.6964
## alternative hypothesis: true location is less than 7

Practice 2: Music and exercise

1. Import data

dt2 <- read.csv('/Users/wisuneepuggard/Desktop/LAB208251/Music_Data.csv',header=TRUE)
head(dt2)#To view first 6 rows of the data set
##   Runner Type Scale
## 1      1 None     8
## 2      2 None     7
## 3      3 None     6
## 4      4 None     8
## 5      5 None     5
## 6      6 None     9
str(dt2) #To view type of variable in the data
## 'data.frame':    36 obs. of  3 variables:
##  $ Runner: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Type  : chr  "None" "None" "None" "None" ...
##  $ Scale : int  8 7 6 8 5 9 7 8 8 7 ...

Type is qualitative variable –> set as.factor

dt2$Type = as.factor(dt2$Type)

Explore data

#find summary statistics of Scale for each Type
tapply(dt2$Scale, dt2$Type, summary)
## $Classical
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   6.000   6.750   7.500   7.417   8.000   9.000 
## 
## $Dance
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     5.0     6.0     6.5     6.5     7.0     8.0 
## 
## $None
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.000   7.000   7.500   7.417   8.000   9.000
#create boxplot of Scale by Type
par(mfrow=c(1,1))
boxplot(dt2$Scale ~ dt2$Type,xlab="Type of music",ylab="Scale")

Q1

A) Is there a difference between None to Classical ?

wilcox.test(x=dt2$Scale[dt2$Type=="None"], y=dt2$Scale[dt2$Type=="Classical"], 
            paired = TRUE, alternative = "two.sided")
## Warning in wilcox.test.default(x = dt2$Scale[dt2$Type == "None"], y =
## dt2$Scale[dt2$Type == : cannot compute exact p-value with ties
## Warning in wilcox.test.default(x = dt2$Scale[dt2$Type == "None"], y =
## dt2$Scale[dt2$Type == : cannot compute exact p-value with zeroes
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  dt2$Scale[dt2$Type == "None"] and dt2$Scale[dt2$Type == "Classical"]
## V = 23, p-value = 1
## alternative hypothesis: true location shift is not equal to 0

Q2

B) Is the Classical higher than Dance?

wilcox.test(x=dt2$Scale[dt2$Type=="Classical"], y=dt2$Scale[dt2$Type=="Dance"], 
            paired = TRUE, alternative = "greater")
## Warning in wilcox.test.default(x = dt2$Scale[dt2$Type == "Classical"], y =
## dt2$Scale[dt2$Type == : cannot compute exact p-value with ties
## Warning in wilcox.test.default(x = dt2$Scale[dt2$Type == "Classical"], y =
## dt2$Scale[dt2$Type == : cannot compute exact p-value with zeroes
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  dt2$Scale[dt2$Type == "Classical"] and dt2$Scale[dt2$Type == "Dance"]
## V = 24.5, p-value = 0.04231
## alternative hypothesis: true location shift is greater than 0

Assignment Lab4

You must submit:

  1. R file with your codes, and

  2. Answer sheet with your handwriting

On Mango, see the deadline there!

For each question, Write 1) hypotheses 2) non-parametric statistic test 3) test value and p-value 4) accept or reject H0

Practice1

Data collected for a sample of 50 in-store transactions during one day in 2021. A store’s manager wants to use this sample data to learn about customer’s behavior. Use the methods of descriptive statistics and non-parametric statistics presented in this module to analyze the data; any finding that appear interesting and of potential value to the store’s manager.

Data file: AStore_Data.csv

##   Customer MethodOfPayment ItemsPerchased DiscountAmount Sales Gender
## 1        1            Cash              1              0  39.5   male
## 2        2            Cash              1             25 102.4 female
## 3        3            Cash              1              0  22.5   male
## 4        4            Cash              5             12 100.4   male
## 5        5            Cash              2              0  54.0   male
## 6        6            Cash              1              0  39.5 female
##   MaritalStatus Age
## 1        single  32
## 2       married  36
## 3      divorced  32
## 4       widowed  28
## 5        single  34
## 6        single  44

At the significance level of 0.05, use an appropriate nonparametric test to answer the following questions:

  1. Is the average age of customers more than 45?

  2. Is there a difference between average sales of female and male customers?

Practice 2

A researcher wants to examine whether music has an effect on the perceived psychological effort required to perform an exercise session. To test this, the researcher recruited 12 runners who each ran three times on a treadmill for 30 minutes. For consistency, the treadmill speed was the same for all three runs. In a random order, each subject ran: (a) listening to no music at all; (b) listening to classical music; and (c) listening to dance music. At the end of each run, subjects were asked to record how hard the running session felt on a scale of 1 to 10, with 1 being easy and 10 extremely hard.

Data file: Music_Data.csv

##   Runner Type Scale
## 1      1 None     8
## 2      2 None     7
## 3      3 None     6
## 4      4 None     8
## 5      5 None     5
## 6      6 None     9

At the significance level of 0.05, use an appropriate nonparametric test to answer the following questions:

  1. Is the average scale while running with no music less than 7?

  2. Is the average scale while running with classical music higher than running with no music?