1.ONE WAY ANOVA

step 1:前置-假設H0, H1

library(readr)

# ONE WAY ANOVA
# The null and alternative hypothesis of an ANOVA are:
# H0 = male and female are equal in terms of evaluation
# H1 = male and female are different in terms of evaluation

# Import the data and look at the first six rows

evaluation <- read_csv("~/Downloads/Evaluation.csv")
## Rows: 90 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Gender, Study
## dbl (1): Evaluation
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(evaluation)
## # A tibble: 6 × 3
##   Evaluation Gender Study
##        <dbl> <chr>  <chr>
## 1         61 Female BA   
## 2         65 Female BA   
## 3         86 Female BA   
## 4         68 Female BA   
## 5         69 Female BA   
## 6         88 Female BA
summary(evaluation)
##    Evaluation        Gender             Study          
##  Min.   : 36.00   Length:90          Length:90         
##  1st Qu.: 61.00   Class :character   Class :character  
##  Median : 68.00   Mode  :character   Mode  :character  
##  Mean   : 68.09                                        
##  3rd Qu.: 75.00                                        
##  Max.   :100.00
res_aov <- aov(Evaluation ~ Gender,
               data = evaluation
)

Step2: Check the normality of implicit variable’s residual 檢查因變量殘差整體的正態性

因變量:implicit variable (e.g. y=ax+b, y就是那個因變量)

殘差:residuals

# Test the normality of residuals via a histogram and a QQ-plot, 
# and Shapiro-Wilk test.

# The null and alternative hypothesis for both tests are:
# H0: data come from a normal distribution
# H1: data do not come from a normal distribution

par(mfrow = c(1, 2)) # combine plots
# histogram
hist(res_aov$residuals)
# QQ-plot
library(car)
## Loading required package: carData
qqPlot(res_aov$residuals,
       id = FALSE # id = FALSE to remove point identification
)

# Shapiro-Wilk test
#shapiro.test(res_aov$residuals)
res_step1<-shapiro.test(res_aov$residuals)
res_step1
## 
##  Shapiro-Wilk normality test
## 
## data:  res_aov$residuals
## W = 0.99056, p-value = 0.771
# P-value of the Shapiro-Wilk test on the residuals is larger than 
# the usual significance level of α = 5%, so we do not reject the 
# hypothesis that residuals follow a normal distribution (p-value = 0.771).
#因為p value 大於0.05
#所以滿足正態性條件

step 3:通過箱線圖和點圖以及統計檢驗(Levene 檢驗等)直觀地驗證同質性。Verified homogeneity visually via a boxplot and dotplot, and a statistical test (Levene’s test, among others).

# Verified homogeneity visually via a boxplot and dotplot, and 
# via a statistical test (Levene’s test, among others).

# The null and alternative hypothesis for both tests are:
# H0: variances are equal
# H1: at least one variance is different

# Boxplot
boxplot(Evaluation ~ Gender,
        data = evaluation
)
# There is one outlier in the group Male, as defined by the interquartile range criterion. 
# This point is, however, not seen as a significant outlier so we can assume that 
# the assumption of no significant outliers is met.


# Dotplot
library("lattice")

dotplot(Evaluation ~ Gender,
        data = evaluation
)

# Levene's test
leveneTest(Evaluation ~ Gender,
           data = evaluation
)
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  1  2.6337 0.1082
##       88
# The p-value being larger than the significance level of 0.05, 
# we do not reject the null hypothesis, so we cannot reject the hypothesis 
# that variances are equal between species (p-value = 0.1082).
# 因為p>0.05,所以滿足原假設,所以variances相等。

# Run ANOVA
summary(res_aov)
##             Df Sum Sq Mean Sq F value Pr(>F)
## Gender       1      1    1.11   0.007  0.934
## Residuals   88  14256  162.00

得出結論:

# Given that the p-value is larger than 0.05, so we cannot reject the hypothesis

# that all means are equal. Therefore, we can conclude that male and female are

# equal in terms of evaluation (P-value = 0.934).

2. TWO WAY ANOVA

# TWO WAY ANOVA
# Question: Does Evaluation vary according to on Gender and Study?

table(evaluation$Gender, evaluation$Study) # Make frequency tables
##         
##          BA CE IDE
##   Female 15 15  15
##   Male   15 15  15
# Color box plot by a second group 'Gender'
library("ggpubr")
## Loading required package: ggplot2
ggboxplot(evaluation, x = "Study", y = "Evaluation", color = "Gender",
          palette = c("#00AFBB", "#E7B800"))

# Add error bars: mean_se
ggline(evaluation, x = "Study", y = "Evaluation", color = "Gender",
       add = c("mean_se", "dotplot"),
       palette = c("#00AFBB", "#E7B800"))
## Bin width defaults to 1/30 of the range of the data. Pick better value with
## `binwidth`.

# Compute a two-way ANOVA test
res.aov2 <- aov(Evaluation ~ Gender * Study, data = evaluation)
summary(res.aov2)
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## Gender        1      1     1.1   0.010    0.922    
## Study         2    418   209.0   1.798    0.172    
## Gender:Study  2   4073  2036.5  17.518 4.38e-07 ***
## Residuals    84   9765   116.3                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

得出結論:

# The interaction between suppdose has a p-value < 0.05 (significant), indicating 因為signif codes:** # that the connection between dose and tooth length is influenced by the supp technique.

Based on the data chart, IDE female students give higher scores for the evaluation. From this, we draw the assumption that female students in the IDE program prefer the course most. A definite conclusion cannot be drawn as it may be the case that female students provide higher ratings to be polite or for other reasons.

3. ONE WAY ANOVA FOR IDE

# ONE WAY ANOVA FOR IDE STUDENTS

# The null and alternative hypothesis of an ANOVA are:
# H0 = male and female are equal in terms of evaluation
# H1 = male and female are different in terms of evaluation
IDE_evaluation <- read_csv("~/Downloads/Evaluation.csv")
## Rows: 90 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Gender, Study
## dbl (1): Evaluation
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#filter(IDE_evaluation, Study == "IDE")不知道為什麼一直讀不到Study欄位
#所以用以下方法取代 which先拉出需要的欄位的ID 

IDE1data<- read_csv("~/Downloads/Evaluation.csv") 
## Rows: 90 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Gender, Study
## dbl (1): Evaluation
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#創建一個表IDE1data
IDE1<-which(IDE_evaluation$Study=="IDE") #IDE1代表的是一個object(是所有符合條件的id)
IDE1data <- IDE1data[IDE1,] 
#拉出所以符合IDE1的資料 指向IDE1data

head(IDE1data)
## # A tibble: 6 × 3
##   Evaluation Gender Study
##        <dbl> <chr>  <chr>
## 1         80 Female IDE  
## 2         71 Female IDE  
## 3         86 Female IDE  
## 4         86 Female IDE  
## 5         85 Female IDE  
## 6         83 Female IDE
res_aov3 <- aov(Evaluation ~ Gender, data = IDE1data)

# Boxplot
boxplot(Evaluation ~ Gender, data = IDE1data)

# Run ANOVA
summary(res_aov3)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Gender       1   2001  2000.8   24.09 3.56e-05 ***
## Residuals   28   2326    83.1                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

得出結論:

# Based on the box plot, female students give higher scores for the evaluation.

# Thus, we draw the assumption that female students in the IDE prefer the course.

# A definite conclusion cannot be drawn as it may be the case that female students provide higher ratings to be polite or for other reasons.

4. Are there problems with the results of the first analysis? If so, please explain why. Keep your answer short and precise.

first analysis

boxplot(Evaluation ~ Gender,
        data = evaluation
)

the data depends on different disciplines

ggboxplot(evaluation, x = "Study", y = "Evaluation", color = "Gender",
          palette = c("#00AFBB", "#E7B800"))

得出結論

As the picture shows, we could find that there are problems with the results of the first analysis.​

In the first analysis, it is noticeable that the score from female students is more widely distributed, than that of male students, and both of them had similar means.​

However, from the results from the other two charts, which analysis the students from IDE and CE respectively, we could know that for the former part, female students are present better that male students. Interestingly, it is almost totally different situation in terms of the CE students.​

That means, analysising all the students no matter their discipline might lead to a wrong results compared to address the data separately.