1. Setting

This analysis focuses on understanding how different factors affect the number of days a student is absent from school. The data set classifies children from Walgett, New South Wales, Australia by culture, age, sex, and learning status, and records the response of number of days absent from school in a particular school year. The factors are as follows:

The response variable for this study is the number of days absent from school in the year of study.

The raw data is organized into rows, with each row being an observation of a student. The row index, four factors, and response variable make up the columns for a total of six, and there are a total of 146 students in the study. All of the given variables from the data set are used in this study besides the column indicating the observation index.

First, we establish usable column names corresponding to each variable in the data set.

data <- read.csv("quine.csv", header = TRUE)
columns <- c("X", "eth", "sex", "age", "learn", "days")
colnames(data) <- columns

The following shows the first 10 observations.

head(data, n=10)
##     X eth sex age learn days
## 1   1   A   M  F0    SL    2
## 2   2   A   M  F0    SL   11
## 3   3   A   M  F0    SL   14
## 4   4   A   M  F0    AL    5
## 5   5   A   M  F0    AL    5
## 6   6   A   M  F0    AL   13
## 7   7   A   M  F0    AL   20
## 8   8   A   M  F0    AL   22
## 9   9   A   M  F1    SL    6
## 10 10   A   M  F1    SL    6

The factors are all recognized as such by R, and the response variable is an integer. The following code shows the structure as well as a summary of the data.

str(data)
## 'data.frame':    146 obs. of  6 variables:
##  $ X    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ eth  : Factor w/ 2 levels "A","N": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex  : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ age  : Factor w/ 4 levels "F0","F1","F2",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ learn: Factor w/ 2 levels "AL","SL": 2 2 2 1 1 1 1 1 2 2 ...
##  $ days : int  2 11 14 5 5 13 20 22 6 6 ...
summary(data)
##        X          eth    sex    age     learn        days      
##  Min.   :  1.00   A:69   F:80   F0:27   AL:83   Min.   : 0.00  
##  1st Qu.: 37.25   N:77   M:66   F1:46   SL:63   1st Qu.: 5.00  
##  Median : 73.50                 F2:40           Median :11.00  
##  Mean   : 73.50                 F3:33           Mean   :16.46  
##  3rd Qu.:109.75                                 3rd Qu.:22.75  
##  Max.   :146.00                                 Max.   :81.00

2. Experimental Design

This experimental design will use multifactor Analysis of Variance (ANOVA) to test whether the factors are significant in determining the number of days a student was absent from school. This design is used because it allows for insight into the main effects of each factor, along with interaction effects between the factors. Using multifactor ANOVA, statistical significance of each of the main and interaction effects can be determined.

The null hypothesis tested is that none of the factors (ethnicity, sex, age, or learning type), or their interactions affect number of days absent.

In this experiment, there are no replicates or repeated measures, and there is no blocking in the design because the factors are all categorical. While no information was provided indicating how the data was collected, the data set does not seem to show any specific sampling or strong groupings. Additionally, because the data was obtained from an existing source, it is unknown why the factors in the set were chosen from all of the possible determinants of days absent. However, in this study, only the effect of the given factors will be analyzed.

3. Statistical Analysis

First, we can look at the distribution of the response variable. From the histogram below, it is evident that the data is right skewed and smaller values are more frequent. Additionally, there are only a few potential outliers that fall outside of the other data.

hist(data$days, breaks = 20, main = "Days Absent from School", xlab = "Days Absent")

plot(data$days ~ data$eth + data$sex + data$age + data$learn)

These boxplots show that there is a significant difference in means for ethnicity, days, and age is at least slightly different. However, they don’t initially show any difference for the learner status.

model <- aov(data$days ~ data$eth + data$sex + data$age + data$learn + data$eth*data$sex + data$eth*data$age + data$eth*data$learn + data$sex*data$age + data$sex*data$learn + data$age*data$learn)
anova(model)
## Analysis of Variance Table
## 
## Response: data$days
##                      Df  Sum Sq Mean Sq F value    Pr(>F)    
## data$eth              1  2980.5 2980.51 14.6267 0.0002038 ***
## data$sex              1   279.0  279.01  1.3692 0.2441228    
## data$age              3  2118.8  706.27  3.4660 0.0182641 *  
## data$learn            1   689.3  689.29  3.3826 0.0682038 .  
## data$eth:data$sex     1   190.6  190.60  0.9354 0.3352946    
## data$eth:data$age     3  2807.3  935.76  4.5922 0.0043537 ** 
## data$eth:data$learn   1    10.2   10.21  0.0501 0.8232080    
## data$sex:data$age     3  2396.6  798.86  3.9203 0.0102322 *  
## data$sex:data$learn   1   175.1  175.10  0.8593 0.3556870    
## data$age:data$learn   2   574.0  287.01  1.4085 0.2482820    
## Residuals           128 26082.9  203.77                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA of this model indicates that ethnicity and the interaction between ethnicity and age are significant at a 0.001 significance level. Additionally, age and the interaction between gender and age are significant at the 0.01 significance level. These significant values reflect that the factor or the interaction affects the number of days absent.

We can look further at the interactions between the factors through interaction plots.

interaction.plot(data$eth, data$sex, data$days)

There is no interaction here.

interaction.plot(data$eth, data$age, data$days)

There is an interaction between ethnicity and age, as the p-value from the ANOVA output suggested.

interaction.plot(data$eth, data$learn, data$days)

While there appears to be an interaction, the p-value from the ANOVA output was too high to conclude that this interaction is significant.

interaction.plot(data$sex, data$age, data$days)

There is an interaction between gender and age, as the p-value from the ANOVA output suggested.

interaction.plot(data$sex, data$learn, data$days)

There is no interaction here.

interaction.plot(data$age, data$learn, data$days)

There is no interaction here.

The Normal Q-Q plot above helps to understand how the residuals are distributed, which can give insight into whether or not the model is accurate or not. Since the deviate from the line at the lower and upper tail, they are not normally distributed in these areas. This suggests that there may be a better model for days absent from school.

4. References

Data obtained from http://vincentarelbundock.github.io/Rdatasets/doc/MASS/quine.html

5. Appendices

Complete R code

# read in data and rename columns
data <- read.csv("quine.csv", header = TRUE)
columns <- c("X", "eth", "sex", "age", "learn", "days")
colnames(data) <- columns

#show first 10 rows of data
head(data, n=10)

#show the structure and summary of data
str(data)
summary(data)

#create a histogram for response- days absent
hist(data$days, breaks = 20, main = "Days Absent from School", xlab = "Days Absent")

#boxplots for the factors
plot(data$days ~ data$eth + data$sex + data$age + data$learn)

#create ANOVA model
model <- aov(data$days ~ data$eth + data$sex + data$age + data$learn + data$eth*data$sex + data$eth*data$age + data$eth*data$learn + data$sex*data$age + data$sex*data$learn + data$age*data$learn)
anova(model)

#create interaction plots
interaction.plot(data$eth, data$sex, data$days)
interaction.plot(data$eth, data$age, data$days)
interaction.plot(data$eth, data$learn, data$days)
interaction.plot(data$sex, data$age, data$days)
interaction.plot(data$sex, data$age, data$days)
interaction.plot(data$age, data$learn, data$days)

#QQ normal plot for residuals
qqnorm(residuals(model))
qqline(residuals(model))