Class Survey Data: Background

The first survey was the original one I developed for the course. I wanted to collect different types of data to help students to understand the differences between quantitative and qualitative/categorical data and also dealing with Likert data which is a very common data type in health-care and nutrition questionnaires.

The questions were:

  • What is your gender? (categorical)

  • What is your height in metres? (quantitative)

  • What is your favourite colour? (categorical)

  • In what month were you born? (categorical)

  • How many siblings do you have? (quantitative - integer)

  • How important do you think statistics is to your job prospects? (Likert)

  • How important do you think learning statistics is statistics within your degree? (Likert)

  • How positive do you feel about studying statistics? (Likert)

Reasoning

Even the first question has issues in how we ask it. I left it as male, female, prefer not to say but I could have applied a better set of choices. Since my first teaching class in the late 1990s I have had trans students in my class. They are one of the most discriminated against groups and it is VERY important that we develop inclusive language for the question of gender. This is an area that is being actively researched especially with respect to population survey questions on gender [@bauer2017].

I asked for height as this is the least contentious measurement you are likely to ask about. I have had students with eating disorders and I avoid discussions about weight if at all possible. Height can also be an issue for trans students who can feel self-conscious about the issue.

Favourite colour should cause no issues, the only problem is the huge variety of colour names. Some students said nude which puzzled me as this was a multi-ethnic class and one person’s nude it not another person’s nude. I suggest defining a limited list of colours.

In what month were you born. In the first year I used a short answer response and this was a major issue for data cleaning as you can’t imagine how many different ways there are of typing February. Always use a dropdown list for this sort of question to remove the need to clean the data as much.

How many siblings do you have? This looks simple but again it isn’t straight-forward and even I have issues. I have one brother and one half-brother. Does that mean I answer 2, 1 or 1.5? What about step siblings, adoption or fostering or any of the other complex family relationships?

The three Likert scales are all simple. The only complexity here is whether Likert questions can be treated as quantitative or qualitative. I am firmly in the qualitative camp which means that they cannot be summed or averaged. They do not have statistics and they do not have a standard deviation or variance. There are other who disagree. For example the National Student Survey (even more ridiculous as they combine categories to create a binary measure) or the end of module feedback forms which average the scores.

library("dplyr")

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
# Read in the csv file to the dataframe dfss2
dfss2 <- read.csv("Class_Survey.csv")

# Simplify the column names
colnames(dfss2) <- c("Gender", "Height", "Col.", "Month","Sib.","Imp1", "Imp2", "Pos")

# Summarise the data
summary(dfss2)
    Gender              Height          Col.              Month          
 Length:1626        Min.   :1.000   Length:1626        Length:1626       
 Class :character   1st Qu.:1.600   Class :character   Class :character  
 Mode  :character   Median :1.650   Mode  :character   Mode  :character  
                    Mean   :1.654                                        
                    3rd Qu.:1.710                                        
                    Max.   :2.500                                        
      Sib.             Imp1            Imp2            Pos       
 Min.   : 0.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
 1st Qu.: 1.000   1st Qu.:3.000   1st Qu.:4.000   1st Qu.:3.000  
 Median : 2.000   Median :4.000   Median :4.000   Median :3.000  
 Mean   : 2.457   Mean   :3.971   Mean   :4.272   Mean   :3.213  
 3rd Qu.: 3.000   3rd Qu.:5.000   3rd Qu.:5.000   3rd Qu.:4.000  
 Max.   :16.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
# Put the months in order by defining a factor so that it is tabulated correctly.
dfss2$Month <- factor(dfss2$Month, levels = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"))

# Tabulate the Gender Column
table(dfss2$Gender)

           Female              Male Prefer not to say 
             1247               366                13 
# Tabulate the Favourite Colour Column
table(dfss2$Col.)

 Black   Blue  Brown   Gold  Green Orange   Pink Purple    Red Silver  White 
   292    416     26     34    180     35    156    195    176     13     57 
Yellow 
    46 
# Tabulate the Birth Month Column
table(dfss2$Month)

  January  February     March     April       May      June      July    August 
      158       129       144       145       121       114       134       133 
September   October  November  December 
      141       147       119       141 

All the data columns have the correct types. There are 1626 rows of data collected over about 6 years from classes. Men only make up slightly more than 25% of the dataset. This is lower than the actual proportion of men enrolled in Life Sciences and this reflects a different response rate between men and women to filling in the survey.

There are still more quality control steps that we need to take to verify that the data is actually valid.

boxplot(dfss2$Height)

The boxplot of Heights suggests some unusually tall and short students. There are a few students who are over 2.4 m tall and a few students who are about 1.0 m tall. These look to be unlikely values.

The current tallest living person is 2.51 m tall. There was nobody in the class who I recognised as being exceptionally all and it is almost certain that any value above 2.2 m is false.

The shortest person alive is currently 66 cm tall or 0.66 m. I did not see anyone this short and I think that anyone below 1.3 m tall is also likely an incorrectly recorded data point.

boxplot(dfss2$Height ~ dfss2$Gender, xlab="Gender", ylab="Height (m)")

When I create the boxplot by Gender these outliers become even more suspicious and need to be discounted.

dfss2 <- dfss2 %>% mutate(Height = replace(Height, Height >= 2.2, NA))
dfss2 <- dfss2 %>% mutate(Height = replace(Height, Height <= 1.3, NA))
boxplot(dfss2$Height ~ dfss2$Gender, xlab="Gender", ylab="Height (m)")

The only other column where I could have incorrect data is the siblings column as the other values were selected from pull-down or multiple choice menus.

Sibling should be an integer. If it is a fraction then there is a problem. I should see this with a barplot of the counts for each of the different values.

barplot(table(dfss2$Sib.))

It is clear that some people have entered misleading data which needs to be replaced as a missing value.

dfss2 <- dfss2 %>% mutate(Sib. = replace(Sib., Sib.%%1 != 0, NA))
table(dfss2$Sib.)

  0   1   2   3   4   5   6   7   8   9  10  11  13  16 
119 464 419 280 147  92  34  29  13   4   2   2   1  11 

This has removed the obvious errors in the data so that we now only have integer values. I am still concerned about the number of 17 member families and I suspect that those are also inventions given the shape of the rest of the data and that there are no 14 or 15 sibling families.

If I order the data by siblings from highest to lowest I see that most of them are male or prefer not to say and that 3 already have an NA from lying about their heights. Given my prior knowledge of male students my feeling is that all of their data is unreliable and I think that all of these rows should be removed completely.

library("readr")
dfss2a <- dfss2 %>% filter(Sib. < 16 | Gender == "Female")
table(dfss2a$Sib.)

  0   1   2   3   4   5   6   7   8   9  10  11  13  16 
119 464 419 280 147  92  34  29  13   4   2   2   1   1 
table(dfss2a$Gender)

           Female              Male Prefer not to say 
             1247               358                 8 
summary(dfss2a)
    Gender              Height          Col.                 Month    
 Length:1613        Min.   :1.400   Length:1613        January  :157  
 Class :character   1st Qu.:1.600   Class :character   October  :147  
 Mode  :character   Median :1.650   Mode  :character   April    :145  
                    Mean   :1.659                      March    :143  
                    3rd Qu.:1.710                      September:140  
                    Max.   :2.080                      December :140  
                    NA's   :22                         (Other)  :741  
      Sib.            Imp1           Imp2            Pos       
 Min.   : 0.00   Min.   :1.00   Min.   :1.000   Min.   :1.000  
 1st Qu.: 1.00   1st Qu.:3.00   1st Qu.:4.000   1st Qu.:3.000  
 Median : 2.00   Median :4.00   Median :4.000   Median :3.000  
 Mean   : 2.37   Mean   :3.97   Mean   :4.277   Mean   :3.212  
 3rd Qu.: 3.00   3rd Qu.:5.00   3rd Qu.:5.000   3rd Qu.:4.000  
 Max.   :16.00   Max.   :5.00   Max.   :5.000   Max.   :5.000  
 NA's   :6                                                     
boxplot(dfss2a$Height ~ dfss2a$Gender, xlab="Gender", ylab="Height (m)")

barplot(table(dfss2a$Gender))

barplot(table(dfss2a$Sib.))

write_csv(dfss2a, "Class_Survey1.csv")