The first survey was the original one I developed for the course. I wanted to collect different types of data to help students to understand the differences between quantitative and qualitative/categorical data and also dealing with Likert data which is a very common data type in health-care and nutrition questionnaires.
The questions were:
What is your gender? (categorical)
What is your height in metres? (quantitative)
What is your favourite colour? (categorical)
In what month were you born? (categorical)
How many siblings do you have? (quantitative - integer)
How important do you think statistics is to your job prospects? (Likert)
How important do you think learning statistics is statistics within your degree? (Likert)
How positive do you feel about studying statistics? (Likert)
Reasoning
Even the first question has issues in how we ask it. I left it as male, female, prefer not to say but I could have applied a better set of choices. Since my first teaching class in the late 1990s I have had trans students in my class. They are one of the most discriminated against groups and it is VERY important that we develop inclusive language for the question of gender. This is an area that is being actively researched especially with respect to population survey questions on gender [@bauer2017].
I asked for height as this is the least contentious measurement you are likely to ask about. I have had students with eating disorders and I avoid discussions about weight if at all possible. Height can also be an issue for trans students who can feel self-conscious about the issue.
Favourite colour should cause no issues, the only problem is the huge variety of colour names. Some students said nude which puzzled me as this was a multi-ethnic class and one person’s nude it not another person’s nude. I suggest defining a limited list of colours.
In what month were you born. In the first year I used a short answer response and this was a major issue for data cleaning as you can’t imagine how many different ways there are of typing February. Always use a dropdown list for this sort of question to remove the need to clean the data as much.
How many siblings do you have? This looks simple but again it isn’t straight-forward and even I have issues. I have one brother and one half-brother. Does that mean I answer 2, 1 or 1.5? What about step siblings, adoption or fostering or any of the other complex family relationships?
The three Likert scales are all simple. The only complexity here is whether Likert questions can be treated as quantitative or qualitative. I am firmly in the qualitative camp which means that they cannot be summed or averaged. They do not have statistics and they do not have a standard deviation or variance. There are other who disagree. For example the National Student Survey (even more ridiculous as they combine categories to create a binary measure) or the end of module feedback forms which average the scores.
library("dplyr")
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
# Read in the csv file to the dataframe dfss2dfss2 <-read.csv("Class_Survey.csv")# Simplify the column namescolnames(dfss2) <-c("Gender", "Height", "Col.", "Month","Sib.","Imp1", "Imp2", "Pos")# Summarise the datasummary(dfss2)
Gender Height Col. Month
Length:1626 Min. :1.000 Length:1626 Length:1626
Class :character 1st Qu.:1.600 Class :character Class :character
Mode :character Median :1.650 Mode :character Mode :character
Mean :1.654
3rd Qu.:1.710
Max. :2.500
Sib. Imp1 Imp2 Pos
Min. : 0.000 Min. :1.000 Min. :1.000 Min. :1.000
1st Qu.: 1.000 1st Qu.:3.000 1st Qu.:4.000 1st Qu.:3.000
Median : 2.000 Median :4.000 Median :4.000 Median :3.000
Mean : 2.457 Mean :3.971 Mean :4.272 Mean :3.213
3rd Qu.: 3.000 3rd Qu.:5.000 3rd Qu.:5.000 3rd Qu.:4.000
Max. :16.000 Max. :5.000 Max. :5.000 Max. :5.000
# Put the months in order by defining a factor so that it is tabulated correctly.dfss2$Month <-factor(dfss2$Month, levels =c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"))# Tabulate the Gender Columntable(dfss2$Gender)
Female Male Prefer not to say
1247 366 13
# Tabulate the Favourite Colour Columntable(dfss2$Col.)
Black Blue Brown Gold Green Orange Pink Purple Red Silver White
292 416 26 34 180 35 156 195 176 13 57
Yellow
46
# Tabulate the Birth Month Columntable(dfss2$Month)
January February March April May June July August
158 129 144 145 121 114 134 133
September October November December
141 147 119 141
All the data columns have the correct types. There are 1626 rows of data collected over about 6 years from classes. Men only make up slightly more than 25% of the dataset. This is lower than the actual proportion of men enrolled in Life Sciences and this reflects a different response rate between men and women to filling in the survey.
There are still more quality control steps that we need to take to verify that the data is actually valid.
boxplot(dfss2$Height)
The boxplot of Heights suggests some unusually tall and short students. There are a few students who are over 2.4 m tall and a few students who are about 1.0 m tall. These look to be unlikely values.
The current tallest living person is 2.51 m tall. There was nobody in the class who I recognised as being exceptionally all and it is almost certain that any value above 2.2 m is false.
The shortest person alive is currently 66 cm tall or 0.66 m. I did not see anyone this short and I think that anyone below 1.3 m tall is also likely an incorrectly recorded data point.
The only other column where I could have incorrect data is the siblings column as the other values were selected from pull-down or multiple choice menus.
Sibling should be an integer. If it is a fraction then there is a problem. I should see this with a barplot of the counts for each of the different values.
barplot(table(dfss2$Sib.))
It is clear that some people have entered misleading data which needs to be replaced as a missing value.
This has removed the obvious errors in the data so that we now only have integer values. I am still concerned about the number of 17 member families and I suspect that those are also inventions given the shape of the rest of the data and that there are no 14 or 15 sibling families.
If I order the data by siblings from highest to lowest I see that most of them are male or prefer not to say and that 3 already have an NA from lying about their heights. Given my prior knowledge of male students my feeling is that all of their data is unreliable and I think that all of these rows should be removed completely.
Gender Height Col. Month
Length:1613 Min. :1.400 Length:1613 January :157
Class :character 1st Qu.:1.600 Class :character October :147
Mode :character Median :1.650 Mode :character April :145
Mean :1.659 March :143
3rd Qu.:1.710 September:140
Max. :2.080 December :140
NA's :22 (Other) :741
Sib. Imp1 Imp2 Pos
Min. : 0.00 Min. :1.00 Min. :1.000 Min. :1.000
1st Qu.: 1.00 1st Qu.:3.00 1st Qu.:4.000 1st Qu.:3.000
Median : 2.00 Median :4.00 Median :4.000 Median :3.000
Mean : 2.37 Mean :3.97 Mean :4.277 Mean :3.212
3rd Qu.: 3.00 3rd Qu.:5.00 3rd Qu.:5.000 3rd Qu.:4.000
Max. :16.00 Max. :5.00 Max. :5.000 Max. :5.000
NA's :6