Very often, data falls into a limited number of categories. For example, human hair color can be categorized as black/brown/blonde/red/grey/white (and perhaps a few more options for people who dye it). In R, categorical data is stored in factors. Given the importance of these factors in data analysis, you should start learning how to create, subset and compare them now!
What’s a factor and why would you use it?
What’s a factor and why would you use it? (2)
What’s a factor and why would you use it? (3)
Factor levels
Summarizing a factor
Battle of the sexes
Ordered factors
Ordered factors (2)
Comparing ordered factors
What’s a factor and why would you use it? (3) There are two types of categorical variables: a nominal categorical variable and an ordinal categorical variable.
A nominal variable is a categorical variable without an implied order. This means that it is impossible to say that ‘one is worth more than the other’. For example, think of the categorical variable animals_vector with the categories “Elephant”, “Giraffe”, “Donkey” and “Horse”. Here, it is impossible to say that one stands above or below the other. (Note that some of you might disagree ;-) ).
In contrast, ordinal variables do have a natural ordering. Consider for example the categorical variable temperature_vector with the categories: “Low”, “Medium” and “High”. Here it is obvious that “Medium” stands above “Low”, and “High” stands above “Medium”.
INSTRUCTIONS 70 XP Click ‘Submit Answer’ to check how R constructs and prints nominal and ordinal variables. Do not worry if you do not understand all the code just yet, we will get to that.
Show Answer (-70 XP) HINT Just click the ‘Submit Answer’ button and look at the console. Notice how R indicates the ordering of the factor levels for ordinal categorical variables.Factor levels When you first get a data set, you will often notice that it contains factors with specific factor levels. However, sometimes you will want to change the names of these levels for clarity or other reasons. R allows you to do this with the function levels():
levels(factor_vector) <- c(“name1”, “name2”,…) A good illustration is the raw data that is provided to you by a survey. A common question for every questionnaire is the sex of the respondent. Here, for simplicity, just two categories were recorded, “M” and “F”. (You usually need more categories for survey data; either way, you use a factor to store the categorical data.)
survey_vector <- c(“M”, “F”, “F”, “M”, “M”) Recording the sex with the abbreviations “M” and “F” can be convenient if you are collecting data with pen and paper, but it can introduce confusion when analyzing the data. At that point, you will often want to change the factor levels to “Male” and “Female” instead of “M” and “F” for clarity.
Watch out: the order with which you assign the levels is important. If you type levels(factor_survey_vector), you’ll see that it outputs [1] “F” “M”. If you don’t specify the levels of the factor when creating the vector, R will automatically assign them alphabetically. To correctly map “F” to “Female” and “M” to “Male”, the levels should be set to c(“Female”, “Male”), in this order.
INSTRUCTIONS 70 XP INSTRUCTIONS 70 XP Check out the code that builds a factor vector from survey_vector. You should use factor_survey_vector in the next instruction. Change the factor levels of factor_survey_vector to c(“Female”, “Male”). Mind the order of the vector elements here. Show Answer (-70 XP) HINT Mind the order in which you have to type in the factor levels. Hint: look at the order in which the levels are printed when typing levels(factor_survey_vector).
Summarizing a factor After finishing this course, one of your favorite functions in R will be summary(). This will give you a quick overview of the contents of a variable:
summary(my_var) Going back to our survey, you would like to know how many “Male” responses you have in your study, and how many “Female” responses. The summary() function gives you the answer to this question.
INSTRUCTIONS 70 XP INSTRUCTIONS 70 XP Ask a summary() of the survey_vector and factor_survey_vector. Interpret the results of both vectors. Are they both equally useful in this case?
Show Answer (-70 XP) HINT Call the summary() function on both survey_vector and factor_survey_vector, it’s as simple as that!Have a look at the output. The fact that you identified “Male” and “Female” as factor levels in factor_survey_vector enables R to show the number of elements for each category.
Battle of the sexes You might wonder what happens when you try to compare elements of a factor. In factor_survey_vector you have a factor with two levels: “Male” and “Female”. But how does R value these relative to each other?
INSTRUCTIONS 70 XP Read the code in the editor and click ‘Submit Answer’ to test if male is greater than (>) female.
Show Answer (-70 XP) HINT Just click ‘Submit Answer’ and have a look at output that gets printed to the console.Ordered factors Since “Male” and “Female” are unordered (or nominal) factor levels, R returns a warning message, telling you that the greater than operator is not meaningful. As seen before, R attaches an equal value to the levels for such factors.
But this is not always the case! Sometimes you will also deal with factors that do have a natural ordering between its categories. If this is the case, we have to make sure that we pass this information to R…
Let us say that you are leading a research team of five data analysts and that you want to evaluate their performance. To do this, you track their speed, evaluate each analyst as “slow”, “medium” or “fast”, and save the results in speed_vector.
INSTRUCTIONS 70 XP As a first step, assign speed_vector a vector with 5 entries, one for each analyst. Each entry should be either “slow”, “medium”, or “fast”. Use the list below:
Analyst 1 is medium, Analyst 2 is slow, Analyst 3 is slow, Analyst 4 is medium and Analyst 5 is fast. No need to specify these are factors yet.
Show Answer (-70 XP) HINT Assign to speed_vector a vector containing the character strings “slow”, “medium”, or “fast”.Ordered factors (2) speed_vector should be converted to an ordinal factor since its categories have a natural ordering. By default, the function factor() transforms speed_vector into an unordered factor. To create an ordered factor, you have to add two additional arguments: ordered and levels.
factor(some_vector, ordered = TRUE, levels = c(“lev1”, “lev2” …)) By setting the argument ordered to TRUE in the function factor(), you indicate that the factor is ordered. With the argument levels you give the values of the factor in the correct order.
INSTRUCTIONS 70 XP From speed_vector, create an ordered factor vector: factor_speed_vector. Set ordered to TRUE, and set levels to c(“slow”, “medium”, “fast”).
Show Answer (-70 XP) HINT Use the function factor() to create factor_speed_vector based on speed_character_vector. The argument ordered should be set to TRUE since there is a natural ordering. Also, set levels = c(“slow”, “medium”, “fast”).Comparing ordered factors Having a bad day at work, ‘data analyst number two’ enters your office and starts complaining that ‘data analyst number five’ is slowing down the entire project. Since you know that ‘data analyst number two’ has the reputation of being a smarty-pants, you first decide to check if his statement is true.
The fact that factor_speed_vector is now ordered enables us to compare different elements (the data analysts in this case). You can simply do this by using the well-known operators.
INSTRUCTIONS 70 XP Use [2] to select from factor_speed_vector the factor value for the second data analyst. Store it as da2. Use [5] to select the factor_speed_vector factor value for the fifth data analyst. Store it as da5. Check if da2 is greater than da5; simply print out the result. Remember that you can use the > operator to check whether one element is larger than the other. Show Answer (-70 XP) HINT To select the factor value for the third data analyst, you’d need factor_speed_vector[3]. To compare two values, you can use >. For example: da3 > da4.What do the result tell you? Data analyst two is complaining about the data analyst five while in fact they are the one slowing everything down! This concludes the chapter on factors. With a solid basis in vectors, matrices and factors, you’re ready to dive into the wonderful world of data frames, a very important data structure in R!