Once your data is in R you need to understand how R sees it. Just because you know a question has categorical or continuous data does not mean R sees it this way. There are multiple ways to understand how R sees your data. We share two below.
Now that you’ve got your data into R, lets get some practice looking at it. There are a few ways to view a dataframe object like our new data file. One is to click it in the Environment tab – it will open in a new window that can do some useful stuff (e.g., sort by columns or filter). Another way is to use the head() command (and it’s opposite, the tail() command).
head(data)
## Stu.ID Intent Q1 Q2 Q3
## 1 1001 8 Disagree Disagree Disagree
## 2 1002 6 Prefer not to respond Disagree Agree
## 3 1003 5 Somewhat disagree Somewhat disagree Somewhat disagree
## 4 1004 9 Agree Agree Strongly Agree
## 5 1005 7 Strongly Agree Strongly Agree Strongly Agree
## 6 1006 8 Disagree Disagree Disagree
## Q4 Q5 Gender Par1.Educ
## 1 Agree Agree Agender High School
## 2 Agree Agree Female &/or Feminine &/or Woman High School
## 3 Agree Agree Female &/or Feminine &/or Woman High School
## 4 Strongly Agree Strongly Agree NA Bachelor's
## 5 Strongly Agree Strongly Agree Female &/or Feminine &/or Woman High School
## 6 Somewhat agree Somewhat agree Male &/or Masculine &/or Man Bachelor's
## Par2.Educ Age Rac.Eth
## 1 Some College 18 White
## 2 High School 19 Middle Eastern or North African, White
## 3 High School 25 I prefer to identify as [Columbian]
## 4 PhD 16 White
## 5 Not applicable 19 White
## 6 PhD 18 Black or African American
tail(data)
## Stu.ID Intent Q1 Q2
## 19 1019 9 <NA> <NA>
## 20 1020 8 Prefer not to respond Prefer not to respond
## 21 1021 5 Agree Agree
## 22 1022 5 Disagree Disagree
## 23 1023 7 Agree Agree
## 24 1024 6 Agree Agree
## Q3 Q4 Q5
## 19 <NA> <NA> <NA>
## 20 Prefer not to respond Prefer not to respond Prefer not to respond
## 21 Strongly Agree Strongly Agree Strongly Agree
## 22 Disagree Somewhat disagree Somewhat disagree
## 23 Strongly Agree Strongly Agree Strongly Agree
## 24 Strongly Agree Strongly Agree Strongly Agree
## Gender Par1.Educ Par2.Educ Age
## 19 NA Bachelor's Bachelor's 18
## 20 NA NA NA 18
## 21 Male &/or Masculine &/or Man High School High School 30
## 22 Female &/or Feminine &/or Woman High School High School 19
## 23 I don't understand the question Bachelor's Bachelor's 18
## 24 Male &/or Masculine &/or Man Bachelor's Bachelor's 18
## Rac.Eth
## 19 Black or African American
## 20 Black or African American
## 21 Hispanic; Latino; or Spanish origin
## 22 Asian
## 23 Asian
## 24 White
As you can see, these commands shows you the first and last six rows of the specified dataframe. This is often a quick and easy way to see how your data looks without troubling to open it in a new window.
Another useful command is the colnames() command. This shows you a list of all the columns in a dataframe, and can be very helpful if you have a lot of variables and need to see how one is spelled. But more than that, it can be useful if your current column names aren’t serving your purposes. For instance, maybe we want our column names to be lowercase:
colnames(data) # see what the current column names are
## [1] "Stu.ID" "Intent" "Q1" "Q2" "Q3" "Q4"
## [7] "Q5" "Gender" "Par1.Educ" "Par2.Educ" "Age" "Rac.Eth"
old_columns <- colnames(data) # save the old column names in a new object
new_columns <- c("stu.id","intent","q1","q2","q3","q4","q5","gender","par1.educ","par2.educ","age","race.eth") # create the new column names and place them in a vector
colnames(data) <- new_columns # apply the new column names to the dataframe
colnames(data) # see the new column names
## [1] "stu.id" "intent" "q1" "q2" "q3" "q4"
## [7] "q5" "gender" "par1.educ" "par2.educ" "age" "race.eth"
colnames(data) <- old_columns # apply the old column names to the dataframe
colnames(data) # see the new old column names
## [1] "Stu.ID" "Intent" "Q1" "Q2" "Q3" "Q4"
## [7] "Q5" "Gender" "Par1.Educ" "Par2.Educ" "Age" "Rac.Eth"
Sometimes you need to view or manipulate only part of an object, such as a single variable in a dataset. To do this you use the $ operator. If we wanted to preview the first six lines of the Q1 column in our data, we could use the code below.
head(data$Q1)
## [1] "Disagree" "Prefer not to respond" "Somewhat disagree"
## [4] "Agree" "Strongly Agree" "Disagree"
Last but not least, you can use your new understanding of data structures in R to subset your data. Take our imported data, for instance. Suppose we want to use some of the variables (Stu.ID, Intent, and Age) but not the others, and find that they’re cluttering things up. Subsetting the current dataframe will let us create a new dataframe with the specified information without altering the previous dataframe.
data2 <- subset(data, select=c(Stu.ID,Intent,Age))
head(data2)
## Stu.ID Intent Age
## 1 1001 8 18
## 2 1002 6 19
## 3 1003 5 25
## 4 1004 9 16
## 5 1005 7 19
## 6 1006 8 18
Perhaps we want to filter our dataframe so that only 19 year olds are included and we only see the columns relating to IDs, intent, and age.
data3 <- subset(data, select=c(Stu.ID,Intent,Age), Age == "19")
head(data3)
## Stu.ID Intent Age
## 2 1002 6 19
## 5 1005 7 19
## 10 1010 9 19
## 12 1012 NA 19
## 15 1015 10-Definitely will 19
## 18 1018 7 19
ON YOUR OWN. If you look at the data, you’ll notice we have a participant under the age of 18, which means we can’t use their data without their guardian’s permission, which we didn’t collect. Try subsetting your data to drop those below the age of 18 and see what happens!
One way to understand how R is reading your data is the str (structure) command.
str(data)
## 'data.frame': 24 obs. of 12 variables:
## $ Stu.ID : int 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 ...
## $ Intent : chr "8" "6" "5" "9" ...
## $ Q1 : chr "Disagree" "Prefer not to respond" "Somewhat disagree" "Agree" ...
## $ Q2 : chr "Disagree" "Disagree" "Somewhat disagree" "Agree" ...
## $ Q3 : chr "Disagree" "Agree" "Somewhat disagree" "Strongly Agree" ...
## $ Q4 : chr "Agree" "Agree" "Agree" "Strongly Agree" ...
## $ Q5 : chr "Agree" "Agree" "Agree" "Strongly Agree" ...
## $ Gender : chr "Agender" "Female &/or Feminine &/or Woman " "Female &/or Feminine &/or Woman " "NA" ...
## $ Par1.Educ: chr "High School" "High School" "High School" "Bachelor's" ...
## $ Par2.Educ: chr "Some College" "High School" "High School" "PhD" ...
## $ Age : chr "18" "19" "25" "16" ...
## $ Rac.Eth : chr "White" "Middle Eastern or North African, White" "I prefer to identify as [Columbian]" "White" ...
It’s similar to the head() command, but provides two additional bits of information. Between the name of each column (e.g., Stu.ID) and the excerpt of the data in that column (e.g., 1001 1002 1003, etc) is a short code that tells you what type of data is in the column, or what type of variable it is. The Stu.ID variable is an integer type variable, while the remaining variables are character types. The most commonly used types are:
Another way to understand how R is reading your data is the summary command. This function allows you to look at a summary of each column of your data. For numeric data you get to see the spread of the data and for categorical data you get how many individuals are in each category. Character data just returns the class of the data. Generally when you see character data you will want to convert it to categorical data (unless you have other plans for that data).
summary(data)
## Stu.ID Intent Q1 Q2
## Min. :1001 Length:24 Length:24 Length:24
## 1st Qu.:1007 Class :character Class :character Class :character
## Median :1012 Mode :character Mode :character Mode :character
## Mean :1012
## 3rd Qu.:1018
## Max. :1024
## Q3 Q4 Q5 Gender
## Length:24 Length:24 Length:24 Length:24
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Par1.Educ Par2.Educ Age Rac.Eth
## Length:24 Length:24 Length:24 Length:24
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
You might find that your data has been imported as one type (e.g., character) when you want it to be something else (e.g., integer). There’s a very easy way to tell R to convert it from one type to the other!
data$Gender <- as.factor(data$Gender) # recode gender variable from character to factor
This code overwrites the data type for your existing dataframe. So, it converts Par1.Educ and Gender from character data to factors. If you rerun the summary() or str() commands you can see the changes. Instead of class:character, you should see a summary of the categories for each question you converted as well as how many participants chose each category.
Some columns may have so many categories that R doesn’t show them all when your run a global summary. You can tell levels are missing if you see (Other) among your factors (for an example of this look at Gender_factor). To see all the levels you can run a summary() command on a particular column and you will get all the levels of the factor.
summary(data$Gender)
## Agender
## 1
## Agender, Female &/or Feminine &/or Woman
## 1
## female
## 1
## Female &/or Feminine &/or Woman
## 9
## Genderfluid
## 1
## I don't understand the question
## 1
## Male &/or Masculine &/or Man
## 6
## Male &/or Masculine &/or Man, Transgender
## 1
## NA
## 3
To change multiple columns at the same time you can use the following code that uses the lapply function. The lapply() function is powerful but is also a bit beyond this workshop, so we won’t go into depth, but the code is presented here for your use. Instead of naming each column you want it to look at, you can just tell it the column numbers. To get the column numbers you can use the names commands which prints the names of each column and the column numbers.
names(data)
## [1] "Stu.ID" "Intent" "Q1" "Q2" "Q3" "Q4"
## [7] "Q5" "Gender" "Par1.Educ" "Par2.Educ" "Age" "Rac.Eth"
data[,3:7] <- lapply(data[,3:7],as.factor)
With this command we are telling R to make columns 3 through 7 (questions Q1 through Q5) factors. You can run the summary() command again to see your data.
ON YOUR OWN: Gender isn’t the only variable that would be better as a factor. What about Par1.Educ? Try converting it on your own and see how it goes!
After completing Step 3, what issues do you see in this data set that will need to be addressed before the data can be analyzed? Refer back the project scenario, the survey below, the summaries you made in R, and the raw data if you need to.