Run this first code block by clicking on the green arrow to the right—————>
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.This is what you will do when you have finished filling out the assignment and are ready to save and upload it to canvas.
For this lab all of your instructions will be in this text segment of the file and everything that you type in, either commands or answers to questions that I ask will be in the darker grey boxes called “code chunks”. Inside of a code chunk, you can type plain text (that wont be run as code) by putting “#” in front of it. Any number of “#” signs indicate that R should not run that line as a code. When I want you to input a text answer for your assignment I will but a “###” before some text in the code chunk and you should answer the question by typing after the question when indicated. For example, in the following code chunk please type your name where I have indicated:
3+3
## [1] 6
### Type your name here: Colby
Run the block above by clicking on the green arrow. Note how R only gave you the answer to the command “3+3” that is in the code block, it ignored the text because we had “#” before it.
Lets make sure that we are in the correct folder for your working directory. Based on what you learned in the last lab, enter the code below that gives you the file path for your current working directory:
getwd()
## [1] "C:/Users/colby/OneDrive/Desktop/PSY 260/Lab 6"
You should be in the “lab5” folder that you created in your “PSY260_R” folder on your desktop. If you are not first make sure that this file is saved to that folder then click on “session” in the tabs at the top of this page -> “Set Working Directory” -> “To Source File Location”. Once you have done this run the code block above again to double-check it.
We need to load our packages. Remember that we did this during our first lab playing around with R. Using what you learned from the first lab, install and load the packages “ggplot2”, “Hmisc”, “readr”, and “psych”. Refer back to our Lab 1 materials if you are unsure how to do this.
### type in the code to load the listed packages below, one on each line. Remember that they must be installed first in the "Packages" tab to the right before they can be loaded:
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
library(ggplot2)
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
## The following object is masked from 'package:Hmisc':
##
## describe
library(readr)
Now we are going to upload our data file. You can do this by running the code below. If you have correctly set up your directory this should run fine. If you get an error please go back to the video that I linked in Lab 4 on “Demystifying file paths in R” and see if you can troubleshoot what went wrong.
class_data <- read.csv("class_data_S26.csv")
Now that we have our data uploaded, we can play with it!
…but first some setup.
Run the code block below:
45/467534657
## [1] 9.624955e-08
notice how the number that R gave is presented in scientific notation (thats what the little “e-” means). This is the default for R when presenting really small numbers, but it can be a problem when we want to interpret exact values so we are going to change it. Run the code below to tell R to turn off scientific notation and then run the code above again to see what happens
options(scipen=999)
Okay, now we can start working with the data. First, we want to take a look at our data. If you click on the object in our “environment” panel in the top right, a spreadsheet of the data will open in its own tab. Just click on the “Lab6_assignment” tab again to get back here.You will be able to see all of the data we collected as a class during our first lab. Its in a slightly different format now. I put it into a .csv file which is the best file type for uploading data into R. It is also stripped down, the variable names are simplified and the data is either a single number or a short character string. Because I have stripped down the data to work with it in R, we need some way to remember what each of the variables are and what the numbers represent. This is where a “codebook” comes into play. The other file I had you put into our “Lab6” folder is a pdf called “Codebook_class_data_s26” please open this and take a look. This file is basically a cheat sheet for what all of our variables are. This will become more important later in the semester when you are working with more complex data sets and collecting your own data. For now this file should help you with the next step if you have forgotten what any of the variables are.
The first step to working with our data is to identify what kind of data scale each variable is on. As I mentioned in class, we treat variables differently in statistics depending on whether they are ordinal, nominal, variable, or ratio. We’ve already categorized these particular variables in class but just to make sure you remember fill out the following code block:
### In the following lines after each variable type "ordinal", "nominal", "interval" or "ratio" to indicate what each variable's scale is:
# name:nominal
# cat_dog:nominal
# coffee:nominal
# zombie_apo:interval
# morning_night:interval
# num_sib:ratio
# num_countries:ratio
When it comes to doing descriptive statistics, the distinction between nominal and ordinal and between interval and ratio doesn’t matter as much. What matters is whether the data is categorical (nominal and ordinal) or continuous (interval and ratio)
When you load in a dataset, R automatically assigns variable types to each variable. Lets take a look at what it did. Run the following codeblock to see the structure of the dataset (variable types)
str(class_data)
## 'data.frame': 30 obs. of 7 variables:
## $ name : chr "Bella" "Kyla" "Lucas" "Colby" ...
## $ cat_dog : chr "dog" "both" "dog" "dog" ...
## $ coffee : chr "yes" "no" "yes" "yes" ...
## $ zombie_apo : int 2 4 4 5 3 5 2 2 3 2 ...
## $ morning_night: int 5 3 3 5 3 3 3 1 2 4 ...
## $ num_sib : int 0 0 1 1 1 1 1 1 1 1 ...
## $ num_countries: int 1 3 4 2 13 0 3 2 3 4 ...
You will notice that R categorized the first three as “chr” - this means “character”, the data is made up of text or characters. R categorized the last 4 as “int” - this stands for “integer” meaning the data is made up of whole numbers.The last 4 being labeled as “int” is fine, we will be able to do descriptive stats with these. However, if we want to do some descriptives with “cat_dog” and “coffee” we need R to recognize that these are variables with multiple factors (e.g. cat_dog represents a question where the answer is one of three options, “cat”, “dog”, or “both”) and not just a bunch or random text. To do this we will need to tell R that these two variables are a “factor” and not “character”. This is how we do this:
class_data$cat_dog <- as.factor(class_data$cat_dog)
# Notice how the first part of this command is "class_data$cat_dog", in R we can reference a specific variable inside of a dataframe (which is what our class_data object is) by using the symbol "$" and then the name of the variable. So this is telling R to take just that one variable "cat_dog" which is in the dataframe " class_data" and then change it to a factor using the function "as.factor()".
### Now I want you to edit the code below to tell R to do the exact same thing but with our other categorical variable "coffee":(class_data$cofee)
class_data$cat_dog <- as.factor(class_data$cat_dog)
If you followed the instructions above and ran the code then when you run this next block you should see that both “cat_dog” and “coffee” have been changed to factors.
str(class_data)
## 'data.frame': 30 obs. of 7 variables:
## $ name : chr "Bella" "Kyla" "Lucas" "Colby" ...
## $ cat_dog : Factor w/ 4 levels "both","cat","dog",..: 3 1 3 3 3 3 3 3 3 3 ...
## $ coffee : chr "yes" "no" "yes" "yes" ...
## $ zombie_apo : int 2 4 4 5 3 5 2 2 3 2 ...
## $ morning_night: int 5 3 3 5 3 3 3 1 2 4 ...
## $ num_sib : int 0 0 1 1 1 1 1 1 1 1 ...
## $ num_countries: int 1 3 4 2 13 0 3 2 3 4 ...
Notice how R has automatically detected the number of levels in each of the factors, “cat_dog” has 3 levels and “coffee” has 2.
Lets keep working with our categorical variables. There are not as many descriptives that we can run on categorical variables. We can’t calculate any measures of central tendency or variability because the data is not continuous. However, what we can do is calculate frequency (specifically the frequency of each level of your factor) and we can visualize our data with a bar chart showing the number of items in each category/level.
The first thing we are going to ask R for is a frequency table. We can get a frequency table for n in each level of each variable like this:
table(class_data$cat_dog)
##
## both cat dog neither
## 5 2 22 1
# Edit the next code block to get the table for the variable "coffee"
table(class_data$coffee)
##
## no yes
## 9 21
### REFLECTION: What is the most frequent score for each variable? What does this information tell you about our class as a whole (how would you interpret this data)?
### type your answer here: Dog was the most frequent variable. It was at 22 which is far more than the next most frequent answer of 5. The vast majority of people prefer dogs.
Another thing we can do is get the breakdown of frequencies by both variables “cat_dog” and “coffee”:
table(class_data$cat_dog, class_data$coffee)
##
## no yes
## both 2 3
## cat 0 2
## dog 7 15
## neither 0 1
# in order to add another variable to the same function we just include it inside of the parenthesis separated by a comma
### REFLECTION: In the table you just generated, what does the number "15" represent? What does this tell you about our class?
### type your answer here:It means that 15 people are dog people as well as cofee drinkers.
So those are the frequency tables, the other thing we can generate are visualizations of the data. For categorical data the only good way to visualize it is with a bar graph:
barplot(table(class_data$cat_dog),
main = "Cat vs Dog Preference",
ylab = "Count")
In the function above we are asking R to create a barplot using the “barplot” function. Notice how the function we used before to create a table is inside of the barplot function, this is because we are basically just taking that frequency table and making a bar graph out of it. The argument “main” is telling R what you want the title of the graph to be and the argument “ylab” is telling it that your y axis should be labelled “count” - because it is the n of each level. Edit the next block of code to get the bar graph for the variable “coffee”. Don’t forget to change your graph title!
barplot(table(class_data$coffee),
main = "Cat vs Dog Preference",
ylab = "Count")
So thats pretty much all we can do with categorical data for now, lets move on to our other variables. There are specific functions to run pretty much every measure of central tendency, for example:
median(class_data$num_sib)
## [1] 2
Using the logic above, in the following code block try to write a line that will give you the “mean” of “zombie_apo”
mean(class_data$zombie_apo)
## [1] 2.866667
Try to run the following code block to find the mode:
mode(class_data$num_countries)
## [1] "numeric"
Oh no! what happened! lol. R has some fun little quirks. One of them is that the function “mode()” is actually referring to the storage mode of an object (like “numeric” or “character”), not the statistical mode. In order to tell R we want the statistical mode we need to use “get_mode()”:
#This block of code is defining the function, dont worry about it for right now
get_mode <- function(x) {
uniq_vals <- unique(x)
uniq_vals[which.max(tabulate(match(x, uniq_vals)))]
}
get_mode(class_data$num_countries)
## [1] 2
So this is all well and good but it would be kind of annoying to have to run a separate line for every descriptive statistic we want to run on every variable. Good thing there are functions that let us do all the descriptives at once!
describe(class_data)
## vars n mean sd median trimmed mad min max range skew
## name* 1 30 15.50 8.80 15.5 15.50 11.12 1 30 29 0.00
## cat_dog* 2 30 2.63 0.81 3.0 2.75 0.00 1 4 3 -1.18
## coffee* 3 30 1.70 0.47 2.0 1.75 0.00 1 2 1 -0.83
## zombie_apo 4 30 2.87 1.01 3.0 2.79 1.48 1 5 4 0.45
## morning_night 5 30 3.70 1.21 4.0 3.79 1.48 1 5 4 -0.46
## num_sib 6 30 1.83 1.15 2.0 1.71 1.48 0 5 5 0.84
## num_countries 7 30 3.70 3.87 3.0 2.96 1.48 0 18 18 2.11
## kurtosis se
## name* -1.32 1.61
## cat_dog* 0.10 0.15
## coffee* -1.35 0.09
## zombie_apo -0.73 0.18
## morning_night -1.04 0.22
## num_sib 0.43 0.21
## num_countries 4.59 0.71
#Note: This function will only work if you have successfully installed and loaded the package "psych" because it is part of that package and does not exist in the base R
Notice how there is an asterisk next to our first three variables. This is because these are categorical “character” variables. So R is basically telling you that you can’t rely on the descriptives for these variables. Lets clean this up by only including the variables that we want full descriptive stats for:
describe(class_data[, 4:7])
## vars n mean sd median trimmed mad min max range skew
## zombie_apo 1 30 2.87 1.01 3 2.79 1.48 1 5 4 0.45
## morning_night 2 30 3.70 1.21 4 3.79 1.48 1 5 4 -0.46
## num_sib 3 30 1.83 1.15 2 1.71 1.48 0 5 5 0.84
## num_countries 4 30 3.70 3.87 3 2.96 1.48 0 18 18 2.11
## kurtosis se
## zombie_apo -0.73 0.18
## morning_night -1.04 0.22
## num_sib 0.43 0.21
## num_countries 4.59 0.71
Take a look at the structure of the code above. Looking at what R gave us what do you think “4:7” represents (what is this tell R to do)?
### Type your answer here: It represents the fourth through seventh variables.
Looking at the mean and standard deviation of “morning_night”, what can you say about this data?
### Type your answer here:That there is an outlier.
Looking at the mean and standard deviation of “num_countries”, what can you say about this data?
### Type your answer here: That people who have visited more than three countries have visited much more than that rather than just one or two more.
Lets visualize our data! For our likert scale data like “zombie_apo” and “morning_night” the best way to visualize it is a barplot (like we did for our categorical data):
barplot(table(class_data$zombie_apo),
main = "Likelihood of Surviving a Zombie Apocalypse",
xlab = "Rating (1 = low, 5 = high)",
ylab = "Count")
If you take a look at the codebook pdf you will see that for our zombie data a “1” represented “i would be the first zombie” so we have this labeled as “low” whereas a “5” represented “Final survivor!!” so we have labeled this as “high”.
Using what you learned before with our categorical data, please edit this code to instead make the barplot for “morning_night”. Our ratings for this one should be “1 = morning person” and “5 = night owl”. Don’t forget to change your title!
barplot(table(class_data$zombie_apo),
main = "Likelihood of Surviving a Zombie Apocalypse",
xlab = "Rating (1 = low, 5 = high)",
ylab = "Count")
Read the following short article: https://www.statisticshowto.com/shapes-of-distributions/. How would you describe the shape of the two barcharts above?:
### Type your answer here:Most people are somewhat confident and not very confident or not confident at all. Not many extreme answers.
Lets look at our two ratio variables now. Take a look at the descriptives for “num_siblings” again. Our range for this variable was 5 (min=0, max=5), since our range is so small and we are working with discrete numbers (because you can’t have 1/4 of a sibling), it makes sense to visualize this data with another bar plot:
barplot(table(class_data$num_sib),
main = "Number of Siblings",
xlab = "Number of Siblings",
ylab = "Count")
How would you describe the shape of this distribution?
### Type your answer here: Most people have 1-2 siblings and anything different is rare.
Now, we can visually inspect our graphs to decide if they are normally distributed however, this can be pretty subjective - different people might look at the same graph and come to very different conclusions. Good thing we have statistics! There is a statistic we can run to check the normality of a variable. There are actually a lot of options but the one we are going to use is called the Shapiro-Wilk Test:
shapiro.test(class_data$num_sib)
##
## Shapiro-Wilk normality test
##
## data: class_data$num_sib
## W = 0.88002, p-value = 0.002823
The number that tells you whether your data is normally distributed or not is the “p-value”. We are going to talk more about p-values later but for now if your p-value is greater than .05 then you can assume that your data is a normal distribution. If your p-value is less than .05 then your data is not normally distributed (it is skewed). In the test we did on “num_sib” our p-value is just barely greater than .05, but it still passes the test! Now I want you to edit the code below to get the skewness statistic on “num_countries” and then interpret it:
###: edit the code below:
shapiro.test(class_data$num_sib)
##
## Shapiro-Wilk normality test
##
## data: class_data$num_sib
## W = 0.88002, p-value = 0.002823
###: What did this test tell you about the shape of the distribution of this variable?
###: Type your answer here:P value sahows us the shape has a normal distribution.
Now lets actually look at the graphs for number of countries:
barplot(table(class_data$num_countries),
main = "Countries Visited",
xlab = "Number of Countries",
ylab = "Count")
The chart above technically works however, it looks kind of messy and is hard to interpret. Notice on the bottom how we don’t have a bar for each number (like 1)? This is because no one answered that specific number. There’s also a large gap between 12 and 24. So this bar chart is probably not the best way to visualize this data. Something that would be better is what we call a boxplot:
boxplot(class_data$num_countries,
main = "Countries Visited",
ylab = "Number of Countries")
The box plot is especially good for showing the spread of the data and any outliers. The dark line in the middle of the box represents the mean and the dark grey box around it represents the standard deviation. You will notice that in the very top of the chart space there is a small circle. Looking at the data, what do you think this represents?
### Type your answer here:The middle space in between each number listed. 7.5,12.5,17.5
The final thing I want to do with this lab is to give you a little sneak peak into what is coming next. We are going to try to visualize a relationship between our two purely numeric variables (number of siblings and countries visited).
plot(class_data$num_sib, class_data$num_countries,
main = "Siblings vs Countries Visited",
xlab = "Number of Siblings", ylab = "Countries Visited")
Now this looks kind of silly, that’s because our range of number of siblings is so small, we will look at some better scatterplots later. However, is there any kind of relationship that you can see here between number of siblings and countries visited? (hint: its okay if the answer is no or is tentative, we are just exploring the data right now)
### Type your answer here:People with more siblings were less likely to visit more countries.
As a final reflection, what is one thing that you found in this dataset that you found interesting?
### Type your answer here:People with one or two siblings were more likely to visit more countries.
Congrats! You made it to the end of the lab! Before you knit and upload your file, are there any remaining questions you have about R that you want me to answer before our next lab or comments about this one?
### Type anything here:
Make sure to click the “Knit” button and be sure to check that your html file was saved to your “Lab6” folder before closing out of this window. Also, make sure to save this R markdown file before you go so that just in case anything goes wrong, you at least have your typed answers in here.