R is a free software environment for statistical computing and graphics, and is one of the most widely used programs in both industry and research. For more information on the R programming language and how to download R, check out this website: https://www.r-project.org/. After clicking download R, you will be navigated to the CRAN mirrors. You can choose any mirror in USA (such as Indiana University), and then choose the correct version for your operating system.

For an easier interface to program in R that also allows the creation of reproducible documents, we suggest downloading and utilizing R Studio, available here: https://www.rstudio.com/. Make sure you download the free RStudio Desktop version. Actually, this document was created using R Studio in a specific project type called R Markdown.

To be able to use R Studio, you will need to also download R using the first link above.

Introduction to the R Interface

This is what the main coding area in R Markdown looks like. Here is where you will write your code and execute it by using the Run button in the top right corner.

The top left corner is where you can save your work and access any other R files you have open as tabs.

This is the console. This is where executed code from the main area will be.

Correctly run code with no errors will be blue in the console, while error messages will be red.

This is the environment section.

Created variables and inputted datasets will be available to view in the environment section.

Lastly, this section has the help section and the plots section.

The plots tab will have any plots/graphs that you create, and you can switch between plots using the blow arrows in the top left. The help tab is a very important part of the R interface that allows you to search functions in R and learn how to use them correctly. Whenever you are having a coding issue, the help tab is a great place to start.

Now that you have been introduced to the R interface, you can begin to code to help you better understand the chapters from your PSYCH 2220 class.

R Coding Basics

A great way to understanding the basics of R is imagining it like how a phone works. The installation of the base of R is the basic functions of the phone, such as sending texts and calling. Packages in R are just like downloading apps on your phone. Libraries and functions are like using the apps on your phone.

Notebooks

An R Notebook allows you to keep your code organized and type in notes in regular non-code language. To create a new notebook, open up R studio and click the plus sign in the top left corner. R Notebooks are what you will use for homework assignment that require coding.

Chunks

Within each notebook you make, you will have chunks, which is the way to divide up your code into sections, so it doesn’t get sloppy. For example, within one notebook, you may have chunks of code that are related to each other. To manually add a chunk, press the green ”C” with a plus sign, and press ”R.” To insert a chunk using the keyboard, press ctrl+alt+i for PCs, and command+option+i for Macs.

Objects

This is a term you need to be familiar with because you will be creating objects all the time when using R. In simple terms, an object in R can be anything, whether it is a number, a column of numbers, or text. Objects also have many purposes, such as making a table with data.

Making Objects

# Create the object
Hello <- "hi there" 
 
# Print object 
Hello 
## [1] "hi there"
# An object as a number 
Number <- 10 

# Print object 
Number 
## [1] 10
# An object as a list of numbers. Let's imagine we are at a park and count how 
# many people we see at each ride
ParkPeople <- c(3, 8, 1, 5, 2, 5, 4) 
 
# print object 
ParkPeople 
## [1] 3 8 1 5 2 5 4

Dataframes

A dataframe is a data set, and as you continue using R, this will make a lot more sense. You are able to run simple statistics with objects you create, and much more that will be introduced throughout your time in the course.

Some Advice

Learning code takes time and patience - it’s okay if you don’t understand it the first time around! Helping each other out will also help you understand coding better. Additionally, experiment with your code or data! This can help you visualize where errors may be in your code. Furthermore, keep your code clean and organized. Keep in mind the order in which you complete each step, it matters! Finally, do not freak out! This is a new process for most of you!

# This is a comment within a chunk. If you want to type notes in your code, 
# put a hashtag before you type and R will not read it as code. This is a great 
# way to organize steps in your code and you can refer back to it! 
# It is also a great way to keep track of the steps you complete. 

Calculations in R

Another cool feature in R is that you can do mathematical computations. Below are some common computations you can do.

# Calculations with numbers 
5+5 
## [1] 10
36/4 
## [1] 9
5*5 
## [1] 25
sqrt(144) 
## [1] 12
3^2 
## [1] 9
# Calculations with objects 
six <- 5 
5+six 
## [1] 10
six+six 
## [1] 10
# Calculations using object "ParkPeople from the objects section above" 
ParkPeople+4 
## [1]  7 12  5  9  6  9  8
ParkPeople+six 
## [1]  8 13  6 10  7 10  9
ParkPeople*3 
## [1]  9 24  3 15  6 15 12
sqrt(ParkPeople) 
## [1] 1.732051 2.828427 1.000000 2.236068 1.414214 2.236068 2.000000
# rounding 
round(sqrt(ParkPeople)) 
## [1] 2 3 1 2 1 2 2

Using R For Statistics

Now that you’ve learned how to use R and some of its basic tools and functions, you can use R for what it’s meant to do: statistics! Below we’ve written some code to explore various concepts in statistics that you may or may not have learned about. Using R you can play around with these concepts and hopefully develop a much richer understanding of them!

Representing Data Visually

Graphing, and the visual representation of data, is an essential component of statistics. Thus, understanding how to convert data into visual graphics within R is an extremely valuable tool. The aim of this section is to provide a short and simple introduction to graphing using R-Studio.

Identifying Data

Before creating a graph, you are going to need to identify the data you are working with. You can obtain your data simply by creating a vector that holds the values or you could load a dataset from your desktop or from R itself.

Creating a vector may look something like this

x <- c(1,2,3,4,5,6,7,8,9)


n = 10
mean = 0
sd = 1
y <- rnorm(n, mean, sd)

# in this case x and y are the vectors 

Deciding Which Graph to Use

After identifying your data, you will now have to determine which graph to use. Many times your professor will ask for a specific plot, but occasionally, you will have to decide for yourself what graph to use. In the case that you have to determine which graph is best, it may be a good idea to think back on the data you are working with.

Are you working with…

  1. scale variables?

  2. One variable? Two variables? Maybe even three variables?

  3. Are you looking at the relationship between your variables? Or using one to predict the other?

These are all super important questions when determining what graph to use with which set of data. So, I will run through a few common graphs and the situations in which you would want to use them.

Scatterplots

Scatterplots are best to use when depicting the relation between two scale variables.

ex. Looking at the relationship between hours of sleep before a big exam and the grades received on that exam.

Line graphs

Line graphs are pretty similar to scatterplots, but are usually used to display how a single scale variable changes over time using a line of best fit. In other words, line graphs are great to use when a question asks you to predict information about one variable based on the other.

ex. Can we predict a person’s stress levels by looking at how often they exercise?

Bar Graphs

Bar graphs are generally used when we are working with one nominal variable and one scale variable. The x-axis generally represents the nominal/ categorical variable, while the y-axis generally represents the scale variable.

ex. percentage of population that owns pets (scale) for the population of six different neighborhoods (nominal).

Histograms

Histograms can look similar to bar graphs, but they depict just one variable (usually a scale variable). Additionally, they are used to visualize the frequency of numerical data, so they are great to show how often an event takes place.

ex. You want to see how much squirrels’ weights vary in a particular area.

Creating/ Coding Graphs

This section will run through how to code for 3 different types of graphs, including basic graphs using the plot() function, histograms using the hist() function, and bar plots using the barplot() function.

simple graphs - plot() function

We’ll first plot a super simple graph

# first, define your vector

x <- c(1,5,4,2,8)

# next we can plot this vetor using the plot() function

plot(x)

You’ve officially plotted a graph (specifically a scatterplot) using R! Now it’s time to add some titles.

# use main=" " to label the main title, or header
# use xlab=" " to label the x-axis
# use ylab=" " to label the y-axis

# So, say the variable x represents inches of snow

plot(x, main = "Inches of Snow", 
     xlab = "x", 
     ylab = "Amount of Snow (in Inches)")

Now that we have some labels, let’s add another variable!

# now say we have 
# x = inches of snow 
# y = number of reported car accidents 

x <- c(1,2,4,5,8)
y <- c(10,11,20,23,31)

plot(x,y, main = "Relationship between Inches of Snow and Car accidents", 
     xlab = "Inches of Snow", 
     ylab = "Number of Car Accidents")

If the question was asking us to predict the number of car accidents based on the inches of snow, then a line graph would be more appropriate. So, let’s connects our points!

# in order to change a scatterplot into a line graph we simply add
# type = 'l'

#don't forget your commas though!

plot(x, y, type = 'l', 
    main = "Relationship between Inches of Snow and Car accidents", 
     xlab = "Inches of Snow", 
     ylab = "Number of Car Accidents")

We can also change the color and width of our line if we want

# add col="blue" or "green" or "red" etc. 
# add lwd = # to change the width

plot(x, y, type = 'l', col = "red", lwd=4,
     main = "Relationship between Inches of Snow and Car accidents", 
     xlab = "Inches of Snow", 
     ylab = "Number of Car Accidents")

Lastly, if we want to compare multiple variables to see how they differ, we can add additional lines to our plot.

# x = inches of snow in Ohio
# z = inches of snow in Tennessee
# a = car accidents in Tennessee

z <- c(1,1,2,4,8)
a <- c(15,16,25,32,56)
  
plot(x, y, type = 'l', col = "red", lwd=4,
     main = "Relationship between Inches of Snow and Car accidents", 
     xlab = "Inches of Snow", 
     ylab = "Number of Car Accidents")
lines(z, a, type = 'l', col = "blue", lwd=4)

Since our new line goes above the limit of our y-axis, it is probably a good idea to increase the limits of that axis.

# ylim=c() specifies limits of y axis
# xlim=c() specifies limits of x axis

plot(x, y, type = 'l', col = "red", lwd=4,
     ylim = c(0,60),
     main = "Relationship between Inches of Snow and Car accidents", 
     xlab = "Inches of Snow", 
     ylab = "Number of Car Accidents")
lines(z, a, type = 'l', col = "blue", lwd=4)

One last step that would help make this graph more clear to an observer would be to add a legend. Legends are great tools to help clarify which set of data is being represented by each line.

# fist decide where your legend should go 
#(i.e., topleft, bottomleft, topright, bottomright)

#then add in what you want your lines to be labeled

#lastly match the lines to the correct color

plot(x, y, type = 'l', col = "red", lwd=4,
     ylim = c(0,60),
     main = "Relationship between Inches of Snow and Car accidents", 
     xlab = "Inches of Snow", 
     ylab = "Number of Car Accidents")
lines(z, a, type = 'l', col = "blue", lwd=4)

legend("topleft", legend=c("Tennessee", "Ohio"), 
       col=c("blue", "red"), lwd=c(2,2))

Congratulations, you just made a line graph in R!

bar plots - barplot() function

Now that we’ve used the plot() function to make a scatterplot and a line graph, we’ll go over how to used the barplot() function to make a bar graph.

# let's use x = inches of snow again
x <- c(1,2,4,5,8)

# now plug x into barplot()
barplot(x)

Awesome! Now that you have a super simple bar graph, let’s add some more information.

x <- c(1,2,4,5,8)
# for this example, let's assume x = inches of snow for each day of the week
# b/c of this, we want our x-axis to be labeled with monday - friday
# use the function names.arg to do this

barplot(x, main = "Daily Inches of Snow",
        ylab = "Snowfall (in Inches)", 
        xlab = "Days of the Week",
        names.arg=c("Mon","Tue","Wed","Thu","Fri"))

That looks pretty good, but if we want to change the color of the bars we can simply add col = to our code

barplot(x, main = "Daily Inches of Snow",
        ylab = "Snowfall (in Inches)", 
        xlab = "Days of the Week",
        names.arg=c("Mon","Tue","Wed","Thu","Fri"), 
        col="lightblue")

That looks like a pretty solid bar graph!

histograms - hist() function

Last, but not least, we’ll walk through how to create a histogram.

# let's work with inches of snow again
# x = inches of snow 

x <- c(1,1,1,1,2,2,2,3,3,3,4,4,4,5,5,6,7,8)

#now, plug x into the hist() function

hist(x)

Now we can add some labels.

hist(x, main = "Histogram of Snowfall (in Inches)", 
     xlab = "Snowfall (in Inches)", 
     col = "lightblue")

That looks like a pretty solid histogram, but let’s go ahead and alter the y-axis limits.

hist(x, main = "Histogram of Snowfall (in Inches)", 
     xlab = "Snowfall (in Inches)", 
     col = "lightblue", 
     ylim = c(0,10))

There you go, you’ve now made a line graph, bar graph, and a histogram :)

The Central Limit Theorem

The Central Limit Theorem demonstrates that even when the original population is not normally distributed, the distribution of means will be normally distributed around the true population mean. This means that we can conduct statistical inference on the distribution of means since it is normally distributed (think Z-scores).

Let’s see how this works - here is a population that is definitely not normally distributed:

set.seed(614)
x <- rchisq(100000,5)
hist(x, main = "Non-Normal Distribution (with mean line)")
abline(v=mean(x), col = "red")

Even with the red line, it would be difficult to make inference (like determining percentiles) using this distribution because it is so skewed right. Why don’t we take random samples of the same size (I chose a sample size of 7) from this distribution and see if the means of those samples tell us anything?

set.seed(614)
a <- mean(sample(x,7))
b <- mean(sample(x,7))
c <- mean(sample(x,7))
d <- mean(sample(x,7))
e <- mean(sample(x,7))
f <- mean(sample(x,7))
g <- mean(sample(x,7))
h <- mean(sample(x,7))
i <- mean(sample(x,7))
j <- mean(sample(x,7))

c(a, b, c, d, e, f, g, h, i, j)
##  [1] 3.014966 4.636534 4.333393 5.914719 4.825164 5.700166 6.277536 4.466371
##  [9] 6.212693 5.124813

Here are the means of 10 samples, each randomly selected with a sample size of 7. It looks like the means of these samples lie between about 3 and 7, and are usually around 4 and 5. What would happen if we graphed the means of a bunch of samples from this distribution?

set.seed(614)
clm <- c()
for (i in 1:10000)
  clm[i] <- mean(sample(x,7))
hist(clm, main = "Distribution of 1000 Means of Sample Size 7", xlab = "Means", 
     breaks = 30)
abline(v=mean(x), col = "red")

Now this is a pretty normal-looking distribution. But if you have a keen eye you may notice that there is still a bit of a right skew (this is because the original distribution was right skewed). Should we increase the number of samples even more? We’re at 10,000 samples right now, so that may not be the best idea. Maybe we should try increasing the sample size. Let’s try one last distribution:

set.seed(614)
clm <- c()
for (i in 1:10000)
  clm[i] <- mean(sample(x,15))
hist(clm, main = "Distribution of 10000 Means of Sample Size 15", xlab = "Means",
     breaks = 30)
abline(v=mean(x), col = "red")

Wow. This distribution is not only very normal, but it also has a smaller variability around the true population mean of 5. Now with this type of distribution, we could easily make inferences about our data - all we need is the mean of the group. For example, say the original (non-normal) distribution was the distribution of iq scores for all American college students. Your PSYCH 2220 professor could take the mean of 15 of their students and compare it to this distribution of means (using Z-scores and such) to determine if their students have significantly high averages!

Glossary

The purpose of some of the code used in the previous section may not be clear, so here is a glossary of selective code with their respective functions.

lwd: This stands for line width. Set it equal to a number to change the width of a line in a plot. Higher numbers make thicker lines!

names.arg: This stands for argument names. You can use this to create a list of names to use as a label for a variable where a unique name for each value of the variable would make sense.

set.seed: When you generate a random variable in R, every time you run that code you will get slightly different results (since the variables generated are random). To get the same numbers each time use this function with any number. Then any person who runs the same code with the same “seed” will get the same results!

abline: This adds a line to a plot. Set “a” equal to the intercept and “b” equal to the slope of the line you would like to add. If you set only one value, this will be the line x=“value”.

rchisq: This generates random values from the chi-squared distribution. The first argument is the number of random variables to be generated and the second argument is the degrees of freedom for the distribution.