R is a powerful and free statistical analysis and graphics tool. It is driven by commands. R is known for its professional-looking graphics, which allow for complete customization. It is written and maintained by a group of volunteers known as the R core team. The base software is supplemented by many add-on packages developed by R users around the world. These packages cover a wide range of topics in statistics and are hosted on the Comprehensive R Archive Network (CRAN).
Anyone can contribute add-on packages, which are checked for quality before being added to the collection.
For Windows: - Click on “Download R for Windows” from the CRAN website. - Click on “base” to download the current version of R. - Save the file, double-click it, and follow the on-screen instructions.
For Mac: - Click on “Download for (Mac) OS X” from the CRAN website.
- Download the latest .pkg file. - Double-click on the file
and follow the on-screen instructions.
RStudio is an integrated environment with four panes: 1. Source window 2. Console 3. Environment/History pane 4. Window for displaying plots, files, packages, help, and viewer.
You can download RStudio from RStudio’s website. Install R first, as RStudio depends on it.
To create a new project folder in RStudio: 1. Go to
File > New Project... 2. Select
New Directory > New Project. 3. Choose or create a
directory, enter the project name, and click “Create Project”.
Commands can be typed in either the source window or the console. For example:
# Example of a basic arithmetic command
15 + 2 - 8
## [1] 9
A package is a collection of functions and data that implement specific tasks. To install a package in R Studio, use the following command:
install.packages(“package_name”)
To use the package, use the following command:
library(package_name)
For example, to install the BSDA (Basic Statistics and Data Analysis) package, we type:
install.packages(“BSDA”), wait for R to install the package, and type the command
library(BSDA) to use the package
To find the current working directory:
getwd()
To change the working directory:
setwd(“C:/desired/file/path”)
To import excel data into R, save the data as comma separated value (.csv) or tab delimited text (.txt) file. To import the .csv file into R, use the read.csv( ) command. Normally, we would need to type the path to find the file inside the parenthesis. However, a more convenient command is the file.choose( ) command. Typing this command inside the parenthesis opens a new window which allows us to browse directly to the location of the file on our computer and click on it. Thus, we type the command
read.csv(file.choose( ))
to import the .csv file to R. To import a .txt file into R, we use the command
read.delim(file.choose( ))
Once data have been created in R, we can export them in several file formats, one of which is the comma separated values of .csv. To do so, we use the R Command
write.csv(data_name, file= ‘file_path.csv’)
To get more information on any specific named function, for example mean, the command is help(mean). Alternatively, we can type a question mark in front of the word: ?mean
We can assign a value to a variable by combining the symbols < and – to form <-, like an arrow pointing to the variable. For example, we can assign the number 4 to the variable x by typing
x <- 10
We can add, subtract, multiply divide and take exponents of numbers or expressions using the +, - * , / and ^ signs respectively. Example
3*4 + 5^2 -6/2
## [1] 34
A vector is a list of numbers or characters. It is equivalent to a column in an excel worksheet. We can create a vector using the c( ) command and entering the terms of the vector inside the parenthesis, each separated by a comma.
vector1 <- c(3, 5, 1, 8, 2)
vector2 <- c("January", "February", "March", "April", "May")
vector1
## [1] 3 5 1 8 2
vector2
## [1] "January" "February" "March" "April" "May"
Notice that for vector2, each term is in quotation marks since they are characters.
A data frame is a special list or R object that is multidimensional and is usually used to store data read from an Excel or .CSV file. It is equivalent to an excel worksheet. A data frame can store values of different types. We use the following syntax to declare a data frame:
Variable <- data.frame (vector_1, vector_2, … vector_n )
For example, if we have the vectors
Student_Name <- c('James', 'Jane', 'John', 'Steve', 'Angela')
Student_Age <- c(18, 21, 19, 15, 20)
Student_GPA <- c(3.8, 3.5, 3.7, 3.9, 4.0)
We can create a dataframe that combines these vectors as follows:
Student_Data <- data.frame(Student_Name, Student_Age, Student_GPA)
Student_Data
## Student_Name Student_Age Student_GPA
## 1 James 18 3.8
## 2 Jane 21 3.5
## 3 John 19 3.7
## 4 Steve 15 3.9
## 5 Angela 20 4.0
If the dataframe has no title names, we can add the title of each column using either the names( ) function or by simply typing the names when assigning the vectors to the dataframe.
For example, suppose in the above example, our vectors are vector1, vector2, and vector3. If we want to change the names to “Student Name”, “Student Age”, and “Student GPA” respectively, we can either use the command do the following:
vector1 <- c("James", "Jane", "John", "Steve", "Angela")
vector2 <- c(18, 21, 19, 15, 20)
vector3 <- c(3.8, 3.5, 3.7, 3.9, 4.0)
Student_Data2 <- data.frame(vector1, vector2, vector3)
Student_Data2
## vector1 vector2 vector3
## 1 James 18 3.8
## 2 Jane 21 3.5
## 3 John 19 3.7
## 4 Steve 15 3.9
## 5 Angela 20 4.0
We can then use the names( ) command to obtain our desired title names as follows:
names(Student_Data2) <- c("Student Name", "Student Age", "Student GPA")
Student_Data2
## Student Name Student Age Student GPA
## 1 James 18 3.8
## 2 Jane 21 3.5
## 3 John 19 3.7
## 4 Steve 15 3.9
## 5 Angela 20 4.0
Alternatively, we can simply type the names within the parenthesis when assigning the vectors to the dataframe as follows:
Student_Data3 <- data.frame("Student_Name"=c("James", "Jane", "John", "Steve", "Angela"), "Student_Age"=c(18, 21, 19, 15, 20), "Student_GPA"=c(3.8, 3.5, 3.7, 3.9, 4.0))
Student_Data3
## Student_Name Student_Age Student_GPA
## 1 James 18 3.8
## 2 Jane 21 3.5
## 3 John 19 3.7
## 4 Steve 15 3.9
## 5 Angela 20 4.0
To select a particular column in a dataframe attach the symbol $
to the dataframe name followed by the name of the column. Note that once you type the $
symbol next to the name of dataframe, all column names will appear in a drop-down menu, so you can simply click on the name of the desired column. For example, typing the command Student_Data3$Student_GPA in the example above will select the Student_GPA column
If you want to directly reference the name of the columns in the dataframe without going throuth the steps described above, you can attach the dataframe using the command attach(name_of_dataframe). You can detatch the dataframe when you are done by using the command detatch(name_of_dataframe).
The types of data in R are: • Numeric, which contains all kinds of numbers • Logical, which consist of logical values (TRUE/FALSE) • Character, which consist of text
We can check for the data type in a vector by using the syntax is.data type
For example, we can check if the Student_Name vector is a numeric data type by typing
is.numeric(Student_Name)
## [1] FALSE
The result “FALSE” means the variable type is not numeric
is.character(Student_Name)
## [1] TRUE
returns the value “TRUE”, which means Student_Name is a character. We can convert vectors to different data types using the command as.data type. For example, converting the vector (0, 1, 0, 3, 7, 0, 8) to a logical form yields the following:
as.logical(c(0, 1, 0, 3, 7, 0, 8))
## [1] FALSE TRUE FALSE TRUE TRUE FALSE TRUE
Notice that all zeros are converted to a logical value of FALSE, all other values are converted to TRUE.
R comes with several built-in data sets. These data sets are generally used as practice data beginning R learners to use. Examples are mtcars, iris, ToothGrowth, USArrests, etc. To see a complete list of all built-in data sets in R, type the command data( ).
Frequency tables summarize categorical data by displaying the number of observations belonging to each category. Frequency tables for two categorical variables summarize the relationships between those variables by showing the number of observations that fall into each combination of categories. The R command for creating a frequency table from a single variable is table( ).
For example, we can obtain the Frequency table for the Species column in the iris dataframe from the built-in data in R as follows:
table(iris$Species)
##
## setosa versicolor virginica
## 50 50 50
A two-way table can be entered as a matrix of objects. We can input the table as follows:
Two_Way_Table <- matrix(c(1,3,2,5, 4, 2, 9, 7, 8), nrow=3, byrow=TRUE)
Two_Way_Table
## [,1] [,2] [,3]
## [1,] 1 3 2
## [2,] 5 4 2
## [3,] 9 7 8
We can transpose a matrix using the t( ) command, as shown below:
t(Two_Way_Table)
## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 3 4 7
## [3,] 2 2 8
The R command prop.table( ) displays the table with each cell expressed as a proportion of the total count as shown below:
prop.table(Two_Way_Table)
## [,1] [,2] [,3]
## [1,] 0.02439024 0.07317073 0.04878049
## [2,] 0.12195122 0.09756098 0.04878049
## [3,] 0.21951220 0.17073171 0.19512195
We can display the above table as percentages by multiplying by 100 using the round( ) command as follows:
round(prop.table(Two_Way_Table)*100)
## [,1] [,2] [,3]
## [1,] 2 7 5
## [2,] 12 10 5
## [3,] 22 17 20
We can use the addmargins( ) command to display the table with row and column totals as shown below:
addmargins(Two_Way_Table)
## Sum
## 1 3 2 6
## 5 4 2 11
## 9 7 8 24
## Sum 15 14 12 41
A frequency distribution shows how data are partitioned among several categories by listing the categories along with the frequency of data values associated with each of them. To create a frequency distribution, we first use either the summary( ) command or the range( ) command to identify the max and min numbers. We then identify the break points using the command
breaks <- seq(min, max, by = interval_size).
Next, we use the function cut(Data, break, right = ) to find the bins.
Lead_and_IQ <- c(61, 82, 70, 72, 72, 95, 89, 57, 116, 95, 82, 116, 99, 74, 100,72, 126, 80, 86, 94, 100, 72, 63, 101, 85, 85, 124, 105, 81, 87)
summary(Lead_and_IQ)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 57.00 72.50 85.50 88.03 99.75 126.00
breaks = seq(50, 130, by = 10)
IQ.cut <- cut(Lead_and_IQ, breaks, right= FALSE)
IQ.freq <- table(IQ.cut)
cbind(IQ.freq)
## IQ.freq
## [50,60) 1
## [60,70) 2
## [70,80) 6
## [80,90) 9
## [90,100) 4
## [100,110) 4
## [110,120) 2
## [120,130) 2
Data visualization is an important part of statistics. Graphs help us to spot trends and relationships that could easily be missed just by looking at the raw data. This chapter deals with how to explore data with tables and graphs. One of the hardest parts of an analysis is producing quality supporting graphics. Conversely, a good graph is one of the best ways to present findings. Fortunately, R provides excellent graphing capabilities, both in the base installation and with add-on packages such as lattice and ggplot2. We will briefly present some simple graphs using base graphics and then show their counterparts in ggplot2.
Here are some R commands for creating common graphs in Base R
The base R graphics are good for performing basic functions for beginners. If you want more customizing flexibility, it is good to use the ggplot2. The “gg” stands for “grammar of graphics”. It involves a set of rules for combining graphics components to produce graphs. To use ggplot2, we first have to install it. The easiest way to install it is to install the whole tidyverse using the command install.packages(“tidyverse”) which contains ggplot2 and other packages. Alternatively, we can just install the ggplot2 package using the command install.packages(“ggplot2”). A plot in ggplot2 can be divided into 3 parts: Data + Aesthetics + Geometry
Data: a dataframe
Aesthetics: used to indicate the x and y variables. It can be also used to control the color, the size and the shape of points, etc…. . It appears in the R command as aes( )
Geometry: corresponds to the type of graphics (histogram, box plot, line plot, ….). It appears in the R command as geom_
In Base R:
# Use the 'mtcars' dataset, a RStudio built-in dataset
data <- mtcars$mpg # Extract the 'mpg' (miles per gallon) column
# Create a histogram
hist(data,
main = "Histogram of Miles Per Gallon (mpg)",
xlab = "Miles Per Gallon",
ylab = "Frequency",
col = "lightblue", # Color of the bars
border = "darkblue") # Color of the bar borders
In ggplot2:
# Load the ggplot2 package
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.3
# Create a histogram
ggplot(mtcars, aes(x = mpg)) +
geom_histogram(binwidth = 2,
fill = "lightblue",
color = "darkblue") +
labs(
title = "Histogram of Miles Per Gallon (mpg)",
x = "Miles Per Gallon",
y = "Frequency"
) +
theme_minimal()
Create a dot plot for ‘mpg’ (miles per gallon)
p <- ggplot(mtcars, aes(mpg)) +
geom_dotplot(binwidth = 1.5, fill = "blue") +
scale_y_continuous(NULL, breaks = NULL) + # Remove y-axis labels and ticks
xlab("Miles Per Gallon (mpg)") + # Label for x-axis
theme_minimal() # Minimalistic theme
# Display the plot
print(p)
Create a stem-and-leaf plot
Each “stem” represents a range of values, and the “leaves” represent individual data points within that range. The default behavior splits data into two parts: The leading digit(s) form the “stem.” The trailing digit(s) form the “leaf.”
# Use the built-in mtcars dataset
data <- mtcars$mpg # Extract the 'mpg' (miles per gallon) column
# Create a stem-and-leaf plot
stem(data)
##
## The decimal point is at the |
##
## 10 | 44
## 12 | 3
## 14 | 3702258
## 16 | 438
## 18 | 17227
## 20 | 00445
## 22 | 88
## 24 | 4
## 26 | 03
## 28 |
## 30 | 44
## 32 | 49
Create a Pie Chart
Freq <-c(640, 195, 95, 63, 111)
pie (Freq, labels = c("Pilot Error", "Mechanical", "Sabotage", "Weather", "Other"),
main = "Pie Chart of Causes of Plane Crashes")
This section provides a brief introduction to the concepts of scatterplots, correlation, correlation coefficient, and linear regression.
Scatterplot and Correlation
A correlation is said to exist between two variables when the values of one variable are somehow associated with the values of the other variable. A linear correlation exists between two variables when there is a relationship and the plotted points of paired data result in a pattern that can be approximated by a straight line. A scatterplot is a plot of paired (x, y) quantitative data with a horizontal x-axis and a vertical y-axis. The horizontal axis is used for the first variable (x), and the vertical axis is used for the second variable (y).
The R command for the scatterplot of paired (x, y) quantitative data is plot(x, y).
Listed below are the overhead widths (cm) of seals measured from aerial photographs and the weights (kg) of these same seals (based on “Mass Estimation of Weddell Seals Using Techniques of Photogrammetry,” by R. Garrott of Montana State University). The purpose of the study was to determine if weights of seals could be determined from overhead photographs.
Scatterplot Using Base R:
Overhead_Width <- c(7.2, 7.4, 9.8, 9.4, 8.8, 8.4)
Weight <- c(116, 154, 245, 202, 200, 191)
plot(Overhead_Width, Weight, xlab= "Overhead Width (cm)", ylab="Weigh (kg)")
Scatterplot Using ggplot2:
Overhead_Width <- c(7.2, 7.4, 9.8, 9.4, 8.8, 8.4)
Weight <- c(116, 154, 245, 202, 200, 191)
Data1 <- data.frame(Overhead_Width, Weight)
ggplot(Data1, aes(x=Overhead_Width, y=Weight)) + geom_point() +
labs(x= "Overhead Width (cm)", y="Weigh (kg)") +
theme_light()
## Linear Correlation Coefficient r
The Linear correlation coefficient is denoted by r, and it measures the strength of the linear association between two variables. The R command for the linear correlation between x and y is cor(x, y). The R command for testing the significance of the Pearson correlation coefficient between x and y is cor.test(x, y).
Consider the “Foot and Height” data.
The scatterplot and the linear correlation coefficient are shown below.
Shoe_Print_Length <- c(29.7, 29.7, 31.4, 31.8, 27.6)
Height <- c(175.3, 177.8, 185.4, 175.3, 172.7)
plot(Shoe_Print_Length, Height, xlab = "Shoe Print Length (cm)", ylab="Height(cm")
cor.test(Shoe_Print_Length, Height)
##
## Pearson's product-moment correlation
##
## data: Shoe_Print_Length and Height
## t = 1.2699, df = 3, p-value = 0.2937
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.6083443 0.9683757
## sample estimates:
## cor
## 0.5912691
Given a collection of paired data, the regression line is the straight line that best fits the scatterplot of the data. The regression equation \[ \hat{y} = b_0 + b_1x \] where \(b_0\) is the y-intercept of the line and \(b_1\) is the slope of the line. This equation algebraically describes the regression line. The R command for obtaining the least squares regression equation is model <- lm(y ~ x, data = ). and summary(model). The R command for obtaining the least squares regression line is abline(model), where model is as defined earlier.
# Load the built-in mtcars dataset
data <- mtcars
# Fit a linear regression model: mpg (miles per gallon) predicted by hp (horsepower)
model <- lm(mpg ~ hp, data = data)
# Summarize the model
summary(model)
##
## Call:
## lm(formula = mpg ~ hp, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7121 -2.1122 -0.8854 1.5819 8.2360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.09886 1.63392 18.421 < 2e-16 ***
## hp -0.06823 0.01012 -6.742 1.79e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892
## F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07
# Plot the data and regression line
plot(data$hp, data$mpg,
xlab = "Horsepower (hp)",
ylab = "Miles Per Gallon (mpg)",
main = "Regression of MPG on Horsepower")
abline(model, col = "red") # Add the regression line
Regression Equation: The least squares regression equation is:
The regression equation is:
\[ \hat{y} = b_0 + b_1x \]
Where: - \(b_0\) is the intercept. - \(b_1\) is the slope.
In our case:
\[ \hat{y} = 30.1 + -0.0682x \]