Return to Home Page



Dee Chiluiza, PhD
Northeastern University
Introduction to data analysis using R, R Studio and R Markdown
Short manual series: Vectors and Matrix


1 Libraries used in this document

# Libraries used in this document
library(tidyverse) 
library(gridExtra)   # For grid.arrange()
library(grid)        # For grid tables
library(DT)          # For datatables
library(knitr) 
library(modeest)

# Data sets

data("faithful")


2 Working with Vectors


A vector is the most basic data structure you can create and use in R. They are very simple, easy to create and manipulate, and at the same time, they are very important, critical elements of your data analysis processes. some applications of vectors are:

  • To create a data set for analysis. For example, a vector for some people’s names, a vector for their salaries, and a vector for their number of pets.

  • To organize other R objects in order to create tables. For example, obtain descriptive statistics from a data set (means, medians, standard deviations, etc.) and present them using tables.

  • To describe parameters needed inside some codes. For example, x- and y-axes limits are define by a minimum and a maximum value, they are indicated with a vector containing two elements, the borders of a figure require four measures: bottom, left, top and right; they are indicated with a vector containing four elements.

Imagine you have the following data: You are on a night out with 5 of your friends. You want to know how much money you all have to spend and the average amount of money each person contributes. The amount of money each one has to spend are: $55, $45, $60, $70, $50, $65. At the end of the night you ask them how much money they had when they returned to their respective homes ($5, $8, $12, $25, $18, and $58, respectively).

For this data, you can easily identify three variables, names, money before you all expend it, money at the end of the night.

Following their corresponding orders from left to right, each group of data is organized into vectors. Therefore, we will create a vector for the variable Name and two vectors for the variables Money.



3 How do you create a vector?

Before you start to work on this section, be sure to clean the environment.

Use code: <MARK>ls()</MARK> to see a full list of the objects stored in the environment tab of your R Studio (vectors, tables data sets, etc.), then clean the environmentusing code <MARK>remove(list = ls())</MARK> or the keyboard key combination: <MARK>CTR+L</MARK>. Cleaning the environment is for learning purposes, you do not have to do it all the time.

Vectors are created using the code c(- , -). The values are placed inside the parenthesis and separated by commas.

Notice that quotations are used for categorical variables, such as names. Also, notice that the values are inserted in an orderly manner, be careful of that order since their position will correspond to the values in the same position in adjacent vectors. For example:

A= c(“M”, “P”, “Q”)
B = c(22, 28, 15)

In this example, M is 22, P is 28, and Q is 15.

Important: it is always a good practice to provide a name to all vectors you create, specially when you are learning to code in R. The name can be anything, it is up to you (they cannot start with a number or contain empty spaces). Also notice that I like to use the equal symbol (=) instead of the commonly used arrow (<-); this is also a choice.

Observe the vectors inside the R chunk below, and remember that the names you see below are not codes, are words I chose to create objects, and objects are used to store data. You can use any name of your preference.

name_of_object = code or vector used to create the object

friendsNames = c("Anne", "Mary", "Sri", "Ma", "Tom", "Rick")
moneyBefore = c(55, 45, 60, 70, 50, 65)
moneyAfter = c(5, 8, 12, 25, 18, 58)

After running the codes, observe that three objects now appear in your Environment tab.
Now, some basic calculations and vector manipulations are explained in the R Chunk below.

Notice the use of hashtags (#) to create annotations (non-coding text used to organize your codes or to enter additional information)

  • The total amount of money you all have (sum).

  • The total amount of money at the end of the night (sum).

  • How much money every person spent (substraction).

# The total amount of money you all have (sum)
totalBefore = sum(moneyBefore)

# The total amount of money at the end of the night (sum)
totalAfter = sum(moneyAfter)

# How much money every person spent (substraction)
eachSpent = moneyBefore - moneyAfter


The outcomes of those codes are presented here using inline R Codes.

Total before: 345

Total after: 126

Money spent per person: 50, 37, 48, 45, 32, 7

Yeah, Rick was a little cheap in this outing. Let’s connect each person with their corresponding money amounts.

  • Use vector <MARK>names()</MARK>, in which, the vector with the values is entered inside the parentheses, and the vector with the names is entered after the equal sign.
names(moneyBefore) = friendsNames


After this, call again the vector moneyBefore and observe how each friend’s name is now connected to their corresponding money amounts.

moneyBefore
## Anne Mary  Sri   Ma  Tom Rick 
##   55   45   60   70   50   65


  • If you want to know the amount of money spent per friend, just repeat the same strategy, this time use the vector eachSpent.

  • We will improve the presentation of the table by using library knitr, code kable (I add all libraries in my first R chunk, at the beginning of the document).

  • Compare the table below with the outcome of the previous R Chunk.

names(eachSpent) = friendsNames

kable(eachSpent, 
      format = "html",
      table.attr = "style='width:40%;'")
x
Anne 50
Mary 37
Sri 48
Ma 45
Tom 32
Rick 7



That was easy, right? Let’s see some other questions and vector manipulations:

  • What is the mean money per friend?
    Simply use code mean(moneyBefore)

  • Who is spent more than $30.00?
    Use the higher-than symbol > to filter the code: eachSpent>30

  • Whospent more money
    Use code <MARK>max( )</MARK> on the object eachSpent.

  • Who spent less money
    Use code <MARK>min( )</MARK> on the object eachSpent.

Observe and enter the following codes on your practice file. Also, notice the use of the print(paste()) code combination to present an outcome accompanied by information text.

# What is the mean money per friend?
print(paste("The mean spent money was $",mean(moneyAfter)))
## [1] "The mean spent money was $ 21"

I can remove the quotations if needed:

# What is the mean money per friend?
print(paste("The mean spent money was $", mean(moneyAfter)), quote = FALSE)
## [1] The mean spent money was $ 21



4 Select specific observations

You can isolate the names of the friends that spent more than $30.00 by using the following code strategy:

1. Create an object containing the logical question

2. Create a new object to isolate required observations. Notice the use of square brackets [ ]. The square brackets read more or less as follows: From the vector “eachSpent” isolate observations contained in vector “morethan30

# 1. Create an object containing the logical question

morethan30 = eachSpent > 30

# 2. Create a new object to isolate required observations. 

friendsMore30 = eachSpent[morethan30]

# 3. Activate and present the new object

friendsMore30
## Anne Mary  Sri   Ma  Tom 
##   50   37   48   45   32


As you can see, only five friends are listed, Rick, who spent only 7 dollars, is not included in the list.

5 Vectors used inside codes


During your data analysis projects, you will notice the presence of vectors almost every where. They are used, for example, to indicate a list of colors to be used. The bar graphs below are in sequence, notice how the code col is used to add colors, first just one color, then it is used to indicate three colors using a little vector. Don’t worry if you don’t understand bar plots yet, for now focus on the col code. There is a link to bar plots in the home page.

par(mfrow=c(1,3))

dataBar = c(A=20, C=25, D=13)

barplot(dataBar)

barplot(dataBar, col = "blue")

barplot(dataBar, col = c("blue", "yellow", "pink"))


Vectors are also used to indicate the limits of the x- or y axis. In the example above, now let’s use the code ylim to change the limits of the y-axis. In this case, the vector c( - , - ) contains only two elements, a minimum and a maximum value. Notice the examples below.

  • Bar plot 1 displays custom limits created by R.

  • Plot 2, limits extended to 40.

  • Plot 3, limits extended to 100.

par(mfrow=c(1,3))

barplot(dataBar, 
        col = c("blue", "yellow", "pink"))

barplot(dataBar, 
        ylim = c(0,40),
        col = c("blue", "yellow", "pink"))

barplot(dataBar, 
        ylim = c(0,100),
        col = c("blue", "yellow", "pink"))


6 Matrix


A matrix is used to collect data elements, arranged in two-dimensional layouts (columns and rows). They are considered basic two dimensional data structures.

Similar to data frames (another set of basic codes), matrices are very common in the R language. You can use them to present data as tables. We will use the public data set faithful to obtain some basic descriptive statistics from the two variables. Now imagine that you have a list of values containing the mean, the median, standard deviation, etc., and you want to present them in an orderly manner using a table.

  • Start by obtaining and storing the descriptive statistics as objects.

  • Using vectors, as you learnt above, create objects to store those values. Picture the table you want to create, you can create one vector per row, or you can create a vector per column (most appropriate for actual variablesvalues). In this case, we will create one vector per row, and in each row we present the statistics for each variable.

  • Pay attention to the sequence you use to enter the objects.

  • Create the matrix. First I decide I want to present the values in the same direction as they are store in the vectos.Therefore, on code byrow = I select TRUE because the values must be organized by row.

  • And since I have only two groups to present, waiting and eruption, I select only two rows in the nrow = code.

  • Observe the raw table presented using the name we used to create the matrix, faith_Table. Based on the table we visualized above, are the values correct?

# 1. Store information

meam_Wait   = mean(faithful$waiting)
sd_Wait     = sd(faithful$waiting)
median_Wait = median(faithful$waiting)

mean_Eruption   = mean(faithful$eruptions)
sd_Eruption     = sd(faithful$eruptions)
median_Eruption = median(faithful$eruptions)

# 2. Create the vectors.

wait_Vector     = c(meam_Wait, sd_Wait, median_Wait)
eruption_Vector = c(mean_Eruption, sd_Eruption, median_Eruption)

# 3. Create the matrix.

faith_Table = matrix(c(wait_Vector,eruption_Vector), nrow = 2, byrow = TRUE)

# 4. Present a raw table

faith_Table
##           [,1]      [,2] [,3]
## [1,] 70.897059 13.594974   76
## [2,]  3.487783  1.141371    4


Since the values contain too many decimals, we reduce them using code round(data, digits = 2). The data is the name of the object, number of digits can be changed.

Important: Do not reduce the number of decimals when you create objects, only and ONLY when you are presenting the data.

round(faith_Table, 2)
##       [,1]  [,2] [,3]
## [1,] 70.90 13.59   76
## [2,]  3.49  1.14    4



7 Columns and row names

Using the same strategy you learnt above, let’s add names to the columns and rows.

  • Use vectors to create objects to store the names you want to use.

  • Then use codes colnames() and rownames() as you learnt above.

  • Notice that on kable() you don’t need to use code round() since kable() has the option digits.

# 1. Create vectors for column and row names

col_names = c("Mean", "StDev", "Median")
row_names = c("Wating", "Eruption")

# 2. Apply names to columns and rows

colnames(faith_Table) = col_names
rownames(faith_Table) = row_names

# 3. Present table using a nice code such as kable()

knitr::kable(faith_Table,
             align = "c", 
             digits = 2,
             format = "html",
             table.attr = "style='width:40%;'")
Mean StDev Median
Wating 70.90 13.59 76
Eruption 3.49 1.14 4



8 Change direction of the table


For this type of data, it makes more sense to add the variables by columns and their descriptive statistics in the rows. To perform this:

  • Simple change the number of rows you need when you create the matrix.

  • Also change the direction the data is arranged. In this case, it must be arranged in columns, then change byrow to FALSE.

  • Do not forget to change the name of the vector used to create columns and row names.

# 1. Create the matrix.

faith_Table_2 = matrix(c(wait_Vector,eruption_Vector), nrow = 3, byrow = FALSE)

# 2. Create the matrix.

row_names_2 = c("Mean", "StDev", "Median")
col_names_2 = c("Wating", "Eruption")

# 3. Apply names to columns and rows

colnames(faith_Table_2) = col_names_2
rownames(faith_Table_2) = row_names_2

# 4. Present table using a nice code such as kable()

knitr::kable(faith_Table_2,
             align = "c", 
             digits = 2,
             format = "html",
             table.attr = "style='width:40%;'")
Wating Eruption
Mean 70.90 3.49
StDev 13.59 1.14
Median 76.00 4.00



Disclaimer: This short series manual project is a work in progress. Until otherwise clearly stated, this material is considered to be draft version.



Dee Chiluiza, PhD
June 2021
Last update: 05 March, 2022
Boston, Massachusetts, USA

Bruno Dog