Data Structures

Data structures are used to store data. R supports multiple data structures but there are 4 fundumetnal data structures that an R practitioner needs to know. They are:

  1. Vectors: The fundamental (atomic) data structure. R vectorizes elements and stores them in vectors. Unlike MATLAB, which stores data in row-vectors, R stores data in column vectors. Vectors can only store one particular data type, i.e. a vector can only hold character data type but not a mixture of character and numeric data types.
x <- c(55, 65, 90, 86) # create a vector through concatenation by using c()
x
## [1] 55 65 90 86
y <- c(75, 60, 92, 80)
y
## [1] 75 60 92 80
z <- x + y # vector addition
z
## [1] 130 125 182 166
u <- c(200, 300, 400, 450)
z-u
## [1]  -70 -175 -218 -284
Names <- c("Rob", "James", "Alberto", "Maya") # Character vectors
Names
## [1] "Rob"     "James"   "Alberto" "Maya"
  1. Matrices: Matrices are built by either combining vectors row wise (rbind()) or by using columns (cbind()). Matrices can also be created by using the matrix() function (use ?matrix to look up the arguments). Similar to vectors, matrices also only support one single data type in a single matrix.
m <- matrix(1:12, nrow = 3, ncol = 4, byrow = FALSE)
m
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12
n <- cbind(x, y)
n
##       x  y
## [1,] 55 75
## [2,] 65 60
## [3,] 90 92
## [4,] 86 80
v <- rbind(x, y)
v
##   [,1] [,2] [,3] [,4]
## x   55   65   90   86
## y   75   60   92   80
# Some basic matrix operations:
t(n) # create a transpose
##   [,1] [,2] [,3] [,4]
## x   55   65   90   86
## y   75   60   92   80
n %*% t(n) # create a square matrix
##       [,1]  [,2]  [,3]  [,4]
## [1,]  8650  8075 11850 10730
## [2,]  8075  7825 11370 10390
## [3,] 11850 11370 16564 15100
## [4,] 10730 10390 15100 13796
solve((t(n) %*% n)) # inverse
##              x            y
## x  0.002319781 -0.002232726
## y -0.002232726  0.002190450

3. Lists: Lists are the building blocks of Data Frames. They can hold multiple different data types in a single list. Use as.list() to convert to a list or list() to transform vectors/matrices of different data types into a single list.

# Unlike vectors and matrices, lists can hold multiple datatypes
list.1 <- list(names, x, y)
list.1; str(list.1)
## [[1]]
## function (x)  .Primitive("names")
## 
## [[2]]
## [1] 55 65 90 86
## 
## [[3]]
## [1] 75 60 92 80
## List of 3
##  $ :function (x)  
##  $ : num [1:4] 55 65 90 86
##  $ : num [1:4] 75 60 92 80
list.2 <- list(names, n, m)
list.2; str(list.2)
## [[1]]
## function (x)  .Primitive("names")
## 
## [[2]]
##       x  y
## [1,] 55 75
## [2,] 65 60
## [3,] 90 92
## [4,] 86 80
## 
## [[3]]
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12
## List of 3
##  $ :function (x)  
##  $ : num [1:4, 1:2] 55 65 90 86 75 60 92 80
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:2] "x" "y"
##  $ : int [1:3, 1:4] 1 2 3 4 5 6 7 8 9 10 ...

4. Data Frames: Data frames are the fundamental data structure to hold numerous data types. Base R performs very well with data frame, however the modern “tidyverse” universe uses something called a “tibble”

Year <- seq(1981, 1997, by = 1)
Observations <- sample(1000, 17, replace = TRUE) # Generate 17 random numbers between 0-1000
df.1 <- data.frame(Year, Observations) # Create a data frame
Year
##  [1] 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995
## [16] 1996 1997
Observations
##  [1] 161 814 497 315 765 588  61 494 458 514 263 238 943 650 936  50 612
class(df.1) # check the class of object
## [1] "data.frame"
str(df.1) # check the structure of the object and its respective columns
## 'data.frame':    17 obs. of  2 variables:
##  $ Year        : num  1981 1982 1983 1984 1985 ...
##  $ Observations: int  161 814 497 315 765 588 61 494 458 514 ...

Now, let’s look at doing some preliminary statistical analysis.

linear.model.1 <- lm(Observations ~ Year, data = df.1) # perform linear regression
summary(linear.model.1) # check the summary and the coefficients
## 
## Call:
## lm(formula = Observations ~ Year, data = df.1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -487.26 -241.72   15.79  125.76  425.26 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12451.419  28433.850  -0.438    0.668
## Year             6.507     14.296   0.455    0.655
## 
## Residual standard error: 288.8 on 15 degrees of freedom
## Multiple R-squared:  0.01363,    Adjusted R-squared:  -0.05213 
## F-statistic: 0.2072 on 1 and 15 DF,  p-value: 0.6555
plot(df.1$Year, df.1$Observations, col = "red", xlab = "Year", ylab = "Observations")
abline(linear.model.1) # Create a Linear Regression line using abline()

par(mfrow = c(2,2)) # create a 2*2 matrix for the plots
plot(linear.model.1) # create a plot of the residuals

Exercises

Perform the following exercises in base R.

1. Create a data-frame called “Results” by combining the vectors “Names”, “x”, and “y”. Rename “x” as “Result_1” and “y” as “Result_2”.

Results = data.frame(Names,x,y)
#Result_1 <- x 
#Result_2 <- y
colnames(Results) <- c('Names', 'Result_1', 'Result_2') # Changing names of col
Results

2. Create a Scatter-Plot of “Result_1” on the x-axis “Result_2” on the y-axis.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
# Scatter plot of Result_1 against Result_2
ggplot(data = Results, aes(x = Result_1, y = Result_2)) + geom_point()

3. Perform a Linear Regression by regressing “Result_2” on “Result_1”. (Hint: use the lm() function)

linear.model.1 <- lm(Result_2 ~ Result_1, data = Results) # perform linear regression

4. Plot the residuals and summarize the results.

# Plotting the residuals
par(mfrow = c(2,2)) # create a 2*2 matrix for the plots
plot(linear.model.1) # create a plot of the residuals

# Summary of the linear regression model
summary(linear.model.1) # check the summary and the coefficients
## 
## Call:
## lm(formula = Result_2 ~ Result_1, data = Results)
## 
## Residuals:
##       1       2       3       4 
##   8.788 -11.758   6.376  -3.406 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  35.7072    30.0681   1.188    0.357
## Result_1      0.5546     0.3987   1.391    0.299
## 
## Residual standard error: 11.57 on 2 degrees of freedom
## Multiple R-squared:  0.4917, Adjusted R-squared:  0.2376 
## F-statistic: 1.935 on 1 and 2 DF,  p-value: 0.2988

5. Interpret your results.

Solution: If we look at the Residuals vs Fitted image in the figure of the question 4 above, we observe that the regression line is not fitted well with the actual data points, because the Summary statistics shows that the residual standard error is 11.57 on 2 degrees of freedom. The p-value of the model is 0.2988 which indicates that the regression is not capable enough to predict the actual points accurately.