Data structures are used to store data. R supports multiple data structures but there are 4 fundumetnal data structures that an R practitioner needs to know. They are:
x <- c(55, 65, 90, 86) # create a vector through concatenation by using c()
x
## [1] 55 65 90 86
y <- c(75, 60, 92, 80)
y
## [1] 75 60 92 80
z <- x + y # vector addition
z
## [1] 130 125 182 166
u <- c(200, 300, 400, 450)
z-u
## [1] -70 -175 -218 -284
Names <- c("Rob", "James", "Alberto", "Maya") # Character vectors
Names
## [1] "Rob" "James" "Alberto" "Maya"
m <- matrix(1:12, nrow = 3, ncol = 4, byrow = FALSE)
m
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
n <- cbind(x, y)
n
## x y
## [1,] 55 75
## [2,] 65 60
## [3,] 90 92
## [4,] 86 80
v <- rbind(x, y)
v
## [,1] [,2] [,3] [,4]
## x 55 65 90 86
## y 75 60 92 80
# Some basic matrix operations:
t(n) # create a transpose
## [,1] [,2] [,3] [,4]
## x 55 65 90 86
## y 75 60 92 80
n %*% t(n) # create a square matrix
## [,1] [,2] [,3] [,4]
## [1,] 8650 8075 11850 10730
## [2,] 8075 7825 11370 10390
## [3,] 11850 11370 16564 15100
## [4,] 10730 10390 15100 13796
solve((t(n) %*% n)) # inverse
## x y
## x 0.002319781 -0.002232726
## y -0.002232726 0.002190450
3. Lists: Lists are the building blocks of Data Frames. They can hold multiple different data types in a single list. Use as.list() to convert to a list or list() to transform vectors/matrices of different data types into a single list.
# Unlike vectors and matrices, lists can hold multiple datatypes
list.1 <- list(names, x, y)
list.1; str(list.1)
## [[1]]
## function (x) .Primitive("names")
##
## [[2]]
## [1] 55 65 90 86
##
## [[3]]
## [1] 75 60 92 80
## List of 3
## $ :function (x)
## $ : num [1:4] 55 65 90 86
## $ : num [1:4] 75 60 92 80
list.2 <- list(names, n, m)
list.2; str(list.2)
## [[1]]
## function (x) .Primitive("names")
##
## [[2]]
## x y
## [1,] 55 75
## [2,] 65 60
## [3,] 90 92
## [4,] 86 80
##
## [[3]]
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
## List of 3
## $ :function (x)
## $ : num [1:4, 1:2] 55 65 90 86 75 60 92 80
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:2] "x" "y"
## $ : int [1:3, 1:4] 1 2 3 4 5 6 7 8 9 10 ...
4. Data Frames: Data frames are the fundamental data structure to hold numerous data types. Base R performs very well with data frame, however the modern “tidyverse” universe uses something called a “tibble”
Year <- seq(1981, 1997, by = 1)
Observations <- sample(1000, 17, replace = TRUE) # Generate 17 random numbers between 0-1000
df.1 <- data.frame(Year, Observations) # Create a data frame
Year
## [1] 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995
## [16] 1996 1997
Observations
## [1] 161 814 497 315 765 588 61 494 458 514 263 238 943 650 936 50 612
class(df.1) # check the class of object
## [1] "data.frame"
str(df.1) # check the structure of the object and its respective columns
## 'data.frame': 17 obs. of 2 variables:
## $ Year : num 1981 1982 1983 1984 1985 ...
## $ Observations: int 161 814 497 315 765 588 61 494 458 514 ...
Now, let’s look at doing some preliminary statistical analysis.
linear.model.1 <- lm(Observations ~ Year, data = df.1) # perform linear regression
summary(linear.model.1) # check the summary and the coefficients
##
## Call:
## lm(formula = Observations ~ Year, data = df.1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -487.26 -241.72 15.79 125.76 425.26
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12451.419 28433.850 -0.438 0.668
## Year 6.507 14.296 0.455 0.655
##
## Residual standard error: 288.8 on 15 degrees of freedom
## Multiple R-squared: 0.01363, Adjusted R-squared: -0.05213
## F-statistic: 0.2072 on 1 and 15 DF, p-value: 0.6555
plot(df.1$Year, df.1$Observations, col = "red", xlab = "Year", ylab = "Observations")
abline(linear.model.1) # Create a Linear Regression line using abline()
par(mfrow = c(2,2)) # create a 2*2 matrix for the plots
plot(linear.model.1) # create a plot of the residuals
Perform the following exercises in base R.
Results = data.frame(Names,x,y)
#Result_1 <- x
#Result_2 <- y
colnames(Results) <- c('Names', 'Result_1', 'Result_2') # Changing names of col
Results
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
# Scatter plot of Result_1 against Result_2
ggplot(data = Results, aes(x = Result_1, y = Result_2)) + geom_point()
linear.model.1 <- lm(Result_2 ~ Result_1, data = Results) # perform linear regression
# Plotting the residuals
par(mfrow = c(2,2)) # create a 2*2 matrix for the plots
plot(linear.model.1) # create a plot of the residuals
# Summary of the linear regression model
summary(linear.model.1) # check the summary and the coefficients
##
## Call:
## lm(formula = Result_2 ~ Result_1, data = Results)
##
## Residuals:
## 1 2 3 4
## 8.788 -11.758 6.376 -3.406
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.7072 30.0681 1.188 0.357
## Result_1 0.5546 0.3987 1.391 0.299
##
## Residual standard error: 11.57 on 2 degrees of freedom
## Multiple R-squared: 0.4917, Adjusted R-squared: 0.2376
## F-statistic: 1.935 on 1 and 2 DF, p-value: 0.2988
Solution: If we look at the Residuals vs Fitted image in the figure of the question 4 above, we observe that the regression line is not fitted well with the actual data points, because the Summary statistics shows that the residual standard error is 11.57 on 2 degrees of freedom. The p-value of the model is 0.2988 which indicates that the regression is not capable enough to predict the actual points accurately.