Lab 2 is based on the creation of a vector, matrix, list of variables, and dataframe. Finally, we will investigate the association of two variables by a simple linear regression model as well as a scatter plot for visual inspection.
Data structures are used to store data. R supports multiple data structures but there are 4 fundumetnal data structures that an R practitioner needs to know. They are:
1. Vectors: The fundamental (atomic)
data structure. R vectorizes elements and stores them in
vectors. Unlike MATLAB, which stores data in row-vectors, R
stores data in column vectors. Vectors can only store one particular
data type, i.e. a vector can only hold character data type but not a
mixture of character and numeric data types.
x <- c(55, 65, 90, 86) # create a vector through concatenation by using c()
x
## [1] 55 65 90 86
y <- c(75, 60, 92, 80)
y
## [1] 75 60 92 80
z <- x + y # vector addition
z
## [1] 130 125 182 166
u <- c(200, 300, 400, 450)
u
## [1] 200 300 400 450
z-u
## [1] -70 -175 -218 -284
Names<- c("Rob", "James", "Alberto", "Maya") # Character vectors
Names
## [1] "Rob" "James" "Alberto" "Maya"
rbind()) or by using columns (cbind()).
Matrices can also be created by using the matrix() function
(use ?matrix to look up the arguments). Similar to vectors, matrices
also only support one single data type in a single matrix.m <- matrix(1:12, nrow = 3, ncol = 4, byrow = FALSE)
m
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
n <- cbind(x, y)
n
## x y
## [1,] 55 75
## [2,] 65 60
## [3,] 90 92
## [4,] 86 80
v<- rbind(x, y)
v
## [,1] [,2] [,3] [,4]
## x 55 65 90 86
## y 75 60 92 80
t(n) # create a transpose
## [,1] [,2] [,3] [,4]
## x 55 65 90 86
## y 75 60 92 80
n %*% t(n) # create a square matrix
## [,1] [,2] [,3] [,4]
## [1,] 8650 8075 11850 10730
## [2,] 8075 7825 11370 10390
## [3,] 11850 11370 16564 15100
## [4,] 10730 10390 15100 13796
solve((t(n) %*% n)) # inverse
## x y
## x 0.002319781 -0.002232726
## y -0.002232726 0.002190450
3. Lists: Lists are the building blocks of Data
Frames. They can hold multiple different data types in a single list.
Use as.list() to convert to a list or list() to transform
vectors/matrices of different data types into a single list.
# Unlike vectors and matrices, lists can hold multiple datatypes
list.1 <- list(names, x, y)
list.1; str(list.1)
## [[1]]
## function (x) .Primitive("names")
##
## [[2]]
## [1] 55 65 90 86
##
## [[3]]
## [1] 75 60 92 80
## List of 3
## $ :function (x)
## $ : num [1:4] 55 65 90 86
## $ : num [1:4] 75 60 92 80
list.2 <- list(names, n, m)
list.2; str(list.2)
## [[1]]
## function (x) .Primitive("names")
##
## [[2]]
## x y
## [1,] 55 75
## [2,] 65 60
## [3,] 90 92
## [4,] 86 80
##
## [[3]]
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
## List of 3
## $ :function (x)
## $ : num [1:4, 1:2] 55 65 90 86 75 60 92 80
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:2] "x" "y"
## $ : int [1:3, 1:4] 1 2 3 4 5 6 7 8 9 10 ...
4. Data Frames: Data frames are the fundamental data
structure to hold numerous data types. Base R performs very
well with data frame, however the modern “tidyverse” universe uses
something called a “tibble”.
Year <- seq(1981, 1997, by = 1)
Observations <- sample(1000, 17, replace = TRUE) # Generate 17 random numbers between 0-1000
df.1 <- data.frame(Year, Observations) # Create a data frame
Year
## [1] 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995
## [16] 1996 1997
Observations
## [1] 864 70 888 741 785 570 788 740 697 182 297 425 905 264 294 873 703
Observations
## [1] 864 70 888 741 785 570 788 740 697 182 297 425 905 264 294 873 703
df.1
## Year Observations
## 1 1981 864
## 2 1982 70
## 3 1983 888
## 4 1984 741
## 5 1985 785
## 6 1986 570
## 7 1987 788
## 8 1988 740
## 9 1989 697
## 10 1990 182
## 11 1991 297
## 12 1992 425
## 13 1993 905
## 14 1994 264
## 15 1995 294
## 16 1996 873
## 17 1997 703
class(df.1) # check the class of object
## [1] "data.frame"
str(df.1) # check the structure of the object and its respective columns
## 'data.frame': 17 obs. of 2 variables:
## $ Year : num 1981 1982 1983 1984 1985 ...
## $ Observations: int 864 70 888 741 785 570 788 740 697 182 ...
Now, let’s look at doing some preliminary statistical analysis.
linear.model.1 <- lm(Observations ~ Year, data = df.1) # perform linear regression
summary(linear.model.1) # check the summary and the coefficients
##
## Call:
## lm(formula = Observations ~ Year, data = df.1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -576.7 -253.5 109.6 179.5 342.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15759.419 27970.539 0.563 0.581
## Year -7.625 14.063 -0.542 0.596
##
## Residual standard error: 284.1 on 15 degrees of freedom
## Multiple R-squared: 0.01922, Adjusted R-squared: -0.04616
## F-statistic: 0.294 on 1 and 15 DF, p-value: 0.5956
plot(df.1$Year, df.1$Observations, col = "red", xlab = "Year", ylab = "Observations")
abline(linear.model.1) # Create a Linear Regression line using abline()
par(mfrow = c(2,2)) # create a 2*2 matrix for the plots
plot(linear.model.1) # create a plot of the residuals
Create a data-frame called “Results” by combining the vectors “Names”, “x”, and “y”. Rename “x” as “Result_1” and “y” as “Result_2”.
Answer-1:
Results<-data.frame(Names,x, y)
Results
## Names x y
## 1 Rob 55 75
## 2 James 65 60
## 3 Alberto 90 92
## 4 Maya 86 80
Rename column of dataframe
# way-1
colnames(Results)<-c("Names", "Result_1", "Result_2")
Results
## Names Result_1 Result_2
## 1 Rob 55 75
## 2 James 65 60
## 3 Alberto 90 92
## 4 Maya 86 80
# way-2
Results<-data.frame(Names,x, y)
names(Results)[2]<-paste("Result_1")
names(Results)[3]<-paste("Result_2")
Results
## Names Result_1 Result_2
## 1 Rob 55 75
## 2 James 65 60
## 3 Alberto 90 92
## 4 Maya 86 80
# way-3: by using dplyr
Results<-data.frame(Names,x, y)
library(dplyr)
Results<-Results %>%
rename(
Result_1 = x,
Result_2 = y
)
Results
## Names Result_1 Result_2
## 1 Rob 55 75
## 2 James 65 60
## 3 Alberto 90 92
## 4 Maya 86 80
Create a Scatter-Plot of “Result_1” on the x-axis “Result_2” on the y-axis.
Answer-2:
plot(Results$Result_1, Results$Result_2,
main="Scatter Plot of Result_1 and Result_2",
xlab = "Result_1",
ylab="Result_1")
Perform a Linear Regression by regressing “Result_2” on “Result_1”. (Hint: use the lm() function)
Answer-3:
model<-lm(Result_1~Result_2, data =Results )
Plot the residuals and summarize the results.
Answer-4:
par(mfrow = c(2,2)) # create a 2*2 matrix for the plots
plot(model) # create a plot of the residuals
summary(model)
##
## Call:
## lm(formula = Result_1 ~ Result_2, data = Results)
##
## Residuals:
## 1 2 3 4
## -17.449 5.850 2.480 9.119
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.9559 49.4618 0.120 0.915
## Result_2 0.8866 0.6374 1.391 0.299
##
## Residual standard error: 14.63 on 2 degrees of freedom
## Multiple R-squared: 0.4917, Adjusted R-squared: 0.2376
## F-statistic: 1.935 on 1 and 2 DF, p-value: 0.2988
Interpret your results.
Answer-5:
From the summary statistics of the linear model, we found that
Result_2 is positively associated with
Result_1 because the estimated coefficient is positive
(0.8866). However, the result is not significate (p-value: 0.299).
According to the fitted model, less than half of the variability can be
explained (R-squared: 0.4917, Adjusted R-squared: 0.2376)