Description

Lab 2 is based on the creation of a vector, matrix, list of variables, and dataframe. Finally, we will investigate the association of two variables by a simple linear regression model as well as a scatter plot for visual inspection.

Data Structures

Data structures are used to store data. R supports multiple data structures but there are 4 fundumetnal data structures that an R practitioner needs to know. They are:

1. Vectors: The fundamental (atomic) data structure. R vectorizes elements and stores them in vectors. Unlike MATLAB, which stores data in row-vectors, R stores data in column vectors. Vectors can only store one particular data type, i.e. a vector can only hold character data type but not a mixture of character and numeric data types.

x <- c(55, 65, 90, 86) # create a vector through concatenation by using c()
x
## [1] 55 65 90 86
y <- c(75, 60, 92, 80)
y
## [1] 75 60 92 80
z <- x + y # vector addition
z
## [1] 130 125 182 166
u <- c(200, 300, 400, 450)
u
## [1] 200 300 400 450
z-u
## [1]  -70 -175 -218 -284
Names<- c("Rob", "James", "Alberto", "Maya") # Character vectors
Names
## [1] "Rob"     "James"   "Alberto" "Maya"
  1. Matrices: Matrices are built by either combining vectors row wise (rbind()) or by using columns (cbind()). Matrices can also be created by using the matrix() function (use ?matrix to look up the arguments). Similar to vectors, matrices also only support one single data type in a single matrix.
m <- matrix(1:12, nrow = 3, ncol = 4, byrow = FALSE)
m
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12
n <- cbind(x, y)
n
##       x  y
## [1,] 55 75
## [2,] 65 60
## [3,] 90 92
## [4,] 86 80
v<- rbind(x, y)
v
##   [,1] [,2] [,3] [,4]
## x   55   65   90   86
## y   75   60   92   80

Some basic matrix operations:

t(n) # create a transpose
##   [,1] [,2] [,3] [,4]
## x   55   65   90   86
## y   75   60   92   80
n %*% t(n) # create a square matrix
##       [,1]  [,2]  [,3]  [,4]
## [1,]  8650  8075 11850 10730
## [2,]  8075  7825 11370 10390
## [3,] 11850 11370 16564 15100
## [4,] 10730 10390 15100 13796
solve((t(n) %*% n)) # inverse
##              x            y
## x  0.002319781 -0.002232726
## y -0.002232726  0.002190450

3. Lists: Lists are the building blocks of Data Frames. They can hold multiple different data types in a single list. Use as.list() to convert to a list or list() to transform vectors/matrices of different data types into a single list.

# Unlike vectors and matrices, lists can hold multiple datatypes
list.1 <- list(names, x, y)
list.1; str(list.1)
## [[1]]
## function (x)  .Primitive("names")
## 
## [[2]]
## [1] 55 65 90 86
## 
## [[3]]
## [1] 75 60 92 80
## List of 3
##  $ :function (x)  
##  $ : num [1:4] 55 65 90 86
##  $ : num [1:4] 75 60 92 80
list.2 <- list(names, n, m)
list.2; str(list.2)
## [[1]]
## function (x)  .Primitive("names")
## 
## [[2]]
##       x  y
## [1,] 55 75
## [2,] 65 60
## [3,] 90 92
## [4,] 86 80
## 
## [[3]]
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12
## List of 3
##  $ :function (x)  
##  $ : num [1:4, 1:2] 55 65 90 86 75 60 92 80
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:2] "x" "y"
##  $ : int [1:3, 1:4] 1 2 3 4 5 6 7 8 9 10 ...

4. Data Frames: Data frames are the fundamental data structure to hold numerous data types. Base R performs very well with data frame, however the modern “tidyverse” universe uses something called a “tibble”.

Year <- seq(1981, 1997, by = 1)
Observations <- sample(1000, 17, replace = TRUE) # Generate 17 random numbers between 0-1000
df.1 <- data.frame(Year, Observations) # Create a data frame
Year
##  [1] 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995
## [16] 1996 1997
Observations
##  [1] 864  70 888 741 785 570 788 740 697 182 297 425 905 264 294 873 703
Observations
##  [1] 864  70 888 741 785 570 788 740 697 182 297 425 905 264 294 873 703
df.1
##    Year Observations
## 1  1981          864
## 2  1982           70
## 3  1983          888
## 4  1984          741
## 5  1985          785
## 6  1986          570
## 7  1987          788
## 8  1988          740
## 9  1989          697
## 10 1990          182
## 11 1991          297
## 12 1992          425
## 13 1993          905
## 14 1994          264
## 15 1995          294
## 16 1996          873
## 17 1997          703
class(df.1) # check the class of object
## [1] "data.frame"
str(df.1) # check the structure of the object and its respective columns
## 'data.frame':    17 obs. of  2 variables:
##  $ Year        : num  1981 1982 1983 1984 1985 ...
##  $ Observations: int  864 70 888 741 785 570 788 740 697 182 ...

Preliminary Statistical Analysis

Now, let’s look at doing some preliminary statistical analysis.

linear.model.1 <- lm(Observations ~ Year, data = df.1) # perform linear regression
summary(linear.model.1) # check the summary and the coefficients
## 
## Call:
## lm(formula = Observations ~ Year, data = df.1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -576.7 -253.5  109.6  179.5  342.2 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15759.419  27970.539   0.563    0.581
## Year           -7.625     14.063  -0.542    0.596
## 
## Residual standard error: 284.1 on 15 degrees of freedom
## Multiple R-squared:  0.01922,    Adjusted R-squared:  -0.04616 
## F-statistic: 0.294 on 1 and 15 DF,  p-value: 0.5956
plot(df.1$Year, df.1$Observations, col = "red", xlab = "Year", ylab = "Observations")
abline(linear.model.1) # Create a Linear Regression line using abline()

par(mfrow = c(2,2)) # create a 2*2 matrix for the plots
plot(linear.model.1) # create a plot of the residuals

Excercises

Excercises-1:

Create a data-frame called “Results” by combining the vectors “Names”, “x”, and “y”. Rename “x” as “Result_1” and “y” as “Result_2”.

Answer-1:

Results<-data.frame(Names,x, y)
Results
##     Names  x  y
## 1     Rob 55 75
## 2   James 65 60
## 3 Alberto 90 92
## 4    Maya 86 80

Rename column of dataframe

# way-1
colnames(Results)<-c("Names", "Result_1", "Result_2")
Results
##     Names Result_1 Result_2
## 1     Rob       55       75
## 2   James       65       60
## 3 Alberto       90       92
## 4    Maya       86       80
# way-2
Results<-data.frame(Names,x, y)
names(Results)[2]<-paste("Result_1")
names(Results)[3]<-paste("Result_2")
Results
##     Names Result_1 Result_2
## 1     Rob       55       75
## 2   James       65       60
## 3 Alberto       90       92
## 4    Maya       86       80
# way-3: by using dplyr
Results<-data.frame(Names,x, y)

library(dplyr)

Results<-Results %>% 
  rename(
    Result_1 = x,
    Result_2 = y
  )

Results
##     Names Result_1 Result_2
## 1     Rob       55       75
## 2   James       65       60
## 3 Alberto       90       92
## 4    Maya       86       80

Excercises-2:

Create a Scatter-Plot of “Result_1” on the x-axis “Result_2” on the y-axis.

Answer-2:

plot(Results$Result_1, Results$Result_2,
     main="Scatter Plot of Result_1 and Result_2",
     xlab = "Result_1",
     ylab="Result_1")

Excercises-3:

Perform a Linear Regression by regressing “Result_2” on “Result_1”. (Hint: use the lm() function)

Answer-3:

model<-lm(Result_1~Result_2, data =Results )

Excercises-4:

Plot the residuals and summarize the results.

Answer-4:

par(mfrow = c(2,2)) # create a 2*2 matrix for the plots
plot(model) # create a plot of the residuals

summary(model)
## 
## Call:
## lm(formula = Result_1 ~ Result_2, data = Results)
## 
## Residuals:
##       1       2       3       4 
## -17.449   5.850   2.480   9.119 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   5.9559    49.4618   0.120    0.915
## Result_2      0.8866     0.6374   1.391    0.299
## 
## Residual standard error: 14.63 on 2 degrees of freedom
## Multiple R-squared:  0.4917, Adjusted R-squared:  0.2376 
## F-statistic: 1.935 on 1 and 2 DF,  p-value: 0.2988

Excercises-5:

Interpret your results.

Answer-5:

From the summary statistics of the linear model, we found that Result_2 is positively associated with Result_1 because the estimated coefficient is positive (0.8866). However, the result is not significate (p-value: 0.299). According to the fitted model, less than half of the variability can be explained (R-squared: 0.4917, Adjusted R-squared: 0.2376)