Instructions

Exercise 1: Factors

Use the data frame mtcars, which is a built in data set in R, to answer the questions below. Display the final objects created/adjusted for each question.

1(a) Create a factor vector called vs containing the entries of the vs variable of mtcars. Make sure it has two levels: 0, 1. After you create the vector vs, add at the end the line print(vs) to display the resulting vector.

data("mtcars")
#head(mtcars)
vs= as.factor(mtcars$vs)
print(vs)
##  [1] 0 0 1 1 0 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 0 0 0 1
## Levels: 0 1

1(b) Create a factor vector called carb from the carb column of mtcars. Have this be an ordered factor with levels: 1 < 2 < 3 < 4 < 5 < 6 < 7 < 8. HINT: Some of these levels do not appear automatically. After you create the vector carb, add at the end the line print(carb) to display the resulting vector.

carb= factor(mtcars$carb,levels = 1:8)
print(carb)
##  [1] 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2
## Levels: 1 2 3 4 5 6 7 8

1(c) Create a frequency table for the column mpg.

table(mtcars$mpg)
## 
## 10.4 13.3 14.3 14.7   15 15.2 15.5 15.8 16.4 17.3 17.8 18.1 18.7 19.2 19.7   21 
##    2    1    1    1    1    2    1    1    1    1    1    1    1    2    1    2 
## 21.4 21.5 22.8 24.4   26 27.3 30.4 32.4 33.9 
##    2    1    2    1    1    1    2    1    1

1(d) Create a two way frequency table for the factor vectors vs and carb.

table(vs,carb)
##    carb
## vs  1 2 3 4 5 6 7 8
##   0 0 5 3 8 0 1 0 1
##   1 7 5 0 2 0 0 0 0

1(e) Redefine the vectors vs and carb to be character vectors.

vs<- as.character(vs)
carb<- as.character(carb)

1(f) Create a two-way frequency table for vs and carb again.

table(vs,carb)
##    carb
## vs  1 2 3 4 6 8
##   0 0 5 3 8 1 1
##   1 7 5 0 2 0 0

1(g) Comment on the difference between the frequency table in 1d and 1f.

The vs and carb tables are different because table 1d includes all possible values for carburetors. The table 1f displays only the values listed in the ‘mtcars’ data set.

1(h) Display everything inside the vector carb except the second to last element.

carb[-31]
##  [1] "4" "4" "1" "1" "2" "1" "4" "2" "2" "4" "4" "3" "3" "3" "4" "4" "4" "1" "2"
## [20] "1" "1" "2" "2" "4" "2" "1" "2" "2" "4" "6" "2"

Exercise 2: Calculations Review and Factors

Use the data frame mtcars, which is a built in data set in R, to answer the questions below. Display the final objects created/adjusted for each question.

2(a) Create a numeric vector called cyl from the cyl column of mtcars. Create a frequency table for the values of this vector.

cyl<- mtcars$cyl
table(cyl)
## cyl
##  4  6  8 
## 11  7 14

2(b) First, we will be interested in calculating the central behavior of the variable cyl. For this, calculate the mean and median for the vector cyl. (You can use any base R function for this.)

mean(cyl)
## [1] 6.1875
median(cyl)
## [1] 6

2(c) Store the results of the mean and median for cyl in a list called measures_of_center. Ensure you add names to the elements of the list indicating which statistic is which. Then, use the function str() to display the newly created list.

measures_of_center <- list(data_mu= mean(cyl),data_mid= median(cyl))
str(measures_of_center)
## List of 2
##  $ data_mu : num 6.19
##  $ data_mid: num 6

2(d) Now, we are interested in understanding the variation (spread) of the variable cyl. For this, calculate the standard deviation and the interquartile range (also known as IQR) for the vector cyl. (You can use any base R function for this.)

stand_dev <-sd(cyl)
stand_dev
## [1] 1.785922
output <- IQR(cyl)
output
## [1] 4

2(e) Store the results of the standard deviation and IQR for the vector cyl in a list called measures_of_spread. Ensure you add names to the elements of the list indicating which statistic is which. Then, use the function str() to display the newly created list.

measures_of_spread <-list(standard_deviation=stand_dev, IQR=output)
str(measures_of_spread)
## List of 2
##  $ standard_deviation: num 1.79
##  $ IQR               : num 4

2(f) Create and store a frequency table for cyl in a object called cyl_freq. Then, print the cyl_freq table.

cyl_freq= table(cyl)
print(cyl_freq)
## cyl
##  4  6  8 
## 11  7 14

Exercise 3: Matrices and Data Frames

3(a) Create the a numeric matrix called mymat that contains 0.1, 2, 4 in the first column, 7, 3, 100 in the second column, and 1, 0.9, 8, in the third column. Type the name of the variable mymat so that it is displayed.

mymat=c(0.1,2,4,7,3,100,1,0.9,8) 
mymat=matrix(mymat, nrow = 3, ncol = 3)
mymat
##      [,1] [,2] [,3]
## [1,]  0.1    7  1.0
## [2,]  2.0    3  0.9
## [3,]  4.0  100  8.0

3(b) Display the first row of the matrix.

mymat[1,]
## [1] 0.1 7.0 1.0

3(c) Use operators to generate and display a matrix of TRUE/FALSE values, where TRUE means the values in mymat are strictly greater than 1 AND less than or equal to pi.

mymat>1 & mymat<=pi
##       [,1]  [,2]  [,3]
## [1,] FALSE FALSE FALSE
## [2,]  TRUE  TRUE FALSE
## [3,] FALSE FALSE FALSE

3(d) Display the element in the second row and third column.

mymat[2,3]
## [1] 0.9

3(e) Replace the second column of the matrix by the column multiplied by 10. Display the resulting matrix.

mymat[,2]= 10*mymat[,2]
mymat
##      [,1] [,2] [,3]
## [1,]  0.1   70  1.0
## [2,]  2.0   30  0.9
## [3,]  4.0 1000  8.0

3(f) Use the function data.frame() to create a data frame object called mydata using the following sets of data for each column (3 rows, 4 columns). The column names (in order) should be: age, height, gender, and smoker. Make the class of each column the following (in order): numeric, numeric, factor, logical.

  • 22, 25, 28
mydata=data.frame(age= c(22,25,28), height= c(66,71,64), gender= factor(c('F','M','F')), smoker= c(FALSE, TRUE, TRUE))
mydata
##   age height gender smoker
## 1  22     66      F  FALSE
## 2  25     71      M   TRUE
## 3  28     64      F   TRUE
  • 66, 71, 64

  • F, M, F

  • FALSE, TRUE, TRUE

3(g) Change the row names of the data frame in 3f to be three names of your choice.

row.names(mydata)= c("John","Jack", "Joe")
mydata
##      age height gender smoker
## John  22     66      F  FALSE
## Jack  25     71      M   TRUE
## Joe   28     64      F   TRUE

3(h) Discuss two differences between a matrix and a data frame. ANSWER: A matrix, like a vector, can have data of only th same type. Data.frame can have different types of data that represent the data. For instance logical, or numerical, character. A matrix you can use square brackets Data.frame you can also use the brackets and the dollar sign to get a collumn to open.

Exercise 4: Indexing with Pokemon

4(a) Download the data set Pokemon.csv from Canvas. This data set comes from Kaggle.com. Display the first 15 lines of this data set.

pokemon <- read.csv("C:/Users/pplec/Downloads/Pokemon.csv")
head(pokemon, 15)
##    X.                      Name Type.1 Type.2 Total HP Attack Defense Sp..Atk
## 1   1                 Bulbasaur  Grass Poison   318 45     49      49      65
## 2   2                   Ivysaur  Grass Poison   405 60     62      63      80
## 3   3                  Venusaur  Grass Poison   525 80     82      83     100
## 4   3     VenusaurMega Venusaur  Grass Poison   625 80    100     123     122
## 5   4                Charmander   Fire          309 39     52      43      60
## 6   5                Charmeleon   Fire          405 58     64      58      80
## 7   6                 Charizard   Fire Flying   534 78     84      78     109
## 8   6 CharizardMega Charizard X   Fire Dragon   634 78    130     111     130
## 9   6 CharizardMega Charizard Y   Fire Flying   634 78    104      78     159
## 10  7                  Squirtle  Water          314 44     48      65      50
## 11  8                 Wartortle  Water          405 59     63      80      65
## 12  9                 Blastoise  Water          530 79     83     100      85
## 13  9   BlastoiseMega Blastoise  Water          630 79    103     120     135
## 14 10                  Caterpie    Bug          195 45     30      35      20
## 15 11                   Metapod    Bug          205 50     20      55      25
##    Sp..Def Speed Generation Legendary
## 1       65    45          1     False
## 2       80    60          1     False
## 3      100    80          1     False
## 4      120    80          1     False
## 5       50    65          1     False
## 6       65    80          1     False
## 7       85   100          1     False
## 8       85   100          1     False
## 9      115   100          1     False
## 10      64    43          1     False
## 11      80    58          1     False
## 12     105    78          1     False
## 13     115    78          1     False
## 14      20    45          1     False
## 15      25    30          1     False

4(b) Use commands in R to determine how many Pokemon have both HP > 100 and Defense > 100.

sum(pokemon[,"HP"]>100 & pokemon[,"Defense"] >100)
## [1] 13

4(c) Display the 37th smallest value in the HP column using R commands.

sort(pokemon$HP, decreasing=FALSE) [37]
## [1] 35

4(d) Create a new factor column in the Pokemon data set called Mentality.

pokemon$Mentality= "Protective"
pokemon$Mentality[pokemon$Attack==pokemon$Defense] = "Balanced"
pokemon$Mentality[pokemon$Attack> pokemon$Defense] = "Aggressive"
head(pokemon,15)
##    X.                      Name Type.1 Type.2 Total HP Attack Defense Sp..Atk
## 1   1                 Bulbasaur  Grass Poison   318 45     49      49      65
## 2   2                   Ivysaur  Grass Poison   405 60     62      63      80
## 3   3                  Venusaur  Grass Poison   525 80     82      83     100
## 4   3     VenusaurMega Venusaur  Grass Poison   625 80    100     123     122
## 5   4                Charmander   Fire          309 39     52      43      60
## 6   5                Charmeleon   Fire          405 58     64      58      80
## 7   6                 Charizard   Fire Flying   534 78     84      78     109
## 8   6 CharizardMega Charizard X   Fire Dragon   634 78    130     111     130
## 9   6 CharizardMega Charizard Y   Fire Flying   634 78    104      78     159
## 10  7                  Squirtle  Water          314 44     48      65      50
## 11  8                 Wartortle  Water          405 59     63      80      65
## 12  9                 Blastoise  Water          530 79     83     100      85
## 13  9   BlastoiseMega Blastoise  Water          630 79    103     120     135
## 14 10                  Caterpie    Bug          195 45     30      35      20
## 15 11                   Metapod    Bug          205 50     20      55      25
##    Sp..Def Speed Generation Legendary  Mentality
## 1       65    45          1     False   Balanced
## 2       80    60          1     False Protective
## 3      100    80          1     False Protective
## 4      120    80          1     False Protective
## 5       50    65          1     False Aggressive
## 6       65    80          1     False Aggressive
## 7       85   100          1     False Aggressive
## 8       85   100          1     False Aggressive
## 9      115   100          1     False Aggressive
## 10      64    43          1     False Protective
## 11      80    58          1     False Protective
## 12     105    78          1     False Protective
## 13     115    78          1     False Protective
## 14      20    45          1     False Protective
## 15      25    30          1     False Protective
  • If the variable Attack is larger than the variable Defense, assign this character’s Mentality = Aggressive.
  • If Attack is smaller than Defense then assign Mentality = Protective.
  • If they are equal assign Mentality = Balanced.

Display the first 15 rows of the Pokemon data set with this new variable. HINT: Create a new column and then using indexing and operators.

4(e) Notice that some pokemons have more than one type (for example, grass and poison), while others have exclusively one type. Create a new logical variable named double_type that is TRUE if a pokemon has 2 types, and FALSE otherwise. Use the function head() to show the first 10 lines of our data, only including the variables Type.1, Type.2 and double_type. Then anwer, how many Pokemons do not have a second type?

pokemon$double_type= (pokemon$Type.2) != ""
head(pokemon[,c("Type.1", "Type.2", "double_type")], 10)
##    Type.1 Type.2 double_type
## 1   Grass Poison        TRUE
## 2   Grass Poison        TRUE
## 3   Grass Poison        TRUE
## 4   Grass Poison        TRUE
## 5    Fire              FALSE
## 6    Fire              FALSE
## 7    Fire Flying        TRUE
## 8    Fire Dragon        TRUE
## 9    Fire Flying        TRUE
## 10  Water              FALSE
sum(!pokemon$double_type)
## [1] 386

Exercise 5: Linear Algebra For Data Analysis

In science, a very common problem consists of determining the relationships between two variables \(X\) and \(Y\). In this exercise, we will explore how one can use linear algebra operations to calculate the linear relationship between two variables.

5(a) Load the dataset Possum.csv saving it to an object named possum_df, which contains data corresponding to different possums, measuring their head length and body length. When you import the data, it will be imported as a data.frame object. Change the name of the columns to be headLength and bodyLength respectively. Print using R functions: 1) The number of observations in the dataset. 2) The number of variables in the dataset. 3) The top 6 observations in the dataset.

possum_df <- read_excel("C:/Users/pplec/OneDrive/Desktop/Stat107/HW2/Possum .xlsx")
## Error in read_excel("C:/Users/pplec/OneDrive/Desktop/Stat107/HW2/Possum .xlsx"): could not find function "read_excel"
colnames(possum_df)= c("headLength","bodyLength")
## Error: object 'possum_df' not found
nrow(possum_df) 
## Error: object 'possum_df' not found
ncol(possum_df)
## Error: object 'possum_df' not found
head(possum_df, 6)
## Error: object 'possum_df' not found

5(b) Use the function plot() to generate a scatterplot with bodyLength as the X-axis of the plot, and headLength as the Y-axis of the plot. Make sure to change the x-axis label to “Possum Body Length (cm)” and the y-axis label to “Possum Head Length (mm)”, give an appropriate title and caption. Below your plot, discuss: is there a relationship between these two variables? What type of relationship do you think they have?

plot(x= possum_df$bodyLength, y= possum_df$headLength, xlab = "Possum Body Length(cm)", ylab= "Possum Head Length (mm)", main = "Graph of Possum Body and Head Length", sub = "Positively Correlated")
## Error: object 'possum_df' not found

Answer here: There is a positive relationship between the body and head length. As you can see, the x and y variables are increasing simultaneously. Due to this relationship, there is a positive correlation between x and y.

5(c) Lets consider the relationship between the possum head/body length with the linear model:

\[ HL \sim b \cdot BL + a \]

where \(HL\) and \(BL\) represent the head and body length. Here, \(b\) represents the sign and magnitude of the linear relationship, and \(a\) represents the intercept. In the following, we explore how linear algebra operations can be used to estimate the linear relationship between these two variables.

Create a (104 x 2) matrix BL, which contains 2 columns. The first column contains the value 1 repeated 104 times. The second column contains the values of body-length from the dataset. Furthermore, create a matrix HL of dimensions (104 x 1) which contains the values of the head-length of the possums in the dataset.

BL = matrix(c(rep(1,104),possum_df$bodyLength),nrow = 104, ncol = 2)
## Error: object 'possum_df' not found
HL= matrix(possum_df$headLength)
## Error: object 'possum_df' not found

5(d) Use a combination of R matrix multiplication and transposition to calculate the matrix \(A = (BL)^T \cdot (BL)\). Print the resulting dimensions of the matrix \(A\).

A= t(BL)%*%(BL) 
## Error: object 'BL' not found
print(dim(A))
## Error: object 'A' not found

5(e) Use a combination of the matrix multiplication operator (example: mat1 %*% mat2), matrix transposition (example: t(mat1)), and matrix inversion (example: solve(mat1)) to calculate the following: \[ c = A^{-1} \cdot (BL)^T \cdot HL \] What are the dimensions of \(c\)? Print the object c.

c= solve(A)%*% t(BL)%*%(HL)
## Error: object 'A' not found
dim(c)
## NULL
c
## function (...)  .Primitive("c")

5(f) The vector in part (e) corresponds to the estimated coefficients of the linear regression. More specifically, the entry c[1] corresponds to our estimation of \(a\), and c[2] represents our estimation of \(b\). To verify that the obtained values in c correctly represent the linear relationship between body length and head length, recreate the plot you created in part (b), but now use the function abline(a, b) to add the linear regression line. If you need to learn how to use the function abline(), feel free to use the help functionality of Rstudio. Color the line with "red".

plot(x= possum_df$bodyLength, y= possum_df$headLength, xlab = "Possum Body Length(cm)", ylab= "Possum Head Length (mm)", main = "Graph of Possum Body and Head Length", sub = "Positively Correlated")
## Error: object 'possum_df' not found
abline(a= 42.7,b= 0.57, col= "red")
## Error in int_abline(a = a, b = b, h = h, v = v, untf = untf, ...): plot.new has not been called yet

Answer the following: Does the line you added represent the linear relationship between the variables? Do you think there are intuitive reasons why these two variables would be linearly related?

Answer here: Yes, I think the line represents a linear relationship between the variables because the possum size would need to proportional. For example, the head and body would need to be relatively the same size. Possums have features that make it’s body length a certain size and it’s head length a certain size. Body and head are proportional.