josesa@ucr.edu) or Jericho Lawson
(jlaws011@ucr.edu) if you have any questions, using “[STAT
107]” in the subject line.Use the data frame mtcars, which is a built in data set
in R, to answer the questions below. Display the final objects
created/adjusted for each question.
1(a) Create a factor vector called vs
containing the entries of the vs variable of
mtcars. Make sure it has two levels: 0, 1. After you create
the vector vs, add at the end the line
print(vs) to display the resulting vector.
data("mtcars")
#head(mtcars)
vs= as.factor(mtcars$vs)
print(vs)
## [1] 0 0 1 1 0 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 0 0 0 1
## Levels: 0 1
1(b) Create a factor vector called carb
from the carb column of mtcars. Have this be
an ordered factor with levels: 1 < 2 < 3 < 4 < 5 < 6 <
7 < 8. HINT: Some of these levels do not appear automatically. After
you create the vector carb, add at the end the line
print(carb) to display the resulting vector.
carb= factor(mtcars$carb,levels = 1:8)
print(carb)
## [1] 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2
## Levels: 1 2 3 4 5 6 7 8
1(c) Create a frequency table for the column
mpg.
table(mtcars$mpg)
##
## 10.4 13.3 14.3 14.7 15 15.2 15.5 15.8 16.4 17.3 17.8 18.1 18.7 19.2 19.7 21
## 2 1 1 1 1 2 1 1 1 1 1 1 1 2 1 2
## 21.4 21.5 22.8 24.4 26 27.3 30.4 32.4 33.9
## 2 1 2 1 1 1 2 1 1
1(d) Create a two way frequency table for the factor
vectors vs and carb.
table(vs,carb)
## carb
## vs 1 2 3 4 5 6 7 8
## 0 0 5 3 8 0 1 0 1
## 1 7 5 0 2 0 0 0 0
1(e) Redefine the vectors vs and
carb to be character vectors.
vs<- as.character(vs)
carb<- as.character(carb)
1(f) Create a two-way frequency
table for vs and carb again.
table(vs,carb)
## carb
## vs 1 2 3 4 6 8
## 0 0 5 3 8 1 1
## 1 7 5 0 2 0 0
1(g) Comment on the difference between the frequency table in 1d and 1f.
The vs and carb tables are different because table 1d includes all possible values for carburetors. The table 1f displays only the values listed in the ‘mtcars’ data set.
1(h) Display everything inside the vector
carb except the second to last element.
carb[-31]
## [1] "4" "4" "1" "1" "2" "1" "4" "2" "2" "4" "4" "3" "3" "3" "4" "4" "4" "1" "2"
## [20] "1" "1" "2" "2" "4" "2" "1" "2" "2" "4" "6" "2"
Use the data frame mtcars, which is a built in data set
in R, to answer the questions below. Display the final objects
created/adjusted for each question.
2(a) Create a numeric vector called cyl
from the cyl column of mtcars. Create a
frequency table for the values of this vector.
cyl<- mtcars$cyl
table(cyl)
## cyl
## 4 6 8
## 11 7 14
2(b) First, we will be interested in calculating the
central behavior of the variable cyl. For
this, calculate the mean and median for the vector cyl.
(You can use any base R function for this.)
mean(cyl)
## [1] 6.1875
median(cyl)
## [1] 6
2(c) Store the results of the mean and median for
cyl in a list called measures_of_center.
Ensure you add names to the elements of the list indicating which
statistic is which. Then, use the function str() to display
the newly created list.
measures_of_center <- list(data_mu= mean(cyl),data_mid= median(cyl))
str(measures_of_center)
## List of 2
## $ data_mu : num 6.19
## $ data_mid: num 6
2(d) Now, we are interested in understanding the
variation (spread) of the variable cyl.
For this, calculate the standard deviation and the interquartile range
(also known as IQR) for the vector cyl. (You can use any
base R function for this.)
stand_dev <-sd(cyl)
stand_dev
## [1] 1.785922
output <- IQR(cyl)
output
## [1] 4
2(e) Store the results of the standard deviation and
IQR for the vector cyl in a list called
measures_of_spread. Ensure you add names to the elements of
the list indicating which statistic is which. Then, use the function
str() to display the newly created list.
measures_of_spread <-list(standard_deviation=stand_dev, IQR=output)
str(measures_of_spread)
## List of 2
## $ standard_deviation: num 1.79
## $ IQR : num 4
2(f) Create and store a frequency table for
cyl in a object called cyl_freq. Then, print
the cyl_freq table.
cyl_freq= table(cyl)
print(cyl_freq)
## cyl
## 4 6 8
## 11 7 14
3(a) Create the a numeric matrix called
mymat that contains 0.1, 2, 4 in the first column, 7, 3,
100 in the second column, and 1, 0.9, 8, in the third column. Type the
name of the variable mymat so that it is displayed.
mymat=c(0.1,2,4,7,3,100,1,0.9,8)
mymat=matrix(mymat, nrow = 3, ncol = 3)
mymat
## [,1] [,2] [,3]
## [1,] 0.1 7 1.0
## [2,] 2.0 3 0.9
## [3,] 4.0 100 8.0
3(b) Display the first row of the matrix.
mymat[1,]
## [1] 0.1 7.0 1.0
3(c) Use operators to generate and display a matrix
of TRUE/FALSE values, where TRUE means the values in mymat
are strictly greater than 1 AND less than or equal to
pi.
mymat>1 & mymat<=pi
## [,1] [,2] [,3]
## [1,] FALSE FALSE FALSE
## [2,] TRUE TRUE FALSE
## [3,] FALSE FALSE FALSE
3(d) Display the element in the second row and third column.
mymat[2,3]
## [1] 0.9
3(e) Replace the second column of the matrix by the column multiplied by 10. Display the resulting matrix.
mymat[,2]= 10*mymat[,2]
mymat
## [,1] [,2] [,3]
## [1,] 0.1 70 1.0
## [2,] 2.0 30 0.9
## [3,] 4.0 1000 8.0
3(f) Use the function data.frame() to
create a data frame object called mydata using the
following sets of data for each column (3 rows, 4 columns). The column
names (in order) should be: age, height, gender, and smoker. Make the
class of each column the following (in order): numeric, numeric, factor,
logical.
mydata=data.frame(age= c(22,25,28), height= c(66,71,64), gender= factor(c('F','M','F')), smoker= c(FALSE, TRUE, TRUE))
mydata
## age height gender smoker
## 1 22 66 F FALSE
## 2 25 71 M TRUE
## 3 28 64 F TRUE
66, 71, 64
F, M, F
FALSE, TRUE, TRUE
3(g) Change the row names of the data frame in 3f to be three names of your choice.
row.names(mydata)= c("John","Jack", "Joe")
mydata
## age height gender smoker
## John 22 66 F FALSE
## Jack 25 71 M TRUE
## Joe 28 64 F TRUE
3(h) Discuss two differences between a matrix and a data frame. ANSWER: A matrix, like a vector, can have data of only th same type. Data.frame can have different types of data that represent the data. For instance logical, or numerical, character. A matrix you can use square brackets Data.frame you can also use the brackets and the dollar sign to get a collumn to open.
4(a) Download the data set Pokemon.csv
from Canvas. This data set comes from Kaggle.com. Display the
first 15 lines of this data set.
pokemon <- read.csv("C:/Users/pplec/Downloads/Pokemon.csv")
head(pokemon, 15)
## X. Name Type.1 Type.2 Total HP Attack Defense Sp..Atk
## 1 1 Bulbasaur Grass Poison 318 45 49 49 65
## 2 2 Ivysaur Grass Poison 405 60 62 63 80
## 3 3 Venusaur Grass Poison 525 80 82 83 100
## 4 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122
## 5 4 Charmander Fire 309 39 52 43 60
## 6 5 Charmeleon Fire 405 58 64 58 80
## 7 6 Charizard Fire Flying 534 78 84 78 109
## 8 6 CharizardMega Charizard X Fire Dragon 634 78 130 111 130
## 9 6 CharizardMega Charizard Y Fire Flying 634 78 104 78 159
## 10 7 Squirtle Water 314 44 48 65 50
## 11 8 Wartortle Water 405 59 63 80 65
## 12 9 Blastoise Water 530 79 83 100 85
## 13 9 BlastoiseMega Blastoise Water 630 79 103 120 135
## 14 10 Caterpie Bug 195 45 30 35 20
## 15 11 Metapod Bug 205 50 20 55 25
## Sp..Def Speed Generation Legendary
## 1 65 45 1 False
## 2 80 60 1 False
## 3 100 80 1 False
## 4 120 80 1 False
## 5 50 65 1 False
## 6 65 80 1 False
## 7 85 100 1 False
## 8 85 100 1 False
## 9 115 100 1 False
## 10 64 43 1 False
## 11 80 58 1 False
## 12 105 78 1 False
## 13 115 78 1 False
## 14 20 45 1 False
## 15 25 30 1 False
4(b) Use commands in R to determine how many Pokemon
have both HP > 100 and
Defense > 100.
sum(pokemon[,"HP"]>100 & pokemon[,"Defense"] >100)
## [1] 13
4(c) Display the 37th smallest value in the
HP column using R commands.
sort(pokemon$HP, decreasing=FALSE) [37]
## [1] 35
4(d) Create a new factor column in the Pokemon data
set called Mentality.
pokemon$Mentality= "Protective"
pokemon$Mentality[pokemon$Attack==pokemon$Defense] = "Balanced"
pokemon$Mentality[pokemon$Attack> pokemon$Defense] = "Aggressive"
head(pokemon,15)
## X. Name Type.1 Type.2 Total HP Attack Defense Sp..Atk
## 1 1 Bulbasaur Grass Poison 318 45 49 49 65
## 2 2 Ivysaur Grass Poison 405 60 62 63 80
## 3 3 Venusaur Grass Poison 525 80 82 83 100
## 4 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122
## 5 4 Charmander Fire 309 39 52 43 60
## 6 5 Charmeleon Fire 405 58 64 58 80
## 7 6 Charizard Fire Flying 534 78 84 78 109
## 8 6 CharizardMega Charizard X Fire Dragon 634 78 130 111 130
## 9 6 CharizardMega Charizard Y Fire Flying 634 78 104 78 159
## 10 7 Squirtle Water 314 44 48 65 50
## 11 8 Wartortle Water 405 59 63 80 65
## 12 9 Blastoise Water 530 79 83 100 85
## 13 9 BlastoiseMega Blastoise Water 630 79 103 120 135
## 14 10 Caterpie Bug 195 45 30 35 20
## 15 11 Metapod Bug 205 50 20 55 25
## Sp..Def Speed Generation Legendary Mentality
## 1 65 45 1 False Balanced
## 2 80 60 1 False Protective
## 3 100 80 1 False Protective
## 4 120 80 1 False Protective
## 5 50 65 1 False Aggressive
## 6 65 80 1 False Aggressive
## 7 85 100 1 False Aggressive
## 8 85 100 1 False Aggressive
## 9 115 100 1 False Aggressive
## 10 64 43 1 False Protective
## 11 80 58 1 False Protective
## 12 105 78 1 False Protective
## 13 115 78 1 False Protective
## 14 20 45 1 False Protective
## 15 25 30 1 False Protective
Attack is larger than the variable
Defense, assign this character’s
Mentality = Aggressive.Attack is smaller than Defense then
assign Mentality = Protective.Mentality = Balanced.Display the first 15 rows of the Pokemon data set with this new variable. HINT: Create a new column and then using indexing and operators.
4(e) Notice that some pokemons have more than one
type (for example, grass and poison), while others have exclusively one
type. Create a new logical variable named
double_type that is TRUE if a pokemon has 2
types, and FALSE otherwise. Use the function
head() to show the first 10 lines of our data, only
including the variables Type.1, Type.2 and
double_type. Then anwer, how many Pokemons do not have a
second type?
pokemon$double_type= (pokemon$Type.2) != ""
head(pokemon[,c("Type.1", "Type.2", "double_type")], 10)
## Type.1 Type.2 double_type
## 1 Grass Poison TRUE
## 2 Grass Poison TRUE
## 3 Grass Poison TRUE
## 4 Grass Poison TRUE
## 5 Fire FALSE
## 6 Fire FALSE
## 7 Fire Flying TRUE
## 8 Fire Dragon TRUE
## 9 Fire Flying TRUE
## 10 Water FALSE
sum(!pokemon$double_type)
## [1] 386
In science, a very common problem consists of determining the relationships between two variables \(X\) and \(Y\). In this exercise, we will explore how one can use linear algebra operations to calculate the linear relationship between two variables.
5(a) Load the dataset Possum.csv saving
it to an object named possum_df, which contains data
corresponding to different possums, measuring their head length and body
length. When you import the data, it will be imported as a
data.frame object. Change the name of the columns to be
headLength and bodyLength respectively. Print
using R functions: 1) The number of observations in the dataset. 2) The
number of variables in the dataset. 3) The top 6 observations in the
dataset.
possum_df <- read_excel("C:/Users/pplec/OneDrive/Desktop/Stat107/HW2/Possum .xlsx")
## Error in read_excel("C:/Users/pplec/OneDrive/Desktop/Stat107/HW2/Possum .xlsx"): could not find function "read_excel"
colnames(possum_df)= c("headLength","bodyLength")
## Error: object 'possum_df' not found
nrow(possum_df)
## Error: object 'possum_df' not found
ncol(possum_df)
## Error: object 'possum_df' not found
head(possum_df, 6)
## Error: object 'possum_df' not found
5(b) Use the function plot() to
generate a scatterplot with bodyLength as the X-axis of the
plot, and headLength as the Y-axis of the plot. Make sure
to change the x-axis label to “Possum Body Length (cm)” and the
y-axis label to “Possum Head Length (mm)”, give an appropriate
title and caption. Below your plot, discuss: is there a relationship
between these two variables? What type of relationship do you think they
have?
plot(x= possum_df$bodyLength, y= possum_df$headLength, xlab = "Possum Body Length(cm)", ylab= "Possum Head Length (mm)", main = "Graph of Possum Body and Head Length", sub = "Positively Correlated")
## Error: object 'possum_df' not found
Answer here: There is a positive relationship between the body and head length. As you can see, the x and y variables are increasing simultaneously. Due to this relationship, there is a positive correlation between x and y.
5(c) Lets consider the relationship between the possum head/body length with the linear model:
\[ HL \sim b \cdot BL + a \]
where \(HL\) and \(BL\) represent the head and body length. Here, \(b\) represents the sign and magnitude of the linear relationship, and \(a\) represents the intercept. In the following, we explore how linear algebra operations can be used to estimate the linear relationship between these two variables.
Create a (104 x 2) matrix BL, which contains 2 columns.
The first column contains the value 1 repeated 104 times. The second
column contains the values of body-length from the dataset. Furthermore,
create a matrix HL of dimensions (104 x 1) which contains
the values of the head-length of the possums in the dataset.
BL = matrix(c(rep(1,104),possum_df$bodyLength),nrow = 104, ncol = 2)
## Error: object 'possum_df' not found
HL= matrix(possum_df$headLength)
## Error: object 'possum_df' not found
5(d) Use a combination of R matrix multiplication and transposition to calculate the matrix \(A = (BL)^T \cdot (BL)\). Print the resulting dimensions of the matrix \(A\).
A= t(BL)%*%(BL)
## Error: object 'BL' not found
print(dim(A))
## Error: object 'A' not found
5(e) Use a combination of the matrix multiplication
operator (example: mat1 %*% mat2), matrix transposition
(example: t(mat1)), and matrix inversion (example:
solve(mat1)) to calculate the following: \[
c = A^{-1} \cdot (BL)^T \cdot HL
\] What are the dimensions of \(c\)? Print the object c.
c= solve(A)%*% t(BL)%*%(HL)
## Error: object 'A' not found
dim(c)
## NULL
c
## function (...) .Primitive("c")
5(f) The vector in part (e) corresponds to the
estimated coefficients of the linear regression. More specifically, the
entry c[1] corresponds to our estimation of \(a\), and c[2] represents our
estimation of \(b\). To verify that the
obtained values in c correctly represent the linear
relationship between body length and head length, recreate the plot you
created in part (b), but now use the function abline(a, b)
to add the linear regression line. If you need to learn how to use the
function abline(), feel free to use the help functionality
of Rstudio. Color the line with "red".
plot(x= possum_df$bodyLength, y= possum_df$headLength, xlab = "Possum Body Length(cm)", ylab= "Possum Head Length (mm)", main = "Graph of Possum Body and Head Length", sub = "Positively Correlated")
## Error: object 'possum_df' not found
abline(a= 42.7,b= 0.57, col= "red")
## Error in int_abline(a = a, b = b, h = h, v = v, untf = untf, ...): plot.new has not been called yet
Answer the following: Does the line you added represent the linear relationship between the variables? Do you think there are intuitive reasons why these two variables would be linearly related?
Answer here: Yes, I think the line represents a linear relationship between the variables because the possum size would need to proportional. For example, the head and body would need to be relatively the same size. Possums have features that make it’s body length a certain size and it’s head length a certain size. Body and head are proportional.