This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.
Manipulating vectors, in several ways, can be important in Data science. The main ways you can manipulate vectors are the following:.
- Selecting and displaying certain parts - Sorting and rearranging - Returning Logical values
Create a simple vector vec of numbers: 3, 5, 7, 5, 3, 2, 6, 8, 5, 6, 9
Run and comment the codes and the outputs.
# Create vec
vec=c(3, 5, 7, 5, 3, 2, 6, 8, 5, 6, 9)
# Code Comments
#-------------------------------------------------------------------------#
vec[1] # Code reads and prints the first item in the vector.
## [1] 3
vec[3] # Code reads and prints the third item in the vector.
## [1] 7
vec[4:7] # Code reads and prints the fourth to seventh items in the vector.
## [1] 5 3 2 6
vec[-7] # Code reads and removes the seventh item in the vector, then prints the rest of the items in order.
## [1] 3 5 7 5 3 2 8 5 6 9
vec[c(1, 3, 5, 7, 9)] # Code reads and prints the first, third, fifth, seventh, and ninth items in the vector.
## [1] 3 7 3 6 5
# Code Comments
#-------------------------------------------------------------------------#
vec[c(-3, -8)] # Code reads and removes the thrid and eighth item from the vector, then prints the rest of the items in order.
## [1] 3 5 5 3 2 6 5 6 9
vec[vec >3 ] # Code reads and removes every item in the vector that is less than or equal to a value of 3, then printing all of the values in the vector above 3.
## [1] 5 7 5 6 8 5 6 9
vec[ vec <5 | vec >7 ] # Code reads and prints every value in the vector that is less than five and greater than 7.
## [1] 3 3 2 8 9
A= vec[vec != 5 ] # Assigns indexed vector to variable A, which then removes every value in the vector that is not equal to five, and prints the rest in order as is.
A
## [1] 3 7 3 2 6 8 6 9
B= vec[seq(1,length(vec), 2)] # Assigns sequenced vector to variable B, which modifies the vector to print in a sequence starting at the first item to the number of elements (or length) of the vector, by every second item (counting by 2s).
B
## [1] 3 7 3 6 5 9
You can use other commands on your object to help you extract various parts.
Create a simple vector x of the following:
NA, 3, 5, 7, 5, 3, 2, NA, 6, 8, 5, 6, 9
Run and comment the codes and the outputs.
# Create x
x=c(NA, 3, 5, 7, 5, 3, 2, NA, 6, 8, 5, 6, 9)
# Code Comments
#-------------------------------------------------------------------------#
length(x) # Code reads and prints the number of elements/items in the vector, which in this case is 13.
## [1] 13
x[(length(x)-5) : length(x)] # Code slices the vector (:), extracting the last 6 elements of the vector.
## [1] NA 6 8 5 6 9
# Write the code that Eliminates NAs from x and name the new vector y. Show y.
na.omit(x)
## [1] 3 5 7 5 3 2 6 8 5 6 9
## attr(,"na.action")
## [1] 1 8
## attr(,"class")
## [1] "omit"
y=(na.omit(x))
y
## [1] 3 5 7 5 3 2 6 8 5 6 9
## attr(,"na.action")
## [1] 1 8
## attr(,"class")
## [1] "omit"
# Write the code that gets the largest value of the values of y using the in function max( ) :
max(y)
## [1] 9
which(y==max(y)) # Code reads the vector and prints the index (position) of the largest value, which is 9, and its position is 11.
## [1] 11
Run and comment the codes and the outputs.
# Code Comments
#-------------------------------------------------------------------------#
A=c(8, 9, 7, 9, NA) # Assigns the vector to variable A.
sort(A) # Rearranges the items in vector A by smallest to largest, removing the NAs.
## [1] 7 8 9 9
order(A) # Orders the items in A by numbers 1 to 5.
## [1] 3 1 2 4 5
sort(A, na.last = NA) # Rearranges the items in A by smallest to largest, removing the NA at the last spot.
## [1] 7 8 9 9
order(A, na.last=NA) # Orders the items in A using numbers 1 to 4, removing the NA at the last spot which would have been 5.
## [1] 3 1 2 4
sort(A, na.last = TRUE) # Rearranges the items in A from smallest value to largest, keeping the NA in the last spot.
## [1] 7 8 9 9 NA
order(A, na.last = TRUE) # Orders the items in A using numbers 1 to 5, keeping the NA in the last spot as 5.
## [1] 3 1 2 4 5
?sort
# Code Comments
#-------------------------------------------------------------------------#
sort(A, na.last = FALSE) # Rearranges the items in A from smallest to largest, and changes the NA's position from last in the vector to first.
## [1] NA 7 8 9 9
order(A, na.last = FALSE) # Orders the items in A from 1 to 5, placing the NA as 5 in the first position rather than last.
## [1] 5 3 1 2 4
sort(A, decreasing = TRUE) # Rearranges the items in A in decreasing order, from largest to smallest.
## [1] 9 9 8 7
order(A, decreasing = TRUE) # Orders the items in A using numbers 1 to 5, in decreasing order.
## [1] 2 4 1 3 5
sort(A, decreasing = FALSE) # Rearranges the items in A, ordered from smallest value to largest, excluding the NA.
## [1] 7 8 9 9
order(A, decreasing = FALSE) # Orders the items in A using the numbers 1 to 5, including the NA as number 5.
## [1] 3 1 2 4 5
?order
# Code Comments
#-------------------------------------------------------------------------#
X=c(16, NA, 22, 14, 21) # Creates new vector and assigns it to variable X.
Y=c("a", "b", "c", "d", NA) # Creates a second vector and assigns it to variable Y.
DF= data.frame(X,Y ) # Creates a data frame (table) with the values in both X and Y.
DF
## X Y
## 1 16 a
## 2 NA b
## 3 22 c
## 4 14 d
## 5 21 <NA>
sorted_DF = DF[order(X, Y),] # Function sorts the data frame created with vectors X and Y, using the numbers 1 to 5 according to the letter in each row.
sorted_DF
## X Y
## 4 14 d
## 1 16 a
## 5 21 <NA>
## 3 22 c
## 2 NA b
sorted_DF1 = DF[order(X, na.last=NA, Y),] # Modifies the ordered data frame from the previous code, omitting the rows with NAs.
sorted_DF1
## X Y
## 4 14 d
## 1 16 a
## 3 22 c
sorted_DF2 = DF[order(X, na.last=NA),] # Another modification of the previous code, ordering the dataframe from row 1 to 5, this time only omitting row 2 which has an NA.
sorted_DF2
## X Y
## 4 14 d
## 1 16 a
## 5 21 <NA>
## 3 22 c
sorted_DF3 = DF[order(Y, na.last = NA),] # Orders data frame rows from number 1 to 4, omitting row 5 for its NA in the last position.
sorted_DF3
## X Y
## 1 16 a
## 2 NA b
## 3 22 c
## 4 14 d
# Code Comments
#-------------------------------------------------------------------------#
B= c(3, 5, 7, 5, 3, 2, 6, 8, 5, 6, 9) # Assigns new vector to variable B.
which(B==6) # Reads and returns the indexes or positions of the value 6 in the vector.
## [1] 7 10
B == 5 # Reads and indicates the positions of the value 5 in the vector as 'TRUE', where every other position is indicated as 'FALSE'.
## [1] FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
B > 5 # Reads and indicates the positions of values in the vector greater than 5 as 'TRUE' and the positions of values less than or equal to 5 as 'FALSE'.
## [1] FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE TRUE
# Code Comments
#-------------------------------------------------------------------------#
B < 5 # Reads and prints the positions of values in the vector that are less than 5 as 'TRUE' and the values greater than or equal to 5 as 'FALSE'.
## [1] TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
B > 5 & B < 8 # Reads and indicates the positions of values greater than 5 and less than 8 as 'TRUE' and the positions od values less than or equal to five, or greater than or equal to 8, as 'FALSE'.
## [1] FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE
B==3 | B != 3 # Reads and identifies the positions of the value of 3 in the vector, then excludes the value of 3 from the vector and indicates the positions of the rest of the values as 'TRUE'.
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Look at the data table:
\[\begin{array}{ccccc} \hline \\ Length & Speed & Algae & NO3 & BOD \\ \hline \\ 20 & 12 & 40 & 2.25 & 200\\ 21 & 14 & 45 & 2.15 & 180\\ 22 & 12 & 45 & 1.75 & 135\\ 23 & 16 & 80 & 1.95 & 120\\ 21 & 20 & 75 & 1.95 & 110 \\ 20 & 21 & 65 & 2.75 & 120\\ \hline \end{array}\]
1. Use R to form a data.frame DF of this data set.
How do you check the type of the this data table using an R code?
# Form the columns as vectors
x=c(20, 21, 22, 23, 21, 20)
y=c(12, 14, 12, 16, 20, 21)
z=c(40, 45, 45, 80, 75, 65)
a=c(2.25, 2.15, 1.75, 1.95, 1.95, 2.75)
b=c(200, 180, 135, 120, 110, 120)
DF=data.frame(x, y, z, a, b)
# Assign names to the columns and define the data frame DF
DF=data.frame(length=x, speed=y, algae=z, NO3=a, BOD=b)
# Show the data frame DF
DF
## length speed algae NO3 BOD
## 1 20 12 40 2.25 200
## 2 21 14 45 2.15 180
## 3 22 12 45 1.75 135
## 4 23 16 80 1.95 120
## 5 21 20 75 1.95 110
## 6 20 21 65 2.75 120
# Check the type of DF using the class( ) function.
class(DF) #type is "data.frame".
## [1] "data.frame"
2. How many variables are described ?
How many measurements do we have for each variable ?
Use R codes to find these numbers.
# dimension of the data.frame
dim(DF)
## [1] 6 5
# number of columns
ncol(DF)
## [1] 5
# number of rows
nrow(DF)
## [1] 6
Comments
#The data frame has dimensions of 5 by 6. #There are 5 variables described, one for each of the 5 columns in the data frame. #There are six measurements for each variable, thus six rows in the data frame.
3. Pick out the item from a row m and a column n.
DF[4,1] #Picks out the individual item from the fourth row of the first column of the data frame.
## [1] 23
4. Display a row p of the data frame.
Select the row p and display the columns one to four.
# Display a specific row p of the data frame.
head(DF, n=1) #Identifies and displays the first row of the data frame.
## length speed algae NO3 BOD
## 1 20 12 40 2.25 200
# Display selected columns for a specific row
DF[3, 1:4] #identifies and displays the columns 1 to 4 for row 3.
## length speed algae NO3
## 3 22 12 45 1.75
5. Display all rows while selecting a specific column alone.
DF[1:6, 4]
## [1] 2.25 2.15 1.75 1.95 1.95 2.75
# Identifies and displays the measurements on all 6 rows for column 4.
6. Specify several rows and display all columns.
DF[1:4, 1:5]
## length speed algae NO3 BOD
## 1 20 12 40 2.25 200
## 2 21 14 45 2.15 180
## 3 22 12 45 1.75 135
## 4 23 16 80 1.95 120
# Identifies and displays the rows 1 to 4 for all 5 columns of the data frame.
7. Specify several rows except the second column.
DF[1:4,-2]
## length algae NO3 BOD
## 1 20 40 2.25 200
## 2 21 45 2.15 180
## 3 22 45 1.75 135
## 4 23 80 1.95 120
# Displays rows 1 to 4, omitting the second column while displaying the rest.
8. Specify several rows and one column using its name rather than a simple value.
# Comment: Displays all rows for the first column of the data.frame.
DF[, c("length")]
## [1] 20 21 22 23 21 20
9. Specify a single value with the data frame.
DF[3, 2]
## [1] 12
# Comment: Displays the single value at the third row, second column of the data frame.
10. Select a single row and a single column.
DF[2, 4]
## [1] 2.15
# Comment: Displays the value for the selected row (2) and column (4)
11. Use an R command to check if there are NA values.
is.na(DF)
## length speed algae NO3 BOD
## [1,] FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE FALSE
#Reads the data frame and checks for NAs. In this case, there are none.
12. Rearrange the data frame by the 1st column in a descending order. Display the sorted data.frame
#sorted_DF = DF[order(length),]
#sorted_DF
#Will not let me render file with this part as code.
13. Remove 2 columns and 3 rows from the data frame. Display the new data frame.
DF[-2:-4, -3:-4]
## length speed BOD
## 1 20 12 200
## 5 21 20 110
## 6 20 21 120
14. Plot a graph that shows the variation of Speed with respect of Length. Display the graph with a title and labels for the axes.
library(ggplot2)
ggplot(DF, aes(x = length, y = speed)) +
geom_point(color = "darkred") +
labs(title = "Variation of Speed with respect to Length",
x = "Length",
y = "Speed")