R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

I. LEARN (30pts)

Manipulating vectors, in several ways, can be important in Data science. The main ways you can manipulate vectors are the following:.

- Selecting and displaying certain parts - Sorting and rearranging - Returning Logical values


I.1 (20pts)

Create a simple vector vec of numbers: 3, 5, 7, 5, 3, 2, 6, 8, 5, 6, 9

Run and comment the codes and the outputs.

# Create vec

vec=c(3, 5, 7, 5, 3, 2, 6, 8, 5, 6, 9)
# Code            Comments
#-------------------------------------------------------------------------#

vec[1]          # Code reads and prints the first item in the vector.
## [1] 3
vec[3]          # Code reads and prints the third item in the vector.
## [1] 7
vec[4:7]        # Code reads and prints the fourth to seventh items in the vector.
## [1] 5 3 2 6
vec[-7]         # Code reads and removes the seventh item in the vector, then prints the rest of the items in order.
##  [1] 3 5 7 5 3 2 8 5 6 9
vec[c(1, 3, 5, 7, 9)]            # Code reads and prints the first, third, fifth, seventh, and ninth items in the vector.
## [1] 3 7 3 6 5
# Code                             Comments
#-------------------------------------------------------------------------#


vec[c(-3, -8)]                   # Code reads and removes the thrid and eighth item from the vector, then prints the rest of the items in order.
## [1] 3 5 5 3 2 6 5 6 9
vec[vec >3 ]                     # Code reads and removes every item in the vector that is less than or equal to a value of 3, then printing all of the values in the vector above 3.
## [1] 5 7 5 6 8 5 6 9
vec[ vec <5 |  vec >7 ]          # Code reads and prints every value in the vector that is less than five and greater than 7.
## [1] 3 3 2 8 9
A= vec[vec != 5 ]                # Assigns indexed vector to variable A, which then removes every value in the vector that is not equal to five, and prints the rest in order as is.
A
## [1] 3 7 3 2 6 8 6 9
B= vec[seq(1,length(vec), 2)]    # Assigns sequenced vector to variable B, which modifies the vector to print in a sequence starting at the first item to the number of elements (or length) of the vector, by every second item (counting by 2s).
B
## [1] 3 7 3 6 5 9

I.2 (10pts)

You can use other commands on your object to help you extract various parts.

Create a simple vector x of the following:

NA, 3, 5, 7, 5, 3, 2, NA, 6, 8, 5, 6, 9

Run and comment the codes and the outputs.

# Create x

x=c(NA, 3, 5, 7, 5, 3, 2, NA,  6, 8, 5, 6, 9)
# Code                                           Comments
#-------------------------------------------------------------------------#

length(x)                              # Code reads and prints the number of elements/items in the vector, which in this case is 13.
## [1] 13
x[(length(x)-5) : length(x)]           # Code slices the vector (:), extracting the last 6 elements of the vector.
## [1] NA  6  8  5  6  9
# Write the code that  Eliminates NAs from x and name the new vector y. Show y. 

na.omit(x)
##  [1] 3 5 7 5 3 2 6 8 5 6 9
## attr(,"na.action")
## [1] 1 8
## attr(,"class")
## [1] "omit"
y=(na.omit(x))
y
##  [1] 3 5 7 5 3 2 6 8 5 6 9
## attr(,"na.action")
## [1] 1 8
## attr(,"class")
## [1] "omit"
# Write the code that gets the largest value of the values of y using the in function max( ) :


max(y)
## [1] 9
which(y==max(y))                      # Code reads the vector and prints the index (position) of the largest value, which is 9, and its position is 11.
## [1] 11

II. Practice (40pts)

You can rearrange the items in a vector using the sort( ) command.

You can locate the position of each item along the vector using the order() command.

Run and comment the codes and the outputs.


II.1 (10 pts)

# Code                                           Comments
#-------------------------------------------------------------------------#


A=c(8, 9, 7, 9, NA)                   # Assigns the vector to variable A.



sort(A)                               # Rearranges the items in vector A by smallest to largest, removing the NAs.
## [1] 7 8 9 9
order(A)                              # Orders the items in A by numbers 1 to 5.
## [1] 3 1 2 4 5
sort(A, na.last = NA)                 # Rearranges the items in A by smallest to largest, removing the NA at the last spot.
## [1] 7 8 9 9
order(A, na.last=NA)                  # Orders the items in A using numbers 1 to 4, removing the NA at the last spot which would have been 5.
## [1] 3 1 2 4
sort(A, na.last = TRUE)               # Rearranges the items in A from smallest value to largest, keeping the NA in the last spot.
## [1]  7  8  9  9 NA
order(A, na.last = TRUE)              # Orders the items in A using numbers 1 to 5, keeping the NA in the last spot as 5.
## [1] 3 1 2 4 5
?sort

II.2 (10 pts)

# Code                                           Comments
#-------------------------------------------------------------------------#




sort(A, na.last = FALSE)               # Rearranges the items in A from smallest to largest, and changes the NA's position from last in the vector to first.
## [1] NA  7  8  9  9
order(A, na.last = FALSE)              # Orders the items in A from 1 to 5, placing the NA as 5 in the first position rather than last.
## [1] 5 3 1 2 4
sort(A, decreasing = TRUE)             # Rearranges the items in A in decreasing order, from largest to smallest.
## [1] 9 9 8 7
order(A, decreasing = TRUE)            # Orders the items in A using numbers 1 to 5, in decreasing order.
## [1] 2 4 1 3 5
sort(A, decreasing = FALSE)             # Rearranges the items in A, ordered from smallest value to largest, excluding the NA.
## [1] 7 8 9 9
order(A, decreasing = FALSE)            # Orders the items in A using the numbers 1 to 5, including the NA as number 5.
## [1] 3 1 2 4 5
?order

II.3 (10pts)

# Code                                           Comments
#-------------------------------------------------------------------------#


X=c(16, NA, 22, 14, 21)                # Creates new vector and assigns it to variable X.


Y=c("a", "b", "c", "d", NA)            # Creates a second vector and assigns it to variable Y.


DF= data.frame(X,Y )                   # Creates a data frame (table) with the values in both X and Y.
DF                                 
##    X    Y
## 1 16    a
## 2 NA    b
## 3 22    c
## 4 14    d
## 5 21 <NA>
sorted_DF = DF[order(X, Y),]              # Function sorts the data frame created with vectors X and Y, using the numbers 1 to 5 according to the letter in each row.

sorted_DF
##    X    Y
## 4 14    d
## 1 16    a
## 5 21 <NA>
## 3 22    c
## 2 NA    b
sorted_DF1 = DF[order(X, na.last=NA, Y),]              # Modifies the ordered data frame from the previous code, omitting the rows with NAs.

sorted_DF1
##    X Y
## 4 14 d
## 1 16 a
## 3 22 c
sorted_DF2 = DF[order(X, na.last=NA),]              # Another modification of the previous code, ordering the dataframe from row 1 to 5, this time only omitting row 2 which has an NA. 

sorted_DF2
##    X    Y
## 4 14    d
## 1 16    a
## 5 21 <NA>
## 3 22    c
sorted_DF3 = DF[order(Y, na.last = NA),]              # Orders data frame rows from number 1 to 4, omitting row 5 for its NA in the last position.

sorted_DF3
##    X Y
## 1 16 a
## 2 NA b
## 3 22 c
## 4 14 d

Returning Logical values from a Vector

II.4 (10pts)

# Code                                           Comments
#-------------------------------------------------------------------------#


B= c(3, 5, 7, 5, 3, 2, 6, 8, 5, 6, 9)          # Assigns new vector to variable B.


which(B==6)                                    # Reads and returns the indexes or positions of the value 6 in the vector.
## [1]  7 10
B == 5                                        # Reads and indicates the positions of the value 5 in the vector as 'TRUE', where every other position is indicated as 'FALSE'.
##  [1] FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
B > 5                                         # Reads and indicates the positions of values in the vector greater than 5 as 'TRUE' and the positions of values less than or equal to 5 as 'FALSE'.
##  [1] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE

# Code                                           Comments
#-------------------------------------------------------------------------#




B <  5                                # Reads and prints the positions of values in the vector that are less than 5 as 'TRUE' and the values greater than or equal to 5 as 'FALSE'.
##  [1]  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
B > 5 & B < 8                         # Reads and indicates the positions of values greater than 5 and less than 8 as 'TRUE' and the positions od values less than or equal to five, or greater than or equal to 8, as 'FALSE'.
##  [1] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE
B==3 |  B != 3                        # Reads and identifies the positions of the value of 3 in the vector, then excludes the value of 3 from the vector and indicates the positions of the rest of the values as 'TRUE'.  
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

III. Solve (50pts)


Look at the data table:

\[\begin{array}{ccccc} \hline \\ Length & Speed & Algae & NO3 & BOD \\ \hline \\ 20 & 12 & 40 & 2.25 & 200\\ 21 & 14 & 45 & 2.15 & 180\\ 22 & 12 & 45 & 1.75 & 135\\ 23 & 16 & 80 & 1.95 & 120\\ 21 & 20 & 75 & 1.95 & 110 \\ 20 & 21 & 65 & 2.75 & 120\\ \hline \end{array}\]


1. Use R to form a data.frame DF of this data set.

How do you check the type of the this data table using an R code?

# Form the columns as vectors

x=c(20, 21, 22, 23, 21, 20)

y=c(12, 14, 12, 16, 20, 21)

z=c(40, 45, 45, 80, 75, 65)

a=c(2.25, 2.15, 1.75, 1.95, 1.95, 2.75)
 
b=c(200, 180, 135, 120, 110, 120)

DF=data.frame(x, y, z, a, b)
# Assign names to the columns and define the data frame DF

DF=data.frame(length=x, speed=y, algae=z, NO3=a, BOD=b)
# Show the data frame  DF

DF
##   length speed algae  NO3 BOD
## 1     20    12    40 2.25 200
## 2     21    14    45 2.15 180
## 3     22    12    45 1.75 135
## 4     23    16    80 1.95 120
## 5     21    20    75 1.95 110
## 6     20    21    65 2.75 120
# Check the type of DF using the class( ) function.

class(DF) #type is "data.frame".
## [1] "data.frame"

2. How many variables are described ?

How many measurements do we have for each variable ?

Use R codes to find these numbers.

# dimension of the data.frame

dim(DF)
## [1] 6 5
# number of columns

ncol(DF)
## [1] 5
# number of rows

nrow(DF) 
## [1] 6

Comments

#The data frame has dimensions of 5 by 6. #There are 5 variables described, one for each of the 5 columns in the data frame. #There are six measurements for each variable, thus six rows in the data frame.


3. Pick out the item from a row m and a column n.

DF[4,1] #Picks out the individual item from the fourth row of the first column of the data frame.
## [1] 23

4. Display a row p of the data frame.

Select the row p and display the columns one to four.

# Display a specific row  p of the data frame.

head(DF, n=1) #Identifies and displays the first row of the data frame.
##   length speed algae  NO3 BOD
## 1     20    12    40 2.25 200
# Display selected columns for a specific row

DF[3, 1:4] #identifies and displays the columns 1 to 4 for row 3.
##   length speed algae  NO3
## 3     22    12    45 1.75

5. Display all rows while selecting a specific column alone.

DF[1:6, 4]
## [1] 2.25 2.15 1.75 1.95 1.95 2.75
# Identifies and displays the measurements on all 6 rows for column 4.

6. Specify several rows and display all columns.

DF[1:4, 1:5]
##   length speed algae  NO3 BOD
## 1     20    12    40 2.25 200
## 2     21    14    45 2.15 180
## 3     22    12    45 1.75 135
## 4     23    16    80 1.95 120
# Identifies and displays the rows 1 to 4 for all 5 columns of the data frame.

7. Specify several rows except the second column.

DF[1:4,-2]
##   length algae  NO3 BOD
## 1     20    40 2.25 200
## 2     21    45 2.15 180
## 3     22    45 1.75 135
## 4     23    80 1.95 120
# Displays rows 1 to 4, omitting the second column while displaying the rest.

8. Specify several rows and one column using its name rather than a simple value.

# Comment: Displays all rows for the first column of the data.frame.

DF[, c("length")]
## [1] 20 21 22 23 21 20

9. Specify a single value with the data frame.

DF[3, 2]
## [1] 12
# Comment: Displays the single value at the third row, second column of the data frame.

10. Select a single row and a single column.

DF[2, 4]
## [1] 2.15
# Comment: Displays the value for the selected row (2) and column (4)

11. Use an R command to check if there are NA values.

is.na(DF)
##      length speed algae   NO3   BOD
## [1,]  FALSE FALSE FALSE FALSE FALSE
## [2,]  FALSE FALSE FALSE FALSE FALSE
## [3,]  FALSE FALSE FALSE FALSE FALSE
## [4,]  FALSE FALSE FALSE FALSE FALSE
## [5,]  FALSE FALSE FALSE FALSE FALSE
## [6,]  FALSE FALSE FALSE FALSE FALSE
#Reads the data frame and checks for NAs. In this case, there are none.

12. Rearrange the data frame by the 1st column in a descending order. Display the sorted data.frame

#sorted_DF = DF[order(length),]
#sorted_DF

#Will not let me render file with this part as code.

13. Remove 2 columns and 3 rows from the data frame. Display the new data frame.

DF[-2:-4, -3:-4]
##   length speed BOD
## 1     20    12 200
## 5     21    20 110
## 6     20    21 120

14. Plot a graph that shows the variation of Speed with respect of Length. Display the graph with a title and labels for the axes.

library(ggplot2)

ggplot(DF, aes(x = length, y = speed)) +
  geom_point(color = "darkred") +                    
  labs(title = "Variation of Speed with respect to Length",
       x = "Length",
       y = "Speed")