Data Mining with R Assignment 4

Instructions: This is an independent reading assignment. Read pages 87 to 96. Stop at “Data Visualization.” Remember to work through all algorithms/codes.

1. What is a model?

This word is used in different contexts. According to Wikipedia: Scientific modelling is an activity that produces models representing empirical objects, phenomena, and physical processes, to make a particular part or feature of the world easier to understand, define, quantify, visualize, or simulate. It requires selecting and identifying relevant aspects of a situation in the real world and then developing a model to replicate a system with those features.

2. What are the five groups of tasks of modeling in data mining?

Exploratory data analysis
Dependency modeling
Clustering
Anomaly detection
Predictive analytics

3. Typically, what does a data miner do?

Searching for interesting, unexpected, and useful relationships in a dataset. This will help finding unusual patterns in the data or key characteristics of the data.

4. Most data mining techniques can be bifurcated into groups. What are those techniques?

Searching for relationships among the features/columns of the dataset
Searching for relationships among the observations/rows of the dataset

5. What is a main goal of exploratory data analysis?

Providing useful summaries of a dataset which includes some characteristics of the data that the users may find useful.

6. Most datasets have a dimensionality that makes it very difficult for a standard user to inspect the full data and find interesting properties of these data. TRUE or FALSE?

TRUE

7. What are data summaries?

Providing overviews of key properties of the data and describing important properties of the distribution of the values across the observations in a dataset.

8. The summarise() function is a function of which package?

dplyr

9.

Run the following code (below). Study the data. Explain the dataset.

library (DMwR2)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library (dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

data (algae)
algae

## # A tibble: 200 × 18
##    season size  speed   mxPH  mnO2    Cl    NO3   NH4  oPO4   PO4  Chla    a1
##    <fct>  <fct> <fct>  <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 winter small medium  8      9.8  60.8  6.24  578   105   170   50      0  
##  2 spring small medium  8.35   8    57.8  1.29  370   429.  559.   1.3    1.4
##  3 autumn small medium  8.1   11.4  40.0  5.33  347.  126.  187.  15.6    3.3
##  4 spring small medium  8.07   4.8  77.4  2.30   98.2  61.2 139.   1.4    3.1
##  5 autumn small medium  8.06   9    55.4 10.4   234.   58.2  97.6 10.5    9.2
##  6 winter small high    8.25  13.1  65.8  9.25  430    18.2  56.7 28.4   15.1
##  7 summer small high    8.15  10.3  73.2  1.54  110    61.2 112.   3.2    2.4
##  8 autumn small high    8.05  10.6  59.1  4.99  206.   44.7  77.4  6.9   18.2
##  9 winter small medium  8.7    3.4  22.0  0.886 103.   36.3  71    5.54  25.4
## 10 winter small high    7.93   9.9   8    1.39    5.8  27.2  46.6  0.8   17  
## # ℹ 190 more rows
## # ℹ 6 more variables: a2 <dbl>, a3 <dbl>, a4 <dbl>, a5 <dbl>, a6 <dbl>,
## #   a7 <dbl>

data (iris)
iris

##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 2            4.9         3.0          1.4         0.2     setosa
## 3            4.7         3.2          1.3         0.2     setosa
## 4            4.6         3.1          1.5         0.2     setosa
## 5            5.0         3.6          1.4         0.2     setosa
## 6            5.4         3.9          1.7         0.4     setosa
## 7            4.6         3.4          1.4         0.3     setosa
## 8            5.0         3.4          1.5         0.2     setosa
## 9            4.4         2.9          1.4         0.2     setosa
## 10           4.9         3.1          1.5         0.1     setosa
## 11           5.4         3.7          1.5         0.2     setosa
## 12           4.8         3.4          1.6         0.2     setosa
## 13           4.8         3.0          1.4         0.1     setosa
## 14           4.3         3.0          1.1         0.1     setosa
## 15           5.8         4.0          1.2         0.2     setosa
## 16           5.7         4.4          1.5         0.4     setosa
## 17           5.4         3.9          1.3         0.4     setosa
## 18           5.1         3.5          1.4         0.3     setosa
## 19           5.7         3.8          1.7         0.3     setosa
## 20           5.1         3.8          1.5         0.3     setosa
## 21           5.4         3.4          1.7         0.2     setosa
## 22           5.1         3.7          1.5         0.4     setosa
## 23           4.6         3.6          1.0         0.2     setosa
## 24           5.1         3.3          1.7         0.5     setosa
## 25           4.8         3.4          1.9         0.2     setosa
## 26           5.0         3.0          1.6         0.2     setosa
## 27           5.0         3.4          1.6         0.4     setosa
## 28           5.2         3.5          1.5         0.2     setosa
## 29           5.2         3.4          1.4         0.2     setosa
## 30           4.7         3.2          1.6         0.2     setosa
## 31           4.8         3.1          1.6         0.2     setosa
## 32           5.4         3.4          1.5         0.4     setosa
## 33           5.2         4.1          1.5         0.1     setosa
## 34           5.5         4.2          1.4         0.2     setosa
## 35           4.9         3.1          1.5         0.2     setosa
## 36           5.0         3.2          1.2         0.2     setosa
## 37           5.5         3.5          1.3         0.2     setosa
## 38           4.9         3.6          1.4         0.1     setosa
## 39           4.4         3.0          1.3         0.2     setosa
## 40           5.1         3.4          1.5         0.2     setosa
## 41           5.0         3.5          1.3         0.3     setosa
## 42           4.5         2.3          1.3         0.3     setosa
## 43           4.4         3.2          1.3         0.2     setosa
## 44           5.0         3.5          1.6         0.6     setosa
## 45           5.1         3.8          1.9         0.4     setosa
## 46           4.8         3.0          1.4         0.3     setosa
## 47           5.1         3.8          1.6         0.2     setosa
## 48           4.6         3.2          1.4         0.2     setosa
## 49           5.3         3.7          1.5         0.2     setosa
## 50           5.0         3.3          1.4         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 52           6.4         3.2          4.5         1.5 versicolor
## 53           6.9         3.1          4.9         1.5 versicolor
## 54           5.5         2.3          4.0         1.3 versicolor
## 55           6.5         2.8          4.6         1.5 versicolor
## 56           5.7         2.8          4.5         1.3 versicolor
## 57           6.3         3.3          4.7         1.6 versicolor
## 58           4.9         2.4          3.3         1.0 versicolor
## 59           6.6         2.9          4.6         1.3 versicolor
## 60           5.2         2.7          3.9         1.4 versicolor
## 61           5.0         2.0          3.5         1.0 versicolor
## 62           5.9         3.0          4.2         1.5 versicolor
## 63           6.0         2.2          4.0         1.0 versicolor
## 64           6.1         2.9          4.7         1.4 versicolor
## 65           5.6         2.9          3.6         1.3 versicolor
## 66           6.7         3.1          4.4         1.4 versicolor
## 67           5.6         3.0          4.5         1.5 versicolor
## 68           5.8         2.7          4.1         1.0 versicolor
## 69           6.2         2.2          4.5         1.5 versicolor
## 70           5.6         2.5          3.9         1.1 versicolor
## 71           5.9         3.2          4.8         1.8 versicolor
## 72           6.1         2.8          4.0         1.3 versicolor
## 73           6.3         2.5          4.9         1.5 versicolor
## 74           6.1         2.8          4.7         1.2 versicolor
## 75           6.4         2.9          4.3         1.3 versicolor
## 76           6.6         3.0          4.4         1.4 versicolor
## 77           6.8         2.8          4.8         1.4 versicolor
## 78           6.7         3.0          5.0         1.7 versicolor
## 79           6.0         2.9          4.5         1.5 versicolor
## 80           5.7         2.6          3.5         1.0 versicolor
## 81           5.5         2.4          3.8         1.1 versicolor
## 82           5.5         2.4          3.7         1.0 versicolor
## 83           5.8         2.7          3.9         1.2 versicolor
## 84           6.0         2.7          5.1         1.6 versicolor
## 85           5.4         3.0          4.5         1.5 versicolor
## 86           6.0         3.4          4.5         1.6 versicolor
## 87           6.7         3.1          4.7         1.5 versicolor
## 88           6.3         2.3          4.4         1.3 versicolor
## 89           5.6         3.0          4.1         1.3 versicolor
## 90           5.5         2.5          4.0         1.3 versicolor
## 91           5.5         2.6          4.4         1.2 versicolor
## 92           6.1         3.0          4.6         1.4 versicolor
## 93           5.8         2.6          4.0         1.2 versicolor
## 94           5.0         2.3          3.3         1.0 versicolor
## 95           5.6         2.7          4.2         1.3 versicolor
## 96           5.7         3.0          4.2         1.2 versicolor
## 97           5.7         2.9          4.2         1.3 versicolor
## 98           6.2         2.9          4.3         1.3 versicolor
## 99           5.1         2.5          3.0         1.1 versicolor
## 100          5.7         2.8          4.1         1.3 versicolor
## 101          6.3         3.3          6.0         2.5  virginica
## 102          5.8         2.7          5.1         1.9  virginica
## 103          7.1         3.0          5.9         2.1  virginica
## 104          6.3         2.9          5.6         1.8  virginica
## 105          6.5         3.0          5.8         2.2  virginica
## 106          7.6         3.0          6.6         2.1  virginica
## 107          4.9         2.5          4.5         1.7  virginica
## 108          7.3         2.9          6.3         1.8  virginica
## 109          6.7         2.5          5.8         1.8  virginica
## 110          7.2         3.6          6.1         2.5  virginica
## 111          6.5         3.2          5.1         2.0  virginica
## 112          6.4         2.7          5.3         1.9  virginica
## 113          6.8         3.0          5.5         2.1  virginica
## 114          5.7         2.5          5.0         2.0  virginica
## 115          5.8         2.8          5.1         2.4  virginica
## 116          6.4         3.2          5.3         2.3  virginica
## 117          6.5         3.0          5.5         1.8  virginica
## 118          7.7         3.8          6.7         2.2  virginica
## 119          7.7         2.6          6.9         2.3  virginica
## 120          6.0         2.2          5.0         1.5  virginica
## 121          6.9         3.2          5.7         2.3  virginica
## 122          5.6         2.8          4.9         2.0  virginica
## 123          7.7         2.8          6.7         2.0  virginica
## 124          6.3         2.7          4.9         1.8  virginica
## 125          6.7         3.3          5.7         2.1  virginica
## 126          7.2         3.2          6.0         1.8  virginica
## 127          6.2         2.8          4.8         1.8  virginica
## 128          6.1         3.0          4.9         1.8  virginica
## 129          6.4         2.8          5.6         2.1  virginica
## 130          7.2         3.0          5.8         1.6  virginica
## 131          7.4         2.8          6.1         1.9  virginica
## 132          7.9         3.8          6.4         2.0  virginica
## 133          6.4         2.8          5.6         2.2  virginica
## 134          6.3         2.8          5.1         1.5  virginica
## 135          6.1         2.6          5.6         1.4  virginica
## 136          7.7         3.0          6.1         2.3  virginica
## 137          6.3         3.4          5.6         2.4  virginica
## 138          6.4         3.1          5.5         1.8  virginica
## 139          6.0         3.0          4.8         1.8  virginica
## 140          6.9         3.1          5.4         2.1  virginica
## 141          6.7         3.1          5.6         2.4  virginica
## 142          6.9         3.1          5.1         2.3  virginica
## 143          5.8         2.7          5.1         1.9  virginica
## 144          6.8         3.2          5.9         2.3  virginica
## 145          6.7         3.3          5.7         2.5  virginica
## 146          6.7         3.0          5.2         2.3  virginica
## 147          6.3         2.5          5.0         1.9  virginica
## 148          6.5         3.0          5.2         2.0  virginica
## 149          6.2         3.4          5.4         2.3  virginica
## 150          5.9         3.0          5.1         1.8  virginica

summary (algae)

##     season       size       speed         mxPH            mnO2       
##  autumn:40   large :45   high  :84   Min.   :5.600   Min.   : 1.500  
##  spring:53   medium:84   low   :33   1st Qu.:7.700   1st Qu.: 7.725  
##  summer:45   small :71   medium:83   Median :8.060   Median : 9.800  
##  winter:62                           Mean   :8.012   Mean   : 9.118  
##                                      3rd Qu.:8.400   3rd Qu.:10.800  
##                                      Max.   :9.700   Max.   :13.400  
##                                      NA's   :1       NA's   :2       
##        Cl               NO3              NH4                oPO4       
##  Min.   :  0.222   Min.   : 0.050   Min.   :    5.00   Min.   :  1.00  
##  1st Qu.: 10.981   1st Qu.: 1.296   1st Qu.:   38.33   1st Qu.: 15.70  
##  Median : 32.730   Median : 2.675   Median :  103.17   Median : 40.15  
##  Mean   : 43.636   Mean   : 3.282   Mean   :  501.30   Mean   : 73.59  
##  3rd Qu.: 57.824   3rd Qu.: 4.446   3rd Qu.:  226.95   3rd Qu.: 99.33  
##  Max.   :391.500   Max.   :45.650   Max.   :24064.00   Max.   :564.60  
##  NA's   :10        NA's   :2        NA's   :2          NA's   :2       
##       PO4              Chla               a1              a2        
##  Min.   :  1.00   Min.   :  0.200   Min.   : 0.00   Min.   : 0.000  
##  1st Qu.: 41.38   1st Qu.:  2.000   1st Qu.: 1.50   1st Qu.: 0.000  
##  Median :103.29   Median :  5.475   Median : 6.95   Median : 3.000  
##  Mean   :137.88   Mean   : 13.971   Mean   :16.92   Mean   : 7.458  
##  3rd Qu.:213.75   3rd Qu.: 18.308   3rd Qu.:24.80   3rd Qu.:11.375  
##  Max.   :771.60   Max.   :110.456   Max.   :89.80   Max.   :72.600  
##  NA's   :2        NA's   :12                                        
##        a3               a4               a5               a6        
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000  
##  Median : 1.550   Median : 0.000   Median : 1.900   Median : 0.000  
##  Mean   : 4.309   Mean   : 1.992   Mean   : 5.064   Mean   : 5.964  
##  3rd Qu.: 4.925   3rd Qu.: 2.400   3rd Qu.: 7.500   3rd Qu.: 6.925  
##  Max.   :42.800   Max.   :44.600   Max.   :44.400   Max.   :77.600  
##                                                                     
##        a7        
##  Min.   : 0.000  
##  1st Qu.: 0.000  
##  Median : 1.000  
##  Mean   : 2.495  
##  3rd Qu.: 2.400  
##  Max.   :31.600  
##

summary (iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

The algae dataset has 200 observations/rows (water samples) and 18 variables/columns. The variables are 3 contextual variables (season, size, and speed) to explain each water sample and 8 chemical concentration measurements.

What is the algae dataset about?

It is about water samples described by 3 contextual variables (season, size, and speed) and 8 chemical concentration measurements. The summary function in the code above shows minimum, maximum, mean, median, 1st quartile and 3rd quartile of the data.

Explain the characteristics of the iris dataset

The iris dataset has 150 observations and 5 variables. It has 150 types of iris flower (50 from each species: Setosa, Vericolor, and Virginica) and lists 4 pieces of information (Sepal Length, Sepal Width, Petal Length, Petal Width) for each. The summary function in the code above shows minimum, maximum, mean, median, 1st quartile and 3rd quartile of the data.

Did you discover any correlation between any pair of features in the iris dataset?

Yes, based on the result of the following code, petal length and width are highly correlated (correlation coefficient = 0.9629). The next high correlation is between sepal length and petal length (0.8718).

iris_cor <- cor(iris[,1:4])
iris_cor

##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
## Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
## Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
## Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

10. What does the summarise() function do?

It is used to apply any function that produces a scalar value to any column of a data frame table. For example we can find the mean value of a column (or several columns) of the data.

11. We can use the functions, summarise_each() and funs(), to perform what kind of task?

To apply a set of functions to all columns of a data frame table.

12. What is the task of the group_by() function? This function is included in which package?

It forms sub-groups of a dataset using all combinations of the values of one or more nominal variables. It is in dplyr package.

13. Which function will you use if you want to study potential differences among the sub-groups?

summarize (or summarise) function applied after group_by function.

14. The top algorithm/code chunk on page 90 (Code 4) gives us a way to create a function to obtain the mode of a variable. Go through this algorithm.

Mode <- function(x,na.rm = FALSE) {
  if (na.rm) x <- x[!is.na(x)]
  ux <- unique(x)
  return(ux[which.max(tabulate(match(x,ux)))])
}
Mode(algae$mxPH, na.rm = TRUE)

## [1] 8

Mode(algae$season)

## [1] winter
## Levels: autumn spring summer winter

Steps done in the Mode function:

If na.rm input of the function is true, the NA values of the dataset will be ignored.
Unique values of input x are found and saved in ux variable.
Check each value of x matches which value of ux.
Create a table showing how many times each value of ux has been found in x.
Find the location of maximum/repetition match in ux.
Show the value of ux which had the maximum match.

Now, replace

“algae$mxPh” with “iris$Sepal.Length” and “algae$season” with “iris$Petal.Length”

Copy and/or take a screenshot your results for both and include them in this assignment (I only need the first 20 to 40 rows of each sub-group).

Mode <- function(x,na.rm = FALSE) {
  if (na.rm) x <- x[!is.na(x)]
  ux <- unique(x)
  return(ux[which.max(tabulate(match(x,ux)))])
}
head(iris,20)

##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa
## 11          5.4         3.7          1.5         0.2  setosa
## 12          4.8         3.4          1.6         0.2  setosa
## 13          4.8         3.0          1.4         0.1  setosa
## 14          4.3         3.0          1.1         0.1  setosa
## 15          5.8         4.0          1.2         0.2  setosa
## 16          5.7         4.4          1.5         0.4  setosa
## 17          5.4         3.9          1.3         0.4  setosa
## 18          5.1         3.5          1.4         0.3  setosa
## 19          5.7         3.8          1.7         0.3  setosa
## 20          5.1         3.8          1.5         0.3  setosa

print('The mode of sepal length is: \n')

## [1] "The mode of sepal length is: \n"

Mode(iris$Sepal.Length, na.rm = TRUE)

## [1] 5

print('The mode of petal length is: \n')

## [1] "The mode of petal length is: \n"

Mode(iris$Petal.Length)

## [1] 1.4

15. Explain the centralValue() function. What does it do?

It obtains a statistic of centrality of a variable given a sample of its values. It returns the median in the case of numeric variables and the mode for nominal variables.

16.

Explain the inter-quartile range (IQR).

It is the interval that contains 50% of the most central values of a continues variable.

Explain the x-quartile.

It is the value below which there are x% of the observed values.

What does a large value of the IQR mean?

That the central values of the data are spread over a large range.

What does a small value of the IQR mean?

That the central values of the data are spread over a small range which means it is a packed set of values.

17. Which measure of spread, or variability, is more susceptible to outliers?

range

18.

Using the Iris dataset, obtain the quantiles of the variable (or feature), Length, by Species.

data(iris)
sepal_len_q <- aggregate(iris$Sepal.Length, list(Species= iris$Species), quantile)
print('Sepal length quantiles with respect to species are" \n')

## [1] "Sepal length quantiles with respect to species are\" \n"

sepal_len_q

##      Species  x.0% x.25% x.50% x.75% x.100%
## 1     setosa 4.300 4.800 5.000 5.200  5.800
## 2 versicolor 4.900 5.600 5.900 6.300  7.000
## 3  virginica 4.900 6.225 6.500 6.900  7.900

petal_len_q <- aggregate(iris$Petal.Length, list(Species= iris$Species), quantile)
print('Petal length quantiles with respect to species are" \n')

## [1] "Petal length quantiles with respect to species are\" \n"

petal_len_q

##      Species  x.0% x.25% x.50% x.75% x.100%
## 1     setosa 1.000 1.400 1.500 1.575  1.900
## 2 versicolor 3.000 4.000 4.350 4.600  5.100
## 3  virginica 4.500 5.100 5.550 5.875  6.900

Which package provides the better grouping facilities, baseR or dplyr? Which function is the best to use?

dplyr

aggregate() or by()

19. Find the Mode of the subgroup, “iris$Species.”

The following code gives setosa as the mode. However, the number of each species in this dataset is 50, which means that sesota, versicolor, and virginica have equal number of occurrence and there is no mode for this sub group.

Mode <- function(x,na.rm = FALSE) {
  if (na.rm) x <- x[!is.na(x)]
  ux <- unique(x)
  return(ux[which.max(tabulate(match(x,ux)))])
}
Mode(iris$Species, na.rm = TRUE)

## [1] setosa
## Levels: setosa versicolor virginica

20.

What are “pipes?”

Pipes are a way to create a sequence of multiple operations. In other words, we chain multiple operations in a concise and expressive way. The output of the first expression/operation will be used as the input of the next operation. The pipe operator is part of magrittr package.

What is the “piping syntax?”

We include %>% between each two operations (or sometimes expressions). This way we do not need to fill out the first argument of our consecutive operations.

What is the “pipe operator” (% > %)?

It indicates that the output of the expression/operation on the left side of %>% will be used as the first argument of the operation on the right side of it. Use other resources (Google, …) to help you answer these questions.

21. In Code 9, the second chunk of code from the top of page 92, interpret “Species = iris$Species,” which is in the second argument of the aggregate ( ) function. What does it all mean?

It means that we calculate the quantile of variable Sepal.Length (which is the first argument of aggregate function) by Species (which the second argument of aggregate function).

22. In Code 10, the third chunk of code from the top of P.92, interpret all three arguments of the aggregate ( ) What do they all mean?

The first argument indicates which variable (Sepal.Length) we want to find the quantile for and by which variable (Species). The second variable indicates our data. The third variable indicates what kind of function we want to apply on the variable.

23. In some datasets a column (or a feature, or a variable) may contain symbols such as “?” in some of its rows (Look at Section 3.3.1.4 on Pp. 60 and 61). If we use the class ( ) function on that column, we are sure to get the column labeled as “function.” However, assume we want this column to be labeled “integer.” Which function can we use to parse a column, or a vector of values, from “factors” to “integers?”

parse_integer()

We specify “?” as NA using parse_integer(data,na=‘?’)

24.

What is the following code used for?

data (algae, package = "DMwR2")
nasRow <- apply (algae, 1, function(r) sum(is.na(r)))
cat ("The algae dataset contains", sum (nasRow), "NA values.\n")

## The algae dataset contains 33 NA values.

It finds the number of NA values in algae dataset and prints it. It first checks if any value is NA (showing by TRUE value), then adds all the occurrences.

What results are we looking for?

We look for the number of NA variables (and prefer to have no NA variables in a dataset).

25.

What method is used to detect a univariate outlier?

boxplot rule

What does that method state?

An outlier is a value which is outside of the interval [Q1-1.5xIQR,Q3+1.5xIQR] where Q1 is the first quantile, Q3 is the third quantile, and IQR = Q3-Q1.

26. What sort of results does the summary ( ) function yield when applied to a dataset?

Minimum
1st quantile
Median
Mean
3rd quantile
Maximum

27.

For what is the function, describe ( ) used?

It is another way to get a summary about a dataset.

Which package contains the function, describe ( )?

Hmisc

28. Give a definition of the term, parse.

In general it means analyzing (a sentence) into its parts and describing their syntactic roles. In statistics it refers to the process of analyzing a string of data to extract meaningful information. For example, breaking down a complex dataset or a string of text into smaller components that are easier to manage, understand, and analyze. Parsing is helpful for preprocessing of raw data.