This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
## This is Homework #4, following pages 87-96 in the textbook.
# #1. What is a model?
#Scientific modelling is a scientific activity, the aim of which is to make a particular part or feature of the world easier to understand, define, quantify, visualize, or simulate by referencing it to existing and usually commonly accepted knowledge. It requires selecting and identifying relevant aspects of a situation in the real world and then using different types of models for different aims, such as conceptual models to better understand, operational models to operationalize, mathematical models to quantify, and graphical models to visualize the subject.
# #2. What are the five groups of tasks of modeling in data mining?
#1. Exploratory data analysis;
#2. Dependency modeling;
#3. Clustering;
#4. Anomaly detection; and
#5. Predictive analytics
# #3. Typically, what does a data miner do?
#Data miners analyze large datasets to identify patterns, trends, and relationships, transforming raw data into actionable insights. They use statistical techniques and software to extract meaningful information that can be used to make predictions, improve business strategies, and solve problems.
# #4. Most data mining techniques can be bifurcated into groups. What are those techniques?
#Data miners typically are: (i) searching for relationships among the features (columns) describing the cases in a dataset or (ii) searching for relationships among the observations (rows) of the dataset.
# #5. What is a main goal of exploratory data analysis?
#Exploratory data analysis includes a series of techniques that have as the main goal to provide useful summaries of a dataset that highlight some characteristics of the data that the users may find useful. Two types of summaries are textual and visual.
# #6. Most datasets have a dimensionality that makes it very difficult for a standard user to inspect the full data and find interesting properties of these data. TRUE or FALSE?
#True
# #7. What are data summaries?
#Data summaries try to provide overviews of key properties of the data. More specifically, they try to describe important properties of the distribution of the values across the observations in a dataset. Examples of these properties include answers to questions like:
#• What is the “most common value” of a variable?
#• Do the values of a variable “vary” a lot?
#• Are there “strange” / unexpected values in the dataset?
# #8. The summarise() function is a function of which package?
#Summarise() is a function of the dplyr package.
# #9. (a) Run the following code (below). Study the data. Explain the dataset.
library (DMwR2)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library (dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data (algae)
algae
## # A tibble: 200 × 18
## season size speed mxPH mnO2 Cl NO3 NH4 oPO4 PO4 Chla a1
## <fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 winter small medium 8 9.8 60.8 6.24 578 105 170 50 0
## 2 spring small medium 8.35 8 57.8 1.29 370 429. 559. 1.3 1.4
## 3 autumn small medium 8.1 11.4 40.0 5.33 347. 126. 187. 15.6 3.3
## 4 spring small medium 8.07 4.8 77.4 2.30 98.2 61.2 139. 1.4 3.1
## 5 autumn small medium 8.06 9 55.4 10.4 234. 58.2 97.6 10.5 9.2
## 6 winter small high 8.25 13.1 65.8 9.25 430 18.2 56.7 28.4 15.1
## 7 summer small high 8.15 10.3 73.2 1.54 110 61.2 112. 3.2 2.4
## 8 autumn small high 8.05 10.6 59.1 4.99 206. 44.7 77.4 6.9 18.2
## 9 winter small medium 8.7 3.4 22.0 0.886 103. 36.3 71 5.54 25.4
## 10 winter small high 7.93 9.9 8 1.39 5.8 27.2 46.6 0.8 17
## # ℹ 190 more rows
## # ℹ 6 more variables: a2 <dbl>, a3 <dbl>, a4 <dbl>, a5 <dbl>, a6 <dbl>,
## # a7 <dbl>
data (iris)
iris
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## 11 5.4 3.7 1.5 0.2 setosa
## 12 4.8 3.4 1.6 0.2 setosa
## 13 4.8 3.0 1.4 0.1 setosa
## 14 4.3 3.0 1.1 0.1 setosa
## 15 5.8 4.0 1.2 0.2 setosa
## 16 5.7 4.4 1.5 0.4 setosa
## 17 5.4 3.9 1.3 0.4 setosa
## 18 5.1 3.5 1.4 0.3 setosa
## 19 5.7 3.8 1.7 0.3 setosa
## 20 5.1 3.8 1.5 0.3 setosa
## 21 5.4 3.4 1.7 0.2 setosa
## 22 5.1 3.7 1.5 0.4 setosa
## 23 4.6 3.6 1.0 0.2 setosa
## 24 5.1 3.3 1.7 0.5 setosa
## 25 4.8 3.4 1.9 0.2 setosa
## 26 5.0 3.0 1.6 0.2 setosa
## 27 5.0 3.4 1.6 0.4 setosa
## 28 5.2 3.5 1.5 0.2 setosa
## 29 5.2 3.4 1.4 0.2 setosa
## 30 4.7 3.2 1.6 0.2 setosa
## 31 4.8 3.1 1.6 0.2 setosa
## 32 5.4 3.4 1.5 0.4 setosa
## 33 5.2 4.1 1.5 0.1 setosa
## 34 5.5 4.2 1.4 0.2 setosa
## 35 4.9 3.1 1.5 0.2 setosa
## 36 5.0 3.2 1.2 0.2 setosa
## 37 5.5 3.5 1.3 0.2 setosa
## 38 4.9 3.6 1.4 0.1 setosa
## 39 4.4 3.0 1.3 0.2 setosa
## 40 5.1 3.4 1.5 0.2 setosa
## 41 5.0 3.5 1.3 0.3 setosa
## 42 4.5 2.3 1.3 0.3 setosa
## 43 4.4 3.2 1.3 0.2 setosa
## 44 5.0 3.5 1.6 0.6 setosa
## 45 5.1 3.8 1.9 0.4 setosa
## 46 4.8 3.0 1.4 0.3 setosa
## 47 5.1 3.8 1.6 0.2 setosa
## 48 4.6 3.2 1.4 0.2 setosa
## 49 5.3 3.7 1.5 0.2 setosa
## 50 5.0 3.3 1.4 0.2 setosa
## 51 7.0 3.2 4.7 1.4 versicolor
## 52 6.4 3.2 4.5 1.5 versicolor
## 53 6.9 3.1 4.9 1.5 versicolor
## 54 5.5 2.3 4.0 1.3 versicolor
## 55 6.5 2.8 4.6 1.5 versicolor
## 56 5.7 2.8 4.5 1.3 versicolor
## 57 6.3 3.3 4.7 1.6 versicolor
## 58 4.9 2.4 3.3 1.0 versicolor
## 59 6.6 2.9 4.6 1.3 versicolor
## 60 5.2 2.7 3.9 1.4 versicolor
## 61 5.0 2.0 3.5 1.0 versicolor
## 62 5.9 3.0 4.2 1.5 versicolor
## 63 6.0 2.2 4.0 1.0 versicolor
## 64 6.1 2.9 4.7 1.4 versicolor
## 65 5.6 2.9 3.6 1.3 versicolor
## 66 6.7 3.1 4.4 1.4 versicolor
## 67 5.6 3.0 4.5 1.5 versicolor
## 68 5.8 2.7 4.1 1.0 versicolor
## 69 6.2 2.2 4.5 1.5 versicolor
## 70 5.6 2.5 3.9 1.1 versicolor
## 71 5.9 3.2 4.8 1.8 versicolor
## 72 6.1 2.8 4.0 1.3 versicolor
## 73 6.3 2.5 4.9 1.5 versicolor
## 74 6.1 2.8 4.7 1.2 versicolor
## 75 6.4 2.9 4.3 1.3 versicolor
## 76 6.6 3.0 4.4 1.4 versicolor
## 77 6.8 2.8 4.8 1.4 versicolor
## 78 6.7 3.0 5.0 1.7 versicolor
## 79 6.0 2.9 4.5 1.5 versicolor
## 80 5.7 2.6 3.5 1.0 versicolor
## 81 5.5 2.4 3.8 1.1 versicolor
## 82 5.5 2.4 3.7 1.0 versicolor
## 83 5.8 2.7 3.9 1.2 versicolor
## 84 6.0 2.7 5.1 1.6 versicolor
## 85 5.4 3.0 4.5 1.5 versicolor
## 86 6.0 3.4 4.5 1.6 versicolor
## 87 6.7 3.1 4.7 1.5 versicolor
## 88 6.3 2.3 4.4 1.3 versicolor
## 89 5.6 3.0 4.1 1.3 versicolor
## 90 5.5 2.5 4.0 1.3 versicolor
## 91 5.5 2.6 4.4 1.2 versicolor
## 92 6.1 3.0 4.6 1.4 versicolor
## 93 5.8 2.6 4.0 1.2 versicolor
## 94 5.0 2.3 3.3 1.0 versicolor
## 95 5.6 2.7 4.2 1.3 versicolor
## 96 5.7 3.0 4.2 1.2 versicolor
## 97 5.7 2.9 4.2 1.3 versicolor
## 98 6.2 2.9 4.3 1.3 versicolor
## 99 5.1 2.5 3.0 1.1 versicolor
## 100 5.7 2.8 4.1 1.3 versicolor
## 101 6.3 3.3 6.0 2.5 virginica
## 102 5.8 2.7 5.1 1.9 virginica
## 103 7.1 3.0 5.9 2.1 virginica
## 104 6.3 2.9 5.6 1.8 virginica
## 105 6.5 3.0 5.8 2.2 virginica
## 106 7.6 3.0 6.6 2.1 virginica
## 107 4.9 2.5 4.5 1.7 virginica
## 108 7.3 2.9 6.3 1.8 virginica
## 109 6.7 2.5 5.8 1.8 virginica
## 110 7.2 3.6 6.1 2.5 virginica
## 111 6.5 3.2 5.1 2.0 virginica
## 112 6.4 2.7 5.3 1.9 virginica
## 113 6.8 3.0 5.5 2.1 virginica
## 114 5.7 2.5 5.0 2.0 virginica
## 115 5.8 2.8 5.1 2.4 virginica
## 116 6.4 3.2 5.3 2.3 virginica
## 117 6.5 3.0 5.5 1.8 virginica
## 118 7.7 3.8 6.7 2.2 virginica
## 119 7.7 2.6 6.9 2.3 virginica
## 120 6.0 2.2 5.0 1.5 virginica
## 121 6.9 3.2 5.7 2.3 virginica
## 122 5.6 2.8 4.9 2.0 virginica
## 123 7.7 2.8 6.7 2.0 virginica
## 124 6.3 2.7 4.9 1.8 virginica
## 125 6.7 3.3 5.7 2.1 virginica
## 126 7.2 3.2 6.0 1.8 virginica
## 127 6.2 2.8 4.8 1.8 virginica
## 128 6.1 3.0 4.9 1.8 virginica
## 129 6.4 2.8 5.6 2.1 virginica
## 130 7.2 3.0 5.8 1.6 virginica
## 131 7.4 2.8 6.1 1.9 virginica
## 132 7.9 3.8 6.4 2.0 virginica
## 133 6.4 2.8 5.6 2.2 virginica
## 134 6.3 2.8 5.1 1.5 virginica
## 135 6.1 2.6 5.6 1.4 virginica
## 136 7.7 3.0 6.1 2.3 virginica
## 137 6.3 3.4 5.6 2.4 virginica
## 138 6.4 3.1 5.5 1.8 virginica
## 139 6.0 3.0 4.8 1.8 virginica
## 140 6.9 3.1 5.4 2.1 virginica
## 141 6.7 3.1 5.6 2.4 virginica
## 142 6.9 3.1 5.1 2.3 virginica
## 143 5.8 2.7 5.1 1.9 virginica
## 144 6.8 3.2 5.9 2.3 virginica
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
summary (algae)
## season size speed mxPH mnO2
## autumn:40 large :45 high :84 Min. :5.600 Min. : 1.500
## spring:53 medium:84 low :33 1st Qu.:7.700 1st Qu.: 7.725
## summer:45 small :71 medium:83 Median :8.060 Median : 9.800
## winter:62 Mean :8.012 Mean : 9.118
## 3rd Qu.:8.400 3rd Qu.:10.800
## Max. :9.700 Max. :13.400
## NA's :1 NA's :2
## Cl NO3 NH4 oPO4
## Min. : 0.222 Min. : 0.050 Min. : 5.00 Min. : 1.00
## 1st Qu.: 10.981 1st Qu.: 1.296 1st Qu.: 38.33 1st Qu.: 15.70
## Median : 32.730 Median : 2.675 Median : 103.17 Median : 40.15
## Mean : 43.636 Mean : 3.282 Mean : 501.30 Mean : 73.59
## 3rd Qu.: 57.824 3rd Qu.: 4.446 3rd Qu.: 226.95 3rd Qu.: 99.33
## Max. :391.500 Max. :45.650 Max. :24064.00 Max. :564.60
## NA's :10 NA's :2 NA's :2 NA's :2
## PO4 Chla a1 a2
## Min. : 1.00 Min. : 0.200 Min. : 0.00 Min. : 0.000
## 1st Qu.: 41.38 1st Qu.: 2.000 1st Qu.: 1.50 1st Qu.: 0.000
## Median :103.29 Median : 5.475 Median : 6.95 Median : 3.000
## Mean :137.88 Mean : 13.971 Mean :16.92 Mean : 7.458
## 3rd Qu.:213.75 3rd Qu.: 18.308 3rd Qu.:24.80 3rd Qu.:11.375
## Max. :771.60 Max. :110.456 Max. :89.80 Max. :72.600
## NA's :2 NA's :12
## a3 a4 a5 a6
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 1.550 Median : 0.000 Median : 1.900 Median : 0.000
## Mean : 4.309 Mean : 1.992 Mean : 5.064 Mean : 5.964
## 3rd Qu.: 4.925 3rd Qu.: 2.400 3rd Qu.: 7.500 3rd Qu.: 6.925
## Max. :42.800 Max. :44.600 Max. :44.400 Max. :77.600
##
## a7
## Min. : 0.000
## 1st Qu.: 0.000
## Median : 1.000
## Mean : 2.495
## 3rd Qu.: 2.400
## Max. :31.600
##
summary (iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
#(b) What is the algae dataset about?
#It is a data frame containing measurements from various European rivers. It includes 11 predictor variables, such as water sample characteristics (season, size, speed) and chemical concentrations (nitrates, phosphates, etc.), and also includes the frequency of 7 different types of harmful algae as target variables.
#(c) Explain the characteristics of the iris dataset
#This dataset contains information about 150 samples of iris flowers. Each sample includes four numerical measurements: the length and width of the sepal (a leaf-like structure enclosing the petals), and the length and width of the petal (the colored part of the flower). Alongside these numerical attributes, each sample is also labeled with the species of the iris flower, which can be one of three types: setosa, versicolor, or virginica.
#(d) Did you discover any correlation between any pair of features in the iris dataset?
cor(iris[, 1:4])
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
## Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
## Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
## Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
#From the correlation printout, there is a relatively high correlation between the Sepal Length and each of Petal Length and Petal Width. The Petal Length and Petal Width are also correlated.
# #10. What does the summarise() function do?
#The summarise() function reduces a data frame or tibble to a single row (or multiple rows if grouped) by calculating summary statistics for specified columns. It collapses multiple rows into a single row representing a summary of the original data, like calculating the mean, sum, or other statistics.
# #11. We can use the functions, summarise_each() and funs(), to perform what kind of task?
#Summarise() can be used to apply any function that produces a scalar value to any column of a data frame table. See question #10. Using it, along with the funs() function, one can apply a set of functions to all columns of a data frame table.
# #12. What is the task of the group_by() function? This function is included in which package?
#The group_by() function can be used to form sub-groups of a dataset using all combinations of the values of one or more nominal variables (in this case we are using season and size).
#The group_by() function is in the dplyr package and is used to group data based on one or more variables, allowing for subsequent operations to be performed on these groups separately. This function essentially transforms a data frame or table into a grouped structure, where operations are applied to each group independently.
# #13. Which function will you use if you want to study potential differences among the sub-groups?
#Use groupby(), then summarise().
# #14. The top algorithm/code chunk on page 90 (Code 4) gives us a way to create a function to obtain the mode of a variable. Go through this algorithm.
Mode <- function(x, na.rm = FALSE) {
if(na.rm) x <- x[!is.na(x)]
ux <- unique(x)
return(ux[which.max(tabulate(match(x, ux)))])
}
Mode(algae$mxPH, na.rm=TRUE)
## [1] 8
Mode(algae$season)
## [1] winter
## Levels: autumn spring summer winter
algae$season
## [1] winter spring autumn spring autumn winter summer autumn winter winter
## [11] spring summer winter summer winter autumn winter spring summer spring
## [21] winter spring autumn winter spring summer winter autumn winter spring
## [31] autumn winter summer autumn winter summer winter spring winter spring
## [41] winter summer winter spring winter spring autumn winter spring autumn
## [51] winter summer winter spring winter spring autumn spring summer autumn
## [61] spring summer autumn spring summer winter spring summer winter spring
## [71] summer autumn summer winter spring spring autumn winter spring winter
## [81] spring autumn winter summer autumn winter summer winter summer winter
## [91] spring summer winter spring summer autumn winter spring winter summer
## [101] winter spring winter summer autumn winter summer autumn winter summer
## [111] winter spring summer spring summer winter summer autumn spring summer
## [121] winter spring summer winter spring summer winter spring summer winter
## [131] spring summer winter spring autumn spring autumn winter summer winter
## [141] summer spring autumn winter spring autumn winter autumn winter spring
## [151] autumn summer autumn spring autumn spring summer spring autumn winter
## [161] spring spring summer autumn winter autumn spring autumn winter autumn
## [171] winter summer winter spring spring winter autumn spring summer spring
## [181] summer winter summer winter summer winter winter summer autumn spring
## [191] autumn winter spring autumn summer autumn spring autumn winter summer
## Levels: autumn spring summer winter
#Now, replace “algae$mxPh” with “iris$Sepal.Length” and “algae$season” with “iris$Petal.Length”.
#Copy and/or take a screenshot your results for both and include them in this assignment (I only need the first 20 to 40 rows of each sub-group).
Mode <- function(x, na.rm = FALSE) {
if(na.rm) x <- x[!is.na(x)]
ux <- unique(x)
return(ux[which.max(tabulate(match(x, ux)))])
}
Mode(iris$Sepal.Length, na.rm=TRUE)
## [1] 5
Mode(iris$Petal.Length)
## [1] 1.4
iris$Sepal.Length
## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
## [19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
## [37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
## [55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
## [73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
## [91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
## [109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
## [127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
## [145] 6.7 6.7 6.3 6.5 6.2 5.9
# #15. Explain the centralValue() function. What does it do?
#centralValue() can be used to obtain the more adequate statistic of centrality of a given sample of values. It will return the median in the case of numeric variables and the mode for nominal variables.
# #16.
#(a) Explain the inter-quartile range (IQR).
#The inter-quartile range (IQR) is the interval that contains 50% of the most central values of a continuous variable.
#(b) Explain the x-quartile
# The x-quantile is the value below which there are x% of the observed values.
#(c) What does a large value of the IQR mean?
#A large value of the IQR means that these central values are spread over a large range.
#(d) What does a small value of the IQR mean?
# A small value represents a very packed set of values.
# #17. Which measure of spread, or variability, is more susceptible to outliers?
# The range is more suseptible to outliers.
# #18.
#(a) Using the Iris dataset, obtain the quantiles of the variable (or feature), Length, by Species.
#Group by species and calculate quantiles for Sepal.Length
quantiles_by_species <- iris %>%
dplyr::group_by(Species) %>%
dplyr::summarize(
`25%` = quantile(Sepal.Length, 0.25),
`50%` = quantile(Sepal.Length, 0.50),
`75%` = quantile(Sepal.Length, 0.75)
)
quantiles_by_species
## # A tibble: 3 × 4
## Species `25%` `50%` `75%`
## <fct> <dbl> <dbl> <dbl>
## 1 setosa 4.8 5 5.2
## 2 versicolor 5.6 5.9 6.3
## 3 virginica 6.22 6.5 6.9
#Which package provides the better grouping facilities, baseR or dplyr?
#Which function is the best to use?
# #19. Find the Mode of the subgroup, “iris$Species.”
# Count the occurrences of each species in the iris dataset.
species_counts <- table(iris$Species)
# Find the species with the highest count (the mode).
mode_species <- names(species_counts)[which.max(species_counts)]
mode_species
## [1] "setosa"
# #20.
#(a) What are “pipes?”
# Pipes refer to operators that allow you to chain together multiple operations in a sequence, making your code more readable and easier to understand, especially when performing a sequence of data manipulations or transformations.
#(b) What is the “piping syntax?”
# The piping syntax refers to the use of a special operator (or symbols) that allows you to chain multiple functions together, passing the output of one function directly as the input to the next. This makes your code more readable and concise.
#(c) What is the “pipe operator” (% > %)?
# The pipe operator is in effect a simple re-writing operator and this can be applied to any R function. The idea is that the left-hand side of the operator is passed as the first argument of the function on the right side of the operator.
# #21. In Code 9, the second chunk of code from the top of page 92, interpret “Species = iris$Species,” which is in the second argument of the aggregate ( ) function. What does it all mean?
# The code instructs the function to calculate statistics separately for each unique species in the iris dataset.
# #22. In Code 10, the third chunk of code from the top of P.92, interpret all three arguments of the aggregate ( ) What do they all mean?
#The first argument indicates which variable (Sepal.Length) we want to find the quantile for and by which variable (Species).
#The second variable indicates our data.
#The third variable indicates what kind of function we want to apply on the variable, which is quantile.
# #23. In some datasets a column (or a feature, or a variable) may contain symbols such as “?” in some of its rows (Look at Section 3.3.1.4 on Pp. 60 and 61). If we use the class ( ) function on that column, we are sure to get the column labeled as “function.” However, assume we want this column to be labeled “integer.” Which function can we use to parse a column, or a vector of values, from “factors” to “integers?”
# parse_integer()
# #24.
#(a) What is the following code used for?
#> data (algae, package = “DMwR2”)
#> nasRow <- apply (algae, 1, function(r) sum(is.na(r)))
#> cat (“The algae dataset contains”, sum (nasRow), “NA values.\n”)
# This code is used to load the algae dataset from the DMwR2 package in R. Then count the total number of missing values (NA values) within that dataset. The third line prints the result. sum(nasRow) counts the number of missing values.
#(b) What results are we looking for?
# It is looking for the total number of NA values in the dataset.
# #25.
#(a) What method is used to detect a univariate outlier?
#The Boxplot method.
#(b) What does that method state?
# It utilizes the Interquartile Range (IQR) to identify values that fall outside a defined range.
# #26. What sort of results does the summary( ) function yield when applied to a dataset?
#For numeric data, it typically returns the minimum, first quartile, median, mean, third quartile, and maximum values. For categorical data, it shows the frequency counts for each category.
# #27.
#(a) For what is the function, describe( ) used?
#Describe() is used to generate descriptive statistics for data. It provides a comprehensive overview of the variables within a data frame, matrix, or vector, including measures of central tendency, variability, and distribution. This function is particularly useful for exploratory data analysis and understanding the characteristics of your data.
#(b) Which package contains the function, describe( )?
# It is in the Hmisc package.
# #28. Give a definition of the term, parse.
# Parsing refers to the process of analyzing text and converting it into a structured, internal form that the R interpreter can understand and process.
# In general it means analyzing (a sentence) into its parts and describing their syntactic roles. In statistics it refers to the process of analyzing a string of data to extract meaningful information. For example, breaking down a complex dataset or a string of text into smaller components that are easier to manage, understand, and analyze. Parsing is helpful for preprocessing of raw data.