Data Engineering and Mining
Summer 2022
Instructor: C. Pierre, Ph.D., M.Sc. in Analytics
Name: Paul Brown
ASSIGNMENT 4 – Section 3.4 Modeling
1. What is a model?
• According to author is some scientific activity based on
observations of a phenomena in the form of a dataset
2. What are the five groups of tasks of modeling in data
mining?
• Exploratory data analysis
• Dependacy modeling
• Clustering
• Anomaly detection
• Predictive analytics
3. Typically, what does a data miner do?
• Search for interesting, unexpected, and useful relationships in a
dataset
4. Most data mining techniques can be bifurcated into groups. What
are those techniques?
• Search for relationships among the features describing the cases
in a dataset described by some feature values
• Search for relationships among the observations of the
dataset
5. What is a main goal of exploratory data analysis?
• To provide useful summaries of a dataset that highlight some
characteristics of the data that the users may find useful
6. Most datasets have a dimensionality that makes it very difficult
for a standard user to inspect the full data and find interesting
properties of these data. TRUE or FALSE?
• True
7. What are data summaries?
• Prove overviews of key properties of the data. They try to
describe important properties of the distribution of the values across
the observations in a dataset
8. The summarise() function is a function of which package?
• Dplyr package
Question 9
library (DMwR2) View(DMwR2)
algae data (iris) iris summary (algae) summary (iris)
#(b) What is the algae dataset about? # • This data set contains
observations on 11 variables as well as the concentration levels of 7
harmful algae. Values were measured in several European rivers. The 11
predictor variables include 3 contextual variables (season, size and
speed) describing the water sample, plus 8 chemical concentration
measurements.
(c) Explain the characteristics of the iris dataset
• Dataset provides the measurements of three species of the iris
flower. Examines the length and width of the three species which are
setosa, versicolor and virginica
10. What does the summarise() function do?
• The summarize function can be used to apply any function that
produces a scalar value to any column of a data frame table
11. We can use the functions, summarise_each() and funs(), to
perform what kind of task?
• When we apply many functions to one variable, the use of
summarise_each()
provides a more compact and tidy notation
• Funs() represents sum/mean/min/ max.
12. What is the task of the group_by() function? This function is
included in which package?
• Groups the dataframe
• dplyr
13. Which function will you use if you want to study potential
differences among the sub-groups?
• Maybe subsets
14. The top algorithm/code chunk on page 90 (Code 4) gives us a way
to create a function to obtain the mode of a variable. Go through this
algorithm. Now, replace
“algae\(mxPh” with
“iris\)Sepal.Length” and
“algae\(season” with
“iris\)Petal.Length”
15. Explain the centralValue() function. What does it do?
• Used to botain the more adequate statistic of centrality of a
given sampl of values returs the median in the case of numeric variables
and the mode for nominal varialbes
16. (a) Explain the inter-quartile range (IQR).
• The middle 50% of a dataset
(b) Explain the x-quartile
• The value below which there are x% of the observed values
(c) What does a large value of the IQR mean?
• The central values are spread over a large range
(d) What does a small value of the IQR mean?
• Packed set of values
17. Which measure of spread, or variability, is more susceptible to
outliers?
• Range, standard deviation and variance
18. (a) Using the Iris dataset, obtain the quantiles of the variable
(or feature), Length, by Species.
summary(iris)
19. Find the Mode of the subgroup, “iris$Species.”
Mode <- function(x, na.rm = FALSE) { if (na.rm) x <-
x[!is.na(x)] ux <- unique(x) return(ux[which.max(tabulate(match(x,
ux)))]) } Mode(iris$Sepal.Length, na.rm=TRUE)
Mode(iris$Petal.Length)
find mode of subgroup Mode(iris$Species.Length, na.rm=TRUE)
• NULL
Mode <- function(x, na.rm = FALSE) { if (na.rm) x <-
x[!is.na(x)] ux <- unique(x) return(ux[which.max(tabulate(match(x,
ux)))]) } Mode <- function(x, na.rm = FALSE) { if (na.rm) x <-
x[!is.na(x)] ux <- unique(x) return(ux[which.max(tabulate(match(x,
ux)))]) } Mode(iris$Species.Length, na.rm=TRUE)
Mode(iris$Species.Length)
Mode(iris$Petal.Length)
20. (a) What are “pipes?”
(b) What is the “piping syntax?”
• used to pass the output of a function to another function, thereby
enabling functions to be chained together.
(c) What is the “pipe operator” (% > %)?
• function passes the left hand side of the operator to the first
argument of the right hand side of the operator.
21. In Code 9, the second chunk of code from the top of page 92,
interpret
“Species = iris$Species,” which is in the second argument of the
aggregate ( ) function. What does it all mean?
• It provides the quartiles including the minimum and maximum
lengths for the three species, sestosa, versicolor, and virginica. The
aggregate function proves a list of factors to form the sub-groups of
the data
22. In Code 10, the third chunk of code from the top of P.92,
interpret all three arguments of the aggregate ( ) What do they all
mean?
• The sepal length for setosa goes from 4.3 to 5.8 with first
quartlle of 4.8, a median of 5.0 and a third quartile of 5.2
• The sepal length for versicolor goes from 4.9 to 7.0 with first
quartlle of 4.9, a median of 5.6 and a third quartile of 5.9
• The sepal length for virginica goes from 4.9 to 7.9 with first
quartlle of 6.2, a median of 6.5 and a third quartile of 6.9
23. In some datasets a column (or a feature, or a variable) may
contain symbols such as “?” in some of its rows (Look at Section 3.3.1.4
on Pp. 60 and 61). If we use the class ( ) function on that column, we
are sure to get the column labeled as “function.” However, assume we
want this column to be labeled “integer.” Which function can we use to
parse a column, or a vector of values, from “factors” to
“integers?”
24. (a) What is the following code used for?
> data (algae, package = “DMwR2”)
• Used to train data for predicting algae blooms
> nasRow ß apply (algae, 1, function(r) sum(is.na(r)))
• Provides the number of NA values in the algae dataset
cat (“The algae dataset contains”, sum (nasRow), “NA values.”)
• Provides the number of rows that have NA values in a dataset
(b) What results are we looking for?
• This helps find “strange” values in the dataset, in this case
unknown values
#25. (a) What method is used to detect a univariate outlier?
# • Boxplot rule which states that values are outliers if a certain
distance above the third quartile or below the first quartile
(b) What does that method state?
• Outside of (Q1 -1.5x, IQR, Q3 +1.5xIQR)
#26. What sort of results does the summary ( ) function yield when
applied to a dataset? # • can be used to quickly summarize the values in
a vector, data frame, regression model, or ANOVA model
#27. (a) For what is the function, describe ( ) used? # • Provides
Basic descriptive statistics
(b) Which package contains the function, describe ( )?
• Psych package
28. Give a definition of the term, parse.
• Process of analyzing a string of symbols, either in natural
language, computer languages or data structures, conforming to the rules
of a formal grammar.