Data Engineering and Mining

Summer 2022

Instructor: C. Pierre, Ph.D., M.Sc. in Analytics

Name: Paul Brown

ASSIGNMENT 4 – Section 3.4 Modeling

1. What is a model?

• According to author is some scientific activity based on observations of a phenomena in the form of a dataset

2. What are the five groups of tasks of modeling in data mining?

• Exploratory data analysis

• Dependacy modeling

• Clustering

• Anomaly detection

• Predictive analytics

3. Typically, what does a data miner do?

• Search for interesting, unexpected, and useful relationships in a dataset

4. Most data mining techniques can be bifurcated into groups. What are those techniques?

• Search for relationships among the features describing the cases in a dataset described by some feature values

• Search for relationships among the observations of the dataset

5. What is a main goal of exploratory data analysis?

• To provide useful summaries of a dataset that highlight some characteristics of the data that the users may find useful

6. Most datasets have a dimensionality that makes it very difficult for a standard user to inspect the full data and find interesting properties of these data. TRUE or FALSE?

• True

7. What are data summaries?

• Prove overviews of key properties of the data. They try to describe important properties of the distribution of the values across the observations in a dataset

8. The summarise() function is a function of which package?

• Dplyr package

Question 9

library (DMwR2) View(DMwR2)

algae data (iris) iris summary (algae) summary (iris)

#(b) What is the algae dataset about? # • This data set contains observations on 11 variables as well as the concentration levels of 7 harmful algae. Values were measured in several European rivers. The 11 predictor variables include 3 contextual variables (season, size and speed) describing the water sample, plus 8 chemical concentration measurements.

(c) Explain the characteristics of the iris dataset

• Dataset provides the measurements of three species of the iris flower. Examines the length and width of the three species which are setosa, versicolor and virginica

10. What does the summarise() function do?

• The summarize function can be used to apply any function that produces a scalar value to any column of a data frame table

11. We can use the functions, summarise_each() and funs(), to perform what kind of task?

• When we apply many functions to one variable, the use of summarise_each()

provides a more compact and tidy notation

• Funs() represents sum/mean/min/ max.

12. What is the task of the group_by() function? This function is included in which package?

• Groups the dataframe

• dplyr

13. Which function will you use if you want to study potential differences among the sub-groups?

• Maybe subsets

14. The top algorithm/code chunk on page 90 (Code 4) gives us a way to create a function to obtain the mode of a variable. Go through this algorithm. Now, replace

“algae$mxPh” with “iris$Sepal.Length” and

“algae$season” with “iris$Petal.Length”

15. Explain the centralValue() function. What does it do?

• Used to botain the more adequate statistic of centrality of a given sampl of values returs the median in the case of numeric variables and the mode for nominal varialbes

16. (a) Explain the inter-quartile range (IQR).

• The middle 50% of a dataset

(b) Explain the x-quartile

• The value below which there are x% of the observed values

(c) What does a large value of the IQR mean?

• The central values are spread over a large range

(d) What does a small value of the IQR mean?

• Packed set of values

17. Which measure of spread, or variability, is more susceptible to outliers?

• Range, standard deviation and variance

18. (a) Using the Iris dataset, obtain the quantiles of the variable (or feature), Length, by Species.

summary(iris)

19. Find the Mode of the subgroup, “iris$Species.”

Mode <- function(x, na.rm = FALSE) { if (na.rm) x <- x[!is.na(x)] ux <- unique(x) return(ux[which.max(tabulate(match(x, ux)))]) } Mode(iris$Sepal.Length, na.rm=TRUE)

Mode(iris$Petal.Length)

find mode of subgroup Mode(iris$Species.Length, na.rm=TRUE)

• NULL

Mode <- function(x, na.rm = FALSE) { if (na.rm) x <- x[!is.na(x)] ux <- unique(x) return(ux[which.max(tabulate(match(x, ux)))]) } Mode <- function(x, na.rm = FALSE) { if (na.rm) x <- x[!is.na(x)] ux <- unique(x) return(ux[which.max(tabulate(match(x, ux)))]) } Mode(iris$Species.Length, na.rm=TRUE)

Mode(iris$Species.Length)

Mode(iris$Petal.Length)

20. (a) What are “pipes?”

(b) What is the “piping syntax?”

• used to pass the output of a function to another function, thereby enabling functions to be chained together.

(c) What is the “pipe operator” (% > %)?

• function passes the left hand side of the operator to the first argument of the right hand side of the operator.

21. In Code 9, the second chunk of code from the top of page 92, interpret

“Species = iris$Species,” which is in the second argument of the aggregate ( ) function. What does it all mean?

• It provides the quartiles including the minimum and maximum lengths for the three species, sestosa, versicolor, and virginica. The aggregate function proves a list of factors to form the sub-groups of the data

22. In Code 10, the third chunk of code from the top of P.92, interpret all three arguments of the aggregate ( ) What do they all mean?

• The sepal length for setosa goes from 4.3 to 5.8 with first quartlle of 4.8, a median of 5.0 and a third quartile of 5.2

• The sepal length for versicolor goes from 4.9 to 7.0 with first quartlle of 4.9, a median of 5.6 and a third quartile of 5.9

• The sepal length for virginica goes from 4.9 to 7.9 with first quartlle of 6.2, a median of 6.5 and a third quartile of 6.9

23. In some datasets a column (or a feature, or a variable) may contain symbols such as “?” in some of its rows (Look at Section 3.3.1.4 on Pp. 60 and 61). If we use the class ( ) function on that column, we are sure to get the column labeled as “function.” However, assume we want this column to be labeled “integer.” Which function can we use to parse a column, or a vector of values, from “factors” to “integers?”

24. (a) What is the following code used for?

> data (algae, package = “DMwR2”)

• Used to train data for predicting algae blooms

> nasRow ß apply (algae, 1, function(r) sum(is.na(r)))

• Provides the number of NA values in the algae dataset

cat (“The algae dataset contains”, sum (nasRow), “NA values.”)

• Provides the number of rows that have NA values in a dataset

(b) What results are we looking for?

• This helps find “strange” values in the dataset, in this case unknown values

#25. (a) What method is used to detect a univariate outlier?
# • Boxplot rule which states that values are outliers if a certain distance above the third quartile or below the first quartile

(b) What does that method state?

• Outside of (Q1 -1.5x, IQR, Q3 +1.5xIQR)

#26. What sort of results does the summary ( ) function yield when applied to a dataset? # • can be used to quickly summarize the values in a vector, data frame, regression model, or ANOVA model

#27. (a) For what is the function, describe ( ) used? # • Provides Basic descriptive statistics

(b) Which package contains the function, describe ( )?

• Psych package

28. Give a definition of the term, parse.

• Process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar.

Assignment 4 - Paul Brown

2022-07-22