Data Engineering and Mining

Summer 2022

Instructor: C. Pierre, Ph.D., M.Sc. in Analytics

Paul Brown

ASSIGNMENT 3

Assignment Due Date: July 8th 2022

1. What is temporal data?

• Temporal data is simply data that represents a state in time, such as the

land-use patterns of Hong Kong in 1990, or total rainfall in Honolulu on July 1, # 2009.

Temporal data is collected to analyze weather patterns and other environmental

variables, monitor traffic conditions, study demographic trends, and so on. This

data comes from many sources ranging from manual data entry to data collected

using observational sensors or generated from simulation models.

2. What is the “bag of words” approach used for?

• The bag-of-words model is a way of representing text data when modeling text with # machine learning algorithms.The bag-of-words model is simple to understand and # implement and has seen great success in problems such as language modeling and # document classification.

3. What is the approach, TF-IDF, used for?

• TF-IDF defines importance of a term by taking into consideration the importance # of that term in a single document, and scaling it by its importance across all

documents.

4. The package, tm, is a popular R It has several functions for carrying out what

sorts of tasks?

• The tm package offers functionality for managing text documents, abstracts the

process of document manipulation and eases the usage of heterogeneous text

formats in R. The package has integrated database back-end support to minimize # memory demands.

5. What are some of the preprocessing steps that are applied to a corpus using the

tm_map() function?

• Tm_map is an actively maintained open-source R-library for drawing thematic

maps.

6. What is word stemming used for?

• Stemming reduces words to unify across documents

7. What are some of the dimensionality challenges to many analysis tools?

• Data can be too large to handle by existing hardware

• Data can go against the assumptions of some modeling tools

8. List some things a data scientist/data miner can do to fit data properly into the # central memory of computers.

• Can use random samples of the data to reduce the amount of data in a dataset

9. Why does a data engineer/data miner use random sampling of a subset of rows?

• It can become difficult to fit all the data in memory due to fast growth rate

of data

10. List some challenges that prompt a data miner/ analytics professional to trim the # dataset (i.e., reduce dimensionality).

• The amount and diversity of data sources are challenges. Dimensionality can go # against the assumptions of modeling tools

11. Go through Code 10 (P.79). Then describe the results.

• Max percentage set at 0.5. if percent less than one and if percent greater

than max percent (0.5) then do not use value else pick the value…

12. Go to the site, http://stackoverflow.com/questions/22261082. (Links to an

external site.) Describe what you see. Describe some of the content.

• Someone states that he has a csv file to be processed but that it does not fit # into the memory. He then asks, “How can one read 20K random lines of it to do

basic statistics on the selected data frame”?

Three responses are:

o you can also just do it in the terminal with perl.

o perl -ne ‘print if (rand() < .01)’ biglist.txt > subset.txt

o This won’t necessarily get you exactly 20,000 lines. (Here it’ll grab about

.01 or 1% of the total lines.) It will, however, be really fast, and you’ll

have a nice copy of both files in your directory. You can then load the

smaller file into R however you want.

13. If you are sampling rows of a dataset, give an example of an algorithm/code in R

that you would use on a small dataset to program random sampling of a subset of

rows of the dataset.

• Issue with setting up data table

14. In the section, “Sampling Rows,” the authors give their strategy for picking lines # of large files to reduce dimensionality (hint: they draw a random number between 0 # and 1). What is that strategy?

• The authors suggest going through each line of an original large file and draw a # random number between 0 and 1. If the number is below a certain percentage the # line is selected for the final sample, otherwise move to the next line.

15. Explain feature/variable selection: what are we doing when we perform feature

selection?

• Feature selection is the process of reducing the number of input variables when

developing a predictive model.

It is desirable to reduce the number of input variables to both reduce the

computational cost of modeling and, in some cases, to improve the performance of the model.

Involve evaluating the relationship between each input variable and the target

variable using statistics and selecting those input variables that have the

strongest relationship with the target variable.

16. List and explain two popular methods we use to select features.

• Filter methods and wrapper methods

• Filter methods involve looking at variables individually and asserting their

value using some metric, which is then used to rank them and remove the less

relevant ones. Usually a one-shot approach.

• Wrapper methods take into consideration the objectives o the analysis you plan to # carry out with the data set. They search for the subset of variable that are more # adequate in terms of the criteria used to evaluate the results of the posterior

modeling stages. Involves an iterative search process.

17. List two ways of grouping existing feature selection methods.

• Supervised and unsupervised are two ways to group existing feature selection

methods.

18. Explain unsupervised

• Unsupervised methods look at each feature individually an calculate the relevance # using only the values of the variable.

• Unsupervised feature selection techniques ignores the target variable, such as

methods that remove redundant variables using correlation

19. Explain supervised

• Explores the existence of a special variable in the dataset, the so call target

variable. Forms the basis of predictive analytics. Evaluates each feature by

looking at the relationship with the target variable.

• Supervised machine learning using text data involves building a statistical model # to estimate some output from input that includes language.

20. When comparing one or more features of a dataset to a target feature, and looking

at each feature’s relationship with the target variable, which method should I use # to select features?

• Use the wrapper method because it uses some predictive model to assert the value # of a set of candidate features

21. Filter methods are very similar to unsupervised methods. True of False?

• Filtering is the act of choosing a subset of your current data that fits some

criteria. In R, this is the act of selecting/discarding certain rows from a data # frame. As far as I am aware, there are three popular methods to filter data.

22. Fill in the blank:

“Wrapper methods are most of the time _____ because they typically # use some predictive model to assert the value of a set of

• Supervised method

23. List two examples of simple unsupervised filter methods.

• Check for constant variables

• Check for ID-like variables

24. In some cases, when we are trying to reduce the dimensionality of a dataset with

100,000 rows and 140,000 columns, we may eliminate one or more highly correlated

features. True of False?

• True we want to reduce the redundancy in a dataset

25. Explain the Principle Component Analysis’ (PCA) task when selecting features and

reducing dimensionality.

• PCA searches for a new variables that is a smaller set and could be used to

explain most of the variability of the original data which allow one to carry out # an analysis using only the subset