Data Engineering and Mining
Summer 2022
Instructor: C. Pierre, Ph.D., M.Sc. in Analytics
Paul Brown
ASSIGNMENT 3
Assignment Due Date: July 8th 2022
1. What is temporal data?
• Temporal data is simply data that represents a state in time, such
as the
land-use patterns of Hong Kong in 1990, or total rainfall in
Honolulu on July 1, # 2009.
Temporal data is collected to analyze weather patterns and other
environmental
variables, monitor traffic conditions, study demographic trends, and
so on. This
data comes from many sources ranging from manual data entry to data
collected
using observational sensors or generated from simulation
models.
2. What is the “bag of words” approach used for?
• The bag-of-words model is a way of representing text data when
modeling text with # machine learning algorithms.The bag-of-words model
is simple to understand and # implement and has seen great success in
problems such as language modeling and # document classification.
3. What is the approach, TF-IDF, used for?
• TF-IDF defines importance of a term by taking into consideration
the importance # of that term in a single document, and scaling it by
its importance across all
documents.
4. The package, tm, is a popular R It has several functions for
carrying out what
sorts of tasks?
• The tm package offers functionality for managing text documents,
abstracts the
process of document manipulation and eases the usage of
heterogeneous text
formats in R. The package has integrated database back-end support
to minimize # memory demands.
5. What are some of the preprocessing steps that are applied to a
corpus using the
tm_map() function?
• Tm_map is an actively maintained open-source R-library for drawing
thematic
maps.
6. What is word stemming used for?
• Stemming reduces words to unify across documents
7. What are some of the dimensionality challenges to many analysis
tools?
• Data can be too large to handle by existing hardware
• Data can go against the assumptions of some modeling tools
8. List some things a data scientist/data miner can do to fit data
properly into the # central memory of computers.
• Can use random samples of the data to reduce the amount of data in
a dataset
9. Why does a data engineer/data miner use random sampling of a
subset of rows?
• It can become difficult to fit all the data in memory due to fast
growth rate
of data
10. List some challenges that prompt a data miner/ analytics
professional to trim the # dataset (i.e., reduce dimensionality).
• The amount and diversity of data sources are challenges.
Dimensionality can go # against the assumptions of modeling tools
11. Go through Code 10 (P.79). Then describe the results.
• Max percentage set at 0.5. if percent less than one and if percent
greater
than max percent (0.5) then do not use value else pick the
value…
external site.) Describe what you see. Describe some of the
content.
• Someone states that he has a csv file to be processed but that it
does not fit # into the memory. He then asks, “How can one read 20K
random lines of it to do
basic statistics on the selected data frame”?
Three responses are:
o you can also just do it in the terminal with perl.
o perl -ne ‘print if (rand() < .01)’ biglist.txt >
subset.txt
o This won’t necessarily get you exactly 20,000 lines. (Here it’ll
grab about
.01 or 1% of the total lines.) It will, however, be really fast, and
you’ll
have a nice copy of both files in your directory. You can then load
the
smaller file into R however you want.
13. If you are sampling rows of a dataset, give an example of an
algorithm/code in R
that you would use on a small dataset to program random sampling of
a subset of
rows of the dataset.
• Issue with setting up data table
14. In the section, “Sampling Rows,” the authors give their strategy
for picking lines # of large files to reduce dimensionality (hint: they
draw a random number between 0 # and 1). What is that strategy?
• The authors suggest going through each line of an original large
file and draw a # random number between 0 and 1. If the number is below
a certain percentage the # line is selected for the final sample,
otherwise move to the next line.
15. Explain feature/variable selection: what are we doing when we
perform feature
selection?
• Feature selection is the process of reducing the number of input
variables when
developing a predictive model.
It is desirable to reduce the number of input variables to both
reduce the
computational cost of modeling and, in some cases, to improve the
performance of the model.
Involve evaluating the relationship between each input variable and
the target
variable using statistics and selecting those input variables that
have the
strongest relationship with the target variable.
16. List and explain two popular methods we use to select
features.
• Filter methods and wrapper methods
• Filter methods involve looking at variables individually and
asserting their
value using some metric, which is then used to rank them and remove
the less
relevant ones. Usually a one-shot approach.
• Wrapper methods take into consideration the objectives o the
analysis you plan to # carry out with the data set. They search for the
subset of variable that are more # adequate in terms of the criteria
used to evaluate the results of the posterior
modeling stages. Involves an iterative search process.
17. List two ways of grouping existing feature selection
methods.
• Supervised and unsupervised are two ways to group existing feature
selection
methods.
18. Explain unsupervised
• Unsupervised methods look at each feature individually an
calculate the relevance # using only the values of the variable.
• Unsupervised feature selection techniques ignores the target
variable, such as
methods that remove redundant variables using correlation
19. Explain supervised
• Explores the existence of a special variable in the dataset, the
so call target
variable. Forms the basis of predictive analytics. Evaluates each
feature by
looking at the relationship with the target variable.
• Supervised machine learning using text data involves building a
statistical model # to estimate some output from input that includes
language.
20. When comparing one or more features of a dataset to a target
feature, and looking
at each feature’s relationship with the target variable, which
method should I use # to select features?
• Use the wrapper method because it uses some predictive model to
assert the value # of a set of candidate features
21. Filter methods are very similar to unsupervised methods. True of
False?
• Filtering is the act of choosing a subset of your current data
that fits some
criteria. In R, this is the act of selecting/discarding certain rows
from a data # frame. As far as I am aware, there are three popular
methods to filter data.
22. Fill in the blank:
“Wrapper methods are most of the time _______________ __________
because they typically # use some predictive model to assert the value
of a set of
• Supervised method
23. List two examples of simple unsupervised filter methods.
• Check for constant variables
• Check for ID-like variables
24. In some cases, when we are trying to reduce the dimensionality
of a dataset with
100,000 rows and 140,000 columns, we may eliminate one or more
highly correlated
features. True of False?
• True we want to reduce the redundancy in a dataset
25. Explain the Principle Component Analysis’ (PCA) task when
selecting features and
reducing dimensionality.
• PCA searches for a new variables that is a smaller set and could
be used to
explain most of the variability of the original data which allow one
to carry out # an analysis using only the subset