What is temporal data?
What is the “bag of words” approach used for?
What is the approach, TF-IDF, used for?
The package, tm, is a popular in R. It has several functions for carrying out what sorts of tasks?
What are some of the preprocessing steps that are applied to a corpus using the tm_map() function?
removing punctuation
removing numbers
transforming to lowercase
remove whitespace
eliminates ‘stop words’
makes words with same linguistic variation to the same term
What is word stemming used for?
Word stemming makes words with same linguistic variation to the same term.
What are some of the dimensionality challenges to many analysis tools?
List some things a data scientist/data miner can do to fit data properly into the central memory of computers?
Random sampling of a subset of rows
Principle Components Analysis (PCA)
Using a probabilistic task where you assign a random number between 0 and 1 to each line and choose 10% of the ones that have a random number less than 0.1 or whatever the choice is.
Why does a data engineer/data miner use random sampling of a subset of rows?
List some challenges that prompt a data miner/analytics professional to trim the dataset ( i.e., reduce dimensionality).
Data too large
More columns than rows
Go through Code 10 (P.79). Then describe the results.
Go to the site, http://stackoverflow.com/questions/22261082. Describe what you see. Describe some of the content.
If you are sampling rows of a dataset, give an example of an algorithm/code in R that you would use on a small dataset to program random sampling of a subset of rows of the dataset.
RowsInCSV = 1000000 # or however many rows there are
List <- lapply(1:20000, function(x) read.csv(“YourFile.csv, nrows=1, skip = sample(1, RowsInCSV), header=F) DF = dp.call(rbind, List)
In the section, “Sampling Rows.” the authors give their strategy for picking lines of large files to reduce dimensionality (hint: they draw a random number between 0 and 1). What is that strategy?
Explain feature/variable selection: what are we doing when we perform feature selection?
List and explain two popular methods we use to select features.
Filter methods
Wrapper methods
List two ways of grouping existing feature selection methods.
Unsupervised
Supervised
Explain unsupervised
Explain supervised
When comparing one or more features of a dataset to a target feature, and looking at each feature’s relationship with the target variable, which method should I use to select features?
Filter methods are very similar to unsupervised methods. True or False?
Fill in the blank:
List two examples of simple unsupervised filter methods.
In some cases, when we are trying to reduce the dimensionality of a dataset with 100,000 rows and 140,000 columns, we may eliminate one or more highly correlated features. True or False?
Explain the Principle Component Analysis’ (PCA) task when selecting features and reducing dimensionality.