R - Assignment 3

What is temporal data?
- Temporal data is data related to time whether past, present or future. Entries can be seconds, minutes, hours, days or years.
What is the “bag of words” approach used for?
- It is used to transform text documents into a vector of features.
What is the approach, TF-IDF, used for?
- It is used to count the number of occurrences of a term(word). TF stands for term frequency and IDF stands for inverse document frequency.
The package, tm, is a popular in R. It has several functions for carrying out what sorts of tasks?
- One task the function carries out is reading text documents from different sources and formats to be able to also pre-process the documents to lastly be able to analyze the documents. In other words the tm package has functions that reads, cleans and analyzes documents.
What are some of the preprocessing steps that are applied to a corpus using the tm_map() function?
- removing punctuation
- removing numbers
- transforming to lowercase
- remove whitespace
- eliminates ‘stop words’
- makes words with same linguistic variation to the same term
  - For example: love loving lovingly loved lover lovely love would be changed to love love love love lover love love
What is word stemming used for?
- Word stemming makes words with same linguistic variation to the same term.
  - For example: love loving lovingly loved lover lovely love would be changed to love love love love lover love love
What are some of the dimensionality challenges to many analysis tools?
- Data too large
- More columns than rows
List some things a data scientist/data miner can do to fit data properly into the central memory of computers?
- Random sampling of a subset of rows
- Principle Components Analysis (PCA)
  - Explained in question 25.
- Using a probabilistic task where you assign a random number between 0 and 1 to each line and choose 10% of the ones that have a random number less than 0.1 or whatever the choice is.
Why does a data engineer/data miner use random sampling of a subset of rows?
- When the data is too large and it does not fit into the central memory of computers.
List some challenges that prompt a data miner/analytics professional to trim the dataset ( i.e., reduce dimensionality).
- Data too large
- More columns than rows
Go through Code 10 (P.79). Then describe the results.
- The code does two things/functions, the first determines the number of lines the text file has and the second part of the code obtains the random sample by selecting rows.
Go to the site, http://stackoverflow.com/questions/22261082. Describe what you see. Describe some of the content.
- Someone posted a question about using a random sampling subset because the file was too large for the memory. There are four answers to their question and they all seem to be taking a different approach. One approach uses perl which is what the book uses, the other uses read.csv.sql function, another uses IDs to be able to generate a sample and last one uses the function lapply.
If you are sampling rows of a dataset, give an example of an algorithm/code in R that you would use on a small dataset to program random sampling of a subset of rows of the dataset.
- Option 1
  - perl -ne ‘print if (rand() < 0.01)’ biglist.txt > subset. txt
- Option 2
  - RowsInCSV = 1000000 # or however many rows there are
    
    List <- lapply(1:20000, function(x) read.csv(“YourFile.csv, nrows=1, skip = sample(1, RowsInCSV), header=F) DF = dp.call(rbind, List)
In the section, “Sampling Rows.” the authors give their strategy for picking lines of large files to reduce dimensionality (hint: they draw a random number between 0 and 1). What is that strategy?
- The strategy is to give each line in the document a random number perhaps 0.1 and then we create a sample with the lines that have a corresponding random number less than 0.1. It might not always be the case that we get 10% of the lines and so sometimes we increase the threshold of 0.1 and then out of that sample we get 10% of the lines. Again, we don’t have to use 10% but if our data set is super large than 10% might be a good limit.
Explain feature/variable selection: what are we doing when we perform feature selection?
- When we use the feature selection we are selecting a subset of the variables by removing irrelevant variables or variables that are highly correlated with others. This also helps to reduce the dimensionality of the dataset.
List and explain two popular methods we use to select features.
- Filter methods
  - Assigns variables a value using some metric then ranks each variable and removes the less relevant ones.
- Wrapper methods
  - Searches for the subset of variables that will optimize the analysis in the modeling stages.
List two ways of grouping existing feature selection methods.
- Unsupervised
- Supervised
Explain unsupervised
- Unsupervised methods “look at each feature individually and calculates its relevance using only the values of the variable”.
Explain supervised
- Supervised methods evaluates each feature by looking at its relationship with the special/target variable. We need to assume that there exist a special variable in the dataset.
When comparing one or more features of a dataset to a target feature, and looking at each feature’s relationship with the target variable, which method should I use to select features?
- Supervised Method
Filter methods are very similar to unsupervised methods. True or False?
- True
Fill in the blank:
- “Wrapper methods are most of the time supervised methods because they typically use some predictive model to assert the value of a set of candidate features.”
List two examples of simple unsupervised filter methods.
- Checking for constant like variables
- ID-like variables
In some cases, when we are trying to reduce the dimensionality of a dataset with 100,000 rows and 140,000 columns, we may eliminate one or more highly correlated features. True or False?
- True
Explain the Principle Component Analysis’ (PCA) task when selecting features and reducing dimensionality.
- PCA uses a subset of the original set that has the same or close to same variability. The variables of the subset are a linear combination of the original set.

R - Assignment 3

Gabriela Florido

2022-07-13