Assignment 3

Problem 1.

Temporal data, also known as time-series data, represents a state in time and is indexed in time order. It’s often used to analyze trends, patterns, and anomalies over time.

Problem 2.

It is for determining which attributes / properties should be used to represent a text document.

Problem 3.

It integrates term frequency with inverse document frequency that tries to measure the amount of information brought by each term.

Problem 4.

It includes several functions for reading text documents from many different sources and formats, functions for carrying out the most frequent pre-processing steps of these documents, and also other functions for analyzing the document.

Problem 5.

They include removing punctuation, numbers, transforming everything to lowercase, stripping white space, eliminating from further analysis the stop words of the language of the document, or carrying out word stemming in the document.

Problem 6.

Stemming is the process of reducing a word to its stem that affixes to suffixes and prefixes or to the roots of words known as “lemmas”. Stemming is important in natural language understanding (NLU) and natural language processing (NLP).

Problem 7.

The data can be too large to be handled by the available hardware, but it can also have a dimensionality that goes against the assumptions of some modeling tools. That is the case of datasets where there are many more variables than observations, as it is often the case in text mining, for instance. While these datasets may fit perfectly well in our available hardware, they still can be problematic for some tools due to this imbalance.

Problem 8.

The biglm package by Lumley uses incremental computations to offer lm() and glm() functionality to data sets stored outside of R’s main memory.

The ff package by Adler et al. offers file-based access to data sets that are too large to be loaded into memory, along with a number of higher-level functions.

The bigmemory package by Kane and Emerson permits storing large objects such as matrices in memory (as well as via files) and uses external pointer objects to refer to them. This permits transparent access from R without bumping against R’s internal memory limits. Several R processes on the same computer can also share big memory objects.

Problem 9.

As datasets grow in size it becomes difficult to fit all data in memory.One approach is not to use all available rows of a very large dataset, only a subset of the rows are used.

Problem 10.

The data can be too large to be handled by the available hardware. It can have a dimnesionality that goes against the assumptions of some modeling tools. In a dataset, there are many more variables than observations.

Problem 11.

They split the solution in two functions, one that determines the number of lines of a text file and the other that obtains the random sample. The first use the unix command wc that can be used to compute this number of lines in a very efficient manner. The second function does the heavy part using the Perl scripting language. The selected rows are then read into a data frame using the function read-csv from package readr.

Problem 12.

It’s about trying to load a csv file with 20,000 lines of data and do basic statistical analysis, but it would not fit in the memory since it is too big for the person who asked the question.

the answers would be in Perl and Python. However, only a small percentage of the data would be loaded for analysis.

Problem 13.

  data(iris)
  prop2sample <- 0.5
  rowIDs <- sample(1:nrow(iris), as.integer(prop2sample*nrow(iris)))
  iris.sample <-iris[rowIDs,]
  head(iris.sample)

##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 47           5.1         3.8          1.6         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 43           4.4         3.2          1.3         0.2     setosa
## 100          5.7         2.8          4.1         1.3 versicolor
## 113          6.8         3.0          5.5         2.1  virginica
## 95           5.6         2.7          4.2         1.3 versicolor

Problem 14.

The strategy they use essentially goes through each line of the original large file and draws a random number between 0 and 1. If the number is below a certain percentage the line is selected for the final sample, otherwise they move to the next line.

Problem 15.

Trying to remove irrelevant variables or variables that are highly correlated with others.

Reducing the dimensionality of the dataset.

Problem 16.

Filter methods, wrapper methods

Problem 17.

Unsupervised methods, supervised methods

Problem 18.

Unsupervised methods look at each feature individually and calculate its relevance using only the values of the variable.

Problem 19.

Supervised methods explore the existence of a “special” variable in the dataset, the so-called target variable. These supervised methods evaluate each feature by looking at its relationship with the target variable.

Problem 20.

A supervised method

Problem 21

False. I feel the answer could be true or false. But there are definitely some differences, so I picked false.

Problem 22.

supervised methods

Problem 23.

Checking for constant variables. Checking for ID-like variables.

Problem 24.

True

Problem 25.

The method searches for a set of “new” variables, each being a linear combination of the original variables. The idea is that a smaller set of these new variables could be able to “explain” most of the variability of the original data, and if that is the case we can carry out our analysis using only this subset.

Assignment 3

Wei Gao

2024-07-03