Assignment 3

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

# 1. What is temporal data?
     ## Temporal Data is also known as time-series data and refers to data that captures information about time, which is often challenging for individuals to conceptualize due to its dynamic nature and the occurrence of multiple parallel events.

# 2. What is the “bag of words” approach used for?
     ## The "bag of words" approach is used to determine which attributes or properties should be used to represent a text document. The text document is transformed into a vector of features, where each feature is associated with one word of the language in which the document is written. For example, counts of the number of occurrencies or a binary feature indicating whether a word is or is not present in the document.

# 3. What is the approach, TF-IDF, used for?
     ## TF is "term frequency" and IDF is "inverse document frequency" So the approach is term             frequency with inverse document frequency. It is used to measure the amount of information brought by each term. Additionally, it copes with words that are used a lot but do not bring a lot of information.

# 4. The package, tm, is a popular R package. It has several functions for carrying out what sorts of tasks?
    ## This package provides a general text mining framework within R. It includes several functions for reading text documents from many different sources and formats and functions for carrying out the most frequent pre-processing steps of the document.

# 5. What are some of the preprocessing steps that are applied to a corpus using the tm_map() function?
     ## tm_map(removePunctuation)
     ## tm_map(removeNumbers)
     ## tm_map(stripWhitespace)
     ## tm_map(removeWords)
     ## tm_map(stemDocument)

# 6. What is the word stemming used for?
     ## Stemming is used to make all variations of a word the same text.

# 7. What are some of the dimensionality challenges to many analysis tools?
     ## The amount and diversity of data sources often leads to challenges. For example, the data can be too large to handle by available hardware.

# 8. List some things a data scientist/data miner can do to fit data properly into the central memory of computers.
     ## Data scientists can use random samples of the data to reduce the amount of data in the dataset.

# 9. Why does a data engineer/data miner use random sampling of a subset of rows?
      ## As inferred from the previous question, it can become difficult to fit large amounts of data into the memory of computers because of the increasing growth rate.

# 10. List some challenges that prompt a data miner/ analytics professional to trim the dataset (i.e., reduce dimensionality).
      ## The amount and diversity of data sources are challenges and data can be too large for the computer memory or too large to fit the assumption of the modeling tools.

# 11. Go through Code 10 (P.79). Then describe the results.
      ## The code is reading lines of data and randomly assigns a percentage between 0 and 1 to each line. If the percentage is below some pre-selected threshhold then the line is selected. If not, it moves to the next line. The drawback is that the total number of selected lines may not meet the threshhold number. In this case the code produces an error message. In this code, the threshhold is set at .5.

# 12. Go to the site, http://stackoverflow.com/questions/22261082. Describe what you see. Describe some of the content.
      ## Someone has a large dataset and knows it will not fit into the computer's memory. He wants to know how to pull a 20K sample of the data to do basic statistics on. 

## One suggestion is to use "perl -ne 'print if (rand() < .01)' biglist.txt > subset.txt" which will pull approximately 20K lines, perhaps not exactly 20K.  Other suggestions include using the following:

## #library(sqldf)
   #DF <- read.csv.sql("x.csv", sql = "select * from file order by random() limit 20000")

## #RowsInCSV = 10000000 #Or however many rows there are

## #List <- lapply(1:20000, function(x) read.csv("YourFile.csv", nrows=1, skip = sample(1, RowsInCSV), header=F)
   #DF = do.call(rbind, List)

# 13. If you are sampling rows of a dataset, give an example of an algorithm/code in R that you would use on a small dataset to program random sampling of a subset of rows of the dataset.
      ## #data(iris)
         #prop2sample <- 0.5
         #rowIDs <- sample(1:nrow(iris), as.integer(prop2sample*nrow(iris)))
         #iris.sample <- iris[rowIDs,]

         #data(iris)
         #prop2sample <- 0.5
         #rowIDs <- sample(1:nrow(iris), as.integer(prop2sample*nrow(iris)), replace=TRUE)
         #iris.sample <- iris[rowIDs,]

# 14. In the section, “Sampling Rows,” the authors give their strategy for picking lines of large files to reduce dimensionality (hint:  they draw a random number between 0 and 1).  What is that strategy? 
      ## As in one of the previous responses, if the row is randomly assigned a value that is less than a set threshhold, then the line is selected.

# 15. Explain feature/variable selection: what are we doing when we perform feature selection?
      ## #Remove irrelevant variables that are highly correlated with others, or reduce dimensionality.
          # Feature selection is the process of reducing the number of input variables when developing a
          #  predictive model.
          # It is desirable to reduce the number of input variables to both reduce the computational cost of 
          #  modeling and, in some cases, to improve the performance of the model.

# 16. List and explain two popular methods we use to select features.
      ## # Filter methods and Wrapper methods. Filter methods involve looking at variables individually and asserting their value using some metric, which is then used to rank them and remove the less relevant ones, in terms of the selected metric. 
         # Wrapper methods work by taking into consideration the objetives of the analysis we plan to carry out with the dataset.

# 17. List two ways of grouping existing feature selection methods.
      ## One-shot approach and the Interactive Search method (filter and wrapper, respectively)
      ## Another way of grouping existing feature selection methods is into unsupervised and supervised 
         # methods.

# 18. Explain unsupervised
      ## Unsupervised means that the data does not have labels, or buckets.

# 19. Explain supervised
      ## Supervised means that the data has labels and there is a dependent/target variable.

# 20. When comparing one or more features of a dataset to a target feature, and looking at each feature’s relationship with the target variable, which method should I use to select features?
      ## Supervised

# 21. Filter methods are very similar to unsupervised methods. True of False?
      ## False, filter methods, particularly those used in supervised learning, assess features based on their relationship with a defined target variable. Unsupervised methods, on the other hand, explore the intrinsic structure of the data itself, without needing a target variable.

# 22. Fill in the blank:
# “Wrapper methods are most of the time _______________ __________ because they typically use some predictive model to assert the value of a set of candidate features.”
      ##Supervised methods

# 23. List two examples of simple unsupervised filter methods.
      ## # Clustering can be used to group similar data points into clusters based on their inherent characteristics, without predefined categories or labels; Checking for constant variables that have constant value on all observations.
         # Dimensionality reduction reduces the number of features or dimensions in a dataset while retaining important information; Eliminate features that have a low variability based on statistic spread.

# 24. In some cases, when we are trying to reduce the dimensionality of a dataset with 100,000 rows and 140,000 columns, we may eliminate one or more highly correlated features. True of False?
      ## True, it may be necessary to truly reduce the dimensionality.

# 25. Explain the Principle Component Analysis’ (PCA) task when selecting features and reducing dimensionality.
      ## PCA primarily simplifies complex datasets by reducing dimensionality through feature extraction, creating new features that combine original ones. This aids in faster processing, improved model performance, and data visualization.

Assignment 3

LLJ

2025-03-09

R Markdown