HW #3

#1. What is temporal data-data that contains information related to time.

#2. what is the bag of word approach used for ? Natural language Processing
# technique to extract features or characteristics of text data and turn into 
# vectors

#3. What is TF-IDF used for Term Frequency-Inverse document Frequency.It is used
# to determine how relevant a word is in document,

#4. The package, tm, is a popular R It has several functions for carrying out what #sorts of tasks?General  text mining, reading documents of different sources or #format

#5.What are some of the preprocessing steps that are applied to a corpus using the #tm_map() function? removePunctuation, tolower-all lower case, removeNumbers, #removeWord-remove stop word, stripWhitespace, stemDocuments

#6.What is word stemming used for? allows for  one "tense" of a word to be used #within a document all others are replaced

#7.What are some of the dimensionality challenges to many analysis tools? Too Large, goes against
#modeling assumptions, ie more varibles than observations. 


#8.List some things a data scientist/data miner can do to fit data properly into the central memory of computers.Not use all the row, sampling or parallel computing 


#9.Why does a data engineer/data miner use random sampling of a subset of rows?
# to allow the data to fit into modeling tools

#10.List some challenges that prompt a data miner/ analytics professional to trim the dataset (i.e., reduce dimensionality). Data is too large to be put in memory or to fit the assumption of modeling tools.

#11.Go through Code 10 (P.79). Then describe the results.
#Code assumes a large dataset in CSV format and randomly selects 'rows", It selects rows based on random 
#sample of number between 1 and 0, based on sample percentagae the line is "chosen" if the random number associated with the line is less than the sample percent. You can also just select the first Percentage of the line of the dataset to ensure the set has the actual percent of data you desire.(random sample) 

#12.Go to the site, http://stackoverflow.com/questions/22261082. Links to an external site. Describe what you see.  Describe some of the content.
#Request to load 20K dataset in R dataframe
# Several suggested codes to load data such as
#
#library(sqldf)
#DF <- read.csv.sql("x.csv", sql = "select * from file order by random() limit 20000")
#
#RowsInCSV = 10000000 #Or however many rows there are

#List <- lapply(1:20000, function(x) read.csv("YourFile.csv", nrows=1, skip = sample(1, RowsInCSV), #header=F)
#DF = do.call(rbind, List)

#13.If you are sampling rows of a dataset, give an example of an algorithm/code in R that you would use on a small dataset to program random sampling of a subset of rows of the dataset.
#
#data(iris)
#prop2sample <- 0.5
#rowIDs <- sample(1:nrow(iris), as.integer(prop2sample*nrow(iris)))
#iris.sample <- iris[rowIDs,]

#data(iris)
#prop2sample <- 0.5
#rowIDs <- sample(1:nrow(iris), as.integer(prop2sample*nrow(iris)), replace=TRUE)
#iris.sample <- iris[rowIDs,]
#

#14.In the section, “Sampling Rows,” the authors give their strategy for picking lines of large files to reduce dimensionality (hint:  they draw a random number between 0 and 1).  What is that strategy? 
#If the row is assigned a random number below the request 

#15. Explain feature/variable selection: what are we doing when we perform feature selection?
#Remove irrelevant variables that are highly correlated with others, or reduce dimensionality.

#16. List and explain two popular methods we use to select features.
#Filter Method- Look at individual assign a metric and those with "lower" rankings are removed
#Wrapper Method-search for subset of variables to evaluate in a model 


#17.List two ways of grouping existing feature selection methods.
#One-shot approach
#Interative Search method

#18.Explain unsupervised-Looks at individual feature calculates relevance using the variables(values)

#19.Explain supervised-Use at "target" variable and uses it for predictive analysis, and evaluates
#based on relationship with other features and the "target"

#20. When comparing one or more features of a dataset to a target feature, and looking at each feature’s relationship with the target variable, which method should I use to select features? Supervised.


#21.Filter methods are very similar to unsupervised methods. True of False?

#22.    Fill in the blank:

#.“Wrapper methods are most of the time _______________ __________ because they #typically use some predictive model to assert the value of a set of candidate #features.”
#Supervised

#23.List two examples of simple unsupervised filter methods
# Checking for constant variables that have constant value on all observations.
#Eliminate features that have a low variability based on statisic spread.

#24.In some cases, when we are trying to reduce the dimensionality of a dataset with 100,000 rows and 140,000 columns, we may eliminate one or more highly correlated features. True of False?

#25.Explain the Principle Component Analysis’ (PCA) task when selecting features and reducing dimensionality
HW #3

Walter James

2025-02-17