Read all pages in packet in its entirety
Go through and run every algorithm
List which functions are being used
- I will answer this question in question 4.
List which package contains each function
Rcurl
- getURL()
- download the file from specified URL
mlbench
- data()
- loads dataset from database or file
lattice
Amelia
- missmap()
- plots a missingness map showing where missingness occurs in the
dataset passed to amelia
corrplot
- cor()
- calculates the correlation between each pair of numeric
attritubes
- corrplot()
caret
featurePlot()
- a shortcut to produce lattice graphs
preProcess()
Pre-processing transformation (centering, scaling, etc.) can be
estimates from the training data and applied to any data set with the
same variables.
Non-numeric predictors are allowed but will be ignored
predict()
- for predictions from the results of various model fitting
functions
No package needed
- read.csv()
- Allows you to import a csv file
- head()
- Tells computer to output first 5 rows on dataset
- textConnection()
- changes data format so computer can read it
- dim()
- displays dimensions of the dataset
- sapply()
- list types for each attribute for example: integer, float,
character, etc.
- cbind(my_data, new_column)
- combines vectors, matrices and/or data frames by columns.
- par(mfrow=c(1:4))
- Creates multiple plots in the same window
- mfrow as argument
- pass vector containing a number of rows and columns for the grid as
value to argument mfrow and this divides our frame into a grid of that
number of rows and columns
- hist()
- computes a histogram of the given data values
- boxplot()
- creates a boxplot with given information
Describe what each algorithm/code/model is doing or is trying to
do.
- Each Dataset R markdown provided in problem 2 has a a description
for each step.
What is standardization? What is normalization?
Standardized data has mean of 0 and standard deviation of 1. This
is often used to be able to use the standard normal distribution when
computing hypothesis tests.
Normalized values are scaled into the range of [0,1]
List the pre-processing methods/techniques
Instance-based methods
- Most effective when input attributes have same scale
Regression methods
- Most effective when the input attributes are standardized
List all the transform methods.
Scaling
Centering
Standardization
Normalization
BoxCox
YeoJohnson
expoTrans
zv
nzv
PCA
ICA
spatialSign
Which transform method is supposedly the best?
- Each transform method has its advantages and it depends on what type
of data you have and what you are trying to accomplish.
Explain what each transform method does.
Scaling
- Scaling transform calculates the standard deviation for an attribute
and divides each value in that column by the standard deviation.
Centering
- Center transform calculates the mean for an attribute and subtracts
each value in the column by the mean.
Standardization
- Standardization uses centering and scaling on each attribute. This
will make the data have mean of 0 and standard deviation of 1.
Normalization
- Normalization is the process of transforming the data values to be
in the range of [0,1]
BoxCox
Takes attributes and reduces the skewness to be more Gaussian
like.
Transforms response variables.
Assumes all values are positive.
YeoJohnson
expoTrans
Applies a power transform like BoxCox and YeoJohnson.
Can be used for positive and negative data.
Assumes common mean for data
zv
- Removes attributes with a zero variance (all the same value)
nzv
- Removes attributes with a near zero variance (close to the same
value)
PCA
ICA
- Transforms data to the independent components
spatialSign
- Project data onto a unit circle
These data transforms are more likely to be useful for which
algorithms/codes/models?
- These transforms are useful fo tree and rule-based methods.
Define scaling data.
- Scaling transform calculates the standard deviation for an attribute
and divides each value in that column by the standard deviation.
Scaling, normalization and standardization are all part of the
pre-processing process. What does the center transform do?
- Center transform calculates the mean for an attribute and subtracts
each value in the column by the mean.
What is a Gaussian like distribution?
- A Gaussian like distribution is symmetric about its mean meaning it
is bell-shaped.
Load the following datasets. Explain the characteristics of each.
How many columns does the Pima Indian Diabetes dataset have? How many
rows does the Iris dataset have?
Pima Indians Diabetes
Characteristics:
- The dataset has 9 variables, 768 observations, 0 missing cells, 0
duplicate rows, and takes 54.1KB of memory. There are two variable
types, number and boolean.
Columns:
Iris
Characteristics:
- The dataset contains four features (length and width o sepals and
petals) of 50 samples of three species (Iris setosa, Iris virginica and
Iris versicolor)
Rows:
- The dataset has 768 rows.
Explain both PCA and ICA.
Give your understanding of the pre-processing process.
- Data can be messy because perhaps someone else was involved or maybe
you had no say in how data was collected. In order to use machine
learning techniques the data needs to be in a clean and organized
format. We use pre-processing to clean the data, modify it, transform it
to a useful format.