1. Read all pages in packet in its entirety

  2. Go through and run every algorithm

    • Pima Indian Diabetes Algorithms

      • Link to the R markdown of the algorithms for Pima Indian Diabetes
    • Iris Dataset Algorithms

      • Link to the R markdown of the algorithms for Iris Dataset
  3. List which functions are being used

    • I will answer this question in question 4.
  4. List which package contains each function

    • Rcurl

      • getURL()
        • download the file from specified URL
    • mlbench

      • data()
        • loads dataset from database or file
    • lattice

      • plot()

        • creates a plot
    • Amelia

      • missmap()
        • plots a missingness map showing where missingness occurs in the dataset passed to amelia
    • corrplot

      • cor()
        • calculates the correlation between each pair of numeric attritubes
      • corrplot()
        • creates correlation plot
    • caret

      • featurePlot()

        • a shortcut to produce lattice graphs
      • preProcess()

        • Pre-processing transformation (centering, scaling, etc.) can be estimates from the training data and applied to any data set with the same variables.

        • Non-numeric predictors are allowed but will be ignored

      • predict()

        • for predictions from the results of various model fitting functions
    • No package needed

      • read.csv()
        • Allows you to import a csv file
      • head()
        • Tells computer to output first 5 rows on dataset
      • textConnection()
        • changes data format so computer can read it
      • dim()
        • displays dimensions of the dataset
      • sapply()
        • list types for each attribute for example: integer, float, character, etc.
      • cbind(my_data, new_column)
        • combines vectors, matrices and/or data frames by columns.
      • par(mfrow=c(1:4))
        • Creates multiple plots in the same window
        • mfrow as argument
        • pass vector containing a number of rows and columns for the grid as value to argument mfrow and this divides our frame into a grid of that number of rows and columns
      • hist()
        • computes a histogram of the given data values
      • boxplot()
        • creates a boxplot with given information
  5. Describe what each algorithm/code/model is doing or is trying to do.

    • Each Dataset R markdown provided in problem 2 has a a description for each step.
  6. What is standardization? What is normalization?

    • Standardized data has mean of 0 and standard deviation of 1. This is often used to be able to use the standard normal distribution when computing hypothesis tests.

    • Normalized values are scaled into the range of [0,1]

  7. List the pre-processing methods/techniques

    • Instance-based methods

      • Most effective when input attributes have same scale
    • Regression methods

      • Most effective when the input attributes are standardized
  8. List all the transform methods.

    • Scaling

    • Centering

    • Standardization

    • Normalization

    • BoxCox

    • YeoJohnson

    • expoTrans

    • zv

    • nzv

    • PCA

    • ICA

    • spatialSign

  9. Which transform method is supposedly the best?

    • Each transform method has its advantages and it depends on what type of data you have and what you are trying to accomplish.
  10. Explain what each transform method does.

    • Scaling

      • Scaling transform calculates the standard deviation for an attribute and divides each value in that column by the standard deviation.
    • Centering

      • Center transform calculates the mean for an attribute and subtracts each value in the column by the mean.
    • Standardization

      • Standardization uses centering and scaling on each attribute. This will make the data have mean of 0 and standard deviation of 1.
    • Normalization

      • Normalization is the process of transforming the data values to be in the range of [0,1]
    • BoxCox

      • Takes attributes and reduces the skewness to be more Gaussian like.

      • Transforms response variables.

      • Assumes all values are positive.

    • YeoJohnson

      • Takes attributes and reduces the skewness to be more Gaussian like.

      • Values can be zero and negative.

    • expoTrans

      • Applies a power transform like BoxCox and YeoJohnson.

      • Can be used for positive and negative data.

      • Assumes common mean for data

    • zv

      • Removes attributes with a zero variance (all the same value)
    • nzv

      • Removes attributes with a near zero variance (close to the same value)
    • PCA

      • Transforms data to principal components

      • Used in multivariate statistics and linear algebra

    • ICA

      • Transforms data to the independent components
    • spatialSign

      • Project data onto a unit circle
  11. These data transforms are more likely to be useful for which algorithms/codes/models?

    • These transforms are useful fo tree and rule-based methods.
  12. Define scaling data.

    • Scaling transform calculates the standard deviation for an attribute and divides each value in that column by the standard deviation.
  13. Scaling, normalization and standardization are all part of the pre-processing process. What does the center transform do?

    • Center transform calculates the mean for an attribute and subtracts each value in the column by the mean.
  14. What is a Gaussian like distribution?

    • A Gaussian like distribution is symmetric about its mean meaning it is bell-shaped.
  15. Load the following datasets. Explain the characteristics of each. How many columns does the Pima Indian Diabetes dataset have? How many rows does the Iris dataset have?

    1. Pima Indians Diabetes

      • Characteristics:

        • The dataset has 9 variables, 768 observations, 0 missing cells, 0 duplicate rows, and takes 54.1KB of memory. There are two variable types, number and boolean.
      • Columns:

        • There are 9 columns.
    2. Iris

      • Characteristics:

        • The dataset contains four features (length and width o sepals and petals) of 50 samples of three species (Iris setosa, Iris virginica and Iris versicolor)
      • Rows:

        • The dataset has 768 rows.
  16. Explain both PCA and ICA.

    • PCA: transform data to principal components

    • ICA: transform data to independent components

  17. Give your understanding of the pre-processing process.

    • Data can be messy because perhaps someone else was involved or maybe you had no say in how data was collected. In order to use machine learning techniques the data needs to be in a clean and organized format. We use pre-processing to clean the data, modify it, transform it to a useful format.