New England College

Data Mining

Week 1

Weekly Objectives:

Successfully install R
Successfully install R Studio
Successfully download data and examples
Describe R Studio windows and tabs
Define the ‘Big’ in Big Data

Assignment: Install R and R Studio

Install R and R Studio.

Submit a Word or .pdf document with screen shots of R Studio where you have created a vector of three words. Whenever you are asked to submit a screenshot, include either a sliver of your desktop or a timestamp from your desktop. Always repeat the question you are answering.

Week 2

Weekly Objectives:

Successfully install R
Successfully install R Studio
Successfully download data and examples
Describe R Studio windows and tabs
Define the ‘Big’ in Big Data

Assignment: Install R and R Studio

Install R and R Studio.

Machine Learning

Week 1

Week One | Introduction to Machine Learning

Weekly Learning Objective:

Describe machine learning

Assignment

Introduction to ML

You are expected to be able to program in R prior to taking this class. Use Titanic dataset and perform EDA on various columns. Without using any modeling algorithms, and only using basic methods such as frequency distribution, describe the most important predictors of survival of Titanic passengers, e.g. were males or females more likely to survive, were young and rich females more likely to survive than old poor males etc?

Submit your responses in a fully “knit” R Markdown file.

Important note: Please make sure to describe your thinking and analysis results in words. Don’t leave anything to my interpretation while grading.

Week 2

Week Two | Fundamentals of Machine Learning and Introduction to Case Studies Part One

Weekly Learning Objective:

Perform data tidying, manipulating and plotting on Santander Bank Case Study data

Assignment

Fundamentals of ML

Write a fully executed R-Markdown program and submit a pdf / word or html file. The program should merge / join the data files given to you as part of the Santander Bank Case Study. You are then required to perform Exploratory Data Analysis (EDA) along with describing the features of the response variable, visualizing a few predictors, and clearly explaining the findings.

Points will be deducted in case the explanation and commentary is missing in the submission.

Week 3

Week Three | Classification with Logistic Regression

Weekly Learning Objective:

Perform binary classification ML task.

Assignment

Classification with logistic regression

Write a fully executed R-Markdown program and submit a pdf / word or html file performing classification task on the Binary response variable from the Santander Bank Case Study. Make sure to try several permutations of the model before finding the best available model.

You are required to clearly display and explain the models that were run for this task.

Points will be deducted in case you fail to explain the output.

Week 4

Week Four | Classifying with SVM

Weekly Learning Objective:

Apply support vector machine on Case Study 1 data

Assignment

Fundamentals of linear predictive modeling

Write a fully executed R-Markdown program and submit a pdf / word or html file performing classification task on the Binary response variable from the Santander Bank Case Study. Make sure to try various hyperparameters of the SVM algorithm to find the best available model.

You are required to clearly display and explain the models that were run for this task and their effect on the reduction of the Cost Function.

Points will be deducted in case you fail to explain the output.

Week 5

Week Five | Classifying with Decision Trees and Improving with Random Forest and Boosting

Weekly Learning Objective:

Apply ensemble techniques on the case study

Assignment

Fundamentals of ensemble modeling

Write a fully executed R-Markdown program and submit a pdf / word or html file performing classification task using Random Forest and XGBoost algorithms on the Binary response variable from the Santander Bank Case Study. Make sure to try various hyperparameter settings of the two algorithms to find the best available models.

You are required to clearly display and explain the models that were run for this task and their effect on the reduction of the Cost Function.

Points will be deducted in case you fail to explain the output.

Week 6

Week Six | Completion of Case Study One

Weekly Learning Objective:

Complete Santander Bank Case Study and perform data cleaning and EDA on Case Study 2: Nutrition

Assignment

Santander Bank Case Study

Write a formal report on your findings from the last several weeks for the classification of the Santander Bank Case Study.

The main objective is to write a fully executed R-Markdown program performing classification using the best models found for logistic regression, SVM, Random Forest and XGBoost algorithms, and comparing the values of their cost functions and accuracy scores.

Make sure to describe the final hyperparameter settings of all algorithms that were used for comparison purposes.

Clean and merge all the files using proper IDs discussed in the second week. Create one master data file to be analyzed for case study 2. Now perform EDA and share your findings in the form of an R Markdown report.

Week 7

Week Seven | Non-Linear Regression with General Additive Models

Weekly Learning Objective:

Apply general additive model techniques on the Nutrition label case study.

Assignment

Fundamentals of general additive models

Write a fully executed R-Markdown program and submit a pdf / word or html file performing regression task using GAM algorithms on the primary response variable in the Nutrition Case Study. Make sure to try various hyperparameter settings to find the best available models.

You are required to clearly display and explain the models that were run for this task and their effect on the reduction of the Cost Function.

Points will be deducted in case you fail to explain the output.

Week 8

Week Eight | Preventing Overfitting for Regression Problems

Weekly Learning Objective:

Reduce overfitting by perform the techniques such as LASSO, Ridge and Elastic Net on the data from Case Study 2: Nutrition

Assignment

Nutrition Case Study

The main objective is to write a fully executed R-Markdown program performing regression prediction for the response variable using the best models found for LASSO, Ridge and Elastic Net techniques predicting the response variable in the Nutrition case study. Make sure to describe the final hyperparameter settings of all algorithms that were used for comparison purposes.

You are required to clearly display and explain the models that were run for this task and their effect on the reduction of the Cost Function.

Points will be deducted in case you fail to explain the output.

Week 9

Week Nine | Regression using Partition Methods

Weekly Learning Objective:

Apply various partitioning techniques and tune hyperparameters to produce the most optimal regression model using case study on Nutrition

Assignment

Nutrition Case Study

The main objective is to write a fully executed R-Markdown program performing regression prediction for the response variable using the best models found for kNN, Random Forest and XGBoost techniques predicting the response variable in the Nutrition case study. Make sure to describe the final hyperparameter settings of all algorithms that were used for comparison purposes.

You are required to clearly display and explain the models that were run for this task and their effect on the reduction of the Cost Function.

Points will be deducted in case you fail to explain the output.

Week 10

Week Ten | Completion of Case Study 2

Weekly Learning Objective:

Complete Nutrition Case Study and perform EDA on Case Study 3: MNIST and Fashion MNIST image classification

Assignment

Nutrition Case Study

Write a formal report on your findings from the last several weeks for the regression problem of the Nutrition Case Study.

The main objective is to write a fully executed R-Markdown program performing regression prediction using the best models found for and comparing the cost functions and R-square values.

Make sure to describe the final hyperparameter settings of all algorithms that were used for comparison purposes.

Week 11

Week Eleven | Unsupervised Learning using Dimension Reduction

Weekly Learning Objective:

Learn and apply various dimension reduction techniques and apply them on image classification dataset.

Assignment

MNIST image data (You will be able to download the data from Kaggle website from the URL given in the word document)

The main objective is to write a fully executed R-Markdown program performing dimension reduction on a high dimensional image data using MNIST (digits) images that are 28 x 28 pixels resolution.

Make sure to describe the final hyperparameter settings of all algorithms that were used for comparison purposes.

Points will be deducted in case you fail to explain the output.

Week 12

Week Twelve | Unsupervised Larning using Clustering Part One

Weekly Learning Objective:

Learn and apply various unsupervised learning techniques SOM and LLE on the image data.

Assignment

MNIST image data

The main objective for this week is to write a fully executed R-Markdown program performing clustering using SOM and LLE on the image data containing MNIST (digits) images that are 28 x 28 pixels resolution.

Make sure to describe the final hyperparameter settings of all algorithms that were used for comparison purposes.

Points will be deducted in case you fail to explain the output.

Week 13

Week Thirteen | Unsupervised Learning using Clustering Part Two

Weekly Learning Objective:

Learn and apply density-based and distribution-based clustering techniques on the MNIST / Fashion MNIST image data.

Assignment

MNIST / Fashion MNIST image data

The main objective is to write a fully executed R-Markdown program performing clustering using DBSCAN and Mixture model on the MNIST / Fashion MNIST (apparel) images that are 28 x 28 pixels resolution.

Make sure to describe the final hyperparameter settings of all algorithms that were used for comparison purposes.

You are required to clearly display and explain the models that were run for this task and their effect on the reduction of the Cost Function.

Points will be deducted in case you fail to explain the output.

Week 14

Week Fourteen | Human Activity Recognition Final Case Analysis

Weekly Learning Objective:

Apply all machine learning techniques learnt in the course on a case study based on data collected from smart watch sensors.

Assignment

Human Activity Recognition (HAR) Case Analysis

The main objective is to write a fully executed R-Markdown program performing EDA on the given data on the human activity recognition experiment collected through smart watches.

Points will be deducted in case you fail to explain the output.

Week 15

Week Fifteen | Human Activity Recognition Case Analysis

Weekly Learning Objective:

Apply various ML algorithms and optimization techniques on the sensor data.

Assignment

Human Activity Recognition Case Study

The objective is to write a formal analysis report finishing the objectives set forth in the final case analysis document. A sample template for the final report is provided that contains minimum requirements for the report including the following sections: Introduction, Analysis and Results, Methodology, Limitations and Conclusion.

You are required to follow the report template and explain the findings including final models ran for the task.

Points will be deducted in case you fail to explain the output.

Visual Analytics

Week 1

Week One | Installs

Objectives:

successfully install R
successfully install R Studio
successfully install packages

Assignment

Get the most recent version of R. It is free and runs on Windows, Mac, and Linux. Once R is installed, download and install R Studio which is an integrated development environment for R. Run R studio. In the Script window type the following code which will create a variable called my_packages containing an array created by the c() function (you will see the results in the Console window below the Script window): my_packages <- c(“tidyverse”, “broom”, “coefplot”, “cowplot”, “gapminder”, “GGally”, “ggrepel”, “ggridges”, “gridExtra”, “here”, “interplot”, “margins”, “maps”, “mapproj”, “mapdata”, “MASS”, “quantreg”, “rlang”, “scales”, “survey”, “srvyr”, “viridis”, “viridisLite”, “devtools”)

With the cursor anywhere in this code select ‘Run’ or control & enter to execute the code.

Then type and ‘run’ the following:

install.packages(my_packages, repos = “http://cran.rstudio.com”)

The install.packages() function will go to the R Studio repository called ‘cran’ and download these libraries. If you encounter an error when executing this command, you may have to install the packages one at a time.

Finally, install the author’s library with the following command:

devtools::install_github(“kjhealy/socviz”)

This command uses the install_github() function from the devtools library which you downloaded previuosly, to download the socviz library. Use the GetHelp discussion board to get help, if you need it, or simply report that you were successful.

On the right hand lower pane of RStudio, you will see a ‘packages’ tab. Scroll down till you see any of the packages you just installed and take a screen shot for your submission.

By Sunday at midgnight submit a Word document with screen shots showing your successful installations.

Week 3

Week Three | Getting Started

Objectives:

begin learning R syntax
create vectors
use functions
navigate R Studio
use R Markdown

Assignment

Follow the instructions in the book to create a new project. Give it a name and an author. Execute the 3 code snippets from the preface to bring all the packages into this new project then create a new R Markdown document as instructed by the book. Load the tidyverse and socviz libaries.

In order to execute code in an R Markdown object, the code must be enclosed by: {r}.....

Use R studio to complete the following:

create a vector of the names of the members of your immediate family
display it
create a vector of numbers
find the mean of the numbers in the vector you created
assign the results of a function to an object
display the object
determine the class of the vector
change the class of the vector to character
show the new class
show the titanic dataframe
show the class of titanic
convert the titanic dataframe to a tibble
show the structure of various objects
show the structure of the mpg dataframe (mpg is an example dataset showing mileage for cars of various models and with different numbers of cylinders)
show the contents of the mpg dataframe
create a scatterplot of the mpg dataset
create an object called ‘url’ with the organ donation file in it
create an object with the organ donation data, find the structure of that object, load the gapminder dataset (watch this video to learn more about gapminder)
make a scatterplot of the gdpPercap and lifeexp

Submit an R Markdown document by Sunday at midnight with this work in it. Show the graphs not the code. comment on the submissions of two classmates.

Week 4

Week Four | Make a Plot

Objectives:

use structure() function
use summary() function
demonstrate different regression methods
experiment with the aes() function
show different types of labels

Assignment

Show meta data from the mpg dataframe using summary().

##  manufacturer          model               displ            year     
##  Length:234         Length:234         Min.   :1.600   Min.   :1999  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :3.300   Median :2004  
##                                        Mean   :3.472   Mean   :2004  
##                                        3rd Qu.:4.600   3rd Qu.:2008  
##                                        Max.   :7.000   Max.   :2008  
##       cyl           trans               drv                 cty       
##  Min.   :4.000   Length:234         Length:234         Min.   : 9.00  
##  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
##  Median :6.000   Mode  :character   Mode  :character   Median :17.00  
##  Mean   :5.889                                         Mean   :16.86  
##  3rd Qu.:8.000                                         3rd Qu.:19.00  
##  Max.   :8.000                                         Max.   :35.00  
##       hwy             fl               class          
##  Min.   :12.00   Length:234         Length:234        
##  1st Qu.:18.00   Class :character   Class :character  
##  Median :24.00   Mode  :character   Mode  :character  
##  Mean   :23.44                                        
##  3rd Qu.:27.00                                        
##  Max.   :44.00

Show metadata from the gapminder dataframe

##         country        continent        year         lifeExp     
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
##  Australia  :  12                  Max.   :2007   Max.   :82.60  
##  (Other)    :1632                                                
##       pop              gdpPercap       
##  Min.   :6.001e+04   Min.   :   241.2  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1  
##  Median :7.024e+06   Median :  3531.8  
##  Mean   :2.960e+07   Mean   :  7215.3  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
##  Max.   :1.319e+09   Max.   :113523.1  
##

assign ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp) to the variable ‘p’
find the structure of the p object.
add geom.point() to the p object. Show p.
replace geom.point() with geom.smooth(). Show p.
return to geom.point and add geom.smooth(). Show p.
add the linear element to the geom.smooth function. Show p.
change the x axis scare to log10. Show p.
try scale_y_log10(). Show p.
change the method to gam from lm. Show p.
replace scientific notation on the x axis with dollar signs
identify the continent of each point with color. Show p.
add labels to the plot. Show p.
change the method to loess. Show p.
use fill to change the appearace of lines, points, and the interior of the smoother’s standard error ribbon. S thehow p.
limit the figure size in R markdown to 8 x 5.
save one of your plots in it’s own file.
experiment with saving files in different formats and different locations.
map different attributes from gapminder to see what they look like. Show the result.

Submit a Word document by Sunday at midnight with your screen shots of your work. Explain what each image is.

Week 5

Week Five | Showing the Right Numbers

Objectives:

add faceting to graphs
use select() function
use subset() function
create histograms

Assignment

According to the author, ‘ggplot is an implementation of the grammar of graphics’ which is a set of rules for producing visualizations of data. In this first plot, we will track the trajectory of life expectancy over time for each country in the data.

map year to x and lifeExp to y.
use geom_line to show how lifeExp changes over time. (did you notice a mistaken assignment to the y parameter in the book?)
use grouping to make each line refer to a specific country in the dataset
facet the data on continent
try moving the facets around on the page - 5 across and 5 down
add a smoother, change the y scale to a log scale and add the dollar sign
add the labels as described in the book
try using facet_grid
be sure that you know what categorical variables, ordered and unordered, are. compare continuous variables.
use the glimpse function on the gss_sm data set. Try this gss_sm %>% glimpse(). What do you think the pipe operator (%>%) does?
make a smoothed scatterplot of the relationship between age of the respondent and the number of children. What did you learn?
facet the result with the sex and race of the respondent (person who respond to the survey)
experiment with the alpha attribute
experiment with combining attributes in the facet wrapper
geom_bar uses count by default
describe how the geom_bar used the count function to determine how much water had been used.
Use the prop function and group by ‘1’ to show by region
show just the religion column in a table
create a bar chart showing the frequency of religious distribution in the data
use fill to highlight the different religions
create a stacked bar chart to show the frequency of religions by region
use the position = dodge attribute to create individual religious frequency bars
facet the religious frequency chart by region
create a histogram showing midwest regions by size
experiment with the number of bins in the previous histogram
Where does the count variable come from?
use subset to show the data in a histogram just from Ohio and Wisconsin
create a kernel density plot of the area of the midwest states

Submit a Word document by Sunday at midnight with screen shots of your work and text. Explain what each image is.

Week 6

Week Six | Data Transformation

Objectives:

build layered visualizations
swap coordinate systems
create boxplots and whisker plots
explore summary statistics

Assignment

get the structure of the gss_sm dataframe. What is the data type of race, sex, region and income? What do the levels refer to?
create a graph that shows a count of religious preferences grouped by region
turn the region counts in percentages
use dodge2() to put the religious affiliations side by side within regions
show the religious preferences by region, faceted version with the coordinate system swapped
using pipes show a 10 random instances of the first six columns in the organdata data set
create a scatterplot of donors vs. year
create a faceted set of line chart graphs showing donors for year for different countries
create a boxplot of the data with coordinates swapped (because the mean is calculated in every boxplot and because R throws an error when trying to calculate means when there is missing data, add the na.rm = TRUE parameter to remove the NA’s). 10. Replace the boxplot with points
jitter the points
reduce the amount of jitter
using organdata, create a table of summary statistics by country called by_country (show the mean of donors, gdp, health, roads, cerebvas, and the standard deviation of donors)
what is the cerebvas column referring to?
What conclusions can you draw from the previous plot?

Submit a Word document by Sunday at midnight with screen shots of your work and text. Explain what each image is.

Week 7

Week Seven | Plotting Text

Objectives:

add horizontal and vertical lines to graphs
use various repel functions to keep points and text from obscuring each other
add rectangles and labels to plots

Assignment

produce a scatterplot of the by_country data with the points colored by consent_law
Using facet_wrap() split the consent_law variable into two panels and rank the countries by donation rate within the panels
Use geom_pointrange() to create a dot and whisker plot showing the mean of donors and a confidence interval.
Create a scatterplot of roads_mean v. donors_mean with the labels identifying the country sitting to the right or left of the point
load the ggrepel() library
using the elections_historic data, plot the presidents popular vote percentage v electoral college vote percentage. draw axes at 50% for each attribute and use geom_text_repel() to keep the labels from obscuring the points.
What is the electoral college?
create a new binary value column in organdata called ‘ind’ populated by determining whether the ccode is “Spa” or “Ita” and the year is after than 1998.
create an organdata plot of Roads v. Donors and map the ind attribute to the color aesthetic. Label those points with the ccode and suppress the legends.
Add a label in a rectangle to the previous plot that says “Spa = Spain & Ita = Italy”.

Submit a Word document by Sunday at midnight with screen shots of your work and text. Explain what each image is.

Week 8

Week Eight | dplyr

Objectives:

used plyr library
calculate summary statistics
create violin plots

Assignment

Return to the visualization for Presidential Elections: Popular and Electoral College margins, subset by party, and use that to add color to your points.
Recreate figures 5.28 using functions from the dplyr library.
Using gss_sm data, calculate the mean and median number of children by degree
Using gapminder data, create a boxplot of life expectancy over time
Using gapminder data, create a violin plot of population over time.

Submit a Word document by Sunday at midnight with screen shots of your work and text. Explain what each image is.

Week 9

Week Nine | Working with Models 1

Objectives:

learn when to use logarithms for plotting
combine two data frames together
use OLS regression models
build loess regression models

Assignment

Using the gapminder data, create a plot comparing log(gdp PerCa with Life Exp and show three different smoothers in three different colors with a legend showing each smoother type.
In a paragraph compare and contrast the smoother types. LOESS, Cubic Spline, and OLS 3. Look at the gapminder data with str()
Create a linear model of the gapminder data with life expectancy as the target of a multifactor model built from gdpPercap, pop, and continent. Store it in a variable called ‘out’.
print a summary of out.
notice that printing a summary of gapminder will produce different results because summary() knows that out is the output of a linear model.
Use min() and max() to get the minimum and maximum values of per capita GDP and create a vector of 100 evenly spaced elements between them while holding population constant at it’s median and showing the values by continent using a vector..
use predict() to calculate the fitted values for evey row in the dataframe and show the upper and lower bounds of a 95% confidence interval. Store the result in a variable predi_out.
Use cbind() to bind the two data frames together by column.
Look at the top six rows of the result with head()
make an OLS plot the combined dataframes after subsetting continent to Africa and Europe using geom_ribbon to show the prediction intervals.
What does the alpha aesthetic do?

Submit a Word document by Sunday at midnight with screen shots of your work and text. Explain what each image is.

Week 10

Week Ten | Working with Models 2

Objectives:

Install the broom library
Use the tidy() function
Use the glance() function
Implement the augment() function

Assignment

load the broom library
use tidy() on the out dataframe to produce a new dataframe of component level information. Store the result in out_comp.
round all the columns to two decimal places using round_df().
Produce a flipped scatter plot of Term v. Estimate
Produce a new tidy output of out including confidence intervals. Store it in a variable called out_conf after rounding the dataframe to two decimals.
Remove the intercept column and the term continent from the label and make a plot of points with whiskers to show the coefficients with a confidence range and order the output from smallest to largest.
use the head function to see the first six rows after applying the augment function to out. Store the result in out_aug.
Add the data back into out_aug with the data = argument.
plot the .fitted data v. the .resid data
What does this graph show?
using the pipe round the output of glance(out)

Submit a Word document by Sunday at midnight with screen shots of your work and text. Explain what each image is.

Week 11

Week Eleven | Working with Models 3

Objectives:

use SQL commands like group_by
use mutate() to add columns to data
use nest()
explore the tidy library

Assignment

Take a slice of the gapminder data showing only 1977
create a linear model of with lifeexp being the target of the log of gdpPercap. Save it in a variable called fit and show the summary.
Group the entire data set by continent and year, pipe it through the nest() function and store it in a variable called out_le.
The result will be a tibble of columns of data and columns of tibbles called list columns with data in them.
Use the filter() and unnest() functions to see Europe 1977
create a linear function called function(df) with lifeExp and log(gdpPercap). Save it in fit.ols
then use group_by(), nest(), and mutate() to create a tibble called out_le and show the contents of the tibble.
type fit.ols to see what it looks like after you have created it
type fit.ols(df = gapminder) to see what the new function does.
using the tidy() function, extract summary statistics from each model by mapping the tidy() to the model list column. Unnest the result and drop the other columns. Filter out all the Intercept terms and drop the few observations from Oceania. Save five rows of this pipeline in out_tidy.
Plot the output in a dot and whisker diagram (geom_pointrange()) grouped and colored by continent. Use position_dodge within geom_pointrange to insure that the results will be nearby but will not obscure each other

Submit by Sunday at midnight a Word document with screen shots of your work showing a slice of your desktop and text. Explain what each image is.

Week 12

Week Twelve | Working with Models 4

Objectives:

load margins library
model data with logistic regression
use default plot function

Assignment

load the margins library
create a new column called called polviews_m to use Moderate as a reference category using relevel on the polviews column of the gss_sm data.
use glm() to create a model called out_bo using logistic regression of polviews_m with sex and race showing an interaction glm(obama~ polviews_m + sex*race, family = “binomial”, data = gss_sm).
use summary() on out_bo to see what the results look like
calculate the marginal effects of each variable and store that in a variable called bo_m.
plot(bo_m) to see a graph of the results
create a tibble called bo_gg of the summary() of bo_m, create a vector of the prefixes ‘polviews_m’ and ‘sex’. And, remove the prefixes from the factor column and replace ‘race’ with ‘Race:’ in the factor column. Finally, limit the contents of the bo_gg attributes to ‘factor’, ‘AME’, ‘lower’, and ‘upper’.
Plot the average marginal effects with a point and the upper and lower bounds with whiskers

Submit by Sunday at midnight a Word document with screen shots of your work with a slice of your desktop and text. Explain what each image is.

Week 13

Week Thirteen | Mapping

Objectives:

discover spatial data
make chlorpeth maps
recognize relationship between data and maps
merge social data and map data to make socially informative maps

Assignment

pipe the election data through the select() function to pick out the following columns - state, total_vote, r_points, pct_trump, party, census. Pipe that through sample() to see the first five rows.
Create a state level dotplot of election data except the District of Columbia faceted by region. Colorize the dots by party and insert a vertical line dividing the parties, scale the x axis from -30 to +40, put the states on the y axis and label each facet by region and the entire set by “Point Margin”.
create a colorized map of the states of the United States without a legend using the Albers projection.
remove the grid lines and axis labels, if any, and color the previous map accordingto the 2016 election results - red for Trump, blue for Clinton
create a red colored gradient map of Trump voters by percentage in each state with the deeper intensity of the color reflecting the higher the percentage
create map of Trump v. Clinton with purple showing the midpoint between red (Trump) and blue (Clinton).
create a map of opiate related adjusted death rates per state faceted by year from 2000 to 2014. Use a viridis scale color gradient and an Albers projection.
What regions stand out to you as bearing watching? Why?
show the opiate data as a time series, faceted by region showing all states within the region and adding a smoothing curve.
Interpret the features of this set of graphs.

Submit by Sunday at midnight a Word document with screen shots (showing a slice of your desktop) of your work and text. Explain what each image is.

Week 14

Week Fourteen | Refinements in ggplot

Objectives:

use head() to see the first n rows of a data set
include confidence intervals in plots
compare facted graphs to pie charts

Assignment

look at the first six rows of the asasec dataset
plot members v revenue for 2014 in a scatterplot with a confidence interval
switch from loess to ols and add the Journal variable
show the first six rows of studebt
create a faceted comparison of the two distributions - percent of all borrowers and Percent of all balances to show how student loan debt is distributed.
Compare this pair of graphs to the pie charts in figure 8.24 Which visualization do you find it easier to make comparisons? Why?

Submit by Sunday at midnight a Word document with screen shots of your work showing a slice of your desktop and text. Explain what each image is.

New England College

Saraswathi Analytics

Data Mining

Week 1

Weekly Objectives:

Assignment: Install R and R Studio

Week 2

Weekly Objectives:

Assignment: Install R and R Studio

Machine Learning

Week 1

Week One | Introduction to Machine Learning

Weekly Learning Objective:

Assignment

Week 2

Week Two | Fundamentals of Machine Learning and Introduction to Case Studies Part One

Weekly Learning Objective:

Assignment

Week 3

Week Three | Classification with Logistic Regression

Weekly Learning Objective:

Assignment

Week 4

Week Four | Classifying with SVM

Weekly Learning Objective:

Assignment

Week 5

Week Five | Classifying with Decision Trees and Improving with Random Forest and Boosting

Weekly Learning Objective:

Assignment

Week 6

Week Six | Completion of Case Study One

Weekly Learning Objective:

Assignment

Week 7

Week Seven | Non-Linear Regression with General Additive Models

Weekly Learning Objective:

Assignment

Week 8

Week Eight | Preventing Overfitting for Regression Problems

Weekly Learning Objective:

Assignment

Week 9

Week Nine | Regression using Partition Methods

Weekly Learning Objective:

Assignment

Week 10

Week Ten | Completion of Case Study 2

Weekly Learning Objective:

Assignment

Week 11

Week Eleven | Unsupervised Learning using Dimension Reduction

Weekly Learning Objective:

Assignment

Week 12

Week Twelve | Unsupervised Larning using Clustering Part One

Weekly Learning Objective:

Assignment

Week 13

Week Thirteen | Unsupervised Learning using Clustering Part Two

Weekly Learning Objective:

Assignment

Week 14

Week Fourteen | Human Activity Recognition Final Case Analysis

Weekly Learning Objective:

Assignment

Week 15

Week Fifteen | Human Activity Recognition Case Analysis

Weekly Learning Objective:

Assignment

Visual Analytics

Week 1

Week One | Installs

Objectives:

Assignment

Week 3

Week Three | Getting Started

Objectives:

Assignment

Week 4