STAT 452 Homework1 Ulziibat Tserenbat February 2, 2019
Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention.
Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis.
Predictive analytics is the practice of extracting information from existing data sets in order to determine patterns and predict future outcomes and trends. Predictive analytics does not tell you what will happen in the future.
In the field of computer science, artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and other animals.
Deep learning (also known as deep structured learning or hierarchical learning) is part of a broader family of machine learning methods based on learning data representations, as opposed to task-specific algorithms.
This book provides clear and intuitive guidance on how to implement cutting edge statistical and machine learning methods.
This is an R Notebook with the code from Machine Learning with R, Lantz.
Libraries
library(here)
library(lattice)
library(corrgram)
library(gmodels)
here::here()
[1] "C:/Users/Harold.KCG-HAROLD/Downloads/Chap02/Chap02"
Vectors
create vectors of data for three medical patients
subject_name <- c("John Doe", "Jane Doe", "Steve Graves")
temperature <- c(98.1, 98.6, 101.4)
flu_status <- c(FALSE, FALSE, TRUE)
access the second element in body temperature vector
temperature[2]
[1] 98.6
examples of accessing items in vector
include items in the range 2 to 3
temperature[2:3]
[1] 98.6 101.4
exclude item 2 using the minus sign
r
r temperature[-2]
[1] 98.1 101.4
use a vector to indicate whether to include item
temperature[c(TRUE, TRUE, FALSE)]
[1] 98.1 98.6
add gender factor
gender <- factor(c("MALE", "FEMALE", "MALE"))
gender
[1] MALE FEMALE MALE
Levels: FEMALE MALE
add blood type factor
blood <- factor(c("O", "AB", "A"),
levels = c("A", "B", "AB", "O"))
blood
[1] O AB A
Levels: A B AB O
add ordered factor
symptoms <- factor(c("SEVERE", "MILD", "MODERATE"),
levels = c("MILD", "MODERATE", "SEVERE"),
ordered = TRUE)
symptoms
[1] SEVERE MILD MODERATE
Levels: MILD < MODERATE < SEVERE
check for symptoms greater than moderate
symptoms > "MODERATE"
[1] TRUE FALSE FALSE
display information for a patient
subject_name[1]
[1] "John Doe"
temperature[1]
[1] 98.1
flu_status[1]
[1] FALSE
gender[1]
[1] MALE
Levels: FEMALE MALE
blood[1]
[1] O
Levels: A B AB O
symptoms[1]
[1] SEVERE
3 Levels: MILD < ... < SEVERE
create list for a patient and display the patient
subject1 <- list(fullname = subject_name[1],
temperature = temperature[1],
flu_status = flu_status[1],
gender = gender[1],
blood = blood[1],
symptoms = symptoms[1])
subject1
$`fullname`
[1] "John Doe"
$temperature
[1] 98.1
$flu_status
[1] FALSE
$gender
[1] MALE
Levels: FEMALE MALE
$blood
[1] O
Levels: A B AB O
$symptoms
[1] SEVERE
Levels: MILD < MODERATE < SEVERE
methods for accessing a list
get a single list value by position (returns a sub-list)
subject1[2]
$`temperature`
[1] 98.1
get a single list value by position (returns a numeric vector)
subject1[[2]]
[1] 98.1
get a single list value by name
subject1$temperature
[1] 98.1
get several list items by specifying a vector of names
subject1[c("temperature", "flu_status")]
$`temperature`
[1] 98.1
$flu_status
[1] FALSE
access a list like a vector get values 2 and 3
r
r subject1[2:3]
$temperature
[1] 98.1
$flu_status
[1] FALSE
create a data frame from medical patient data and display the data frame
pt_data <- data.frame(subject_name, temperature, flu_status, gender,
blood, symptoms, stringsAsFactors = FALSE)
pt_data
accessing a data frame
get a single column
pt_data$subject_name
[1] "John Doe" "Jane Doe" "Steve Graves"
get several columns by specifying a vector of names
pt_data[c("temperature", "flu_status")]
this is the same as above, extracting temperature and flu_status
pt_data[2:3]
accessing by row and column
pt_data[1, 2]
[1] 98.1
accessing several rows and several columns using vectors
pt_data[c(1, 3), c(2, 4)]
Leave a row or column blank to extract all rows or columns
r
r # column 1, all rows pt_data[, 1]
[1] \John Doe\ \Jane Doe\ \Steve Graves\
r
r # row 1, all columns pt_data[1, ] r # all rows and all columns pt_data[ , ]
the following are equivalent
r
r pt_data[c(1, 3), c(, )] r pt_data[-2, c(-1, -3, -5, -6)]
create a 2x2 matrix
r
r m <- matrix(c(1, 2, 3, 4), nrow = 2) m
[,1] [,2]
[1,] 1 3
[2,] 2 4
equivalent to the above
r
r m <- matrix(c(1, 2, 3, 4), ncol = 2) m
[,1] [,2]
[1,] 1 3
[2,] 2 4
create a 2x3 matrix
r
r m <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 2) m
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
create a 3x2 matrix
m <- matrix(c(1, 2, 3, 4, 5, 6), ncol = 2)
m
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
extract values from matrixes
m[1, 1]
[1] 1
m[3, 2]
[1] 6
extract rows
m[1, ]
[1] 1 4
extract columns
m[, 1]
[1] 1 2 3
saving, loading, and removing R data structures
show all data structures in memory
ls()
[1] "blood" "flu_status"
[3] "gender" "m"
[5] "model_table" "pt_data"
[7] "subject_name" "symptoms"
[9] "temperature" "usedcars"
remove the m and subject1 objects
rm(m, subject1)
object 'subject1' not found
ls()
[1] "blood" "flu_status"
[3] "gender" "model_table"
[5] "pt_data" "subject_name"
[7] "symptoms" "temperature"
[9] "usedcars"
rm(list=ls())
data exploration example using used car data
usedcars <- read.csv("usedcars.csv", stringsAsFactors = FALSE)
get structure of used car data
str(usedcars)
'data.frame': 150 obs. of 6 variables:
$ year : int 2011 2011 2011 2011 2012 2010 2011 2010 2011 2010 ...
$ model : chr "SEL" "SEL" "SEL" "SEL" ...
$ price : int 21992 20995 19995 17809 17500 17495 17000 16995 16995 16995 ...
$ mileage : int 7413 10926 7351 11613 8367 25125 27393 21026 32655 36116 ...
$ color : chr "Yellow" "Gray" "Silver" "Gray" ...
$ transmission: chr "AUTO" "AUTO" "AUTO" "AUTO" ...
summarize numeric variables
r
r summary(usedcars$year)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2000 2008 2009 2009 2010 2012
r
r summary(usedcars[c(, )])
price mileage
Min. : 3800 Min. : 4867
1st Qu.:10995 1st Qu.: 27200
Median :13592 Median : 36385
Mean :12962 Mean : 44261
3rd Qu.:14904 3rd Qu.: 55124
Max. :21992 Max. :151479
calculate the mean income
r
r (36000 + 44000 + 56000) / 3
[1] 45333.33
r
r mean(c(36000, 44000, 56000))
[1] 45333.33
the median income
r
r median(c(36000, 44000, 56000))
[1] 44000
the min/max of used car prices
r
r range(usedcars$price)
[1] 3800 21992
the difference of the range
r
r diff(range(usedcars$price))
[1] 18192
IQR for used car prices
r
r IQR(usedcars$price)
[1] 3909.5
use quantile to calculate five-number summary
r
r quantile(usedcars$price)
0% 25% 50% 75% 100%
3800.0 10995.0 13591.5 14904.5 21992.0
the 99th percentile
r
r quantile(usedcars$price, probs = c(0.01, 0.99))
1% 99%
5428.69 20505.00
quintiles
r
r quantile(usedcars$price, seq(from = 0, to = 1, by = 0.20))
0% 20% 40% 60% 80% 100%
3800.0 10759.4 12993.8 13992.0 14999.0 21992.0
boxplot of used car prices and mileage
boxplot(usedcars$price, main="Boxplot of Used Car Prices",
ylab="Price ($)")
boxplot(usedcars$price ~ usedcars$transmission, main="Boxplot of Used Car Prices by Transmission",
ylab="Price ($)")
using the lattice package
r
r lattice::bwplot(usedcars\(price~usedcars\)transmission, ylab=, xlab=, main=by Transmission)
r
r usedcars\(year <- as.character(usedcars\)year) lattice::bwplot(usedcars\(price~usedcars\)transmission|usedcars$year, ylab=, xlab=, main=by Transmission and Year, layout=(c(5,3)))
r
r boxplot(usedcars$mileage, main=of Used Car Mileage, ylab=(mi.))
r
r boxplot(usedcars\(mileage ~ usedcars\)transmission, main=of Used Car Mileage by Transmission, ylab=(mi.))
histograms of used car prices and mileage
r
r hist(usedcars\(price, main = \Histogram of Used Car Prices\, xlab = \Price (\)))
r
r hist(usedcars$mileage, main = of Used Car Mileage, xlab = (mi.))
r
r lattice::histogram(~ usedcars$price, xlab=, main=of Price)
r
r usedcars\(year <- as.character(usedcars\)year) lattice::histogram(~ usedcars\(price | usedcars\)year, ylab=, xlab=, main=of Price by Year, layout=(c(5,3)))
r
r lattice::histogram(~ usedcars$mileage, xlab=, main=of Mileage)
r
r usedcars\(year <- as.character(usedcars\)year) lattice::histogram(~ usedcars\(mileage | usedcars\)year, xlab=, main=of Mileage by Year, layout=(c(5,3)))
variance and standard deviation of the used car data
r
r var(usedcars$price)
[1] 9749892
r
r sd(usedcars$price)
[1] 3122.482
r
r var(usedcars$mileage)
[1] 728033954
r
r sd(usedcars$mileage)
[1] 26982.1
one-way tables for the used car data
table(usedcars$year)
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
3 1 1 1 3 2 6 11 14 42 49 16 1
table(usedcars$model)
SE SEL SES
78 23 49
table(usedcars$color)
Black Blue Gold Gray Green Red Silver White Yellow
35 17 1 16 5 25 32 16 3
compute table proportions
model_table <- table(usedcars$model)
prop.table(model_table)
SE SEL SES
0.5200000 0.1533333 0.3266667
round the data
r
r color_table <- table(usedcars$color) color_pct <- prop.table(color_table) * 100 round(color_pct, digits = 1)
Black Blue Gold Gray Green Red Silver White Yellow
23.3 11.3 0.7 10.7 3.3 16.7 21.3 10.7 2.0
correlation
r
r cor(x = usedcars\(mileage, y = usedcars\)price)
[1] -0.8061494
scatterplot of price vs. mileage
r
r plot(x = usedcars\(mileage, y = usedcars\)price, main = of Price vs. Mileage, xlab = Car Odometer (mi.), ylab = Car Price ($))
The corrgram package has the corrgram function that is nice for looking at relationships between numeric variable.
r
r corrgram::corrgram(usedcars,lower.panel=panel.ellipse, upper.panel=panel.pts)
new variable indicating conservative colors
r
r usedcars\(conservative <- usedcars\)color %in% c(, , , )
checking our variable
r
r table(usedcars$conservative)
FALSE TRUE
51 99
Crosstab of conservative by model
r
r gmodels::CrossTable(x = usedcars\(model, y = usedcars\)conservative)
Cell Contents
|-------------------------|
| N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 150
| usedcars$conservative
usedcars$model | FALSE | TRUE | Row Total |
---------------|-----------|-----------|-----------|
SE | 27 | 51 | 78 |
| 0.009 | 0.004 | |
| 0.346 | 0.654 | 0.520 |
| 0.529 | 0.515 | |
| 0.180 | 0.340 | |
---------------|-----------|-----------|-----------|
SEL | 7 | 16 | 23 |
| 0.086 | 0.044 | |
| 0.304 | 0.696 | 0.153 |
| 0.137 | 0.162 | |
| 0.047 | 0.107 | |
---------------|-----------|-----------|-----------|
SES | 17 | 32 | 49 |
| 0.007 | 0.004 | |
| 0.347 | 0.653 | 0.327 |
| 0.333 | 0.323 | |
| 0.113 | 0.213 | |
---------------|-----------|-----------|-----------|
Column Total | 51 | 99 | 150 |
| 0.340 | 0.660 | |
---------------|-----------|-----------|-----------|