STAT 452 Homework1 Ulziibat Tserenbat February 2, 2019
Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention.
Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis.
Predictive analytics is the practice of extracting information from existing data sets in order to determine patterns and predict future outcomes and trends. Predictive analytics does not tell you what will happen in the future.
In the field of computer science, artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and other animals.
Deep learning (also known as deep structured learning or hierarchical learning) is part of a broader family of machine learning methods based on learning data representations, as opposed to task-specific algorithms.
This book provides clear and intuitive guidance on how to implement cutting edge statistical and machine learning methods.
This is an R Notebook with the code from Machine Learning with R, Lantz.
Libraries
library(here)
library(lattice)
library(corrgram)
library(gmodels)
here::here()
[1] "C:/Users/Harold.KCG-HAROLD/Downloads/Chap02/Chap02"
Vectors
create vectors of data for three medical patients
subject_name <- c("John Doe", "Jane Doe", "Steve Graves")
temperature <- c(98.1, 98.6, 101.4)
flu_status <- c(FALSE, FALSE, TRUE)
access the second element in body temperature vector
temperature[2]
[1] 98.6
examples of accessing items in vector
include items in the range 2 to 3
temperature[2:3]
[1] 98.6 101.4
exclude item 2 using the minus sign
rr temperature[-2]
[1] 98.1 101.4
use a vector to indicate whether to include item
temperature[c(TRUE, TRUE, FALSE)]
[1] 98.1 98.6
add gender factor
gender <- factor(c("MALE", "FEMALE", "MALE"))
gender
[1] MALE FEMALE MALE
Levels: FEMALE MALE
add blood type factor
blood <- factor(c("O", "AB", "A"),
levels = c("A", "B", "AB", "O"))
blood
[1] O AB A
Levels: A B AB O
add ordered factor
symptoms <- factor(c("SEVERE", "MILD", "MODERATE"),
levels = c("MILD", "MODERATE", "SEVERE"),
ordered = TRUE)
symptoms
[1] SEVERE MILD MODERATE
Levels: MILD < MODERATE < SEVERE
check for symptoms greater than moderate
symptoms > "MODERATE"
[1] TRUE FALSE FALSE
display information for a patient
subject_name[1]
[1] "John Doe"
temperature[1]
[1] 98.1
flu_status[1]
[1] FALSE
gender[1]
[1] MALE
Levels: FEMALE MALE
blood[1]
[1] O
Levels: A B AB O
symptoms[1]
[1] SEVERE
3 Levels: MILD < ... < SEVERE
create list for a patient and display the patient
subject1 <- list(fullname = subject_name[1],
temperature = temperature[1],
flu_status = flu_status[1],
gender = gender[1],
blood = blood[1],
symptoms = symptoms[1])
subject1
$`fullname`
[1] "John Doe"
$temperature
[1] 98.1
$flu_status
[1] FALSE
$gender
[1] MALE
Levels: FEMALE MALE
$blood
[1] O
Levels: A B AB O
$symptoms
[1] SEVERE
Levels: MILD < MODERATE < SEVERE
methods for accessing a list
get a single list value by position (returns a sub-list)
subject1[2]
$`temperature`
[1] 98.1
get a single list value by position (returns a numeric vector)
subject1[[2]]
[1] 98.1
get a single list value by name
subject1$temperature
[1] 98.1
get several list items by specifying a vector of names
subject1[c("temperature", "flu_status")]
$`temperature`
[1] 98.1
$flu_status
[1] FALSE
access a list like a vector get values 2 and 3
rr subject1[2:3]
$temperature
[1] 98.1
$flu_status
[1] FALSE
create a data frame from medical patient data and display the data frame
pt_data <- data.frame(subject_name, temperature, flu_status, gender,
blood, symptoms, stringsAsFactors = FALSE)
pt_data
accessing a data frame
get a single column
pt_data$subject_name
[1] "John Doe" "Jane Doe" "Steve Graves"
get several columns by specifying a vector of names
pt_data[c("temperature", "flu_status")]
this is the same as above, extracting temperature and flu_status
pt_data[2:3]
accessing by row and column
pt_data[1, 2]
[1] 98.1
accessing several rows and several columns using vectors
pt_data[c(1, 3), c(2, 4)]
Leave a row or column blank to extract all rows or columns
rr # column 1, all rows pt_data[, 1]
[1] \John Doe\ \Jane Doe\ \Steve Graves\
rr # row 1, all columns pt_data[1, ] r # all rows and all columns pt_data[ , ]
the following are equivalent
rr pt_data[c(1, 3), c(, )] r pt_data[-2, c(-1, -3, -5, -6)]
create a 2x2 matrix
rr m <- matrix(c(1, 2, 3, 4), nrow = 2) m
[,1] [,2]
[1,] 1 3
[2,] 2 4
equivalent to the above
rr m <- matrix(c(1, 2, 3, 4), ncol = 2) m
[,1] [,2]
[1,] 1 3
[2,] 2 4
create a 2x3 matrix
rr m <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 2) m
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
create a 3x2 matrix
m <- matrix(c(1, 2, 3, 4, 5, 6), ncol = 2)
m
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
extract values from matrixes
m[1, 1]
[1] 1
m[3, 2]
[1] 6
extract rows
m[1, ]
[1] 1 4
extract columns
m[, 1]
[1] 1 2 3
saving, loading, and removing R data structures
show all data structures in memory
ls()
[1] "blood" "flu_status"
[3] "gender" "m"
[5] "model_table" "pt_data"
[7] "subject_name" "symptoms"
[9] "temperature" "usedcars"
remove the m and subject1 objects
rm(m, subject1)
object 'subject1' not found
ls()
[1] "blood" "flu_status"
[3] "gender" "model_table"
[5] "pt_data" "subject_name"
[7] "symptoms" "temperature"
[9] "usedcars"
rm(list=ls())
data exploration example using used car data
usedcars <- read.csv("usedcars.csv", stringsAsFactors = FALSE)
get structure of used car data
str(usedcars)
'data.frame': 150 obs. of 6 variables:
$ year : int 2011 2011 2011 2011 2012 2010 2011 2010 2011 2010 ...
$ model : chr "SEL" "SEL" "SEL" "SEL" ...
$ price : int 21992 20995 19995 17809 17500 17495 17000 16995 16995 16995 ...
$ mileage : int 7413 10926 7351 11613 8367 25125 27393 21026 32655 36116 ...
$ color : chr "Yellow" "Gray" "Silver" "Gray" ...
$ transmission: chr "AUTO" "AUTO" "AUTO" "AUTO" ...
summarize numeric variables
rr summary(usedcars$year)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2000 2008 2009 2009 2010 2012
rr summary(usedcars[c(, )])
price mileage
Min. : 3800 Min. : 4867
1st Qu.:10995 1st Qu.: 27200
Median :13592 Median : 36385
Mean :12962 Mean : 44261
3rd Qu.:14904 3rd Qu.: 55124
Max. :21992 Max. :151479
calculate the mean income
rr (36000 + 44000 + 56000) / 3
[1] 45333.33
rr mean(c(36000, 44000, 56000))
[1] 45333.33
the median income
rr median(c(36000, 44000, 56000))
[1] 44000
the min/max of used car prices
rr range(usedcars$price)
[1] 3800 21992
the difference of the range
rr diff(range(usedcars$price))
[1] 18192
IQR for used car prices
rr IQR(usedcars$price)
[1] 3909.5
use quantile to calculate five-number summary
rr quantile(usedcars$price)
0% 25% 50% 75% 100%
3800.0 10995.0 13591.5 14904.5 21992.0
the 99th percentile
rr quantile(usedcars$price, probs = c(0.01, 0.99))
1% 99%
5428.69 20505.00
quintiles
rr quantile(usedcars$price, seq(from = 0, to = 1, by = 0.20))
0% 20% 40% 60% 80% 100%
3800.0 10759.4 12993.8 13992.0 14999.0 21992.0
boxplot of used car prices and mileage
boxplot(usedcars$price, main="Boxplot of Used Car Prices",
ylab="Price ($)")
boxplot(usedcars$price ~ usedcars$transmission, main="Boxplot of Used Car Prices by Transmission",
ylab="Price ($)")
using the lattice package
rr lattice::bwplot(usedcars\(price~usedcars\)transmission, ylab=, xlab=, main=by Transmission)
rr usedcars\(year <- as.character(usedcars\)year) lattice::bwplot(usedcars\(price~usedcars\)transmission|usedcars$year, ylab=, xlab=, main=by Transmission and Year, layout=(c(5,3)))
rr boxplot(usedcars$mileage, main=of Used Car Mileage, ylab=(mi.))
rr boxplot(usedcars\(mileage ~ usedcars\)transmission, main=of Used Car Mileage by Transmission, ylab=(mi.))
histograms of used car prices and mileage
rr hist(usedcars\(price, main = \Histogram of Used Car Prices\, xlab = \Price (\)))
rr hist(usedcars$mileage, main = of Used Car Mileage, xlab = (mi.))
rr lattice::histogram(~ usedcars$price, xlab=, main=of Price)
rr usedcars\(year <- as.character(usedcars\)year) lattice::histogram(~ usedcars\(price | usedcars\)year, ylab=, xlab=, main=of Price by Year, layout=(c(5,3)))
rr lattice::histogram(~ usedcars$mileage, xlab=, main=of Mileage)
rr usedcars\(year <- as.character(usedcars\)year) lattice::histogram(~ usedcars\(mileage | usedcars\)year, xlab=, main=of Mileage by Year, layout=(c(5,3)))
variance and standard deviation of the used car data
rr var(usedcars$price)
[1] 9749892
rr sd(usedcars$price)
[1] 3122.482
rr var(usedcars$mileage)
[1] 728033954
rr sd(usedcars$mileage)
[1] 26982.1
one-way tables for the used car data
table(usedcars$year)
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
3 1 1 1 3 2 6 11 14 42 49 16 1
table(usedcars$model)
SE SEL SES
78 23 49
table(usedcars$color)
Black Blue Gold Gray Green Red Silver White Yellow
35 17 1 16 5 25 32 16 3
compute table proportions
model_table <- table(usedcars$model)
prop.table(model_table)
SE SEL SES
0.5200000 0.1533333 0.3266667
round the data
rr color_table <- table(usedcars$color) color_pct <- prop.table(color_table) * 100 round(color_pct, digits = 1)
Black Blue Gold Gray Green Red Silver White Yellow
23.3 11.3 0.7 10.7 3.3 16.7 21.3 10.7 2.0
correlation
rr cor(x = usedcars\(mileage, y = usedcars\)price)
[1] -0.8061494
scatterplot of price vs. mileage
rr plot(x = usedcars\(mileage, y = usedcars\)price, main = of Price vs. Mileage, xlab = Car Odometer (mi.), ylab = Car Price ($))
The corrgram package has the corrgram function that is nice for looking at relationships between numeric variable.
rr corrgram::corrgram(usedcars,lower.panel=panel.ellipse, upper.panel=panel.pts)
new variable indicating conservative colors
rr usedcars\(conservative <- usedcars\)color %in% c(, , , )
checking our variable
rr table(usedcars$conservative)
FALSE TRUE
51 99
Crosstab of conservative by model
rr gmodels::CrossTable(x = usedcars\(model, y = usedcars\)conservative)
Cell Contents
|-------------------------|
| N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 150
| usedcars$conservative
usedcars$model | FALSE | TRUE | Row Total |
---------------|-----------|-----------|-----------|
SE | 27 | 51 | 78 |
| 0.009 | 0.004 | |
| 0.346 | 0.654 | 0.520 |
| 0.529 | 0.515 | |
| 0.180 | 0.340 | |
---------------|-----------|-----------|-----------|
SEL | 7 | 16 | 23 |
| 0.086 | 0.044 | |
| 0.304 | 0.696 | 0.153 |
| 0.137 | 0.162 | |
| 0.047 | 0.107 | |
---------------|-----------|-----------|-----------|
SES | 17 | 32 | 49 |
| 0.007 | 0.004 | |
| 0.347 | 0.653 | 0.327 |
| 0.333 | 0.323 | |
| 0.113 | 0.213 | |
---------------|-----------|-----------|-----------|
Column Total | 51 | 99 | 150 |
| 0.340 | 0.660 | |
---------------|-----------|-----------|-----------|