STAT 452 Homework1 Ulziibat Tserenbat February 2, 2019

Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention.

Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis.

Predictive analytics is the practice of extracting information from existing data sets in order to determine patterns and predict future outcomes and trends. Predictive analytics does not tell you what will happen in the future.

In the field of computer science, artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and other animals.

Deep learning (also known as deep structured learning or hierarchical learning) is part of a broader family of machine learning methods based on learning data representations, as opposed to task-specific algorithms.

This book provides clear and intuitive guidance on how to implement cutting edge statistical and machine learning methods.

Chapter 2

This is an R Notebook with the code from Machine Learning with R, Lantz.

Blog post about Projects and Notebooks

Prime Hings for Running A Data Project in R

Good website for learning about R.

Quick-R

Chapter 2: Managing and Understanding Data

Libraries

library(here)
library(lattice)
library(corrgram)
library(gmodels)
here::here()
[1] "C:/Users/Harold.KCG-HAROLD/Downloads/Chap02/Chap02"

R data structures

Vectors

create vectors of data for three medical patients

subject_name <- c("John Doe", "Jane Doe", "Steve Graves")
temperature <- c(98.1, 98.6, 101.4)
flu_status <- c(FALSE, FALSE, TRUE)

access the second element in body temperature vector

temperature[2]
[1] 98.6

examples of accessing items in vector

include items in the range 2 to 3

temperature[2:3]
[1]  98.6 101.4

exclude item 2 using the minus sign

rr temperature[-2]

[1]  98.1 101.4

use a vector to indicate whether to include item

temperature[c(TRUE, TRUE, FALSE)]
[1] 98.1 98.6

Factors

add gender factor

gender <- factor(c("MALE", "FEMALE", "MALE"))
gender
[1] MALE   FEMALE MALE  
Levels: FEMALE MALE

add blood type factor

blood <- factor(c("O", "AB", "A"),
                levels = c("A", "B", "AB", "O"))
blood
[1] O  AB A 
Levels: A B AB O

add ordered factor

symptoms <- factor(c("SEVERE", "MILD", "MODERATE"),
                   levels = c("MILD", "MODERATE", "SEVERE"),
                   ordered = TRUE)
symptoms
[1] SEVERE   MILD     MODERATE
Levels: MILD < MODERATE < SEVERE

check for symptoms greater than moderate

symptoms > "MODERATE"
[1]  TRUE FALSE FALSE

Lists

display information for a patient

subject_name[1]
[1] "John Doe"
temperature[1]
[1] 98.1
flu_status[1]
[1] FALSE
gender[1]
[1] MALE
Levels: FEMALE MALE
blood[1]
[1] O
Levels: A B AB O
symptoms[1]
[1] SEVERE
3 Levels: MILD < ... < SEVERE

create list for a patient and display the patient

subject1 <- list(fullname = subject_name[1], 
                 temperature = temperature[1],
                 flu_status = flu_status[1],
                 gender = gender[1],
                 blood = blood[1],
                 symptoms = symptoms[1])
subject1
$`fullname`
[1] "John Doe"

$temperature
[1] 98.1

$flu_status
[1] FALSE

$gender
[1] MALE
Levels: FEMALE MALE

$blood
[1] O
Levels: A B AB O

$symptoms
[1] SEVERE
Levels: MILD < MODERATE < SEVERE

methods for accessing a list

get a single list value by position (returns a sub-list)

subject1[2]
$`temperature`
[1] 98.1

get a single list value by position (returns a numeric vector)

subject1[[2]]
[1] 98.1

get a single list value by name

subject1$temperature
[1] 98.1

get several list items by specifying a vector of names

subject1[c("temperature", "flu_status")]
$`temperature`
[1] 98.1

$flu_status
[1] FALSE

access a list like a vector get values 2 and 3

rr subject1[2:3]

$temperature
[1] 98.1

$flu_status
[1] FALSE

Data frames

create a data frame from medical patient data and display the data frame

pt_data <- data.frame(subject_name, temperature, flu_status, gender,
                      blood, symptoms, stringsAsFactors = FALSE)
pt_data

accessing a data frame

get a single column

pt_data$subject_name
[1] "John Doe"     "Jane Doe"     "Steve Graves"

get several columns by specifying a vector of names

pt_data[c("temperature", "flu_status")]

this is the same as above, extracting temperature and flu_status

pt_data[2:3]

accessing by row and column

pt_data[1, 2]
[1] 98.1

accessing several rows and several columns using vectors

pt_data[c(1, 3), c(2, 4)]

Leave a row or column blank to extract all rows or columns

rr # column 1, all rows pt_data[, 1]

[1] \John Doe\     \Jane Doe\     \Steve Graves\

rr # row 1, all columns pt_data[1, ] r # all rows and all columns pt_data[ , ]

the following are equivalent

rr pt_data[c(1, 3), c(, )] r pt_data[-2, c(-1, -3, -5, -6)]

Matrixes

create a 2x2 matrix

rr m <- matrix(c(1, 2, 3, 4), nrow = 2) m

     [,1] [,2]
[1,]    1    3
[2,]    2    4

equivalent to the above

rr m <- matrix(c(1, 2, 3, 4), ncol = 2) m

     [,1] [,2]
[1,]    1    3
[2,]    2    4

create a 2x3 matrix

rr m <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 2) m

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

create a 3x2 matrix

m <- matrix(c(1, 2, 3, 4, 5, 6), ncol = 2)
m
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6

extract values from matrixes

m[1, 1]
[1] 1
m[3, 2]
[1] 6

extract rows

m[1, ]
[1] 1 4

extract columns

m[, 1]
[1] 1 2 3

Managing data with R

saving, loading, and removing R data structures

show all data structures in memory

ls()
 [1] "blood"        "flu_status"  
 [3] "gender"       "m"           
 [5] "model_table"  "pt_data"     
 [7] "subject_name" "symptoms"    
 [9] "temperature"  "usedcars"    

remove the m and subject1 objects

rm(m, subject1)
object 'subject1' not found
ls()
[1] "blood"        "flu_status"  
[3] "gender"       "model_table" 
[5] "pt_data"      "subject_name"
[7] "symptoms"     "temperature" 
[9] "usedcars"    
rm(list=ls())

Exploring and understanding data

data exploration example using used car data

usedcars <- read.csv("usedcars.csv", stringsAsFactors = FALSE)

get structure of used car data

str(usedcars)
'data.frame':   150 obs. of  6 variables:
 $ year        : int  2011 2011 2011 2011 2012 2010 2011 2010 2011 2010 ...
 $ model       : chr  "SEL" "SEL" "SEL" "SEL" ...
 $ price       : int  21992 20995 19995 17809 17500 17495 17000 16995 16995 16995 ...
 $ mileage     : int  7413 10926 7351 11613 8367 25125 27393 21026 32655 36116 ...
 $ color       : chr  "Yellow" "Gray" "Silver" "Gray" ...
 $ transmission: chr  "AUTO" "AUTO" "AUTO" "AUTO" ...

Exploring numeric variables

summarize numeric variables

rr summary(usedcars$year)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2000    2008    2009    2009    2010    2012 

rr summary(usedcars[c(, )])

     price          mileage      
 Min.   : 3800   Min.   :  4867  
 1st Qu.:10995   1st Qu.: 27200  
 Median :13592   Median : 36385  
 Mean   :12962   Mean   : 44261  
 3rd Qu.:14904   3rd Qu.: 55124  
 Max.   :21992   Max.   :151479  

calculate the mean income

rr (36000 + 44000 + 56000) / 3

[1] 45333.33

rr mean(c(36000, 44000, 56000))

[1] 45333.33

the median income

rr median(c(36000, 44000, 56000))

[1] 44000

the min/max of used car prices

rr range(usedcars$price)

[1]  3800 21992

the difference of the range

rr diff(range(usedcars$price))

[1] 18192

IQR for used car prices

rr IQR(usedcars$price)

[1] 3909.5

use quantile to calculate five-number summary

rr quantile(usedcars$price)

     0%     25%     50%     75%    100% 
 3800.0 10995.0 13591.5 14904.5 21992.0 

the 99th percentile

rr quantile(usedcars$price, probs = c(0.01, 0.99))

      1%      99% 
 5428.69 20505.00 

quintiles

rr quantile(usedcars$price, seq(from = 0, to = 1, by = 0.20))

     0%     20%     40%     60%     80%    100% 
 3800.0 10759.4 12993.8 13992.0 14999.0 21992.0 

boxplot of used car prices and mileage

boxplot(usedcars$price, main="Boxplot of Used Car Prices",
      ylab="Price ($)")

boxplot(usedcars$price ~ usedcars$transmission, main="Boxplot of Used Car Prices by Transmission",
      ylab="Price ($)")

using the lattice package

rr lattice::bwplot(usedcars\(price~usedcars\)transmission, ylab=, xlab=, main=by Transmission)

rr usedcars\(year <- as.character(usedcars\)year) lattice::bwplot(usedcars\(price~usedcars\)transmission|usedcars$year, ylab=, xlab=, main=by Transmission and Year, layout=(c(5,3)))

rr boxplot(usedcars$mileage, main=of Used Car Mileage, ylab=(mi.))

rr boxplot(usedcars\(mileage ~ usedcars\)transmission, main=of Used Car Mileage by Transmission, ylab=(mi.))

histograms of used car prices and mileage

rr hist(usedcars\(price, main = \Histogram of Used Car Prices\, xlab = \Price (\)))

rr hist(usedcars$mileage, main = of Used Car Mileage, xlab = (mi.))

rr lattice::histogram(~ usedcars$price, xlab=, main=of Price)

rr usedcars\(year <- as.character(usedcars\)year) lattice::histogram(~ usedcars\(price | usedcars\)year, ylab=, xlab=, main=of Price by Year, layout=(c(5,3)))

rr lattice::histogram(~ usedcars$mileage, xlab=, main=of Mileage)

rr usedcars\(year <- as.character(usedcars\)year) lattice::histogram(~ usedcars\(mileage | usedcars\)year, xlab=, main=of Mileage by Year, layout=(c(5,3)))

variance and standard deviation of the used car data

rr var(usedcars$price)

[1] 9749892

rr sd(usedcars$price)

[1] 3122.482

rr var(usedcars$mileage)

[1] 728033954

rr sd(usedcars$mileage)

[1] 26982.1

Exploring numeric variables

one-way tables for the used car data

table(usedcars$year)

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 
   3    1    1    1    3    2    6   11   14   42   49   16    1 
table(usedcars$model)

 SE SEL SES 
 78  23  49 
table(usedcars$color)

 Black   Blue   Gold   Gray  Green    Red Silver  White Yellow 
    35     17      1     16      5     25     32     16      3 

compute table proportions

model_table <- table(usedcars$model)
prop.table(model_table)

       SE       SEL       SES 
0.5200000 0.1533333 0.3266667 

round the data

rr color_table <- table(usedcars$color) color_pct <- prop.table(color_table) * 100 round(color_pct, digits = 1)


 Black   Blue   Gold   Gray  Green    Red Silver  White Yellow 
  23.3   11.3    0.7   10.7    3.3   16.7   21.3   10.7    2.0 

Exploring relationships between variables

correlation

rr cor(x = usedcars\(mileage, y = usedcars\)price)

[1] -0.8061494

scatterplot of price vs. mileage

rr plot(x = usedcars\(mileage, y = usedcars\)price, main = of Price vs. Mileage, xlab = Car Odometer (mi.), ylab = Car Price ($))

The corrgram package has the corrgram function that is nice for looking at relationships between numeric variable.

rr corrgram::corrgram(usedcars,lower.panel=panel.ellipse, upper.panel=panel.pts)

new variable indicating conservative colors

rr usedcars\(conservative <- usedcars\)color %in% c(, , , )

checking our variable

rr table(usedcars$conservative)


FALSE  TRUE 
   51    99 

Crosstab of conservative by model

rr gmodels::CrossTable(x = usedcars\(model, y = usedcars\)conservative)


 
   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  150 

 
               | usedcars$conservative 
usedcars$model |     FALSE |      TRUE | Row Total | 
---------------|-----------|-----------|-----------|
            SE |        27 |        51 |        78 | 
               |     0.009 |     0.004 |           | 
               |     0.346 |     0.654 |     0.520 | 
               |     0.529 |     0.515 |           | 
               |     0.180 |     0.340 |           | 
---------------|-----------|-----------|-----------|
           SEL |         7 |        16 |        23 | 
               |     0.086 |     0.044 |           | 
               |     0.304 |     0.696 |     0.153 | 
               |     0.137 |     0.162 |           | 
               |     0.047 |     0.107 |           | 
---------------|-----------|-----------|-----------|
           SES |        17 |        32 |        49 | 
               |     0.007 |     0.004 |           | 
               |     0.347 |     0.653 |     0.327 | 
               |     0.333 |     0.323 |           | 
               |     0.113 |     0.213 |           | 
---------------|-----------|-----------|-----------|
  Column Total |        51 |        99 |       150 | 
               |     0.340 |     0.660 |           | 
---------------|-----------|-----------|-----------|

 
---
title: "Homework 1"
output:
  html_notebook: default
  html_document:
    df_print: paged
  word_document: default
  pdf_document: default
---
STAT 452 Homework1
Ulziibat Tserenbat
February 2, 2019 


Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention.

Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis.

Predictive analytics is the practice of extracting information from existing data sets in order to determine patterns and predict future outcomes and trends. Predictive analytics does not tell you what will happen in the future.

In the field of computer science, artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and other animals.

Deep learning (also known as deep structured learning or hierarchical learning) is part of a broader family of machine learning methods based on learning data representations, as opposed to task-specific algorithms.

This book provides clear and intuitive guidance on how to implement cutting edge statistical and machine learning methods.

# Chapter 2

This is an R Notebook with the code from Machine Learning with R, Lantz.


## Blog post about Projects and Notebooks 

[Prime Hings for Running A Data Project in R](https://kkulma.github.io/2018-03-18-Prime-Hints-for-Running-a-data-project-in-R/)

## Good website for learning about R.

[Quick-R](https://www.statmethods.net/)

# Chapter 2: Managing and Understanding Data

**Libraries**

```{r}
library(here)
library(lattice)
library(corrgram)
library(gmodels)
```

```{r}
here::here()
```


## R data structures 

**Vectors**

create vectors of data for three medical patients

```{r}
subject_name <- c("John Doe", "Jane Doe", "Steve Graves")
temperature <- c(98.1, 98.6, 101.4)
flu_status <- c(FALSE, FALSE, TRUE)
```

access the second element in body temperature vector

```{r}
temperature[2]
```

examples of accessing items in vector

include items in the range 2 to 3

```{r}
temperature[2:3]
```

exclude item 2 using the minus sign

```{r}
temperature[-2]
```

use a vector to indicate whether to include item

```{r}
temperature[c(TRUE, TRUE, FALSE)]
```

## Factors 

add gender factor

```{r}
gender <- factor(c("MALE", "FEMALE", "MALE"))
gender
```

add blood type factor

```{r}

blood <- factor(c("O", "AB", "A"),
                levels = c("A", "B", "AB", "O"))
blood
```

add ordered factor

```{r}
symptoms <- factor(c("SEVERE", "MILD", "MODERATE"),
                   levels = c("MILD", "MODERATE", "SEVERE"),
                   ordered = TRUE)
symptoms
```

check for symptoms greater than moderate

```{r}
symptoms > "MODERATE"
```

## Lists 

display information for a patient

```{r}
subject_name[1]
temperature[1]
flu_status[1]
gender[1]
blood[1]
symptoms[1]

```

create list for a patient and display the patient

```{r}

subject1 <- list(fullname = subject_name[1], 
                 temperature = temperature[1],
                 flu_status = flu_status[1],
                 gender = gender[1],
                 blood = blood[1],
                 symptoms = symptoms[1])
subject1

```

methods for accessing a list

get a single list value by position (returns a sub-list)

```{r}
subject1[2]

```

get a single list value by position (returns a numeric vector)

```{r}
subject1[[2]]
```

get a single list value by name

```{r}
subject1$temperature
```

get several list items by specifying a vector of names

```{r}
subject1[c("temperature", "flu_status")]
```

access a list like a vector
get values 2 and 3

```{r}
subject1[2:3]
```

## Data frames 

create a data frame from medical patient data and display the data frame

```{r}
pt_data <- data.frame(subject_name, temperature, flu_status, gender,
                      blood, symptoms, stringsAsFactors = FALSE)
pt_data
```

accessing a data frame

get a single column

```{r}
pt_data$subject_name

```

get several columns by specifying a vector of names

```{r}
pt_data[c("temperature", "flu_status")]
```

this is the same as above, extracting temperature and flu_status

```{r}
pt_data[2:3]
```

accessing by row and column

```{r}
pt_data[1, 2]
```

accessing several rows and several columns using vectors

```{r}
pt_data[c(1, 3), c(2, 4)]
```

Leave a row or column blank to extract all rows or columns

```{r}
# column 1, all rows
pt_data[, 1]
# row 1, all columns
pt_data[1, ]
# all rows and all columns
pt_data[ , ]
```

the following are equivalent

```{r}
pt_data[c(1, 3), c("temperature", "gender")]
pt_data[-2, c(-1, -3, -5, -6)]
```

## Matrixes 

create a 2x2 matrix

```{r}
m <- matrix(c(1, 2, 3, 4), nrow = 2)
m
```

equivalent to the above

```{r}
m <- matrix(c(1, 2, 3, 4), ncol = 2)
m
```

create a 2x3 matrix

```{r}
m <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 2)
m
```

create a 3x2 matrix

```{r}
m <- matrix(c(1, 2, 3, 4, 5, 6), ncol = 2)
m
```

extract values from matrixes

```{r}
m[1, 1]
m[3, 2]
```

extract rows

```{r}
m[1, ]
```

extract columns

```{r}
m[, 1]
```

## Managing data with R 

saving, loading, and removing R data structures

show all data structures in memory

```{r}
ls()
```

remove the m and subject1 objects

```{r}
rm(m, subject1)
ls()
```

```{r}
rm(list=ls())
```

## Exploring and understanding data 

data exploration example using used car data

```{r}
usedcars <- read.csv("usedcars.csv", stringsAsFactors = FALSE)
```

get structure of used car data

```{r}
str(usedcars)
```

## Exploring numeric variables 

summarize numeric variables

```{r}
summary(usedcars$year)
summary(usedcars[c("price", "mileage")])
```

calculate the mean income

```{r}
(36000 + 44000 + 56000) / 3
mean(c(36000, 44000, 56000))
```

the median income

```{r}
median(c(36000, 44000, 56000))
```

the min/max of used car prices

```{r}
range(usedcars$price)
```

the difference of the range

```{r}
diff(range(usedcars$price))
```

IQR for used car prices

```{r}
IQR(usedcars$price)
```

use quantile to calculate five-number summary

```{r}
quantile(usedcars$price)

```

the 99th percentile

```{r}
quantile(usedcars$price, probs = c(0.01, 0.99))
```

quintiles
```{r}
quantile(usedcars$price, seq(from = 0, to = 1, by = 0.20))
```

boxplot of used car prices and mileage

```{r}
boxplot(usedcars$price, main="Boxplot of Used Car Prices",
      ylab="Price ($)")
boxplot(usedcars$price ~ usedcars$transmission, main="Boxplot of Used Car Prices by Transmission",
      ylab="Price ($)")

```

using the lattice package

```{r}
lattice::bwplot(usedcars$price~usedcars$transmission,
   ylab="Price", xlab="Transmission",
   main="Price by Transmission")
```


```{r}
usedcars$year <- as.character(usedcars$year)

lattice::bwplot(usedcars$price~usedcars$transmission|usedcars$year,
   ylab="Price", xlab="Transmission",
   main="Price by Transmission and Year", layout=(c(5,3)))
```


```{r}
boxplot(usedcars$mileage, main="Boxplot of Used Car Mileage",
      ylab="Odometer (mi.)")

boxplot(usedcars$mileage ~ usedcars$transmission, main="Boxplot of Used Car Mileage by Transmission", ylab="Odometer (mi.)")
```



histograms of used car prices and mileage

```{r}

hist(usedcars$price, main = "Histogram of Used Car Prices",
     xlab = "Price ($)")

hist(usedcars$mileage, main = "Histogram of Used Car Mileage",
     xlab = "Odometer (mi.)")
```

```{r}
lattice::histogram(~ usedcars$price,
   xlab="Price",
   main="Distribution of Price")
```


```{r}

usedcars$year <- as.character(usedcars$year)

lattice::histogram(~ usedcars$price | usedcars$year,
   ylab="Price", xlab="Price",
   main="Distribution of Price by Year", layout=(c(5,3)))
```

```{r}
lattice::histogram(~ usedcars$mileage,
   xlab="Mileagage",
   main="Distribution of Mileage")
```


```{r}

usedcars$year <- as.character(usedcars$year)

lattice::histogram(~ usedcars$mileage | usedcars$year,
   xlab="Mileage",
   main="Distribution of Mileage by Year", layout=(c(5,3)))
```

variance and standard deviation of the used car data

```{r}
var(usedcars$price)
sd(usedcars$price)
var(usedcars$mileage)
sd(usedcars$mileage)
```

## Exploring numeric variables

one-way tables for the used car data

```{r}

table(usedcars$year)
table(usedcars$model)
table(usedcars$color)
```

compute table proportions

```{r}
model_table <- table(usedcars$model)
prop.table(model_table)
```

round the data

```{r}
color_table <- table(usedcars$color)
color_pct <- prop.table(color_table) * 100
round(color_pct, digits = 1)
```

## Exploring relationships between variables

correlation

```{r}
cor(x = usedcars$mileage, y = usedcars$price)
```



scatterplot of price vs. mileage

```{r}
plot(x = usedcars$mileage, y = usedcars$price,
     main = "Scatterplot of Price vs. Mileage",
     xlab = "Used Car Odometer (mi.)",
     ylab = "Used Car Price ($)")
```

The corrgram package has the corrgram function that is nice for looking at relationships between numeric variable.

```{r}
corrgram::corrgram(usedcars,lower.panel=panel.ellipse,
  upper.panel=panel.pts)
```


new variable indicating conservative colors

```{r}
usedcars$conservative <-
  usedcars$color %in% c("Black", "Gray", "Silver", "White")
```

checking our variable
```{r}
table(usedcars$conservative)
```

Crosstab of conservative by model

```{r}
gmodels::CrossTable(x = usedcars$model, y = usedcars$conservative)
```










