Exercises

3.1 Exercise

Random sample a training data set that contains 80% of original data points.

Sol:

#splitting 80:20
set.seed(13960406)
train_values <- sample(nrow(iris), .80*nrow(iris))
iris_train <- iris[train_values,]

Let us check the dimensions of the sample:

dim(iris)
## [1] 150   5
dim(iris_train)
## [1] 120   5

As can be seen from the output, the training data set has been sampled to contain 80% of the data from original dataset.

3.2 Exercise

Economic Order Quantity Model: D=5000: annual demand quantity K=$4: fixed cost per order h=$0.5: holding cost per unit Q=?

Sol:

#annual demand quantity
D <- 5000
#fixed cost per order
K <- 4
#holding cost per unit
h <- 0.5
#Calculate Economic Order Quantity Model
Q <- sqrt((2*D*K)/h)
Q
## [1] 282.8427

The value of Q is calculated to be 282.8427.

3.3 Exercise

1. Define a vector with values (5, 2, 11, 19, 3, -9, 8, 20, 1). Calculate the sum, mean, and standard deviation.

Sol:

vector <- c(5, 2, 11, 19, 3, -9, 8, 20, 1)
#Sum
sum <- sum(vector)
sum
## [1] 60

The Sum of the vector is 60.

#mean
mean <- mean(vector)
mean
## [1] 6.666667

The mean of the vector is 6.666667.

#Standard deviation
sd <- sd(vector)
sd
## [1] 9.124144

The Standard deviation of the vector is 9.124144.

2.Re-order the vector from largest to smallest, and make it a new vector.

Sol:

#Reordering vector
v_ordered <- vector[order(vector,decreasing = TRUE)]
v_ordered
## [1] 20 19 11  8  5  3  2  1 -9

The vector has been reordered from largest to smallest.

3.Convert the vector to a 3x3 matrix ordered by column.What is the sum of first column?What is the number in column 2 row 3? What is the column sum?

Sol:

#Creating Matrix
mat <- matrix(data = vector, nrow = 3, ncol = 3)
mat
##      [,1] [,2] [,3]
## [1,]    5   19    8
## [2,]    2    3   20
## [3,]   11   -9    1

The vector has been converted into a 3*3 matrix

#sum of the first column
sum_1 <- sum(mat[,1])
sum_1
## [1] 18

The Sum of first column is 18.

mat[3,2]
## [1] -9

The number in column 2 row 3 is -9.

colSums(mat)
## [1] 18 13 29

The sum of columns are as follows: Col 1- 18, Col 2- 13, Col 3- 29

4.Use the following code to load the CustomerData to your R.

Sol:

Let us load the data:

#loading customerData
customer <- read.csv(file = "https://xiaoruizhu.github.io/Data-Mining-R/lecture/data/CustomerData.csv")

Let us check for the dimensions of the data

#dimensions of data
dim(customer)
## [1] 5000   59

There are 5000 rows and 59 columns in the dataset.

Let us display all variable names of the dataset.

#variablenames
names(customer)
##  [1] "CustomerID"          "Region"              "TownSize"           
##  [4] "Gender"              "Age"                 "EducationYears"     
##  [7] "JobCategory"         "UnionMember"         "EmploymentLength"   
## [10] "Retired"             "HHIncome"            "DebtToIncomeRatio"  
## [13] "CreditDebt"          "OtherDebt"           "LoanDefault"        
## [16] "MaritalStatus"       "HouseholdSize"       "NumberPets"         
## [19] "NumberCats"          "NumberDogs"          "NumberBirds"        
## [22] "HomeOwner"           "CarsOwned"           "CarOwnership"       
## [25] "CarBrand"            "CarValue"            "CommuteTime"        
## [28] "PoliticalPartyMem"   "Votes"               "CreditCard"         
## [31] "CardTenure"          "CardItemsMonthly"    "CardSpendMonth"     
## [34] "ActiveLifestyle"     "PhoneCoTenure"       "VoiceLastMonth"     
## [37] "VoiceOverTenure"     "EquipmentRental"     "EquipmentLastMonth" 
## [40] "EquipmentOverTenure" "CallingCard"         "WirelessData"       
## [43] "DataLastMonth"       "DataOverTenure"      "Multiline"          
## [46] "VM"                  "Pager"               "Internet"           
## [49] "CallerID"            "CallWait"            "CallForward"        
## [52] "ThreeWayCalling"     "EBilling"            "TVWatchingHours"    
## [55] "OwnsPC"              "OwnsMobileDevice"    "OwnsGameSystem"     
## [58] "OwnsFax"             "NewsSubscriber"

Let us calculate the average “Debt to Income Ratio”

#average DebtToIncomeRatio
mean(customer$DebtToIncomeRatio)
## [1] 9.95416

The average “Debt to Income Ratio” is calculated to be be 9.95416.

Let us find the proportion of “Married” customers

#proportion of married customers
mean(customer$MaritalStatus == "Married")*100
## [1] 48.02

The proportion of “Married” customers is 48.02%.

3.4 Exercise

1.Do you think 1+1/(22)+1/(32)+…+1/(n^2) converges or diverges as n→∞? Use R to verify your answer.

#series
Series = function(n){
z<- 1
for(i in 2:n){
  z<- z+1/(i*i)
}
z
}
Series(100)
## [1] 1.634984
Series(10000)
## [1] 1.644834
Series(25000)
## [1] 1.644894
Series(30000)
## [1] 1.644901

We can find from the output that the differences between sum of the series becomes negligible as the n value keeps increasing. As n -> infinity, the sum remains almost constant and thus the series converges as n ->infinity.

2.Fibonacci sequence: 1, 1, 2, 3, 5, 8, 13,… What is the next number? What is the 50th number? Creat a vector of first 30 Fibonacci numbers.

Sol:

Let us create a fibonacci series:

#fibonacci series
x <- c(1,1)
for(i in 3:50){
  x[i] <- x[(i-1)] + x[(i-2)]
}
x[8]
## [1] 21

The next number in the series as required is 21.

Check for the 50th number in the series:

x[50]
## [1] 12586269025

The 50th number in the series is 12586269025.

Let us create a vector of first 30 fibonacci series:

x1 <- x[1:30]
x1
##  [1]      1      1      2      3      5      8     13     21     34     55
## [11]     89    144    233    377    610    987   1597   2584   4181   6765
## [21]  10946  17711  28657  46368  75025 121393 196418 317811 514229 832040

3.Write a function that can either calculate the summation of the serie in Question 1 or generate and print Fibonacci sequence in Question 2.

Sol:

Creating a function to generate and print Fibonacci series:

#function to generate fibonacci series
fibonacci = function(n){
  y <- c(1,1)
  for(i in 3:n){
    y[i] <- y[(i-1)] + y[(i-2)]
  }
  y
}
fibonacci(30)
##  [1]      1      1      2      3      5      8     13     21     34     55
## [11]     89    144    233    377    610    987   1597   2584   4181   6765
## [21]  10946  17711  28657  46368  75025 121393 196418 317811 514229 832040

Lab Replicas

Introdution to Data Mining

data(iris)
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
dim(iris)
## [1] 150   5
names(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
class(iris[,1])
## [1] "numeric"
class(iris[,5])
## [1] "factor"
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
setosa <- rbind(iris[iris$Species=="setosa",])
versicolor <- rbind(iris[iris$Species=="versicolor",])
virginica <- rbind(iris[iris$Species=="virginica",])
ind <- 1:30
iris_train <- rbind(setosa[ind,], versicolor[ind,], virginica[ind,])
iris_test <- rbind(setosa[-ind,], versicolor[-ind,], virginica[-ind,])
#Training the model
library(class)
knn_iris <- knn(train = iris_train[, -5], test = iris_test[, -5], cl=iris_train[,5], k=5)
#Prediction Accuracy
table(iris_test[,5], knn_iris, dnn = c("True", "Predicted"))
##             Predicted
## True         setosa versicolor virginica
##   setosa         20          0         0
##   versicolor      0         19         1
##   virginica       0          0        20
sum(iris_test[,5] != knn_iris)
## [1] 1
##K-means Clustering

library(fpc)
fit <- kmeans(iris[,1:4], 5)
plotcluster(iris[,1:4], fit$cluster)

fit2 <- kmeans(iris[,1:4], 3)
plotcluster(iris[,1:4], fit2$cluster)

#Hierarchial Clustering
hc_result <- hclust(dist(iris[,1:4]))
plot(hc_result)
#Cut Dendrogram into 3 Clusters
rect.hclust(hc_result, k=3)

Introduction to R

x <- 10
y = 5
x+y
## [1] 15
log(x)
## [1] 2.302585
exp(y)
## [1] 148.4132
cos(x)
## [1] -0.8390715
x == y
## [1] FALSE
x > y
## [1] TRUE
##Data Structure
# Define numerical vector z
z<- c(3,5,7,9)
# Define character vector zz, where numerical operations cannot be directly applied.
zz<- c("cup", "plate", "pen", "paper")
#Average
mean(z)
## [1] 6
#Standard deviation
sd(z)
## [1] 2.581989
#Median
median(z)
## [1] 6
#Max
max(z)
## [1] 9
#Min
min(z)
## [1] 3
#Summary Stats
summary(z)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     3.0     4.5     6.0     6.0     7.5     9.0
z
## [1] 3 5 7 9
z+2
## [1]  5  7  9 11
z/10
## [1] 0.3 0.5 0.7 0.9
# define vector z1
z1 <- c(2,4,6,8)
# Elementwise operations (must be the same length)
z+z1
## [1]  5  9 13 17
z*z1
## [1]  6 20 42 72
# define vector z2
z2 <- c(z, z1)
z2
## [1] 3 5 7 9 2 4 6 8
z2[2]
## [1] 5
z2[z2>3]
## [1] 5 7 9 4 6 8
z2[z2>3 & z2<6]
## [1] 5 4
z2[order(z2)]
## [1] 2 3 4 5 6 7 8 9
#Exercise
#dotproduct
dot_product = sum(z*z1)
dot_product
## [1] 140
z2[z2<3 | z2>7]
## [1] 9 2 8
#Matrix
z = c(3,5,7,9)

A = matrix(data = c(1,2,3,4,5,6), nrow = 2)
A <- matrix(data = c(1,2,3,4,5,6), nrow = 2, ncol = 3)
A <- matrix(c(1,2,3,4,5,6), 2, 3)
A
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
A <- matrix(data = c(1,2,3,4,5,6), ncol = 2)
A
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6
A <- matrix(data = c(1,2,3,4,5,6), ncol = 4)
A
##      [,1] [,2] [,3] [,4]
## [1,]    1    3    5    1
## [2,]    2    4    6    2
A <- matrix(data = z2, nrow = 4, ncol = 2, byrow = TRUE)
A
##      [,1] [,2]
## [1,]    3    5
## [2,]    7    9
## [3,]    2    4
## [4,]    6    8
A+2
##      [,1] [,2]
## [1,]    5    7
## [2,]    9   11
## [3,]    4    6
## [4,]    8   10
# Dimensions of A
dim(A)
## [1] 4 2
# Transpose
t(A)
##      [,1] [,2] [,3] [,4]
## [1,]    3    7    2    6
## [2,]    5    9    4    8
# Matrix multiplication is doable if and only if the number of columns in A1 equals the number of rows in A2
t(A) %*% A
##      [,1] [,2]
## [1,]   98  134
## [2,]  134  186
# New matrix with dimension 4*2
A2 <- A * 2
# Matrix calculation should satisfy the rules of matrix algebra
A + A2
##      [,1] [,2]
## [1,]    9   15
## [2,]   21   27
## [3,]    6   12
## [4,]   18   24
# A %*% A2 - Error : non- comformable argument calls

A[2,2]
## [1] 9
A[1, ]
## [1] 3 5
A[,1:2]
##      [,1] [,2]
## [1,]    3    5
## [2,]    7    9
## [3,]    2    4
## [4,]    6    8
t(A) %*% A
##      [,1] [,2]
## [1,]   98  134
## [2,]  134  186
diag(t(A) %*% A)
## [1]  98 186
#Dataframes
mydf <- data.frame(A) 
class(mydf)
## [1] "data.frame"
mydata_csv <- read.csv("C:/Users/pallatan/Desktop/storks.csv", header=TRUE)
mydata_txt <- read.table("C:/Users/pallatan/Desktop/storks.txt", header=TRUE, sep = "\t")
data(cars)
#Dimension 
dim(cars)
## [1] 50  2
#Preview the first few rows
head(cars)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10
#Variable names
names(cars)
## [1] "speed" "dist"
#Summary
summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00
#Structure
str(cars)
## 'data.frame':    50 obs. of  2 variables:
##  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist : num  2 10 4 22 16 10 18 26 34 17 ...
#First 2 obs of the variable dist in cars
cars$dist[1:2]
## [1]  2 10
cars1<- cars
cars1$time<- cars$dist/cars$speed
# since "time" is the third column, we can do
cars2<- cars1[,-3]
# we can also drop "time" by keeping the other two variables
cars3<- cars1[c("speed", "dist")]

#List
mylist<- list(myvector=z, mymatrix=A, mydata=cars)
# Load car dataset that comes with R
data(cars)
#fit a simple linear regression between braking distance and speed
lm(dist~speed, data=cars)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Coefficients:
## (Intercept)        speed  
##     -17.579        3.932
reg <- lm(dist~speed, data = cars)
reg[[1]]
## (Intercept)       speed 
##  -17.579095    3.932409
reg[["coeffcients"]]
## NULL
reg$coeffcients
## NULL
#Plotting
plot(cars)

dbinom(x=4,size=10,prob=0.5)
## [1] 0.2050781
pnorm(1.86)
## [1] 0.9685572
qnorm(0.975)
## [1] 1.959964
rnorm(10)
##  [1]  0.17414687  1.91844282 -0.05864362 -0.42274687 -0.06505727 -1.43295417
##  [7] -0.59042676 -0.80500581 -0.75875805 -1.84142519
rnorm(n=10,mean=100,sd=20)
##  [1] 106.65911  97.28761 102.48570 115.07361 114.33993  88.05207  76.80812
##  [8] 149.31537 106.84272  88.93507

Advanced Techniques in R

#functions
abs_val = function(x){
  if(x >= 0){
    return(x)
  }
  else{
    return(-x)
  }
}
abs_val(-5)
## [1] 5
mytruncation<- function(v, lower, upper){
  v[which(v<lower)]<- lower
  v[which(v>upper)]<- upper
  return(v)
}

mytruncation(v = c(1:9), lower = 3, upper = 7)
## [1] 3 3 3 4 5 6 7 7 7
i<- 1
x<- 1
while(i<100){
  i<- i+1
  x<- x+1/i
}
x
## [1] 5.187378
x<- 1
for(i in 2:100){
  x<- x+1/i
}
x
## [1] 5.187378