Random sample a training data set that contains 80% of original data points.
Sol:
#splitting 80:20
set.seed(13960406)
train_values <- sample(nrow(iris), .80*nrow(iris))
iris_train <- iris[train_values,]
Let us check the dimensions of the sample:
dim(iris)
## [1] 150 5
dim(iris_train)
## [1] 120 5
As can be seen from the output, the training data set has been sampled to contain 80% of the data from original dataset.
Economic Order Quantity Model: D=5000: annual demand quantity K=$4: fixed cost per order h=$0.5: holding cost per unit Q=?
Sol:
#annual demand quantity
D <- 5000
#fixed cost per order
K <- 4
#holding cost per unit
h <- 0.5
#Calculate Economic Order Quantity Model
Q <- sqrt((2*D*K)/h)
Q
## [1] 282.8427
The value of Q is calculated to be 282.8427.
1. Define a vector with values (5, 2, 11, 19, 3, -9, 8, 20, 1). Calculate the sum, mean, and standard deviation.
Sol:
vector <- c(5, 2, 11, 19, 3, -9, 8, 20, 1)
#Sum
sum <- sum(vector)
sum
## [1] 60
The Sum of the vector is 60.
#mean
mean <- mean(vector)
mean
## [1] 6.666667
The mean of the vector is 6.666667.
#Standard deviation
sd <- sd(vector)
sd
## [1] 9.124144
The Standard deviation of the vector is 9.124144.
2.Re-order the vector from largest to smallest, and make it a new vector.
Sol:
#Reordering vector
v_ordered <- vector[order(vector,decreasing = TRUE)]
v_ordered
## [1] 20 19 11 8 5 3 2 1 -9
The vector has been reordered from largest to smallest.
3.Convert the vector to a 3x3 matrix ordered by column.What is the sum of first column?What is the number in column 2 row 3? What is the column sum?
Sol:
#Creating Matrix
mat <- matrix(data = vector, nrow = 3, ncol = 3)
mat
## [,1] [,2] [,3]
## [1,] 5 19 8
## [2,] 2 3 20
## [3,] 11 -9 1
The vector has been converted into a 3*3 matrix
#sum of the first column
sum_1 <- sum(mat[,1])
sum_1
## [1] 18
The Sum of first column is 18.
mat[3,2]
## [1] -9
The number in column 2 row 3 is -9.
colSums(mat)
## [1] 18 13 29
The sum of columns are as follows: Col 1- 18, Col 2- 13, Col 3- 29
4.Use the following code to load the CustomerData to your R.
Sol:
Let us load the data:
#loading customerData
customer <- read.csv(file = "https://xiaoruizhu.github.io/Data-Mining-R/lecture/data/CustomerData.csv")
Let us check for the dimensions of the data
#dimensions of data
dim(customer)
## [1] 5000 59
There are 5000 rows and 59 columns in the dataset.
Let us display all variable names of the dataset.
#variablenames
names(customer)
## [1] "CustomerID" "Region" "TownSize"
## [4] "Gender" "Age" "EducationYears"
## [7] "JobCategory" "UnionMember" "EmploymentLength"
## [10] "Retired" "HHIncome" "DebtToIncomeRatio"
## [13] "CreditDebt" "OtherDebt" "LoanDefault"
## [16] "MaritalStatus" "HouseholdSize" "NumberPets"
## [19] "NumberCats" "NumberDogs" "NumberBirds"
## [22] "HomeOwner" "CarsOwned" "CarOwnership"
## [25] "CarBrand" "CarValue" "CommuteTime"
## [28] "PoliticalPartyMem" "Votes" "CreditCard"
## [31] "CardTenure" "CardItemsMonthly" "CardSpendMonth"
## [34] "ActiveLifestyle" "PhoneCoTenure" "VoiceLastMonth"
## [37] "VoiceOverTenure" "EquipmentRental" "EquipmentLastMonth"
## [40] "EquipmentOverTenure" "CallingCard" "WirelessData"
## [43] "DataLastMonth" "DataOverTenure" "Multiline"
## [46] "VM" "Pager" "Internet"
## [49] "CallerID" "CallWait" "CallForward"
## [52] "ThreeWayCalling" "EBilling" "TVWatchingHours"
## [55] "OwnsPC" "OwnsMobileDevice" "OwnsGameSystem"
## [58] "OwnsFax" "NewsSubscriber"
Let us calculate the average “Debt to Income Ratio”
#average DebtToIncomeRatio
mean(customer$DebtToIncomeRatio)
## [1] 9.95416
The average “Debt to Income Ratio” is calculated to be be 9.95416.
Let us find the proportion of “Married” customers
#proportion of married customers
mean(customer$MaritalStatus == "Married")*100
## [1] 48.02
The proportion of “Married” customers is 48.02%.
1.Do you think 1+1/(22)+1/(32)+…+1/(n^2) converges or diverges as n→∞? Use R to verify your answer.
#series
Series = function(n){
z<- 1
for(i in 2:n){
z<- z+1/(i*i)
}
z
}
Series(100)
## [1] 1.634984
Series(10000)
## [1] 1.644834
Series(25000)
## [1] 1.644894
Series(30000)
## [1] 1.644901
We can find from the output that the differences between sum of the series becomes negligible as the n value keeps increasing. As n -> infinity, the sum remains almost constant and thus the series converges as n ->infinity.
2.Fibonacci sequence: 1, 1, 2, 3, 5, 8, 13,… What is the next number? What is the 50th number? Creat a vector of first 30 Fibonacci numbers.
Sol:
Let us create a fibonacci series:
#fibonacci series
x <- c(1,1)
for(i in 3:50){
x[i] <- x[(i-1)] + x[(i-2)]
}
x[8]
## [1] 21
The next number in the series as required is 21.
Check for the 50th number in the series:
x[50]
## [1] 12586269025
The 50th number in the series is 12586269025.
Let us create a vector of first 30 fibonacci series:
x1 <- x[1:30]
x1
## [1] 1 1 2 3 5 8 13 21 34 55
## [11] 89 144 233 377 610 987 1597 2584 4181 6765
## [21] 10946 17711 28657 46368 75025 121393 196418 317811 514229 832040
3.Write a function that can either calculate the summation of the serie in Question 1 or generate and print Fibonacci sequence in Question 2.
Sol:
Creating a function to generate and print Fibonacci series:
#function to generate fibonacci series
fibonacci = function(n){
y <- c(1,1)
for(i in 3:n){
y[i] <- y[(i-1)] + y[(i-2)]
}
y
}
fibonacci(30)
## [1] 1 1 2 3 5 8 13 21 34 55
## [11] 89 144 233 377 610 987 1597 2584 4181 6765
## [21] 10946 17711 28657 46368 75025 121393 196418 317811 514229 832040
data(iris)
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
dim(iris)
## [1] 150 5
names(iris)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
class(iris[,1])
## [1] "numeric"
class(iris[,5])
## [1] "factor"
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
setosa <- rbind(iris[iris$Species=="setosa",])
versicolor <- rbind(iris[iris$Species=="versicolor",])
virginica <- rbind(iris[iris$Species=="virginica",])
ind <- 1:30
iris_train <- rbind(setosa[ind,], versicolor[ind,], virginica[ind,])
iris_test <- rbind(setosa[-ind,], versicolor[-ind,], virginica[-ind,])
#Training the model
library(class)
knn_iris <- knn(train = iris_train[, -5], test = iris_test[, -5], cl=iris_train[,5], k=5)
#Prediction Accuracy
table(iris_test[,5], knn_iris, dnn = c("True", "Predicted"))
## Predicted
## True setosa versicolor virginica
## setosa 20 0 0
## versicolor 0 19 1
## virginica 0 0 20
sum(iris_test[,5] != knn_iris)
## [1] 1
##K-means Clustering
library(fpc)
fit <- kmeans(iris[,1:4], 5)
plotcluster(iris[,1:4], fit$cluster)
fit2 <- kmeans(iris[,1:4], 3)
plotcluster(iris[,1:4], fit2$cluster)
#Hierarchial Clustering
hc_result <- hclust(dist(iris[,1:4]))
plot(hc_result)
#Cut Dendrogram into 3 Clusters
rect.hclust(hc_result, k=3)
x <- 10
y = 5
x+y
## [1] 15
log(x)
## [1] 2.302585
exp(y)
## [1] 148.4132
cos(x)
## [1] -0.8390715
x == y
## [1] FALSE
x > y
## [1] TRUE
##Data Structure
# Define numerical vector z
z<- c(3,5,7,9)
# Define character vector zz, where numerical operations cannot be directly applied.
zz<- c("cup", "plate", "pen", "paper")
#Average
mean(z)
## [1] 6
#Standard deviation
sd(z)
## [1] 2.581989
#Median
median(z)
## [1] 6
#Max
max(z)
## [1] 9
#Min
min(z)
## [1] 3
#Summary Stats
summary(z)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.0 4.5 6.0 6.0 7.5 9.0
z
## [1] 3 5 7 9
z+2
## [1] 5 7 9 11
z/10
## [1] 0.3 0.5 0.7 0.9
# define vector z1
z1 <- c(2,4,6,8)
# Elementwise operations (must be the same length)
z+z1
## [1] 5 9 13 17
z*z1
## [1] 6 20 42 72
# define vector z2
z2 <- c(z, z1)
z2
## [1] 3 5 7 9 2 4 6 8
z2[2]
## [1] 5
z2[z2>3]
## [1] 5 7 9 4 6 8
z2[z2>3 & z2<6]
## [1] 5 4
z2[order(z2)]
## [1] 2 3 4 5 6 7 8 9
#Exercise
#dotproduct
dot_product = sum(z*z1)
dot_product
## [1] 140
z2[z2<3 | z2>7]
## [1] 9 2 8
#Matrix
z = c(3,5,7,9)
A = matrix(data = c(1,2,3,4,5,6), nrow = 2)
A <- matrix(data = c(1,2,3,4,5,6), nrow = 2, ncol = 3)
A <- matrix(c(1,2,3,4,5,6), 2, 3)
A
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
A <- matrix(data = c(1,2,3,4,5,6), ncol = 2)
A
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
A <- matrix(data = c(1,2,3,4,5,6), ncol = 4)
A
## [,1] [,2] [,3] [,4]
## [1,] 1 3 5 1
## [2,] 2 4 6 2
A <- matrix(data = z2, nrow = 4, ncol = 2, byrow = TRUE)
A
## [,1] [,2]
## [1,] 3 5
## [2,] 7 9
## [3,] 2 4
## [4,] 6 8
A+2
## [,1] [,2]
## [1,] 5 7
## [2,] 9 11
## [3,] 4 6
## [4,] 8 10
# Dimensions of A
dim(A)
## [1] 4 2
# Transpose
t(A)
## [,1] [,2] [,3] [,4]
## [1,] 3 7 2 6
## [2,] 5 9 4 8
# Matrix multiplication is doable if and only if the number of columns in A1 equals the number of rows in A2
t(A) %*% A
## [,1] [,2]
## [1,] 98 134
## [2,] 134 186
# New matrix with dimension 4*2
A2 <- A * 2
# Matrix calculation should satisfy the rules of matrix algebra
A + A2
## [,1] [,2]
## [1,] 9 15
## [2,] 21 27
## [3,] 6 12
## [4,] 18 24
# A %*% A2 - Error : non- comformable argument calls
A[2,2]
## [1] 9
A[1, ]
## [1] 3 5
A[,1:2]
## [,1] [,2]
## [1,] 3 5
## [2,] 7 9
## [3,] 2 4
## [4,] 6 8
t(A) %*% A
## [,1] [,2]
## [1,] 98 134
## [2,] 134 186
diag(t(A) %*% A)
## [1] 98 186
#Dataframes
mydf <- data.frame(A)
class(mydf)
## [1] "data.frame"
mydata_csv <- read.csv("C:/Users/pallatan/Desktop/storks.csv", header=TRUE)
mydata_txt <- read.table("C:/Users/pallatan/Desktop/storks.txt", header=TRUE, sep = "\t")
data(cars)
#Dimension
dim(cars)
## [1] 50 2
#Preview the first few rows
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
#Variable names
names(cars)
## [1] "speed" "dist"
#Summary
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
#Structure
str(cars)
## 'data.frame': 50 obs. of 2 variables:
## $ speed: num 4 4 7 7 8 9 10 10 10 11 ...
## $ dist : num 2 10 4 22 16 10 18 26 34 17 ...
#First 2 obs of the variable dist in cars
cars$dist[1:2]
## [1] 2 10
cars1<- cars
cars1$time<- cars$dist/cars$speed
# since "time" is the third column, we can do
cars2<- cars1[,-3]
# we can also drop "time" by keeping the other two variables
cars3<- cars1[c("speed", "dist")]
#List
mylist<- list(myvector=z, mymatrix=A, mydata=cars)
# Load car dataset that comes with R
data(cars)
#fit a simple linear regression between braking distance and speed
lm(dist~speed, data=cars)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Coefficients:
## (Intercept) speed
## -17.579 3.932
reg <- lm(dist~speed, data = cars)
reg[[1]]
## (Intercept) speed
## -17.579095 3.932409
reg[["coeffcients"]]
## NULL
reg$coeffcients
## NULL
#Plotting
plot(cars)
dbinom(x=4,size=10,prob=0.5)
## [1] 0.2050781
pnorm(1.86)
## [1] 0.9685572
qnorm(0.975)
## [1] 1.959964
rnorm(10)
## [1] 0.17414687 1.91844282 -0.05864362 -0.42274687 -0.06505727 -1.43295417
## [7] -0.59042676 -0.80500581 -0.75875805 -1.84142519
rnorm(n=10,mean=100,sd=20)
## [1] 106.65911 97.28761 102.48570 115.07361 114.33993 88.05207 76.80812
## [8] 149.31537 106.84272 88.93507
#functions
abs_val = function(x){
if(x >= 0){
return(x)
}
else{
return(-x)
}
}
abs_val(-5)
## [1] 5
mytruncation<- function(v, lower, upper){
v[which(v<lower)]<- lower
v[which(v>upper)]<- upper
return(v)
}
mytruncation(v = c(1:9), lower = 3, upper = 7)
## [1] 3 3 3 4 5 6 7 7 7
i<- 1
x<- 1
while(i<100){
i<- i+1
x<- x+1/i
}
x
## [1] 5.187378
x<- 1
for(i in 2:100){
x<- x+1/i
}
x
## [1] 5.187378