The purpose of this demo is to show how we identify data structures when we do analysis to draw out some useful insights. You can find and download sample dataset (Bank Customer Data.csv) from this link: https://www.kaggle.com/datasets/kidoen/bank-customers-data First, read .csv file to R environment. You can do it by the following command. If your data is in .xls format then instal.packages(“readxl”) anda after attach this library.
BankCustomer<-read.csv("/home/user/Desktop/Analytcis with R/DataSets/BankCustomerData.csv") #reading .csv
setwd("/home/user/Desktop/Analytcis with R/Workshops/") #Lets change our working directory
#you can check your working directory by getwd()
head(BankCustomer) #Overview imported dataset
## age job marital education default balance housing loan contact day
## 1 58 management married tertiary no 2143 yes no unknown 5
## 2 44 technician single secondary no 29 yes no unknown 5
## 3 33 entrepreneur married secondary no 2 yes yes unknown 5
## 4 47 blue-collar married unknown no 1506 yes no unknown 5
## 5 33 unknown single unknown no 1 no no unknown 5
## 6 35 management married tertiary no 231 yes no unknown 5
## month duration campaign pdays previous poutcome term_deposit
## 1 may 261 1 -1 0 unknown no
## 2 may 151 1 -1 0 unknown no
## 3 may 76 1 -1 0 unknown no
## 4 may 92 1 -1 0 unknown no
## 5 may 198 1 -1 0 unknown no
## 6 may 139 1 -1 0 unknown no
str(BankCustomer) #Check help(str()) in Rstudio
## 'data.frame': 42639 obs. of 17 variables:
## $ age : int 58 44 33 47 33 35 28 42 58 43 ...
## $ job : chr "management" "technician" "entrepreneur" "blue-collar" ...
## $ marital : chr "married" "single" "married" "married" ...
## $ education : chr "tertiary" "secondary" "secondary" "unknown" ...
## $ default : chr "no" "no" "no" "no" ...
## $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
## $ housing : chr "yes" "yes" "yes" "yes" ...
## $ loan : chr "no" "no" "yes" "no" ...
## $ contact : chr "unknown" "unknown" "unknown" "unknown" ...
## $ day : int 5 5 5 5 5 5 5 5 5 5 ...
## $ month : chr "may" "may" "may" "may" ...
## $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : chr "unknown" "unknown" "unknown" "unknown" ...
## $ term_deposit: chr "no" "no" "no" "no" ...
Now as we identified Data Structures, the next step is to assign values to the DS. This is achieved by importing and exporting data from files.
data("mtcars") # Loading mtcars data
write.table(mtcars, file = "mtcars.txt", sep = "\t", row.names = TRUE, col.names = NA)
write.csv(mtcars, file = "mtcars.csv")
Fun: Also we can export from Rstudio to .pdf, .jpeg, .png formats Same way we can export pdf and png by means of following sample.
# Step 1: Call the jpeg command to start the plot
jpeg(file = "//home/user/Desktop/Analytcis with R/My Plot.jpeg", # The directory you want to save the file in
width = 400, # The width of the plot in inches
height = 400) # The height of the plot in inches
# Step 2: Create the plot with R code
plot(x = 1:10,
y = 1:10)
abline(v = 0) # Additional low-level plotting commands
text(x = 0, y = 1, labels = "Random text")
dev.off() # Step 3: Run dev.off() to create the file!
## png
## 2
-Data manipulation is required to bring accuracy in the data. -R base package has “apply” functions in it, which helps to manupilate the data -The apply() functions are used to perform a specific change to each column or row in object. -Types of apply function are: apply(), lapply(), sapply(), tapply(), mapply() and so on…
apply() function apply() takes Data frame or matrix as an input and gives output in vector, list or array. Apply function in R is primarily used to avoid explicit uses of loop constructs. It is the most basic of all collections can be used over a matrice.
This function takes 3 arguments:
apply(X, MARGIN, FUN)
-x: an array or matrix
-MARGIN: take a value or range between 1 and 2 to define where to apply the function:
-MARGIN=1: the manipulation is performed on rows
-MARGIN=2
: the manipulation is performed on
columns
-MARGIN=c(1,2)` the manipulation is performed on rows and columns
-FUN: tells which function to apply. Built functions like mean, median, sum, min, max and even user-defined functions can be applied
Example:
m1 <- matrix(C<-(1:10),nrow=5, ncol=6)
m1
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1 6 1 6 1 6
## [2,] 2 7 2 7 2 7
## [3,] 3 8 3 8 3 8
## [4,] 4 9 4 9 4 9
## [5,] 5 10 5 10 5 10
a_m1 <- apply(m1, 2, sum)
a_m1
## [1] 15 40 15 40 15 40
Now lets do some manipulation with our Bank Customer Data
library(plyr)
BankCustomer<-rename(BankCustomer,c("age"="Age"))
str(BankCustomer)
## 'data.frame': 42639 obs. of 17 variables:
## $ Age : int 58 44 33 47 33 35 28 42 58 43 ...
## $ job : chr "management" "technician" "entrepreneur" "blue-collar" ...
## $ marital : chr "married" "single" "married" "married" ...
## $ education : chr "tertiary" "secondary" "secondary" "unknown" ...
## $ default : chr "no" "no" "no" "no" ...
## $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
## $ housing : chr "yes" "yes" "yes" "yes" ...
## $ loan : chr "no" "no" "yes" "no" ...
## $ contact : chr "unknown" "unknown" "unknown" "unknown" ...
## $ day : int 5 5 5 5 5 5 5 5 5 5 ...
## $ month : chr "may" "may" "may" "may" ...
## $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : chr "unknown" "unknown" "unknown" "unknown" ...
## $ term_deposit: chr "no" "no" "no" "no" ...
max(BankCustomer$Age)
## [1] 95
#Now lets add a column to categorize customers by age period
BankCustomerGen<-transform(BankCustomer, Generation=ifelse(Age<22,"Z",
ifelse(Age<41,"Y",
ifelse(Age<53,"X", "AB"))))
head(BankCustomerGen)
## Age job marital education default balance housing loan contact day
## 1 58 management married tertiary no 2143 yes no unknown 5
## 2 44 technician single secondary no 29 yes no unknown 5
## 3 33 entrepreneur married secondary no 2 yes yes unknown 5
## 4 47 blue-collar married unknown no 1506 yes no unknown 5
## 5 33 unknown single unknown no 1 no no unknown 5
## 6 35 management married tertiary no 231 yes no unknown 5
## month duration campaign pdays previous poutcome term_deposit Generation
## 1 may 261 1 -1 0 unknown no AB
## 2 may 151 1 -1 0 unknown no X
## 3 may 76 1 -1 0 unknown no Y
## 4 may 92 1 -1 0 unknown no X
## 5 may 198 1 -1 0 unknown no Y
## 6 may 139 1 -1 0 unknown no Y
#2 Way Frequency table
table(BankCustomerGen$Generation,BankCustomerGen$poutcome)
##
## failure other success unknown
## AB 577 178 137 5789
## X 1199 404 183 10862
## Y 2486 929 440 19325
## Z 9 6 6 109