Before we begin, make sure that you read the dataset ‘telco.csv’ into your R console. There are two ways of doing it;
telco_data <- read.csv(file.choose ())
or
getwd()
## [1] "C:/Users/Asmui/Documents/Online Class"
as example my directory is at “C:/Users/Asmui/Documents/Online Class”. Next read the “telco.csv” dataset.
telco_data <- read.csv ("telco.csv", stringsAsFactors = TRUE)
3. You can check the stucture of the dataset to make sure that the dataset is correctly read by R.
str(telco_data)
## 'data.frame': 45 obs. of 6 variables:
## $ Gender : Factor w/ 2 levels "Female","Male": 2 1 1 1 1 1 2 2 1 1 ...
## $ Programs : Factor w/ 4 levels "Account","Business",..: 4 2 3 4 3 3 1 4 4 3 ...
## $ Car_Ownership: Factor w/ 2 levels "No","Yes": 2 2 1 1 1 2 2 1 2 1 ...
## $ Telco_Prefer : Factor w/ 4 levels "Celcom","DiGi",..: 1 2 1 3 4 1 3 2 2 3 ...
## $ Usage_GB : num 14.6 15.7 14.8 15.4 12.9 22.4 28 19.2 25.4 25.3 ...
## $ Hour_Perday : num 3 3.8 4 3.5 3 6 6.5 5.5 6 6 ...
head (telco_data)
## Gender Programs Car_Ownership Telco_Prefer Usage_GB Hour_Perday
## 1 Male Statistics Yes Celcom 14.6 3.0
## 2 Female Business Yes DiGi 15.7 3.8
## 3 Female Sciences No Celcom 14.8 4.0
## 4 Female Statistics No Maxis 15.4 3.5
## 5 Female Sciences No U-Mobile 12.9 3.0
## 6 Female Sciences Yes Celcom 22.4 6.0
Note that for this example, we will attach the telco_data;
attach (telco_data)
Revision;
What is the different between using and not using attach () function?
Frequency and Contingency Table
table () is used to produce a frequency table of counts for each level of specified factors.
table (Gender)
## Gender
## Female Male
## 29 16
table (Programs)
## Programs
## Account Business Sciences Statistics
## 11 8 13 13
table (Gender, Programs)
## Programs
## Gender Account Business Sciences Statistics
## Female 8 5 10 6
## Male 3 3 3 7
table (Gender, Car_Ownership)
## Car_Ownership
## Gender No Yes
## Female 16 13
## Male 7 9
Contingency Table for More than Two Variables by using ftable()
ftable (Gender, Programs, Car_Ownership)
## Car_Ownership No Yes
## Gender Programs
## Female Account 5 3
## Business 1 4
## Sciences 6 4
## Statistics 4 2
## Male Account 1 2
## Business 1 2
## Sciences 1 2
## Statistics 4 3
pie (table (Telco_Prefer))
As we can see, by default, R will assign pastel colours to our pie chart. We can customize the pie chart so that it gives us more information rather than a simple pie chart.
pie (table (Telco_Prefer), col = c("Blue", "Yellow", "Red", "Orange"))
We represent each telco operator by using their own brand colour.
Note that you can also put the col = argument by using value. Eg;
pie (table (Telco_Prefer), col = c(1, 2, 3, 4))
and it will gives us the colour according to default colour value assignment, 1=black, 2=blue, 3=green, 4=red.
b. adding main title, Telco Preference by Students
pie (table (Telco_Prefer), col = c("Blue", "Yellow", "Red", "Orange"), main = "Telco Preference by Students")
Figure 1: Pie Chart for Telco Operator Preference by Students
2. Bar Chart
To plot a bar chart using r, we can use barplot() function. Again, We need table () function first, so that barplot () function can plot the counts for each level of factors.
barplot (table (Programs))
barplot (table (Programs), col = c ("lawngreen", "indian red", "khaki", "midnightblue"))
barplot (table (Programs), col = c ("lawngreen", "indian red", "khaki", "midnightblue"),
ylim = c(0, 20))
barplot (table (Programs), col = c ("lawngreen", "indian red", "khaki", "midnightblue"),
ylim = c(0, 20), xlab = "Programs", ylab = "Number of Students", main = "Number of Students by Program")
Figure 2: Bar Chart for the Number of Students by Programs
3. Cluster Bar Chart
We can combine two variables (Gender and Programs) to get a cluster bar chart
barplot (table (Gender, Programs), col = c("deeppink", "cyan"), ylab = "Numbers of Students", xlab = "Programs",
main = "Numbers of Students by Gender and Programs")
Figure 3 (a): Cluster Bar Chart for the Number of Students by Gender and Programs
barplot (table (Gender, Programs), col = c("deeppink", "cyan"), ylab = "Numbers of Students",
xlab = "Programs", main = "Numbers of Students by Gender and Programs", legend = TRUE, beside = TRUE)
Figure 3 (b): Cluster Bar Chart for the Number of Students by Gender and Programs
Some important functions that you need to know when dealing with numerical data.
We can get the summary of statistics for our variable by using summary () function;
summary (Usage_GB)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 13.70 17.80 18.13 23.20 32.40
or we can get a specific statistic by using specific funtion such as;
a. finding the average/mean value
mean (Usage_GB)
## [1] 18.12889
b. finding the median value
median (Usage_GB)
## [1] 17.8
c. finding the minimum value
min (Usage_GB)
## [1] 5
d. finding the maximum value
max (Usage_GB)
## [1] 32.4
e. finding the range (max-min) value
range (Usage_GB)
## [1] 5.0 32.4
f. finding the variance value
var (Usage_GB)
## [1] 42.07528
g. finding the standard deviation value
sd (Usage_GB)
## [1] 6.486546
Adding description to our statistics value by using print (paste0())
print (paste0("On average, the students use their internet quota by", round (mean (Usage_GB), 2), "GB" ))
## [1] "On average, the students use their internet quota by18.13GB"
round () is a function to round the number, example above round to (2) decimal places.
print (paste0("On maximum usage, the students use their internet quota by ", max (Usage_GB), "GB"))
## [1] "On maximum usage, the students use their internet quota by 32.4GB"
1. Stem and Leaf Plot
stem (Usage_GB)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 0 | 5889
## 1 | 00022334
## 1 | 555566677788999
## 2 | 0002223444
## 2 | 556888
## 3 | 02
0 | 8 means 8
2. Box and Whisker Plot
boxplot (Usage_GB)
a. adding y-axis label and main label
boxplot (Usage_GB, ylab = "Internet Quota Usage in GB", main = "Figure 5")
Figure 5 (a): Box Plot for Usage GB
b. adding xlab = and horizontal = TRUE arguments
boxplot (Usage_GB, xlab = "Internet Quota Usage in GB", main = "Figure 5", horizontal = TRUE)
Figure 5 (b): Box Plot for Usage GB
c. adding par (mfrow = c(1,2)) before plotting, row = 1, column = 2
par (mfrow = c(1, 2))
boxplot (Usage_GB, ylab = "Internet Quota Usage in GB", main = "Figure 5")
boxplot (Usage_GB, xlab = "Internet Quota Usage in GB", main = "Figure 5", horizontal = TRUE)
Figure 5 (c): Box Plot for Usage GB
hist(Usage_GB, prob=TRUE) #if probability instead of frequency is desired
lines(density(Usage_GB),lwd=4,col="red") #low level function
plot(Hour_Perday, Usage_GB, main="Hour per Day vs Usage in GB",pch=19, col="blue")
fit=lm(Usage_GB~Hour_Perday) #fitting the linear model
abline(fit,col="red",lwd=4) #low-level function
par(mfrow=c(2,2))
plot(Hour_Perday,type="l", main="lines")
plot(Hour_Perday,type="b", main="both")
plot(Hour_Perday,type="s", main="steps")
plot(Hour_Perday,type="h", main="high density")
par(mfrow=c(2,2))
plot(Hour_Perday,type="l",lty=1,col=1,lwd=3,main="lines")
plot(Hour_Perday,type="b",lty=2,col=2,lwd=3,main="both")
plot(Hour_Perday,type="s",lty=3,col=3,lwd=3,main="steps")
plot(Hour_Perday,type="h",lty=4,col=4,lwd=3,main="high density")
Try replicate all the graph by using your own customization (colours, label, title).
Reply on comment.
edited : 23 June 2020
by: Muhammad Asmui Abdul Rahim
email: asmui@tmsk.uitm.edu.my
created using: rmarkdown