Problem Set 3: Graphics
a)Have the following arguments: mydotchart(data,labels, colors,main, xlab,ylab,xlim,ylim,lty,normalize,col,pch,cex,subsets)
Take a matrix of values rather than just a single series. Each column should represent a different series, so data <- cbind(c(1,3,5),c(10,2,3)) should work.
Permits normalization if normalize parameter is TRUE. If so, it will rescale everything to the range 0 to 1.0, based on the smallest versus largest value in the data matrix. Assume all values will be positive, and you can do this for a data matrix as follows: data <- (data - min(data))/(max(data)-min(data))
d)col, pch, cex, lty, etc. should do something reasonable, like control the size of individual points, entire series, or the whole chart.
Demostrate how this works with the following data and the set of examples given, as well as others you feel relevant.
data <- cbind(c(11,13,15),c(17,19,21), c(12,14,16), c(18,20,22))
#This is a matrix of values rather than just a single series.
data <- (data - min(data))/(max(data)-min(data))
#Normalizing parameter is TRUE and executing the given equation.
matplot(data,main = "Add the main Title", xlab ="Add X-axis label",ylab ="\n Add Y-Axis label",col = "blue",pch = 19,cex = 3,sub = "Add the Subtitle", lty=1:2, type = 'o')
#Creating a new dotchart function based on the matplot version.
#Also col, pch, cex, lty are used accordingly.
segments(0,1:10,100,1:10,lty=3)
# Segmenting adds line to the points.
print(data)
## [,1] [,2] [,3] [,4]
## [1,] 0.0000000 0.5454545 0.09090909 0.6363636
## [2,] 0.1818182 0.7272727 0.27272727 0.8181818
## [3,] 0.3636364 0.9090909 0.45454545 1.0000000
#Printing Data
#Demonstrating if it works with the following:
set.seed(100)
data2 <- data.frame(q1=sample(letters[1:10],100,replace=T),
q2=sample(letters[1:10],100,replace=T),
q3=sample(letters[1:10],100,replace=T),
q4=sample(letters[1:10],100,replace=T),
q5=sample(letters[1:10],100,replace=T))
#Applying the given functions and verifying if they are working.
datatable2<-apply(data2,2,table)
matplot(datatable2,main="Everything",xlab="Value", ylab="Category")
matplot(datatable2[,1:2],main="Everything",xlab="Value", ylab="Category")
matplot(datatable2[,1],main="Everything",xlab="Value", ylab="Category")
matplot(as.matrix(datatable2[,1]),main="Everything",xlab="Value", ylab="Category")
matplot(datatable2[,1:3],main="Everything",xlab="Value", ylab="Category")
matplot(datatable2,col=1:5,main="Everything",xlab="Value", ylab="Category")
matplot(datatable2,col=1:5,pch=16,main="Everything",xlab="Value", ylab="Category")
matplot(datatable2,col=1:5,pch=16,cex=2.5,main="Everything",xlab="Value", ylab="Category")
matplot(datatable2,col=1:5,pch=16,cex=2.5,main="Everything normalized",xlab="Value", ylab="Category",normalize=T)
## Warning in plot.window(...): "normalize" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "normalize" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "normalize" is not
## a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "normalize" is not
## a graphical parameter
## Warning in box(...): "normalize" is not a graphical parameter
## Warning in title(...): "normalize" is not a graphical parameter
#This is the end of the Question 1.
#The following data frame specifies the English letter frequency of letters, the points earned in Scrabble, and the number of Scrabble tiles.
lf <- c(8.167,1.492,2.782,4.253,12.702,2.228,2.015,6.094,
6.966,0.153,0.772,4.025,2.406,6.749,7.507,1.929,
0.095,5.987,6.327,9.056,2.758,0.978,2.36,0.15,1.974,0.074)/100
pts <- c(1,3,3,2,1,4,2,4,1,8,5,1,3,1,1,3,10,1,1,1,1,4,4,8,4,10)
tiles <- c(9,2,2,4,12,2,3,2,9,1,1,4,2,6,8,2,1,6,4,6,4,2,2,1,2,1)
lf.table <- data.frame(LETTERS,
freq=lf,
points=pts,
ntiles=tiles)
#Adding the Score Me function: For any word, you can split it into its letters, and then compute some statistics based on this scoring. The following computes the sum of the inverse letter frequency of the letters, the total scrabble points, the mean numbers of tiles of the letters in the word, and the length of the word:
scoreme <- function(word)
{
lets <- strsplit(splus2R::upperCase(word),"")[[1]]
data <- matrix(0,ncol=4,nrow=length(lets))
for(i in 1:length(lets))
{
index <- which(lets[i]==LETTERS)
data[i,1] <- lf.table$freq[index]
data[i,2] <- lf.table$points[index]
data[i,3] <- lf.table$ntiles[index]
}
list(suminvfreq= sum(1/data[,1]),
points=sum(data[,2]),
meantiles=mean(data[,3]),
length=length(lets))
}
#This is the end of Score me function.
# The following lists a set of words, along with their rank frequency (lower meaning more frequent), and their total frequency (number of occurrences in a large corpus). For each word, compute the four statistics above using the scoreme function. You can add the results of the scoreme function to each row of the test data frame, starting by adding empty variables:
cup <- scoreme("CUP")
found <- scoreme("FOUND")
butterfly <- scoreme("BUTTERFLY")
brew <- scoreme("brew")
cumbersome <- scoreme("CUMBERSOME")
useable <- scoreme("useable")
whittle <- scoreme("WHITTLE")
spiny <- scoreme("SPINY")
uppercase <- scoreme("uppercase")
halfnaked <- scoreme("halfnaked")
bellhop <- scoreme("bellhop")
tetherball <- scoreme("tetherball")
attic <- scoreme("attic")
tearful <- scoreme("tearful")
taligate <- scoreme("tailgate")
hydraulically <- scoreme("hydraulically")
unsparing <- scoreme("unsparing")
embryogenesis <- scoreme("embryogenesis")
#Adding the table here.
test <- read.table(text='
rank word frequency
1081 CUP 1441306
2310 FOUND 573305
5285 BUTTERFLY 171410
7371 brew 94904
11821 CUMBERSOME 39698
17331 useable 17790
18526 WHITTLE 15315
25416 SPINY 7207
27381 uppercase 5959
37281 halfnaked 2459
47381 bellhop 1106
57351 tetherball 425
7309 attic 2711
17311 tearful 542
27303 tailgate 198
37310 hydraulically 78
47309 unsparing 35
57309 embryogenesis 22 ',header=T)[,c(2,1,3)]
# Here we are giving the values which are calculated in the Score me function to the table's Column - Meantiles
test$meantiles<-c(cup$meantiles,found$meantiles,butterfly$meantiles,brew$meantiles,cumbersome$meantiles,useable$meantiles,whittle$meantiles,spiny$meantiles,uppercase$meantiles,halfnaked$meantiles,bellhop$meantiles,tetherball$meantiles,attic$meantiles,tearful$meantiles,taligate$meantiles,hydraulically$meantiles,unsparing$meantiles,embryogenesis$meantiles)
# Here we are giving the values which are calculated in the Score me function to the table's Column - Sum
test$suminvfreq<-c(cup$suminvfreq,found$suminvfreq,butterfly$suminvfreq,brew$suminvfreq,cumbersome$suminvfreq,useable$suminvfreq,whittle$suminvfreq,spiny$suminvfreq,uppercase$suminvfreq,halfnaked$suminvfreq,bellhop$suminvfreq,tetherball$suminvfreq,attic$suminvfreq,tearful$suminvfreq,taligate$suminvfreq,hydraulically$suminvfreq,unsparing$suminvfreq,embryogenesis$suminvfreq)
# Here we are giving the values which are calculated in the Score me function to the table's Column - Points
test$points<- c(cup$points,found$points,butterfly$points,brew$points,cumbersome$points,useable$points,whittle$points,spiny$points,uppercase$points,halfnaked$points,bellhop$points,tetherball$points,attic$points,tearful$points,taligate$points,hydraulically$points,unsparing$points,embryogenesis$points)
# Here we are giving the values which are calculated in the Score me function to the table's Column - Length
test$length<- c(cup$length,found$length,butterfly$length,brew$length,cumbersome$length,useable$length,whittle$length,spiny$length,uppercase$length,halfnaked$length,bellhop$length,tetherball$length,attic$length,tearful$length,taligate$length,hydraulically$length,unsparing$length,embryogenesis$length)
#Printing the table with all the values here:
print(test)
## word rank frequency meantiles suminvfreq points length
## 1 CUP 1081 1441306 2.666667 124.04385 7 3
## 2 FOUND 2310 573305 4.800000 132.79219 9 5
## 3 BUTTERFLY 5285 171410 4.888889 270.32931 17 9
## 4 brew 7371 94904 5.500000 133.97264 9 4
## 5 CUMBERSOME 11821 39698 5.400000 283.92776 18 10
## 6 useable 17331 17790 6.714286 171.92224 9 7
## 7 WHITTLE 18526 15315 5.857143 127.94021 13 7
## 8 SPINY 25416 7207 4.600000 147.47662 10 5
## 9 uppercase 27381 5959 5.888889 236.38227 15 9
## 10 halfnaked 37281 2459 5.444444 286.36268 20 9
## 11 bellhop 47381 1106 4.857143 206.15716 14 7
## 12 tetherball 57351 425 6.300000 199.90076 15 10
## 13 attic 7309 2711 6.400000 84.63001 7 5
## 14 tearful 17311 542 6.142857 153.84862 10 7
## 15 tailgate 27303 198 7.250000 143.27433 9 8
## 16 hydraulically 37310 78 4.692308 343.52430 25 13
## 17 unsparing 47309 35 5.444444 226.46828 12 9
## 18 embryogenesis 57309 22 6.307692 323.29833 21 13
#Making a set of plots showing both rank frequency and total frequency by each of the four statistics based on individual letters (a total of 8 figures).
plot(test$rank,test$meantiles, pch=23, col="green", xlab = "Rank",ylab = "Meantiles", main="Rank v/s Meantiles with Correlation -0.29")
plot(test$rank,test$suminvfreq, pch=19,col="red",xlab = "Rank",ylab = "Sum", main="Rank v/s Sum with Correlation 0.52")
plot(test$rank,test$points, pch=25, col="blue",xlab = "Rank",ylab = "Points" , main="Rank v/s Points with Correlation 0.53")
plot(test$rank,test$length, pch=16, col="orange",xlab = "Ranks",ylab = "Length", main="Rank v/s Length with Correlation 0.66")
plot(test$frequency,test$meantiles, pch=23, col="green", xlab = "Frequency",ylab = "Meantiles", main="Frequency v/s Meantiles with Correlation -0.73")
plot(test$frequency,test$suminvfreq, pch=19,col="red",xlab = "requency",ylab = "Sum", main="Frequency v/s Sum with Correlation -0.30")
plot(test$frequency,test$points, pch=25, col="blue",xlab = "Frequency",ylab = "Points", main="Frequency v/s Points with Correlation -0.36")
plot(test$frequency,test$length, pch=16, col="orange",xlab = "Frequency",ylab = "Length", main="Frequency v/s Length with Correlation -0.50")
#Computing the correlation between each pairing using the cor() function:
cor(test$frequency,test$rank)
## [1] -0.4801456
cor(test$frequency,test$meantiles)
## [1] -0.7381602
cor(test$frequency,test$suminvfreq)
## [1] -0.3046734
cor(test$frequency,test$points)
## [1] -0.3609458
cor(test$frequency,test$length)
## [1] -0.5092806
cor(test$rank,test$frequency)
## [1] -0.4801456
cor(test$rank,test$meantiles)
## [1] 0.2916469
cor(test$rank,test$suminvfreq)
## [1] 0.5217017
cor(test$rank,test$points)
## [1] 0.5312453
cor(test$rank,test$length)
## [1] 0.6679547
Question: Discuss whether you see any relationships between either of the two corpus statistics on word frequency and any of the three letter-based statistics:
Answer: We see that the correlation factor between the Frequency and the letter-based statistics are negative. However, the correlation factor betweent the Rank and the letter-based statistics are Postive. This implies that as the frequency increases, the statistics of the letter-based values decrease. This also menas that the correlation between the frequency and the Letter-based statistics are not strong. While the rank v/s letter-based values are positive. This means that the correlation value between the rank and letter-based values is strong and has a positive trend.
Question: If you see a relationship, try to suggest why you might be seeing it. Explain why it is positive or negative.
Answer: The relationship that we see is either positive or negative. If the correlation value is negative, it means that the variables are inversely related. If the correlation value is positive, it means that the variables are directly related. Hence, we see that the frequency v/s the sum, length, points and meantiles are inversely related. While, rank v/s the statistics are directly related.
Question: Discuss relative advantages of looking at rank frequency versus raw frequency.
Answer: Rank frequency shows the lower value and the values are truncated while the raw frequencies have a greater value where the exact point of value cannot be determined.