Problem Set 3

Problem Set 3: Graphics

Create a new dotchart function based on the matplot version we did in class (do not use the built-in dotchart function). It should:

a)Have the following arguments: mydotchart(data,labels, colors,main, xlab,ylab,xlim,ylim,lty,normalize,col,pch,cex,subsets)

Take a matrix of values rather than just a single series. Each column should represent a different series, so data <- cbind(c(1,3,5),c(10,2,3)) should work.
Permits normalization if normalize parameter is TRUE. If so, it will rescale everything to the range 0 to 1.0, based on the smallest versus largest value in the data matrix. Assume all values will be positive, and you can do this for a data matrix as follows: data <- (data - min(data))/(max(data)-min(data))

d)col, pch, cex, lty, etc. should do something reasonable, like control the size of individual points, entire series, or the whole chart.

Demostrate how this works with the following data and the set of examples given, as well as others you feel relevant.

data <- cbind(c(11,13,15),c(17,19,21), c(12,14,16), c(18,20,22)) 
#This is a matrix of values rather than just a single series.

data <- (data - min(data))/(max(data)-min(data))
#Normalizing parameter is TRUE and executing the given equation. 

matplot(data,main = "Add the main Title", xlab ="Add X-axis label",ylab ="\n Add Y-Axis label",col = "blue",pch = 19,cex = 3,sub = "Add the Subtitle", lty=1:2, type = 'o')
#Creating a new dotchart function based on the matplot version. 
#Also col, pch, cex, lty are used accordingly. 


segments(0,1:10,100,1:10,lty=3)

# Segmenting adds line to the points.

print(data)

##           [,1]      [,2]       [,3]      [,4]
## [1,] 0.0000000 0.5454545 0.09090909 0.6363636
## [2,] 0.1818182 0.7272727 0.27272727 0.8181818
## [3,] 0.3636364 0.9090909 0.45454545 1.0000000

#Printing Data


#Demonstrating if it works with the following:

set.seed(100)
data2 <- data.frame(q1=sample(letters[1:10],100,replace=T),
                   q2=sample(letters[1:10],100,replace=T),
                   q3=sample(letters[1:10],100,replace=T),
                   q4=sample(letters[1:10],100,replace=T), 
                   q5=sample(letters[1:10],100,replace=T))

#Applying the given functions and verifying if they are working. 

datatable2<-apply(data2,2,table)
matplot(datatable2,main="Everything",xlab="Value", ylab="Category")

matplot(datatable2[,1:2],main="Everything",xlab="Value", ylab="Category")

matplot(datatable2[,1],main="Everything",xlab="Value", ylab="Category")  
matplot(as.matrix(datatable2[,1]),main="Everything",xlab="Value", ylab="Category")

matplot(datatable2[,1:3],main="Everything",xlab="Value", ylab="Category")

matplot(datatable2,col=1:5,main="Everything",xlab="Value", ylab="Category")

matplot(datatable2,col=1:5,pch=16,main="Everything",xlab="Value", ylab="Category")

matplot(datatable2,col=1:5,pch=16,cex=2.5,main="Everything",xlab="Value", ylab="Category")

matplot(datatable2,col=1:5,pch=16,cex=2.5,main="Everything normalized",xlab="Value", ylab="Category",normalize=T)

## Warning in plot.window(...): "normalize" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "normalize" is not a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "normalize" is not
## a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "normalize" is not
## a graphical parameter

## Warning in box(...): "normalize" is not a graphical parameter

## Warning in title(...): "normalize" is not a graphical parameter

#This is the end of the Question 1.

Correlating word frequency with SCRABBLE scores

#The following data frame specifies the English letter frequency of letters, the points earned in Scrabble, and the number of Scrabble tiles.

lf <- c(8.167,1.492,2.782,4.253,12.702,2.228,2.015,6.094,
        6.966,0.153,0.772,4.025,2.406,6.749,7.507,1.929,
        0.095,5.987,6.327,9.056,2.758,0.978,2.36,0.15,1.974,0.074)/100
pts <- c(1,3,3,2,1,4,2,4,1,8,5,1,3,1,1,3,10,1,1,1,1,4,4,8,4,10)
tiles <- c(9,2,2,4,12,2,3,2,9,1,1,4,2,6,8,2,1,6,4,6,4,2,2,1,2,1)
lf.table <- data.frame(LETTERS,
                       freq=lf,
                       points=pts,
                       ntiles=tiles)

#Adding the Score Me function: For any word, you can split it into its letters, and then compute some statistics based on this scoring. The following computes the sum of the inverse letter frequency of the letters, the total scrabble points, the mean numbers of tiles of the letters in the word, and the length of the word:


scoreme <- function(word)
{
  
  lets <- strsplit(splus2R::upperCase(word),"")[[1]]
  data <- matrix(0,ncol=4,nrow=length(lets))

  for(i in 1:length(lets))
  {
    index <- which(lets[i]==LETTERS)
    data[i,1] <- lf.table$freq[index]  
    data[i,2] <- lf.table$points[index]
    data[i,3] <- lf.table$ntiles[index]
    
  } 
  list(suminvfreq= sum(1/data[,1]),
       points=sum(data[,2]),
       meantiles=mean(data[,3]),
       length=length(lets))
}
#This is the end of Score me function.



# The following lists a set of words, along with their rank frequency (lower meaning more frequent), and their total frequency (number of occurrences in a large corpus). For each word, compute the four statistics above using the scoreme function. You can add the results of the scoreme function to each row of the test data frame, starting by adding empty variables:
cup <- scoreme("CUP")
found <- scoreme("FOUND")
butterfly <- scoreme("BUTTERFLY")
brew <- scoreme("brew")
cumbersome <- scoreme("CUMBERSOME")
useable <- scoreme("useable")
whittle <- scoreme("WHITTLE")
spiny <- scoreme("SPINY")
uppercase <- scoreme("uppercase")
halfnaked <- scoreme("halfnaked")
bellhop <- scoreme("bellhop")
tetherball <- scoreme("tetherball")
attic <- scoreme("attic")
tearful <- scoreme("tearful")
taligate <- scoreme("tailgate")
hydraulically <- scoreme("hydraulically")
unsparing <- scoreme("unsparing")
embryogenesis <- scoreme("embryogenesis")

#Adding the table here. 

test <- read.table(text='
rank    word          frequency
1081    CUP           1441306
2310    FOUND         573305
5285    BUTTERFLY     171410
7371    brew          94904    
11821   CUMBERSOME    39698
17331   useable       17790 
18526   WHITTLE       15315
25416   SPINY         7207
27381   uppercase     5959
37281   halfnaked     2459
47381   bellhop       1106 
57351   tetherball    425
7309    attic         2711    
17311   tearful       542 
27303   tailgate      198 
37310   hydraulically 78  
47309   unsparing     35  
57309   embryogenesis 22 ',header=T)[,c(2,1,3)]


# Here we are giving the values which are calculated in the Score me function to the table's Column - Meantiles
test$meantiles<-c(cup$meantiles,found$meantiles,butterfly$meantiles,brew$meantiles,cumbersome$meantiles,useable$meantiles,whittle$meantiles,spiny$meantiles,uppercase$meantiles,halfnaked$meantiles,bellhop$meantiles,tetherball$meantiles,attic$meantiles,tearful$meantiles,taligate$meantiles,hydraulically$meantiles,unsparing$meantiles,embryogenesis$meantiles) 

# Here we are giving the values which are calculated in the Score me function to the table's Column - Sum
test$suminvfreq<-c(cup$suminvfreq,found$suminvfreq,butterfly$suminvfreq,brew$suminvfreq,cumbersome$suminvfreq,useable$suminvfreq,whittle$suminvfreq,spiny$suminvfreq,uppercase$suminvfreq,halfnaked$suminvfreq,bellhop$suminvfreq,tetherball$suminvfreq,attic$suminvfreq,tearful$suminvfreq,taligate$suminvfreq,hydraulically$suminvfreq,unsparing$suminvfreq,embryogenesis$suminvfreq)

# Here we are giving the values which are calculated in the Score me function to the table's Column - Points
test$points<- c(cup$points,found$points,butterfly$points,brew$points,cumbersome$points,useable$points,whittle$points,spiny$points,uppercase$points,halfnaked$points,bellhop$points,tetherball$points,attic$points,tearful$points,taligate$points,hydraulically$points,unsparing$points,embryogenesis$points)

# Here we are giving the values which are calculated in the Score me function to the table's Column - Length
test$length<- c(cup$length,found$length,butterfly$length,brew$length,cumbersome$length,useable$length,whittle$length,spiny$length,uppercase$length,halfnaked$length,bellhop$length,tetherball$length,attic$length,tearful$length,taligate$length,hydraulically$length,unsparing$length,embryogenesis$length)

#Printing the table with all the values here:

print(test)

##             word  rank frequency meantiles suminvfreq points length
## 1            CUP  1081   1441306  2.666667  124.04385      7      3
## 2          FOUND  2310    573305  4.800000  132.79219      9      5
## 3      BUTTERFLY  5285    171410  4.888889  270.32931     17      9
## 4           brew  7371     94904  5.500000  133.97264      9      4
## 5     CUMBERSOME 11821     39698  5.400000  283.92776     18     10
## 6        useable 17331     17790  6.714286  171.92224      9      7
## 7        WHITTLE 18526     15315  5.857143  127.94021     13      7
## 8          SPINY 25416      7207  4.600000  147.47662     10      5
## 9      uppercase 27381      5959  5.888889  236.38227     15      9
## 10     halfnaked 37281      2459  5.444444  286.36268     20      9
## 11       bellhop 47381      1106  4.857143  206.15716     14      7
## 12    tetherball 57351       425  6.300000  199.90076     15     10
## 13         attic  7309      2711  6.400000   84.63001      7      5
## 14       tearful 17311       542  6.142857  153.84862     10      7
## 15      tailgate 27303       198  7.250000  143.27433      9      8
## 16 hydraulically 37310        78  4.692308  343.52430     25     13
## 17     unsparing 47309        35  5.444444  226.46828     12      9
## 18 embryogenesis 57309        22  6.307692  323.29833     21     13

#Making a set of plots showing both rank frequency and total frequency by each of the four statistics based on individual letters (a total of 8 figures).

plot(test$rank,test$meantiles, pch=23, col="green", xlab = "Rank",ylab = "Meantiles", main="Rank v/s Meantiles with Correlation -0.29")

plot(test$rank,test$suminvfreq, pch=19,col="red",xlab = "Rank",ylab = "Sum", main="Rank v/s Sum with Correlation 0.52")

plot(test$rank,test$points, pch=25, col="blue",xlab = "Rank",ylab = "Points" , main="Rank v/s Points with Correlation 0.53")

plot(test$rank,test$length, pch=16, col="orange",xlab = "Ranks",ylab = "Length", main="Rank v/s Length with Correlation 0.66")

plot(test$frequency,test$meantiles, pch=23, col="green", xlab = "Frequency",ylab = "Meantiles", main="Frequency v/s Meantiles with Correlation -0.73")

plot(test$frequency,test$suminvfreq, pch=19,col="red",xlab = "requency",ylab = "Sum", main="Frequency v/s Sum with Correlation -0.30")

plot(test$frequency,test$points, pch=25, col="blue",xlab = "Frequency",ylab = "Points", main="Frequency v/s Points with Correlation -0.36")

plot(test$frequency,test$length, pch=16, col="orange",xlab = "Frequency",ylab = "Length", main="Frequency v/s Length with Correlation -0.50")

#Computing the correlation between each pairing using the cor() function:
cor(test$frequency,test$rank)

## [1] -0.4801456

cor(test$frequency,test$meantiles)

## [1] -0.7381602

cor(test$frequency,test$suminvfreq)

## [1] -0.3046734

cor(test$frequency,test$points)

## [1] -0.3609458

cor(test$frequency,test$length)

## [1] -0.5092806

cor(test$rank,test$frequency)

## [1] -0.4801456

cor(test$rank,test$meantiles)

## [1] 0.2916469

cor(test$rank,test$suminvfreq)

## [1] 0.5217017

cor(test$rank,test$points)

## [1] 0.5312453

cor(test$rank,test$length)

## [1] 0.6679547

Question: Discuss whether you see any relationships between either of the two corpus statistics on word frequency and any of the three letter-based statistics:

Answer: We see that the correlation factor between the Frequency and the letter-based statistics are negative. However, the correlation factor betweent the Rank and the letter-based statistics are Postive. This implies that as the frequency increases, the statistics of the letter-based values decrease. This also menas that the correlation between the frequency and the Letter-based statistics are not strong. While the rank v/s letter-based values are positive. This means that the correlation value between the rank and letter-based values is strong and has a positive trend.

Question: If you see a relationship, try to suggest why you might be seeing it. Explain why it is positive or negative.

Answer: The relationship that we see is either positive or negative. If the correlation value is negative, it means that the variables are inversely related. If the correlation value is positive, it means that the variables are directly related. Hence, we see that the frequency v/s the sum, length, points and meantiles are inversely related. While, rank v/s the statistics are directly related.

Question: Discuss relative advantages of looking at rank frequency versus raw frequency.

Answer: Rank frequency shows the lower value and the values are truncated while the raw frequencies have a greater value where the exact point of value cannot be determined.

Problem Set 3

Sneha Nimmagadda

22 September 2018