Chapter 2

Section 2.1: Scatterplots

Example: Botnets

A botnet is a remotely and silently controlled collection of networked computers. Botnets are illicitly created through the use of viruses, Trojans, and other malware to assimilate computers, or bots, into the botnet, generally without the knowledge of the computer owner. The data set below includes information on botnets that are use to send spam messages.

spam<-read.file("/home/emesekennedy/Data/Ch2/spam.txt")
## Reading data with read.table()

Create a scratterplot to examine the realtionship between the number of bots (in thousands) in a botnet and the number of spam messages a day (in billions) that botnet can send. Here, we will use the number of bots as the explanatory variable, and the number of spam messages as the response variable.

xyplot(SpamsPerDay~Bots, data=spam)

The scatterplot has a cluster of points in the bottom left corner and it seems like that there is an outlier.

Let’s look at the boxplots for each of the two variables to verify that there is an outlier.

bwplot(~Bots, data=spam)

bwplot(~SpamsPerDay, data=spam)

The boxplot indicates that the variable SpamsPerDay has an outlier. Let’s create a new data set without the outlier and look at the corresponding scatterplot.

spam2<-subset(spam, SpamsPerDay<40)
xyplot(SpamsPerDay~Bots, data=spam2)

The form of the original scatterplot appears to be fairly linear, so let’s look at a scatterplot that also draws a straight line through the data.

xyplot(SpamsPerDay~Bots, data=spam, panel=panel.lmbands)

Example: Debt of Countries in Two Consecutive Years

debt<-read.file("/home/emesekennedy/Data/Ch2/debt.txt", sep="\t", header=T)
## Reading data with read.table()
xyplot(Debt2007~Debt2006, data=debt)

As expected, the scatterplot indicates that there is a very strong linear relationship between the debt of a country in 2006 and the debt in 2007. The two variables are positively associated.

Example: Forbes Best Countries for Business

best<-read.file("/home/emesekennedy/Data/Ch2/bestcountries.txt", sep="\t", header=T)
## Reading data with read.table()

Let’s create a scatterplot to examine the relationship between unemployment rate of a country and the country’s gross domestic product (GDP) per capita.

xyplot(GDPPerCap~Unemployment, data=best)

The relationship between these two variables is clearly not linear. Most of the data is in the lower left of the plot, so the data is skewed toward larger values. Let’s use a log (Note: here log means natural logarithm with base e) transformation to transform both of the variables. This type of transformation can only be used if all values are positive.

best<-transform(best, LogGDPPerCap=log(GDPPerCap))
best<-transform(best, LogUnemployment=log(Unemployment))

Now, look at the scatterplot for the transformed variables.

xyplot(LogGDPPerCap~LogUnemployment, data=best)

It seems like there is not much relationship between the two variables when the log of the unemployment rate is less than approximately 1.6. When the log of the unemployment rate is greater than 1.6 then there seems to be a negative linear relationship between the two transformed variables.

When the log of the unemployment rate is 1.6, then we can find the unemployment rate (i.e. undo the log transformation) by finding e^(1.6). We can do this with the following command:

exp(1.6)
## [1] 4.953032

So log unemployment of 1.6 corresponds to an approximate unemployment rate of 5% (the unit of unemployment rate is percentage).