WWWusage datasetWWWusage was the dataset used for the following analysis. This dataset is a time-series of 100 minutes cataloging how many users an internet server had for each minute. The source of the data is J. Durbin and S.J. Koopman’s piece Time Series Analysis by State Space Methods.
Because this a fairly straight forward dataset let’s start off with some simple analysis. The number below is the mean of the data. In this case the value is the average number of users on this server per minute.
mean(WWWusage)
## [1] 137.08
The output below now shows the fewest amount of people who were on the server in one minute.
min(WWWusage)
## [1] 83
Now the output shows the maximum number of people that were on this server during a minute.
max(WWWusage)
## [1] 228
Finally, the output below shows the total number of people who wee on this internet server throughout the whole 100 minutes
sum(WWWusage)
## [1] 13708
If we would want to look at the amount of spread within the data, we could easily see the standard deviation for WWWusage which is shown below. This just gives us a better idea for how spread out this data is. We will sort of see this physically represented in the next section.
sd(WWWusage)
## [1] 39.99941
This number means that about 65% of our data points fall with in 39.999 of our mean of 137.08 visitors, so 65% falls between 97 to 177 visitors. This will match up with our visuals in the next section.
An easier way to interpret this data is through visuals. Various different graphics can be created to help understand this data. For the first two following graphs WWWusage was divided into four different time segments. These increments were each 25 minutes long and the graphs display the number of visitors that the internet server had during that 25 minute period. 25 minute increments were chosen so the time lapse could be broken up into even fourths, for easier analysis.
Users<-c(2823,3707,2967,4211)
barplot(Users,ylim=c(0,4500), xlim = c(0,4.5), main="Internet Users over 100 minutes \n in 25 Minute Increments", ylab="Number of Internet Users",names.arg = c("1 to 25 Minutes","26 to 50 Minutes","51 to 75 Minutes","76 to 100 Minutes"),col="lightblue")
text(.7,3000,"2823")
text(1.9,3850,"3707")
text(3.1,3150,"2967")
text(4.3,4400,"4211")
As you can see on the graph above the busiest time for the internet server was in the last 25 minute increment with 4211 visitors. The slowest time for the server was in the first 25 minutes with only 2823 visitors. Also having the graph in light blue is just fun.
A line chart provides a visual of what the data looks like in sequential order. Because the dataset is a time-series this also means that this line chart is in chronological order. Each of the circles on the line chart represents a different minute during the time-series.
Below there is a black line that represents the mean at 137.08. The two blue lines shows one standard deviation away from the mean.
plot(WWWusage,type="o",col="red",xlab="Minute",ylab="Number of Visitors",main="Number of Visitors to the Serve \n by Each Minute")
abline(h=137.08,col='black')
abline(h=177.08,col='blue')
abline(h=97.08,col='blue')
As you can see the data fluctuates in the same way as the bar graph with the 25 minute increments does. However, we get a more exact picture by looking at it from a minute to minute perspective. You can see, each of the most visited times line up with the larger increments. The peak of the traffic happened in the last 20 or so minutes which explain why the largest increment was the last one in the bar graph above.
Perhaps we would like to look at just how many busy minutes there were. The histogram below shows how many minutes there were with visitors in varying categories.
hist(WWWusage, breaks=4,labels = TRUE,col="violet",ylim=c(0,50))
As you can see above there were 27 seperate minutes with 50 to 100 visitors. There were 41 different minutes with 100 to 150 visitors, 23 minutes with 150 to 200 visitors, and only 9 minutes with over 200 visitors. As you can see from the line chart, all 9 of these high traffic minutes were in that last 25 minute increment of the time-series
A dataset can and should be analyzed in many different ways. In the first section the raw numbers are manipulated to give information regarding the dataset. The mean, maximum users, minimum users, total users, and standard deviation give useful information. However, the visuals in the second section are needed to better communicate the information. As stated above each of the visuals is useful in different contexts and for different information. These visuals allow an audience to easily see what story the data tells.Although these are seemingly easy graphs to use, the coding in R in order to make these graphs useful gets fairly long. These are just some ways to analyze the simple dataset WWWusage.