A note: I write labs (and many other things for the class) in RMarkdown. Markdown in general is a widely used mark-up language that allows enriched text and can be compiled into html, pdf, and more. RMarkdown has the additional benefit of working as a code notebook where you can insert and run R code. See this student’s writeup about RMarkdown for guidance on RMarkdown if you want to learn more.
mean(X) as a measure of centerquantile(X) as indicators of common vs. extremal behaviorsum(<X fits criteria?>)/length(X) as a measure of frequencysd(X) as a measure of how consistent or spread out the data is##Importing the data set into RStudio
Download the csv (comma-seperated-value) data file for our (fake) Music App. CSV is a standard text file format for working with data on your local machine.
In RStudio, locate the “ImportDataset” option in the “Environment” pane. Use this to select the file type. For this class, we’ll be using standard csv (comma-seperated-value) files, so use the “text(base)” option.
This should open a file explorer. Navigate to and select the csv file you downloaded. A preview window opens. You can rename the table so that it’s easy to reference in code – I’ll give notes throughout as if you’d renamed it data. Preview the import to make sure it’s being properly translated and then import. Note that the command for this import will appear in the console: if you prefer to use the command in future, go ahead!
View(data)head(data)nrow(data) (resp. ncol(data) for column count)X with the name of the column: data$Xi with the row index: data[i,]
data[,j] for column jdata[1:10, ] for the first 10 rows or data[ , c(2,4,5)] for columns 2, 4, 5.subset(data, condition) (use ?subset in the console for full details)
subset(data, X > 10) where X is a column name, keeps only rows where the X column is over 10&& (and), || (or). Also may be helpful to know that in R ! indicates not, so X!=0 will take only the data where X isn’t 0. \(\star\) Task Play around with each of the above methods in the console (no need to show this in your lab file). Then, in the space for R code below, create a variable sessionsWithErrorsOrLatency equal to the subset of the data including only rows where there are at least 3 errors or the latency exceeds 600. Use methods from above to print the “head” of this subset, and to compute the number of rows in this subset.
Insert your code here (follow this same pattern for each task):
songAppData <- read.csv("C:/Users/Leo/Desktop/MATH&270/songAppData.csv")
sessionsWithErrorsOrLatency = subset(songAppData, subset = (songAppData$nErrors > 2 | songAppData$avgLatencies > 600))
head(sessionsWithErrorsOrLatency)
## X UserId nSongs nErrors avgLatencies daySessionStarted hourSessionStarted
## 1 1 1 38 3 649.8478 2021-04-07 19
## 4 4 1 34 4 583.0707 2021-04-09 3
## 13 13 3 3 3 1320.7270 2021-04-08 10
## 14 14 3 28 3 420.1560 2021-04-11 22
## 16 16 3 1 0 1033.7489 2021-04-08 21
## 34 34 7 26 4 623.7457 2021-04-09 16
## nSongsSkipped DeviceOS
## 1 4 android
## 4 2 android
## 13 0 iOS
## 14 2 iOS
## 16 0 iOS
## 34 3 iOS
And if you have any comments/notes, write them in text here:
As a data analyst (or anyone working on a set of data), this is usually the first step when you get a new data set. There’s no faster way to start understanding patterns in data than to plot it in a few different ways.
The “distribution” of data refers to how frequently the data takes on different values. We’ll also talk about probability distributions, which are an exact/formal version of this. For quickly understanding the distribution of real data, nothing beats a histogram (or the equivalent “bar chart” for categorical data). Histograms group nearby values into bins, and count the number of values that are in each bin. The result is displayed as a bar chart, where taller bars indicate more data in that range.
Histograms can be used for any numeric column of data.
If X is your numeric column name, hist(data$X) to plot a basic histogram. You can use ?hist in the console to see all the optional arguments. [One that I commonly use is the breaks argument, which lets you specify the location and sizes of bins. Another useful argument is main which refers to the plot title, and freq which changes from counts to proportions.]
\(\star\) Task Plot histograms to show the distribution on number of Sessions, number of Songs, latencies and number of errors.
hist(songAppData$UserId, main = "Frequency vs. Number of Sessions", ylab = "Frequency of Sessions", xlab = "Number of User Sessions")
hist(songAppData$nSongs, main = "Frequency vs. Number of Songs", ylab = "Frequency of Songs", xlab = "Number of Songs")
hist(songAppData$avgLatencies, main = "Frequency vs. Number of Average Latencies", ylab = "Freqency of Latency", xlab = "Number of Average Latencies")
hist(songAppData$nErrors, main = "Frequency vs. Number of Errors", ylab = "Frequency of Errors", xlab = "Number of Errors")
A note: when we look at a histogram, we can get a sense of the “shape” of the distribution. Are values close together vs. spread apart? Is the distribution symmetric vs. skewed? [Skewed distributions have a peak and then data spread out mainly in one direction. “Left skew” means the data is spread out mainly toward lower (left) values. “Right skew” means the data is spread out mainly toward higher (right) values. Symmetric distributions are balanced– there’s no main direction of skew.]
Boxplots (or box and whisker plots) are a second way of quickly looking at the distribution of numeric data. * Boxplots split the data into quarters: the minimum value, the value with 25% of data less than it (aka the 25th percentile), the value with 50% of data less than it (aka the 50th percentile, the 75th percentile, and the max. These are then represented in a figure.
boxplot(data$X) to plot a single boxplot for a column.
boxplot(data[, 2:4]).Y is a numeric column and X is categorical, boxplot(data$Y ~ data$X) plots the boxplot distribution of Y, but split out between the different values for X.\(\star\) Task Plot a side-by-side boxplot that shows the distribution of latency, split out by the device operating system.
boxplot(songAppData$avgLatencies ~ songAppData$DeviceOS, main = "Distribution of Latency", ylab = "Average Amount of Latency", xlab = "Operating Devices")
IfX and Y are two numeric columns, plot(data$X, data$Y) will generate a simple scatterplot.
\(\star\) Task Create a scatterplot of the number of songs vs. the number of skipped songs. From this plot, describe the relationship (or non-relationship) between these two variables.
plot(songAppData$nSongs, songAppData$nSongsSkipped, main = "Number of Songs vs. Number of Skipped Songs", ylab = "Number of Songs Skipped", xlab = "Number of Songs")
#Relationship: An increase in songs being played will increase the songs being skipped
If X is your categorical column name, table(data$X) will produce a summarized table of counts in each value of X.
\(\star\) Task Create a table summarizing the number of users per device type. Then create a table summarizing the number of users that had no errors vs. one error. The error version will require you to use a call of the form table(data$X > 0), where this makes a table for the boolean error presence.
table(songAppData$DeviceOS)
##
## android iOS windows
## 32333 23970 8239
table(songAppData$nErrors > 0)
##
## FALSE TRUE
## 33968 30574
\(\star\) TaskNow wrap one of the table commands you did above in prop.table(): that is, run a command like prop.table(table(data$X)). You should see that instead of counts, the entries of the table have been converted to proportions.
prop.table(table(songAppData$DeviceOS))
##
## android iOS windows
## 0.5009606 0.3713861 0.1276533
prop.table(table(songAppData$nErrors > 0))
##
## FALSE TRUE
## 0.526293 0.473707
Rs table function also allows you to create so-called “contingency tables” which allow you to look at frequency of events in two conditions simultaneously. This produces information of the form:
| Error: FALSE | Error: TRUE | |
|---|---|---|
| iOS | # or % | # or % |
| android | # or % | # or % |
| windows | # or % | # or % |
\(\star\) Task Build a contingency table that will fill out the info #s in the one above. To do this, supply the two columns of interest as arguments (e.g. table(data$X,data$Y > 0)).
table(songAppData$DeviceOS, songAppData$nErrors > 0)
##
## FALSE TRUE
## android 22002 10331
## iOS 6287 17683
## windows 5679 2560
\(\star\) Task Now build a contingency table that will fill out the cells from the above with proportions instead of counts. To do this, wrap the table() call you made above in the method prop.table().
prop.table(table(songAppData$DeviceOS, songAppData$nErrors > 0))
##
## FALSE TRUE
## android 0.34089430 0.16006631
## iOS 0.09740944 0.27397664
## windows 0.08798922 0.03966409
To do prettier/more advanced plotting in R, you can make use of the ggplot library. * Install by running this line once install.packages("ggplot2") * Add the following line to your script (before your plot) library(ggplot2) * If you want to learn more check out this ggplot tutorial
As an example, plot the following, but replace data with your main data variable, X with the name of the column for latency of song load and Y with the name of the column column for device type.
library(ggplot2)
ggplot(songAppData, aes(x=songAppData$avgLatencies, fill = songAppData$DeviceOS)) +
geom_histogram(alpha=0.4, aes(),binwidth = 1, position = "identity") + geom_vline(xintercept = median(x))
ggplot(songAppData, aes(x=songAppData$avgLatencies, y=songAppData$DeviceOS, fill = songAppData$DeviceOS)) +
geom_boxplot(alpha=0.5) + theme_light()
In general basic plotting functions in R are “good enough” for your own use to understand data, but plots from ggplot are highly configurable and are great options for work that you share, post, or publish.
We’ll cover some frequently used summary statistics for data here. You’ll learn how to compute these using built-in R methods. You’ll also be asked to reflect on what each summary statistic indicates.
In this section you’ll also make use of the subset method to consider just specific portions of the data.
The most common summary statistic, this estimates the center of the distribution. It is also an estimate of the expected value (we’ll talk about that later). 42.945 million * can be skewed by outliers (consider that the average compensation in a room of 400 Microsoft retail store employees is around $48,000 but jumps to over $150,000 if the Microsoft CEO walks into the room) * Simple computation: take the sum, divide by the count: \[\overline{X} = \sum_{i=1}^n \frac{1}{n}\cdot X_i\] * To compute in R, use mean(V) for a vector or column V. Checking ?mean reveals two optional arguments: + trim gives you the option of ignoring the highest/lowest x/% of values before taking the mean. This can help reduce the impact of outliers, but used too aggressively can also skew your results. We’ll use it only sparingly. + na.rm if there are null/na values, the mean will throw an error. This argument allows you to discard any rows of data that are na before taking the mean (also reduces the count n). + note: instead of computing for each column separately, you can use the shortcut colMeans(data).
\(\star\) Task Compute all numeric column means for the data set. Go through and for each put it into a sentence: “The average __________ (insert column name) in this sample of data was ____________ (insert computed mean). If we want to improve our product we should aim to __________ (increase? or decrease?) that average.”
mean(songAppData$UserId)
## [1] 5911.738
#The average UserID (insert column name) in this sample of data was 5911.74 (insert computed mean). If we want to improve our product we should aim to increase (increase? or decrease?) that average.
mean(songAppData$nSongs)
## [1] 10.60728
#The average nSong (insert column name) in this sample of data was 10.61 (insert computed mean). If we want to improve our product we should aim to increase (increase? or decrease?) that average.
mean(songAppData$nErrors)
## [1] 1.040749
#The average nErrors (insert column name) in this sample of data was 1.04 (insert computed mean). If we want to improve our product we should aim to decrease (increase? or decrease?) that average.
mean(songAppData$avgLatencies)
## [1] 364.5242
#The average avgLatencies (insert column name) in this sample of data was 364.52 (insert computed mean). If we want to improve our product we should aim to decrease (increase? or decrease?) that average.
The median (or 50th percentile) is found by ordering all of the data and taking the middle value. Note that this is a different measure of “center” than the mean: it can result in a number close to the mean, but can also be really far away. * Unlike the mean, the median isn’t skewed by outliers. In the Microsoft example, the median income would be closer to $48,000 – the salary of the person with 200 people earning less and 200 people earning more. * In R, compute the median by taking median(V) for the data column or vector V.
\(\star\) Task Compute medians for 3 different numeric columns. Which are close to the mean? Which are far off – and in what direction? This tells you about skew of the distribution – that there are a relatively small number of values that have big enough values to pull the mean in a particular direction. This should line up with the skew you may have observed in plotting the histograms.
#median(colMeans(sessionsData))
#median(songAppData$UserId)
#N/A
median(songAppData$nSongs)
## [1] 4
#number of songs median is quite a bit far off from it's mean as it's six songs apart.
median(songAppData$nErrors)
## [1] 0
#errors median is very close to the mean as it's only one error off from being the exact same value.
median(songAppData$avgLatencies)
## [1] 302.7649
#latencies median is somewhat close to the mean in the bigger picture. It's 50 off each other, but the latency ranges from 0 to 1000.
You saw the 25th and 75th percentiles in box plots. In the tech industry, percentile measures are frequently monitored to keep an eye on (and hopefully improve) ``worst" performance, as this can typically be much worse than the average performance. For example, a team might set a goal to reduce the 95th percentile of latency by a quarter.
quantile function. For example, quantile(data$X, c(0.50,0.95)) would compute both the median and 95th percentile of data column X.\(\star\) Task Compute the 80th and 95th percentile of latency. If your team is setting a goal to reduce the 95th percentile of latency by a quarter, what’s the target latency?
quantile(songAppData$avgLatencies, c(0.80, 0.95))
## 80% 95%
## 546.7407 854.5136
Data can have the same center but very different spread. It’s important to understand and track both!
As an example, consider the amount of money in people’s bank accounts, there’s a very wide spread: some have $0 balance, and others have tens of thousands of dollars. A reasonable guess could be that the average is around $800 and the standard deviation (a standardized measure of how much values vary from the center) might be $2000. If the bank made an error that redistributed account balance so that everyone had the same amount, the average would still be $800, but the standard deviation would now be $0. If the bank was tracking the average balance, but not the standard deviation (or other measures of consistency), they might totally miss this error!
V by sd(V)\(\star\) Task Compute the standard deviation for all numeric columns. In particular, compute the so-called ``coefficient of variation" by taking the standard deviation divided by the the mean. Which column has the highest standard deviation relative to the mean? Which is most consistent relative to the mean?
#sd(colMeans(sessionsData))
sd(songAppData$UserId)
## [1] 3409.827
sd(songAppData$nSongs)
## [1] 12.7332
sd(songAppData$nErrors)
## [1] 1.502238
sd(songAppData$avgLatencies)
## [1] 250.5817
You looked at counts and proportiosn meeting a specific criteria when you built tables. Since counts vary depending on how much data you collect, it’s usually better to look at the proportion (aka percentage or fraction) of data.
Although tables are useful, direct computations are more flexible. So let’s practice how to do this.
sum(data$X > 5 && data$Y == 0 )/nrow(data) would first evaluate the condition (X value is more than 5 AND Y value is 0) as TRUE (1) or FALSE (0). The sum gives the number of trues (since true counts as 1 and false as 0), and then we divide by the total number of rows.
&& for AND, || for OR, != for NOT EQUAL in your conditions.\(\star\) Task You are seeing more errors than you’d like in the product, and think there’s a bug. The bug could be in a change that was specific to iOS, or it could be in a change that impacted all device types. First find the proportion of all users that had at least one error.
Then make a data subset to just the iOS sessions: find the proportion among this data.
Based on this info, where do you think you should start looking for the bug?
sum(songAppData$nErrors > 1)/nrow(songAppData)
## [1] 0.2573053
iosSubset = subset(songAppData, songAppData$DeviceOS == "iOS")
sum(iosSubset$nErrors > 1)/nrow(iosSubset)
## [1] 0.4604506
head(iosSubset)
## X UserId nSongs nErrors avgLatencies daySessionStarted hourSessionStarted
## 6 6 2 3 2 273.8098 2021-04-09 9
## 7 7 2 11 2 260.0373 2021-04-05 17
## 8 8 2 1 1 173.7470 2021-04-09 8
## 9 9 2 1 1 405.3958 2021-04-09 4
## 10 10 3 43 1 167.5177 2021-04-09 8
## 11 11 3 1 1 267.1666 2021-04-05 22
## nSongsSkipped DeviceOS
## 6 0 iOS
## 7 3 iOS
## 8 0 iOS
## 9 0 iOS
## 10 1 iOS
## 11 0 iOS