Under the File tab, use Save As… to make a version of this file with a new name. In case things go sideways, we can go back to the original.
At the top of this document, put your name between the quotes after author
. This is now your notebook.
R provides many data sets to work with, so we can learn new analysis skills before scaling up. mtcars
is a classic go-to R data frame. It was extracted from the 1974 Motor Trend US magazine and comprises fuel consumption and 10 design and performance features for 32 automobiles (1973–74 models).
We can create a table of the entire data set in a new tab with the View()
function.
# Check out the full data set
View(mtcars)
Each row of mtcars
is an automobile, and each column is a performance feature. For example:
mpg
is miles per gallonwt
is weightThe function help()
provides information on R functions and data. We can find out what all the performance features are:
# what exactly is in mtcars?
help(mtcars)
We learned that heatmap()
plots a numeric matrix of values. So our first step will be to ensure that the data are converted from a table or data frame
to a matrix of number values. We will do most of our analysis on data in matrix form.
The symbol <-
is the assignment operator. It assigns a value on the right side of the operator to a variable on the left side. It functions, for us, like an equals (=) sign.
# Convert mtcars into a matrix of numbers
# Assign the output to the variable data
data <- as.matrix(mtcars)
Heatmaps are a way to colorize, visualize, and organize a data set with the goal of finding relationships among observations and features.
The scale()
function normalizes the features so they are comparable.
# Let's change the range of each feature so they are comparable
# We'll assigne the output to a new variable data_scaled
data_scaled <- scale(data)
We found that the heatmap for the scaled data reveals patterns in the data.
# A heat map is a color image of our data with dendrograms
heatmap(data_scaled)
Scatterplots plot one variable against another. They work best for continuous data.
# Make a data matrix for continuous variable
sub_data_scaled <- data_scaled[,c(1,3:7)]
# Do an all-on-all scatter plot
pairs(sub_data_scaled)
We can see in greater detail which features have relationships with others. Scatterplots help us find correlations.
Boxplots are a simple way to see the distribution of features.
# boxplots show the range of each feature
boxplot(x = as.list(mtcars))
# boxplots show the range of each feature
# Let's remove mpg, disp, hp, and qsec
boxplot(x = as.list(mtcars[,-c(1,3,4,7)]))
We can create boxplots for the scaled features as well to ensure our scaled data have similar distributions.
# boxplots show the range of each feature
boxplot(x = as.list(as.data.frame(data_scaled)))
We will learn about other visualizations when we do analyses with our data.