In this activity, we are going to install and load R
packages; practice using functions to view, clean, and visualize data;
and learn more about using R markdown to document our
analysis. R is a powerful tool that can do a lot of
different things; this sandbox activity will help us get more
comfortable using R while demonstrating some of its
functions in action.
R packagesPackages are a key part of working with
R.They contain bundles of code called
functions that allow us to perform a wide range of tasks in
R. Some of them even contain datasets that we can use to
practice the skills we have been learning throughout this course.
Some packages are installed by default, but many others
can be downloaded from an external source such as the Comprehensive R
Archive Network, or CRAN.
In this activity, we will be using a package called
tidyverse. The tidyverse package is actually a
collection individual packages that can help us perform a
wide variety of analysis tasks.
To install the tidyverse package, execute the code in
the code chunk below by clicking on the green arrow button in the top
right corner. When we execute a code chunk in RMarkdown, the output will
appear in the .rmd area and our console.
install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
Once a package is installed, we can load it by running the
library() function with the package name inside the
parentheses, like this:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Installing and loading the tidyverse package may take a
few minutes– be sure to wait for it to finish running before moving on
to the next steps!
Once the chunk above has finished running, we will get a report that
summarizes what packages were loaded because we ran the
library() function. The report will also let us know if
there are any functions that have a conflict, but we don’t
need to worry about that for now.
Now that we have loaded an R package, we can start
exploring some data.
Many of the tidyverse packages contain sample datasets
that we can use to practice our R skills. The
diamonds dataset in the ggplot2 package is a
great example for previewing R functions.
Because we already loaded this package in the last step, the
diamonds dataset is ready for us to use.
One common function we can use to preview the data is the
head() function, which displays the columns and the first
several rows of data. We can test out how the head()
function works by running the chunk below:
head(diamonds)
## # A tibble: 6 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
In addition to head() there are a number of other useful
functions we can use to summarize or preview the data. For example, the
str() and glimpse() functions will both return
summaries of each column in our data arranged horizontally. We can try
out these two functions by running the code chunks below:
str(diamonds)
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
glimpse(diamonds)
## Rows: 53,940
## Columns: 10
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…
Another simple function that we may use regularly is the
colnames() function. It returns a list of column names from
our dataset. We can check out this function by running the code chunk
below:
colnames(diamonds)
## [1] "carat" "cut" "color" "clarity" "depth" "table" "price"
## [8] "x" "y" "z"
After running the code chunk, we may have noticed a number in
brackets. This number helps us count the number of columns in our
dataset. If we have data with lots of columns and
colnames() prints the results on multiple lines, each line
will have a number in brackets at the start of the line indicating what
number column that is! So, for example, “carat” is the first column in
the diamonds dataset. On the second line, there is the
number seven in brackets; “price” is the seventh column.
One of the most frequent tasks we will have to perform as an analyst
is to clean and organize our data. R makes this easy! There
are many functions we can use to help us perform important tasks easily
and quickly.
For example, we might need to rename the columns, or variables, in
our data. There is a function for that: rename(). We can
check out how it works in the chunk below:
rename(diamonds, carat_new = carat)
## # A tibble: 53,940 × 10
## carat_new cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## # ℹ 53,930 more rows
Here, the function is being used to change the name of
carat to carat_new. This is a pretty basic
change, but rename() has many options that can help us do
more complex changes across all of the variables in our data.
For example, we can rename more than one variable in the same
rename() code. The code below demonstrates how:
rename(diamonds, carat_new = carat, cut_new = cut)
## # A tibble: 53,940 × 10
## carat_new cut_new color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## # ℹ 53,930 more rows
Another handy function for summarizing our data is
summarize(). We can use it to generate a wide range of
summary statistics for our data. For example, if we wanted to know what
the mean for carat was in this dataset, we could run the
code in the chunk below:
summarize(diamonds, mean_carat = mean(carat))
## # A tibble: 1 × 1
## mean_carat
## <dbl>
## 1 0.798
These functions are a great way to get more familiar with our data
and start making observations about it. But sometimes, previewing tables
isn’t enough to understand a dataset. Luckily, R has
visualization tools built in.
With R, we can create data visualizations that are
simple and easy to understand or complicated and beautiful just by
changing a bit of code. R empowers us to present the same
data in so many different ways, which can help us create new insights or
highlight important data findings. One of the most commonly used
visualization packages is the ggplot2 package, which is
loaded automatically when we install and load tidyverse.
The diamonds dataset that we have been using so far is a
ggplot2 dataset.
To build a visualization with ggplot2 we layer plot
elements together with a + symbol. We will learn a lot more
about using ggplot2 later in the course, but here is a
preview of how easy and flexible it is to make visuals using code:
ggplot(data = diamonds, aes(x = carat, y = price)) +
geom_point()
The code above takes the diamonds data, plots the carat
column on the X-axis, the price column on the Y-axis, and represents the
data as a scatter plot using the geom_point() command.
ggplot2 makes it easy to modify or improve our visuals.
For example, if we wanted to change the color of each point so that it
represented another variable, such as the cut of the diamond, we can
change the code like this:
ggplot(data = diamonds, aes(x = carat, y = price, color = cut)) +
geom_point()
Wow, that’s a busy visual! Sometimes when we are trying to represent
many different aspects of our data in a visual, it can help to separate
out some of the components. For example, we could create a different
plot for each type of cut. ggplot2 makes it easy to do this
with the facet_wrap() function:
ggplot(data = diamonds, aes(x = carat, y = price, color = cut)) +
geom_point() +
facet_wrap(~cut)
We will learn many other ways of working with ggplot2 to
make functional and beautiful visuals later on. For now, hopefully we
understand that it is both flexible and powerful!
We have been working in an R markdown file, which allows
us to put code and writing in the same place. Markdown is a simple
language for adding formatting to text documents. For example, all of
the section headers have been formatted by adding ## to the
beginning of the line. Markdown can be used to format the text in other
ways, such as creating bulleted lists:
When we have written, executed, and documented our code in an
R markdown document like this, we can use the
knit button in the menu bar at the top of the editing pane
to export our work to a beautiful, readable document for others.
We have had a chance to explore more R tools that we can
start using on our own. We learned how to install and load
R packages; functions for viewing, cleaning, and
visualizing data; and using R markdownto export our work.
Feel free to continue exploring these functions in the rmd file, or use
this code in our own RStudio project space. As we practice on our own,
consider how R is similar and different from the tools we
have already learned in this program, and how we might start using it
for our own data analysis projects. R provides a lot of
flexibility and utility that can make it a key tool in a data analyst’s
tool kit.
Make sure to mark this activity as complete in Coursera.