Welcome to the sandbox! This activity is going to provide you with
the opportunity to preview some of the cool things you can do in
R
that you will be learning in this course. You will learn
more about working with packages and data and try out some important
functions.
In this activity, you are going to install and load R
packages; practice using functions to view, clean, and visualize data;
and learn more about using R markdown
to document your
analysis. R
is a powerful tool that can do a lot of
different things; this sandbox activity will help you get more
comfortable using R
while demonstrating some of its
functions in action. In later activities, you will also get the
opportunity to write your own R code!
R packages
Packages
are a key part of working with
R.
They contain bundles of code called
functions
that allow you to perform a wide range of tasks
in R.
Some of them even contain datasets that you can use
to practice the skills you have been learning throughout this
course.
Some packages
are installed by default, but many others
can be downloaded from an external source such as the Comprehensive R
Archive Network, or CRAN.
In this activity, you will be using a package called
tidyverse.
The tidyverse
package is actually a
collection individual packages
that can help you perform a
wide variety of analysis tasks.
To install the tidyverse
package, execute the code in
the code chunk below by clicking on the green arrow button in the top
right corner. When you execute a code chunk in RMarkdown, the output
will appear in the .rmd area and your console.
install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
Once a package is installed, you can load it by running the
library()
function with the package name inside the
parentheses, like this:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Installing and loading the tidyverse
package may take a
few minutes– be sure to wait for it to finish running before moving on
to the next steps!
Once the chunk above has finished running, you will get a report that
summarizes what packages were loaded because you ran the
library()
function. The report will also let you know you
if there are any functions
that have a conflict, but you
don’t need to worry about that for now.
Now that you have loaded an R package,
you can start
exploring some data.
Many of the tidyverse
packages contain sample datasets
that you can use to practice your R
skills. The
diamonds
dataset in the ggplot2
package is a
great example for previewing R
functions.
Because you already loaded this package in the last step, the
diamonds
dataset is ready for you to use.
One common function you can use to preview the data is the
head()
function, which displays the columns and the first
several rows of data. You can test out how the head()
function works by running the chunk below:
head(diamonds)
## # A tibble: 6 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
In addition to head()
there are a number of other useful
functions you can use to summarize or preview the data. For example, the
str()
and glimpse()
functions will both return
summaries of each column in your data arranged horizontally. You can try
out these two functions by running the code chunks below:
str(diamonds)
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
glimpse(diamonds)
## Rows: 53,940
## Columns: 10
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…
Another simple function that you may use regularly is the
colnames()
function. It returns a list of column names from
your dataset. You can check out this function by running the code chunk
below:
colnames(diamonds)
## [1] "carat" "cut" "color" "clarity" "depth" "table" "price"
## [8] "x" "y" "z"
After running the code chunk, you may have noticed a number in
brackets. This number helps you count the number of columns in your
dataset. If you have data with lots of columns and
colnames()
prints the results on multiple lines, each line
will have a number in brackets at the start of the line indicating what
number column that is! So, for example, “carat” is the first column in
the diamonds
dataset. On the second line, there is the
number seven in brackets; “price” is the seventh column.
One of the most frequent tasks you will have to perform as an analyst
is to clean and organize your data. R
makes this easy!
There are many functions you can use to help you perform important tasks
easily and quickly.
For example, you might need to rename the columns, or variables, in
your data. There is a function for that: rename().
You can
check out how it works in the chunk below:
rename(diamonds, carat_new = carat)
## # A tibble: 53,940 × 10
## carat_new cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## # ℹ 53,930 more rows
Here, the function is being used to change the name of
carat
to carat_new
. This is a pretty basic
change, but rename()
has many options that can help you do
more complex changes across all of the variables in your data.
For example, you can rename more than one variable in the same
rename()
code. The code below demonstrates how:
rename(diamonds, carat_new = carat, cut_new = cut)
## # A tibble: 53,940 × 10
## carat_new cut_new color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## # ℹ 53,930 more rows
Another handy function for summarizing your data is
summarize().
You can use it to generate a wide range of
summary statistics for your data. For example, if you wanted to know
what the mean for carat
was in this dataset, you could run
the code in the chunk below:
summarize(diamonds, mean_carat = mean(carat))
## # A tibble: 1 × 1
## mean_carat
## <dbl>
## 1 0.798
These functions are a great way to get more familiar with your data
and start making observations about it. But sometimes, previewing tables
isn’t enough to understand a dataset. Luckily, R
has
visualization tools built in.
With R,
you can create data visualizations that are
simple and easy to understand or complicated and beautiful just by
changing a bit of code. R
empowers you to present the same
data in so many different ways, which can help you create new insights
or highlight important data findings. One of the most commonly used
visualization packages is the ggplot2
package, which is
loaded automatically when you install and load tidyverse.
The diamonds
dataset that you have been using so far is a
ggplot2
dataset.
To build a visualization with ggplot2
you layer plot
elements together with a +
symbol. You will learn a lot
more about using ggplot2
later in the course, but here is a
preview of how easy and flexible it is to make visuals using code:
ggplot(data = diamonds, aes(x = carat, y = price)) +
geom_point()
The code above takes the diamonds
data, plots the carat
column on the X-axis, the price column on the Y-axis, and represents the
data as a scatter plot using the geom_point()
command.
ggplot2
makes it easy to modify or improve your visuals.
For example, if you wanted to change the color of each point so that it
represented another variable, such as the cut of the diamond, you can
change the code like this:
ggplot(data = diamonds, aes(x = carat, y = price, color = cut)) +
geom_point()
Wow, that’s a busy visual! Sometimes when you are trying to represent
many different aspects of your data in a visual, it can help to separate
out some of the components. For example, you could create a different
plot for each type of cut. ggplot2
makes it easy to do this
with the facet_wrap()
function:
ggplot(data = diamonds, aes(x = carat, y = price, color = cut)) +
geom_point() +
facet_wrap(~cut)
You will learn many other ways of working with ggplot2
to make functional and beautiful visuals later on. For now, hopefully
you understand that it is both flexible and powerful!
You have been working in an R markdown
file, which
allows you to put code and writing in the same place. Markdown is a
simple language for adding formatting to text documents. For example,
all of the section headers have been formatted by adding ##
to the beginning of the line. Markdown can be used to format the text in
other ways, such as creating bulleted lists:
When you have written, executed, and documented your code in an
R markdown
document like this, you can use the
knit
button in the menu bar at the top of the editing pane
to export your work to a beautiful, readable document for others.
You have had a chance to explore more R
tools that you
can start using on your own. You learned how to install and load
R packages
; functions for viewing, cleaning, and
visualizing data; and using R markdown
to export your work.
Feel free to continue exploring these functions in the rmd file, or use
this code in your own RStudio project space. As you practice on your
own, consider how R
is similar and different from the tools
you have already learned in this program, and how you might start using
it for your own data analysis projects. R
provides a lot of
flexibility and utility that can make it a key tool in a data analyst’s
tool kit.
Make sure to mark this activity as complete in Coursera.