Getting Started with R


Welcome to R! Whether you’re just starting or have some experience, R is an amazing tool for diving into data. Its straightforward syntax and powerful features make it easy to get started and grow your skills. With R, you can explore, analyze, and visualize data in ways that are both fun and impactful. Let’s get started and see where your data journey takes you!

Powering Data with R & Markdown


Understanding R Markdown

  • White sections in .Rmd files are for narrative text. Add headers using # and plain text for paragraphs.

    Example: “This is my first narrative.”

  • Gray sections are code chunks for R scripts.

Adding Comments in Code Chunks:

Use # to add comments.

# Example: Adding a comment
1+1 #Addition
1*2 #Multiplication
3/2 #Division

Mathematical Operations:

  • + Addition
  • - Substraction
  • * Multiplication
  • / Division
  • ^ Exponentiation

Useful Shortcuts in RStudio:

  • Create a New Chunk: Ctrl + Alt + I
  • Run Selected Line(s): ctrl + Enter
  • Run Current Chunk: Ctrl + Shift + Enter
  • Add Comments: Ctrl + Shift + C
  • Assign Object <-: Alt + -

Case Sensitivity in R:

"Data" == "data"  # False
"Data" == "Data"  # True

Logical Operator:

  • > Greater than
  • < Less than
  • >= Greater than or Equal to
  • <= Less than or Equal to
  • == Equals
  • != Not equal
  • & AND
  • | OR

Data Types

  • Character: Strings or text.

Example:

names <- c("Alice", "Bob")
class(names)
## [1] "character"


  • Numeric: Continuous or discrete numbers

Example:

values <- c(1.5, 2, 3.7)
class(values)
## [1] "numeric"


  • Integer: Whole numbers. Numbers in the form of discrete numbers (integers without commas). Generally from ID data (customer ID, transaction ID, etc.). To force a numeric to be an integer, you can use L after the number.

Example:

values <- c(3L,2L,4L)
class(values) 
## [1] "integer"


  • Logical: Boolean values (TRUE, FALSE).

Example:

status <- c(TRUE, FALSE, T, F)
class(status)
## [1] "logical"


  • Complex: Numbers with imaginary components.
comp <- c(3+2i, 4+2i)
class(comp)
## [1] "complex"


  • Special Data Type: Factor

Factors represent categorical data efficiently.

gender <- c("Female", "Male", "Female") # character data type
gender <- factor(gender) #changed to factor

gender
## [1] Female Male   Female
## Levels: Female Male
class(gender)
## [1] "factor"


Data Structures

  • Vector: One-dimensional and homogenous. Example:
ages <- c(23, 24, 25)
length(ages)
## [1] 3


  • Matrix: Two-dimensional and homogenous. There are several ways to create a matrix:

method 1: using the matrix() function

matrix1 <- matrix(1:6, nrow = 2, ncol = 3)
matrix1 # by default, values will be filled in per column
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
matrix2 <- matrix(data = 1:6, nrow = 2, ncol = 3, byrow = T)
matrix2 # use the `byrow=T` argument so that the values are filled in row by row
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6

method 2: using the cbind and rbind functions

  • cbind (column bind) to combine several vectors into several columns
  • rbind (row bind) to combine several vectors into several rows

Notice: - When using rbind/colbind to create a matrix, the initial vector names will be the row names/column names of the matrix.

  • List: One-dimensional but can contain mixed data types. Example:
list1 <- list(c(1, 2, 3), "text", TRUE)
list1
## [[1]]
## [1] 1 2 3
## 
## [[2]]
## [1] "text"
## 
## [[3]]
## [1] TRUE
class(list1)
## [1] "list"

The subset list method can use [] or [[]].

use [] for the subset list and its index. use [[]] for the subset list without the index.

# subset second list index
list1[1]
## [[1]]
## [1] 1 2 3
class(list1[1]) # data type is still a list
## [1] "list"


  • Data.Frame: Tabular structure where columns can have mixed types. Example:
df <- data.frame(Name = c("A", "B"), Age = c(20, 21))

Exploring Data Frame:

class(df)
## [1] "data.frame"
dim(df)
## [1] 2 2
colnames(df) # view column names
## [1] "Name" "Age"
rownames(df) # view row names; in this example it is still an index
## [1] "1" "2"
names(df) # another way to view column names
## [1] "Name" "Age"

Working with Data in R

Before starting a project in R, it’s important to work within a single folder or working directory. This ensures that all data files you process and their resulting outputs are saved in one organized location. There are specific functions in R to get and set the working directory:

#get working directory
getwd()
## [1] "C:/Users/user/Documents/Data Science/Fran/P4DS"
#set working directory
setwd(dir = "C:/Users/user/Documents/Data Science/Fran/P4DS")

Import Data

R supports reading various file formats. For instance, to read a CSV file, you can use the following methods:

retail <- read.csv("data_input/retail.csv")
retail

The data is already tidy/clean if:

  • Each column is a variable.
  • Each row is an observation.
  • Each cell contains a single value.

Next, we can proceed to check the structure of the data.

Checking Data Structure

Before processing, it’s important to understand the dataset. Use these commands:

  1. Check dimensions (rows and columns):
dim(retail)
## [1] 9994   15
nrow(retail)
## [1] 9994
ncol(retail)
## [1] 15
  1. View Column Names:
colnames(retail)
##  [1] "Row.ID"       "Order.ID"     "Order.Date"   "Ship.Date"    "Ship.Mode"   
##  [6] "Customer.ID"  "Segment"      "Product.ID"   "Category"     "Sub.Category"
## [11] "Product.Name" "Sales"        "Quantity"     "Discount"     "Profit"
names(retail)
##  [1] "Row.ID"       "Order.ID"     "Order.Date"   "Ship.Date"    "Ship.Mode"   
##  [6] "Customer.ID"  "Segment"      "Product.ID"   "Category"     "Sub.Category"
## [11] "Product.Name" "Sales"        "Quantity"     "Discount"     "Profit"
  1. Preview the first and last rows:
head(retail) # first 6 rows of the data by default 
##   Row.ID       Order.ID Order.Date Ship.Date      Ship.Mode Customer.ID
## 1      1 CA-2016-152156    11/8/16  11/11/16   Second Class    CG-12520
## 2      2 CA-2016-152156    11/8/16  11/11/16   Second Class    CG-12520
## 3      3 CA-2016-138688    6/12/16   6/16/16   Second Class    DV-13045
## 4      4 US-2015-108966   10/11/15  10/18/15 Standard Class    SO-20335
## 5      5 US-2015-108966   10/11/15  10/18/15 Standard Class    SO-20335
## 6      6 CA-2014-115812     6/9/14   6/14/14 Standard Class    BH-11710
##     Segment      Product.ID        Category Sub.Category
## 1  Consumer FUR-BO-10001798       Furniture    Bookcases
## 2  Consumer FUR-CH-10000454       Furniture       Chairs
## 3 Corporate OFF-LA-10000240 Office Supplies       Labels
## 4  Consumer FUR-TA-10000577       Furniture       Tables
## 5  Consumer OFF-ST-10000760 Office Supplies      Storage
## 6  Consumer FUR-FU-10001487       Furniture  Furnishings
##                                                       Product.Name    Sales
## 1                                Bush Somerset Collection Bookcase 261.9600
## 2      Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back 731.9400
## 3        Self-Adhesive Address Labels for Typewriters by Universal  14.6200
## 4                    Bretford CR4500 Series Slim Rectangular Table 957.5775
## 5                                   Eldon Fold 'N Roll Cart System  22.3680
## 6 Eldon Expressions Wood and Plastic Desk Accessories, Cherry Wood  48.8600
##   Quantity Discount    Profit
## 1        2     0.00   41.9136
## 2        3     0.00  219.5820
## 3        2     0.00    6.8714
## 4        5     0.45 -383.0310
## 5        2     0.20    2.5164
## 6        7     0.00   14.1694
tail(retail) # last 6 rows of the data by default
##      Row.ID       Order.ID Order.Date Ship.Date      Ship.Mode Customer.ID
## 9989   9989 CA-2017-163629   11/17/17  11/21/17 Standard Class    RA-19885
## 9990   9990 CA-2014-110422    1/21/14   1/23/14   Second Class    TB-21400
## 9991   9991 CA-2017-121258    2/26/17    3/3/17 Standard Class    DB-13060
## 9992   9992 CA-2017-121258    2/26/17    3/3/17 Standard Class    DB-13060
## 9993   9993 CA-2017-121258    2/26/17    3/3/17 Standard Class    DB-13060
## 9994   9994 CA-2017-119914     5/4/17    5/9/17   Second Class    CC-12220
##        Segment      Product.ID        Category Sub.Category
## 9989 Corporate TEC-PH-10004006      Technology       Phones
## 9990  Consumer FUR-FU-10001889       Furniture  Furnishings
## 9991  Consumer FUR-FU-10000747       Furniture  Furnishings
## 9992  Consumer TEC-PH-10003645      Technology       Phones
## 9993  Consumer OFF-PA-10004041 Office Supplies        Paper
## 9994  Consumer OFF-AP-10002684 Office Supplies   Appliances
##                                                                   Product.Name
## 9989                                           Panasonic KX - TS880B Telephone
## 9990                                                    Ultra Door Pull Handle
## 9991                        Tenex B1-RE Series Chair Mats for Low Pile Carpets
## 9992                                                     Aastra 57i VoIP phone
## 9993                         It's Hot Message Books with Stickers, 2 3/4" x 5"
## 9994 Acco 7-Outlet Masterpiece Power Center, Wihtout Fax/Phone Line Protection
##        Sales Quantity Discount  Profit
## 9989 206.100        5      0.0 55.6470
## 9990  25.248        3      0.2  4.1028
## 9991  91.960        2      0.0 15.6332
## 9992 258.576        2      0.2 19.3932
## 9993  29.600        4      0.0 13.3200
## 9994 243.160        2      0.0 72.9480
  1. Check for missing values (NA):
anyNA(retail) # check NA of all data
## [1] FALSE
colSums(is.na(retail)) # check NA per column
##       Row.ID     Order.ID   Order.Date    Ship.Date    Ship.Mode  Customer.ID 
##            0            0            0            0            0            0 
##      Segment   Product.ID     Category Sub.Category Product.Name        Sales 
##            0            0            0            0            0            0 
##     Quantity     Discount       Profit 
##            0            0            0
  1. View whole data structure:
str(retail)
## 'data.frame':    9994 obs. of  15 variables:
##  $ Row.ID      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Order.ID    : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
##  $ Order.Date  : chr  "11/8/16" "11/8/16" "6/12/16" "10/11/15" ...
##  $ Ship.Date   : chr  "11/11/16" "11/11/16" "6/16/16" "10/18/15" ...
##  $ Ship.Mode   : chr  "Second Class" "Second Class" "Second Class" "Standard Class" ...
##  $ Customer.ID : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
##  $ Segment     : chr  "Consumer" "Consumer" "Corporate" "Consumer" ...
##  $ Product.ID  : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
##  $ Category    : chr  "Furniture" "Furniture" "Office Supplies" "Furniture" ...
##  $ Sub.Category: chr  "Bookcases" "Chairs" "Labels" "Tables" ...
##  $ Product.Name: chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
##  $ Sales       : num  262 731.9 14.6 957.6 22.4 ...
##  $ Quantity    : int  2 3 2 5 2 7 4 6 3 5 ...
##  $ Discount    : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
##  $ Profit      : num  41.91 219.58 6.87 -383.03 2.52 ...

Notice: For data types like factors, it shows the number of levels (unique IDs), a few level labels, and some integer values. These numbers are basically indexes for the levels found in each observation:

levels(retail$Ship.Mode) # there are 4 unique categories (4 indexes)
## NULL
str(retail$Ship.Mode)
##  chr [1:9994] "Second Class" "Second Class" "Second Class" "Standard Class" ...

So, for the Ship.Mode column in rows 1-3, the label that appears is “Second Class.”

To look at data distribution and content, you can also use summary. It displays:

  • Factor type: Count per category.
  • Numeric type: 5-number summary.
  • Date type: Date range.

Data Transformation

After checking the structure, make sure each column’s data type is correct. If not, do explicit coercion.

Convert columns to character type using as.character for:

  • Order ID
  • Customer ID
  • Product ID
  • Product Name
retail$Order.ID <- as.character(retail$Order.ID)
retail$Customer.ID <- as.character(retail$Customer.ID)
retail$Product.ID <- as.character(retail$Product.ID)
retail$Product.Name <- as.character(retail$Product.Name)

Check using class() or str():

class(retail$Order.ID)
## [1] "character"

Convert columns to Date type using as.Date:

  • Order Date
  • Ship Date
# Check the initial date format
head(retail$Order.Date)
## [1] "11/8/16"  "11/8/16"  "6/12/16"  "10/11/15" "10/11/15" "6/9/14"
retail$Order.Date <- as.Date(retail$Order.Date, "%m/%d/%y")
retail$Ship.Date <- as.Date(retail$Ship.Date, "%m/%d/%y")
# Check again
str(retail)
## 'data.frame':    9994 obs. of  15 variables:
##  $ Row.ID      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Order.ID    : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
##  $ Order.Date  : Date, format: "2016-11-08" "2016-11-08" ...
##  $ Ship.Date   : Date, format: "2016-11-11" "2016-11-11" ...
##  $ Ship.Mode   : chr  "Second Class" "Second Class" "Second Class" "Standard Class" ...
##  $ Customer.ID : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
##  $ Segment     : chr  "Consumer" "Consumer" "Corporate" "Consumer" ...
##  $ Product.ID  : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
##  $ Category    : chr  "Furniture" "Furniture" "Office Supplies" "Furniture" ...
##  $ Sub.Category: chr  "Bookcases" "Chairs" "Labels" "Tables" ...
##  $ Product.Name: chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
##  $ Sales       : num  262 731.9 14.6 957.6 22.4 ...
##  $ Quantity    : int  2 3 2 5 2 7 4 6 3 5 ...
##  $ Discount    : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
##  $ Profit      : num  41.91 219.58 6.87 -383.03 2.52 ...

Sampling Data