Getting Started with R
Welcome to R! Whether you’re just starting or have some
experience, R is an amazing tool for diving into data. Its
straightforward syntax and powerful features make it easy to get started
and grow your skills. With R, you can explore, analyze, and visualize
data in ways that are both fun and impactful. Let’s get started and see
where your data journey takes you!
Powering Data with R & Markdown
Understanding R Markdown
White sections in
.Rmdfiles are for narrative text. Add headers using#and plain text for paragraphs.Example: “This is my first narrative.”
Gray sections are code
chunksfor R scripts.
Adding Comments in Code Chunks:
Use # to add comments.
Mathematical Operations:
+Addition-Substraction*Multiplication/Division^Exponentiation
Useful Shortcuts in RStudio:
- Create a New Chunk:
Ctrl + Alt + I - Run Selected Line(s):
ctrl + Enter - Run Current Chunk:
Ctrl + Shift + Enter - Add Comments:
Ctrl + Shift + C - Assign Object
<-:Alt + -
Case Sensitivity in R:
Logical Operator:
>Greater than<Less than>=Greater than or Equal to<=Less than or Equal to==Equals!=Not equal&AND|OR
Data Types
- Character: Strings or text.
Example:
## [1] "character"
- Numeric: Continuous or discrete numbers
Example:
## [1] "numeric"
- Integer: Whole numbers. Numbers in the form of discrete numbers (integers without commas). Generally from ID data (customer ID, transaction ID, etc.). To force a numeric to be an integer, you can use L after the number.
Example:
## [1] "integer"
- Logical: Boolean values (
TRUE,FALSE).
Example:
## [1] "logical"
- Complex: Numbers with imaginary components.
## [1] "complex"
- Special Data Type: Factor
Factors represent categorical data efficiently.
gender <- c("Female", "Male", "Female") # character data type
gender <- factor(gender) #changed to factor
gender## [1] Female Male Female
## Levels: Female Male
## [1] "factor"
Data Structures
- Vector: One-dimensional and homogenous. Example:
## [1] 3
- Matrix: Two-dimensional and homogenous. There are several ways to create a matrix:
method 1: using the matrix() function
matrix1 <- matrix(1:6, nrow = 2, ncol = 3)
matrix1 # by default, values will be filled in per column## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
matrix2 <- matrix(data = 1:6, nrow = 2, ncol = 3, byrow = T)
matrix2 # use the `byrow=T` argument so that the values are filled in row by row## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
method 2: using the cbind and rbind
functions
- cbind (column bind) to combine several vectors into several columns
- rbind (row bind) to combine several vectors into several rows
Notice: - When using rbind/colbind to create a matrix, the initial
vector names will be the row names/column names of the matrix.
- List: One-dimensional but can contain mixed data types. Example:
## [[1]]
## [1] 1 2 3
##
## [[2]]
## [1] "text"
##
## [[3]]
## [1] TRUE
## [1] "list"
The subset list method can use [] or
[[]].
use [] for the subset list and its index. use
[[]] for the subset list without the index.
## [[1]]
## [1] 1 2 3
## [1] "list"
- Data.Frame: Tabular structure where columns can have mixed types. Example:
Exploring Data Frame:
## [1] "data.frame"
## [1] 2 2
## [1] "Name" "Age"
## [1] "1" "2"
## [1] "Name" "Age"
Working with Data in R
Before starting a project in R, it’s important to work within a single folder or working directory. This ensures that all data files you process and their resulting outputs are saved in one organized location. There are specific functions in R to get and set the working directory:
## [1] "C:/Users/user/Documents/Data Science/Fran/P4DS"
Import Data
R supports reading various file formats. For instance, to read a CSV file, you can use the following methods:
The data is already tidy/clean if:
- Each column is a variable.
- Each row is an observation.
- Each cell contains a single value.
Next, we can proceed to check the structure of the data.
Checking Data Structure
Before processing, it’s important to understand the dataset. Use these commands:
- Check dimensions (rows and columns):
## [1] 9994 15
## [1] 9994
## [1] 15
- View Column Names:
## [1] "Row.ID" "Order.ID" "Order.Date" "Ship.Date" "Ship.Mode"
## [6] "Customer.ID" "Segment" "Product.ID" "Category" "Sub.Category"
## [11] "Product.Name" "Sales" "Quantity" "Discount" "Profit"
## [1] "Row.ID" "Order.ID" "Order.Date" "Ship.Date" "Ship.Mode"
## [6] "Customer.ID" "Segment" "Product.ID" "Category" "Sub.Category"
## [11] "Product.Name" "Sales" "Quantity" "Discount" "Profit"
- Preview the first and last rows:
## Row.ID Order.ID Order.Date Ship.Date Ship.Mode Customer.ID
## 1 1 CA-2016-152156 11/8/16 11/11/16 Second Class CG-12520
## 2 2 CA-2016-152156 11/8/16 11/11/16 Second Class CG-12520
## 3 3 CA-2016-138688 6/12/16 6/16/16 Second Class DV-13045
## 4 4 US-2015-108966 10/11/15 10/18/15 Standard Class SO-20335
## 5 5 US-2015-108966 10/11/15 10/18/15 Standard Class SO-20335
## 6 6 CA-2014-115812 6/9/14 6/14/14 Standard Class BH-11710
## Segment Product.ID Category Sub.Category
## 1 Consumer FUR-BO-10001798 Furniture Bookcases
## 2 Consumer FUR-CH-10000454 Furniture Chairs
## 3 Corporate OFF-LA-10000240 Office Supplies Labels
## 4 Consumer FUR-TA-10000577 Furniture Tables
## 5 Consumer OFF-ST-10000760 Office Supplies Storage
## 6 Consumer FUR-FU-10001487 Furniture Furnishings
## Product.Name Sales
## 1 Bush Somerset Collection Bookcase 261.9600
## 2 Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back 731.9400
## 3 Self-Adhesive Address Labels for Typewriters by Universal 14.6200
## 4 Bretford CR4500 Series Slim Rectangular Table 957.5775
## 5 Eldon Fold 'N Roll Cart System 22.3680
## 6 Eldon Expressions Wood and Plastic Desk Accessories, Cherry Wood 48.8600
## Quantity Discount Profit
## 1 2 0.00 41.9136
## 2 3 0.00 219.5820
## 3 2 0.00 6.8714
## 4 5 0.45 -383.0310
## 5 2 0.20 2.5164
## 6 7 0.00 14.1694
## Row.ID Order.ID Order.Date Ship.Date Ship.Mode Customer.ID
## 9989 9989 CA-2017-163629 11/17/17 11/21/17 Standard Class RA-19885
## 9990 9990 CA-2014-110422 1/21/14 1/23/14 Second Class TB-21400
## 9991 9991 CA-2017-121258 2/26/17 3/3/17 Standard Class DB-13060
## 9992 9992 CA-2017-121258 2/26/17 3/3/17 Standard Class DB-13060
## 9993 9993 CA-2017-121258 2/26/17 3/3/17 Standard Class DB-13060
## 9994 9994 CA-2017-119914 5/4/17 5/9/17 Second Class CC-12220
## Segment Product.ID Category Sub.Category
## 9989 Corporate TEC-PH-10004006 Technology Phones
## 9990 Consumer FUR-FU-10001889 Furniture Furnishings
## 9991 Consumer FUR-FU-10000747 Furniture Furnishings
## 9992 Consumer TEC-PH-10003645 Technology Phones
## 9993 Consumer OFF-PA-10004041 Office Supplies Paper
## 9994 Consumer OFF-AP-10002684 Office Supplies Appliances
## Product.Name
## 9989 Panasonic KX - TS880B Telephone
## 9990 Ultra Door Pull Handle
## 9991 Tenex B1-RE Series Chair Mats for Low Pile Carpets
## 9992 Aastra 57i VoIP phone
## 9993 It's Hot Message Books with Stickers, 2 3/4" x 5"
## 9994 Acco 7-Outlet Masterpiece Power Center, Wihtout Fax/Phone Line Protection
## Sales Quantity Discount Profit
## 9989 206.100 5 0.0 55.6470
## 9990 25.248 3 0.2 4.1028
## 9991 91.960 2 0.0 15.6332
## 9992 258.576 2 0.2 19.3932
## 9993 29.600 4 0.0 13.3200
## 9994 243.160 2 0.0 72.9480
- Check for missing values (NA):
## [1] FALSE
## Row.ID Order.ID Order.Date Ship.Date Ship.Mode Customer.ID
## 0 0 0 0 0 0
## Segment Product.ID Category Sub.Category Product.Name Sales
## 0 0 0 0 0 0
## Quantity Discount Profit
## 0 0 0
- View whole data structure:
## 'data.frame': 9994 obs. of 15 variables:
## $ Row.ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Order.ID : chr "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
## $ Order.Date : chr "11/8/16" "11/8/16" "6/12/16" "10/11/15" ...
## $ Ship.Date : chr "11/11/16" "11/11/16" "6/16/16" "10/18/15" ...
## $ Ship.Mode : chr "Second Class" "Second Class" "Second Class" "Standard Class" ...
## $ Customer.ID : chr "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
## $ Segment : chr "Consumer" "Consumer" "Corporate" "Consumer" ...
## $ Product.ID : chr "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
## $ Category : chr "Furniture" "Furniture" "Office Supplies" "Furniture" ...
## $ Sub.Category: chr "Bookcases" "Chairs" "Labels" "Tables" ...
## $ Product.Name: chr "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
## $ Sales : num 262 731.9 14.6 957.6 22.4 ...
## $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
## $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
## $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
Notice: For data types like factors, it shows the number of levels (unique IDs), a few level labels, and some integer values. These numbers are basically indexes for the levels found in each observation:
## NULL
## chr [1:9994] "Second Class" "Second Class" "Second Class" "Standard Class" ...
So, for the Ship.Mode column in rows 1-3, the label that
appears is “Second Class.”
To look at data distribution and content, you can also use summary. It displays:
- Factor type: Count per category.
- Numeric type: 5-number summary.
- Date type: Date range.
Data Transformation
After checking the structure, make sure each column’s data type is correct. If not, do explicit coercion.
Convert columns to character type using as.character
for:
- Order ID
- Customer ID
- Product ID
- Product Name
retail$Order.ID <- as.character(retail$Order.ID)
retail$Customer.ID <- as.character(retail$Customer.ID)
retail$Product.ID <- as.character(retail$Product.ID)
retail$Product.Name <- as.character(retail$Product.Name)Check using class() or str():
## [1] "character"
Convert columns to Date type using as.Date:
- Order Date
- Ship Date
## [1] "11/8/16" "11/8/16" "6/12/16" "10/11/15" "10/11/15" "6/9/14"
retail$Order.Date <- as.Date(retail$Order.Date, "%m/%d/%y")
retail$Ship.Date <- as.Date(retail$Ship.Date, "%m/%d/%y")## 'data.frame': 9994 obs. of 15 variables:
## $ Row.ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Order.ID : chr "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
## $ Order.Date : Date, format: "2016-11-08" "2016-11-08" ...
## $ Ship.Date : Date, format: "2016-11-11" "2016-11-11" ...
## $ Ship.Mode : chr "Second Class" "Second Class" "Second Class" "Standard Class" ...
## $ Customer.ID : chr "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
## $ Segment : chr "Consumer" "Consumer" "Corporate" "Consumer" ...
## $ Product.ID : chr "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
## $ Category : chr "Furniture" "Furniture" "Office Supplies" "Furniture" ...
## $ Sub.Category: chr "Bookcases" "Chairs" "Labels" "Tables" ...
## $ Product.Name: chr "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
## $ Sales : num 262 731.9 14.6 957.6 22.4 ...
## $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
## $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
## $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...