Get Started with R for Data Science

This section will introduce students to R programming for Data Science.

Installing R, RStudio and R packages

Installation should be done only once, unless you need to update them.It is important to install R first before installing RStudio. An R package is a library of R functions and data from independent researchers and programmers.

Install R then RStudio.
Install packages.

install.packages("DataExplorer")
install.packages("tidyverse")
install.packages("data.table")
install.packages("ggplot2")
install.packages("readxl")

Loading packages

Load packages only when you need them in your current code.

library(DataExplorer)
library(tidyverse)
library(data.table)
library(ggplot2)

Reading tables

Read comma delemited file (csv) into data.table

A data.table object is an improved version of data.frame. A data.frame is a table type of data. A data.table inherits a data.frame property but has additional functionalities. The “class(dt)” command shows that dt is both a data.table and a data.frame.

# read csv data using data.table::fread
dt <- fread("D:/ISU/Cur/Data Science/RM/Students3rdyr.csv")
class(dt)

## [1] "data.table" "data.frame"

Reading an Excel file

library(readxl)
dt <- read_excel("D:/ISU/Cur/Students1stsem2022_23.xlsx", sheet = "Data Science 3")
class(dt)

## [1] "tbl_df"     "tbl"        "data.frame"

# convert to data.table
dt <- data.table(dt)
class(dt)

## [1] "data.table" "data.frame"

Reading Excel file from Google Drive

To read an excel file from Google Drive: a. Login to your Google Drive account b. Locate the Excel file you want to open, and make it shareable to anyone with link c. Open you Excel file using Google Sheets d. Copy the url of your Excel file e. Use the url as the first parameter of the read_sheet function as shown on the sample code below.

# reference: https://www.digitalocean.com/community/tutorials/google-sheets-in-r
# install.packages("googlesheets4")
# install.packages("googledrive")
library(googlesheets4)
library(googledrive)

## 
## Attaching package: 'googledrive'

## The following objects are masked from 'package:googlesheets4':
## 
##     request_generate, request_make

dt <- read_sheet("https://docs.google.com/spreadsheets/d/1TJSH3e7JoYEv4JBJ-C2-51wO2-gWbJ5KQtSPdUamweI/edit#gid=1260842998", sheet = "Data Science 3")

## ! Using an auto-discovered, cached token.

##   To suppress this message, modify your code or options to clearly consent to
##   the use of a cached token.

##   See gargle's "Non-interactive auth" vignette for more details:

##   <https://gargle.r-lib.org/articles/non-interactive-auth.html>

## ℹ The googlesheets4 package is using a cached token for 'jfolledo@gmail.com'.

## Auto-refreshing stale OAuth token.

## ✔ Reading from "Folledo Students 1st sem 2022_23".

## ✔ Range ''Data Science 3''.

dt <- data.table(dt)

EDA using library(DataExplorer)

Reference: https://youtu.be/ssVEoj54rx4

1. A short glimpse of the gss_cat data from library(DataExplorer)

The two lines of code below are identical. They show the number of rows and columns of the gss_cat data, as well as sample values for each column.

gss_cat %>% glimpse()

## Rows: 21,483
## Columns: 9
## $ year    <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 20…
## $ marital <fct> Never married, Divorced, Widowed, Never married, Divorced, Mar…
## $ age     <int> 26, 48, 67, 39, 25, 25, 36, 44, 44, 47, 53, 52, 52, 51, 52, 40…
## $ race    <fct> White, White, White, White, White, White, White, White, White,…
## $ rincome <fct> $8000 to 9999, $8000 to 9999, Not applicable, Not applicable, …
## $ partyid <fct> "Ind,near rep", "Not str republican", "Independent", "Ind,near…
## $ relig   <fct> Protestant, Protestant, Protestant, Orthodox-christian, None, …
## $ denom   <fct> "Southern baptist", "Baptist-dk which", "No denomination", "No…
## $ tvhours <int> 12, NA, 2, 4, 1, NA, 3, NA, 0, 3, 2, NA, 1, NA, 1, 7, NA, 3, 3…

# does the same without using %>%
glimpse(gss_cat)

## Rows: 21,483
## Columns: 9
## $ year    <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 20…
## $ marital <fct> Never married, Divorced, Widowed, Never married, Divorced, Mar…
## $ age     <int> 26, 48, 67, 39, 25, 25, 36, 44, 44, 47, 53, 52, 52, 51, 52, 40…
## $ race    <fct> White, White, White, White, White, White, White, White, White,…
## $ rincome <fct> $8000 to 9999, $8000 to 9999, Not applicable, Not applicable, …
## $ partyid <fct> "Ind,near rep", "Not str republican", "Independent", "Ind,near…
## $ relig   <fct> Protestant, Protestant, Protestant, Orthodox-christian, None, …
## $ denom   <fct> "Southern baptist", "Baptist-dk which", "No denomination", "No…
## $ tvhours <int> 12, NA, 2, 4, 1, NA, 3, NA, 0, 3, 2, NA, 1, NA, 1, 7, NA, 3, 3…

2. Introduction to the data

gss_cat %>% plot_intro()

### 3. Check missing values

gss_cat %>% plot_missing()

gss_cat %>% profile_missing()

## # A tibble: 9 × 3
##   feature num_missing pct_missing
##   <fct>         <int>       <dbl>
## 1 year              0     0      
## 2 marital           0     0      
## 3 age              76     0.00354
## 4 race              0     0      
## 5 rincome           0     0      
## 6 partyid           0     0      
## 7 relig             0     0      
## 8 denom             0     0      
## 9 tvhours       10146     0.472

4. Plot continuous variables

gss_cat %>% plot_density()

gss_cat %>% plot_histogram()

5. Plot categorical variables

gss_cat %>% plot_bar()

6. Plot the relationship

gss_cat %>% plot_correlation()

## 1 features with more than 20 categories ignored!
## denom: 30 categories