This section will introduce students to R programming for Data Science.
Installation should be done only once, unless you need to update them.It is important to install R first before installing RStudio. An R package is a library of R functions and data from independent researchers and programmers.
install.packages("DataExplorer")
install.packages("tidyverse")
install.packages("data.table")
install.packages("ggplot2")
install.packages("readxl")
Load packages only when you need them in your current code.
library(DataExplorer)
library(tidyverse)
library(data.table)
library(ggplot2)
A data.table object is an improved version of data.frame. A data.frame is a table type of data. A data.table inherits a data.frame property but has additional functionalities. The “class(dt)” command shows that dt is both a data.table and a data.frame.
# read csv data using data.table::fread
dt <- fread("D:/ISU/Cur/Data Science/RM/Students3rdyr.csv")
class(dt)
## [1] "data.table" "data.frame"
library(readxl)
dt <- read_excel("D:/ISU/Cur/Students1stsem2022_23.xlsx", sheet = "Data Science 3")
class(dt)
## [1] "tbl_df" "tbl" "data.frame"
# convert to data.table
dt <- data.table(dt)
class(dt)
## [1] "data.table" "data.frame"
To read an excel file from Google Drive: a. Login to your Google
Drive account b. Locate the Excel file you want to open, and make it
shareable to anyone with link c. Open you Excel file using Google Sheets
d. Copy the url of your Excel file e. Use the url as the first parameter
of the read_sheet function as shown on the sample code
below.
# reference: https://www.digitalocean.com/community/tutorials/google-sheets-in-r
# install.packages("googlesheets4")
# install.packages("googledrive")
library(googlesheets4)
library(googledrive)
##
## Attaching package: 'googledrive'
## The following objects are masked from 'package:googlesheets4':
##
## request_generate, request_make
dt <- read_sheet("https://docs.google.com/spreadsheets/d/1TJSH3e7JoYEv4JBJ-C2-51wO2-gWbJ5KQtSPdUamweI/edit#gid=1260842998", sheet = "Data Science 3")
## ! Using an auto-discovered, cached token.
## To suppress this message, modify your code or options to clearly consent to
## the use of a cached token.
## See gargle's "Non-interactive auth" vignette for more details:
## <https://gargle.r-lib.org/articles/non-interactive-auth.html>
## ℹ The googlesheets4 package is using a cached token for 'jfolledo@gmail.com'.
## Auto-refreshing stale OAuth token.
## ✔ Reading from "Folledo Students 1st sem 2022_23".
## ✔ Range ''Data Science 3''.
dt <- data.table(dt)
Reference: https://youtu.be/ssVEoj54rx4
The two lines of code below are identical. They show the number of rows and columns of the gss_cat data, as well as sample values for each column.
gss_cat %>% glimpse()
## Rows: 21,483
## Columns: 9
## $ year <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 20…
## $ marital <fct> Never married, Divorced, Widowed, Never married, Divorced, Mar…
## $ age <int> 26, 48, 67, 39, 25, 25, 36, 44, 44, 47, 53, 52, 52, 51, 52, 40…
## $ race <fct> White, White, White, White, White, White, White, White, White,…
## $ rincome <fct> $8000 to 9999, $8000 to 9999, Not applicable, Not applicable, …
## $ partyid <fct> "Ind,near rep", "Not str republican", "Independent", "Ind,near…
## $ relig <fct> Protestant, Protestant, Protestant, Orthodox-christian, None, …
## $ denom <fct> "Southern baptist", "Baptist-dk which", "No denomination", "No…
## $ tvhours <int> 12, NA, 2, 4, 1, NA, 3, NA, 0, 3, 2, NA, 1, NA, 1, 7, NA, 3, 3…
# does the same without using %>%
glimpse(gss_cat)
## Rows: 21,483
## Columns: 9
## $ year <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 20…
## $ marital <fct> Never married, Divorced, Widowed, Never married, Divorced, Mar…
## $ age <int> 26, 48, 67, 39, 25, 25, 36, 44, 44, 47, 53, 52, 52, 51, 52, 40…
## $ race <fct> White, White, White, White, White, White, White, White, White,…
## $ rincome <fct> $8000 to 9999, $8000 to 9999, Not applicable, Not applicable, …
## $ partyid <fct> "Ind,near rep", "Not str republican", "Independent", "Ind,near…
## $ relig <fct> Protestant, Protestant, Protestant, Orthodox-christian, None, …
## $ denom <fct> "Southern baptist", "Baptist-dk which", "No denomination", "No…
## $ tvhours <int> 12, NA, 2, 4, 1, NA, 3, NA, 0, 3, 2, NA, 1, NA, 1, 7, NA, 3, 3…
gss_cat %>% plot_intro()
### 3. Check missing values
gss_cat %>% plot_missing()
gss_cat %>% profile_missing()
## # A tibble: 9 × 3
## feature num_missing pct_missing
## <fct> <int> <dbl>
## 1 year 0 0
## 2 marital 0 0
## 3 age 76 0.00354
## 4 race 0 0
## 5 rincome 0 0
## 6 partyid 0 0
## 7 relig 0 0
## 8 denom 0 0
## 9 tvhours 10146 0.472
gss_cat %>% plot_density()
gss_cat %>% plot_histogram()
gss_cat %>% plot_bar()
gss_cat %>% plot_correlation()
## 1 features with more than 20 categories ignored!
## denom: 30 categories