Introduction on Using R Software in Data Processing

October 25, 2023

Approval of Central Mindanao University to Join the training program by DOST-PCIEERD Scholarship and MOOCSX Philippines through COURSERA

Objectives

Provide History and Overview of R
Introduce basic commands in R
Introduce R Script and R Markdown
Install some R packages
Illustrate: generate R data, data in R, and Export Excel Data in R

History and Overview of R

1993: The origins of R can be traced to a programming language called “S,” which was developed at Bell Laboratories by John Chambers and his colleagues. S was designed for data analysis and graphics.

History and Overview of R

1995: Ross Ihaka and Robert Gentleman, both statisticians at the University of Auckland in New Zealand, began developing R as an open-source implementation of the S language. Their goal was to create a free, accessible, and extensible statistical software tool.
1997: R version 0.50 was released, marking the first public release of the R language. It included basic functionality for data manipulation, statistical modeling, and graphics.

History and Overview of R

2000: The R Project for Statistical Computing was officially announced, establishing R as an open-source project. The R community started to grow, and contributions from developers worldwide began to enhance the language’s capabilities and package ecosystem.
2004: R version 2.0.0 was released, introducing significant improvements and new features. This release marked a major milestone in the development of R.
2009: The Comprehensive R Archive Network (CRAN) became the primary repository for R packages. CRAN provides a centralized platform for developers to share and distribute their R packages.

History and Overview of R

2011: The RStudio Integrated Development Environment (IDE) was released. RStudio offers a user-friendly interface, code editing features, debugging tools, and enhanced data visualization capabilities, making it a popular choice among R users.
2016: R ranked as the top programming language for data science in the annual “Kaggle Data Science Survey,” solidifying its position as a leading tool in the field.
Present: R continues to evolve and thrive, with regular updates and new releases. The R community remains active, contributing to the development of new packages, improving performance, and expanding the language’s capabilities.

The R Installation

Obtain a copy of an R language installer from a dependable source or directly from the Internet. The URL is http://cran.r-project.org/
The latest version of R is 4.3.1

The R Installation

Once the installation is done, start R by clicking the Desktop icon for R

The R Console

Along the top of the window is a limited set of menus, which can be used for various tasks including opening, loading and saving script windows, loading and saving your workspace, and installing packages.
When you open an R session (i.e. start the R program), the R console opens and you are presented with a screen like this:

The R Console

The R Logo

The RStudio

RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management.
RStudio is available in open source and commercial editions and runs on the desktop (Windows, Mac, and Linux).

The RStudio

You can download the latest version of RStudio at https://www.rstudio.com/products/rstudio/

The RStudio

Basic R commands

Can be used as an interactive calculator

Addition

5+7

## [1] 12

Subtraction

10-5

## [1] 5

Storing result to a variable

x<-5+7

Call the variable x

## [1] 12

Introduction on R Script and R Markdown

R Script

R Markdown

Installing packages

Working with data in R, generated data, and excel data

library(readxl)

## Warning: package 'readxl' was built under R version 4.2.3

Sample <- read_excel("D:/PSQ 2023/CMU Webinar/Data.xlsx")
Sample

## # A tibble: 15 × 2
##    Time    Accuracy
##    <chr>      <dbl>
##  1 Ten           95
##  2 Ten           90
##  3 Ten           65
##  4 Ten           95
##  5 Ten           85
##  6 Fifteen       70
##  7 Fifteen       65
##  8 Fifteen       50
##  9 Fifteen       55
## 10 Fifteen       70
## 11 Twenty        45
## 12 Twenty        55
## 13 Twenty        45
## 14 Twenty        40
## 15 Twenty        70

Some Statistics

Computation for Mean

mean(Sample$Accuracy)

## [1] 66.33333

Computation for Standard Deviation

sd(Sample$Accuracy)

## [1] 18.36793

Other Summary Statistics

summary(Sample$Accuracy)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   40.00   52.50   65.00   66.33   77.50   95.00

Mean computation group by categories

(aggregate(Sample$Accuracy, list(Sample$Time), mean))

##   Group.1  x
## 1 Fifteen 62
## 2     Ten 86
## 3  Twenty 51

(aggregate(Sample$Accuracy, list(Sample$Time), sd))

##   Group.1         x
## 1 Fifteen  9.082951
## 2     Ten 12.449900
## 3  Twenty 11.937336

Example of Graphical Presentation

ggboxplot(Sample, x = "Time", y = "Accuracy", fill="Time")

Using dplyr package

Sample%>%
  group_by(Time)%>%
  summarize(Mean = mean(Accuracy), SD = sd(Accuracy))

# A tibble: 3 × 3
  Time     Mean    SD
  <chr>   <dbl> <dbl>
1 Fifteen    62  9.08
2 Ten        86 12.4 
3 Twenty     51 11.9

Additional Example

library(readxl)
Sample2 <- read_excel("D:/PSQ 2023/CMU Webinar/Data1.xlsx")

## New names:
## • `Relax3` -> `Relax3...40`
## • `Relax3` -> `Relax3...41`
## • `Education` -> `Education...50`
## • `` -> `...51`
## • `Income` -> `Income...52`
## • `` -> `...71`
## • `Education` -> `Education...72`
## • `` -> `...73`
## • `` -> `...74`
## • `` -> `...75`
## • `Income` -> `Income...76`
## • `` -> `...77`
## • `` -> `...79`
## • `` -> `...80`
## • `` -> `...81`
## • `` -> `...82`

Sample2

## # A tibble: 145 × 82
##      No. Gender   age Reappraisal1 Reappraisal2 Reappraisal3 Reappraisal4
##    <dbl> <chr>  <dbl>        <dbl>        <dbl>        <dbl>        <dbl>
##  1     1 male      43            4            2            2            2
##  2     2 male      40            3            2            4            1
##  3     3 male      60            3            2            2            2
##  4     4 male      50            2            2            4            2
##  5     5 male      42            4            4            2            4
##  6     6 female    42            2            4            4            3
##  7     7 male      54            4            2            3            3
##  8     8 male      40            2            2            2            2
##  9     9 male      56            2            2            3            2
## 10    10 male      43            4            2            4            4
## # ℹ 135 more rows
## # ℹ 75 more variables: Reappraisal5 <dbl>, ReappraisalMean <dbl>,
## #   SocialSupport1 <dbl>, SocialSupport2 <dbl>, SocialSupport3 <dbl>,
## #   SocialSupportMean <dbl>, ProbSolving1 <dbl>, ProbSolving2 <dbl>,
## #   ProbSolving3 <dbl>, ProbSolving4 <dbl>, ProbSolvingMean <dbl>, Rel1 <dbl>,
## #   Rel2 <dbl>, Rel3 <dbl>, Rel4 <dbl>, RelMean <dbl>, Tol1 <dbl>, Tol2 <dbl>,
## #   TolMean <dbl>, Emo1 <dbl>, Emo2 <dbl>, Emo3 <dbl>, Emo4 <dbl>, …

Number of rows and columns in a dataset

dim(Sample2)

## [1] 145  82

Mean and Standard Deviation of Age

library(dplyr)
Sample2 %>% 
    summarize(`Mean Age` = mean(age), `SD of Age` = sd(age))

## # A tibble: 1 × 2
##   `Mean Age` `SD of Age`
##        <dbl>       <dbl>
## 1       49.3        8.39

Age classified by Gender

library(dplyr)
Sample2 %>% 
    group_by(Gender)%>%
    summarize(`Mean Age` = mean(age), `SD of Age` = sd(age))

## # A tibble: 2 × 3
##   Gender `Mean Age` `SD of Age`
##   <chr>       <dbl>       <dbl>
## 1 female       46.0        8.82
## 2 male         50.5        7.92

Distribution of Age

library(dplyr)
Sample2%>%
  mutate(Agecode=ifelse(age<=50, "at most 50 years old", "More than 50 years old"))%>%
  group_by(Agecode)%>%
  summarise(count=n())%>%
  mutate(Percentage =round((count/sum(count)*100),2))

## # A tibble: 2 × 3
##   Agecode                count Percentage
##   <chr>                  <int>      <dbl>
## 1 More than 50 years old    69       47.6
## 2 at most 50 years old      76       52.4

Socio-Demographic Profile

Gender

library(dplyr)
Sample2%>%
  group_by(Gender)%>%
  summarise(count=n())%>%
  mutate(Percentage =round((count/sum(count)*100),2))

## # A tibble: 2 × 3
##   Gender count Percentage
##   <chr>  <int>      <dbl>
## 1 female    40       27.6
## 2 male     105       72.4

Education

library(dplyr)
Sample2%>%
  group_by(Education)%>%
  summarise(count=n())%>%
  mutate(Percentage =round((count/sum(count)*100),2))

## # A tibble: 14 × 3
##    Education             count Percentage
##    <chr>                 <int>      <dbl>
##  1 Colege graduate           1       0.69
##  2 College graduate         21      14.5 
##  3 College level            16      11.0 
##  4 Elementary graduate      17      11.7 
##  5 Elementary level         21      14.5 
##  6 Ementary level            1       0.69
##  7 High School graduate      4       2.76
##  8 High chool graduate       1       0.69
##  9 High schoo graduate       1       0.69
## 10 High school graduate     22      15.2 
## 11 High school level        37      25.5 
## 12 High scool level          1       0.69
## 13 High sschool graduate     1       0.69
## 14 Highschool level          1       0.69

Education

Sample3<-Sample2%>%
  mutate(Educationcode = recode(`Education`,
                           "Colege graduate" = "College graduate", "Ementary level" = "Elementary level", "High schoo graduate" = "High school graduate", "High School graduate" = "High school graduate", "High sschool graduate" = "High school graduate", "High scool level" = "High school level", "Highschool level " = "High school level", "Highschool level" = "High school level", "Highschool level " = "High school level", "High chool graduate" = "High school graduate"))

Education

library(dplyr)
Sample3%>%
  group_by(Educationcode)%>%
  summarise(count=n())%>%
  mutate(Percentage =round((count/sum(count)*100),2))

## # A tibble: 6 × 3
##   Educationcode        count Percentage
##   <fct>                <int>      <dbl>
## 1 Elementary level        22       15.2
## 2 Elementary graduate     17       11.7
## 3 High school level       39       26.9
## 4 High school graduate    29       20  
## 5 College level           16       11.0
## 6 College graduate        22       15.2

Reference

Deng, Roger D., R Probramming for Data Science, 2014, Lean Publishing.
http://cran.r-project.org/
https://www.rstudio.com/products/rstudio/

Sponsored scholarship by DOST-PCIEERD and MOOCSX Philippines

Approval of Central Mindanao University to Join the training program by DOST-PCIEERD Scholarship and MOOCSX Philippines through COURSERA

Objectives

History and Overview of R

History and Overview of R

History and Overview of R

History and Overview of R

The R Installation

The R Installation

The R Console

The R Console

The R Logo

The RStudio

The RStudio

The RStudio

Basic R commands

Can be used as an interactive calculator

Addition

Subtraction

Storing result to a variable

Call the variable x

Introduction on R Script and R Markdown

R Script

R Script

R Markdown

Installing packages

Installing packages

Working with data in R, generated data, and excel data

Some Statistics

Computation for Mean

Computation for Standard Deviation

Other Summary Statistics

Mean computation group by categories

Example of Graphical Presentation

Using dplyr package

Additional Example

Number of rows and columns in a dataset

Mean and Standard Deviation of Age

Age classified by Gender

Distribution of Age

Socio-Demographic Profile

Gender

Education

Education

Education

Reference

Thank you and God bless