Introduction to R Statistical Analysis

Author

Affiliation

Renato A. Folledo, Jr.

Isabela State University

R and RStudio

R (https://cran.r-project.org/) is a programming language for statistical computing and graphics

Developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues
R is a Free and Open Source Software
Runs on MS Windows, Linux, and MacOS operating systems

RStudio is an integrated development environment (IDE) for R, where you write and manage R codes

Arithmetics in R

Addition, subtraction, multiplication, division, exponentiation, roots, etc.

2 + 1   # addition

[1] 3

9 - 6   # subtraction

[1] 3

8 * 9   # multiplication

[1] 72

99 / 11 # division

[1] 9

3^2     # exponent

[1] 9

27^(1/3)  #cube root

[1] 3

tan(45)   # trigonometric function

[1] 1.619775

R data types

Basic types

numeric
character

date
factor

Multi-dimension

vector
matrix

Tabular/database

data.frame
data.table

varnum <- 3.1416          # numeric
class(varnum)

[1] "numeric"

varchr <- "Loveliness"    # character
class(varchr)

[1] "character"

Note

Use class() function to determine a variable’s data type, e.g. class(variableName)

R functions

Built-in and library functions

sin(45)

[1] 0.8509035

pi

[1] 3.141593

sqrt(49)

[1] 7

class(pi)

[1] "numeric"

User-defined functions

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

Modes <- function(x) {
  ux <- unique(x)
  tab <- tabulate(match(x, ux))
  ux[tab == max(tab)]
}

# single mode
vectNum <- c(10, 13,9, 9, 11,9,8)
Mode(vectNum)

[1] 9

# two modes
vectNum <- c(10, 13,9, 9, 11,9,8, 11, 11)
Modes(vectNum)

[1]  9 11

Install library packages

# you must have internet access when installing R  packages

install.packages("readxl")
install.packages("data.table")
install.packages("ggplot2")

Note

Run install.packages only once per package library!

Load library from installed packages

# Open libraries
library(ggplot2)
library(data.table)
library(readxl)

Import excel file

Important

Change the folder “C:/jun/FirstSem24-25/Stat 2024/QUIZ STAT A2.xls” to where you saved the excel file, e.g. ”C:/stat/QUIZ STAT A2.xls”

Don’t forget to use forward-slash / instead of back-slash \

d <- read_xlsx("C:/jun/FirstSem24-25/Stat 2024/QUIZ STAT A2.xlsx")

Note

Download the excel file from here

Convert dataset into a data.table

d <- data.table(d)

Describe the dataset

View(d)       # browse the dataset
colnames(d)   # check the column names

 [1] "ZipID"                 "Gender"                "Age"                  
 [4] "Birthday"              "Height0"               "Height"               
 [7] "Weight"                "WeeklyAllowance"       "WeekdayHousing"       
[10] "Weeklytranspoexpenses" "ModeofDailytranspo"    "Status"               
[13] "BirthPlace"            "Hometown"

# display the data type per variable
# how many observations (number of rows) are there?
# how many variables (number of columns)?
# how many types of statistical data (qualitative and quantitative ) can you see?
str(d)

Classes 'data.table' and 'data.frame':  39 obs. of  14 variables:
 $ ZipID                : num  1 2 3 4 5 6 7 8 9 10 ...
 $ Gender               : chr  "Female" "Male" "Female" "Female" ...
 $ Age                  : num  22 20 21 21 21 21 21 21 21 22 ...
 $ Birthday             : chr  "September 9, 2001" "October 14, 2003" "July 9, 2003" "Febraury 28, 2003" ...
 $ Height0              : num  152 160 157 157 163 ...
 $ Height               : num  150 161 158 166 153 ...
 $ Weight               : num  58 59 41 48 58 50 45 60 73 72 ...
 $ WeeklyAllowance      : num  500 900 500 500 600 1000 1000 1000 1300 750 ...
 $ WeekdayHousing       : chr  "Rent" "Rent" "Home" "Home" ...
 $ Weeklytranspoexpenses: num  NA 150 400 400 500 NA 1000 1000 200 250 ...
 $ ModeofDailytranspo   : chr  "Walk" "Walk" "Public Transportation" "Motorcycle" ...
 $ Status               : chr  "With partner" "Single" "Single" "Single" ...
 $ BirthPlace           : chr  "Pinopoc, Alcala, Cagayan" "Turad Yeban Norte, Benito Soliven, Isa." "Canogan Abajo Norte, Sto. Tomas" "Binuang, San Pablo, Isabela" ...
 $ Hometown             : chr  "Alcala, Cagayan" "Yeban Norte, Benito Soliven" "Canogan Abajo Norte, Sto. Tomas" "Binguang, San Pablo Isabela" ...
 - attr(*, ".internal.selfref")=<externalptr>

# what are the information data displayed per statistical data type?
summary(d)

     ZipID         Gender               Age          Birthday        
 Min.   : 1.0   Length:39          Min.   :20.00   Length:39         
 1st Qu.:11.0   Class :character   1st Qu.:21.00   Class :character  
 Median :21.0   Mode  :character   Median :21.00   Mode  :character  
 Mean   :22.1                      Mean   :21.23                     
 3rd Qu.:33.5                      3rd Qu.:21.00                     
 Max.   :43.0                      Max.   :26.00                     
                                                                     
    Height0          Height          Weight      WeeklyAllowance 
 Min.   :150.0   Min.   :143.0   Min.   :40.00   Min.   : 100.0  
 1st Qu.:156.2   1st Qu.:154.4   1st Qu.:48.00   1st Qu.: 500.0  
 Median :160.0   Median :160.4   Median :54.50   Median : 750.0  
 Mean   :162.1   Mean   :162.2   Mean   :56.05   Mean   : 735.9  
 3rd Qu.:167.3   3rd Qu.:169.6   3rd Qu.:62.25   3rd Qu.:1000.0  
 Max.   :180.3   Max.   :186.0   Max.   :78.00   Max.   :1300.0  
                                 NA's   :1                       
 WeekdayHousing     Weeklytranspoexpenses ModeofDailytranspo    Status         
 Length:39          Min.   : 100.0        Length:39          Length:39         
 Class :character   1st Qu.: 187.5        Class :character   Class :character  
 Mode  :character   Median : 250.0        Mode  :character   Mode  :character  
                    Mean   : 304.6                                             
                    3rd Qu.: 350.0                                             
                    Max.   :1000.0                                             
                    NA's   :12                                                 
  BirthPlace          Hometown        
 Length:39          Length:39         
 Class :character   Class :character  
 Mode  :character   Mode  :character

Examine the height of participants

# what is the data type of Height variable?
# interpret the values displayed by the `summary` function
# what is the range of height?
# what are their IQR values?
# What does IQR mean?
summary(d[, Height])

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  143.0   154.4   160.4   162.2   169.6   186.0

# get the average height and number of observations by gender
d[, list(average = mean(Height), obs = .N), by="Gender"]

   Gender  average   obs
   <char>    <num> <int>
1: Female 161.1648    21
2:   Male 163.4911    18

# what is the average height of males and females?
# how about the number of observations per gender?

Generate a box plot with height on the y-axis and gender on the x-axis

boxplot(d$Height ~ d$Gender, xlab="Gender", ylab = "Height (cm)")

# what does the boxplot show?
# based from the box plot, are males taller than females or vice versa?

Compare male and female heights

# get average height and number of observation of males only
d[Gender=="Male", list(average = mean(Height), obs = .N)]

    average   obs
      <num> <int>
1: 163.4911    18

# get average height and number observations of females
d[Gender=="Female", list(average = mean(Height), obs = .N)]

    average   obs
      <num> <int>
1: 161.1648    21

Display the central tendency of males and females

d[, list(average=mean(Height),
         min=min(Height),
         max=max(Height),
         median=median(Height),
         stdev=sd(Height),
         obs=.N), by="Gender"]

   Gender  average   min   max median     stdev   obs
   <char>    <num> <num> <num>  <num>     <num> <int>
1: Female 161.1648 143.4   174 158.48  8.699673    21
2:   Male 163.4911 143.0   186 161.56 12.629607    18

Compare the average height of males and females

# use t-test to determine whether there is a significant difference in the heights of male and female participants.
# Hypothesis:
# Null hypothesis: heights of males and females are not significantly different
# Research hypothesis: Males are taller than female
t.test(d[Gender=="Male", Height], d[Gender=="Female", Height],
       alternative = "greater")


    Welch Two Sample t-test

data:  d[Gender == "Male", Height] and d[Gender == "Female", Height]
t = 0.6589, df = 29.493, p-value = 0.2575
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 -3.669372       Inf
sample estimates:
mean of x mean of y 
 163.4911  161.1648

# What is the resulting p-value?
# Based from the t-test result, are the males taller than females? Why?