2 + 1 # addition[1] 3
9 - 6 # subtraction[1] 3
8 * 9 # multiplication[1] 72
99 / 11 # division[1] 9
3^2 # exponent[1] 9
27^(1/3) #cube root[1] 3
tan(45) # trigonometric function[1] 1.619775
R (https://cran.r-project.org/) is a programming language for statistical computing and graphics
Developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues
R is a Free and Open Source Software
Runs on MS Windows, Linux, and MacOS operating systems
RStudio is an integrated development environment (IDE) for R, where you write and manage R codes
Addition, subtraction, multiplication, division, exponentiation, roots, etc.
2 + 1 # addition[1] 3
9 - 6 # subtraction[1] 3
8 * 9 # multiplication[1] 72
99 / 11 # division[1] 9
3^2 # exponent[1] 9
27^(1/3) #cube root[1] 3
tan(45) # trigonometric function[1] 1.619775
numeric
character
date
factor
vector
matrix
data.frame
data.table
varnum <- 3.1416 # numeric
class(varnum)[1] "numeric"
varchr <- "Loveliness" # character
class(varchr)[1] "character"
Use class() function to determine a variable’s data type, e.g. class(variableName)
sin(45)[1] 0.8509035
pi[1] 3.141593
sqrt(49)[1] 7
class(pi)[1] "numeric"
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
Modes <- function(x) {
ux <- unique(x)
tab <- tabulate(match(x, ux))
ux[tab == max(tab)]
}
# single mode
vectNum <- c(10, 13,9, 9, 11,9,8)
Mode(vectNum)[1] 9
# two modes
vectNum <- c(10, 13,9, 9, 11,9,8, 11, 11)
Modes(vectNum)[1] 9 11
# you must have internet access when installing R packages
install.packages("readxl")
install.packages("data.table")
install.packages("ggplot2")Run install.packages only once per package library!
# Open libraries
library(ggplot2)
library(data.table)
library(readxl)Change the folder “C:/jun/FirstSem24-25/Stat 2024/QUIZ STAT A2.xls” to where you saved the excel file, e.g. ”C:/stat/QUIZ STAT A2.xls”
Don’t forget to use forward-slash / instead of back-slash \
d <- read_xlsx("C:/jun/FirstSem24-25/Stat 2024/QUIZ STAT A2.xlsx")Download the excel file from here
d <- data.table(d)View(d) # browse the dataset
colnames(d) # check the column names [1] "ZipID" "Gender" "Age"
[4] "Birthday" "Height0" "Height"
[7] "Weight" "WeeklyAllowance" "WeekdayHousing"
[10] "Weeklytranspoexpenses" "ModeofDailytranspo" "Status"
[13] "BirthPlace" "Hometown"
# display the data type per variable
# how many observations (number of rows) are there?
# how many variables (number of columns)?
# how many types of statistical data (qualitative and quantitative ) can you see?
str(d)Classes 'data.table' and 'data.frame': 39 obs. of 14 variables:
$ ZipID : num 1 2 3 4 5 6 7 8 9 10 ...
$ Gender : chr "Female" "Male" "Female" "Female" ...
$ Age : num 22 20 21 21 21 21 21 21 21 22 ...
$ Birthday : chr "September 9, 2001" "October 14, 2003" "July 9, 2003" "Febraury 28, 2003" ...
$ Height0 : num 152 160 157 157 163 ...
$ Height : num 150 161 158 166 153 ...
$ Weight : num 58 59 41 48 58 50 45 60 73 72 ...
$ WeeklyAllowance : num 500 900 500 500 600 1000 1000 1000 1300 750 ...
$ WeekdayHousing : chr "Rent" "Rent" "Home" "Home" ...
$ Weeklytranspoexpenses: num NA 150 400 400 500 NA 1000 1000 200 250 ...
$ ModeofDailytranspo : chr "Walk" "Walk" "Public Transportation" "Motorcycle" ...
$ Status : chr "With partner" "Single" "Single" "Single" ...
$ BirthPlace : chr "Pinopoc, Alcala, Cagayan" "Turad Yeban Norte, Benito Soliven, Isa." "Canogan Abajo Norte, Sto. Tomas" "Binuang, San Pablo, Isabela" ...
$ Hometown : chr "Alcala, Cagayan" "Yeban Norte, Benito Soliven" "Canogan Abajo Norte, Sto. Tomas" "Binguang, San Pablo Isabela" ...
- attr(*, ".internal.selfref")=<externalptr>
# what are the information data displayed per statistical data type?
summary(d) ZipID Gender Age Birthday
Min. : 1.0 Length:39 Min. :20.00 Length:39
1st Qu.:11.0 Class :character 1st Qu.:21.00 Class :character
Median :21.0 Mode :character Median :21.00 Mode :character
Mean :22.1 Mean :21.23
3rd Qu.:33.5 3rd Qu.:21.00
Max. :43.0 Max. :26.00
Height0 Height Weight WeeklyAllowance
Min. :150.0 Min. :143.0 Min. :40.00 Min. : 100.0
1st Qu.:156.2 1st Qu.:154.4 1st Qu.:48.00 1st Qu.: 500.0
Median :160.0 Median :160.4 Median :54.50 Median : 750.0
Mean :162.1 Mean :162.2 Mean :56.05 Mean : 735.9
3rd Qu.:167.3 3rd Qu.:169.6 3rd Qu.:62.25 3rd Qu.:1000.0
Max. :180.3 Max. :186.0 Max. :78.00 Max. :1300.0
NA's :1
WeekdayHousing Weeklytranspoexpenses ModeofDailytranspo Status
Length:39 Min. : 100.0 Length:39 Length:39
Class :character 1st Qu.: 187.5 Class :character Class :character
Mode :character Median : 250.0 Mode :character Mode :character
Mean : 304.6
3rd Qu.: 350.0
Max. :1000.0
NA's :12
BirthPlace Hometown
Length:39 Length:39
Class :character Class :character
Mode :character Mode :character
# what is the data type of Height variable?
# interpret the values displayed by the `summary` function
# what is the range of height?
# what are their IQR values?
# What does IQR mean?
summary(d[, Height]) Min. 1st Qu. Median Mean 3rd Qu. Max.
143.0 154.4 160.4 162.2 169.6 186.0
# get the average height and number of observations by gender
d[, list(average = mean(Height), obs = .N), by="Gender"] Gender average obs
<char> <num> <int>
1: Female 161.1648 21
2: Male 163.4911 18
# what is the average height of males and females?
# how about the number of observations per gender?boxplot(d$Height ~ d$Gender, xlab="Gender", ylab = "Height (cm)")# what does the boxplot show?
# based from the box plot, are males taller than females or vice versa?# get average height and number of observation of males only
d[Gender=="Male", list(average = mean(Height), obs = .N)] average obs
<num> <int>
1: 163.4911 18
# get average height and number observations of females
d[Gender=="Female", list(average = mean(Height), obs = .N)] average obs
<num> <int>
1: 161.1648 21
d[, list(average=mean(Height),
min=min(Height),
max=max(Height),
median=median(Height),
stdev=sd(Height),
obs=.N), by="Gender"] Gender average min max median stdev obs
<char> <num> <num> <num> <num> <num> <int>
1: Female 161.1648 143.4 174 158.48 8.699673 21
2: Male 163.4911 143.0 186 161.56 12.629607 18
# use t-test to determine whether there is a significant difference in the heights of male and female participants.
# Hypothesis:
# Null hypothesis: heights of males and females are not significantly different
# Research hypothesis: Males are taller than female
t.test(d[Gender=="Male", Height], d[Gender=="Female", Height],
alternative = "greater")
Welch Two Sample t-test
data: d[Gender == "Male", Height] and d[Gender == "Female", Height]
t = 0.6589, df = 29.493, p-value = 0.2575
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
-3.669372 Inf
sample estimates:
mean of x mean of y
163.4911 161.1648
# What is the resulting p-value?
# Based from the t-test result, are the males taller than females? Why?