# open the excel file using the read_xlsx function from readxl library
dt <- readxl::read_xlsx("C:/jun/FirstSem24-25/Stat 2024/QUIZ STAT A2.xlsx")
# convert to data.table
dt <- data.table::data.table(dt)
class(dt)[1] "data.table" "data.frame"
Descriptive statistics helps researchers describe the central tendency (mean, median, mode), spread/dispersion (range, variance, and standard deviation), and shape of the probability distribution of data. It also involves making graphs and summary tables to help in visualization and understanding the data. The main objective of descriptive statistics is to effectively summarize and describe the characteristics of a data, providing an overview of the data and helping to identify patterns and relationships between variables. It provides a useful starting point in analyzing data, as it can help to identify outliers, and helps in identifying the appropriate statistical technique for inferential data processing.
At the end of this exercise, the students
The sample R codes in this exercise examines the relationship between Gender and Height of participants. You then need to change the Height into Weight using the same codes below to generate descriptive summary statistics, box plot, and summary tables. Aside from deriving the descriptive statistics on your data, you also need to answer the guide questions and incorporate your answer on your report.
Measures of central tendency include mean, median, and mode. The mean is applicable only to quantitative data. Median, on the other hand, is applicable only to ordinal and quantitative data, but not on nominal data. Finally, the mode is applicable only to qualitative data.
This activity is an exercise on descriptive statistical analysis of the heights of male and female participants, using the Gender and Height variables from the student data. Download your “Stat 2024/QUIZ STAT A2.xlsx”
# open the excel file using the read_xlsx function from readxl library
dt <- readxl::read_xlsx("C:/jun/FirstSem24-25/Stat 2024/QUIZ STAT A2.xlsx")
# convert to data.table
dt <- data.table::data.table(dt)
class(dt)[1] "data.table" "data.frame"
Copy the variables Gender and Height to a new table named HtGender
# We copy all the rows, denoted by `dt[,`
# which has no (blank) parameter before the comma `,`
# whereas if we use `dt[Gender == "Male"`,
# only the male participants are copied.
# The parameter `list(Gender, Height)` says that only column
# variables Height and Gender are copied to `HtGender` table
HtGender <- dt[, list(Gender, Height)]
## display a summary of all the variables in HtGender
summary(HtGender) Gender Height
Length:39 Min. :143.0
Class :character 1st Qu.:154.4
Mode :character Median :160.4
Mean :162.2
3rd Qu.:169.6
Max. :186.0
The mean function in R computes the mean or average of a vector. It can run independently or within a data.table object as illustrated in the codes below.
# display mean right away
HtGender$Height # display the Height as vector [1] 150.40 161.02 158.48 166.48 152.56 154.40 160.40 173.10 175.64 180.34
[11] 178.26 168.02 143.40 162.10 157.10 169.10 172.10 157.64 174.80 155.48
[21] 157.48 169.18 169.02 151.94 156.00 159.00 166.00 170.00 150.48 168.00
[31] 173.00 146.40 154.02 154.48 148.48 186.00 160.00 143.00 174.00
# the dollar symbol uses the `data.frame`
# functionality to extract Height variable
mean(HtGender$Height) # compute the mean of a data.frame vector[1] 162.2385
mean(HtGender[,Height] ) # equivalent to HtGender$Height but this time,[1] 162.2385
# it uses the `data.table` to extract Height vector
# use data.table functionality to compute the mean
HtGender[, mean(Height)] # mean height of all participants[1] 162.2385
# It is easier to extract selected rows in a data.table
HtGender[Gender=="Male", mean(Height)] # mean height of Male[1] 163.4911
HtGender[Gender=="Female", mean(Height)] # mean height of Male[1] 161.1648
# use data.table functionality to
# create a summary table of mean height of participants
# grouped by male and female
HtGender[,list(mean=mean(Height)), by = "Gender"] Gender mean
<char> <num>
1: Female 161.1648
2: Male 163.4911
# compute median
# display the median right away
median(HtGender[,Height] )[1] 160.4
# use data.table functionality
HtGender[,median(Height)] # median height of all participants[1] 160.4
HtGender[Gender=="Male", median(Height)] # median height of Male[1] 161.56
HtGender[Gender=="Female",median(Height)] # median height of Female[1] 158.48
# summary table of median height grouped by male and female
HtGender[,list(median = median(Height)), by = "Gender"] Gender median
<char> <num>
1: Female 158.48
2: Male 161.56
# summary table of mean and median heights by gender
HtGender[,list(mean=mean(Height), median = median(Height)), by = "Gender"] Gender mean median
<char> <num> <num>
1: Female 161.1648 158.48
2: Male 163.4911 161.56
# create a summary table of mean and median heights,
# including the number of observations per gender
HtGender[,list(mean=mean(Height),
median = median(Height),
n = .N), by = "Gender"] Gender mean median n
<char> <num> <num> <int>
1: Female 161.1648 158.48 21
2: Male 163.4911 161.56 18
What is the mean and median height of females, males, and the whole sample data (combined males and females).
Mode works only for qualitative data. But there is no built-in mode function in R, so we use the user defined function Modes. I downloaded user-defined functions Mode and Modes. Mode gets the first mode of the data. However, if there are multiple modes, then we need to use the user-defined Modes function
# define function Mode
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
# define the 2nd function Modes that can get multiple modes if any
Modes <- function(x) {
ux <- unique(x)
tab <- tabulate(match(x, ux))
ux[tab == max(tab)]
}
# display mode of a qualitative data
Mode(HtGender[,Gender])[1] "Female"
Modes(HtGender[,Gender])[1] "Female"
# attempt to compute the mode of quantitative data
Modes(HtGender[,Height]) [1] 150.40 161.02 158.48 166.48 152.56 154.40 160.40 173.10 175.64 180.34
[11] 178.26 168.02 143.40 162.10 157.10 169.10 172.10 157.64 174.80 155.48
[21] 157.48 169.18 169.02 151.94 156.00 159.00 166.00 170.00 150.48 168.00
[31] 173.00 146.40 154.02 154.48 148.48 186.00 160.00 143.00 174.00
What is the mode of the Gender variable? Why is it that the mode of the Height variable is not meaningful?
# use built-in functions and display results immediately
min(HtGender[, Height]) # minimum[1] 143
max(HtGender[, Height]) # maximum[1] 186
max(HtGender[, Height]) - min(HtGender[, Height]) # range[1] 43
# create summary table using data.table capability
HtGender[, list(minHt=min(Height),
maxHT=max(Height),
rangeHt = max(Height) - min(Height))] minHt maxHT rangeHt
<num> <num> <num>
1: 143 186 43
# separate by gender
HtGender[, list(minHt=min(Height),
maxHT=max(Height),
rangeHt = max(Height) - min(Height)),
by="Gender"] Gender minHt maxHT rangeHt
<char> <num> <num> <num>
1: Female 143.4 174 30.6
2: Male 143.0 186 43.0
Compute Quartile 1, median (Quartile 2), Quartile 3, Interquartile range (IQR).
Quartile 1, 25% of the observations are below it, while 75% are above it
Quartile 2 on the other hand is also the median of the observations
Quartile 3, 75% of the observations are below it, while 25% are above it.
IQR = Q3 - Q1
IQR minimum = Q1 - 1.5(IQR)
IQR maximum = Q3 + 1.5(IQR)
IQR minimum and maximum are useful in determining any outliers. Any observation with a value above the IQR maximum or below the IQR minimum are called outliers, which are showed as asterisks (*) in Figure 1.
Use the code below to compute the different quartile values.
# display output right away
quantile(HtGender[, Height], 0.25) # quartile 1 25%
154.44
quantile(HtGender[, Height], 0.5) # median is also quartile 2 50%
160.4
quantile(HtGender[, Height], 0.75) # quartile 3 75%
169.59
Instead of computing quartiles individually, we can generate a summary table to create a combined summary statistics table.
# create a summary table for all the observations
HtGender[, list(minHt=min(Height),
maxHT=max(Height),
rangeHt = max(Height) - min(Height),
Q1=quantile(Height, 0.25),
median=median(Height),
Q3 = quantile(Height, 0.75))] minHt maxHT rangeHt Q1 median Q3
<num> <num> <num> <num> <num> <num>
1: 143 186 43 154.44 160.4 169.59
# create a summary table by gender
# note that we added `by="Gender"` parameter
HtGender[, list(minHt=min(Height),
maxHT=max(Height),
rangeHt = max(Height) - min(Height),
Q1=quantile(Height, 0.25),
median=median(Height),
Q3 = quantile(Height, 0.75)),
by="Gender"] Gender minHt maxHT rangeHt Q1 median Q3
<char> <num> <num> <num> <num> <num> <num>
1: Female 143.4 174 30.6 154.48 158.48 168.02
2: Male 143.0 186 43.0 153.23 161.56 173.60
Compute the interquartile range (IQR)
# interquartile range to the summary table
IQR(HtGender[,Height])[1] 15.15
# include IQR on the summary table
HtGender[, list(minHt=min(Height),
maxHT=max(Height),
rangeHt = max(Height) - min(Height),
Q1=quantile(Height, 0.25),
median=median(Height),
Q3 = quantile(Height, 0.75),
IQR = IQR(Height),
minIQR = quantile(Height, 0.25) - 1.5*IQR(Height),
maxIQR = quantile(Height, 0.75) + 1.5*IQR(Height)),
by="Gender"] Gender minHt maxHT rangeHt Q1 median Q3 IQR minIQR maxIQR
<char> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1: Female 143.4 174 30.6 154.48 158.48 168.02 13.54 134.170 188.330
2: Male 143.0 186 43.0 153.23 161.56 173.60 20.37 122.675 204.155
Did you see any observation whose Height is less than minIQR or greater than maxIQR?
What do we call the observations that are less than minimum IQR or greater than maximum IQR?
Variance is a numerical measure of how the data values are dispersed around the mean. Standard deviation, on the other hand, is simply the square root of variance. Standard deviation nonetheless has the same unit of measure as the variable being studied, e.g. if the height variable is in cm, the unit of standard deviation is also in cm.
# display height variance
var(HtGender[, Height])[1] 112.5726
# standard deviation of height variable
sd(HtGender[, Height]) [1] 10.61002
# standard deviation of male participants
sd(HtGender[Gender=="Male", Height]) [1] 12.62961
# standard deviation of female participants
sd(HtGender[Gender=="Female", Height])[1] 8.699673
# add standard deviation on the summary table and
# save the summary table to sumTable object
sumTable <- HtGender[, list(meanHt=mean(Height),
medianHt = median(Height),
minHt=min(Height),
maxHT=max(Height),
rangeHt = max(Height) - min(Height),
Q1=quantile(Height, 0.25),
Q2=median(Height),
Q3 = quantile(Height, 0.75),
IQR = IQR(Height),
stdev = sd(Height),
obs = .N),
by="Gender"]
# display the summary table
sumTable Gender meanHt medianHt minHt maxHT rangeHt Q1 Q2 Q3 IQR
<char> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1: Female 161.1648 158.48 143.4 174 30.6 154.48 158.48 168.02 13.54
2: Male 163.4911 161.56 143.0 186 43.0 153.23 161.56 173.60 20.37
stdev obs
<num> <int>
1: 8.699673 21
2: 12.629607 18
The box plot generated displays the spread of the height data along the y-axis, and grouped by Gender along the x-axis.
boxplot(HtGender[,Height] ~ dt[,Gender], xlab="Gender", ylab = "Height (cm)")Figure 2. Box plot comparing the heights of male and female participants
We can export our summary table to a format that is compatible to most programs. In the example below, we export the summary table to a comma-delimeted format for use in MS Excel or Word.
write.csv(sumTable,"C:/stat/SumTable.csv", row.names = F)For your exercise, use the Weight variable to generate the same descriptive statistics as in the R codes above:
Generate a summary table of the measures of central tendency and dispersion. Copy and paste it as a table in MS Word format.
Describe the resulting box plot.
Summary/reflection/lesson learned from this exercise