Descritive statistics are basically used to describe the basic features of data by providing simple summaries about the sample and the measures. Unlike the inferential statistics, descriptive statistics simply describes what the data shows or what is going on in the data. This also helps in simplifying large amounts of data in a sensible way (https://socialresearchmethods.net/kb/statdesc.php).
In this activity, I run the code below in order to remove all files in the working directory.
code
rm(list=ls())
To set working directory, I run the code below so that the HEIGHT and WEIGHT data of the class will be read and used in different analysis for this activity.
setwd("C:\\Users\\April Mae Tabonda\\Documents\\MS Marine Science\\Biostat\\PLP\\RMDs\\PLP_5 Descriptive Statistics")
getwd()
## [1] "C:/Users/April Mae Tabonda/Documents/MS Marine Science/Biostat/PLP/RMDs/PLP_5 Descriptive Statistics"
In creating an object, use the code htwt.data. The object htwt.data will be a dataframe,a table or two dimensional array-like structure which contains values of one variable and each row contains one set of values from each column (https://www.tutorialspoint.com/r/r_data_frames.htm), of the HEIGHT and WEIGHT data of the class. It is very important to set the file name into .csv type from the directory. To get htwt data of the class, I run the codes below:
codes
htwt.data<- read.csv(file="htwt.csv",
header=TRUE, sep=",")
attach(htwt.data)
str(htwt.data)
## 'data.frame': 16 obs. of 4 variables:
## $ SUBJECT: int 1 2 3 4 5 6 7 8 9 10 ...
## $ SEX : Factor w/ 2 levels "F","M": 1 1 2 1 2 1 1 2 2 1 ...
## $ WEIGHT : int 101 112 200 110 123 125 129 165 185 100 ...
## $ HEIGHT : num 61 61 68 59 65 62 62 65 73.2 65 ...
nrow(htwt.data)
## [1] 16
ncol(htwt.data)
## [1] 4
dim(htwt.data)
## [1] 16 4
names(htwt.data)
## [1] "SUBJECT" "SEX" "WEIGHT" "HEIGHT"
colnames(htwt.data)
## [1] "SUBJECT" "SEX" "WEIGHT" "HEIGHT"
rownames(htwt.data)
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14"
## [15] "15" "16"
head(htwt.data, n=10)
## SUBJECT SEX WEIGHT HEIGHT
## 1 1 F 101 61.0
## 2 2 F 112 61.0
## 3 3 M 200 68.0
## 4 4 F 110 59.0
## 5 5 M 123 65.0
## 6 6 F 125 62.0
## 7 7 F 129 62.0
## 8 8 M 165 65.0
## 9 9 M 185 73.2
## 10 10 F 100 65.0
tail(htwt.data)
## SUBJECT SEX WEIGHT HEIGHT
## 11 11 M 201 70.0
## 12 12 M 160 65.0
## 13 13 F 94 58.0
## 14 14 M 176 66.0
## 15 15 F 95 61.0
## 16 16 M 158 64.5
Codes and their uses:
str(htwt.data) to identify the structure nrow(htwt.data) to list the number of rows ncol(htwt.data) to list the number of columns dim(htwt.data) to show dimensions of the data frame names(htwt.data) to identify names colnames(htwt.data) to show column names rownames(htwt.data) to show row names head(htwt.data) to show the 1st 10 observations tail(htwt.data) to show the last 6 observations
class(SUBJECT)
## [1] "integer"
class(htwt.data$HEIGHT)
## [1] "numeric"
class(htwt.data$WEIGHT)
## [1] "integer"
str(htwt.data)
## 'data.frame': 16 obs. of 4 variables:
## $ SUBJECT: int 1 2 3 4 5 6 7 8 9 10 ...
## $ SEX : Factor w/ 2 levels "F","M": 1 1 2 1 2 1 1 2 2 1 ...
## $ WEIGHT : int 101 112 200 110 123 125 129 165 185 100 ...
## $ HEIGHT : num 61 61 68 59 65 62 62 65 73.2 65 ...
duplicated(htwt.data$SUBJECT)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE
Description
In the data frame, it presents the number of observations (16) with 4 variables (SUBJECT, SEX, WEIGHT, HEIGHT). Weight is in lbs. and height is in inches. Sex is represented by number 1 for Female and 2 for Male. The subject tells the number of students participated in the data gathering.
To better visualize and understand the class data, histogram and plot was run using the codes below.
codes
hist(HEIGHT)
plot(HEIGHT)
hist(WEIGHT)
plot(WEIGHT)
Descriptions and Insights
Histogram shows the frequency of students with height belong to the same range. On the otherhand, plot individually shows the height of 16 student. In my understanding, histogram provides an idea of which range of height most of the students in the class belong, while, the plot gives a glimpse of the minimum and maximum height of students within the class. In the original activity, histogram and plot of weight are not included, but in this section, I tried to run the codes for histogram and plot using the weight data.
plot(density(HEIGHT))
Description
Density plot are usually a much more effective way to view the distribution of a variable. This plot probability densities instead of frequencies.
Use the code below to create boxplot.
code
boxplot(HEIGHT)
boxplot(WEIGHT)
boxplot.stats(HEIGHT)
## $stats
## [1] 58.00 61.00 64.75 65.50 70.00
##
## $n
## [1] 16
##
## $conf
## [1] 62.9725 66.5275
##
## $out
## [1] 73.2
boxplot.stats(WEIGHT)
## $stats
## [1] 94.0 105.5 127.0 170.5 201.0
##
## $n
## [1] 16
##
## $conf
## [1] 101.325 152.675
##
## $out
## integer(0)
Description
Boxplot is a standardized way of displaying the distribution of data based on a five number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”). It can tell about the outliers and what their values are. It can also tell you if your data is symmetrical, how tightly your data is grouped, and if and how your data is skewed.
Figure 1. Different parts of boxplot
To better understand the different parts of boxplot, Figure 1. clearly shows the interpretation of each part of the boxplot:
median (Q2/50th Percentile): the middle value of the dataset. first quartile (Q1/25th Percentile): the middle number between the smallest number (not the “minimum”) and the median of the dataset. third quartile (Q3/75th Percentile): the middle value between the median and the highest value (not the “maximum”) of the dataset. interquartile range (IQR): 25th to the 75th percentile. whiskers (shown in blue) outliers (shown as green circles) “maximum”: Q3 + 1.5IQR “minimum”: Q1 -1.5IQR
source: https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51
Stem-and-leaf plot
stem(HEIGHT)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 5 | 89
## 6 | 11122
## 6 | 5555568
## 7 | 03
stem(WEIGHT)
##
## The decimal point is 2 digit(s) to the right of the |
##
## 0 | 9
## 1 | 00011233
## 1 | 66789
## 2 | 00
Stripchart
A strip chart is the most basic type of plot available. It plots the data in order along a line with each data point represented as a box. Stripchart code below:
code
stripchart(HEIGHT)
stripchart(WEIGHT)
Dotchart
dotchart(HEIGHT)
dotchart(WEIGHT)
Length stands for N of a vector.
length(HEIGHT)
## [1] 16
length(WEIGHT)
## [1] 16
The code below indicates if the indexed value is missing or not. Using the code below, indexed value is missing if the result TRUE and not if the result is FALSE.
code
is.na(HEIGHT)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE
The code below indicates if the indexed value is missing or not. Using the code below, indexed value is not missing if the result TRUE and missing if the result is FALSE.
code
complete.cases(HEIGHT)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [15] TRUE TRUE
Descriptive statistics including NAs if any. Codes below create summary of the data to provide an easy understanding.
code
summary(HEIGHT)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 58.00 61.00 64.75 64.11 65.25 73.20
summary(WEIGHT)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 94.0 107.8 127.0 139.6 167.8 201.0
Mean or arithmetic average
code
mean(HEIGHT)
## [1] 64.10625
mean(WEIGHT)
## [1] 139.625
Standard deviation
sd(HEIGHT)
## [1] 4.005907
sd(WEIGHT)
## [1] 37.92075
Variance
var(HEIGHT)
## [1] 16.04729
var(WEIGHT)
## [1] 1437.983
Median or Midpoint
median(HEIGHT)
## [1] 64.75
median(WEIGHT)
## [1] 127
Range (Minimum and maximum)
range(HEIGHT)
## [1] 58.0 73.2
range(WEIGHT)
## [1] 94 201
Minimium or the lowest value within the data
min(HEIGHT)
## [1] 58
min(WEIGHT)
## [1] 94
Location of the first occurrence of the minimum value. The result of the code below shows that student 13 has the minimum HEIGHT and WEIGHT within the class data.
which.min(HEIGHT)
## [1] 13
which.min(WEIGHT)
## [1] 13
Maximum or the highest value within the data
max(HEIGHT)
## [1] 73.2
max(WEIGHT)
## [1] 201
Location of first occurrence of maximum value. The result of the codes below show that student 9 has the maximum value for HEIGHT and student 11 has the maximum value for WEIGHT within the class data.
which.max(HEIGHT)
## [1] 9
which.max(WEIGHT)
## [1] 11
Contingency table of counts for each combination of vector values v factor levels.
table(HEIGHT)
## HEIGHT
## 58 59 61 62 64.5 65 66 68 70 73.2
## 1 1 3 2 1 4 1 1 1 1
table(WEIGHT)
## WEIGHT
## 94 95 100 101 110 112 123 125 129 158 160 165 176 185 200 201
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Calculate the means for variables in dataframe htwt.data excluding missing values.
sapply(htwt.data,mean,na.rm=TRUE)
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## SUBJECT SEX WEIGHT HEIGHT
## 8.50000 NA 139.62500 64.10625
A simple way of generating summary statistics by grouping variable is available in the psych package. In installing psych package, use the codes install.packages(“psych”), install.packages(“Hmisc”), and library(“psych”) to generate summary statistics.
codes
library("psych")
describe(htwt.data)
## vars n mean sd median trimmed mad min max range skew
## SUBJECT 1 16 8.50 4.76 8.50 8.50 5.93 1 16.0 15.0 0.00
## SEX* 2 16 1.50 0.52 1.50 1.50 0.74 1 2.0 1.0 0.00
## WEIGHT 3 16 139.62 37.92 127.00 138.50 46.70 94 201.0 107.0 0.30
## HEIGHT 4 16 64.11 4.01 64.75 63.89 4.45 58 73.2 15.2 0.54
## kurtosis se
## SUBJECT -1.43 1.19
## SEX* -2.12 0.13
## WEIGHT -1.53 9.48
## HEIGHT -0.41 1.00
describeBy(htwt.data,SEX,skew=FALSE,ranges=FALSE)
##
## Descriptive statistics by group
## group: F
## vars n mean sd se
## SUBJECT 1 8 7.25 5.06 1.79
## SEX* 2 8 1.00 0.00 0.00
## WEIGHT 3 8 108.25 13.24 4.68
## HEIGHT 4 8 61.12 2.10 0.74
## --------------------------------------------------------
## group: M
## vars n mean sd se
## SUBJECT 1 8 9.75 4.40 1.56
## SEX* 2 8 2.00 0.00 0.00
## WEIGHT 3 8 171.00 25.61 9.06
## HEIGHT 4 8 67.09 3.11 1.10
The code below shows pairwise scatter plots of all variables including histogram, locally smoothed regressions, and Pearson correlation.
code
pairs.panels(htwt.data)
In this acitvity, we have evaluated the normality of a variable using Q-Q plot or Quantile-Quantile plot using the codes below:
codes
qqnorm(HEIGHT,
main = "Normal Q-Q Plot of Height of Biostat students",
xlab="Theoretical Quantiles of the Height of Biostat students",
ylab = "Sample Quantiles of the Height of Biostat students")
qqline(HEIGHT)
In this section, we preform a Shapiro-Wilk (SW) test of normality using the code below:
code
shapiro.test(htwt.data$HEIGHT)
##
## Shapiro-Wilk normality test
##
## data: htwt.data$HEIGHT
## W = 0.94823, p-value = 0.4622
This section shows the code for normality test using the Kolmogorov-Smirnov.
code
ks.test(htwt.data$HEIGHT,"pnorm", mean=HEIGHT[1], sd=HEIGHT[2])
## Warning in ks.test(htwt.data$HEIGHT, "pnorm", mean = HEIGHT[1], sd =
## HEIGHT[2]): ties should not be present for the Kolmogorov-Smirnov test
##
## One-sample Kolmogorov-Smirnov test
##
## data: htwt.data$HEIGHT
## D = 0.48039, p-value = 0.001241
## alternative hypothesis: two-sided
ks.test(htwt.data$HEIGHT,"pnorm", mean=0, sd=1)
## Warning in ks.test(htwt.data$HEIGHT, "pnorm", mean = 0, sd = 1): ties
## should not be present for the Kolmogorov-Smirnov test
##
## One-sample Kolmogorov-Smirnov test
##
## data: htwt.data$HEIGHT
## D = 1, p-value = 2.531e-14
## alternative hypothesis: two-sided
ks.test(htwt.data$WEIGHT,"pnorm", mean=WEIGHT[1], sd=WEIGHT[2])
##
## One-sample Kolmogorov-Smirnov test
##
## data: htwt.data$WEIGHT
## D = 0.47508, p-value = 0.0007725
## alternative hypothesis: two-sided
ks.test(htwt.data$WEIGHT,"pnorm", mean=0, sd=1)
##
## One-sample Kolmogorov-Smirnov test
##
## data: htwt.data$WEIGHT
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided
Description
Both SW and KS tests use the same hypotheses:
HO: There is no difference between the distribution of the data set and a normal. HA: There is a difference between the distribution of the data set and normal.
The P-value will be provided by R, if below 0.05 reject the HO.
SEXcounts<-table(htwt.data$SEX)
SEXcounts
##
## F M
## 8 8
barplot(SEXcounts, main="Number of Males & Females",xlab="SEX",ylab="N")
In this section, boxplot was used to compare HEIGHT between male and female using the code below:
code
boxplot(htwt.data$HEIGHT~SEX, main="HEIGHT by SEX",
ylab="HEIGHT(in)",
xlab = "SEX")
Run the code below to create boxplot comparing the WEIGHT between male and female.
code
boxplot(htwt.data$WEIGHT~SEX, main="WEIGHT by SEX",
ylab = "HEIGHT(in)",
xlab="SEX")
Code below creates a boxplot comparing both the HEIGHT and WEIGHT between male and female.
code
boxplot(htwt.data$HEIGHT,htwt.data$WEIGHT)
title("HEIGHTs(1) and WEIGHTs(2), GENDER Combined",
xlab = "GENDER",
ylab = "Ht and Wt")
hist(htwt.data$HEIGHT, main="Histogram of HEIGHT", xlab="HEIGHT")
hist(htwt.data$WEIGHT, main = "Histogram of WEIGHT", xlab="WEIGHT")