PLP#5 Descriptive Statistics

Descriptive statistics

Descritive statistics are basically used to describe the basic features of data by providing simple summaries about the sample and the measures. Unlike the inferential statistics, descriptive statistics simply describes what the data shows or what is going on in the data. This also helps in simplifying large amounts of data in a sensible way (https://socialresearchmethods.net/kb/statdesc.php).

Housekeeping-Use for All Analyses

In this activity, I run the code below in order to remove all files in the working directory.

code

rm(list=ls())

To set working directory, I run the code below so that the HEIGHT and WEIGHT data of the class will be read and used in different analysis for this activity.

setwd("C:\\Users\\April Mae Tabonda\\Documents\\MS Marine Science\\Biostat\\PLP\\RMDs\\PLP_5 Descriptive Statistics")
getwd()

## [1] "C:/Users/April Mae Tabonda/Documents/MS Marine Science/Biostat/PLP/RMDs/PLP_5 Descriptive Statistics"

DATA IMPORT OF A .CSV TYPE DATA FILE INTO R

In creating an object, use the code htwt.data. The object htwt.data will be a dataframe,a table or two dimensional array-like structure which contains values of one variable and each row contains one set of values from each column (https://www.tutorialspoint.com/r/r_data_frames.htm), of the HEIGHT and WEIGHT data of the class. It is very important to set the file name into .csv type from the directory. To get htwt data of the class, I run the codes below:

codes

htwt.data<- read.csv(file="htwt.csv",
                     header=TRUE, sep=",")
attach(htwt.data)

HEIGHT AND WEIGHT DATA OF THE CLASS

str(htwt.data)

## 'data.frame':    16 obs. of  4 variables:
##  $ SUBJECT: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ SEX    : Factor w/ 2 levels "F","M": 1 1 2 1 2 1 1 2 2 1 ...
##  $ WEIGHT : int  101 112 200 110 123 125 129 165 185 100 ...
##  $ HEIGHT : num  61 61 68 59 65 62 62 65 73.2 65 ...

nrow(htwt.data)

## [1] 16

ncol(htwt.data)

## [1] 4

dim(htwt.data)

## [1] 16  4

names(htwt.data)

## [1] "SUBJECT" "SEX"     "WEIGHT"  "HEIGHT"

colnames(htwt.data)

## [1] "SUBJECT" "SEX"     "WEIGHT"  "HEIGHT"

rownames(htwt.data)

##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14"
## [15] "15" "16"

head(htwt.data, n=10)

##    SUBJECT SEX WEIGHT HEIGHT
## 1        1   F    101   61.0
## 2        2   F    112   61.0
## 3        3   M    200   68.0
## 4        4   F    110   59.0
## 5        5   M    123   65.0
## 6        6   F    125   62.0
## 7        7   F    129   62.0
## 8        8   M    165   65.0
## 9        9   M    185   73.2
## 10      10   F    100   65.0

tail(htwt.data)

##    SUBJECT SEX WEIGHT HEIGHT
## 11      11   M    201   70.0
## 12      12   M    160   65.0
## 13      13   F     94   58.0
## 14      14   M    176   66.0
## 15      15   F     95   61.0
## 16      16   M    158   64.5

Codes and their uses:

str(htwt.data) to identify the structure nrow(htwt.data) to list the number of rows ncol(htwt.data) to list the number of columns dim(htwt.data) to show dimensions of the data frame names(htwt.data) to identify names colnames(htwt.data) to show column names rownames(htwt.data) to show row names head(htwt.data) to show the 1st 10 observations tail(htwt.data) to show the last 6 observations

DATA ORGANIZATION AND INFORMATION

class(SUBJECT)

## [1] "integer"

class(htwt.data$HEIGHT)

## [1] "numeric"

class(htwt.data$WEIGHT)

## [1] "integer"

str(htwt.data)

## 'data.frame':    16 obs. of  4 variables:
##  $ SUBJECT: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ SEX    : Factor w/ 2 levels "F","M": 1 1 2 1 2 1 1 2 2 1 ...
##  $ WEIGHT : int  101 112 200 110 123 125 129 165 185 100 ...
##  $ HEIGHT : num  61 61 68 59 65 62 62 65 73.2 65 ...

duplicated(htwt.data$SUBJECT)

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE

Description

In the data frame, it presents the number of observations (16) with 4 variables (SUBJECT, SEX, WEIGHT, HEIGHT). Weight is in lbs. and height is in inches. Sex is represented by number 1 for Female and 2 for Male. The subject tells the number of students participated in the data gathering.

VISUAL DATA CHECK

To better visualize and understand the class data, histogram and plot was run using the codes below.

codes

hist(HEIGHT)

plot(HEIGHT)

hist(WEIGHT)

plot(WEIGHT)

Descriptions and Insights

Histogram shows the frequency of students with height belong to the same range. On the otherhand, plot individually shows the height of 16 student. In my understanding, histogram provides an idea of which range of height most of the students in the class belong, while, the plot gives a glimpse of the minimum and maximum height of students within the class. In the original activity, histogram and plot of weight are not included, but in this section, I tried to run the codes for histogram and plot using the weight data.

plot(density(HEIGHT))

Description

Density plot are usually a much more effective way to view the distribution of a variable. This plot probability densities instead of frequencies.

BOXPLOT

Use the code below to create boxplot.

code

boxplot(HEIGHT)

boxplot(WEIGHT)

boxplot.stats(HEIGHT)

## $stats
## [1] 58.00 61.00 64.75 65.50 70.00
## 
## $n
## [1] 16
## 
## $conf
## [1] 62.9725 66.5275
## 
## $out
## [1] 73.2

boxplot.stats(WEIGHT)

## $stats
## [1]  94.0 105.5 127.0 170.5 201.0
## 
## $n
## [1] 16
## 
## $conf
## [1] 101.325 152.675
## 
## $out
## integer(0)

Description

Boxplot is a standardized way of displaying the distribution of data based on a five number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”). It can tell about the outliers and what their values are. It can also tell you if your data is symmetrical, how tightly your data is grouped, and if and how your data is skewed.

Figure 1. Different parts of boxplot

To better understand the different parts of boxplot, Figure 1. clearly shows the interpretation of each part of the boxplot:

median (Q2/50th Percentile): the middle value of the dataset. first quartile (Q1/25th Percentile): the middle number between the smallest number (not the “minimum”) and the median of the dataset. third quartile (Q3/75th Percentile): the middle value between the median and the highest value (not the “maximum”) of the dataset. interquartile range (IQR): 25th to the 75th percentile. whiskers (shown in blue) outliers (shown as green circles) “maximum”: Q3 + 1.5IQR “minimum”: Q1 -1.5IQR

source: https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51

Stem-and-leaf plot

stem(HEIGHT)

## 
##   The decimal point is 1 digit(s) to the right of the |
## 
##   5 | 89
##   6 | 11122
##   6 | 5555568
##   7 | 03

stem(WEIGHT)

## 
##   The decimal point is 2 digit(s) to the right of the |
## 
##   0 | 9
##   1 | 00011233
##   1 | 66789
##   2 | 00

Stripchart

A strip chart is the most basic type of plot available. It plots the data in order along a line with each data point represented as a box. Stripchart code below:

code

stripchart(HEIGHT)

stripchart(WEIGHT)

Dotchart

dotchart(HEIGHT)

dotchart(WEIGHT)

DESCRIPTIVE ANALYSIS OF THE DATA

Length stands for N of a vector.

length(HEIGHT)

## [1] 16

length(WEIGHT)

## [1] 16

The code below indicates if the indexed value is missing or not. Using the code below, indexed value is missing if the result TRUE and not if the result is FALSE.

code

is.na(HEIGHT)

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE

The code below indicates if the indexed value is missing or not. Using the code below, indexed value is not missing if the result TRUE and missing if the result is FALSE.

code

complete.cases(HEIGHT)

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [15] TRUE TRUE

Descriptive statistics including NAs if any. Codes below create summary of the data to provide an easy understanding.

code

summary(HEIGHT)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   58.00   61.00   64.75   64.11   65.25   73.20

summary(WEIGHT)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    94.0   107.8   127.0   139.6   167.8   201.0

Mean or arithmetic average

code

mean(HEIGHT)

## [1] 64.10625

mean(WEIGHT)

## [1] 139.625

Standard deviation

sd(HEIGHT)

## [1] 4.005907

sd(WEIGHT)

## [1] 37.92075

Variance

var(HEIGHT)

## [1] 16.04729

var(WEIGHT)

## [1] 1437.983

Median or Midpoint

median(HEIGHT)

## [1] 64.75

median(WEIGHT)

## [1] 127

Range (Minimum and maximum)

range(HEIGHT)

## [1] 58.0 73.2

range(WEIGHT)

## [1]  94 201

Minimium or the lowest value within the data

min(HEIGHT)

## [1] 58

min(WEIGHT)

## [1] 94

Location of the first occurrence of the minimum value. The result of the code below shows that student 13 has the minimum HEIGHT and WEIGHT within the class data.

which.min(HEIGHT)

## [1] 13

which.min(WEIGHT)

## [1] 13

Maximum or the highest value within the data

max(HEIGHT)

## [1] 73.2

max(WEIGHT)

## [1] 201

Location of first occurrence of maximum value. The result of the codes below show that student 9 has the maximum value for HEIGHT and student 11 has the maximum value for WEIGHT within the class data.

which.max(HEIGHT)

## [1] 9

which.max(WEIGHT)

## [1] 11

Contingency table of counts for each combination of vector values v factor levels.

table(HEIGHT)

## HEIGHT
##   58   59   61   62 64.5   65   66   68   70 73.2 
##    1    1    3    2    1    4    1    1    1    1

table(WEIGHT)

## WEIGHT
##  94  95 100 101 110 112 123 125 129 158 160 165 176 185 200 201 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1

Calculate the means for variables in dataframe htwt.data excluding missing values.

sapply(htwt.data,mean,na.rm=TRUE)

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

##   SUBJECT       SEX    WEIGHT    HEIGHT 
##   8.50000        NA 139.62500  64.10625

SUMMARY STATISTICS BY GROUP USING THE PSYCH PACKAGE

A simple way of generating summary statistics by grouping variable is available in the psych package. In installing psych package, use the codes install.packages(“psych”), install.packages(“Hmisc”), and library(“psych”) to generate summary statistics.

codes

library("psych")
describe(htwt.data)

##         vars  n   mean    sd median trimmed   mad min   max range skew
## SUBJECT    1 16   8.50  4.76   8.50    8.50  5.93   1  16.0  15.0 0.00
## SEX*       2 16   1.50  0.52   1.50    1.50  0.74   1   2.0   1.0 0.00
## WEIGHT     3 16 139.62 37.92 127.00  138.50 46.70  94 201.0 107.0 0.30
## HEIGHT     4 16  64.11  4.01  64.75   63.89  4.45  58  73.2  15.2 0.54
##         kurtosis   se
## SUBJECT    -1.43 1.19
## SEX*       -2.12 0.13
## WEIGHT     -1.53 9.48
## HEIGHT     -0.41 1.00

describeBy(htwt.data,SEX,skew=FALSE,ranges=FALSE)

## 
##  Descriptive statistics by group 
## group: F
##         vars n   mean    sd   se
## SUBJECT    1 8   7.25  5.06 1.79
## SEX*       2 8   1.00  0.00 0.00
## WEIGHT     3 8 108.25 13.24 4.68
## HEIGHT     4 8  61.12  2.10 0.74
## -------------------------------------------------------- 
## group: M
##         vars n   mean    sd   se
## SUBJECT    1 8   9.75  4.40 1.56
## SEX*       2 8   2.00  0.00 0.00
## WEIGHT     3 8 171.00 25.61 9.06
## HEIGHT     4 8  67.09  3.11 1.10

The code below shows pairwise scatter plots of all variables including histogram, locally smoothed regressions, and Pearson correlation.

code

pairs.panels(htwt.data)

*UNIVARIATE NORMALITY**

In this acitvity, we have evaluated the normality of a variable using Q-Q plot or Quantile-Quantile plot using the codes below:

codes

qqnorm(HEIGHT,
       main = "Normal Q-Q Plot of Height of Biostat students",
       xlab="Theoretical Quantiles of the Height of Biostat students",
       ylab = "Sample Quantiles of the Height of Biostat students")

qqline(HEIGHT)

*UNIVARIATE NORMALITY Shapiro-Wilk(SW), Kolmogorov-Smirnov**

In this section, we preform a Shapiro-Wilk (SW) test of normality using the code below:

code

shapiro.test(htwt.data$HEIGHT)

## 
##  Shapiro-Wilk normality test
## 
## data:  htwt.data$HEIGHT
## W = 0.94823, p-value = 0.4622

This section shows the code for normality test using the Kolmogorov-Smirnov.

code

ks.test(htwt.data$HEIGHT,"pnorm", mean=HEIGHT[1], sd=HEIGHT[2])

## Warning in ks.test(htwt.data$HEIGHT, "pnorm", mean = HEIGHT[1], sd =
## HEIGHT[2]): ties should not be present for the Kolmogorov-Smirnov test

## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  htwt.data$HEIGHT
## D = 0.48039, p-value = 0.001241
## alternative hypothesis: two-sided

ks.test(htwt.data$HEIGHT,"pnorm", mean=0, sd=1)

## Warning in ks.test(htwt.data$HEIGHT, "pnorm", mean = 0, sd = 1): ties
## should not be present for the Kolmogorov-Smirnov test

## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  htwt.data$HEIGHT
## D = 1, p-value = 2.531e-14
## alternative hypothesis: two-sided

ks.test(htwt.data$WEIGHT,"pnorm", mean=WEIGHT[1], sd=WEIGHT[2])

## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  htwt.data$WEIGHT
## D = 0.47508, p-value = 0.0007725
## alternative hypothesis: two-sided

ks.test(htwt.data$WEIGHT,"pnorm", mean=0, sd=1)

## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  htwt.data$WEIGHT
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided

Description

Both SW and KS tests use the same hypotheses:

HO: There is no difference between the distribution of the data set and a normal. HA: There is a difference between the distribution of the data set and normal.

The P-value will be provided by R, if below 0.05 reject the HO.

CREATING SIMPLE BAR PLOT WITH LABELS

SEXcounts<-table(htwt.data$SEX)
SEXcounts

## 
## F M 
## 8 8

barplot(SEXcounts, main="Number of Males & Females",xlab="SEX",ylab="N")

CREATING GRAPHS

In this section, boxplot was used to compare HEIGHT between male and female using the code below:

code

boxplot(htwt.data$HEIGHT~SEX, main="HEIGHT by SEX",
        ylab="HEIGHT(in)",
        xlab = "SEX")

Run the code below to create boxplot comparing the WEIGHT between male and female.

code

boxplot(htwt.data$WEIGHT~SEX, main="WEIGHT by SEX",
        ylab = "HEIGHT(in)",
        xlab="SEX")

Code below creates a boxplot comparing both the HEIGHT and WEIGHT between male and female.

code

boxplot(htwt.data$HEIGHT,htwt.data$WEIGHT)
title("HEIGHTs(1) and WEIGHTs(2), GENDER Combined",
      xlab = "GENDER",
      ylab = "Ht and Wt")

HISTOGRAM

hist(htwt.data$HEIGHT, main="Histogram of HEIGHT", xlab="HEIGHT")

hist(htwt.data$WEIGHT, main = "Histogram of WEIGHT", xlab="WEIGHT")