Creating Variables and Basic Statistics
In this lab, we (1) create variables and (2) display basic descriptive statistics in the entire dataset and in subsets.
Relevant functions: read.csv(),
table(), mean(), median(),
sd(), aggregate(), group_by(),
summarize_at()
1. Importing Data
We begin by loading the dataset named “Flower_Data.csv”, which is available on OWL (Resources -> Exercises -> Week 3). Please don’t use the “import dataset” function in the environment window. The script alone should be sufficient to replicate your work.
We input the CSV file using the read.csv() command. The
following code will work if the dataset and your R script are located
within the same folder.
As an alternative, you can skip the setwd() function and
set the path to your file directly within the read.csv()
function (ex.:
read.csv("/Users/evelynebrie/Desktop/myFolder/FlowerData.csv")).
# Setting the folder where the script is as our working directory
setwd(dirname(rstudioapi::getSourceEditorContext()$path))
# Loading the dataset
FlowerData <- read.csv("Flower_Data.csv")
The data frame should now appear within your environment (upper right window). Let’s take a look at its content.
# Looking at the dimensions of the dataset
dim(FlowerData)
## [1] 150 5
# Printing the column names of the dataset
colnames(FlowerData)
## [1] "sepal.length" "sepal.width" "petal.length" "petal.width" "species"
# Printing the first 10 observations of the dataset
head(FlowerData,10)
## sepal.length sepal.width petal.length petal.width species
## 1 5.1 3.5 1.4 0.2 Iris-setosa
## 2 4.9 3.0 1.4 0.2 Iris-setosa
## 3 4.7 3.2 1.3 0.2 Iris-setosa
## 4 4.6 3.1 1.5 0.2 Iris-setosa
## 5 5.0 3.6 1.4 0.2 Iris-setosa
## 6 5.4 3.9 1.7 0.4 Iris-setosa
## 7 4.6 3.4 1.4 0.3 Iris-setosa
## 8 5.0 3.4 1.5 0.2 Iris-setosa
## 9 4.4 2.9 1.4 0.2 Iris-setosa
## 10 4.9 3.1 1.5 0.1 Iris-setosa
The sepals are the leaves protecting the petals. The petals are the leaves used to attract pollinators. Petals are usually located closer to the inside of the flower, and sepals are located more to the outside of the flower.
2. Creating Variables
2.1 Creating Dummy Variables
Let’s say we want to create a new indicator (or dummy) variable called longSepal which takes a value of 1 every time the sepal.length variable has a value higher than 6, and takes a value of 0 otherwise.
# Create a new variable called longSepal
FlowerData$longSepal<- NA
FlowerData$longSepal[FlowerData$sepal.length>6] <- 1
FlowerData$longSepal[FlowerData$sepal.length<=6] <- 0
# Sanity Check
table(FlowerData$longSepal)
##
## 0 1
## 89 61
# Printing out that vector
FlowerData$longSepal
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 1 0 1 0 0 0 0 1 0 1 0 0 1 0 0 1 1 1
## [75] 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 1 1 1 0 1 1 1 1
## [112] 1 1 0 0 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1
## [149] 1 0
Exercise 1
Create a numeric variable called shortSepal which takes a value of 1 every time the sepal.length variable has a value smaller than 5, and takes a value of 0 otherwise.
2.2 Creating Categorial Variables
Let’s say we want to create a new categorical variable called petalSize which takes a value of “Small” every time the petal.width variable has a value of 1 or smaller, takes a value of “Medium” every time that variable has a value higher than 1 and smaller then 2, and takes a value of “Large” every time that variable has a value of 2 or above.
# Create a new empty variable called petalSize
FlowerData$petalSize<- NA
FlowerData$petalSize[FlowerData$petal.width<=1] <- "Small"
FlowerData$petalSize[FlowerData$petal.width>1 & FlowerData$petal.width<=2] <- "Medium"
FlowerData$petalSize[FlowerData$petal.width>2] <- "Large"
# Sanity Check
table(FlowerData$petalSize)
##
## Large Medium Small
## 23 70 57
2.3 Creating a Numeric Variable
Let’s say we want to create a new numeric variable called ps_ratio which takes the value of the length ratio between each flower’s sepals and petals.
# Create a new empty variable called petalSize
FlowerData$ps_ratio <- NA
FlowerData$ps_ratio <- FlowerData$petal.length/FlowerData$sepal.length
# Sanity Check
summary(FlowerData$ps_ratio)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2069 0.3148 0.7089 0.6182 0.8127 0.9524
# Printing three variables only (first 10 observations only)
FlowerData[1:10,c("petal.length","sepal.length","ps_ratio")]
## petal.length sepal.length ps_ratio
## 1 1.4 5.1 0.2745098
## 2 1.4 4.9 0.2857143
## 3 1.3 4.7 0.2765957
## 4 1.5 4.6 0.3260870
## 5 1.4 5.0 0.2800000
## 6 1.7 5.4 0.3148148
## 7 1.4 4.6 0.3043478
## 8 1.5 5.0 0.3000000
## 9 1.4 4.4 0.3181818
## 10 1.5 4.9 0.3061224
3. Basic Descriptive Statistics
3.1 Full Dataset
Imagine that you are a biologist interested in the length of petals. You might want to check out the mean(), median() and sd() of petal lengths within this dataset.
# Calculating the mean
mean(FlowerData$petal.length)
## [1] 3.758667
# Calculating the median
median(FlowerData$petal.length)
## [1] 4.35
# Calculating the standard deviation
sd(FlowerData$petal.length)
## [1] 1.76442
3.2 Subsets
Now, imagine that you are interested in the characteristics by species of iris: either iris setosa, iris versicolor or iris virginica. In other words, you want to see if the species of a iris flower (independent variable) has an impact on the characteristics of its petals and sepals (dependent variables).
table(FlowerData$species)
##
## Iris-setosa Iris-versicolor Iris-virginica
## 50 50 50
You can calculate whatever basic statistics in the following subgroups, in the following ways. This is one way to do it in the tidyverse grammar.
library(dplyr)
FlowerData %>%
group_by(species) %>%
summarize_at(vars(petal.length,petal.width),
.funs=mean)
## # A tibble: 3 × 3
## species petal.length petal.width
## <chr> <dbl> <dbl>
## 1 Iris-setosa 1.46 0.244
## 2 Iris-versicolor 4.26 1.33
## 3 Iris-virginica 5.55 2.03
But you could also do this in base R in the following way.
aggregate(petal.width ~ species, data=FlowerData, FUN=mean, na.action = na.omit)
## species petal.width
## 1 Iris-setosa 0.244
## 2 Iris-versicolor 1.326
## 3 Iris-virginica 2.026
And you could even subset the statistics to a specific part of the data, for instance only to flowers with a petal-sepal ratio higher than 0.3.
aggregate(petal.width ~ species, data=FlowerData, FUN=mean, na.action = na.omit, subset = ps_ratio>0.3)
## species petal.width
## 1 Iris-setosa 0.2578947
## 2 Iris-versicolor 1.3260000
## 3 Iris-virginica 2.0260000
Exercise 2
Calculate the standard deviation of sepal length, but only for the iris virginica.
Once you’re done, your sd() should be the following one.
answer # I stored the answer in a numeric element called "answer"
## [1] 0.6358796