POLISCI 3325G - Data Science for Political Science

Evelyne Brie

Winter 2023

Creating Variables and Basic Statistics

In this lab, we (1) create variables and (2) display basic descriptive statistics in the entire dataset and in subsets.

Relevant functions: read.csv(), table(), mean(), median(), sd(), aggregate(), group_by(), summarize_at()

1. Importing Data

We begin by loading the dataset named “Flower_Data.csv”, which is available on OWL (Resources -> Exercises -> Week 3). Please don’t use the “import dataset” function in the environment window. The script alone should be sufficient to replicate your work.

We input the CSV file using the read.csv() command. The following code will work if the dataset and your R script are located within the same folder.

As an alternative, you can skip the setwd() function and set the path to your file directly within the read.csv() function (ex.: read.csv("/Users/evelynebrie/Desktop/myFolder/FlowerData.csv")).

# Setting the folder where the script is as our working directory
setwd(dirname(rstudioapi::getSourceEditorContext()$path))

# Loading the dataset
FlowerData <- read.csv("Flower_Data.csv")

The data frame should now appear within your environment (upper right window). Let’s take a look at its content.

# Looking at the dimensions of the dataset
dim(FlowerData)
## [1] 150   5
# Printing the column names of the dataset 
colnames(FlowerData)
## [1] "sepal.length" "sepal.width"  "petal.length" "petal.width"  "species"
# Printing the first 10 observations of the dataset
head(FlowerData,10)
##    sepal.length sepal.width petal.length petal.width     species
## 1           5.1         3.5          1.4         0.2 Iris-setosa
## 2           4.9         3.0          1.4         0.2 Iris-setosa
## 3           4.7         3.2          1.3         0.2 Iris-setosa
## 4           4.6         3.1          1.5         0.2 Iris-setosa
## 5           5.0         3.6          1.4         0.2 Iris-setosa
## 6           5.4         3.9          1.7         0.4 Iris-setosa
## 7           4.6         3.4          1.4         0.3 Iris-setosa
## 8           5.0         3.4          1.5         0.2 Iris-setosa
## 9           4.4         2.9          1.4         0.2 Iris-setosa
## 10          4.9         3.1          1.5         0.1 Iris-setosa

The sepals are the leaves protecting the petals. The petals are the leaves used to attract pollinators. Petals are usually located closer to the inside of the flower, and sepals are located more to the outside of the flower.

 

2. Creating Variables

2.1 Creating Dummy Variables

Let’s say we want to create a new indicator (or dummy) variable called longSepal which takes a value of 1 every time the sepal.length variable has a value higher than 6, and takes a value of 0 otherwise.

# Create a new  variable called longSepal
FlowerData$longSepal<- NA 
FlowerData$longSepal[FlowerData$sepal.length>6] <- 1
FlowerData$longSepal[FlowerData$sepal.length<=6] <- 0

# Sanity Check
table(FlowerData$longSepal)
## 
##  0  1 
## 89 61
# Printing out that vector
FlowerData$longSepal
##   [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 1 0 1 0 0 0 0 1 0 1 0 0 1 0 0 1 1 1
##  [75] 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 1 1 1 0 1 1 1 1
## [112] 1 1 0 0 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1
## [149] 1 0
 

Exercise 1

Create a numeric variable called shortSepal which takes a value of 1 every time the sepal.length variable has a value smaller than 5, and takes a value of 0 otherwise.

2.2 Creating Categorial Variables

Let’s say we want to create a new categorical variable called petalSize which takes a value of “Small” every time the petal.width variable has a value of 1 or smaller, takes a value of “Medium” every time that variable has a value higher than 1 and smaller then 2, and takes a value of “Large” every time that variable has a value of 2 or above.

# Create a new empty variable called petalSize
FlowerData$petalSize<- NA 

FlowerData$petalSize[FlowerData$petal.width<=1] <- "Small"
FlowerData$petalSize[FlowerData$petal.width>1 & FlowerData$petal.width<=2] <- "Medium"
FlowerData$petalSize[FlowerData$petal.width>2] <- "Large"

# Sanity Check
table(FlowerData$petalSize)
## 
##  Large Medium  Small 
##     23     70     57

2.3 Creating a Numeric Variable

Let’s say we want to create a new numeric variable called ps_ratio which takes the value of the length ratio between each flower’s sepals and petals.

# Create a new empty variable called petalSize
FlowerData$ps_ratio <- NA 

FlowerData$ps_ratio <- FlowerData$petal.length/FlowerData$sepal.length

# Sanity Check
summary(FlowerData$ps_ratio)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2069  0.3148  0.7089  0.6182  0.8127  0.9524
# Printing three variables only (first 10 observations only)
FlowerData[1:10,c("petal.length","sepal.length","ps_ratio")]
##    petal.length sepal.length  ps_ratio
## 1           1.4          5.1 0.2745098
## 2           1.4          4.9 0.2857143
## 3           1.3          4.7 0.2765957
## 4           1.5          4.6 0.3260870
## 5           1.4          5.0 0.2800000
## 6           1.7          5.4 0.3148148
## 7           1.4          4.6 0.3043478
## 8           1.5          5.0 0.3000000
## 9           1.4          4.4 0.3181818
## 10          1.5          4.9 0.3061224

3. Basic Descriptive Statistics

3.1 Full Dataset

Imagine that you are a biologist interested in the length of petals. You might want to check out the mean(), median() and sd() of petal lengths within this dataset.

# Calculating the mean
mean(FlowerData$petal.length)
## [1] 3.758667
# Calculating the median
median(FlowerData$petal.length)
## [1] 4.35
# Calculating the standard deviation
sd(FlowerData$petal.length)
## [1] 1.76442

3.2 Subsets

Now, imagine that you are interested in the characteristics by species of iris: either iris setosa, iris versicolor or iris virginica. In other words, you want to see if the species of a iris flower (independent variable) has an impact on the characteristics of its petals and sepals (dependent variables).

table(FlowerData$species)
## 
##     Iris-setosa Iris-versicolor  Iris-virginica 
##              50              50              50

You can calculate whatever basic statistics in the following subgroups, in the following ways. This is one way to do it in the tidyverse grammar.

library(dplyr)
FlowerData %>%
  group_by(species) %>%
  summarize_at(vars(petal.length,petal.width),
               .funs=mean)
## # A tibble: 3 × 3
##   species         petal.length petal.width
##   <chr>                  <dbl>       <dbl>
## 1 Iris-setosa             1.46       0.244
## 2 Iris-versicolor         4.26       1.33 
## 3 Iris-virginica          5.55       2.03

But you could also do this in base R in the following way.

aggregate(petal.width ~ species, data=FlowerData, FUN=mean, na.action = na.omit)
##           species petal.width
## 1     Iris-setosa       0.244
## 2 Iris-versicolor       1.326
## 3  Iris-virginica       2.026

And you could even subset the statistics to a specific part of the data, for instance only to flowers with a petal-sepal ratio higher than 0.3.

aggregate(petal.width ~ species, data=FlowerData, FUN=mean, na.action = na.omit, subset = ps_ratio>0.3)
##           species petal.width
## 1     Iris-setosa   0.2578947
## 2 Iris-versicolor   1.3260000
## 3  Iris-virginica   2.0260000
 

Exercise 2

Calculate the standard deviation of sepal length, but only for the iris virginica.

Once you’re done, your sd() should be the following one.

answer # I stored the answer in a numeric element called "answer"
## [1] 0.6358796