This chapter will cover basic programming concepts, and setting up your R session which will include setting the working directory, installing packages, importing/exporting data, and viewing data.
This section will provide a brief overview of the most fundamental programming concepts, which is a good place to start if you’ve never used or have forgotten how to use R.
Everything is an object in R. For simplicity’s sake, we’ll say that an object is a name given to something so we can reference it later. Highways are given numbers so they can be referenced; for example, interstate 15. In much the same way, objects are names used to reference a value or a set of values.
y <- 2 # This symbol "<-" is how you assign a value to an object
3 -> x # This way "->" works equally wellR has 5 basic data types, which are known as classes, and to manipulate your data it is a good idea to have a basic understanding of these classes.
0.5, 2 ## 1. Numeric class
TRUE, FALSE ## 2. Logical class
"hat", "dog" ## 3. Character class
5L, 12L ## 4. Integer class ("L" stores this as integer)
1+0i, 2+4i ## 5. Complex classThese classes are the basic building blocks, and all values in R belong to one of these classes. Multiple values can be concatenated together and assigned to an object:
## [1] 1 2 3 4 5
In the above example, numeric values were concatenated together and assigned to the object y. Concatenating values together and assigning them to one object creates a data structure.
R has many data structures. The most relevant data structures include:
Vectors: all values within a vector must be of the same class. In this example, all values are of the numeric class.
## [1] 3 23 56 12 54
Lists: contain values that can belong to any class. Lists can even other data structures, such as another list.
## [[1]]
## [1] 0.5
##
## [[2]]
## [1] TRUE
##
## [[3]]
## [1] "hat"
Factors: look similar to vectors, except that they contain levels that group the data. In this example, a vector is first created and then converted to a factor.
## [1] 3 3 7 7 7 3 5 5 3 7
## Levels: 3 5 7
Data frames: a special type of list where every element of the list has the same length. A data frame is usually what’s created when importing a data file into R.
## id x y
## 1 a 1 6
## 2 b 2 7
## 3 c 3 8
## 4 d 4 9
## 5 e 5 10
Before you start importing files and coding, you need to set the working directory so that R looks for and saves files in the correct folder (which can be whatever you choose). You’ll want to set the working directory to wherever your data files are located.
getwd() # Shows your current working directory
setwd("/Desktop/Code") # Sets the working directory to the listed file path
dir() # Shows files in the working directoryInstall any packages you want to use in your code, such as the tidyverse package. The tidyverse package is a collection of packages that includes ggplot2, dplyr, tidyr, readr, and several others. You only need to install a package once. Once the package is installed you should never have to install it again.
Packages only need to be installed once, but they need to be loaded every time you start an R session. Loading a package puts it into the working memory; that way R knows you want to use those functions from those packages. Requiring the user to load packages is a way for R to save memory, which allows for more efficient and faster processing.
Make sure that your working directory is correctly set before attempting to import files. If your data files are contained within your working directory folder, you simply need to type the name of the data file with quotes around it:
1.Import a csv file
mydata <- read.csv("data.csv")
# If the data is contained within a subfolder of the working directory, you need to specify the file path when importing:
mydata <- read.csv("/data_files/example.csv")# The readxl package is needed to use this function
# Library(readxl)
mydata <- read_excel("data.xlsx")# Reads a file in table format and creates a data frame from it
mydata <- read.table("file.txt", sep="\t")Functions to export your data to a text or csv file:
Listed below are a few useful functions for feeling out your data after it has been imported.
We’ll use R’s built-in datasets to give an example
The str() function shows the overall structure of the data, which consists of the data structure and the class of each variable, as well as the number of observations and variables. It also shows a preview of the data for each variable.
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Here are some other useful functions to feel out your data:
head(iris) # First 6 rows of dataset
head(iris, n=3) # First 3 rows of dataset
head(iris, n= -50) # All rows but the last 50
tail(iris) # Last 6 rows
tail(iris, n=10) # Last 10 rows
tail(iris, n= -10) # All rows but the first 10
iris[1:10, ] # First 10 rows (brackets are used to subset)
iris[ ,1:2] # First two columns
iris[1:10,1:3] # First 10 rows of data of the first 3 columnsHere are some basic plotting functions to start exploring variables and the relationships between variables. More on plotting outliers and assessing normality will be covered in Chapter 4.
# Dollar sign "$" is a way to subset a dataframe by the column name
plot(iris$Sepal.Length, iris$Sepal.Width)
plot(iris$Species)
boxplot(iris$Sepal.Length, iris$Sepal.Width)
hist(iris$Sepal.Length)The goal of this chapter is to help you compile your data into a single data file that can be manipulated in R, and to then clean that data and prepare it for data transformation.
Packages needed for this chapter:
Oftentimes, you will probably need to stitch together rows and columns from different data files. This can be accomplished using several different functions in R, including (but not limited to) cbind(), rbind(), and a custom loop function if more complex merging is required.
Below are examples of the cbind() and rbind() functions. As you can see, the data is added sequentially based off the order that the data is referenced in the function. If you need to concatenante many files together, however, this is not a very efficient method. Another way to add data files together is by using a loop function.
example_data1 <- c(1:5)
example_data2 <- c(5:1)
example_data3 <- c(6:10)
cbind(example_data1, example_data2, example_data3)## example_data1 example_data2 example_data3
## [1,] 1 5 6
## [2,] 2 4 7
## [3,] 3 3 8
## [4,] 4 2 9
## [5,] 5 1 10
## [,1] [,2] [,3] [,4] [,5]
## example_data1 1 2 3 4 5
## example_data2 5 4 3 2 1
## example_data3 6 7 8 9 10
Here is a custom loop function to compile all of the files within a directory into a single file; courtesy of Dr. Lohse. Keep in mind that this loop function is specific to the dataset that was used in this example (hence, custom loop functon); meaning you cannot simply plug and chug to use it with your data. You will need to specify exactly how the data should be merged together.
setwd("~/Desktop/Code/KINES_6770/data/eeg_data")
# Set the working directory to the location of the files
file_list <- list.files()
file_list
# Create an object that references the list of files
for (file in file_list){
# if the merged dataset doesn't exist, create it
if (!exists("MASTER")){
MASTER <- read.table(file, header=TRUE, sep=",")
# Take the subject ID from the file name and make it a variable:
MASTER$subID <-factor(substr(file, start = 1, stop = 3))
# Extract the block from the file name and make it a variable:
MASTER$block <-factor(substr(file, start = 25, stop = 27))
print(file)
}
# if the merged dataset does exist, append to it
if (exists("MASTER")){
# Create the temporary data file:
temp_dataset <-read.table(file, header=TRUE, sep=",")
# Take the subject ID from the file name and make it a variable:
temp_dataset$subID <-factor(substr(file, start = 1, stop = 3))
# Extract the block from the file name and make it a variable:
temp_dataset$block <-factor(substr(file, start = 25, stop = 27))
# Row bind the temporary data to the master data
MASTER<-rbind(MASTER, temp_dataset)
# Remove or "empty" the temporary data set
rm(temp_dataset)
print(file)
}
}The “tidyr” package is a great package for cleaning data. Only two of the functions will be discussed in this notebook: spread and gather. If you find yourself needing more data cleaning functions, take a look at R for Data Science by Hadley Wickham.
What makes a dataset tidy?
Problem: column names are not names of variables, but values of a variable.
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
Solution: Gather those columns into a new pair of variables
## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Brazil 1999 37737
## 3 China 1999 212258
## 4 Afghanistan 2000 2666
## 5 Brazil 2000 80488
## 6 China 2000 213766
# This "%>%" is called a pipe and it is used extensively in the tidyverse package. When you see the pipe, think of the word "then". For example, call on table4a, then use the gather function to manipulate the table.Problem: an observation is scattered across multiple rows.
## # A tibble: 8 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
Solution: Create new variables and spread them across multiple columns
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
It was discussed in Chapter 1 that there are distinct data types (classes) and data structures in R. You can only do certain manipulations to an R object depending on its class and structure. Sometimes you’ll need to convert the class or structure of an object so that you can manipulate it the way you want.
For example, here is the structure of the mtcars dataset:
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
The mtcars is a dataframe and all of its variables are vectors. It makes sense to have the cylinder variable grouped by its engine size (as a factor), since there are only 3 possible engine sizes: 4, 6, and 8 liter; that way we can graph the data as a categorical variable rather than a continuous variable. This can be accomplished using the following function:
## [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
## Levels: 4 6 8
Here are some useful functions for checking and coverting the class and structure of objects:
is.numeric()
is.character()
is.logical()
is.data.frame()
is.factor()
# Check to see what the object is
as.integer()
as.data.frame()
as.factor()
# Change the object if you need to In Chapter 3 we’ll go over how to create new variables with the dplyer package. In contrast, this section will be about changing specific values within a given variable. For example, what if we had a numeric vector and wanted to recode all values below 5 to 0? Here’s an example of how that could be accomplished:
data <- 1:10 # Data is a vector containing the numbers 1-10
data[data < 5] <- 0 # Values < 5 will be assigned a new value, 0
data## [1] 0 0 0 0 5 6 7 8 9 10
Or, let’s say that all values below 10 should be assigned a missing value, NA.
## [1] NA NA NA NA NA NA NA NA NA 10 11 12 13 14 15 16 17 18 19 20
What if we had a variable named “gender” that contained character values and we wanted to re-code those values as 0 for males and 1 for females:
gender <- c("MALE","FEMALE","FEMALE", "UNKNOWN", "MALE")
as.numeric(gender) # as.numeric doesn't work in this case## Warning: NAs introduced by coercion
## [1] NA NA NA NA NA
The as.numeric() function doesn’t work in this case because R doesn’t know how to assign numerical values to characters. Instead, we could use an if/else statement:
## [1] 0 1 1 2 0
Data might not be in the most workable form even after it has been cleaned. Often times, we’ll still need to create new variables, rename variables, reorder observations, and group and summarize the data. The dplyr package (a package within the tidyverse package) has several very useful functions for accomplishing these tasks.
Packages needed for this chapter:
Listed below are a few of the more common functions with the dplyr package that can assist in solving a wide array of data-manipulation challenges.
filter() – Selects observations based upon pre-determined criteria
arrange() – Reorders the rows of your data
select() – Picks variables by their names
mutate()- Creates a new variable with functions of existing variables
group_by()- Changes the scope of each function from operating on the entire dataset to operating on it group-by-group
summarize() – Collapses many values down to a single summary
The filter() function allows you to subset observations based on their values.
## mpg cyl disp hp drat wt qsec vs am gear carb
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Pontiac Firebird 19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
## Ferrari Dino 19.7 6 145 175 3.62 2.770 15.50 0 1 5 6
More complex filters require boolean operators to make arguments:
& = and
= or
! = not
For example:
## mpg cyl disp hp drat wt qsec vs am gear carb
## Valiant 18.1 6 225.0 105 2.76 3.46 20.22 1 0 3 1
## Merc 280 19.2 6 167.6 123 3.92 3.44 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.44 18.90 1 0 4 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.77 15.50 0 1 5 6
The arrange() function changes the order of the variables in a data frame. By default, observations are listed in ascending order of the selected variable.
If you have a big dataset, you probably don’t want to look at the whole thing all at once because it’s too much information to display on a screen. Or, you might want to compare and contrast variables but their columns are so far apart that it’s difficult to compare them. To get around this, you can use the select() function to select only the variables that you want to keep in the dataset.
## mpg wt carb
## Mazda RX4 21.0 2.620 4
## Mazda RX4 Wag 21.0 2.875 4
## Datsun 710 22.8 2.320 1
## Hornet 4 Drive 21.4 3.215 1
## Hornet Sportabout 18.7 3.440 2
## mpg cyl disp hp drat wt
## Mazda RX4 21.0 6 160 110 3.90 2.620
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875
## Datsun 710 22.8 4 108 93 3.85 2.320
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215
## Hornet Sportabout 18.7 8 360 175 3.15 3.440
The mutate() function allows you to add new columns that are functions of existing columns, and it adds those new columns to the end of the dataset.
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal_Ratio
## 1 5.1 3.5 1.4 0.2 setosa 1.457143
## 2 4.9 3.0 1.4 0.2 setosa 1.633333
## 3 4.7 3.2 1.3 0.2 setosa 1.468750
## 4 4.6 3.1 1.5 0.2 setosa 1.483871
## 5 5.0 3.6 1.4 0.2 setosa 1.388889
# Sepal_Ratio is a new variable created by taking the Sepal_Length divided by the Sepal_Width for each observation
# If you only want to keep the new variables, you can use the transmute() function insteadThe group_by() function changes the unit of analysis from the complete dataset to indivdual groups.
## # A tibble: 32 x 11
## # Groups: cyl [3]
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
## 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
## # … with 22 more rows
It’s a great function, but it doesn’t really do anything by itself. Instead, we can pair group_by() with the other functions, such as the summarise() function, to do a lot of useful taks.
The summaize() function (or summarise()) can summarize data based off of the group_by() function.
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
## cyl avg_mpg
## <dbl> <dbl>
## 1 4 26.7
## 2 6 19.7
## 3 8 15.1
# Data was grouped by the cylinder variable, and then the average mpg was computed for each cylinder sizeThese functions can be paired together to quickly and easily make all kinds of transformations to a dataset:
mtcars[1:5,] %>%
filter(hp > 100) %>%
select(mpg, cyl, hp, wt) %>%
arrange(mpg) %>%
mutate(hp_wt= hp/wt)## mpg cyl hp wt hp_wt
## 1 18.7 8 175 3.440 50.87209
## 2 21.0 6 110 2.620 41.98473
## 3 21.0 6 110 2.875 38.26087
## 4 21.4 6 110 3.215 34.21462
In this section, basic graphing functions will be covered, including types of plots, basic arguments, and aesthetics. Additionally, plots for detecting outliers or determining normality of data will be provided. At the end, applied examples are provided that can be modified for personal use.
Packages needed for this chapter:
We’ll use the ggplot() function along with the mpg dataset from the tidyverse package for the graphing examples. Let’s try to graph the mpg dataset:
If you try graphing this, you will get a blank graph. That’s because we haven’t told the ggplot function what we want it to do with the data. We need to tell ggplot what type of graph to make. We’ll start with the geom_point() function.
Now we have told ggplot that we want to make a geom_point(), which is a scatter plot, and we’ve also specified the x and y values. There are a couple different ways to write this code, but to keep it consistent we’ll continue writing the code in this format. Here are a few graphs that can be plotted with ggplot():
geom_point(): Plots individual data points according to the x and y values assigned
geom_jitter(): Much like geom_point, but if there are overlapping data points, they will both be clearly plotted on the graph with slight variation in the x value (points are slightly shifted left or right).
geom_line(): Plots a line going through each individual data point. Note that with this particular data frame, this code does not produce an interpretable graph. For an advanced example, please see Advanced Examples.
geom_smooth(): Plots line of best fit for data points. “lm” and “loess” are the common arguments for method, which will plot a straight or a curved line, respectively. Below are examples of “lm” (top) and “loess” (bottom) methods.
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
geom_bar(): Creates a bar chart for specified independent variables.
geom_boxplot(): Creates a box plot for specified variables, displaying the mean and standard errors for each independent variable. Note that the black dots represent outliers in the data frame.
The best way to learn the basic arguments of ggplot is to play with different values on the same plot and then see what changes. Listed below are some basic arguments of the ggplot() function:
alpha: adjusts the transparency of a specified object.
size: adjusts the size of data points or lines.
fill: adjusts the color inside of data points.
weight: adjusts the thickness of a line or data point.
linetype: adjusts the type of line displayed in plots with lines; enables categorical variables to be on same graph with different visual representation
shape: adjusts the type of shape displayed in plots with points; enables categorical variables to be on same graph with different visual representation
color: adjusts the color displayed in plots; enables categorical variables to be on same graph with different visual representation
Here are 3 comparison examples that highlight the functionality of some of these arguments:
ggplot(data = mpg) +
geom_smooth(method = "loess", size = 1.5,
mapping = aes(x = displ, y = hwy))+
geom_point(size = 2, mapping = aes(x = displ, y = hwy))## `geom_smooth()` using formula 'y ~ x'
ggplot(data = mpg) +
geom_smooth(method = "loess", size = 1.5, mapping = aes(
x = displ, y = hwy, color = drv), se = FALSE)+
geom_point(size = 2, mapping = aes(x = displ, y = hwy,
color = drv))## `geom_smooth()` using formula 'y ~ x'
The two graphs above have very similar codes but different results. Both are displaying the relationship between highway miles and “displ,” a categorical variable. The main difference is the implementation of color = drv, which created separate lines for each name in that categorical variable. This would work with linetype for the geom_smooth code, and with shape for the geom_point code. Finally, the se = FALSE removed the shading around the lines, which indicated the standard error.
ggplot(data = mpg) +
geom_jitter(size = 2.3, mapping = aes(
x = drv, y = cty, shape = drv, color = drv))+
coord_cartesian(ylim = c(10,25))These two graphs display the same data, which is the city mileage for each drv data point. However, we made use of the coord_cartesian function which can zoom in on the axes. We also adjusted the size of the data points. The plot on the right, although it is missing some outlier data points, looks clearer.
ggplot(data = mpg) +
geom_smooth(method = "lm", mapping = aes(x = displ, y = cty, linetype = drv), color = "purple", se = FALSE)## `geom_smooth()` using formula 'y ~ x'
ggplot(data = mpg) +
geom_smooth(method = "lm", mapping = aes(x = displ, y = cty), color = "purple", se = FALSE)+
facet_wrap(~drv, nrow = 1, ncol = 3)## `geom_smooth()` using formula 'y ~ x'
These graphs show the same data, but in different visual forms. The first code differentiates the different “drv” levels through linetypes, while the second code displays the different levels of “drv” in separate panels. The latter is accomplished through the facet_wrap function. The facet_grid function can also be used, but facet_wrap enables you to customize the layout of the panels (through the ncol and nrow arguments).
In this section, a template is laid out for developing basic aesthetics in graphs. Please note that there are more adjustments that can be made to a graph or figure, but this provides a basic template that you can use to make graphs. Below, we display two graphs with identical data, but the second graph has been cleaned up by taking advantage of ggplot’s additional arguments.
ggplot(data = mpg) +
geom_smooth(method = "loess", size = 1.5, mapping = aes(x = displ, y = hwy, color = drv), se = FALSE)+
geom_point(size = 2, mapping = aes(x = displ, y = hwy, color = drv))## `geom_smooth()` using formula 'y ~ x'
g1 <- ggplot(data = mpg) +
geom_smooth(method = "loess", size = 1.5, mapping = aes(x = displ, y = hwy, color = drv), se = FALSE)+
geom_point(size = 2, mapping = aes(x = displ, y = hwy, color = drv))
g2 <- g1 + scale_x_continuous(name = "Engine Displacement (litres)")+
scale_y_continuous(name = "Highway Miles per Gallon", breaks = c(15,20,25,30,35,40,45))
g3 <- g2 + theme_bw() +
theme(axis.text=element_text(size=18, colour="black"),
axis.title=element_text(size=18,face="bold")) +
theme(strip.text.x = element_text(size = 18))+
theme(axis.title.y = element_text(margin = margin(t = 0, r = 20, b = 0, l = 15)))+
theme(axis.title.x = element_text(margin = margin(t = 15, r = 0, b = 10, l = 0)))+
theme(legend.position = c(0.80,0.9))+
theme(legend.title = element_blank())+
theme(legend.text = element_text(size = 16))+
theme(plot.title = element_blank())
g4 <- g3+scale_color_discrete(name=element_blank(),
labels = c("4-Wheel Drive",
"Front-Wheel Drive",
"Rear-Wheel Drive"))
plot(g4)## `geom_smooth()` using formula 'y ~ x'
There are several things to note in the code:
scale_x_continous & scale_y_continuous allow the axes to be automatically scaled to our data, but we can further name the axes and set the breaks, which specifies which numbers should be displayed on an axis.
theme_bw() dictated that the general color scheme of the figure is black and white. However, we did input a color argument before, so this won’t apply to the different lines. The background is now white though instead of grey.
axis_test and axis_title allow us to specify the size and font of the text. This goes for the legend_text and legend_title as well.
margin allows us to adjust the spacing and position between the axis title and the figure itself. legend_position allows us to put the legend where we wish to on the figure.
scale_color_discrete allows us to name the different levels of the “drv” variable as it will appear in the legend.
There are a variety of ways to check normality of data, but visualizations provide a quick, easy way to determine if assumptions of normality have been met or if outliers are present.
hist(mpg$cty,probability=T,
main="Histogram of skewed data", xlab="Skewed data")
lines(density(mpg$cty),col=2)Above is a histogram and density curve showing the distribution of the “cty” variable in the mpg dataset. This graph displays a slight right skew.
Above is q-q plot graph, which shows the correlation between your sample distribution and a theoretical normal distribution. The dots should generally hug that line of best fit created in this plot. This plot indicates this may not be a normal distribution since the ends are tailing off, and there are 6 outliers at each end of the plot.
Another useful way to detect outliers is using Cook’s distance. This is applicable when determining if any observations will have a significant impact on a predicted y value in an analysis. For this example, we use the simple lm function to determine if “hwy” can predict “cty.” We then use this model to plot the cook’s distance for each observation.
mod <- lm(cty ~ hwy, data=mpg) # Create a linear model
cooksd <- cooks.distance(mod) # Cook's distance
plot(cooksd, pch="*", cex=2,
main="Influential Obs by Cooks distance") # Plot Cook's distance
abline(h = 4*mean(cooksd, na.rm=T), col="red") # Add cutoff line
text(x=1:length(cooksd)+1,
y=cooksd, labels=ifelse(cooksd>4*mean(cooksd, na.rm=T),
names(cooksd),""), col="red")The red line in this code shows where most of the observations are located, and anything above this line has more influential effects on analysis (because they may be outliers). Generally, observations with a cook’s distance greater than 1 should be removed. However, this graph shows the greatest value to be 0.30 (look at y-axis). Therefore, data probably doesn’t need to be removed, unless the detected data points (as indicated in red text) are subjectively interpreted as outliers
This plot checks if residuals in your data (for an analysis, like a linear model) are equally distributed (homoscedastic) or bunch together as specific values (heteroscedastic). Normal data is homoscedastic: the plot should look like a shotgun blast in this case, which is does. Heteroscedastic data will show residuals bunching near a specific value, forming a “cone” shape with the data in this graph.
For these example plots you will need access to the box folder for the KIN 6770 class. To follow along, dowload the “data_TECH_DI.csv” and “TRIM_T_DI.csv” files from the box folder.
# Download the files and setwd() to the location of the files
setwd("~/Desktop/Code/Lab notebook/data")
# Read in file
TECH<-read.csv("data_TECH_DI.csv", header = TRUE, sep=",",
na.strings=c("NA","NaN"," ",""))# This creates a new dataframe, filtering out data from 2014 or earlier
Test<- subset(TECH, !Year < 2014)
# Creating the plot
g1 <- ggplot(data = subset(Test, !is.na(Test$SMTQ_total)), mapping = aes(x = Year, y = NAT_TECH, na.rm = TRUE))+
geom_smooth(method = "lm", size = 2, se = FALSE, aes(color = SMTQ_total > 40))+
geom_jitter(size = 2, width = 0.25, pch = 21, aes(fill = SMTQ_total > 40))+
ggtitle(element_blank())+
coord_cartesian(xlim=c(2013.95,2018.2))
g2 <- g1+scale_x_continuous (name="Years", breaks = pretty(TECH$Year, n=3))+
scale_y_continuous(name = "National Technical Points (Rank)")
g3 <- g2 + theme_bw() +
theme(axis.text=element_text(size=26, colour="black"),
axis.title=element_text(size=26,face="bold")) +
theme(strip.text.x = element_text(size = 18))+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
theme(axis.title.y = element_text(margin = margin(t = 0, r = 20, b = 0, l = 15)))+
theme(axis.title.x = element_text(margin = margin(t = 20, r = 0, b = 10, l = 0)))+
theme(legend.position = c(0.3,0.92))+
theme(legend.text = element_text(size = 22))
g4 <- g3+scale_color_manual(name="",
labels = c("Low Mental Toughness (<40)",
"High Mental Toughness (>40)"),
values = c("FALSE" = "black",
"TRUE" = "grey"))+
scale_fill_manual(name="",
labels = c("Low Mental Toughness (<40)",
"High Mental Toughness (>40)"),
values = c("FALSE" = "black",
"TRUE" = "grey"))
plot(g4)## `geom_smooth()` using formula 'y ~ x'
This code created distinct lines and dots by using a conditional argument. In this case, it was if the variable “SMTQ_total” was less than or greater than 40. The rest of the code includes aesthetics that have been covered, but slightly tweaked.
setwd("~/Desktop/Code/Lab notebook/data")
# Read in data file
TRIM_T<-read.csv("./TRIM_T_DI.csv", header = TRUE, sep=",",
na.strings=c("NA","NaN"," ",""))
# Remove unspecified gender from data
TRIM_T <- subset(TRIM_T, !is.na(Gender))
# Create average hours variable grouping by age, gender, and country, and creating a variable for SD
AvgAge<-summarize(group_by(TRIM_T, AgeYear, Country, Gender),
AgeHours = mean(HOURS_TOTAL, na.rm=TRUE),
AgeHoursSD = sd(HOURS_TOTAL)/sqrt(sum(!is.na(HOURS_TOTAL))))## `summarise()` regrouping output by 'AgeYear', 'Country' (override with `.groups` argument)
# Filtering out ages below 3 and above 18
AvgAge <- filter(AvgAge, AgeYear <19)
AvgAge <- filter(AvgAge, AgeYear >2)
# Rescaling hours variable
AvgAge <- mutate(AvgAge, Hours = AgeHours * 100,
SE = AgeHoursSD * 100)
# Making a variable so error bars won't overlap
dodge = position_dodge2(width = 0.4)
# Renaming country variables
levels(AvgAge$Country) <- c("Austria", "United States")
g1<-ggplot(AvgAge) +
geom_line(color = "black", size = 1.5, mapping = aes(x = AgeYear, y = Hours, linetype = Gender), position = dodge)+
geom_errorbar(aes(x = AgeYear, ymin=Hours-SE, ymax=Hours+SE), width = 0.4, position = dodge)+
facet_wrap(~Country, nrow = 1, ncol = 2)+
coord_cartesian(xlim = c(2.5,18),ylim = c(50,800))
g2<-g1+scale_x_continuous(name = "Age", breaks = c(3,6,9,12,15,18)) +
scale_y_continuous(name = "Average Engagement Hours/Year", breaks = c(100,200,300,400,500,600,700,800))
g3 <- g2 + theme_bw() +
theme(axis.text=element_text(size=18, colour="black"),
axis.title=element_text(size=23,face="bold"),
strip.text = element_text(size = 30)) +
theme(strip.text.x = element_text(size = 23))+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
theme(axis.title.y = element_text(margin = margin(t = 0, r = 20, b = 0, l = 15)))+
theme(axis.title.x = element_text(margin = margin(t = 15, r = 0, b = 10, l = 0)))+
theme(legend.position = c(0.9,0.1))+
theme(legend.title = element_blank())+
theme(legend.text = element_text(size = 21))+
theme(plot.title = element_blank())+
theme(panel.spacing = unit(4, "lines"))
g4 <- g3+scale_linetype_discrete(name=element_blank(),
labels = c("Female",
"Male"))
plot(g4)This plot has some extra steps for formatting the data to plot. We want to create a graph that shows the average amount of practice for each age group when grouped by gender and country. First, we removed anyone with unspecified gender. Next, we used the summarize function to create an average hours variable for each grouping. Next, we filtered out ages we don’t want displayed in the graph, while adjusting the scaling of the hours variable. Finally, we created a “dodge” variable, which would make it so error bars don’t overlap in the graph code. The “levels” argument allows the variable name to be changed to whatever we want displayed on the graph.