GGPLOT2 Tutorial

While working with ggplot, it is essential to remember that it works in layers, i.e., we keep adding layers until the visualization meets our need. Let’s take a journey:

Invoking ggplot2 Library and Setting Dataset

We are using the ‘cars’ dataset. I have loaded the data and assigned it a local name called ‘mydata’. Once I am done, I want to check if the process is successful.

library(ggplot2)
mydata<-cars
head(mydata) #Gives first six rows as sample

  speed dist
1     4    2
2     4   10
3     7    4
4     7   22
5     8   16
6     9   10

summary(mydata) #Provides the overall summary of the datset

     speed           dist       
 Min.   : 4.0   Min.   :  2.00  
 1st Qu.:12.0   1st Qu.: 26.00  
 Median :15.0   Median : 36.00  
 Mean   :15.4   Mean   : 42.98  
 3rd Qu.:19.0   3rd Qu.: 56.00  
 Max.   :25.0   Max.   :120.00

str(mydata)#Gives information about the total observation, variables, data types by variables, and some values themselves.

'data.frame':   50 obs. of  2 variables:
 $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
 $ dist : num  2 10 4 22 16 10 18 26 34 17 ...

Looks like the cars dataset (now assigned to ‘mydata’ has 50 rows and 2 columns). The first column tells us about the ‘speed’ while the second column gives some information about the distance represented by ‘dist’. Remember, we have to use these variables exact same way, even if the upper case, and how they spell.

Creating a Simple Plot

In this program I not not mentioning the type of figure, plot. So, R will just set the platform. However, I am telling R taht I want to use both variables. Speed will be in the x-axis and dist in y-axis.

ggplot(mydata, aes(x=speed, y=dist))

It’s a big stage for the upcoming figure. Now, lets assign our figure a name ‘a’ and create a scatter plot. We simply add ‘geom_point()’. Here it is.

a<- ggplot(mydata, aes(x=speed, y=dist))+
  geom_point()
a

Here’s the plot. Looks like these variables share positive linear relationships. Okay, what if I want a line digram instead of the scatter plot? We simply replace ‘geom_point()’ by ‘geom_line()’. Does it work? Let’s give it a name ‘b’.

b<- ggplot(mydata, aes(x=speed, y=dist))+
  geom_line()
b

Woohoo! It did work. These variables still have same relationship, but the fluctuation is more visible in this picture.

I am now going to build upon the picture a. I am going to zoom-in and see values only between the 10 and 15 in x-axis and 25 to 50 in y-axis.

a<- a + coord_cartesian(xlim=c(10,15), ylim=c(25,50))
a

It definitely worked.

Now, lets create some histograms. For this, lets use the mtcars data set.

Histogram

data(mtcars)
nirmal<-mtcars
head(nirmal)

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

str(nirmal)

'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

a. Basic Histogram

Once the data is loaded. I am going to create a histogram of the variable ‘wt’. I provide required information and ask for a ‘geom_histogram()’. Let’s give it a name r.

r <- ggplot(nirmal, aes(x=wt))+
  geom_histogram()
r

Let’s define the width of the bars.

b. Little Advance

r<-ggplot(nirmal, aes(x=wt))+
  geom_histogram(binwidth = 0.5)
  r

c. More Advance

Defining colors. The bars will be filled with blue color, and the outline will be black.

r<- ggplot(nirmal, aes(x=wt))+
  geom_histogram(binwidth = 0.5, color="black", fill="blue")
r

d. Adding Vertical Line on the Histogram

r<- r+geom_vline(aes(xintercept=mean(wt)),
                 color="red", linetype="dashed", size=1)
r

#d. Drawing a Density Plot on Top of the Histogram (note that alpha is the transparency level)

r<- ggplot(nirmal, aes(x=wt))+
  geom_histogram(binwidth = 0.5, aes(y = ..density..),
                 alpha = 0.3, color = "black", fill = "yellow")+
  geom_density(alpha = 0.2, fill = "pink")
r

Reading my SPSS File

Haven package is required to read spss files into R. Let’s invoke the library and load the data. I assign the same local name to this datafile. Remember, the datafile should be in the current working directory.

library(haven)
nirmal<-read_spss("C:/Users/nirma/Documents/EDX courses/MicroMaster MIT/14.310x-Data Analysis for Social Scientists/Programs/Longitudinal_Differential Study.sav")
head(nirmal)

## # A tibble: 6 x 58
##   TI_ME_ID TIM_E_ID TIME_ID  SP2017   F2017  SP2018   F2018 PS_T_ID PST_ID
##   <chr>    <dbl+lb>   <dbl> <dbl+l> <dbl+l> <dbl+l> <dbl+l> <chr>    <dbl>
## 1 F2016    1 [Fall~       0 0 [Fal~ 0 [Oth~ 0 [Fal~ 0 [Fal~ F2016_1   1001
## 2 F2016    1 [Fall~       0 0 [Fal~ 0 [Oth~ 0 [Fal~ 0 [Fal~ F2016_1   1001
## 3 F2016    1 [Fall~       0 0 [Fal~ 0 [Oth~ 0 [Fal~ 0 [Fal~ F2016_1   1001
## 4 F2016    1 [Fall~       0 0 [Fal~ 0 [Oth~ 0 [Fal~ 0 [Fal~ F2016_1   1001
## 5 F2016    1 [Fall~       0 0 [Fal~ 0 [Oth~ 0 [Fal~ 0 [Fal~ F2016_1   1001
## 6 F2016    1 [Fall~       0 0 [Fal~ 0 [Oth~ 0 [Fal~ 0 [Fal~ F2016_1   1001
## # ... with 49 more variables: Program <chr>, MAJOR <dbl+lbl>, EM_AE <dbl+lbl>,
## #   EM_TH <dbl+lbl>, EM_IE <dbl+lbl>, EM_ED <dbl+lbl>, EM_TE <dbl+lbl>,
## #   EM_FL <dbl+lbl>, Sub <chr>, SUBJECT <dbl+lbl>, LA_MA <dbl+lbl>,
## #   LA_SC <dbl+lbl>, LA_SS <dbl+lbl>, LA_VA <dbl+lbl>, LA_WL <dbl+lbl>,
## #   Grd_Lvl <chr>, GType <chr>, GRADE <dbl+lbl>, ELEM_MID <dbl+lbl>,
## #   ELEM_HI <dbl+lbl>, ST_D_ID <chr>, STD_ID <dbl>, SID <dbl>, C_SIZE <dbl>,
## #   CL_SIZE <dbl+lbl>, Gender <chr>, MALE <dbl+lbl>, Eth <chr>,
## #   MINORITY <dbl+lbl>, ETHNICITY <dbl+lbl>, W_B <dbl+lbl>, W_H <dbl+lbl>,
## #   W_A <dbl+lbl>, W_AI <dbl+lbl>, W_O <dbl+lbl>, EconDis <chr>,
## #   FRPL <dbl+lbl>, ESE_A <chr>, ESE <dbl+lbl>, No_Yes <dbl+lbl>,
## #   NO_GIFT <dbl>, SDES <dbl+lbl>, ESOL <chr>, EL <dbl+lbl>, NEL_EL <dbl+lbl>,
## #   NEL_EXIT <dbl+lbl>, ELL <dbl+lbl>, PRE_SCR <dbl>, POST_SCR <dbl>

Creating simple Scatter Plot using PRE_SCR and POST_SCR

g<-ggplot(nirmal, aes(PRE_SCR, POST_SCR))+
  geom_point(color=nirmal$TIM_E_ID)
g

## Warning: Removed 1 rows containing missing values (geom_point).

w<-ggplot(nirmal, aes(PRE_SCR, POST_SCR))+
  geom_line(color=nirmal$TIM_E_ID)
w

Adding Title to the Plot

w <- w + ggtitle('Correlation Between Pretest and Posttest Scores')
w

Adding a Line

x <- w + geom_line()
x

Customizing the Title

x <- x + theme(plot.title=element_text(
  size = 20, face = "bold", margin = margin(10,0,10,0)
  )
)
g

## Warning: Removed 1 rows containing missing values (geom_point).

Changing Spacing in Multi-line Text

w <- w + ggtitle('Correlation Between Pretest and Posttest Scores Among All Students')
w <- w + theme(plot.title = element_text(size = 20, face = "bold", vjust = 1, lineheight = 0.6))
w

#Customizing X and Y Axes

w <- w + labs(x = "Students' Pretest Scores", 
      y = "Students' Corresponding Posttest Scores")
w

#Change Size of and Rotate Tick Text

w <- w + theme(axis.text.x = element_text(
                angle = 50, size = 20, vjust = 0.5))
w

#Checking Total Complete Cases in Nirmal and omitting the missing cases

nirmal[!complete.cases(nirmal),]#Checking total number of complete cases

## # A tibble: 454 x 58
##    TI_ME_ID TIM_E_ID TIME_ID  SP2017   F2017  SP2018   F2018 PS_T_ID PST_ID
##    <chr>    <dbl+lb>   <dbl> <dbl+l> <dbl+l> <dbl+l> <dbl+l> <chr>    <dbl>
##  1 F2016    1 [Fall~       0 0 [Fal~ 0 [Oth~ 0 [Fal~ 0 [Fal~ F2016_~   1012
##  2 F2016    1 [Fall~       0 0 [Fal~ 0 [Oth~ 0 [Fal~ 0 [Fal~ F2016_~   1012
##  3 F2016    1 [Fall~       0 0 [Fal~ 0 [Oth~ 0 [Fal~ 0 [Fal~ F2016_~   1012
##  4 F2016    1 [Fall~       0 0 [Fal~ 0 [Oth~ 0 [Fal~ 0 [Fal~ F2016_~   1012
##  5 F2016    1 [Fall~       0 0 [Fal~ 0 [Oth~ 0 [Fal~ 0 [Fal~ F2016_~   1012
##  6 F2016    1 [Fall~       0 0 [Fal~ 0 [Oth~ 0 [Fal~ 0 [Fal~ F2016_~   1012
##  7 F2016    1 [Fall~       0 0 [Fal~ 0 [Oth~ 0 [Fal~ 0 [Fal~ F2016_~   1012
##  8 F2016    1 [Fall~       0 0 [Fal~ 0 [Oth~ 0 [Fal~ 0 [Fal~ F2016_~   1012
##  9 F2016    1 [Fall~       0 0 [Fal~ 0 [Oth~ 0 [Fal~ 0 [Fal~ F2016_~   1012
## 10 F2016    1 [Fall~       0 0 [Fal~ 0 [Oth~ 0 [Fal~ 0 [Fal~ F2016_~   1012
## # ... with 444 more rows, and 49 more variables: Program <chr>,
## #   MAJOR <dbl+lbl>, EM_AE <dbl+lbl>, EM_TH <dbl+lbl>, EM_IE <dbl+lbl>,
## #   EM_ED <dbl+lbl>, EM_TE <dbl+lbl>, EM_FL <dbl+lbl>, Sub <chr>,
## #   SUBJECT <dbl+lbl>, LA_MA <dbl+lbl>, LA_SC <dbl+lbl>, LA_SS <dbl+lbl>,
## #   LA_VA <dbl+lbl>, LA_WL <dbl+lbl>, Grd_Lvl <chr>, GType <chr>,
## #   GRADE <dbl+lbl>, ELEM_MID <dbl+lbl>, ELEM_HI <dbl+lbl>, ST_D_ID <chr>,
## #   STD_ID <dbl>, SID <dbl>, C_SIZE <dbl>, CL_SIZE <dbl+lbl>, Gender <chr>,
## #   MALE <dbl+lbl>, Eth <chr>, MINORITY <dbl+lbl>, ETHNICITY <dbl+lbl>,
## #   W_B <dbl+lbl>, W_H <dbl+lbl>, W_A <dbl+lbl>, W_AI <dbl+lbl>, W_O <dbl+lbl>,
## #   EconDis <chr>, FRPL <dbl+lbl>, ESE_A <chr>, ESE <dbl+lbl>,
## #   No_Yes <dbl+lbl>, NO_GIFT <dbl>, SDES <dbl+lbl>, ESOL <chr>, EL <dbl+lbl>,
## #   NEL_EL <dbl+lbl>, NEL_EXIT <dbl+lbl>, ELL <dbl+lbl>, PRE_SCR <dbl>,
## #   POST_SCR <dbl>

nirmal<-na.omit(nirmal)# Removing empty cells from the dataset
nirmal[!complete.cases(nirmal),] #checking if that worked

## # A tibble: 0 x 58
## # ... with 58 variables: TI_ME_ID <chr>, TIM_E_ID <dbl+lbl>, TIME_ID <dbl>,
## #   SP2017 <dbl+lbl>, F2017 <dbl+lbl>, SP2018 <dbl+lbl>, F2018 <dbl+lbl>,
## #   PS_T_ID <chr>, PST_ID <dbl>, Program <chr>, MAJOR <dbl+lbl>,
## #   EM_AE <dbl+lbl>, EM_TH <dbl+lbl>, EM_IE <dbl+lbl>, EM_ED <dbl+lbl>,
## #   EM_TE <dbl+lbl>, EM_FL <dbl+lbl>, Sub <chr>, SUBJECT <dbl+lbl>,
## #   LA_MA <dbl+lbl>, LA_SC <dbl+lbl>, LA_SS <dbl+lbl>, LA_VA <dbl+lbl>,
## #   LA_WL <dbl+lbl>, Grd_Lvl <chr>, GType <chr>, GRADE <dbl+lbl>,
## #   ELEM_MID <dbl+lbl>, ELEM_HI <dbl+lbl>, ST_D_ID <chr>, STD_ID <dbl>,
## #   SID <dbl>, C_SIZE <dbl>, CL_SIZE <dbl+lbl>, Gender <chr>, MALE <dbl+lbl>,
## #   Eth <chr>, MINORITY <dbl+lbl>, ETHNICITY <dbl+lbl>, W_B <dbl+lbl>,
## #   W_H <dbl+lbl>, W_A <dbl+lbl>, W_AI <dbl+lbl>, W_O <dbl+lbl>, EconDis <chr>,
## #   FRPL <dbl+lbl>, ESE_A <chr>, ESE <dbl+lbl>, No_Yes <dbl+lbl>,
## #   NO_GIFT <dbl>, SDES <dbl+lbl>, ESOL <chr>, EL <dbl+lbl>, NEL_EL <dbl+lbl>,
## #   NEL_EXIT <dbl+lbl>, ELL <dbl+lbl>, PRE_SCR <dbl>, POST_SCR <dbl>