While working with ggplot, it is essential to remember that it works in layers, i.e., we keep adding layers until the visualization meets our need. Let’s take a journey:
We are using the ‘cars’ dataset. I have loaded the data and assigned it a local name called ‘mydata’. Once I am done, I want to check if the process is successful.
library(ggplot2)
mydata<-cars
head(mydata) #Gives first six rows as sample
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
summary(mydata) #Provides the overall summary of the datset
speed dist
Min. : 4.0 Min. : 2.00
1st Qu.:12.0 1st Qu.: 26.00
Median :15.0 Median : 36.00
Mean :15.4 Mean : 42.98
3rd Qu.:19.0 3rd Qu.: 56.00
Max. :25.0 Max. :120.00
str(mydata)#Gives information about the total observation, variables, data types by variables, and some values themselves.
'data.frame': 50 obs. of 2 variables:
$ speed: num 4 4 7 7 8 9 10 10 10 11 ...
$ dist : num 2 10 4 22 16 10 18 26 34 17 ...
Looks like the cars dataset (now assigned to ‘mydata’ has 50 rows and 2 columns). The first column tells us about the ‘speed’ while the second column gives some information about the distance represented by ‘dist’. Remember, we have to use these variables exact same way, even if the upper case, and how they spell.
In this program I not not mentioning the type of figure, plot. So, R will just set the platform. However, I am telling R taht I want to use both variables. Speed will be in the x-axis and dist in y-axis.
ggplot(mydata, aes(x=speed, y=dist))
It’s a big stage for the upcoming figure. Now, lets assign our figure a name ‘a’ and create a scatter plot. We simply add ‘geom_point()’. Here it is.
a<- ggplot(mydata, aes(x=speed, y=dist))+
geom_point()
a
Here’s the plot. Looks like these variables share positive linear relationships. Okay, what if I want a line digram instead of the scatter plot? We simply replace ‘geom_point()’ by ‘geom_line()’. Does it work? Let’s give it a name ‘b’.
b<- ggplot(mydata, aes(x=speed, y=dist))+
geom_line()
b
Woohoo! It did work. These variables still have same relationship, but the fluctuation is more visible in this picture.
I am now going to build upon the picture a. I am going to zoom-in and see values only between the 10 and 15 in x-axis and 25 to 50 in y-axis.
a<- a + coord_cartesian(xlim=c(10,15), ylim=c(25,50))
a
It definitely worked.
Now, lets create some histograms. For this, lets use the mtcars data set.
data(mtcars)
nirmal<-mtcars
head(nirmal)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
str(nirmal)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Once the data is loaded. I am going to create a histogram of the variable ‘wt’. I provide required information and ask for a ‘geom_histogram()’. Let’s give it a name r.
r <- ggplot(nirmal, aes(x=wt))+
geom_histogram()
r
Let’s define the width of the bars.
r<-ggplot(nirmal, aes(x=wt))+
geom_histogram(binwidth = 0.5)
r
Defining colors. The bars will be filled with blue color, and the outline will be black.
r<- ggplot(nirmal, aes(x=wt))+
geom_histogram(binwidth = 0.5, color="black", fill="blue")
r
r<- r+geom_vline(aes(xintercept=mean(wt)),
color="red", linetype="dashed", size=1)
r
#d. Drawing a Density Plot on Top of the Histogram (note that alpha is the transparency level)
r<- ggplot(nirmal, aes(x=wt))+
geom_histogram(binwidth = 0.5, aes(y = ..density..),
alpha = 0.3, color = "black", fill = "yellow")+
geom_density(alpha = 0.2, fill = "pink")
r
Haven package is required to read spss files into R. Let’s invoke the library and load the data. I assign the same local name to this datafile. Remember, the datafile should be in the current working directory.
library(haven)
nirmal<-read_spss("C:/Users/nirma/Documents/EDX courses/MicroMaster MIT/14.310x-Data Analysis for Social Scientists/Programs/Longitudinal_Differential Study.sav")
head(nirmal)
## # A tibble: 6 x 58
## TI_ME_ID TIM_E_ID TIME_ID SP2017 F2017 SP2018 F2018 PS_T_ID PST_ID
## <chr> <dbl+lb> <dbl> <dbl+l> <dbl+l> <dbl+l> <dbl+l> <chr> <dbl>
## 1 F2016 1 [Fall~ 0 0 [Fal~ 0 [Oth~ 0 [Fal~ 0 [Fal~ F2016_1 1001
## 2 F2016 1 [Fall~ 0 0 [Fal~ 0 [Oth~ 0 [Fal~ 0 [Fal~ F2016_1 1001
## 3 F2016 1 [Fall~ 0 0 [Fal~ 0 [Oth~ 0 [Fal~ 0 [Fal~ F2016_1 1001
## 4 F2016 1 [Fall~ 0 0 [Fal~ 0 [Oth~ 0 [Fal~ 0 [Fal~ F2016_1 1001
## 5 F2016 1 [Fall~ 0 0 [Fal~ 0 [Oth~ 0 [Fal~ 0 [Fal~ F2016_1 1001
## 6 F2016 1 [Fall~ 0 0 [Fal~ 0 [Oth~ 0 [Fal~ 0 [Fal~ F2016_1 1001
## # ... with 49 more variables: Program <chr>, MAJOR <dbl+lbl>, EM_AE <dbl+lbl>,
## # EM_TH <dbl+lbl>, EM_IE <dbl+lbl>, EM_ED <dbl+lbl>, EM_TE <dbl+lbl>,
## # EM_FL <dbl+lbl>, Sub <chr>, SUBJECT <dbl+lbl>, LA_MA <dbl+lbl>,
## # LA_SC <dbl+lbl>, LA_SS <dbl+lbl>, LA_VA <dbl+lbl>, LA_WL <dbl+lbl>,
## # Grd_Lvl <chr>, GType <chr>, GRADE <dbl+lbl>, ELEM_MID <dbl+lbl>,
## # ELEM_HI <dbl+lbl>, ST_D_ID <chr>, STD_ID <dbl>, SID <dbl>, C_SIZE <dbl>,
## # CL_SIZE <dbl+lbl>, Gender <chr>, MALE <dbl+lbl>, Eth <chr>,
## # MINORITY <dbl+lbl>, ETHNICITY <dbl+lbl>, W_B <dbl+lbl>, W_H <dbl+lbl>,
## # W_A <dbl+lbl>, W_AI <dbl+lbl>, W_O <dbl+lbl>, EconDis <chr>,
## # FRPL <dbl+lbl>, ESE_A <chr>, ESE <dbl+lbl>, No_Yes <dbl+lbl>,
## # NO_GIFT <dbl>, SDES <dbl+lbl>, ESOL <chr>, EL <dbl+lbl>, NEL_EL <dbl+lbl>,
## # NEL_EXIT <dbl+lbl>, ELL <dbl+lbl>, PRE_SCR <dbl>, POST_SCR <dbl>
g<-ggplot(nirmal, aes(PRE_SCR, POST_SCR))+
geom_point(color=nirmal$TIM_E_ID)
g
## Warning: Removed 1 rows containing missing values (geom_point).
w<-ggplot(nirmal, aes(PRE_SCR, POST_SCR))+
geom_line(color=nirmal$TIM_E_ID)
w
w <- w + ggtitle('Correlation Between Pretest and Posttest Scores')
w
x <- w + geom_line()
x
x <- x + theme(plot.title=element_text(
size = 20, face = "bold", margin = margin(10,0,10,0)
)
)
g
## Warning: Removed 1 rows containing missing values (geom_point).
w <- w + ggtitle('Correlation Between Pretest and Posttest Scores Among All Students')
w <- w + theme(plot.title = element_text(size = 20, face = "bold", vjust = 1, lineheight = 0.6))
w
#Customizing X and Y Axes
w <- w + labs(x = "Students' Pretest Scores",
y = "Students' Corresponding Posttest Scores")
w
#Change Size of and Rotate Tick Text
w <- w + theme(axis.text.x = element_text(
angle = 50, size = 20, vjust = 0.5))
w
#Checking Total Complete Cases in Nirmal and omitting the missing cases
nirmal[!complete.cases(nirmal),]#Checking total number of complete cases
## # A tibble: 454 x 58
## TI_ME_ID TIM_E_ID TIME_ID SP2017 F2017 SP2018 F2018 PS_T_ID PST_ID
## <chr> <dbl+lb> <dbl> <dbl+l> <dbl+l> <dbl+l> <dbl+l> <chr> <dbl>
## 1 F2016 1 [Fall~ 0 0 [Fal~ 0 [Oth~ 0 [Fal~ 0 [Fal~ F2016_~ 1012
## 2 F2016 1 [Fall~ 0 0 [Fal~ 0 [Oth~ 0 [Fal~ 0 [Fal~ F2016_~ 1012
## 3 F2016 1 [Fall~ 0 0 [Fal~ 0 [Oth~ 0 [Fal~ 0 [Fal~ F2016_~ 1012
## 4 F2016 1 [Fall~ 0 0 [Fal~ 0 [Oth~ 0 [Fal~ 0 [Fal~ F2016_~ 1012
## 5 F2016 1 [Fall~ 0 0 [Fal~ 0 [Oth~ 0 [Fal~ 0 [Fal~ F2016_~ 1012
## 6 F2016 1 [Fall~ 0 0 [Fal~ 0 [Oth~ 0 [Fal~ 0 [Fal~ F2016_~ 1012
## 7 F2016 1 [Fall~ 0 0 [Fal~ 0 [Oth~ 0 [Fal~ 0 [Fal~ F2016_~ 1012
## 8 F2016 1 [Fall~ 0 0 [Fal~ 0 [Oth~ 0 [Fal~ 0 [Fal~ F2016_~ 1012
## 9 F2016 1 [Fall~ 0 0 [Fal~ 0 [Oth~ 0 [Fal~ 0 [Fal~ F2016_~ 1012
## 10 F2016 1 [Fall~ 0 0 [Fal~ 0 [Oth~ 0 [Fal~ 0 [Fal~ F2016_~ 1012
## # ... with 444 more rows, and 49 more variables: Program <chr>,
## # MAJOR <dbl+lbl>, EM_AE <dbl+lbl>, EM_TH <dbl+lbl>, EM_IE <dbl+lbl>,
## # EM_ED <dbl+lbl>, EM_TE <dbl+lbl>, EM_FL <dbl+lbl>, Sub <chr>,
## # SUBJECT <dbl+lbl>, LA_MA <dbl+lbl>, LA_SC <dbl+lbl>, LA_SS <dbl+lbl>,
## # LA_VA <dbl+lbl>, LA_WL <dbl+lbl>, Grd_Lvl <chr>, GType <chr>,
## # GRADE <dbl+lbl>, ELEM_MID <dbl+lbl>, ELEM_HI <dbl+lbl>, ST_D_ID <chr>,
## # STD_ID <dbl>, SID <dbl>, C_SIZE <dbl>, CL_SIZE <dbl+lbl>, Gender <chr>,
## # MALE <dbl+lbl>, Eth <chr>, MINORITY <dbl+lbl>, ETHNICITY <dbl+lbl>,
## # W_B <dbl+lbl>, W_H <dbl+lbl>, W_A <dbl+lbl>, W_AI <dbl+lbl>, W_O <dbl+lbl>,
## # EconDis <chr>, FRPL <dbl+lbl>, ESE_A <chr>, ESE <dbl+lbl>,
## # No_Yes <dbl+lbl>, NO_GIFT <dbl>, SDES <dbl+lbl>, ESOL <chr>, EL <dbl+lbl>,
## # NEL_EL <dbl+lbl>, NEL_EXIT <dbl+lbl>, ELL <dbl+lbl>, PRE_SCR <dbl>,
## # POST_SCR <dbl>
nirmal<-na.omit(nirmal)# Removing empty cells from the dataset
nirmal[!complete.cases(nirmal),] #checking if that worked
## # A tibble: 0 x 58
## # ... with 58 variables: TI_ME_ID <chr>, TIM_E_ID <dbl+lbl>, TIME_ID <dbl>,
## # SP2017 <dbl+lbl>, F2017 <dbl+lbl>, SP2018 <dbl+lbl>, F2018 <dbl+lbl>,
## # PS_T_ID <chr>, PST_ID <dbl>, Program <chr>, MAJOR <dbl+lbl>,
## # EM_AE <dbl+lbl>, EM_TH <dbl+lbl>, EM_IE <dbl+lbl>, EM_ED <dbl+lbl>,
## # EM_TE <dbl+lbl>, EM_FL <dbl+lbl>, Sub <chr>, SUBJECT <dbl+lbl>,
## # LA_MA <dbl+lbl>, LA_SC <dbl+lbl>, LA_SS <dbl+lbl>, LA_VA <dbl+lbl>,
## # LA_WL <dbl+lbl>, Grd_Lvl <chr>, GType <chr>, GRADE <dbl+lbl>,
## # ELEM_MID <dbl+lbl>, ELEM_HI <dbl+lbl>, ST_D_ID <chr>, STD_ID <dbl>,
## # SID <dbl>, C_SIZE <dbl>, CL_SIZE <dbl+lbl>, Gender <chr>, MALE <dbl+lbl>,
## # Eth <chr>, MINORITY <dbl+lbl>, ETHNICITY <dbl+lbl>, W_B <dbl+lbl>,
## # W_H <dbl+lbl>, W_A <dbl+lbl>, W_AI <dbl+lbl>, W_O <dbl+lbl>, EconDis <chr>,
## # FRPL <dbl+lbl>, ESE_A <chr>, ESE <dbl+lbl>, No_Yes <dbl+lbl>,
## # NO_GIFT <dbl>, SDES <dbl+lbl>, ESOL <chr>, EL <dbl+lbl>, NEL_EL <dbl+lbl>,
## # NEL_EXIT <dbl+lbl>, ELL <dbl+lbl>, PRE_SCR <dbl>, POST_SCR <dbl>
w <- w + theme(
axis.title.x = element_text(color = "chocolate", vjust = 0.35,angle = 0),
axis.text.x = element_text(angle = 0, size = 10, vjust = 0.5),
axis.title.y = element_text (color = "green", vjust = 0.35)
)
w
#Defining Y Limit
w <- w + ylim(c(0,130))
w
i <- ggplot(nirmal, aes(ETHNICITY, PRE_SCR, color = factor(ETHNICITY)))+
geom_boxplot(fill = "tomato3")
i
## Don't know how to automatically pick scale for object of type haven_labelled/vctrs_vctr/double. Defaulting to continuous.
#Changing points to jitter
i <- i + geom_jitter(alpha = 0.5, aes(color = factor(ETHNICITY)),
position = position_jitter(width = 0.2))
i
## Don't know how to automatically pick scale for object of type haven_labelled/vctrs_vctr/double. Defaulting to continuous.