MCD Wokshop 2

Ibrahim Inal

Data Handling

R vs. Packages

We have already seen data types and some basic properties. We will continue to understand more about data handling and related tools. Please note that everything related with data can either be done

  • only R

  • with packages

When it comes to packages, some packages are used more common than others. We will try to use both approaches. I encourage you to check alternative packages. There is always something cool!

Getting Data I: Packages

One of the easiest way to get data is to use already built in data in R. We will use the one of the most, if not the most, in-built data mtcars. To see all others you could type the function data() in your command window.

#Loading mtcars data
data("mtcars")
?mtcars #to learn more about the dataset


#To get some observations
head(mtcars)
tail(mtcars)

# Some functions from vector& matrix
nrow(mtcars)
ncol(mtcars)
dim(mtcars)

#variables
names(mtcars)

#some data might have rows as variables
row.names(mtcars)

# to understand the structure of the data
str(mtcars)

#If you'd like to view in a traditional way
View(mtcars)

Some data handling

When you have the data you may want to check if the data makes sense. For instance, if you read the help document vs,am etc can be considered as categorical variables. So it makes more sense to convert them to categorical variables.

#Converting numerical variables to categorical variables. 
mtcars$vs <- as.factor(mtcars$vs)
mtcars$am<-as.factor(mtcars$am)

You can try to look at structure of the data again or alternatively you could use class() function. If you would like to see each variable with class, then a better way is

lapply(mtcars,class)
sapply(mtcars,class)

When we have data the first thing to identify is whether there are missing variables or not. One way to do this

sum(is.na(mtcars))#to count the missing variables

Let’s delete a variable first

#creating missing variables
mtcars[1,1]<-NA  
mtcars[10,5]<-NA
sum(is.na(mtcars)) #checking the count of missing variables
colSums(is.na(mtcars)) #counting missing variables column by column
which(is.na(mtcars)) #getting the location of missing variables.

Now, we have a data set with missing variables. There are different ways to handle with missing data, you will learn more about this in other modules. One basic, or naive way, is to replace the missing value with the averages.

mtcars[1,1]<-round(mean(mtcars$mpg,na.rm = TRUE),2)#recording  missing value with mean of that variable

mtcars[10,5]<-round(mean(mtcars$drat,na.rm = TRUE),2)#recording missing value with mean of that variable

Alternatively, we could remove the missing data (most common way to handle) from out data set. To handle missing data, we either

complete.cases(mtcars)  # this gives logical values of whether a column is complete or not
mtcars[complete.cases(mtcars),]

or

na.omit(mtcars)

Practice

  1. Get the built-in data set airquality. How many missing values are in this data set?
  1. Which variables are the missing values concentrated in?
  1. How would you impute the mean or median for these values?

4.How would you omit all rows containing missing values?

Getting Data II

There are many sources of data. You will learn some sources in this module and some others in other modules. R is quite good to access data in various ways. The most common practice of accessing data is to download it from a data source. We will use Data Bank https://databank.worldbank.org/ for this purpose.

Please download the following indicators from World Development Indicators

  • Access to electricity (% of population)
  • Compulsory education, duration (years)
  • Educational attainment, at least Bachelor’s or equivalent, population, 25+, total (%) (cumulative)
  • Educational attainment, at least primary, population, 25+, total (%) (cumulative)
  • GDP per capita, PPP (current international $)
  • Government expenditure on education, total (% of GDP)
  • Population ages, 0-14 (% of total population)
  • Population, total

Choose the years from 2009 onward. Once you complete these you could click Download options on the upper-right corner. Choose the format CSV or Excel (the two most common form of data).

Uploading data in R environment

For Excel files one way is to use Environment> Import Dataset>From Excel. This is essentially writing the following code

library(readxl)
wb_data_raw <- read_excel("mydata/wb_data_raw.xlsx")
View(wb_data_raw)

Whether you downloaded Excel or CSV file, the data would not be very easy to use. So, we need to make sure that data looks something easy to work with. This process is called tidying up the data. In this part we will use the data in tidy form. Later we will see how to clean and tidy up our data.

wb_data<-read.csv("mydata/wb_data_tidy.csv",stringsAsFactors = FALSE) #Loading data in cvs form
?read.csv
str(wb_data)#structure of data
head(wb_data)

Note that you could use the rest of the code from the previous section.

Here is a short summary of the variables in the dataset (other than the obvious ones):

  • elecAccess: % of population with access to electricity.
  • gdpPerCap: GDP per capita based on purchasing power parity. This is a proxy for how rich a country is.
  • compEduc: Number of years that children are legally obliged to attend school.
  • educPri: % of population aged 25 and over that attained or completed primary education.
  • educTer: % of population aged 25 and over that attained or completed Bachelor’s or equivalent.
  • govEducExp: General government expenditure on education as a % of GDP.
  • popYoung: Population between the ages of 0 to 14 as a % of total population.
  • pop: Total population.

Getting Data III

Another way of getting data to R - if available - is to use packages. Luckily, there are many useful packages for this purpose. Just to name a few:

  • WDI allows accessing Worldbank data.
  • NHSRdatasets allows accessing NHS data.
  • fredr allows accessing St. Louis FED data.
  • ecb allows accessing ECB data.
  • pdfetch allows accessing public economic and financial data.

Getting Data III

To access the Worldbank data we could use WDI. The (dis)advantage is no need to save the data.

#install.packages("WDI")

library("WDI")

WDIsearch("GDP per capita")  # to search terms 

WDIsearch() is generally useful after you get some experience with WB data. If you’re new to this data it makes more sense to use the website https://data.worldbank.org/ and learn more about indicator ids.

rawdata<- WDI(indicator = c("EG.ELC.ACCS.ZS", # access to electricity
                  "SE.COM.DURS", # compulsory education years
                  "SE.TER.CUAT.BA.ZS", # tertiary education
                  "SE.PRM.CUAT.ZS", # primary education
                  "NY.GDP.PCAP.PP.CD", # gdppercap
                  "SE.XPD.TOTL.GD.ZS", # government expenditure on educ
                  "SP.POP.0014.TO.ZS", # population 0-14
                  "SP.POP.TOTL" ), # population total
    start = 2009, 
    end = 2023)

Typically, the data comes as list. If you would like to work in other formats you could convert it. Also, tidyverse package allows to get the data in other forms. If you want a particular country, then you could pass this into the WDI() function as an option.

WDI(country="GB",
    indicator="EG.ELC.ACCS.ZS",
  start=2009,
  end=2023
  
)

Note that you need to use the right name for the country, If you use UK instead of GB, you will get an error message.

Graphs and visualization

Why visualization matters?

Group 1 Data
ID x y
1 10 8.04
2 8 6.95
3 13 7.58
4 9 8.81
5 11 8.33
6 14 9.96
7 6 7.24
8 4 4.26
9 12 10.84
10 7 4.82
11 5 5.68
Group 2 Data
ID x y
1 10 9.14
2 8 8.14
3 13 8.74
4 9 8.77
5 11 9.26
6 14 8.10
7 6 6.13
8 4 3.10
9 12 9.13
10 7 7.26
11 5 4.74
Group 3 Data
Group x y
1 10 7.46
2 8 6.77
3 13 12.74
4 9 7.11
5 11 7.81
6 14 8.84
7 6 6.08
8 4 5.39
9 12 8.15
10 7 6.42
11 5 5.73
Group 4 Data
Group x y
1 8 6.58
2 8 5.76
3 8 7.71
4 8 8.84
5 8 8.47
6 8 7.04
7 8 5.25
8 19 12.50
9 8 5.56
10 8 7.91
11 8 6.89

Visualization: Anscombe’s quartet

Base R Plotting

Base R plotting functions are installed by default in base R and do not require additional visualization packages to be installed.

Scatter Plot

Presumably the most basic and common graph used in data analysis is scatterplot.

# base R
plot(x = mtcars$wt, y = mtcars$mpg) #scatter plot graph. This can be done without x and y equality

plot(x=mtcars$wt,y=mtcars$mpg,xlab="Weight", ylab="Miles per gallon", main="Weight vs. Miles per Gallon", col="skyblue")# A better version of the graph

We could get a scatter plot matrix to observe several plots at once. The following takes the columns 4,5 and 6 put them each other in a scatterplot

# base R
plot(mtcars[,4:6])

Line Plot

The plot function allows to plot line charts as well

# base R
plot(x = mtcars$wt, y = mtcars$mpg, type="l")

Obviously this is less useful compared to scatterplot.

# base R
plot(x = mtcars$gear, y = mtcars$carb, type="l")

Bar Plot

barplot() allows us to plot bar chart of values.

barplot(mtcars$cyl)#simple bar chart

Alternatively, we may want to make our bar chart to count. In that case we could use table() function together with barplot()

barplot(table(mtcars$cyl))

Box Plot

One way to do box plots is to use plot function.

plot(as.factor(mtcars$cyl), mtcars$mpg) # note that x becomes a factor otherwise this would give scatter plot

Also boxplot function would also gives us the same.

boxplot(mpg ~ cyl,data=mtcars)

Pie Chart

pie() function produces pie charts. However, it takes a Frequency Table as input (see barplot above). You can either create the table first and then pass it to the pie() function or you can create the table directly in the pie() function.

pie(table(mtcars$cyl))

Histogram

hist(mtcars$mpg)  #histogram with default bins

hist(mtcars$mpg, breaks=10) #histogram with adjusted bins

Practice

  1. Using the built-in data set airquality, create a scatter plot comparing the Temp and Ozone variables. What can you say about the graph?
  1. Create a histogram of the Temp variable. Try to adjust bins so that there are (approximately) 20 bins.
  1. Plot the frequency of observations in each Month. What can you see?
  1. Create a boxplot to view the distribution of Ozone for each month

ggplot

ggplot is a package that is developed to put some structure for visualization. The official cheet sheet is here.

Plot = data + Aesthetics + Geometry

  • Geometries: Visual elements used for our data e.g., point, line, histogram etc.
    • Geom: point
  • Aesthetics: Defines the data columns which affect various aspects of the geom e.g., x,y. color,fill size etc.

ggplot Graphs

Scatterplot

# ggplot base syntax
ggplot(data= mtcars, aes(x = wt, y = mpg)) +  
  geom_point()  #Scatterplot with ggplot




#ggplot adding options to  make graph better
ggplot(mtcars, aes( wt,  mpg)) +
  geom_point(color = "steelblue", size = 3) + #geom point allows change in color and size
   labs(title ="Weight vs Miles per Gallon",  #labs allows labelling
        x="Weight",
        y="Miles per Gallon")+
  theme_grey() #theme grey is one of themes available you can try minimal or classic as well
  



qplot(x = mtcars$wt, y = mtcars$mpg)# an earlier alternative in ggplot2 package

In order to create matrix of scatter graphs we have seen earlier we need an additional package - in the previous version this was not necessary.

library(GGally)

ggpairs(mtcars[, 4:6],
              title = "Pairwise Scatter Plots for Selected Variables",
              columns = c("wt", "hp", "drat"),
              mapping = aes(color = NULL))

Line Plot

Bar Plot

# Create a bar plot using ggplot
ggplot(mtcars, aes(x = factor(cyl))) +
  geom_bar(fill = "steelblue") +
  labs(title = "Bar Plot of Cylinder Frequencies",
       x = "Cylinders",
       y = "Frequency") +
  theme_minimal()+
  coord_flip()

Note that in order to count values we need to pass cyl as factor.

Box Plot

library(ggplot2)


#Box plot with ggplot
ggplot(mtcars, aes(x = as.factor(cyl), y = mpg)) +  # note that we passed cyl as factors so that it will be counted
  geom_boxplot(fill = "steelblue", color = "black") + # this part makes box plot with some optional coloring
  labs(title = "Box Plot of MPG by Cylinder",
       x = "Cylinders",
       y = "MPG") +  # labs allows labelling
  theme_minimal() # this is one of the themes that gives a general structure to the graph

Pie Chart

Creating pie chart with ggplot2 is rather indirect. We need to create box plot first then convert this to a pie chart.

We will use geom_boxplot() in another way this time. Note that the previous version would also be fine.

#Step1: Making data compatible with ggplot structure. ggplot works with data frame

df<-data.frame(table(mtcars$cyl))


#Step2: Creating box plot. The trick is we will stack everything together.

bar_chart<-ggplot(df, aes(x="", y=Freq,fill=Var1)) +
  geom_bar(width=1, stat="identity")  # note that width=1 allows stacked structure. 


bar_chart + 
  coord_polar("y", start=0)+ #converting bar plot to pie chart
  theme_minimal()

Histogram

#histogram with ggplot
ggplot(mtcars,aes(x=mpg))+
  geom_histogram(binwidth=4)

Visualization Practice

Revisiting WB data

Before continuing you could try to clean your environment.

rm(list=ls()) #to clean your environment


df<- read.csv("mydata/wb_data_tidy.csv", stringsAsFactors = FALSE)

Here is a short summary of the variables in the dataset (other than the obvious ones):

  • elecAccess: % of population with access to electricity.
  • gdpPerCap: GDP per capita based on purchasing power parity. This is a proxy for how rich a country is.
  • compEduc: Number of years that children are legally obliged to attend school.
  • educPri: % of population aged 25 and over that attained or completed primary education.
  • educTer: % of population aged 25 and over that attained or completed Bachelor’s or equivalent.
  • govEducExp: General government expenditure on education as a % of GDP.
  • popYoung: Population between the ages of 0 to 14 as a % of total population.
  • pop: Total population.
  1. Plot the distribution of GDP per capita. Try with different bins.
  1. What is the relationship between primary education attainment and GDP per capita?
  1. Plot the same relationship with logarithmic transformation i.e., log GDP per capita. What can you tell about the relationship?
  1. Plot the relationship for tertiary education, but change the shape of points.
  1. Plot the relationship between years of compulsory education and GDP per capita. Do you see overplotting? How do you make an improvement for the graph?
  1. Plot the same relationship as a box plot. Try geom_violin() as well.
  1. Try to plot population over years with a line graph. Hint: first try basic line graph, then try to use group to group data based on country.
  1. Is there a relationship between the total population of a country and the relative proportion of the country being less than 15 years old?
  1. Is there a relationship between a country’s GDP per capita and the percentage of its population with access to electricity?
  1. How does the distribution of tertiary education looks like?

Optional material

Additional graphs

Below you could find some other common graph types. Note that, I have used ggplot2 package, but you may want to learn how to these in base R.

# Some weird shaped box plots

#library(ggplot2) #load the library
#data(mtcars)  # load mtcars if itis not in your environment


ggplot(mtcars, aes(x=as.factor(cyl),y=mpg))+
  geom_boxplot(notch = TRUE)    # notch makes some trapesoid type shape for box plot


ggplot(mtcars, aes(x=as.factor(cyl),y=mpg))+
  geom_violin()   


#QQ graphs

ggplot(mtcars, aes(sample=mpg))+stat_qq()

Further tools

  • You could find available colors by running colors() in your console.
  • While we covered some basic optional arguments i.e., parameters for the graphs. You could see all of them by running ?par.

Adding points to an existing graph

Since ggplot2 works well with layers, we could add the additional data points with +.

library(ggplot2)


ggplot(data=df )+
  geom_point( mapping=aes(x= educTer , y=govEducExp), col="steelblue3" ) +
  geom_point(mapping = aes(x = educPri, y = govEducExp), col = "violetred") +
  labs( title= "Government Expenditure and Education",
        x="Education",
        y="Government Expenditure")+
  theme_minimal()

Multigraphs side-by-side

There are multiple ways to add graphs, but using gridExtra() package is very convenient for this task.

library(ggplot2)
library(gridExtra)

# Create the first ggplot plot
plot1 <- ggplot(data = df) +
  geom_point(mapping = aes(x = educTer, y = govEducExp), col = "steelblue3") +
  labs(title = "Government Expenditure and Education", x = "Education", y = "Government Expenditure") +
  theme_minimal()

# Create the second ggplot plot
plot2 <- ggplot(data = df) +
  geom_point(mapping = aes(x = educTer, y = educPri), col = "tomato3") +
  labs(title = "Education and Primary Education", x = "Education", y = "Primary Education") +
  theme_minimal()

grid.arrange(plot1, plot2, ncol = 2)

Takeaways

  • Getting data in different forms
  • Basic usage of ggplot2 in tidyverse ecosystem

References