Q1(a)

Find and get a dataset from the datasets available within R. Perform exploratory data analysis (EDA) and prepare a codebook on that dataset using a newer method in R.


Ans: I am going to be using the dataset LifeCycleSavings for 1(a)

First I am going to make the codebook then I will do the EDA (Exploratory Data Analysis)


Codebook (coding)

help("LifeCycleSavings")

As this dataset is already available in R, using the function help() I got decent information to work with, all the information in the codebook is from the help() function about the LifeCycleSavings dataset.


Newer Method:

A newer and easier method of creating codebooks is through the library dataMaid. This library contains the function makeCodebook(), which accepts a data frame and returns a summary of its variables and also a graphic visualization for each variable. The output of the makeCodebook() function is a Rmarkdown file.

library(dataMaid)
makeCodebook(LifeCycleSavings)

Now, with the help of the makeCodebook() function from the library dataMaid, I got detailed information on each variable in a Rmarkdown file which I edited accordingly and attached with the Description of the data and its source and references.


Some Basic Edits:

Firstly, what I did was I added labels for each column, I got the information from the help() function:

attr(LifeCycleSavings$sr, "labels") <- "aggregate personal savings"
attr(LifeCycleSavings$pop15, "labels") <- "% of population under 15"
attr(LifeCycleSavings$pop75, "labels") <- "% of population over 75"
attr(LifeCycleSavings$dpi, "labels") <- "real per-capita disposable income"
attr(LifeCycleSavings$ddpi, "labels") <- "% growth rate of dpi"

Then I also added short description for some important variables:

attr(LifeCycleSavings$sr, "shortDescription") <- "savings ratio (aggregate personal saving divided by disposable income)"
attr(LifeCycleSavings$dpi, "shortDescription") <- "It is calculated by taking income earned from all sources minus taxes, savings and non-tax payments and dividing by the total population."

Codebook (start)

This is the head of the table
sr pop15 pop75 dpi ddpi
Australia 11.43 29.35 2.87 2329.68 2.87
Austria 12.07 23.32 4.41 1507.99 3.93
Belgium 13.17 23.80 4.43 2108.47 3.82
Bolivia 5.75 41.89 1.67 189.13 0.22
Brazil 12.88 42.19 0.83 728.47 4.56
Canada 8.79 31.72 2.85 2982.88 2.43


Data report overview

The dataset examined has the following dimensions:

Feature Result
Number of observations 50
Number of variables 6

Codebook summary table

Label Variable Class # unique values Missing Description
Name of the country [row.names] character 50 0.00 %
aggregate personal savings [sr] numeric 49 0.00 % savings ratio (aggregate personal saving divided by disposable income)
% of population under 15 [pop15] numeric 50 0.00 %
% of population over 75 [pop75] numeric 46 0.00 %
real per-capita disposable income [dpi] numeric 50 0.00 % It is calculated by taking income earned from all sources minus taxes, savings and non-tax payments and dividing by the total population.
% growth rate of dpi [ddpi] numeric 48 0.00 %

Variable list


row.names

  • The variable is a key (distinct values for each observation).

sr

Feature Result
Variable type numeric
Number of missing obs. 0 (0 %)
Number of unique values 49
Median 10.51
1st and 3rd quartiles 6.97; 12.62
Min. and max. 0.6; 21.1


pop15

Feature Result
Variable type numeric
Number of missing obs. 0 (0 %)
Number of unique values 50
Median 32.58
1st and 3rd quartiles 26.22; 44.06
Min. and max. 21.44; 47.64


pop75

Feature Result
Variable type numeric
Number of missing obs. 0 (0 %)
Number of unique values 46
Median 2.17
1st and 3rd quartiles 1.12; 3.33
Min. and max. 0.56; 4.7


dpi

Feature Result
Variable type numeric
Number of missing obs. 0 (0 %)
Number of unique values 50
Median 695.66
1st and 3rd quartiles 288.21; 1795.62
Min. and max. 88.94; 4001.89


ddpi

Feature Result
Variable type numeric
Number of missing obs. 0 (0 %)
Number of unique values 48
Median 3
1st and 3rd quartiles 2; 4.48
Min. and max. 0.22; 16.71

Description

Data on the savings ratio 1960–1970. Under the life-cycle savings hypothesis as developed by Franco Modigliani, the savings ratio (aggregate personal saving divided by disposable income) is explained by per-capita disposable income, the percentage rate of change in per-capita disposable income, and two demographic variables: the percentage of population less than 15 years old and the percentage of the population over 75 years old. The data are averaged over the decade 1960–1970 to remove the business cycle or other short-term fluctuations.

Source

The data were obtained from Belsley, Kuh and Welsch (1980). They in turn obtained the data from Sterling (1977).

Refrences

Sterling, Arnie (1977) Unpublished BS Thesis. Massachusetts Institute of Technology.

Belsley, D. A., Kuh. E. and Welsch, R. E. (1980) Regression Diagnostics. New York: Wiley.

Codebook(end)


EDA (Exploratory Data Analysis)

Firstly, I looked at how the table looks:

data <- head(LifeCycleSavings)
knitr::kable(data,caption = "This is the head of the table")
This is the head of the table
sr pop15 pop75 dpi ddpi
Australia 11.43 29.35 2.87 2329.68 2.87
Austria 12.07 23.32 4.41 1507.99 3.93
Belgium 13.17 23.80 4.43 2108.47 3.82
Bolivia 5.75 41.89 1.67 189.13 0.22
Brazil 12.88 42.19 0.83 728.47 4.56
Canada 8.79 31.72 2.85 2982.88 2.43

Then, I checked for any missing values in LifeCycleSavings:

sum(is.na(LifeCycleSavings))
## [1] 0

As can be seen there are no missing values.


Then I checked for Outliers using boxplot

boxplot(LifeCycleSavings$sr, ylab = "sr")

boxplot(LifeCycleSavings$pop15, ylab = "pop15")

boxplot(LifeCycleSavings$pop75, ylab = "pop75")

boxplot(LifeCycleSavings$dpi, ylab = "dpi")

boxplot(LifeCycleSavings$ddpi, ylab = "ddpi")

Only the column ddpi seems to have two outliers.


Another problem I had was the column for countries, it has no variable, so I copied the original data set and assigned the name country to the first column;

LifeCycleSavings0 <- data.frame(LifeCycleSavings)
library(data.table)
setDT(LifeCycleSavings0, keep.rownames = "country")[]
head(LifeCycleSavings0)
##      country    sr pop15 pop75     dpi ddpi
## 1: Australia 11.43 29.35  2.87 2329.68 2.87
## 2:   Austria 12.07 23.32  4.41 1507.99 3.93
## 3:   Belgium 13.17 23.80  4.43 2108.47 3.82
## 4:   Bolivia  5.75 41.89  1.67  189.13 0.22
## 5:    Brazil 12.88 42.19  0.83  728.47 4.56
## 6:    Canada  8.79 31.72  2.85 2982.88 2.43

Now after all the corrections, the summary of the data frame looks like this:

summary(LifeCycleSavings0)
##    country                sr             pop15           pop75      
##  Length:50          Min.   : 0.600   Min.   :21.44   Min.   :0.560  
##  Class :character   1st Qu.: 6.970   1st Qu.:26.21   1st Qu.:1.125  
##  Mode  :character   Median :10.510   Median :32.58   Median :2.175  
##                     Mean   : 9.671   Mean   :35.09   Mean   :2.293  
##                     3rd Qu.:12.617   3rd Qu.:44.06   3rd Qu.:3.325  
##                     Max.   :21.100   Max.   :47.64   Max.   :4.700  
##       dpi               ddpi       
##  Min.   :  88.94   Min.   : 0.220  
##  1st Qu.: 288.21   1st Qu.: 2.002  
##  Median : 695.66   Median : 3.000  
##  Mean   :1106.76   Mean   : 3.758  
##  3rd Qu.:1795.62   3rd Qu.: 4.478  
##  Max.   :4001.89   Max.   :16.710

Now the data is ready to be visualized

Firstly, I looked at which country has the highest per-capita disposable income (I flipped the graph so it was easier to see the country names):

ggplot(data=LifeCycleSavings0,aes(x=reorder(country,dpi),y=dpi)) + 
  geom_bar(stat ='identity')+
  theme_grey() +
  coord_flip()+
  labs(title = 'Ranking of Countries by Per-capita disposable income',y='Per-capita disposable income',x='Countries')


Then I wanted to know in which country is the income rising at the fastest rate (ddpi)

ggplot(data=LifeCycleSavings0,aes(x=reorder(country,ddpi),y=ddpi)) + 
  geom_bar(stat ='identity')+
  theme_grey() +
  coord_flip()+
  labs(title = 'Ranking of Countries by growth rate of per-capita disposable income',y='growth rate',x='Countries')

In both the above graphs, the top 10 countries were different. So, I hypothesize that dpi and ddpi and not positively correlated.


To check if my hypothesis is correct or not. I looked for a correlation between per-capita disposable income and its growth rate:

ggplot(data=LifeCycleSavings0, aes(x=dpi, y=ddpi)) +
  geom_point() + geom_smooth()

After looking at this graph it seems to me that countries with lower income have a slightly higher growth rate, when compared to countries with high income. Therefore, there is a slightly negative relationship between dpi and ddpi, this means my hypothesis is correct.


Another interesting relation I found was that countries with higher income show to have less children (as pop15 is less compared to others) as you can see in the graph below:

ggplot(data=LifeCycleSavings0, aes(x=dpi, y=pop15)) +
  geom_point() + geom_smooth()


One more interesting thing I found was that higher income countries have more elders (% population over 75) (as can be seen from the graph below). I assume that this is due to better health care.

ggplot(data=LifeCycleSavings0, aes(x=dpi, y=pop75)) +
  geom_point() + geom_smooth()


Conclusion from the EDA

In this EDA I looked for missing values and mistakes and tried to correct them if they were present. I also looked for outliers in the data. I looked at two bar graphs and made a hypothesis, and made a scatter plot to evaluate if my hypothesis was correct or not. I also gained insight on the relationship between age group population percentage and per-capita income by looking at two different scatterplots.


EDA(end)


Q1(b)

Demonstrate these FIVE (5) functions of dplyr for data manipulation:
i. filter ( )
ii. arrange ( )
iii. mutate ( )
iv. select ( )
v. summarise ( )
You can create your own sensible dataset in certain context for this question with at least 15 observations (rows) and 4 features (columns) or you can get any suitable dataset online.
Show the R code and provide a short explanation on what each function does for each produced output.

I created the following table containing the marks of the students for the 3 subjects maths, chemistry, and physics. It also contains their respective student ID and gender.

library(dplyr)
st_id = c("st001", "st002", "st003", "st004", "st005", "st006", "st007", "st008", "st009", "st010", "st011", "st012", "st013", "st014", "st015", "st016")
st_gender = c("M", "M", "F", "F", "M", "F", "M", "M", "F", "F", "M", "M", "F", "M", "F", "F")
maths = c(56 , 75, 56, 87 ,49, 78 ,56 ,78, 48, 89, 56, 78, 56, 76, 78, 85)
chemistry = c(56, 78, 56, 76, 78, 85, 56 , 75 ,56 ,78, 48, 89, 56, 87 ,49, 78)
physics = c(48, 89, 56, 87, 49, 78, 67, 75, 56, 45, 98, 56, 87, 76, 68, 76)
student_marks = data.frame(st_id, st_gender, maths, chemistry, physics)
student_marks
##    st_id st_gender maths chemistry physics
## 1  st001         M    56        56      48
## 2  st002         M    75        78      89
## 3  st003         F    56        56      56
## 4  st004         F    87        76      87
## 5  st005         M    49        78      49
## 6  st006         F    78        85      78
## 7  st007         M    56        56      67
## 8  st008         M    78        75      75
## 9  st009         F    48        56      56
## 10 st010         F    89        78      45
## 11 st011         M    56        48      98
## 12 st012         M    78        89      56
## 13 st013         F    56        56      87
## 14 st014         M    76        87      76
## 15 st015         F    78        49      68
## 16 st016         F    85        78      76

1. filter()

The filter() function subsets a data frame, keeping only the rows that meet a given condition. The row must give a TRUE value for the condition in order to be kept. If the row doesn’t meet the condition it will be dropped. So it will only show rows that you want to see.


For example when I enter the code below:

filter(student_marks, maths>70)
##   st_id st_gender maths chemistry physics
## 1 st002         M    75        78      89
## 2 st004         F    87        76      87
## 3 st006         F    78        85      78
## 4 st008         M    78        75      75
## 5 st010         F    89        78      45
## 6 st012         M    78        89      56
## 7 st014         M    76        87      76
## 8 st015         F    78        49      68
## 9 st016         F    85        78      76

I get a data frame as a result where all students marks are greater than 70.


I can also enter another condition and attach it using & (AND) or | (OR). Look at the code below:

filter(student_marks, maths>70 & st_gender == "M")
##   st_id st_gender maths chemistry physics
## 1 st002         M    75        78      89
## 2 st008         M    78        75      75
## 3 st012         M    78        89      56
## 4 st014         M    76        87      76

This output gives only male students who scored more than 70 in maths.


You can also enter functions into filter(). Look at the code below:

filter(student_marks, maths>mean(maths))
##   st_id st_gender maths chemistry physics
## 1 st002         M    75        78      89
## 2 st004         F    87        76      87
## 3 st006         F    78        85      78
## 4 st008         M    78        75      75
## 5 st010         F    89        78      45
## 6 st012         M    78        89      56
## 7 st014         M    76        87      76
## 8 st015         F    78        49      68
## 9 st016         F    85        78      76

This output gives students who scored more than average in maths.


2. arrange()

The arrange() function, arranges the data frame according to a given column variable. It is is used to reorganize table rows according to the variable given. This means it changes the order of the rows in the table (either ascending or descending) according to the column you gave it.


Example: This function arranges the data frame in ascending order of chemistry marks scored.

arrange(student_marks, chemistry)
##    st_id st_gender maths chemistry physics
## 1  st011         M    56        48      98
## 2  st015         F    78        49      68
## 3  st001         M    56        56      48
## 4  st003         F    56        56      56
## 5  st007         M    56        56      67
## 6  st009         F    48        56      56
## 7  st013         F    56        56      87
## 8  st008         M    78        75      75
## 9  st004         F    87        76      87
## 10 st002         M    75        78      89
## 11 st005         M    49        78      49
## 12 st010         F    89        78      45
## 13 st016         F    85        78      76
## 14 st006         F    78        85      78
## 15 st014         M    76        87      76
## 16 st012         M    78        89      56

You can also arrange the data frame in descending order by using the function desc(). Example below:

arrange(student_marks, desc(chemistry))
##    st_id st_gender maths chemistry physics
## 1  st012         M    78        89      56
## 2  st014         M    76        87      76
## 3  st006         F    78        85      78
## 4  st002         M    75        78      89
## 5  st005         M    49        78      49
## 6  st010         F    89        78      45
## 7  st016         F    85        78      76
## 8  st004         F    87        76      87
## 9  st008         M    78        75      75
## 10 st001         M    56        56      48
## 11 st003         F    56        56      56
## 12 st007         M    56        56      67
## 13 st009         F    48        56      56
## 14 st013         F    56        56      87
## 15 st015         F    78        49      68
## 16 st011         M    56        48      98

You can also arrange by using two variables, for this the values in one of the variable needs to be repetitive, for example student gender. The below code first arranges by the gender, then for each gender it arranges by marks obtained in chemistry.

arrange(student_marks, st_gender,  chemistry)
##    st_id st_gender maths chemistry physics
## 1  st015         F    78        49      68
## 2  st003         F    56        56      56
## 3  st009         F    48        56      56
## 4  st013         F    56        56      87
## 5  st004         F    87        76      87
## 6  st010         F    89        78      45
## 7  st016         F    85        78      76
## 8  st006         F    78        85      78
## 9  st011         M    56        48      98
## 10 st001         M    56        56      48
## 11 st007         M    56        56      67
## 12 st008         M    78        75      75
## 13 st002         M    75        78      89
## 14 st005         M    49        78      49
## 15 st014         M    76        87      76
## 16 st012         M    78        89      56

3. mutate()

The mutate() function is a function for creating new variables. What this means is that it can be used to create and add new columns to an existing data frame. There are many ways of doing this.


Firstly, let us add a new column comp_sc which contains the mark of the students for the subject of Computer science.

comp_sc = c(76, 64, 56, 45, 67, 87, 67, 60, 78, 76, 86, 67, 89, 56, 67, 78)
mutate(student_marks,  comp_sc)
##    st_id st_gender maths chemistry physics comp_sc
## 1  st001         M    56        56      48      76
## 2  st002         M    75        78      89      64
## 3  st003         F    56        56      56      56
## 4  st004         F    87        76      87      45
## 5  st005         M    49        78      49      67
## 6  st006         F    78        85      78      87
## 7  st007         M    56        56      67      67
## 8  st008         M    78        75      75      60
## 9  st009         F    48        56      56      78
## 10 st010         F    89        78      45      76
## 11 st011         M    56        48      98      86
## 12 st012         M    78        89      56      67
## 13 st013         F    56        56      87      89
## 14 st014         M    76        87      76      56
## 15 st015         F    78        49      68      67
## 16 st016         F    85        78      76      78

You can also enter a single value to be assigned to all the rows, example:

result = "pass"
mutate(student_marks,  result)
##    st_id st_gender maths chemistry physics result
## 1  st001         M    56        56      48   pass
## 2  st002         M    75        78      89   pass
## 3  st003         F    56        56      56   pass
## 4  st004         F    87        76      87   pass
## 5  st005         M    49        78      49   pass
## 6  st006         F    78        85      78   pass
## 7  st007         M    56        56      67   pass
## 8  st008         M    78        75      75   pass
## 9  st009         F    48        56      56   pass
## 10 st010         F    89        78      45   pass
## 11 st011         M    56        48      98   pass
## 12 st012         M    78        89      56   pass
## 13 st013         F    56        56      87   pass
## 14 st014         M    76        87      76   pass
## 15 st015         F    78        49      68   pass
## 16 st016         F    85        78      76   pass

You can also enter a function or an equation in the mutate() function:

mutate(student_marks,  average = (maths+chemistry+physics)/3)
##    st_id st_gender maths chemistry physics  average
## 1  st001         M    56        56      48 53.33333
## 2  st002         M    75        78      89 80.66667
## 3  st003         F    56        56      56 56.00000
## 4  st004         F    87        76      87 83.33333
## 5  st005         M    49        78      49 58.66667
## 6  st006         F    78        85      78 80.33333
## 7  st007         M    56        56      67 59.66667
## 8  st008         M    78        75      75 76.00000
## 9  st009         F    48        56      56 53.33333
## 10 st010         F    89        78      45 70.66667
## 11 st011         M    56        48      98 67.33333
## 12 st012         M    78        89      56 74.33333
## 13 st013         F    56        56      87 66.33333
## 14 st014         M    76        87      76 79.66667
## 15 st015         F    78        49      68 65.00000
## 16 st016         F    85        78      76 79.66667

4. select()

The select() function allows users to select variables (columns) in R. The select() function selects the columns based on a given condition by the user. With this function you can display only the columns you want to be displayed.


Firstly, You can select any number of columns using this function, example: This selects all columns between st_gender and chemistry.

select(student_marks, st_gender:chemistry)
##    st_gender maths chemistry
## 1          M    56        56
## 2          M    75        78
## 3          F    56        56
## 4          F    87        76
## 5          M    49        78
## 6          F    78        85
## 7          M    56        56
## 8          M    78        75
## 9          F    48        56
## 10         F    89        78
## 11         M    56        48
## 12         M    78        89
## 13         F    56        56
## 14         M    76        87
## 15         F    78        49
## 16         F    85        78

You can also pass functions to select columns, for the example below the function selects all columns that start with “s”:

select(student_marks, starts_with("s"))
##    st_id st_gender
## 1  st001         M
## 2  st002         M
## 3  st003         F
## 4  st004         F
## 5  st005         M
## 6  st006         F
## 7  st007         M
## 8  st008         M
## 9  st009         F
## 10 st010         F
## 11 st011         M
## 12 st012         M
## 13 st013         F
## 14 st014         M
## 15 st015         F
## 16 st016         F

You can also select all columns and ignore a few, for the example below the function selects all columns except for st_gender:

select(student_marks, -st_gender)
##    st_id maths chemistry physics
## 1  st001    56        56      48
## 2  st002    75        78      89
## 3  st003    56        56      56
## 4  st004    87        76      87
## 5  st005    49        78      49
## 6  st006    78        85      78
## 7  st007    56        56      67
## 8  st008    78        75      75
## 9  st009    48        56      56
## 10 st010    89        78      45
## 11 st011    56        48      98
## 12 st012    78        89      56
## 13 st013    56        56      87
## 14 st014    76        87      76
## 15 st015    78        49      68
## 16 st016    85        78      76

5. summarise()

The summarise() function allows users to summarise observations in a dataframe, it takes the dataframe and the summary function as parameters. This is very helpful if you wanted to know about the maximum, minimum or average values in a given column.


For example for our data frame students_marksyou can easily use the function to find the maximum value scored in maths:

summarise(student_marks, max(maths))
##   max(maths)
## 1         89

But now you want to find out which gender scored the highest in maths, so you type the following function. Here, the maximum marks are grouped by the student’s gender using the function group_by().

summarise(group_by(student_marks, st_gender), max(maths))
## # A tibble: 2 x 2
##   st_gender `max(maths)`
##   <chr>            <dbl>
## 1 F                   89
## 2 M                   78

You can also use the summarise function to check for logical expression, for example we need to check if anyone scored above 90 in maths. The output is either True or False.

summarise(student_marks, any(maths>90))
##   any(maths > 90)
## 1           FALSE

Q1(b) (ended)