Information
Summary
Interpretation
In this section, I will be using the Nile data set (that is already available in R).
Nile
## Time Series:
## Start = 1871
## End = 1970
## Frequency = 1
## [1] 1120 1160 963 1210 1160 1160 813 1230 1370 1140 995 935 1110 994 1020
## [16] 960 1180 799 958 1140 1100 1210 1150 1250 1260 1220 1030 1100 774 840
## [31] 874 694 940 833 701 916 692 1020 1050 969 831 726 456 824 702
## [46] 1120 1100 832 764 821 768 845 864 862 698 845 744 796 1040 759
## [61] 781 865 845 944 984 897 822 1010 771 676 649 846 812 742 801
## [76] 1040 860 874 848 890 744 749 838 1050 918 986 797 923 975 815
## [91] 1020 906 901 1170 912 746 919 718 714 740
Metadata
The data contains the measurements of the annual flow of the river Nile at Aswan (formerly Assuan), recorded from 1871–1970.
It is a time series data containing 100 data points, one for each year.
Source: Durbin, J. and Koopman, S. J. (2001). Time Series Analysis by State Space Methods. Oxford University Press. http://www.ssfpack.com/DKbook.html
| Information | Value |
|---|---|
| Field Label | Flow |
| Variable | Flow Rate (in 108 m3) |
| Variable Type | Numeric |
| Allowable Values | 456-1370 |
Firstly, I would want to visualize the data set, to make it easier to read. As it is a time series data, a line chart seems appropriate.
plot(Nile, col = "red", lwd=2, xlab = "Year", ylab = "Flow Rate (in 10^8 m^3)", main = "Annual Flow of River Nile, 1871-1970")
From the default format, I have included the title of the chart as well as the axis titles and its units. Moreover, I have made the line thicker and changed its color to red so that it is eye catching. Although showing sharp fluctuations from year-to-year, the chart seems to display an overall decreasing trend over the duration of the observation period.
Second, I would show the summary of the data set to showcase all of the important information.
library(memisc)
codebook(Nile)
## ================================================================================
##
## Nile
##
## --------------------------------------------------------------------------------
##
## Storage mode: double
##
## Min: 456.0000000
## Max: 1370.0000000
## Mean: 919.3500000
## Std.Dev.: 168.3792371
## Skewness: 0.3223697
## Kurtosis: -0.3049068
fivenum(Nile)
## [1] 456.0 798.0 893.5 1035.0 1370.0
I used the memeisc package to automatically create a codebook, which would summarize the data and display the minimum & maximum values, mean, standard deviation, skewness, and kurtosis. To supplement that information, I used a fivenum function to display (in the following order order) the minimum, lower-hinge (Q1), median (Q2), upper hinge (Q3), and maximum values. These information gives you a rough image of the distribution of the data and its order.
library(dplyr)
First, I am going to create a simple data set.
Name <- c("A","B","C","D","E","F","G","H","I","J")
Weight <- c("50","79","59","66","57","73","64","56","52","60")
Height <- c("159","180","163","171","165","173","160","165","175","167")
df <- data.frame(Name,Weight,Height)
df
## Name Weight Height
## 1 A 50 159
## 2 B 79 180
## 3 C 59 163
## 4 D 66 171
## 5 E 57 165
## 6 F 73 173
## 7 G 64 160
## 8 H 56 165
## 9 I 52 175
## 10 J 60 167
Then, I am going to demonstrate several data manipulation functions.
rename(df,"Weight(kg)"=Weight,"Height(cm)"=Height)
## Name Weight(kg) Height(cm)
## 1 A 50 159
## 2 B 79 180
## 3 C 59 163
## 4 D 66 171
## 5 E 57 165
## 6 F 73 173
## 7 G 64 160
## 8 H 56 165
## 9 I 52 175
## 10 J 60 167
filter(df, Weight<60)
## Name Weight Height
## 1 A 50 159
## 2 C 59 163
## 3 E 57 165
## 4 H 56 165
## 5 I 52 175
BMI <- c("19.78","24.38","22.21","22.57","20.94","24.39","25.00","20.57","16.98","21.51")
mutate(df,BMI)
## Name Weight Height BMI
## 1 A 50 159 19.78
## 2 B 79 180 24.38
## 3 C 59 163 22.21
## 4 D 66 171 22.57
## 5 E 57 165 20.94
## 6 F 73 173 24.39
## 7 G 64 160 25.00
## 8 H 56 165 20.57
## 9 I 52 175 16.98
## 10 J 60 167 21.51
Name <- c("A","B","C","D","E","F","G","H","I","J")
Age <- c("20","22","49","18","35","29","39","32","45","36")
Hobby <- c("Exercising","Watching Movies","Dancing","Hiking","Swimming","Reading","Painting","Singing","Studying","Travelling")
df2 <- data.frame(Name,Age,Hobby)
df2
## Name Age Hobby
## 1 A 20 Exercising
## 2 B 22 Watching Movies
## 3 C 49 Dancing
## 4 D 18 Hiking
## 5 E 35 Swimming
## 6 F 29 Reading
## 7 G 39 Painting
## 8 H 32 Singing
## 9 I 45 Studying
## 10 J 36 Travelling
left_join(df,df2)
## Joining, by = "Name"
## Name Weight Height Age Hobby
## 1 A 50 159 20 Exercising
## 2 B 79 180 22 Watching Movies
## 3 C 59 163 49 Dancing
## 4 D 66 171 18 Hiking
## 5 E 57 165 35 Swimming
## 6 F 73 173 29 Reading
## 7 G 64 160 39 Painting
## 8 H 56 165 32 Singing
## 9 I 52 175 45 Studying
## 10 J 60 167 36 Travelling