Introduction to Data Science: Alternative Assessment 2


a. Interpreting Confusion Matrix

Information

  • Total cases (n) = 195
  • True Positives (TP) = 120
  • False Positives (FP) = 10
  • False Negatives (FN) = 15
  • True Negatives (TN) = 50

Summary

  • Correct predictions = TP + TN = 120 + 50 = 170
  • Incorrect predictions = FP + FN = 10 + 15 = 25
  • Accuracy = TP + TN / n = 120 + 50 / 195 = 0.87
  • Error = FP + FN / n = 10 + 15 / 195 = 0.13

Interpretation

  • Precision = TP / (TP + FP) = 120 / (120 + 10) = 0.92
  • Negative Predicted Value = TN / (TN + FN) = 50 / (50 + 15) = 0.77
  • Sensitivity = TP / (TP + FN) = 120 / (120 + 15) = 0.88
  • Specificity = TN / (TN + FP) = 50 / (50 + 10) = 0.83

b. Exploratory Data Analysis and Codebook

In this section, I will be using the Nile data set (that is already available in R).

Nile
## Time Series:
## Start = 1871 
## End = 1970 
## Frequency = 1 
##   [1] 1120 1160  963 1210 1160 1160  813 1230 1370 1140  995  935 1110  994 1020
##  [16]  960 1180  799  958 1140 1100 1210 1150 1250 1260 1220 1030 1100  774  840
##  [31]  874  694  940  833  701  916  692 1020 1050  969  831  726  456  824  702
##  [46] 1120 1100  832  764  821  768  845  864  862  698  845  744  796 1040  759
##  [61]  781  865  845  944  984  897  822 1010  771  676  649  846  812  742  801
##  [76] 1040  860  874  848  890  744  749  838 1050  918  986  797  923  975  815
##  [91] 1020  906  901 1170  912  746  919  718  714  740

Metadata

The data contains the measurements of the annual flow of the river Nile at Aswan (formerly Assuan), recorded from 1871–1970.

It is a time series data containing 100 data points, one for each year.

Source: Durbin, J. and Koopman, S. J. (2001). Time Series Analysis by State Space Methods. Oxford University Press. http://www.ssfpack.com/DKbook.html

Information Value
Field Label Flow
Variable Flow Rate (in 108 m3)
Variable Type Numeric
Allowable Values 456-1370

Firstly, I would want to visualize the data set, to make it easier to read. As it is a time series data, a line chart seems appropriate.

plot(Nile, col = "red", lwd=2, xlab = "Year", ylab = "Flow Rate (in 10^8 m^3)", main = "Annual Flow of River Nile, 1871-1970")

From the default format, I have included the title of the chart as well as the axis titles and its units. Moreover, I have made the line thicker and changed its color to red so that it is eye catching. Although showing sharp fluctuations from year-to-year, the chart seems to display an overall decreasing trend over the duration of the observation period.

Second, I would show the summary of the data set to showcase all of the important information.

library(memisc)
codebook(Nile)
## ================================================================================
## 
##    Nile
## 
## --------------------------------------------------------------------------------
## 
##    Storage mode: double
## 
##         Min:  456.0000000
##         Max: 1370.0000000
##        Mean:  919.3500000
##    Std.Dev.:  168.3792371
##    Skewness:    0.3223697
##    Kurtosis:   -0.3049068
fivenum(Nile)
## [1]  456.0  798.0  893.5 1035.0 1370.0

I used the memeisc package to automatically create a codebook, which would summarize the data and display the minimum & maximum values, mean, standard deviation, skewness, and kurtosis. To supplement that information, I used a fivenum function to display (in the following order order) the minimum, lower-hinge (Q1), median (Q2), upper hinge (Q3), and maximum values. These information gives you a rough image of the distribution of the data and its order.


c. Dplyr Functions

library(dplyr)

First, I am going to create a simple data set.

Name <- c("A","B","C","D","E","F","G","H","I","J")
Weight <- c("50","79","59","66","57","73","64","56","52","60")
Height <- c("159","180","163","171","165","173","160","165","175","167")
df <- data.frame(Name,Weight,Height)
df
##    Name Weight Height
## 1     A     50    159
## 2     B     79    180
## 3     C     59    163
## 4     D     66    171
## 5     E     57    165
## 6     F     73    173
## 7     G     64    160
## 8     H     56    165
## 9     I     52    175
## 10    J     60    167

Then, I am going to demonstrate several data manipulation functions.

  1. Change existing column name: I will change two columns so that it would include the units in their name.
rename(df,"Weight(kg)"=Weight,"Height(cm)"=Height)
##    Name Weight(kg) Height(cm)
## 1     A         50        159
## 2     B         79        180
## 3     C         59        163
## 4     D         66        171
## 5     E         57        165
## 6     F         73        173
## 7     G         64        160
## 8     H         56        165
## 9     I         52        175
## 10    J         60        167
  1. Pick rows based on their values: I will extract rows where the person has a weight below 60 kg.
filter(df, Weight<60)
##   Name Weight Height
## 1    A     50    159
## 2    C     59    163
## 3    E     57    165
## 4    H     56    165
## 5    I     52    175
  1. Add new columns to the data frame: I will add a column of their Body Mass Index (BMI)
BMI <- c("19.78","24.38","22.21","22.57","20.94","24.39","25.00","20.57","16.98","21.51")
mutate(df,BMI)
##    Name Weight Height   BMI
## 1     A     50    159 19.78
## 2     B     79    180 24.38
## 3     C     59    163 22.21
## 4     D     66    171 22.57
## 5     E     57    165 20.94
## 6     F     73    173 24.39
## 7     G     64    160 25.00
## 8     H     56    165 20.57
## 9     I     52    175 16.98
## 10    J     60    167 21.51
  1. Combine data across two or more data frames: I will combine the BMI data frame from above with a data of their personal information. First I will construct a new data frame of their personal information, then combine the two using their names as the matching value.
Name <- c("A","B","C","D","E","F","G","H","I","J")
Age <- c("20","22","49","18","35","29","39","32","45","36")
Hobby <- c("Exercising","Watching Movies","Dancing","Hiking","Swimming","Reading","Painting","Singing","Studying","Travelling")
df2 <- data.frame(Name,Age,Hobby)
df2
##    Name Age           Hobby
## 1     A  20      Exercising
## 2     B  22 Watching Movies
## 3     C  49         Dancing
## 4     D  18          Hiking
## 5     E  35        Swimming
## 6     F  29         Reading
## 7     G  39        Painting
## 8     H  32         Singing
## 9     I  45        Studying
## 10    J  36      Travelling
left_join(df,df2)
## Joining, by = "Name"
##    Name Weight Height Age           Hobby
## 1     A     50    159  20      Exercising
## 2     B     79    180  22 Watching Movies
## 3     C     59    163  49         Dancing
## 4     D     66    171  18          Hiking
## 5     E     57    165  35        Swimming
## 6     F     73    173  29         Reading
## 7     G     64    160  39        Painting
## 8     H     56    165  32         Singing
## 9     I     52    175  45        Studying
## 10    J     60    167  36      Travelling