About the Document
Data Cleaning Project - Smartphone Activity Prediction
Data Storytelling Project - Oscar Winners and IMDb
End to End Project - Credit Risk Analysis

About the Document

This document consists of three data science projects. Its primary objective is to showcase a number of essential data science skills to prospective employers.

Data Scientists and analysts can add value to an organisation in a number of ways:

Extracting insights from raw data and presenting to others
Building systems to offer direct value to the customer
Building systems to offer direct value to others in the organisation
Sharing their expertise within the organisation

To demonstrate the areas above, it is important to have a combination of the following skills:

Ability to communicate
Ability to collaborate with others
Technical competence
Ability to reason about data
Motivation and ability to take initiative

This portfolio’s aim is to demonstrate the skills above with three projects which will focus on different areas of the data analysis process.

The first will be a data cleaning project which involves taking messy or unstructured data and cleaning it up in order to perform analysis.

The second is a data storytelling project with the aim to extract insights from a data set and persuade others.

Third is an end to end project which takes in and processes data and then generates some output. This shows code that is customer-facing and can be run multiple times with different data to generate different outputs.

The entirety of this portfolio will be in the R programming language and created using RStudio. No repositories such as Github were created for this work as all of the code used is visible and documented in this portfolio. However the Markdown version can be requested from jacksandom@gmail.com.

Data Cleaning Project - Smartphone Activity Prediction

Introduction

The purpose of this project is to demonstrate collecting, working with and cleaning a data set. The goal is to prepare tidy data that can be used for later analysis. The brief for this project is taken from the Getting and Cleaning Data course project in the Data Science Specialisation from Coursera. They define tidy data as requiring the following criteria:

Each variable you measure should be in one column
Each different observation of that variable should be in a different row
There should be one table for each “kind” of variable
For multiple tables, they should include a column in the table that allows them to be linked

The Project

One of the most exciting areas in all of data science right now is wearable computing. Companies like Fitbit, Nike, and Jawbone Up are racing to develop the most advanced algorithms to attract new users. The data for this project is collected from the accelerometers from the Samsung Galaxy S smartphone and the primary aim of the project will be to create a tidy data set from the data provided and to perform an analysis using a machine learning technique to classify the data and determine the most important variables.

The Data

The data used in this project can be found at the following link:

https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip

The data set consists of recordings from 30 participants over a period of time. Each person performed six activities (walking, walking upstairs, walking downstairs, sitting, standing, laying) wearing a smartphone on their waist. Using the embedded accelerometer and gyroscope, a number of measurements were taken. For each record in the data set, the following information is provided:

Triaxial acceleration from the accelerometer (total acceleration) and the estimated body acceleration
Triaxial Angular velocity from the gyroscope
A 561-feature vector with time and frequency domain variables
Its activity label
An identifier of the subject who carried out the experiment

More information about the data is available from the UCI Machine Learning Repository where the data was obtained:

http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones

Objectives

The data set obtained was split into training and test sets with the identifier data also separated out. The aim of the first section of this analysis is to:

Merge the training and the test sets to create one data set
Extract only the measurements on the mean and standard deviation for each measurement.
Use descriptive activity names to name the activities in the data set
Appropriately label the data set with descriptive variable names.

The second section of the project will involve using a statistic modeling technique to classify the movements and identify the important measurements.

Cleaning the Data

The R packages required for this section are data.table, reshape2, randomForest and caret.

library(data.table)
library(reshape2)
library("randomForest")
library("caret")

The next step is to load the data into R. The following code assumes the data is already downloaded and unzipped into the working directory.

# Load train data
trainData <- read.table("./UCI HAR Dataset/train/X_train.txt")
trainLabel <- read.table("./UCI HAR Dataset/train/y_train.txt")
trainSubject <- read.table("./UCI HAR Dataset/train/subject_train.txt")

# Load test data
testData <- read.table("./UCI HAR Dataset/test/X_test.txt")
testLabel <- read.table("./UCI HAR Dataset/test/y_test.txt") 
testSubject <- read.table("./UCI HAR Dataset/test/subject_test.txt")

Merging the training and the test sets

The ‘rbind’ function is used to join the training and test data sets. This results in a table with 10299 rows and 561 columns.

joinData <- rbind(trainData, testData)

# remove no longer required data sets
remove(trainData); remove(testData)

# Show dimensions of new data table
dim(joinData)

## [1] 10299   561

The train and test labels are also joined and finally the train and test subject identifiers.

joinLabel <- rbind(trainLabel, testLabel)
joinSubject <- rbind(trainSubject, testSubject)
remove(trainLabel); remove(testLabel); remove(trainSubject); remove(testSubject)

The ‘head’ function can be used to preview the data. Only the first 5 columns are previewed to reduce the size of the output. The preview shows that the column headers do not currently describe the data in the table. This will be addressed in the next sections.

head(joinData,5)[,c('V1', 'V2', 'V3', 'V4', 'V5')]

##          V1          V2         V3         V4         V5
## 1 0.2885845 -0.02029417 -0.1329051 -0.9952786 -0.9831106
## 2 0.2784188 -0.01641057 -0.1235202 -0.9982453 -0.9753002
## 3 0.2796531 -0.01946716 -0.1134617 -0.9953796 -0.9671870
## 4 0.2791739 -0.02620065 -0.1232826 -0.9960915 -0.9834027
## 5 0.2766288 -0.01656965 -0.1153619 -0.9981386 -0.9808173

Extract mean and standard deviation for each measurement

To find the mean and standard deviation measurements, first the features data needs to be loaded into R. This is a list of all features recorded by the accelerometers.

features <- read.table("./UCI HAR Dataset/features.txt")

The list of features loaded contains a set of strings and therefore a regular expression query needs to be used to find the mean and standard deviation features.

meanSD <- grep("mean\\(\\)|std\\(\\)", features[, 2])

The joined data can then be subset using the above.

joinData <- joinData[, meanSD]
dim(joinData)

## [1] 10299    66

This has reduced the number of columns to 66. The columns can also be renamed using the features list with some additional cleaning.

names(joinData) <- gsub("\\(\\)", "", features[meanSD, 2]) # remove "()"
names(joinData) <- gsub("mean", "Mean", names(joinData)) # capitalise M
names(joinData) <- gsub("std", "Std", names(joinData)) # change "std" to "SD"
names(joinData) <- gsub("-", "", names(joinData)) # remove "-"
remove(features)

# preview data
head(joinData,5)[,c('tBodyAccMeanX', 'tBodyAccMeanY', 'tBodyAccMeanZ', 'tBodyAccStdX', 
                    'tBodyAccStdY')]

##   tBodyAccMeanX tBodyAccMeanY tBodyAccMeanZ tBodyAccStdX tBodyAccStdY
## 1     0.2885845   -0.02029417    -0.1329051   -0.9952786   -0.9831106
## 2     0.2784188   -0.01641057    -0.1235202   -0.9982453   -0.9753002
## 3     0.2796531   -0.01946716    -0.1134617   -0.9953796   -0.9671870
## 4     0.2791739   -0.02620065    -0.1232826   -0.9960915   -0.9834027
## 5     0.2766288   -0.01656965    -0.1153619   -0.9981386   -0.9808173

The column names are now a lot more descriptive and all of the non-relevant data has been removed.

Use descriptive activity names to name the activities

The activity labels are also loaded from the original data into R.

activity <- read.table("./UCI HAR Dataset/activity_labels.txt")
activity

##   V1                 V2
## 1  1            WALKING
## 2  2   WALKING_UPSTAIRS
## 3  3 WALKING_DOWNSTAIRS
## 4  4            SITTING
## 5  5           STANDING
## 6  6             LAYING

These labels will be cleaned up by changing to lower case and removing the underscore.

activity[, 2] <- tolower(gsub("_", "", activity[, 2]))
substr(activity[2, 2], 8, 8) <- toupper(substr(activity[2, 2], 8, 8))
substr(activity[3, 2], 8, 8) <- toupper(substr(activity[3, 2], 8, 8))

These labels can then be joined to the label data that was loaded previously.

activityLabel <- activity[joinLabel[, 1], 2]
joinLabel[, 1] <- activityLabel
names(joinLabel) <- "activity"
head(joinLabel)

##   activity
## 1 standing
## 2 standing
## 3 standing
## 4 standing
## 5 standing
## 6 standing

Appropriately label the data set with the activity names

Finally the full data will be binded with the subject and activity labels. The activity labels will become the dependent variable for the statistical modelling task in the next section of the project.

names(joinSubject) <- "subject"
cleanData <- cbind(joinSubject, joinLabel, joinData)
remove(joinData); remove(activity)
remove(joinLabel); remove(joinSubject)
remove(activityLabel); remove(meanSD)
head(cleanData,5)[,c('subject', 'activity', 'tBodyAccMeanX', 'tBodyAccMeanY', 
                     'tBodyAccMeanZ')]

##   subject activity tBodyAccMeanX tBodyAccMeanY tBodyAccMeanZ
## 1       1 standing     0.2885845   -0.02029417    -0.1329051
## 2       1 standing     0.2784188   -0.01641057    -0.1235202
## 3       1 standing     0.2796531   -0.01946716    -0.1134617
## 4       1 standing     0.2791739   -0.02620065    -0.1232826
## 5       1 standing     0.2766288   -0.01656965    -0.1153619

Statistical Modeling

This section will use the clean data set from the previous section and utilise a statistical modeling method to classify the data and determine the most important variables in that classification.

As the primary focus of this project was data cleaning, this section will employ only one model (Random forests) and other models will be utilised later in the portfolio.

A simple explanation of random forests are that they operate by constructing a number of decision trees on a training data set and outputting the class that is the mode of the classes across the decision trees. A more detailed explanation will be provided in the Explanatory Post section of the portfolio.

Random forests were chosen for this problem as some of the variables are hard to distinguish e.g. walking vs walking upstairs. The random forests model is more robust than single decision trees (as they use averages) which suffer from high variance or bias.

Split data by subject into train and test sets

First the clean data will be re-split into train and test sets using a 70/30 split. The data will be split on subject in order to get an even distribution of activities in each set.

cleanData.train<-cleanData[cleanData$subject %in% c(1,2,3,5,6,7,11,12,13,14,15,16,17,21,
                                                    22,23,24,25,26,27),]
cleanData.test<-cleanData[cleanData$subject %in% c(8,9,10,18,19,20,28,29,30),]

Next the random forest model can be performed on the training data with activity as outcome.

activity.rf <- randomForest(as.factor(cleanData.train$activity)~., data=cleanData.train)

The confusion matrix for the training set shows a 100% accuracy using the random forest model.

training.cm <- confusionMatrix(cleanData.train$activity,
                               predict(activity.rf, cleanData.train, type="class"))
training.cm[2]

## $table
##                    Reference
## Prediction          laying sitting standing walking walkingDownstairs
##   laying              1293       0        0       0                 0
##   sitting                0    1187        0       0                 0
##   standing               0       0     1285       0                 0
##   walking                0       0        0    1178                 0
##   walkingDownstairs      0       0        0       0               948
##   walkingUpstairs        0       0        0       0                 0
##                    Reference
## Prediction          walkingUpstairs
##   laying                          0
##   sitting                         0
##   standing                        0
##   walking                         0
##   walkingDownstairs               0
##   walkingUpstairs              1041

Using the ‘varImpPlot’ function, the importance of variables can be ranked.

par(mfrow=c(1,1))
varImpPlot(activity.rf, pch=1, main="Random Forest Model Variables Importance")

Finally the model is run on the test set to further determine its accuracy.

test.cm <- confusionMatrix(cleanData.test$activity,
                           predict(activity.rf, cleanData.test,type="class"))
test.cm[2]

## $table
##                    Reference
## Prediction          laying sitting standing walking walkingDownstairs
##   laying               597       0        0       0                 0
##   sitting               15     468       57       0                 0
##   standing               0      51      514       0                 0
##   walking                0       0        0     421                30
##   walkingDownstairs      0       0        0       2               339
##   walkingUpstairs        0       0        0      17                27
##                    Reference
## Prediction          walkingUpstairs
##   laying                          0
##   sitting                         0
##   standing                        0
##   walking                        33
##   walkingDownstairs              72
##   walkingUpstairs               407

The accuracy of the random forests model on the test set is 90.13%. This indicates that it is a good model for predicting activities based on data collected from the smart phones.

Conclusion

This project aimed to demonstrate the importance of create a clean and structured data set in order to perform meaningful analysis. There are a number of other important cleaning techniques that were not required as part of this project but should be considered in any data cleaning task. This includes filling in missing values, correcting erroneous values and standardising.

The clean data set allowed statistic analysis to be performed in a few simple steps with meaningful results. Further steps in this analysis could be to assess the random forests model against other statistical methods.

Data Storytelling Project - Oscar Winners and IMDb

Introduction

The purpose of this project is to demonstrate how to extract insights from data in order to persuade or inform others. Making heavy use of visualisations, it will attempt to take a set of data and tell a compelling narrative with it. To do this the project will use the following approach:

Take an interesting and clean data set
Explore the data in order to find angles with which to approach analysis
Write a narrative picking the most interesting angles from the explorations using charts and extensive explanations
Present results

The approach the project will take is a fluid one with a broad opening question. This allows the data exploration itself to identify the most compelling specific questions that can be asked.

Project

Every year the Academy Awards (known commonly as The Oscars) recognise excellence in cinema as assessed by the Academy’s voting membership. It is commonly recognised as the most prestigious of the movie awards and draws a lot of global interest. However this project is interested in how the views of the Academy differ from those of the general film viewing public. In this analysis, the general public will be represented by the Internal Movie Database (IMDb) which is the world’s most extensive source of movie information and also the largest source of film ratings from the general public.

The Data

The data used for this project will be taken from three sources.The IMDB 5000 Movie Data set from the Kaggle website. This consists of data scraped from IMDB as of 2016. More information can be found on:

https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset

The Oscar data for best picture winning movies is also taken from a Kaggle data set which consists of winners and nominees from all of the Oscar ceremonies up to 2016. The link for this data is found at:

https://www.kaggle.com/theacademy/academy-awards

However another set of data will be parsed from Wikipedia which is a table of all films that have won any Oscar. This is a good example of using multiple sources of data and extracting data from a HTML page which is a useful tool. The Wikipedia table can be found at:

https://en.wikipedia.org/wiki/List_of_Academy_Award-winning_films

Gathering and Cleaning the Data

The first step of the analysis is to read all of the data sets into R and do some initial exploration.

The R packages required for this project are ggplot2, data.table, rvest, stringr and cowplot.

library(ggplot2)
library(data.table)
library(rvest)
library(stringr)
library(cowplot)

IMBd Data

After downloading the two Kaggle sets of data to the working directory and unzipping, the IMDb data can be loaded.

IMDb <- read.csv("movie_metadata.csv", stringsAsFactors = FALSE, encoding = 'UTF-8')
colnames(IMDb)

##  [1] "color"                     "director_name"            
##  [3] "num_critic_for_reviews"    "duration"                 
##  [5] "director_facebook_likes"   "actor_3_facebook_likes"   
##  [7] "actor_2_name"              "actor_1_facebook_likes"   
##  [9] "gross"                     "genres"                   
## [11] "actor_1_name"              "movie_title"              
## [13] "num_voted_users"           "cast_total_facebook_likes"
## [15] "actor_3_name"              "facenumber_in_poster"     
## [17] "plot_keywords"             "movie_imdb_link"          
## [19] "num_user_for_reviews"      "language"                 
## [21] "country"                   "content_rating"           
## [23] "budget"                    "title_year"               
## [25] "actor_2_facebook_likes"    "imdb_score"               
## [27] "aspect_ratio"              "movie_facebook_likes"

The columns of interest from this data are the title, year and IMDb score so the data can be subset accordingly.

IMDb <- IMDb[,c('movie_title', 'title_year', 'imdb_score')]
head(IMDb, 5)

##                                               movie_title title_year
## 1                                                 Avatar        2009
## 2               Pirates of the Caribbean: At World's End        2007
## 3                                                Spectre        2015
## 4                                  The Dark Knight Rises        2012
## 5 Star Wars: Episode VII - The Force Awakens                      NA
##   imdb_score
## 1        7.9
## 2        7.1
## 3        6.8
## 4        8.5
## 5        7.1

There is white space at the end of the movie titles which is removed using the stringr package.

IMDb$movie_title <- str_trim(IMDb$movie_title)

As the Oscar data is up to the 2016 ceremony, only films up to 2015 are included. Therefore any films beyond that can be removed from the data.

IMDb['title_year'] <- lapply(IMDb['title_year'], function(x) as.numeric(x))
IMDb <- subset(IMDb, title_year < 2016)

Oscar Data

Next the Oscars data can be loaded.

oscars <- read.csv("database.csv", stringsAsFactors=FALSE)
oscars$Name <- str_trim(oscars$Name)
colnames(oscars)

## [1] "Year"     "Ceremony" "Award"    "Winner"   "Name"     "Film"

The only column not needed from this data is the ceremony identifier column.

oscars <- oscars[,c('Year', 'Award', 'Winner', 'Name', 'Film')]
head(oscars, 5)

##        Year   Award Winner                Name             Film
## 1 1927/1928   Actor     NA Richard Barthelmess        The Noose
## 2 1927/1928   Actor      1       Emil Jannings The Last Command
## 3 1927/1928 Actress     NA      Louise Dresser  A Ship Comes In
## 4 1927/1928 Actress      1        Janet Gaynor       7th Heaven
## 5 1927/1928 Actress     NA      Gloria Swanson   Sadie Thompson

From this data, the aim is to extract the list of films that won the Best Picture award. The award for best picture has been known under five different names:

1927/28-1928/29: Academy Award for Outstanding Picture
1929/30-1940: Academy Award for Outstanding Production
1941-1943: Academy Award for Outstanding Motion Picture
1944-1961: Academy Award for Best Motion Picture
1962-present: Academy Award for Best Picture

The data will be subset to reflect these name changes.

bestPictureNoms <- subset(oscars, Award %in% c('Outstanding Picture', 'Outstanding Production', 'Outstanding Motion Picture', 'Best Motion Picture', 'Best Picture'))

One mistake in the data is that the film names and production companies are switched around for the 1928 and 1929 nominees. This is fixed below.

bestPictureNoms[1:8, 4] <- bestPictureNoms[1:8, 5]
bestPictureNoms <- bestPictureNoms[,1:4]

# remove whitespace
bestPictureNoms$Name <- str_trim(bestPictureNoms$Name)

Finally the data will be filtered by the ‘Winner’ column to return the best picture winners.

bestPicture <- subset(bestPictureNoms, Winner == 1)
head(bestPicture, 5)

##          Year                  Award Winner                           Name
## 22  1927/1928    Outstanding Picture      1                          Wings
## 65  1928/1929    Outstanding Picture      1            The Broadway Melody
## 101 1929/1930 Outstanding Production      1 All Quiet on the Western Front
## 141 1930/1931 Outstanding Production      1                       Cimarron
## 179 1931/1932 Outstanding Production      1                    Grand Hotel

The final data source is the table in Wikipedia. This can be scraped using the rvest package.

url <- "https://en.wikipedia.org/wiki/List_of_Academy_Award-winning_films"
oscarWinners <- url %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="mw-content-text"]/div/table') %>%
  html_table()
oscarWinners <- oscarWinners[[1]]
colnames(oscarWinners) <- tolower(colnames(oscarWinners))
head(oscarWinners, 5)

##                                        film year awards nominations
## 1                        The Shape of Water 2017      4          13
## 2                              Darkest Hour 2017      2           6
## 3 Three Billboards Outside Ebbing, Missouri 2017      2           7
## 4                                  I, Tonya 2017      1           3
## 5                                   Get Out 2017      1           4

Again the 2016 data will be removed. Also there are some values in “()” which indicate honorary awards and in “[]” to indicate citations. Both are removed below and the column types are made numeric.

oscarWinners <- subset(oscarWinners, year < 2016)

oscarWinners$awards <- gsub("\\s*\\([^\\)]+\\)","", as.character(oscarWinners$awards))
oscarWinners$nominations <- gsub("\\s*\\([^\\)]+\\)","", as.character(oscarWinners$nominations))
oscarWinners$nominations <- gsub("\\s*\\[[^\\)]+\\]","", as.character(oscarWinners$nominations))

oscarWinners$awards <- as.numeric(oscarWinners$awards)
oscarWinners$nominations <- as.numeric(oscarWinners$nominations)

Taking out the honorary awards means there are some films left with no awards. These can also be removed.

oscarWinners <- subset(oscarWinners, awards > 0)

Another issue with the Wikipedia data is that films that begin with ‘The’ are loaded incorrectly due to the difference in the film title format and Wikipedia URL title. This is also rectified below.

oscarWinners$film <- with(oscarWinners,ifelse(grepl('TheThe', film), substr(film,as.numeric(gregexpr(pattern ='TheThe',film)) + 3, nchar(film)), film))

That concludes the data gathering and cleaning section of the project and now exploratory analysis can be started.

Exploratory Analysis

The first step is to look at the top ranking films separately. For the Oscars, this is determined by total number of awards and the top 20 movies are pulled out.

oscarWinners20 <- oscarWinners[order(-oscarWinners$awards),][1:20,]
g <- ggplot(oscarWinners20, aes(reorder(film, -awards), awards)) + 
        geom_bar(stat="identity", fill='#bba267') + 
        geom_text(aes(label = awards), nudge_y = -1,color = "white", fontface = "bold") +
        theme(axis.title.x=element_blank()) + 
        theme(axis.title.y=element_blank()) +
        theme(axis.text.y=element_blank()) +
        theme(axis.ticks.y=element_blank()) +
        theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10)) + 
        theme(plot.margin = unit(c(1,1,1,1), "cm")) +
        scale_x_discrete(labels = function(x) str_wrap(x, width = 23)) +
        ggtitle("Top 20 Oscar Winning Movies")
g

The top 20 IMDb movies are found by IMDb score.

IMDb$imdb_score <- as.numeric(IMDb$imdb_score)
IMDb20 <- IMDb[order(-IMDb$imdb_score),][1:20,]
g <- ggplot(IMDb20, aes(reorder(movie_title, -imdb_score), imdb_score)) + 
        geom_bar(stat="identity", fill='#f5de50') + 
        geom_text(aes(label=imdb_score), nudge_y = -1, color="black", fontface="bold") +
        theme(axis.title.x=element_blank()) + 
        theme(axis.title.y=element_blank()) +
        theme(axis.text.y=element_blank()) +
        theme(axis.ticks.y=element_blank()) +
        theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10)) + 
        theme(plot.margin = unit(c(1,1,1,1), "cm")) +
        scale_x_discrete(labels = function(x) str_wrap(x, width = 23)) +
        ggtitle("Top 20 Movies on IMDb")
g

The code below shows that only The Lord of the Rings: The Return of the King and Schindler’s List were in the top 20 films most appreciated by both the Academy and IMDB users. This is an early indication of the contrasting ways in which awards committees and the general public rate movies.

subset(oscarWinners20, film %in% IMDb20$movie_title)

##                                              film year awards nominations
## 210 The Lord of the Rings: The Return of the King 2003     11          11
## 351                              Schindler's List 1993      7          12

A histogram can be used to help us compare the release years of the top movies.

imdbyears <- as.data.frame(IMDb20$title_year)
colnames(imdbyears) <- 'year'
oscarWinners20$year <- as.numeric(oscarWinners20$year)

g <- ggplot(oscarWinners20, aes(year)) +
        geom_histogram(bins=10, fill = "#bba267", alpha=0.6) +
        scale_x_continuous(limits = c(1930, 2020), breaks=seq(1935,2015,10)) +
        geom_histogram(data=imdbyears, bins=10, fill="#f5de50", alpha=0.6) +
        theme(axis.title.x=element_blank()) + 
        theme(axis.title.y=element_blank()) +
        theme(plot.margin = unit(c(1,1,1,1), "cm")) +
        ggtitle("Top Oscar and IMDb Movies Distribution by Year")
g

This seems to suggest that IMDb users favour more recent movies with 13 of the 20 highest rate movies from 1985 onward. Three movies pre-dating 1955 won eight Oscars each: Gone with the Wind (1939), From Here to Eternity (1953) and On the Waterfront (1954) but none of these feature in the IMDB top 20.

Of course winning Oscars is not the only way to measure Oscar success. Some years may feature greater competition and receiving a nomination itself is a recognition of excellence from the Academy. Therefore the next analysis will take the top 20 Oscar films again but including nominations this time. Then the IMDb scores will be added to see how viewers rated them.

To perform this analysis, it is a good idea to merge the data into one data frame by left joining the IMDb data to the top 20 Oscar movies.

oscarWinners20 <- merge(x = oscarWinners20, y = IMDb, by.x = "film", by.y = "movie_title", all.x = TRUE)
oscarWinners20 <- oscarWinners20[-9]

Looking at the data shows that three films were not in the IMDb data (Ben-Hur, Gigi and Cabaret). The easiest way to rectify this is by manually filling in from the IMDB website.

oscarWinners20[2,6] <- 8.1
oscarWinners20[3,6] <- 7.8
oscarWinners20[7,6] <- 6.9

Now the plot can be created.

# prepare data for plotting
oscarWinners20 <- oscarWinners20[order(-oscarWinners20$awards, -oscarWinners20$nominations),]
awardsCount <- as.data.frame(rep(oscarWinners20$film, oscarWinners20$awards))
colnames(awardsCount) <- 'count'
nomsCount <- as.data.frame(rep(oscarWinners20$film, oscarWinners20$nominations))
colnames(nomsCount) <- 'count'

# create plot
g <- ggplot(nomsCount, aes(count)) + 
        geom_dotplot(bins=20) +
        geom_dotplot(data=awardsCount, bins=20, fill="#bba267") +
        scale_x_discrete(limits=as.vector(oscarWinners20$film), labels = function(x) str_wrap(x, width = 25)) +
        scale_y_continuous(expand = c(0, 0), limits = c(0, 15)) +
        theme(axis.title.x=element_blank()) + 
        theme(axis.title.y=element_blank()) +
        theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10)) + 
        theme(plot.margin = unit(c(1,1,1,1), "cm")) +
        ggtitle("Top 20 Oscar Winning Movies and IMDb")

# add IMDb labels
g <- g + geom_point(aes(y=13.75), shape = 23, size = 12, fill = "#f5de50") 
for(i in 1:length(oscarWinners20$imdb_score)){g <- g + geom_text(x=i, y=13.75, label=oscarWinners20$imdb_score[i])}
remove(awardsCount); remove(nomsCount)
g

mean(oscarWinners20$imdb_score)

## [1] 7.89

The chart shows that the Oscar winning movies are generally rated well with an average score of 7.89 across the 20 movies. The most overrated film according to IMDb users is Gigi with a score of only 6.9. The most highly regarded of the Oscar decorated films are Lord of the Rings: Return of the King and Schindler’s List with scores of 8.9 (the only two films to make the IMDb top 20).

Another interesting analysis is to look at how the winners of the Oscar award for best picture performed with IMDb users. Considering this is the top recognition at the awards, these movies in particular would be expected to perform well. To do this, the IMDb database first needs to be merged with the best picture data frame which was created earlier from the Oscars Kaggle data.

colnames(bestPicture) <- tolower(colnames(bestPicture))
bestPicture <- merge(x = bestPicture, y = IMDb, by.x = "name", by.y = "movie_title", all.x = TRUE)
bestPicture <- bestPicture[,-9]

# remove any duplicate rows
bestPicture <- bestPicture[!duplicated(bestPicture), ]

Due to the lack of some data in the IMDb set, not all of the Oscar winning movies have entries in the IMDB data so these scores can be entered manually as the data is small.

bestPicture[which(is.na(bestPicture$imdb_score), arr.ind=TRUE),6] <- c('8.3', '8.1', '7.2', '8.1', '6', '6', '8', '6.9', '7.2', '7.6', '8', '7.8', '7.6', '7.7', '7.6', '7.2', '8.7', '7.4', '9', '6.8', '7.3', '7.8')

# fix join error
bestPicture[12,6] <- 6.8

To show the most and least appreciated of the best picture winning movies, two graphs will be created of the ten highest and lowest IMDb rated movies.

worst10 <- bestPicture[order(bestPicture$imdb_score),][1:10,]
worst10$imdb_score <- as.numeric(worst10$imdb_score)

g <- ggplot(worst10, aes(reorder(name, imdb_score), imdb_score)) + 
        geom_bar(stat="identity", fill='#f5de50') + 
        geom_text(aes(label=imdb_score), nudge_y = -1, color="black", fontface="bold") +
        theme(axis.title.x=element_blank()) + 
        theme(axis.title.y=element_blank()) +
        theme(axis.text.y=element_blank()) +
        theme(axis.ticks.y=element_blank()) +
        theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10)) + 
        theme(plot.margin = unit(c(1,1,1,1), "cm")) +
        scale_x_discrete(labels = function(x) str_wrap(x, width = 23)) +
        ggtitle("Worst 10 Oscar Winning Movies on IMDb")
remove(worst10)
g

best10 <- bestPicture[order(bestPicture$imdb_score, decreasing = TRUE),][1:10,]
best10$imdb_score <- as.numeric(best10$imdb_score)

g <- ggplot(best10, aes(reorder(name, -imdb_score), imdb_score)) + 
        geom_bar(stat="identity", fill='#f5de50') + 
        geom_text(aes(label=imdb_score), nudge_y = -1, color="black", fontface="bold") +
        theme(axis.title.x=element_blank()) + 
        theme(axis.title.y=element_blank()) +
        theme(axis.text.y=element_blank()) +
        theme(axis.ticks.y=element_blank()) +
        theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10)) + 
        theme(plot.margin = unit(c(1,1,1,1), "cm")) +
        scale_x_discrete(labels = function(x) str_wrap(x, width = 23)) +
        ggtitle("Best 10 Oscar Winning Movies on IMDb")
remove(best10)
g

The above shows a real range of scores for Best Picture winning movies. The Godfather (Best Picture 1972) sits at the top with a rating of 9.2 while Cimarron and Cavalcade (Outstanding Production 1931 and 1933 respectively) are at the bottom with 6. Again this suggests that IMDb users do not appreciate earlier movies in the same way the Academy does (or that the standard of cinema then was much lower).

The next piece of analysis will be to look again at the top 20 rated movies on IMDb and examine their performance at the Oscars i.e. number of awards and nominations (if any). This will give an indication of which movies were most underrated by the Academy.

First the IMDb top 20 is merged with the Oscar winners data.

IMDb20 <- merge(x = IMDb20, y = oscarWinners, by.x = "movie_title", by.y = "film", all.x = TRUE)

# fix join errors
IMDb20[7,4:6] <- oscarWinners[581,2:4]
IMDb20[10,4:6] <- oscarWinners[556,2:4]
IMDb20[14,4:6] <- oscarWinners[594,2:4]

head(IMDb20, 5)

##    movie_title title_year imdb_score year awards nominations
## 1 12 Angry Men       1957        8.9 <NA>     NA          NA
## 2  City of God       2002        8.7 <NA>     NA          NA
## 3   Fight Club       1999        8.8 <NA>     NA          NA
## 4 Forrest Gump       1994        8.8 1994      6          13
## 5   Goodfellas       1990        8.7 1990      1           6

The output of the merge suggests that 6 of the 20 films in the list did not receive any Oscars. The next step is to find out if any of them were nominated by searching through the original Oscars data.

# Films with no oscars
IMDb20[which(is.na(IMDb20$awards), arr.ind=TRUE),1]

## [1] "12 Angry Men"                                  
## [2] "City of God"                                   
## [3] "Fight Club"                                    
## [4] "Star Wars: Episode V - The Empire Strikes Back"
## [5] "The Good, the Bad and the Ugly"                
## [6] "The Shawshank Redemption"

for (i in which(is.na(IMDb20$awards), arr.ind=TRUE)){
        IMDb20[i,6] <- sum(oscars$Name == IMDb20[which(is.na(IMDb20$awards), arr.ind=TRUE),1][i]) + sum(oscars$Film == IMDb20[which(is.na(IMDb20$awards), arr.ind=TRUE),1][i])
}
IMDb20[20,6] <- 7

The remaining NAs indicate no nominations and can be replaced with 0s.

IMDb20[is.na(IMDb20)] <- 0

The plot below is similar to the one above and shows the movie awards against the IMDb rating.

# prepare data for plotting
IMDb20 <- IMDb20[order(-IMDb20$imdb_score),]
awardsCount <- as.data.frame(rep(IMDb20$movie_title, IMDb20$awards))
colnames(awardsCount) <- 'count'
nomsCount <- as.data.frame(rep(IMDb20$movie_title, IMDb20$nominations))
colnames(nomsCount) <- 'count'

# create plot
g <- ggplot(nomsCount, aes(count)) + 
        geom_dotplot(bins=20) +
        geom_dotplot(data=awardsCount, bins=20, fill="#bba267") +
        scale_x_discrete(limits=as.vector(IMDb20$movie_title), labels = function(x) str_wrap(x, width = 23)) +
        scale_y_continuous(expand = c(0, 0), limits = c(0, 20)) +
        theme(axis.title.x=element_blank()) + 
        theme(axis.title.y=element_blank()) +
        theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10)) + 
        theme(plot.margin = unit(c(1,1,1,1), "cm")) +
        ggtitle("Top 20 IMDb Movies and Oscar Wins/Nominations")

# add IMDb labels
for(i in 1:length(IMDb20$imdb_score)){
        g <- g + geom_point(x=i, y=18, shape = 23, size = 12, fill = "#f5de50") + geom_text(x=i, y=18, label=IMDb20$imdb_score[i])
}
remove(awardsCount); remove(nomsCount)
g

The chart gives some very interesting results. A number of films received no Oscars including the top overall film on IMDb, The Shawshank Redemption. Two of the films also received no nominations (The Good, the Bad and the Ugly, Stars Wars: Episode V).

It is important to understand the limitations of the analysis so far and try to find other ways in which to approach the task. One limitation is that ranking films by number of Oscar awards/nominations is not a real representation of how they are judged by the Academy over time but merely just in that year. The movie is only judges against other movies that came out that year so a bad year for movies can lead to an abundance of awards for the one good release that year. This can also go the other way. For example, The Shawshank Redemption (number 1 on IMDb) came out the same year as Pulp Fiction and Forrest Gump, two other films that have stood the test of time. This could explain the lack of real recognition by the Oscars.

To address this issue, the final piece of exploratory analysis will examine how many of the Best Picture winning movies were the best IMDb rated movies of their respective years. To do this, the IMDb data needs to be grouped by release year by the max IMDb score(s) that year.

IMDbByYear <- as.data.table(IMDb)
IMDbByYear <- IMDbByYear[ , .SD[which.max(imdb_score)], by = title_year]
IMDbByYear <- IMDbByYear[order(IMDbByYear$title_year),]
IMDbByYear <- subset(IMDbByYear, title_year > 1926)
head(IMDbByYear, 10)

##     title_year                movie_title imdb_score
##  1:       1927                 Metropolis        8.3
##  2:       1929              Pandora's Box        8.0
##  3:       1930              Hell's Angels        7.8
##  4:       1932         A Farewell to Arms        6.6
##  5:       1933                42nd Street        7.7
##  6:       1934      It Happened One Night        8.2
##  7:       1935                    Top Hat        7.8
##  8:       1936               Modern Times        8.6
##  9:       1937      The Prisoner of Zenda        7.8
## 10:       1938 You Can't Take It with You        8.0

Inner joining the data frames show the films in both lists.

merge(IMDbByYear, bestPicture, by.x = 'movie_title', by.y = 'name')[order(year),c(1,4)]

##                                       movie_title year
##  1:                         It Happened One Night 1934
##  2:                    You Can't Take It with You 1938
##  3:                            Gone with the Wind 1939
##  4:                                       Rebecca 1940
##  5:                       How Green Was My Valley 1941
##  6:                                    Casablanca 1943
##  7:                              The Lost Weekend 1945
##  8:                         From Here to Eternity 1953
##  9:                            Lawrence of Arabia 1962
## 10:                                 The Godfather 1972
## 11:                                     The Sting 1973
## 12:                               The Deer Hunter 1978
## 13:                      The Silence of the Lambs 1991
## 14:                              Schindler's List 1993
## 15:                                     Gladiator 2000
## 16: The Lord of the Rings: The Return of the King 2003
## 17:                                  The Departed 2006

This analysis shows us that only 17 of the 89 best picture winning movies are the highest rated movie of their year on IMDb. In other words, IMDb only agree with the Oscars 19.1% of the time.

Conclusion

The data has shown that although there are certainly a number of films seemingly loved by both viewers and the Oscars, there are also many films where IMDb users felt the Academy got it wrong. This is hardly a surprising result but it is still interesting to understand the specific movies which separate opinion.

The final piece of analysis addressed one of the limitations of looking at the Oscars as an indicator of how good a film is. However there are other problems which also make comparison difficult. For example, IMDb was founded in 1990 and so for many films, the ratings are a representation of how they stood the test of time. Whereas the Oscars were based on the views of that time. Another issue is that Oscars are awarded in a number of technical categories such as make-up and sound. Although they are important aspects of films, winning these awards does not necessarily indicate a good film and this makes using total awards/nominations as an indicator of greatness difficult. For example, Lord of the Rings: Return of the King receiving so many awards does not indicate that the Oscars view the movie as the best of all time. Rather they view it as a movie that is technically brilliant in addition to having good plot, acting and directing.

Nonetheless, the data still gives a lot of insight and future avenues to explore. One of the limitations of the IMDb data set is that it does not have accurate acting data (i.e. who the main actors are in each movie). One area of interest would be to look at how the films of individuals actors, actresses and directors rank compared to how many awards those individuals have won. Furthermore, IMDb is not the only source of internet reviews and it could also be interesting to add data from other sites such as Rotten Tomatoes or Metacritic for further insight.

End to End Project - Credit Risk Analysis

Introduction

The end to end project is the third and final one in this portfolio. Similar to the previous projects, it will look to extract insights from data in order to present to others. However it will then build on this to create a customer-facing system that can be run multiple times with different pieces of data to generate different outputs. It will do this by applying a machine learning model to an original set of data and then creating a user-friendly application in order to input new data into the model. The following steps will be used to achieve the above:

Import and clean a data set
Create predictions using training and test data and relevant machine learning models
Use the model to design a customer-facing application

Project

Credit risk is the risk of a borrow defaulting on a debt and that the lender may lose the principal of the loan or associated interest. When a bank receives a loan application, it has to make a decision as to whether to approve the loan or not based on the applicant’s profile. If the bank deems the applicant to have bad credit risk, it means the applicant is not likely to repay the loan and approving the loan could result in financial loss to the bank.

The purpose of this project is to take a data set of loan applications and build a predictive model for making a decision as to whether to approve a loan based on the applicant’s profile. An application will then be built which is intended to provide guidance to a bank manager for making this decision.

Data

The data for the analysis is a set of 1000 German credit applications with 20 different attributes of the applicant. The original data is from the UCI Machine Learning Repository but the CSV version used in this analysis can be found from the Penn State University website (https://onlinecourses.science.psu.edu/stat857/node/215). Further information about the data set can be found at:

https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)

Note: the data is taken from 1994 and therefore all monetary amounts are valued in Deutsche Mark (DM).

Gathering and Cleaning the Data

As always, the first step of analysis is to read the data into RStudio and perform cleaning of the data to prepare for exploratory analysis.

library(ggplot2)
library(cowplot)
library(caret)
library(ROCR)
library(rpart)
library(rpart.plot)
library(rattle)
library(randomForest)

The data set used will be the numeric data which is more suitable for some algorithms that cannot cope with categorical variables.

credit <- read.csv('german_credit.csv', header = TRUE)
str(credit)

## 'data.frame':    1000 obs. of  21 variables:
##  $ Creditability                    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Account.Balance                  : int  1 1 2 1 1 1 1 1 4 2 ...
##  $ Duration.of.Credit..month.       : int  18 9 12 12 12 10 8 6 18 24 ...
##  $ Payment.Status.of.Previous.Credit: int  4 4 2 4 4 4 4 4 4 2 ...
##  $ Purpose                          : int  2 0 9 0 0 0 0 0 3 3 ...
##  $ Credit.Amount                    : int  1049 2799 841 2122 2171 2241 3398 1361 1098 3758 ...
##  $ Value.Savings.Stocks             : int  1 1 2 1 1 1 1 1 1 3 ...
##  $ Length.of.current.employment     : int  2 3 4 3 3 2 4 2 1 1 ...
##  $ Instalment.per.cent              : int  4 2 2 3 4 1 1 2 4 1 ...
##  $ Sex...Marital.Status             : int  2 3 2 3 3 3 3 3 2 2 ...
##  $ Guarantors                       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Duration.in.Current.address      : int  4 2 4 2 4 3 4 4 4 4 ...
##  $ Most.valuable.available.asset    : int  2 1 1 1 2 1 1 1 3 4 ...
##  $ Age..years.                      : int  21 36 23 39 38 48 39 40 65 23 ...
##  $ Concurrent.Credits               : int  3 3 3 3 1 3 3 3 3 3 ...
##  $ Type.of.apartment                : int  1 1 1 1 2 1 2 2 2 1 ...
##  $ No.of.Credits.at.this.Bank       : int  1 2 1 2 2 2 2 1 2 1 ...
##  $ Occupation                       : int  3 3 2 2 2 2 2 2 1 1 ...
##  $ No.of.dependents                 : int  1 2 1 2 1 2 1 2 1 1 ...
##  $ Telephone                        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Foreign.Worker                   : int  1 1 1 2 2 2 2 2 1 1 ...

More information about the attribute information and what the numeric values mean can be found at the UCI link above. One thing to note immediately is that three of the columns contain continuous variables rather than categorical data (duration of credit, credit amount and age). This is potentially important information in deciding credit risk and therefore one solution is to transform the data into categorical variables using the cut function.

credit$Duration.of.Credit..month. <- cut(credit$Duration.of.Credit..month., c(0,12,18,24,Inf), labels = c(1:4))
credit$Credit.Amount <- cut(credit$Credit.Amount, c(0,1000,5000,10000,Inf), labels = c(1:4))
credit$Age..years. <- cut(credit$Age..years., c(18,25,40,60,Inf), labels = c(1:4))
head(credit[,c(3,6,14)],5)

##   Duration.of.Credit..month. Credit.Amount Age..years.
## 1                          2             2           1
## 2                          1             2           2
## 3                          1             1           1
## 4                          1             2           2
## 5                          1             2           2

The new structure of the three columns can be seen above. The categories have changed as follows. Duration of Credit (month):

0 - 12 months
13 - 18 months
19 - 24 months
Over 24 months

Credit Amount:

0 - 1,000 DM
1,001 - 5,000 DM
5,001 - 10,000 DM
Over 10,000 DM

Age:

18 - 25
26 - 40
41 - 60
Over 60

Finally, the remaining columns can be converted to factors.

for(i in 1:21) credit[, i] <- as.factor(credit[, i])

Exploratory Analysis

Before starting the modeling phase, it is important to explore the data to get an idea of any patterns or areas of interest.

The first thing is to examine how many examples of good and bad credit risk there are.

g <- ggplot(credit, aes(Creditability)) +
        geom_bar(fill = "#4EB25A") +
        theme(axis.title.x=element_blank()) + 
        theme(axis.title.y=element_blank()) +
        scale_y_continuous(breaks=seq(0,700,100)) +
        scale_x_discrete(labels = c("Bad","Good")) +
        ggtitle("Count of Good and Bad Credit Risks")
g

The plot shows 300 examples of bad credit risk applicants versus 700 good. This is something that should be noted later when splitting the data set into training and test sets.

The next step is to explore some of the variables in the data. For example, it might be a fair assumption that amount of total savings is strongly linked to the credit risk of the applicant i.e. an applicant with little money in their account is a higher credit risk than one with a lot of savings? Another plot can be produced to confirm this.

g <- ggplot(credit, aes(Value.Savings.Stocks, fill = Creditability), stat="identity") +
        geom_bar() +
        scale_fill_manual(values = c("#D3D6D4", "#4EB25A"), labels=c("Bad","Good")) +
        theme(axis.title.x=element_blank()) +
        theme(axis.title.y=element_blank()) +
        scale_y_continuous(breaks=seq(0,700,100)) +
        scale_x_discrete(labels = c("< 100 DM", "100-500 DM", "500-1000 DM", "> 1000 DM", "Unknown")) +
        theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10)) + 
        theme(axis.text.y = element_text(size = 10)) +
        theme(legend.text=element_text(size=10)) +
        theme(legend.title=element_text(size=12)) +
        ggtitle("Good and Bad Credit Risks by Credit History")
g

The plot seems to back up the rationale. A higher percentage of applicants with less savings are deemed as having bad credit risk.

Another area to explore is how credit risk relates to employment status. There are four statuses for employment in the data:

Unemployed/ Unskilled (non-resident)
Unskilled (resident)
Skilled employee / Official
Management / Self-employed / Highly Qualified Employee / Officer

g <- ggplot(credit, aes(Occupation, fill = Creditability), stat="identity") +
        geom_bar() +
        scale_fill_manual(values = c("#D3D6D4", "#4EB25A"), labels=c("Bad","Good")) +
        theme(axis.title.x=element_blank()) +
        theme(axis.title.y=element_blank()) +
        scale_y_continuous(breaks=seq(0,700,100)) +
        scale_x_discrete(labels = c("Unemployed", "Unskilled", "Skilled", "Management")) +
        theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10)) + 
        theme(axis.text.y = element_text(size = 10)) +
        theme(legend.text=element_text(size=10)) +
        theme(legend.title=element_text(size=12)) +
        ggtitle("Good and Bad Credit Risks by Occupation")
g

There appears to be less of a link with occupation. Most of the applicants come under ‘skilled employee’ but the creditability of unskilled employees and management/highly qualified employees does not appear significantly different. However further statistical modeling is needed to support this initial analysis.

Finally, some exploration can be performed on one of the new categorical variables created above. This example will look at age.

g <- ggplot(credit, aes(Age..years., fill = Creditability), stat="identity") +
        geom_bar() +
        scale_fill_manual(values = c("#D3D6D4", "#4EB25A"), labels=c("Bad","Good")) +
        theme(axis.title.x=element_blank()) +
        theme(axis.title.y=element_blank()) +
        scale_y_continuous(breaks=seq(0,700,100)) +
        scale_x_discrete(labels = c("18-25", "26-40", "41-60", "60+")) +
        theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10)) + 
        theme(axis.text.y = element_text(size = 10)) +
        theme(legend.text=element_text(size=10)) +
        theme(legend.title=element_text(size=12)) +
        ggtitle("Good and Bad Credit Risks by Age")
g

This analysis indicates perhaps some decrease in credit risk with age. However there is probably a lot of correlation between age and other factors such as savings and property so further statistical analysis is needed.

Statistical modeling

Logistic Regression

The first step before applying models is to create training and test data sets. The data will be split 70/30 and spread evenly between good and bad credit risks using the CreateDataPartition function in the caret package.

set.seed(2828)
inTraining <- createDataPartition(credit$Creditability, p=0.7, list=FALSE)
train <- credit[inTraining,]
test <- credit[-inTraining,]

The first model is logistic regression using the glm() function.

set.seed(2828)
lmModel <- glm(Creditability ~ ., family = binomial, data = train)

# Fit model to test set
lmFit <- predict(lmModel, type = "response", test)

# Compare predictions to test set
lmPred <- prediction(lmFit, test$Creditability)

# Create Area Under the Curve (AUC) plot
plot(performance(lmPred, 'tpr', 'fpr'))

performance(lmPred, measure = 'auc')@y.values[[1]]

## [1] 0.732963

The AUC of the model is 0.73. This is a measure of the model’s performance by evaluating the trade off between the true positive and false positive rate i.e. how good is the model at identifying good creditability risk without falsely identifying bad risks as good?

This is a fairly good score but the next sections will look at classification trees and random forests to try and improve on this.

Decision Tree

Decision trees use a tree-like model of decisions and their outcomes to create a prediction model.

set.seed(28)
dtModel <- rpart(Creditability ~ ., data=train)
fancyRpartPlot(dtModel)

As before, the model is fit to the test data to analyse the performance.

dtFit <- predict(dtModel, test, type = 'prob')[, 2]
dtPred <- prediction(dtFit, test$Creditability)
plot(performance(dtPred, 'tpr', 'fpr'))

performance(dtPred, measure = 'auc')@y.values[[1]]

## [1] 0.6823545

This model has performed less well than logistic regression.

Random Forest

The final model is Random Forest. Random forests operate by constructing a number of decision trees on the training data set and outputting the class that is the mode of the classes across the decision trees.

set.seed(2828)
rfModel <- randomForest(Creditability ~ ., data=train)
rfFit <- predict(rfModel, test, type = 'prob')[,2]
rfPred <- prediction(rfFit, test$Creditability)
plot(performance(rfPred, 'tpr', 'fpr'))

performance(rfPred, measure = 'auc')@y.values[[1]]

## [1] 0.7521164

The Random Forest model returns an AUC of 0.752 which is slightly better than the logistic regression model. This is the model which will be used for the final application.

The plot below shows the rank of importance for variables in the Random Forest model. Account balance is ranked as the most significant measurement in the model with purpose second. Purpose identifies the reason for the applicant’s request for credit e.g. car, education, business etc.

par(mfrow=c(1,1))
varImpPlot(rfModel, pch=1, main="Random Forest Model Variables Importance")

The confusion matrix below shows the split between prediction success of good (1) and bad (0) credit risks and an overall accuracy of 75.33%.

rfCM <- confusionMatrix(test$Creditability,
                        predict(rfModel, test, type="class"))
rfCM

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  33  57
##          1  17 193
##                                           
##                Accuracy : 0.7533          
##                  95% CI : (0.7005, 0.8011)
##     No Information Rate : 0.8333          
##     P-Value [Acc > NIR] : 0.9998          
##                                           
##                   Kappa : 0.3273          
##  Mcnemar's Test P-Value : 5.797e-06       
##                                           
##             Sensitivity : 0.6600          
##             Specificity : 0.7720          
##          Pos Pred Value : 0.3667          
##          Neg Pred Value : 0.9190          
##              Prevalence : 0.1667          
##          Detection Rate : 0.1100          
##    Detection Prevalence : 0.3000          
##       Balanced Accuracy : 0.7160          
##                                           
##        'Positive' Class : 0               
##

The next and final step is to build the application designed to aid bank managers.

Final Application

The application will be built using the Shiny package in RStudio. Two separate scripts are created in order to produce the application. The UI.R script sets out the look and feel of the application while the server.R script has the underlying code to provide the functionality. The app will feature drop down boxes in the tool bar where the user can input the data and the accept/rejection decision will be shown in the middle of the page.

The UI.r script is shown below:

library(shiny)
library(shinythemes)

shinyUI(fluidPage(
        theme = shinytheme("slate"),
        titlePanel("Credit Risk Evaluator"),
        sidebarLayout(
                sidebarPanel(
                        helpText("This app evaluates whether a loan applicant should be considered a good or bad credit risk and delivers an accept or reject recommendation"),            
                        selectInput("account_balance",
                                    label = h6("Account Balance"),
                                    choices = list("< 0 DM" = 1,
                                                   "0 - 200 DM" = 2,
                                                   "> 200 DM / Salary assignments for at least 1 year" =3,
                                                   "No checking account" = 4),
                                    selected = 1 
                        ),
                        selectInput("duration_of_credit",
                                    label = h6("Duration of Credit (Months)"),
                                    choices = list("0 - 12 months" = 1,
                                                   "13 - 18 months" = 2,
                                                   "19 - 24 months" =3,
                                                   "> 24 months" = 4),
                                    selected = 1 
                        ),
                        selectInput("payment_status_previous_credit",
                                    label = h6("Payment Status of Previous Credit"),
                                    choices = list("No credits taken / All credits paid back duly" = 0,
                                                   "All credits at this bank paid back duly" = 1,
                                                   "Existing credits paid back duly till now" =2,
                                                   "Delay in paying off in the past" = 3,
                                                   "Critical account / Other credits existing (not at this bank)" = 4),
                                    selected = 1 
                        ),
                        selectInput("purpose",
                                    label = h6("Purpose"),
                                    choices = list("Car (new)" = 0,
                                                   "Car (used)" = 1,
                                                   "Furniture / Equipment" = 2,
                                                   "Radio / Television" = 3,
                                                   "Domestic Appliances" = 4,
                                                   "Repairs" = 5,
                                                   "Education" = 6,
                                                   "Re-training" = 7,
                                                   "Business" =8,
                                                   "Other" = 9),
                                    selected = 1 
                        ),
                        selectInput("credit_amount",
                                    label = h6("Credit Amount"),
                                    choices = list("0 - 1,000 DM" = 1,
                                                   "1,001 - 5,000 DM" = 2,
                                                   "5,001 - 10,000 DM" =3,
                                                   "> 10,000 DM" = 4),
                                    selected = 1 
                        ),
                        selectInput("value_savings",
                                    label = h6("Value of Savings/Stocks"),
                                    choices = list("< 100 DM" = 1,
                                                   "100 - 500 DM" = 2,
                                                   "500 - 1000 DM" = 3,
                                                   "> 1000 DM" =4,
                                                   "Unknown / No Savings Account" = 5),
                                    selected = 1 
                        ),
                        selectInput("length_employment",
                                    label = h6("Length of Current Employment"),
                                    choices = list("Unemployed" = 1,
                                                   "< 1 Year" = 2,
                                                   "1 - 4 Years" = 3,
                                                   "4 - 7 Years" =4,
                                                   "> 7 Years" = 5),
                                    selected = 1 
                        ),
                        selectInput("instalment_percent",
                                    label = h6("Instalment (%) of Disposable Income"),
                                    choices = list("1%" = 1,
                                                   "2%" = 2,
                                                   "3%" =3,
                                                   "4% and above" = 4),
                                    selected = 1 
                        ),
                        selectInput("sex_marital_status",
                                    label = h6("Sex and Marital Status"),
                                    choices = list("Male : Divorced / Separated" = 1,
                                                   "Female : Divorced / Separated / Married " = 2,
                                                   "Male : Single" = 3,
                                                   "Male : Married / Widowed" = 4,
                                                   "Female : Single" = 5),
                                    selected = 1
                        ),
                        selectInput("guarantors",
                                    label = h6("Guarantors"),
                                    choices = list("None" = 1,
                                                   "Co-applicant" = 2,
                                                   "Guarantor" = 3),
                                    selected = 1 
                        ),
                        selectInput("duration_current_address",
                                    label = h6("Time in Current Address"),
                                    choices = list("< 1 Year" = 1,
                                                   "1 - 2 Years" = 2,
                                                   "2 - 3 Years" =3,
                                                   "> 3 Years" = 4),
                                    selected = 1 
                        ),
                        selectInput("most_valuable_asset",
                                    label = h6("Most Valuable Available Asset"),
                                    choices = list("Real Estate" = 1,
                                                   "Building Society Savings Agreement / Life Insurance" = 2,
                                                   "Car or Other" =3,
                                                   "Unknown / No Property" = 4),
                                    selected = 1 
                        ),
                        selectInput("age",
                                    label = h6("Age"),
                                    choices = list("18 - 25" = 1,
                                                   "26 - 40" = 2,
                                                   "41 - 60" =3,
                                                   "Over 60" = 4),
                                    selected = 1 
                        ),
                        selectInput("concurrent_credits",
                                    label = h6("Concurrent Credits"),
                                    choices = list("Bank" = 1,
                                                   "Stores" =2,
                                                   "None" = 3),
                                    selected = 1 
                        ),
                        selectInput("type_apartment",
                                    label = h6("Type of Apartment"),
                                    choices = list("Rent" = 1,
                                                   "Own" = 3,
                                                   "For Free" = 4),
                                    selected = 1 
                        ),
                        selectInput("num_credits_this_bank",
                                    label = h6("Number of Credits at This Bank"),
                                    choices = list("1" = 1,
                                                   "2" = 2,
                                                   "3" =3,
                                                   "4 or more" = 4),
                                    selected = 1 
                        ),
                        selectInput("occupation",
                                    label = h6("Occupation"),
                                    choices = list("Unemployed / Unskilled - Non-resident " = 1,
                                                   "Unskilled - Resident " = 2,
                                                   "Skilled Employee / Official " =3,
                                                   "Management / Self-employed / 
Highly Qualified Employee / Officer " = 4),
                                    selected = 1 
                        ),
                        selectInput("num_dependents",
                                    label = h6("Number of Dependents"),
                                    choices = list("0 - 1" = 1,
                                                   "2 or more" = 2),
                                    selected = 1 
                        ),
                        selectInput("telephone",
                                    label = h6("Telephone"),
                                    choices = list("None" = 1,
                                                   "Yes, Registered Under the Customer's Name" = 4),
                                    selected = 1 
                        ),
                        selectInput("foreign_worker",
                                    label = h6("Foreign Worker"),
                                    choices = list("Yes" = 1,
                                                   "No" = 2),
                                    selected = 1 
                        ), 
                        br(),
                        actionButton("action_Calc", label = "Evaluate")        
                ),
                mainPanel(
                        tabsetPanel(
                                tabPanel("Output",
                                         p(h4("Recommendation:")),
                                         textOutput("decision"),
                                         tags$head(tags$style("#decision{color: white; font-size: 60px;}")
                                         )
                                ),
                                tabPanel("Documentation",
                                         p(h4("Credit Risk Evaluator:")),
                                         br(),
                                         helpText("This application evaluates whether or not a credit applicant has good or bad credit risk and delivers an accept or reject recommendation. The model was developed using a random forests model which was run on a data set of 1000 German credit applications."),
                                         HTML("")                
                                )
                        )
                )
        )
))

The server.R script:

library(shiny)
library(caret)
library(randomForest)

shinyServer(function(input, output) {
        
        credit <- read.csv('german_credit.csv', header = TRUE)
        credit$Duration.of.Credit..month. <- cut(credit$Duration.of.Credit..month., c(0,12,18,24,Inf), labels = c(1:4))
        credit$Credit.Amount <- cut(credit$Credit.Amount, c(0,1000,5000,10000,Inf), labels = c(1:4))
        credit$Age..years. <- cut(credit$Age..years., c(18,25,40,60,Inf), labels = c(1:4))
        for(i in 1:21) credit[, i] <- as.factor(credit[, i])
        set.seed(2828)
        inTraining <- createDataPartition(credit$Creditability, p=0.7, list=FALSE)
        train <- credit[inTraining,]
        set.seed(2828)
        rfModel <- randomForest(Creditability ~ ., data=train)
        inputDF <- credit[-(2:1000),]
        
        # Return accept/reject decision 
        values <- reactiveValues()
        observe({
                input$action_Calc
                inputDF[1,-1] <- isolate(c(input$account_balance, input$duration_of_credit,
                                           input$payment_status_previous_credit, input$purpose,
                                           input$credit_amount, input$value_savings,
                                           input$length_employment, input$instalment_percent,
                                           input$sex_marital_status, input$guarantors,
                                           input$duration_current_address, input$most_valuable_asset,
                                           input$age, input$concurrent_credits,
                                           input$type_apartment, input$num_credits_this_bank,
                                           input$occupation, input$num_dependents,
                                           input$telephone, input$foreign_worker))
                decision <- predict(rfModel, inputDF, type = 'prob')
                if(decision[2] > decision[1]){
                        decision = 'ACCEPT'
                } else {
                        decision = 'REJECT'
                }
                values$decision <- decision
                })
        
        # Display decision
        
        output$decision <- renderText({
                if(input$action_Calc == 0) ""
                else
                        out <- paste(values$decision)
                
        })
})

The final application is hosted on https://jssandom.shinyapps.io/credit_risk_evaluator/. A screenshot is shown below.

Credit Risk Evaluator

The application’s functionality was tested by inputting a couple of obvious dummy cases. The dummy applicant with seemingly good credit history received an ‘Accept’ recommendation whilst the seemingly bad credit applicant received a ‘Reject’.

Dummy Examples

Conclusion

The above example shows the way in which a bank manager could use the application to guide decision making on credit applications. Although the application is quite basic, it gives an indication as to how value can be extracted from analysing data.

One limitation of the analysis is the fairly small and historic data used in this example. A good model would draw on thousands (if not millions) of customers data and be constantly adapting to the flow of information. However this was not possible in the example due to access to relevant data and computational limits.

Furthermore, more complicated models could certainly be developed to produce more accurate predictions. This example aims to show that fairly accurate models can be produced using only a few lines of code and well known statistical modeling methods.

Data Science Portfolio in R

Jack Sandom

08 September 2017

About the Document

Data Cleaning Project - Smartphone Activity Prediction

Introduction

The Project

The Data

Objectives

Cleaning the Data

Merging the training and the test sets

Extract mean and standard deviation for each measurement

Use descriptive activity names to name the activities

Appropriately label the data set with the activity names

Statistical Modeling

Split data by subject into train and test sets

Conclusion

Data Storytelling Project - Oscar Winners and IMDb

Introduction

Project

The Data

Gathering and Cleaning the Data

IMBd Data

Oscar Data

Exploratory Analysis

Conclusion

End to End Project - Credit Risk Analysis

Introduction

Project

Data

Gathering and Cleaning the Data

Exploratory Analysis

Statistical modeling

Logistic Regression

Decision Tree

Random Forest

Final Application

Conclusion