Data Science: X-Ray Chest Classification

Jonah Winninghoff

12/14/20

What is the Architecture Decision Record?

Architecture Decision Record

The Architecture Decision Record (ADR) is to capture every key decision for each part of relevant coding process. It does not describe what, and how, coding executes but rather the abstract of it—the reason of it. For this project, the ADR is to document the whys. These documents are to understand better why the image classification is architected this way.

Data Source

Definition

By definition, the data source is a location of which data being used comes from. The nomenclature of this term is databases, which commonly used for many data scientists and analysts—that is, relational database management systems.

Technology Choice

Pranav Raikote publishes the images released by the University of Montreal in Kaggle platform. This platform is an open-source data. The Kaggle API in IBM Watson Studio Jupyter Notebook is in use to retrieve this dataset. This dataset is 153mb in total and it has two different files, which are training and testing. Both have three different subsets: Covid-19, viral pneumonia and normal lungs.

Justification

The open-source data in Kaggle lubricates the transparency of work and update on dataset. For example, this dataset may be occurred if the Covid-19 new variants emerge. However, the discretion to do so is entirely up to its owner.

Enterprise Data

Definition

Enterprise data is either datasets or databases shared by the users of institutes or geographic regions, which mainly focuses on resilience of data storage.

Technology Choice

The IBM Watson Studio Jupyter Notebook is in use to connect Kaggle platform and this dataset also persists in IBM Cloud Pak for Data as an asset. This process is entirely cloud-based.

Justification

The advantage of using cloud is a resilience of dataset, which has no geographic restrictions and high accessibility. To process this dataset does not require internal power. Generally, some ethical concerns have arisen with this technology because of data privacy and security. But in this case, there is no indication of personal information record associated in this dataset.

Data Integration

Definition

The meaning of term Data Integration is the process of mingling data from a variety of sources, which commonly used for querying relational databases. For example, one might encounter several different datasets across the database and these need to mingle using primary and foreign keys.

Technology Choice

When the data extracts, the pathlib is in use to convert this filesystem into object. Also, it utilizes to separate training and testing datasets. Finally, the flow_from_directory imported by TensorFlow using Keras API automates to categorize each subset of both training and testing datasets based on how directory structures.

Justification

This approach is, without doubt, justified because this filesytem directory is as follows:

…/data/train

      …/Covid
      
            …/001.png
            
            …/002.jpg
            
      …/Pneumonia
      
            …/001.jpeg
            
            …/002.jpg
            
      …/Normal
      
            …/001.jpg
            
            …/002.png

Discovery and Exploration

Definition

The discovery and exploration are another term to describe Exploratory Data Analysis, which refers to components allowing for visualization and summary statistics. In other words, this procedure undergoes data mining process by searching for correlations in the data.

Technology

There are several different explorations that have done, as following:

Average and Difference

Standard Deviation

Justification

Every analysis has a purpose. For example, the first analysis is to familiarize with what this dataset looks. The second analysis is this bar chart. It is to ensure that total number of each subset in this dataset is not too even. This result proves that it is relatively even, so it is unlikely to have an impact on building image classification. But the implication is the sampling size is not large enough. The ImageDataGenerator can mitigate this problem. Average, difference, and standard deviation images are to test if the distinction is clear from each other. Fortunately, those suggest that image classification model is trainable.

Actionable Insight

Technology Choice

The diagram is a demonstration of how Convolutional Neural Network is taught to be the image classification:

Justification

This CNN technique operates this classification more efficient and agile. Without this, the computation may be time consuming due to plethora of parameters. As mentioned earlier, the sampling size of this dataset is too small. The ImageDataGenerator augments images that help mitigate potential biases. Not only does it improve this algorithm performance, it becomes more resistant to overfitting. This generator stabilizes the convergence of categorical cross-entropy loss. The out-of-sample accuracy is 92.42% and its loss is 0.22. The Adam gradient descent tends to perform better for CNN.