National Vulnerability Database Analysis

9/11/2022

Team members

Suhasini Masti

Sylvia Chopde

Shrawani Misra

Thanusha Reddy Pappu

Vulnerabilty

Data Sources

The primary data source we will be using for this project is from National Vulnerability Database which Provides comprehensive information about software vulnerabilities

National Vulnerability Database has data about software Vulnerabilities happened in the past two decades.

The data provided on the website is in the form of JSON data feeds.

Since we have over 50,000 Vulnerability disclosures in our primary data source, we do not plan to collect data from any secondary sources.

Problem Statement and Why is it interesting?

By analyzing the vulnerabilities over the last couple of years we are able to identify the frequencies/occurrence of these vulnerabilities and also the level of impact it poses. This will help us better identify focus areas to minimize critical vulnerabilities.
We can analyze data for the last 2-3 years to get an idea of the most recent software vulnerabilities to help organizations to focus better on these areas through their security standards and help minimize the consequences of a software breaches.

Analytics plan

1.Since our data is in JSON format, we will convert the data to a data frame and then into a CSV for better analysis.

2.We will visualize the data in R using ggplot2,Quanteda textplots to better Visualize and understand relationships among the topics

3.Quantitative analysis of textual data : Creating a Corpus,remove stop words,generate tokens and and Document Feature Matrix, generate top features , FCM (Feature Co-occurrence Matrix) using the quanteda package

4.LDA (Latent Dirichlet Allocation)

5.Advanced Topic Modelling like STM(Structural Topic Modelling) for improving inference and qualitative interpretability and allows us to use document-level variables.

Evaluation Plan

1.Perplexity, a common metric used to evaluate NLP and language models and for a better fit model we aim for a lower perplexity.

2.Cross Validation measures like CaoJuan2009, Arun2010 , Deveaud2014 to find the best number of topics suited for our model.Fit the resulting LDA model and topic-specific diagnostics using topic doc package.

3.topicQuality - to measure exclusivity vs semantic coherence of different topics

4.LDAvis - to see different topics, key words and plot Expected Topic Performance of Top Topics