Dependency Modeling Quiz

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

## This is the Dependency Modeling Quiz Assignment.
library(ggplot2)

# #1. What is data mining? Give a brief overview (no more than two pages) of data mining.

#According to our text, data mining has to do with the discovery of useful, valid, unexpected, and understandable knowledge from data. 

#Data mining is the process of extracting useful patterns and knowledge from large datasets. It involves the use of techniques from machine learning, statistics, and database systems to analyze data and discover previously unknown, interesting, and actionable information. 

#Data mining is the process of examining vast volumes of data and datasets to extract (or "mine") meaningful insight that may assist companies in solving issues, predicting trends, mitigating risks, and identifying new possibilities. Data mining is like traditional mining in that, in both situations, miners sift through mounds of data in search of valuable minerals and components.

# #2. List some topics, concerns, and some of the tools of data mining.

# One of the most important distinguishing issues in data mining is size. With the widespread use of computer technology and information systems, the amount of data available for exploration has increased exponentially. This poses diﬃcult challenges for the standard data analysis disciplines: One has to consider issues like computational eﬃciency, limited memory resources, interfaces to databases, etc. Other key distinguishing features are the diversity of data sources that one frequently encounters in data mining projects, as well as the diversity of data types (text, sound, video, etc.). All these issues turn data mining into a highly interdisciplinary subject involving not only typical data analysts but also people working with databases, data visualization on high dimensions, etc.

#Some topics include preprocessing/data cleaning, data visualization, data transformations.
#R and Python are examples of tools of data mining.

# #3. What is data noise?

# Noise refers to random errors or variance in a dataset. It can arise due to faulty sensors, data entry errors, or inconsistencies in data collection and can obscure patterns or reduce model accuracy; A poor or unstructured dataset with irrelevant information.

# #4. What is the concern about outliers?

# Outliers in data mining can significantly impact the accuracy and reliability of results, leading to misleading interpretations and poor model performance. They can skew statistical measures like the mean, distort relationships between variables, and negatively affect the ability of algorithms to learn from data.

# #5. Discuss noise vs outliers.

# Noise and outliers are both deviations from the expected data patterns, but they differ in their nature and implications. Noise refers to random errors or irrelevant variations in data that obscure the true signal. Outliers, on the other hand, are data points that significantly deviate from the norm but may still be valid data points representing genuine phenomena or events.

# #6. Explain the following: “ % > % ”.

# %>% is the pipe operator. It passes the output of the left-hand expression to the first argument of the right-hand function.

# #7. What is the difference between supervised learning and unsupervised learning?

# Supervised learning: Uses labeled data; aims to predict outcomes. Examples: classification, regression.
# Unsupervised learning: No labels; aims to discover hidden structure. Examples: clustering, association rules.

# #8. What is the difference between structured and unstructured data?

# Structured data: Organized in rows and columns (e.g., databases, spreadsheets).
# Unstructured data: No fixed format (e.g., text, images, audio, video).

# #9. Tell us about the company, Nvidia (No more than one page).

# Nvidia Corporation is a leading American technology company founded in 1993, known for designing graphics processing units (GPUs). Initially focused on gaming, Nvidia revolutionized the graphics industry with its GeForce line. It later became a major player in AI, data centers, and autonomous vehicles. Its CUDA platform allows GPUs to be used for parallel computing tasks, vital for deep learning and data science. Nvidia also powers many AI frameworks (like TensorFlow and PyTorch) and is a major player in the AI chip market.

# #10.TRUE or FALSE. R Studio is the IDE (Integrated Developmnet Environment) for R.

# True

# #11. Explain a classification problem. 

# A classification problem involves assigning a data point to a predefined binary category or class. For example, "live in the US" or not.

# #12. Explain a clustering problem.  

# Clustering involves grouping data points into clusters such that items in the same group are more similar to each other than to those in other groups. It is unsupervised.

# #13. What does the acronym, CRISP-DM) stand for?
 
# It stands for CRISP-DM: Cross-Industry Standard Process for Data Mining
# Phases include: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment.

# #14. What is predictive analysis?

# Predictive analysis uses historical data and machine learning techniques to predict future outcomes. 
# Example: forecasting sales, predicting churn.

# #15. What is descriptive analysis?

# Descriptive analysis summarizes past data to understand patterns or trends. It answers the “what happened?” question.

# #16. What is the elbow method?

# The elbow method is used in clustering or k means to find the optimal number of clusters. It plots the within-cluster sum of squares vs. number of clusters and identifies the "elbow" point where adding more clusters yields diminishing returns.

# #17. What is another name for item frequency?

# Another term is support count, which refers to how often an item or itemset appears in a dataset.

# #18. Explain support count.

# Support count is the number of transactions in which a particular itemset appears.

# #19. Explain support.

# Support is the proportion of transactions that contain a particular itemset. It is a percentage; denominator is the total number of transactions.

# #20. Explain confidence

# Confidence measures the likelihood that item B is bought given item A is bought, in association rules. It is akin to conditional probability.

# #21. How many subsets does a 5-itemset have?

#  Using the formula 2^k - 1 = 2^5 - 1, there are 31. This eliminates the emptyset.

# #22. Use the following table (below) and find the confidence of the following association rule,
#  {Milk, Bread}  -->  {Diapers}.

# TID         ITEMS
# 1           {Bread, Milk}
# 2           {Bread, Diapers, Beer, Eggs}
# 3           {Milk, Diapers, Beer, Cola}
# 4           {Bread, Milk, Diapers, Beer} 
# 5           {Bread, Milk, Diapers, Cola}

# Since confidence is akin to conditional probability, I am using the formula P(milk & bread | diapers) = P(milk & bread & diapers) / P(milk & bread). Here A = milk & bread and B = diapers. The 3 items occur in 2/5 = 0.4 transactions and the 2 items occur in 3/5 = 0.6 transactions. So the confidence is 0.4/0.6 = .6667 = 66.67%, which means "given that diapers are purchased, there is a 66.67% probability that milk and bread will also be purchased.

Dependency Modeling Quiz

LLJ

June 22, 2025

R Markdown