DATA622 HW2

Here we are using the PointBlank function ‘scan_data’ to explore the missingness, distribution and correlations in the data set.

Some interesting correlations are revealed. The target ‘east_asia’ is positively correlated with the number of stars a product received and the style ‘bowl’. It is negatively correlated with the producer ‘Nissin’ and having the words ‘Instant’ or ‘Curry’ in the name. These negatively relationships make sense, Nissin produces ramen for a Western audience and the adjectives ‘Instant’ and ‘Curry’ and probably most appealing to those same people.

Because this dataset came from Kaggle, it is very clean and has almost no missingness, not something you encounter in the real world.

Overview
Reproducibility

Table Overview

Columns	10
Rows	2,577
`NA`s	0
Duplicate Rows	2,049 (79.51%)

Column Types

numeric

Reproducibility Information

Scan Build Time	`2024-04-04 11:56:45`
pointblank Version	`0.11.4`
R Version	R version 4.3.2 (2023–10–31 ucrt) Eye Holes
Operating System	`x86_64-w64-mingw32`

Stars
numeric

Distinct	42
`NA`s	0
`Inf`/`-Inf`	0

Mean	3.65
Minimum	0
Maximum	5

Quantile Statistics

Minimum	0.00
5th Percentile	1.70
Q1	3.25
Median	3.75
Q3	4.25
95th Percentile	5.00
Maximum	5.00
Range	5.00
IQR	1.00

Descriptive Statistics

Mean	3.65
Variance	1.03
Standard Deviation	1.02
Coefficient of Variation	0.28

Value	Count	Frequency
4	393	15.3%
5	386	15.0%
3.75	350	13.6%
3.5	335	13.0%
3	176	6.8%
3.25	170	6.6%
4.25	143	5.5%
4.5	135	5.2%
2.75	85	3.3%
Other Values (336)	336	13.0%

Maximum Values

Value	Count	Frequency
4.00	393	15.25%
5.00	386	14.98%
3.75	350	13.58%
3.50	335	13.00%
3.00	176	6.83%
3.25	170	6.60%
4.25	143	5.55%
4.50	135	5.24%
2.75	85	3.30%
2.00	68	2.64%

Minimum Values

Value	Count	Frequency
0.100	1	0.04%
0.750	1	0.04%
0.900	1	0.04%
1.800	1	0.04%
2.100	1	0.04%
2.125	1	0.04%
2.850	1	0.04%
3.125	1	0.04%
3.200	1	0.04%
3.300	1	0.04%

nissin
numeric

Distinct	2
`NA`s	0
`Inf`/`-Inf`	0

Mean	0.15
Minimum	0
Maximum	1

Quantile Statistics

Minimum	0.00
5th Percentile	0.00
Q1	0.00
Median	0.00
Q3	0.00
95th Percentile	1.00
Maximum	1.00
Range	1.00
IQR	0.00

Descriptive Statistics

Mean	0.15
Variance	0.13
Standard Deviation	0.36
Coefficient of Variation	2.40

Value	Count	Frequency
0	2196	85.2%
1	381	14.8%
Other Values (0)	0	0.0%

Maximum Values

Value	Count	Frequency
0	2196	85.22%
1	381	14.78%

Minimum Values

Value	Count	Frequency
1	381	14.78%
0	2196	85.22%

east_asia
numeric

Distinct	2
`NA`s	0
`Inf`/`-Inf`	0

Mean	0.41
Minimum	0
Maximum	1

Quantile Statistics

Minimum	0.00
5th Percentile	0.00
Q1	0.00
Median	0.00
Q3	1.00
95th Percentile	1.00
Maximum	1.00
Range	1.00
IQR	1.00

Descriptive Statistics

Mean	0.41
Variance	0.24
Standard Deviation	0.49
Coefficient of Variation	1.20

Value	Count	Frequency
0	1525	59.2%
1	1052	40.8%
Other Values (0)	0	0.0%

Maximum Values

Value	Count	Frequency
0	1525	59.18%
1	1052	40.82%

Minimum Values

Value	Count	Frequency
1	1052	40.82%
0	1525	59.18%

spicy
numeric

Distinct	2
`NA`s	0
`Inf`/`-Inf`	0

Mean	0.1
Minimum	0
Maximum	1

Quantile Statistics

Minimum	0.00
5th Percentile	0.00
Q1	0.00
Median	0.00
Q3	0.00
95th Percentile	1.00
Maximum	1.00
Range	1.00
IQR	0.00

Descriptive Statistics

Mean	0.10
Variance	0.09
Standard Deviation	0.30
Coefficient of Variation	2.95

Value	Count	Frequency
0	2312	89.7%
1	265	10.3%
Other Values (0)	0	0.0%

Maximum Values

Value	Count	Frequency
0	2312	89.72%
1	265	10.28%

Minimum Values

Value	Count	Frequency
1	265	10.28%
0	2312	89.72%

curry
numeric

Distinct	2
`NA`s	0
`Inf`/`-Inf`	0

Mean	0.05
Minimum	0
Maximum	1

Quantile Statistics

Minimum	0.00
5th Percentile	0.00
Q1	0.00
Median	0.00
Q3	0.00
95th Percentile	0.00
Maximum	1.00
Range	1.00
IQR	0.00

Descriptive Statistics

Mean	0.05
Variance	0.05
Standard Deviation	0.21
Coefficient of Variation	4.45

Value	Count	Frequency
0	2453	95.2%
1	124	4.8%
Other Values (0)	0	0.0%

Maximum Values

Value	Count	Frequency
0	2453	95.19%
1	124	4.81%

Minimum Values

Value	Count	Frequency
1	124	4.81%
0	2453	95.19%

instant
numeric

Distinct	2
`NA`s	0
`Inf`/`-Inf`	0

Mean	0.18
Minimum	0
Maximum	1

Quantile Statistics

Minimum	0.00
5th Percentile	0.00
Q1	0.00
Median	0.00
Q3	0.00
95th Percentile	1.00
Maximum	1.00
Range	1.00
IQR	0.00

Descriptive Statistics

Mean	0.18
Variance	0.14
Standard Deviation	0.38
Coefficient of Variation	2.17

Value	Count	Frequency
0	2124	82.4%
1	453	17.6%
Other Values (0)	0	0.0%

Maximum Values

Value	Count	Frequency
0	2124	82.42%
1	453	17.58%

Minimum Values

Value	Count	Frequency
1	453	17.58%
0	2124	82.42%

pack
numeric

Distinct	2
`NA`s	0
`Inf`/`-Inf`	0

Mean	0.59
Minimum	0
Maximum	1

Quantile Statistics

Minimum	0.00
5th Percentile	0.00
Q1	0.00
Median	1.00
Q3	1.00
95th Percentile	1.00
Maximum	1.00
Range	1.00
IQR	1.00

Descriptive Statistics

Mean	0.59
Variance	0.24
Standard Deviation	0.49
Coefficient of Variation	0.83

Value	Count	Frequency
1	1528	59.3%
0	1049	40.7%
Other Values (0)	0	0.0%

Maximum Values

Value	Count	Frequency
1	1528	59.29%
0	1049	40.71%

Minimum Values

Value	Count	Frequency
0	1049	40.71%
1	1528	59.29%

tray
numeric

Distinct	2
`NA`s	0
`Inf`/`-Inf`	0

Mean	0.04
Minimum	0
Maximum	1

Quantile Statistics

Minimum	0.00
5th Percentile	0.00
Q1	0.00
Median	0.00
Q3	0.00
95th Percentile	0.00
Maximum	1.00
Range	1.00
IQR	0.00

Descriptive Statistics

Mean	0.04
Variance	0.04
Standard Deviation	0.20
Coefficient of Variation	4.78

Value	Count	Frequency
0	2469	95.8%
1	108	4.2%
Other Values (0)	0	0.0%

Maximum Values

Value	Count	Frequency
0	2469	95.81%
1	108	4.19%

Minimum Values

Value	Count	Frequency
1	108	4.19%
0	2469	95.81%

cup
numeric

Distinct	2
`NA`s	0
`Inf`/`-Inf`	0

Mean	0.17
Minimum	0
Maximum	1

Quantile Statistics

Minimum	0.00
5th Percentile	0.00
Q1	0.00
Median	0.00
Q3	0.00
95th Percentile	1.00
Maximum	1.00
Range	1.00
IQR	0.00

Descriptive Statistics

Mean	0.17
Variance	0.14
Standard Deviation	0.38
Coefficient of Variation	2.17

Value	Count	Frequency
0	2127	82.5%
1	450	17.5%
Other Values (0)	0	0.0%

Maximum Values

Value	Count	Frequency
0	2127	82.54%
1	450	17.46%

Minimum Values

Value	Count	Frequency
1	450	17.46%
0	2127	82.54%

bowl
numeric

Distinct	2
`NA`s	0
`Inf`/`-Inf`	0

Mean	0.19
Minimum	0
Maximum	1

Quantile Statistics

Minimum	0.00
5th Percentile	0.00
Q1	0.00
Median	0.00
Q3	0.00
95th Percentile	1.00
Maximum	1.00
Range	1.00
IQR	0.00

Descriptive Statistics

Mean	0.19
Variance	0.15
Standard Deviation	0.39
Coefficient of Variation	2.09

Value	Count	Frequency
0	2096	81.3%
1	481	18.7%
Other Values (0)	0	0.0%

Maximum Values

Value	Count	Frequency
0	2096	81.33%
1	481	18.67%

Minimum Values

Value	Count	Frequency
1	481	18.67%
0	2096	81.33%

	Stars	nissin	east_asia	spicy	curry	instant	pack	tray	cup	bowl
1	3.75	0	1	0	0	0	0	0	1	0
2	1.00	0	1	1	0	0	1	0	0	0
3	2.25	1	0	0	0	0	0	0	1	0
4	2.75	0	1	0	0	0	1	0	0	0
5	3.75	0	0	0	1	0	1	0	0	0
6..2572
2573	3.50	0	0	0	0	1	0	0	0	1
2574	1.00	0	0	0	0	1	1	0	0	0
2575	2.00	0	0	0	0	0	1	0	0	0
2576	2.00	0	0	0	0	0	1	0	0	0
2577	0.50	0	0	0	0	0	1	0	0	0

DATA622 HW2

William Aiken

2024-04-04

Introduction

Load in Data

Remove the review number (unnecessary key) and cast star rating as numeric

Data Exploration

Overview of `ramenDf`

Variables

Interactions

Correlations

Missing Values

Sample

Decision Trees

Next we want to build a Decision Tree excluding the feature ‘Instant’ that was used in the first node.

Random Forset

Here we split the ramen dataset into a training and test set for our random forest

Here I train my random forsest model for the ramen dataset

Here I test the accuracy

Results

DATA622 HW2

William Aiken

2024-04-04

Introduction

Load in Data

Remove the review number (unnecessary key) and cast star rating as numeric

Data Exploration

Overview of ramenDf

Variables

Interactions

Correlations

Missing Values

Sample

Decision Trees

Next we want to build a Decision Tree excluding the feature ‘Instant’ that was used in the first node.

Random Forset

Here we split the ramen dataset into a training and test set for our random forest

Here I train my random forsest model for the ramen dataset

Here I test the accuracy

Results

Overview of `ramenDf`