#(1a)
library(MASS)
data(Boston)
dim(Boston)
## [1] 506 14
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Problem 1(a)
The Boston dataset contains 506 rows and 14 columns.
Each row represents a census tract in Boston. Each column represents a different variable related to housing conditions, environmental factors, and socioeconomic characteristics.
For example, the dataset includes variables such as crime rate (crim), number of rooms (rm), tax rate (tax), pupil-teacher ratio (ptratio), and median house value (medv).
#1(b)
pairs(Boston)
Problem 1(b) From the pairwise scatterplots, several relationships
between the variables can be observed. Some predictors show clear
relationships. For example, there is a strong negative relationship
between lstat and medv, which means that areas with higher lower-status
population percentages tend to have lower median home values. There is
also a positive relationship between rm and medv, indicating that areas
with more rooms per dwelling tend to have higher home values. In
addition, tax and rad appear to be positively related, suggesting that
areas with higher accessibility to highways tend to have higher tax
rates. Overall, the scatterplots help visualize the relationships
between variables and show that some predictors are strongly associated
with housing prices and other socioeconomic factors.
#(1c)
cor(Boston$crim, Boston)
## crim zn indus chas nox rm age
## [1,] 1 -0.2004692 0.4065834 -0.05589158 0.4209717 -0.2192467 0.3527343
## dis rad tax ptratio black lstat medv
## [1,] -0.3796701 0.6255051 0.5827643 0.2899456 -0.3850639 0.4556215 -0.3883046
Problem 1(c) Yes, several predictors are associated with per capita crime rate (crim). There is a strong positive relationship between crim and rad (correlation = 0.626) and tax (correlation = 0.583). This suggests that areas with higher highway accessibility and higher tax rates tend to have higher crime rates. There is also a positive relationship between crim and lstat, indicating that areas with a higher percentage of lower-status population tend to have higher crime rates. On the other hand, crim has a negative relationship with medv (correlation = -0.388), meaning that areas with higher home values tend to have lower crime rates. Overall, crime rate appears to be associated with several socioeconomic and environmental factors.
#1(d)
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Problem 1(d)
Yes, some census tracts have particularly high crime rates. The crime rate (crim) ranges from 0.00632 to 88.97620, which shows a very large variation. This indicates that some areas have extremely high crime rates compared to others.
Similarly, the tax rate (tax) ranges from 187 to 711, showing a wide range as well. This suggests that some areas have much higher tax rates than others.
Overall, many predictors have large ranges. For example, the number of rooms (rm) ranges from 3.561 to 8.780, and the percentage of lower-status population (lstat) ranges from 1.73 to 37.97. These wide ranges indicate substantial variation in housing and socioeconomic conditions across different census tracts.
1(e)
table(Boston$chas)
##
## 0 1
## 471 35
Problem 1(e) There are 35 census tracts that bound the Charles River. This can be seen from the variable chas, where a value of 1 indicates that the tract bounds the river and a value of 0 indicates that it does not.
1(f)
median(Boston$ptratio)
## [1] 19.05
Problem 1(f) The median pupil-teacher ratio in the Boston dataset is 19.05. This indicates that the typical number of students per teacher across the census tracts is about 19 students per teacher.
#2(a)
library(mlbench)
data(Soybean)
str(Soybean)
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
table(Soybean$Class)
##
## 2-4-d-injury alternarialeaf-spot
## 16 91
## anthracnose bacterial-blight
## 44 20
## bacterial-pustule brown-spot
## 20 92
## brown-stem-rot charcoal-rot
## 44 20
## cyst-nematode diaporthe-pod-&-stem-blight
## 14 15
## diaporthe-stem-canker downy-mildew
## 20 20
## frog-eye-leaf-spot herbicide-injury
## 91 8
## phyllosticta-leaf-spot phytophthora-rot
## 20 88
## powdery-mildew purple-seed-stain
## 20 20
## rhizoctonia-root-rot
## 20
table(Soybean$temp)
##
## 0 1 2
## 80 374 199
table(Soybean$precip)
##
## 0 1 2
## 74 112 459
table(Soybean$hail)
##
## 0 1
## 435 127
table(Soybean$crop.hist)
##
## 0 1 2 3
## 65 165 219 218
Problem 2(a) The frequency distributions of the categorical predictors show that the data is not evenly distributed across all categories.
For example, the temperature (temp) variable has frequencies of 80, 374, and 199 across its levels, showing that level 1 occurs most frequently. Similarly, the precipitation (precip) variable has frequencies of 74, 112, and 459, indicating that level 2 is much more common than the others.
The hail variable also shows imbalance, with 435 observations in level 0 and only 127 in level 1.
These results suggest that some predictors have uneven distributions, but none appear to be completely degenerate since all levels still have some observations.
#2(b)
colSums(is.na(Soybean))
## Class date plant.stand precip temp
## 0 1 36 38 30
## hail crop.hist area.dam sever seed.tmt
## 121 16 1 121 121
## germ plant.growth leaves leaf.halo leaf.marg
## 112 16 0 84 84
## leaf.size leaf.shread leaf.malf leaf.mild stem
## 84 100 84 108 16
## lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 121 38 38 106 38
## mycelium int.discolor sclerotia fruit.pods fruit.spots
## 38 38 38 84 106
## seed mold.growth seed.discolor seed.size shriveling
## 92 92 106 92 106
## roots
## 31
Problem 2(b)
Yes, some predictors are more likely to have missing values than others.
For example, the hail, sever, seed.tmt, and lodging variables each have 121 missing values, which is relatively high compared to other predictors. Other variables such as fruit.spots, seed.discolor, and shriveling also have a large number of missing values.
In contrast, some variables such as Class and leaves have no missing values.
This indicates that missing data is not evenly distributed across predictors. Some predictors are more prone to missing values than others.
Since different predictors have different amounts of missing data, it is possible that the missing data pattern could affect the modeling process and may also be related to the class labels.
2(c)
get_mode <- function(x) {
ux <- na.omit(unique(x))
ux[which.max(tabulate(match(x, ux)))]
}
Soybean_imputed <- Soybean
for(i in 1:ncol(Soybean_imputed)){
Soybean_imputed[is.na(Soybean_imputed[,i]), i] <- get_mode(Soybean_imputed[,i])
}
Problem 2(c)
One strategy for handling missing data is imputation. Since most of the predictors in the Soybean dataset are categorical variables, replacing missing values with the mode (most frequent category) is a reasonable approach.
This method preserves the dataset size and avoids losing information that would occur if observations or predictors were removed.
Mode imputation is simple and effective for categorical data, and it allows the dataset to be used for modeling without missing values.
Alternatively, predictors with excessive missing values could be removed, but imputation is generally preferred when the dataset is not very large.
#3(a)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
data(BloodBrain)
str(logBBB)
## num [1:208] 1.08 -0.4 0.22 0.14 0.69 0.44 -0.43 1.38 0.75 0.88 ...
str(bbbDescr)
## 'data.frame': 208 obs. of 134 variables:
## $ tpsa : num 12 49.3 50.5 37.4 37.4 ...
## $ nbasic : int 1 0 1 0 1 1 1 1 1 1 ...
## $ negative : int 0 0 0 0 0 0 0 0 0 0 ...
## $ vsa_hyd : num 167.1 92.6 295.2 319.1 299.7 ...
## $ a_aro : int 0 6 15 15 12 11 6 12 12 6 ...
## $ weight : num 156 151 366 383 326 ...
## $ peoe_vsa.0 : num 76.9 38.2 58.1 62.2 74.8 ...
## $ peoe_vsa.1 : num 43.4 25.5 124.7 124.7 118 ...
## $ peoe_vsa.2 : num 0 0 21.7 13.2 33 ...
## $ peoe_vsa.3 : num 0 8.62 8.62 21.79 0 ...
## $ peoe_vsa.4 : num 0 23.3 17.4 0 0 ...
## $ peoe_vsa.5 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ peoe_vsa.6 : num 17.24 0 8.62 8.62 8.62 ...
## $ peoe_vsa.0.1 : num 18.7 49 83.8 83.8 83.8 ...
## $ peoe_vsa.1.1 : num 43.5 0 49 68.8 36.8 ...
## $ peoe_vsa.2.1 : num 0 0 0 0 0 ...
## $ peoe_vsa.3.1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ peoe_vsa.4.1 : num 0 0 5.68 5.68 5.68 ...
## $ peoe_vsa.5.1 : num 0 13.567 2.504 0 0.137 ...
## $ peoe_vsa.6.1 : num 0 7.9 2.64 2.64 2.5 ...
## $ a_acc : int 0 2 2 2 2 2 2 2 0 2 ...
## $ a_acid : int 0 0 0 0 0 0 0 0 0 0 ...
## $ a_base : int 1 0 1 1 1 1 1 1 1 1 ...
## $ vsa_acc : num 0 13.57 8.19 8.19 8.19 ...
## $ vsa_acid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ vsa_base : num 5.68 0 0 0 0 ...
## $ vsa_don : num 5.68 5.68 5.68 5.68 5.68 ...
## $ vsa_other : num 0 28.1 43.6 28.3 19.6 ...
## $ vsa_pol : num 0 13.6 0 0 0 ...
## $ slogp_vsa0 : num 18 25.4 14.1 14.1 14.1 ...
## $ slogp_vsa1 : num 0 23.3 34.8 34.8 34.8 ...
## $ slogp_vsa2 : num 3.98 23.86 0 0 0 ...
## $ slogp_vsa3 : num 0 0 76.2 76.2 76.2 ...
## $ slogp_vsa4 : num 4.41 0 3.19 3.19 3.19 ...
## $ slogp_vsa5 : num 32.9 0 9.51 0 0 ...
## $ slogp_vsa6 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ slogp_vsa7 : num 0 70.6 148.1 144 140.7 ...
## $ slogp_vsa8 : num 113.2 0 75.5 75.5 75.5 ...
## $ slogp_vsa9 : num 33.3 41.3 28.3 55.5 26 ...
## $ smr_vsa0 : num 0 23.86 12.63 3.12 3.12 ...
## $ smr_vsa1 : num 18 25.4 27.8 27.8 27.8 ...
## $ smr_vsa2 : num 4.41 0 0 0 0 ...
## $ smr_vsa3 : num 3.98 5.24 8.43 8.43 8.43 ...
## $ smr_vsa4 : num 0 20.8 29.6 21.4 20.3 ...
## $ smr_vsa5 : num 113.2 70.6 235.1 235.1 234.6 ...
## $ smr_vsa6 : num 0 5.26 76.25 76.25 76.25 ...
## $ smr_vsa7 : num 66.2 33.3 0 31.3 0 ...
## $ tpsa.1 : num 16.6 49.3 51.7 38.6 38.6 ...
## $ logp.o.w. : num 2.948 0.889 4.439 5.254 3.8 ...
## $ frac.anion7. : num 0 0.001 0 0 0 0 0.001 0 0 0 ...
## $ frac.cation7. : num 0.999 0 0.986 0.986 0.986 0.986 0.996 0.946 0.999 0.976 ...
## $ andrewbind : num 3.4 -3.3 12.8 12.8 10.3 10 10.4 15.9 12.9 9.5 ...
## $ rotatablebonds : int 3 2 8 8 8 8 8 7 4 5 ...
## $ mlogp : num 2.5 1.06 4.66 3.82 3.27 ...
## $ clogp : num 2.97 0.494 5.137 5.878 4.367 ...
## $ mw : num 155 151 365 382 325 ...
## $ nocount : int 1 3 5 4 4 4 4 3 2 4 ...
## $ hbdnr : int 1 2 1 1 1 1 2 1 1 0 ...
## $ rule.of.5violations : int 0 0 1 1 0 0 0 0 1 0 ...
## $ alert : int 0 0 0 0 0 0 0 0 0 0 ...
## $ prx : int 0 1 6 2 2 2 1 0 0 4 ...
## $ ub : num 0 3 5.3 5.3 4.2 3.6 3 4.7 4.2 3 ...
## $ pol : int 0 2 3 3 2 2 2 3 4 1 ...
## $ inthb : int 0 0 0 0 0 0 1 0 0 0 ...
## $ adistm : num 0 395 1365 703 746 ...
## $ adistd : num 0 10.9 25.7 10 10.6 ...
## $ polar_area : num 21.1 117.4 82.1 65.1 66.2 ...
## $ nonpolar_area : num 379 248 638 668 602 ...
## $ psa_npsa : num 0.0557 0.4743 0.1287 0.0974 0.11 ...
## $ tcsa : num 0.0097 0.0134 0.0111 0.0108 0.0118 0.0111 0.0123 0.0099 0.0106 0.0115 ...
## $ tcpa : num 0.1842 0.0417 0.0972 0.1218 0.1186 ...
## $ tcnp : num 0.0103 0.0198 0.0125 0.0119 0.013 0.0125 0.0162 0.011 0.0109 0.0122 ...
## $ ovality : num 1.1 1.12 1.3 1.3 1.27 ...
## $ surface_area : num 400 365 720 733 668 ...
## $ volume : num 656 555 1224 1257 1133 ...
## $ most_negative_charge: num -0.617 -0.84 -0.801 -0.761 -0.857 ...
## $ most_positive_charge: num 0.307 0.497 0.541 0.48 0.455 ...
## $ sum_absolute_charge : num 3.89 4.89 7.98 7.93 7.85 ...
## $ dipole_moment : num 1.19 4.21 3.52 3.15 3.27 ...
## $ homo : num -9.67 -8.96 -8.63 -8.56 -8.67 ...
## $ lumo : num 3.4038 0.1942 0.0589 -0.2651 0.3149 ...
## $ hardness : num 6.54 4.58 4.34 4.15 4.49 ...
## $ ppsa1 : num 349 223 518 508 509 ...
## $ ppsa2 : num 679 546 2066 2013 1999 ...
## $ ppsa3 : num 31 42.3 64 61.7 61.6 ...
## $ pnsa1 : num 51.1 141.8 202 225.4 158.8 ...
## $ pnsa2 : num -99.3 -346.9 -805.9 -894 -623.3 ...
## $ pnsa3 : num -10.5 -44 -43.8 -42 -39.8 ...
## $ fpsa1 : num 0.872 0.611 0.719 0.693 0.762 ...
## $ fpsa2 : num 1.7 1.5 2.87 2.75 2.99 ...
## $ fpsa3 : num 0.0774 0.1159 0.0888 0.0842 0.0922 ...
## $ fnsa1 : num 0.128 0.389 0.281 0.307 0.238 ...
## $ fnsa2 : num -0.248 -0.951 -1.12 -1.22 -0.933 ...
## $ fnsa3 : num -0.0262 -0.1207 -0.0608 -0.0573 -0.0596 ...
## $ wpsa1 : num 139.7 81.4 372.7 372.1 340.1 ...
## $ wpsa2 : num 272 199 1487 1476 1335 ...
## $ wpsa3 : num 12.4 15.4 46 45.2 41.1 ...
## $ wnsa1 : num 20.4 51.8 145.4 165.3 106 ...
## $ wnsa2 : num -39.8 -126.6 -580.1 -655.3 -416.3 ...
## [list output truncated]
Problem 3(a)
The BloodBrain dataset contains 208 observations.
The outcome variable logBBB is a numeric vector with 208 values, representing the ability of chemical compounds to permeate the blood-brain barrier.
The predictor dataset bbbDescr contains 208 observations and 134 predictor variables. These predictors describe various chemical and structural properties of the compounds, such as molecular weight, polar surface area, and charge-related measurements.
3(b)
library(caret)
nzv <- nearZeroVar(bbbDescr, saveMetrics = TRUE)
nzv[nzv$zeroVar == TRUE, ]
## [1] freqRatio percentUnique zeroVar nzv
## <0 rows> (or 0-length row.names)
Problem 3(b) No predictors have degenerate distributions. Using the nearZeroVar function, no predictors were found to have zero variance. This means that all predictors have some variation in their values. Therefore, none of the predictors are degenerate, and all predictors can potentially contribute useful information for modeling.
#3(c)
cor_matrix <- cor(bbbDescr)
highCorr <- findCorrelation(cor_matrix, cutoff = 0.9)
length(highCorr)
## [1] 36
Problem 3(c) Yes, there are strong relationships between some of the predictors.
Using a correlation cutoff of 0.9, 36 predictors were found to be highly correlated with other predictors. This indicates that many predictors contain similar information.
High correlation between predictors can cause problems in modeling, such as multicollinearity and reduced model interpretability.
Removing these predictors reduces the total number of predictors and simplifies the model, while retaining most of the important information.
#4(a)
library(caret)
data(oil)
set.seed(123)
sample_index <- sample(1:nrow(fattyAcids), 60)
oil_sample <- oilType[sample_index]
table(oil_sample)
## oil_sample
## A B C D E F G
## 24 17 3 3 6 5 2
Problem 4(a)
A completely random sample of 60 oils was selected from the dataset.
The frequency distribution in the random sample was similar to the original dataset, although not exactly the same. For example, oil type A appeared 24 times in the sample compared to 37 in the original dataset, and oil type B appeared 17 times compared to 26 in the original dataset.
Smaller classes such as type C and type G remained rare in both the original and sampled data.
Overall, the random sample roughly reflects the original distribution but includes some variation.
4(b)
set.seed(123)
strat_index <- createDataPartition(oilType, p = 60/length(oilType), list = FALSE)
oil_strat_sample <- oilType[strat_index]
table(oil_strat_sample)
## oil_strat_sample
## A B C D E F G
## 24 17 2 5 7 7 2
Problem 4(b) Using stratified sampling, the frequency distribution of oil types in the sample more closely matches the original dataset compared to completely random sampling. Stratified sampling ensures that each oil type is represented proportionally in the sample. For example, smaller classes such as type C and type G are still included in the sample. Compared to completely random sampling, stratified sampling produces more stable and balanced results.
4(c) Problem 4(b)
Using stratified sampling, the frequency distribution of oil types in the sample more closely matches the original dataset compared to completely random sampling. Stratified sampling ensures that each oil type is represented proportionally in the sample. For example, smaller classes such as type C and type G are still included in the sample. This method reduces sampling bias and provides a more representative sample of the overall dataset. So,compared to completely random sampling, stratified sampling produces more stable and balanced results.
4(d)
binom.test(16, 20)
##
## Exact binomial test
##
## data: 16 and 20
## number of successes = 16, number of trials = 20, p-value = 0.01182
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.563386 0.942666
## sample estimates:
## probability of success
## 0.8
Problem 4(d) Using binom.test with 16 correct predictions out of 20 test samples, the estimated accuracy is 0.8.
The 95% confidence interval for the accuracy is approximately (0.563, 0.943). This interval is relatively wide, indicating a high level of uncertainty in the accuracy estimate due to the small test sample size.
This demonstrates that smaller test sets result in greater uncertainty in model performance estimates.
Increasing the test set size would produce a narrower confidence interval and more reliable performance estimates.
Therefore, there is a trade-off between test set size, model accuracy, and uncertainty.
#Problem 5
model_complexity <- 1:10
bias <- 10 / model_complexity
variance <- model_complexity / 2
total_error <- bias + variance
plot(model_complexity, bias, type = "l", col = "blue", ylim = c(0,10),
xlab = "Model Complexity", ylab = "Error",
main = "Bias-Variance Tradeoff")
lines(model_complexity, variance, col = "red")
lines(model_complexity, total_error, col = "black")
legend("topright",
legend = c("Bias", "Variance", "Total Error"),
col = c("blue", "red", "black"),
lty = 1)
Problem 5
The bias–variance tradeoff describes the balance between bias and variance in predictive modeling.
Bias refers to the error caused by overly simple models that cannot capture the true relationship in the data. High bias leads to underfitting.
Variance refers to the error caused by overly complex models that are too sensitive to the training data. High variance leads to overfitting.
As model complexity increases, bias decreases because the model becomes more flexible. However, variance increases because the model becomes more sensitive to small changes in the data.
The total error is the sum of bias and variance. Initially, increasing model complexity reduces total error. However, after a certain point, increasing complexity causes total error to increase due to high variance.
Therefore, the optimal model complexity is the point where the total error is minimized.