homework1

#(1a)

library(MASS)
data(Boston)

dim(Boston)

## [1] 506  14

str(Boston)

## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Problem 1(a)

The Boston dataset contains 506 rows and 14 columns.

Each row represents a census tract in Boston. Each column represents a different variable related to housing conditions, environmental factors, and socioeconomic characteristics.

For example, the dataset includes variables such as crime rate (crim), number of rooms (rm), tax rate (tax), pupil-teacher ratio (ptratio), and median house value (medv).

#1(b)

pairs(Boston)

Problem 1(b) From the pairwise scatterplots, several relationships between the variables can be observed. Some predictors show clear relationships. For example, there is a strong negative relationship between lstat and medv, which means that areas with higher lower-status population percentages tend to have lower median home values. There is also a positive relationship between rm and medv, indicating that areas with more rooms per dwelling tend to have higher home values. In addition, tax and rad appear to be positively related, suggesting that areas with higher accessibility to highways tend to have higher tax rates. Overall, the scatterplots help visualize the relationships between variables and show that some predictors are strongly associated with housing prices and other socioeconomic factors.

#(1c)

cor(Boston$crim, Boston)

##      crim         zn     indus        chas       nox         rm       age
## [1,]    1 -0.2004692 0.4065834 -0.05589158 0.4209717 -0.2192467 0.3527343
##             dis       rad       tax   ptratio      black     lstat       medv
## [1,] -0.3796701 0.6255051 0.5827643 0.2899456 -0.3850639 0.4556215 -0.3883046

Problem 1(c) Yes, several predictors are associated with per capita crime rate (crim). There is a strong positive relationship between crim and rad (correlation = 0.626) and tax (correlation = 0.583). This suggests that areas with higher highway accessibility and higher tax rates tend to have higher crime rates. There is also a positive relationship between crim and lstat, indicating that areas with a higher percentage of lower-status population tend to have higher crime rates. On the other hand, crim has a negative relationship with medv (correlation = -0.388), meaning that areas with higher home values tend to have lower crime rates. Overall, crime rate appears to be associated with several socioeconomic and environmental factors.

#1(d)

summary(Boston)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Problem 1(d)

Yes, some census tracts have particularly high crime rates. The crime rate (crim) ranges from 0.00632 to 88.97620, which shows a very large variation. This indicates that some areas have extremely high crime rates compared to others.

Similarly, the tax rate (tax) ranges from 187 to 711, showing a wide range as well. This suggests that some areas have much higher tax rates than others.

Overall, many predictors have large ranges. For example, the number of rooms (rm) ranges from 3.561 to 8.780, and the percentage of lower-status population (lstat) ranges from 1.73 to 37.97. These wide ranges indicate substantial variation in housing and socioeconomic conditions across different census tracts.

1(e)

table(Boston$chas)

## 
##   0   1 
## 471  35

Problem 1(e) There are 35 census tracts that bound the Charles River. This can be seen from the variable chas, where a value of 1 indicates that the tract bounds the river and a value of 0 indicates that it does not.

1(f)

median(Boston$ptratio)

## [1] 19.05

Problem 1(f) The median pupil-teacher ratio in the Boston dataset is 19.05. This indicates that the typical number of students per teacher across the census tracts is about 19 students per teacher.

#2(a)

library(mlbench)
data(Soybean)

str(Soybean)

## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

table(Soybean$Class)

## 
##                2-4-d-injury         alternarialeaf-spot 
##                          16                          91 
##                 anthracnose            bacterial-blight 
##                          44                          20 
##           bacterial-pustule                  brown-spot 
##                          20                          92 
##              brown-stem-rot                charcoal-rot 
##                          44                          20 
##               cyst-nematode diaporthe-pod-&-stem-blight 
##                          14                          15 
##       diaporthe-stem-canker                downy-mildew 
##                          20                          20 
##          frog-eye-leaf-spot            herbicide-injury 
##                          91                           8 
##      phyllosticta-leaf-spot            phytophthora-rot 
##                          20                          88 
##              powdery-mildew           purple-seed-stain 
##                          20                          20 
##        rhizoctonia-root-rot 
##                          20

table(Soybean$temp)

## 
##   0   1   2 
##  80 374 199

table(Soybean$precip)

## 
##   0   1   2 
##  74 112 459

table(Soybean$hail)

## 
##   0   1 
## 435 127

table(Soybean$crop.hist)

## 
##   0   1   2   3 
##  65 165 219 218

Problem 2(a) The frequency distributions of the categorical predictors show that the data is not evenly distributed across all categories.

For example, the temperature (temp) variable has frequencies of 80, 374, and 199 across its levels, showing that level 1 occurs most frequently. Similarly, the precipitation (precip) variable has frequencies of 74, 112, and 459, indicating that level 2 is much more common than the others.

The hail variable also shows imbalance, with 435 observations in level 0 and only 127 in level 1.

These results suggest that some predictors have uneven distributions, but none appear to be completely degenerate since all levels still have some observations.

#2(b)

colSums(is.na(Soybean))

##           Class            date     plant.stand          precip            temp 
##               0               1              36              38              30 
##            hail       crop.hist        area.dam           sever        seed.tmt 
##             121              16               1             121             121 
##            germ    plant.growth          leaves       leaf.halo       leaf.marg 
##             112              16               0              84              84 
##       leaf.size     leaf.shread       leaf.malf       leaf.mild            stem 
##              84             100              84             108              16 
##         lodging    stem.cankers   canker.lesion fruiting.bodies       ext.decay 
##             121              38              38             106              38 
##        mycelium    int.discolor       sclerotia      fruit.pods     fruit.spots 
##              38              38              38              84             106 
##            seed     mold.growth   seed.discolor       seed.size      shriveling 
##              92              92             106              92             106 
##           roots 
##              31

Problem 2(b)

Yes, some predictors are more likely to have missing values than others.

For example, the hail, sever, seed.tmt, and lodging variables each have 121 missing values, which is relatively high compared to other predictors. Other variables such as fruit.spots, seed.discolor, and shriveling also have a large number of missing values.

In contrast, some variables such as Class and leaves have no missing values.

This indicates that missing data is not evenly distributed across predictors. Some predictors are more prone to missing values than others.

Since different predictors have different amounts of missing data, it is possible that the missing data pattern could affect the modeling process and may also be related to the class labels.

2(c)

get_mode <- function(x) {
  ux <- na.omit(unique(x))
  ux[which.max(tabulate(match(x, ux)))]
}

Soybean_imputed <- Soybean

for(i in 1:ncol(Soybean_imputed)){
  Soybean_imputed[is.na(Soybean_imputed[,i]), i] <- get_mode(Soybean_imputed[,i])
}

Problem 2(c)

One strategy for handling missing data is imputation. Since most of the predictors in the Soybean dataset are categorical variables, replacing missing values with the mode (most frequent category) is a reasonable approach.

This method preserves the dataset size and avoids losing information that would occur if observations or predictors were removed.

Mode imputation is simple and effective for categorical data, and it allows the dataset to be used for modeling without missing values.

Alternatively, predictors with excessive missing values could be removed, but imputation is generally preferred when the dataset is not very large.

#3(a)

library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

data(BloodBrain)

str(logBBB)

##  num [1:208] 1.08 -0.4 0.22 0.14 0.69 0.44 -0.43 1.38 0.75 0.88 ...

str(bbbDescr)

## 'data.frame':    208 obs. of  134 variables:
##  $ tpsa                : num  12 49.3 50.5 37.4 37.4 ...
##  $ nbasic              : int  1 0 1 0 1 1 1 1 1 1 ...
##  $ negative            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ vsa_hyd             : num  167.1 92.6 295.2 319.1 299.7 ...
##  $ a_aro               : int  0 6 15 15 12 11 6 12 12 6 ...
##  $ weight              : num  156 151 366 383 326 ...
##  $ peoe_vsa.0          : num  76.9 38.2 58.1 62.2 74.8 ...
##  $ peoe_vsa.1          : num  43.4 25.5 124.7 124.7 118 ...
##  $ peoe_vsa.2          : num  0 0 21.7 13.2 33 ...
##  $ peoe_vsa.3          : num  0 8.62 8.62 21.79 0 ...
##  $ peoe_vsa.4          : num  0 23.3 17.4 0 0 ...
##  $ peoe_vsa.5          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ peoe_vsa.6          : num  17.24 0 8.62 8.62 8.62 ...
##  $ peoe_vsa.0.1        : num  18.7 49 83.8 83.8 83.8 ...
##  $ peoe_vsa.1.1        : num  43.5 0 49 68.8 36.8 ...
##  $ peoe_vsa.2.1        : num  0 0 0 0 0 ...
##  $ peoe_vsa.3.1        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ peoe_vsa.4.1        : num  0 0 5.68 5.68 5.68 ...
##  $ peoe_vsa.5.1        : num  0 13.567 2.504 0 0.137 ...
##  $ peoe_vsa.6.1        : num  0 7.9 2.64 2.64 2.5 ...
##  $ a_acc               : int  0 2 2 2 2 2 2 2 0 2 ...
##  $ a_acid              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ a_base              : int  1 0 1 1 1 1 1 1 1 1 ...
##  $ vsa_acc             : num  0 13.57 8.19 8.19 8.19 ...
##  $ vsa_acid            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ vsa_base            : num  5.68 0 0 0 0 ...
##  $ vsa_don             : num  5.68 5.68 5.68 5.68 5.68 ...
##  $ vsa_other           : num  0 28.1 43.6 28.3 19.6 ...
##  $ vsa_pol             : num  0 13.6 0 0 0 ...
##  $ slogp_vsa0          : num  18 25.4 14.1 14.1 14.1 ...
##  $ slogp_vsa1          : num  0 23.3 34.8 34.8 34.8 ...
##  $ slogp_vsa2          : num  3.98 23.86 0 0 0 ...
##  $ slogp_vsa3          : num  0 0 76.2 76.2 76.2 ...
##  $ slogp_vsa4          : num  4.41 0 3.19 3.19 3.19 ...
##  $ slogp_vsa5          : num  32.9 0 9.51 0 0 ...
##  $ slogp_vsa6          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ slogp_vsa7          : num  0 70.6 148.1 144 140.7 ...
##  $ slogp_vsa8          : num  113.2 0 75.5 75.5 75.5 ...
##  $ slogp_vsa9          : num  33.3 41.3 28.3 55.5 26 ...
##  $ smr_vsa0            : num  0 23.86 12.63 3.12 3.12 ...
##  $ smr_vsa1            : num  18 25.4 27.8 27.8 27.8 ...
##  $ smr_vsa2            : num  4.41 0 0 0 0 ...
##  $ smr_vsa3            : num  3.98 5.24 8.43 8.43 8.43 ...
##  $ smr_vsa4            : num  0 20.8 29.6 21.4 20.3 ...
##  $ smr_vsa5            : num  113.2 70.6 235.1 235.1 234.6 ...
##  $ smr_vsa6            : num  0 5.26 76.25 76.25 76.25 ...
##  $ smr_vsa7            : num  66.2 33.3 0 31.3 0 ...
##  $ tpsa.1              : num  16.6 49.3 51.7 38.6 38.6 ...
##  $ logp.o.w.           : num  2.948 0.889 4.439 5.254 3.8 ...
##  $ frac.anion7.        : num  0 0.001 0 0 0 0 0.001 0 0 0 ...
##  $ frac.cation7.       : num  0.999 0 0.986 0.986 0.986 0.986 0.996 0.946 0.999 0.976 ...
##  $ andrewbind          : num  3.4 -3.3 12.8 12.8 10.3 10 10.4 15.9 12.9 9.5 ...
##  $ rotatablebonds      : int  3 2 8 8 8 8 8 7 4 5 ...
##  $ mlogp               : num  2.5 1.06 4.66 3.82 3.27 ...
##  $ clogp               : num  2.97 0.494 5.137 5.878 4.367 ...
##  $ mw                  : num  155 151 365 382 325 ...
##  $ nocount             : int  1 3 5 4 4 4 4 3 2 4 ...
##  $ hbdnr               : int  1 2 1 1 1 1 2 1 1 0 ...
##  $ rule.of.5violations : int  0 0 1 1 0 0 0 0 1 0 ...
##  $ alert               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ prx                 : int  0 1 6 2 2 2 1 0 0 4 ...
##  $ ub                  : num  0 3 5.3 5.3 4.2 3.6 3 4.7 4.2 3 ...
##  $ pol                 : int  0 2 3 3 2 2 2 3 4 1 ...
##  $ inthb               : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ adistm              : num  0 395 1365 703 746 ...
##  $ adistd              : num  0 10.9 25.7 10 10.6 ...
##  $ polar_area          : num  21.1 117.4 82.1 65.1 66.2 ...
##  $ nonpolar_area       : num  379 248 638 668 602 ...
##  $ psa_npsa            : num  0.0557 0.4743 0.1287 0.0974 0.11 ...
##  $ tcsa                : num  0.0097 0.0134 0.0111 0.0108 0.0118 0.0111 0.0123 0.0099 0.0106 0.0115 ...
##  $ tcpa                : num  0.1842 0.0417 0.0972 0.1218 0.1186 ...
##  $ tcnp                : num  0.0103 0.0198 0.0125 0.0119 0.013 0.0125 0.0162 0.011 0.0109 0.0122 ...
##  $ ovality             : num  1.1 1.12 1.3 1.3 1.27 ...
##  $ surface_area        : num  400 365 720 733 668 ...
##  $ volume              : num  656 555 1224 1257 1133 ...
##  $ most_negative_charge: num  -0.617 -0.84 -0.801 -0.761 -0.857 ...
##  $ most_positive_charge: num  0.307 0.497 0.541 0.48 0.455 ...
##  $ sum_absolute_charge : num  3.89 4.89 7.98 7.93 7.85 ...
##  $ dipole_moment       : num  1.19 4.21 3.52 3.15 3.27 ...
##  $ homo                : num  -9.67 -8.96 -8.63 -8.56 -8.67 ...
##  $ lumo                : num  3.4038 0.1942 0.0589 -0.2651 0.3149 ...
##  $ hardness            : num  6.54 4.58 4.34 4.15 4.49 ...
##  $ ppsa1               : num  349 223 518 508 509 ...
##  $ ppsa2               : num  679 546 2066 2013 1999 ...
##  $ ppsa3               : num  31 42.3 64 61.7 61.6 ...
##  $ pnsa1               : num  51.1 141.8 202 225.4 158.8 ...
##  $ pnsa2               : num  -99.3 -346.9 -805.9 -894 -623.3 ...
##  $ pnsa3               : num  -10.5 -44 -43.8 -42 -39.8 ...
##  $ fpsa1               : num  0.872 0.611 0.719 0.693 0.762 ...
##  $ fpsa2               : num  1.7 1.5 2.87 2.75 2.99 ...
##  $ fpsa3               : num  0.0774 0.1159 0.0888 0.0842 0.0922 ...
##  $ fnsa1               : num  0.128 0.389 0.281 0.307 0.238 ...
##  $ fnsa2               : num  -0.248 -0.951 -1.12 -1.22 -0.933 ...
##  $ fnsa3               : num  -0.0262 -0.1207 -0.0608 -0.0573 -0.0596 ...
##  $ wpsa1               : num  139.7 81.4 372.7 372.1 340.1 ...
##  $ wpsa2               : num  272 199 1487 1476 1335 ...
##  $ wpsa3               : num  12.4 15.4 46 45.2 41.1 ...
##  $ wnsa1               : num  20.4 51.8 145.4 165.3 106 ...
##  $ wnsa2               : num  -39.8 -126.6 -580.1 -655.3 -416.3 ...
##   [list output truncated]

Problem 3(a)

The BloodBrain dataset contains 208 observations.

The outcome variable logBBB is a numeric vector with 208 values, representing the ability of chemical compounds to permeate the blood-brain barrier.

The predictor dataset bbbDescr contains 208 observations and 134 predictor variables. These predictors describe various chemical and structural properties of the compounds, such as molecular weight, polar surface area, and charge-related measurements.

3(b)

library(caret)

nzv <- nearZeroVar(bbbDescr, saveMetrics = TRUE)
nzv[nzv$zeroVar == TRUE, ]

## [1] freqRatio     percentUnique zeroVar       nzv          
## <0 rows> (or 0-length row.names)

Problem 3(b) No predictors have degenerate distributions. Using the nearZeroVar function, no predictors were found to have zero variance. This means that all predictors have some variation in their values. Therefore, none of the predictors are degenerate, and all predictors can potentially contribute useful information for modeling.

#3(c)

cor_matrix <- cor(bbbDescr)

highCorr <- findCorrelation(cor_matrix, cutoff = 0.9)

length(highCorr)

## [1] 36

Problem 3(c) Yes, there are strong relationships between some of the predictors.

Using a correlation cutoff of 0.9, 36 predictors were found to be highly correlated with other predictors. This indicates that many predictors contain similar information.

High correlation between predictors can cause problems in modeling, such as multicollinearity and reduced model interpretability.

Removing these predictors reduces the total number of predictors and simplifies the model, while retaining most of the important information.

#4(a)

library(caret)
data(oil)

set.seed(123)
sample_index <- sample(1:nrow(fattyAcids), 60)

oil_sample <- oilType[sample_index]

table(oil_sample)

## oil_sample
##  A  B  C  D  E  F  G 
## 24 17  3  3  6  5  2

Problem 4(a)

A completely random sample of 60 oils was selected from the dataset.

The frequency distribution in the random sample was similar to the original dataset, although not exactly the same. For example, oil type A appeared 24 times in the sample compared to 37 in the original dataset, and oil type B appeared 17 times compared to 26 in the original dataset.

Smaller classes such as type C and type G remained rare in both the original and sampled data.

Overall, the random sample roughly reflects the original distribution but includes some variation.

4(b)

set.seed(123)

strat_index <- createDataPartition(oilType, p = 60/length(oilType), list = FALSE)

oil_strat_sample <- oilType[strat_index]

table(oil_strat_sample)

## oil_strat_sample
##  A  B  C  D  E  F  G 
## 24 17  2  5  7  7  2

Problem 4(b) Using stratified sampling, the frequency distribution of oil types in the sample more closely matches the original dataset compared to completely random sampling. Stratified sampling ensures that each oil type is represented proportionally in the sample. For example, smaller classes such as type C and type G are still included in the sample. Compared to completely random sampling, stratified sampling produces more stable and balanced results.

4(c) Problem 4(b)

Using stratified sampling, the frequency distribution of oil types in the sample more closely matches the original dataset compared to completely random sampling. Stratified sampling ensures that each oil type is represented proportionally in the sample. For example, smaller classes such as type C and type G are still included in the sample. This method reduces sampling bias and provides a more representative sample of the overall dataset. So,compared to completely random sampling, stratified sampling produces more stable and balanced results.

4(d)

binom.test(16, 20)

## 
##  Exact binomial test
## 
## data:  16 and 20
## number of successes = 16, number of trials = 20, p-value = 0.01182
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.563386 0.942666
## sample estimates:
## probability of success 
##                    0.8

Problem 4(d) Using binom.test with 16 correct predictions out of 20 test samples, the estimated accuracy is 0.8.

The 95% confidence interval for the accuracy is approximately (0.563, 0.943). This interval is relatively wide, indicating a high level of uncertainty in the accuracy estimate due to the small test sample size.

This demonstrates that smaller test sets result in greater uncertainty in model performance estimates.

Increasing the test set size would produce a narrower confidence interval and more reliable performance estimates.

Therefore, there is a trade-off between test set size, model accuracy, and uncertainty.

#Problem 5

model_complexity <- 1:10

bias <- 10 / model_complexity
variance <- model_complexity / 2
total_error <- bias + variance

plot(model_complexity, bias, type = "l", col = "blue", ylim = c(0,10),
     xlab = "Model Complexity", ylab = "Error",
     main = "Bias-Variance Tradeoff")

lines(model_complexity, variance, col = "red")
lines(model_complexity, total_error, col = "black")

legend("topright",
       legend = c("Bias", "Variance", "Total Error"),
       col = c("blue", "red", "black"),
       lty = 1)

Problem 5

The bias–variance tradeoff describes the balance between bias and variance in predictive modeling.

Bias refers to the error caused by overly simple models that cannot capture the true relationship in the data. High bias leads to underfitting.

Variance refers to the error caused by overly complex models that are too sensitive to the training data. High variance leads to overfitting.

As model complexity increases, bias decreases because the model becomes more flexible. However, variance increases because the model becomes more sensitive to small changes in the data.

The total error is the sum of bias and variance. Initially, increasing model complexity reduces total error. However, after a certain point, increasing complexity causes total error to increase due to high variance.

Therefore, the optimal model complexity is the point where the total error is minimized.

homework1

Wei You

2026-02-22