Exercise 2

Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.

Explain

Classification is used when the response is qualitative, aiming to assign inputs to categories.
Regression is used when the response is quantitative, predicting continuous numeric values.
Inference focuses on understanding the relationship between predictors and the response.
Prediction focuses on accurately estimating the response for new data.

We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.

This scenario is regression problem. Because the CEO salary is a continues number.
This scenario is interested in inference, they would like to get the relationship between factors and CEO salary.
n = 500 frims, p= 3 (profit, number of employees and industry)

We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price and ten other variables.

This scenario is classification. (a success or a failure)
This scenario is interested in prediction. They wish to know a prediction result.
n = 20 (Previous launched products), p = 13 (price, marketing budget, competition price and ten additional recoded variables)

We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.

This scenario is regression. (% change)
This scenario is prediction. They want to predict the % change.
n = 52 weeks, p = 3 (US, British and German market)

Exercise 5

What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?

The advantages of very flexible approach are lower bias then less flexible approach.
The very flexible approaches are preferred the training data set is large and the less flexibility has high bias.
The less flexible approaches are preferred the training data set is small and the scenario is inference. The less flexible approaches are more interpretable than the very flexible approaches.

Exercise 6

Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a non-parametric approach)? What are its disadvantages?

Parametric statistical learning approaches assume a specific functional form for the relationship between predictors and the response. Non-parametric approaches, like K-nearest neighbors or decision trees, make no such assumptions, allowing the model to flexibly adapt to the data’s structure without a predefined form.
Advantage: Parametric approaches have lower variance. because they are less sensitive to changing in the training data, leading to more stable predictions across different datasets. parametric models easier to interpret and understand, as each parameter directly relates to predictor effects. Plus, it requires fewer data points to estimate parameters effectively, performing well in smaller datasets
Disadvantage: Parametric models introduce high bias, leading to systematic errors and poor fit. It does not work well to complex like non-linear relationship.

Exercise 8

Exercise 8 - (a)

Use the read.csv() function to read the data into R. Call the loaded data college. Make sure that you have the directory set to the correct location for the data.

File_path <- "~/Downloads/College.csv"
College_df <- read.csv(File_path)   # Load csv file -> data.frame

Exercise 8 - (b)

Look at the data using the View() function. You should notice that the first column is just the name of each university. We don’t really want R to treat this as data. However, it may be handy to have these names for later.

rownames(College_df) <- College_df[, 1] # row.names is set to The name of univ
#View(College_df)

College_df <- College_df[, -1] # remove the column of the neame of univ
#View(College_df)

Exercise 8 - (c)

Use the summary() function to produce a numerical summary of the variables in the data set.

summary(College_df)

##    Private               Apps           Accept          Enroll    
##  Length:777         Min.   :   81   Min.   :   72   Min.   :  35  
##  Class :character   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242  
##  Mode  :character   Median : 1558   Median : 1110   Median : 434  
##                     Mean   : 3002   Mean   : 2019   Mean   : 780  
##                     3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902  
##                     Max.   :48094   Max.   :26330   Max.   :6392  
##    Top10perc       Top25perc      F.Undergrad     P.Undergrad     
##  Min.   : 1.00   Min.   :  9.0   Min.   :  139   Min.   :    1.0  
##  1st Qu.:15.00   1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0  
##  Median :23.00   Median : 54.0   Median : 1707   Median :  353.0  
##  Mean   :27.56   Mean   : 55.8   Mean   : 3700   Mean   :  855.3  
##  3rd Qu.:35.00   3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0  
##  Max.   :96.00   Max.   :100.0   Max.   :31643   Max.   :21836.0  
##     Outstate       Room.Board       Books           Personal   
##  Min.   : 2340   Min.   :1780   Min.   :  96.0   Min.   : 250  
##  1st Qu.: 7320   1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850  
##  Median : 9990   Median :4200   Median : 500.0   Median :1200  
##  Mean   :10441   Mean   :4358   Mean   : 549.4   Mean   :1341  
##  3rd Qu.:12925   3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700  
##  Max.   :21700   Max.   :8124   Max.   :2340.0   Max.   :6800  
##       PhD            Terminal       S.F.Ratio      perc.alumni   
##  Min.   :  8.00   Min.   : 24.0   Min.   : 2.50   Min.   : 0.00  
##  1st Qu.: 62.00   1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00  
##  Median : 75.00   Median : 82.0   Median :13.60   Median :21.00  
##  Mean   : 72.66   Mean   : 79.7   Mean   :14.09   Mean   :22.74  
##  3rd Qu.: 85.00   3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00  
##  Max.   :103.00   Max.   :100.0   Max.   :39.80   Max.   :64.00  
##      Expend        Grad.Rate     
##  Min.   : 3186   Min.   : 10.00  
##  1st Qu.: 6751   1st Qu.: 53.00  
##  Median : 8377   Median : 65.00  
##  Mean   : 9660   Mean   : 65.46  
##  3rd Qu.:10830   3rd Qu.: 78.00  
##  Max.   :56233   Max.   :118.00

Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data. Recall that you can reference the first ten columns of a matrix A using A[,1:10].

pairs(College_df[, 2:11]) # The first column is not numerical value.

Use the plot() function to produce side-by-side boxplots of Outstate versus Private.

College_df$Private <- as.factor(College_df$Private)  # Convert chr -> factor
plot(Outstate ~ Private, data = College_df)          # produce side-by-side boxplots

Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10 % of their high school classes exceeds 50 %.

Elite <- rep("No", nrow(College_df))        # define vector with setting "No"
Elite[College_df$Top10perc > 50] <- "Yes"  # Set "Yes" which have "College_df$Top10perc > 50" condition
Elite <- as.factor(Elite)                   # Convert to Factor
College_df <- data.frame(College_df, Elite) # Concatenate Elite vector.

Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative vari- ables. You may find the command par(mfrow = c(2, 2)) useful: it will divide the print window into four regions so that four plots can be made simultaneously. Modifying the arguments to this function will divide the screen in other ways.

par(mfrow = c(2, 2)) # Set the print window

#View(College_df)
hist(College_df$Apps     , breaks = 20, main = "Apps Histogram"     , xlab = "Apps")     # Histogram of Apps
hist(College_df$Accept   , breaks = 20, main = "Accept Histogram"   , xlab = "Accept")
hist(College_df$Outstate , breaks = 20, main = "Outstate Histogram" , xlab = "Outstate")
hist(College_df$PhD      , breaks = 20, main = "PhD Histogram"      , xlab = "PhD")

Continue exploring the data, and provide a brief summary of what you discover.

To examine whether there is a relationship between Room and Board cost and Enrollment Rate. First, compute the enrollment rate and then visualize its association with Room.Board.

College_df$Enroll_Rate <- College_df$Enroll / College_df$Accept # Calc Enroll rate and attach
plot(College_df$Room.Board, College_df$Enroll_Rate,
     main = "Enroll Rate vs Room & Board Cost",
     xlab = "Room and Board Cost",
     ylab = "Enrollment Rate")

Exercise 9

Auto_df <- read.csv("~/Downloads/Auto.csv")
#View(Auto_df)

Which of the predictors are quantitative, and which are qualitative?

Quantitative: mpg, displacement, horsepower, weight and acceleration

Qualititative:origin, name, cylinders and year

# set Quantitative Vector and Qualitative vector
Quantitative <- c("mpg", "displacement", "horsepower", "weight", "acceleration")
Qualitative <- c("origin", "name", "cylinders", "year")

What is the range of each quantitative predictor? You can answer this using the range() function.

## Handle "?" data value as NA
Auto_df[Auto_df == "?"] <- NA  # replace "?" to NA

for (Value  in Quantitative){
  Auto_df[[Value]] <- as.numeric(Auto_df[[Value]])              # Convert to nemeric value
  cat(Value,": ", range(Auto_df[[Value]], na.rm = TRUE), "\n")  # Check the range of each predictor
}

## mpg :  9 46.6 
## displacement :  68 455 
## horsepower :  46 230 
## weight :  1613 5140 
## acceleration :  8 24.8

What is the mean and standard deviation of each quantitative predictor?

for (Value in Quantitative){
  Mean_Quan = mean(Auto_df[[Value]], na.rm = TRUE)  # Calc Mean Value of each predictor
  std_Quan  = sd(Auto_df[[Value]]  , na.rm = TRUE)  # Calc std value of each predictor
  cat("Mean of " , Value, ":  ", Mean_Quan, "\n")
  cat("Std  of " , Value, ":  ", std_Quan, "\n\n")
}

## Mean of  mpg :   23.51587 
## Std  of  mpg :   7.825804 
## 
## Mean of  displacement :   193.5327 
## Std  of  displacement :   104.3796 
## 
## Mean of  horsepower :   104.4694 
## Std  of  horsepower :   38.49116 
## 
## Mean of  weight :   2970.262 
## Std  of  weight :   847.9041 
## 
## Mean of  acceleration :   15.55567 
## Std  of  acceleration :   2.749995

Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

## Make Sub-set without 10:85
Auto_Sub = Auto_df[-(10:85)] 

for (Value in Quantitative){
  Range_Quan = range(Auto_Sub[[Value]], na.rm = TRUE)    # Calc Range of each predictor
  Mean_Quan  = mean(Auto_Sub[[Value]] , na.rm = TRUE)    # Calc Mean Value of each predictor
  std_Quan   = sd(Auto_Sub[[Value]]   , na.rm = TRUE)    # calc std value of each predictor
  cat("Range of ", Value, ":  ", Range_Quan, "\n")
  cat("Mean of " , Value, ":  ", Mean_Quan , "\n")
  cat("Std  of " , Value, ":  ", std_Quan  , "\n\n")
}

## Range of  mpg :   9 46.6 
## Mean of  mpg :   23.51587 
## Std  of  mpg :   7.825804 
## 
## Range of  displacement :   68 455 
## Mean of  displacement :   193.5327 
## Std  of  displacement :   104.3796 
## 
## Range of  horsepower :   46 230 
## Mean of  horsepower :   104.4694 
## Std  of  horsepower :   38.49116 
## 
## Range of  weight :   1613 5140 
## Mean of  weight :   2970.262 
## Std  of  weight :   847.9041 
## 
## Range of  acceleration :   8 24.8 
## Mean of  acceleration :   15.55567 
## Std  of  acceleration :   2.749995

Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.

Observation: Horsepower tends to increase as the number of cylinders increases, while acceleration tends to decrease as the number of cylinders increases.

## set print windows
par(mfrow = c(1, 2))

## Num of cylinders vs HorsePower
plot(Auto_df$cylinders, Auto_df$horsepower,
     main = "1. Cylinders vs Horsepower",
     xlab = "Cylinders",
     ylab = "Horsepower")

## Num of cylinders vs Acceleration
plot(Auto_df$cylinders, Auto_df$acceleration,
     main = "2. Cylinders vs Acceleration",
     xlab = "Cylinders",
     ylab = "Acceleration")

Exercise 10

To begin, load in the Boston data set. The Boston data set is part of the ISLR2 library. How many rows are in this data set? How many columns? What do the rows and columns represent?

library(ISLR2)
#Boston
?Boston

A data frame with 506 rows and 13 variables.

crim: per capita crime rate by town. zn: proportion of residential land zoned for lots over 25,000 sq.ft. indus: proportion of non-retail business acres per town. chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). nox: nitrogen oxides concentration (parts per 10 million). rm: average number of rooms per dwelling. age: proportion of owner-occupied units built prior to 1940. dis: weighted mean of distances to five Boston employment centres. rad: index of accessibility to radial highways. tax: full-value property-tax rate per $10,000. ptratio: pupil-teacher ratio by town. lstat: lower status of the population (percent). medv: median value of owner-occupied homes in $1000s.

Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings.

pairs(Boston[, c(1, 2, 5, 7, 8, 10, 11, 12)])

Observation: - Tax has any relationship with other predictors. - nox is positively correlated with age, while nox is negatively correlated with dis. - crim is positively correlated with lstat, while nox is negatively correlated with dis. - age is positively correlated with nox, crim and lstat, age is negatively correlated with dis and zn.

(c)Are any of the predictors associated with per capita crime rate? If so, explain the relationship. - According to above result, crim is positively correlated with lstat, while nox is negatively correlated with dis.

Do any of the census tracts of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.

range(Boston$crim)       # Crime rates

## [1]  0.00632 88.97620

range(Boston$tax)        # Tax rates

## [1] 187 711

range(Boston$ptratio)    # Puipl-Teacher ratios

## [1] 12.6 22.0

## Summary of eahc predictor
print("Summary of Crime rates:")

## [1] "Summary of Crime rates:"

summary(Boston$crim)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00632  0.08205  0.25651  3.61352  3.67708 88.97620

print("Summary of Tax rates:")

## [1] "Summary of Tax rates:"

summary(Boston$tax)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   187.0   279.0   330.0   408.2   666.0   711.0

print("Summary of Pupil rates:")

## [1] "Summary of Pupil rates:"

summary(Boston$ptratio)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.60   17.40   19.05   18.46   20.20   22.00

#View(Boston)

Crime rates: Mean is 3.61 and Median is 0.25. But Max value is 88.97. We can observe particularly high crime rate.So there is big difference between Mean and median.
Tax rates: Mean is 408.2 and Median is 330.0. But Max value is 711.0 We can observe particularly high tax rate.So there is big difference between Mean and median.
Pupil-Teacher ratios: There is no big difference between Median and Mean. Max value is not high value by comparing with 3rd Qu and Mean.

How many of the census tracts in this data set bound the Charles river?

sum(Boston$chas == 1)  # Sum of 1 in chas predictor

## [1] 35

What is the median pupil-teacher ratio among the towns in this data set?

median(Boston$ptratio) # Calc Median value of ptratio

## [1] 19.05

Which census tract of Boston has lowest median value of owner-occupied homes? What are the values of the other predictors for that census tract, and how do those values compare to the overall ranges for those predictors? Comment on your findings.

Lowest_Idx <- which.min(Boston$medv)        # Find the observation what has the lowest mdev.
Lowest_Observation <- Boston[Lowest_Idx,]   # Get all predictors of the observation
#print(Lowest_Observation)

Ranges <- apply(Boston, MARGIN = 2, range)  # Apply range to all predictors.
#print(Ranges)

for(Value in names(Boston)){
  cat(Value, "=> Lowest: ", Lowest_Observation[[Value]], "  Min: ", Ranges[1, Value], "  Max: ", Ranges[2, Value], "\n")
}

## crim => Lowest:  38.3518   Min:  0.00632   Max:  88.9762 
## zn => Lowest:  0   Min:  0   Max:  100 
## indus => Lowest:  18.1   Min:  0.46   Max:  27.74 
## chas => Lowest:  0   Min:  0   Max:  1 
## nox => Lowest:  0.693   Min:  0.385   Max:  0.871 
## rm => Lowest:  5.453   Min:  3.561   Max:  8.78 
## age => Lowest:  100   Min:  2.9   Max:  100 
## dis => Lowest:  1.4896   Min:  1.1296   Max:  12.1265 
## rad => Lowest:  24   Min:  1   Max:  24 
## tax => Lowest:  666   Min:  187   Max:  711 
## ptratio => Lowest:  20.2   Min:  12.6   Max:  22 
## lstat => Lowest:  30.59   Min:  1.73   Max:  37.97 
## medv => Lowest:  5   Min:  5   Max:  50

Observation: It has high crime rate, tax rate, accessibility to radial highways, percentage of lower-status population, and pupil-teacher ratio.it has a relatively low number of rooms per dwelling and is located close to employment centers.

In this data set, how many of the census tracts average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the census tracts that average more than eight rooms per dwelling.

cat("more than 7: ", sum(Boston$rm > 7), "\n")

## more than 7:  64

cat("more than 8: ", sum(Boston$rm > 8), "\n")

## more than 8:  13

Boston_8      <- Boston[Boston$rm > 8, ]
Boston_others <- Boston[Boston$rm <= 8, ]


summary(Boston_8)[4,]

##                crim                  zn               indus                chas 
## "Mean   :0.71880  "   "Mean   :13.62  "  "Mean   : 7.078  "  "Mean   :0.1538  " 
##                 nox                  rm                 age                 dis 
##  "Mean   :0.5392  "   "Mean   :8.349  "   "Mean   :71.54  "   "Mean   :3.430  " 
##                 rad                 tax             ptratio               lstat 
##  "Mean   : 7.462  "   "Mean   :325.1  "   "Mean   :16.36  "    "Mean   :4.31  " 
##                medv 
##    "Mean   :44.2  "

summary(Boston_others)[4,]

##                 crim                   zn                indus 
## "Mean   : 3.68986  "    "Mean   : 11.3  "    "Mean   :11.24  " 
##                 chas                  nox                   rm 
##  "Mean   :0.06694  "   "Mean   :0.5551  "    "Mean   :6.230  " 
##                  age                  dis                  rad 
##    "Mean   : 68.5  "   "Mean   : 3.805  "   "Mean   : 9.604  " 
##                  tax              ptratio                lstat 
##    "Mean   :410.4  "    "Mean   :18.51  "    "Mean   :12.87  " 
##                 medv 
##    "Mean   :21.96  "

Areas with more than 8 rooms per dwelling have lower values in crim, ptratio, and lstat, and higher values in medv, compared to other areas.

Predictive_Modeling_HW1

Jongbum Choi - pnt 130

2025-06-08