Explain
Classification is used when the response is qualitative, aiming to assign inputs to categories.
Regression is used when the response is quantitative, predicting continuous numeric values.
Inference focuses on understanding the relationship between predictors and the response.
Prediction focuses on accurately estimating the response for new data.
Parametric statistical learning approaches assume a specific functional form for the relationship between predictors and the response. Non-parametric approaches, like K-nearest neighbors or decision trees, make no such assumptions, allowing the model to flexibly adapt to the data’s structure without a predefined form.
Advantage: Parametric approaches have lower variance. because they are less sensitive to changing in the training data, leading to more stable predictions across different datasets. parametric models easier to interpret and understand, as each parameter directly relates to predictor effects. Plus, it requires fewer data points to estimate parameters effectively, performing well in smaller datasets
Disadvantage: Parametric models introduce high bias, leading to systematic errors and poor fit. It does not work well to complex like non-linear relationship.
File_path <- "~/Downloads/College.csv"
College_df <- read.csv(File_path) # Load csv file -> data.frame
Look at the data using the View() function. You should notice that the first column is just the name of each university. We don’t really want R to treat this as data. However, it may be handy to have these names for later.
rownames(College_df) <- College_df[, 1] # row.names is set to The name of univ
#View(College_df)
College_df <- College_df[, -1] # remove the column of the neame of univ
#View(College_df)
summary(College_df)
## Private Apps Accept Enroll
## Length:777 Min. : 81 Min. : 72 Min. : 35
## Class :character 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242
## Mode :character Median : 1558 Median : 1110 Median : 434
## Mean : 3002 Mean : 2019 Mean : 780
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902
## Max. :48094 Max. :26330 Max. :6392
## Top10perc Top25perc F.Undergrad P.Undergrad
## Min. : 1.00 Min. : 9.0 Min. : 139 Min. : 1.0
## 1st Qu.:15.00 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0
## Median :23.00 Median : 54.0 Median : 1707 Median : 353.0
## Mean :27.56 Mean : 55.8 Mean : 3700 Mean : 855.3
## 3rd Qu.:35.00 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0
## Max. :96.00 Max. :100.0 Max. :31643 Max. :21836.0
## Outstate Room.Board Books Personal
## Min. : 2340 Min. :1780 Min. : 96.0 Min. : 250
## 1st Qu.: 7320 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850
## Median : 9990 Median :4200 Median : 500.0 Median :1200
## Mean :10441 Mean :4358 Mean : 549.4 Mean :1341
## 3rd Qu.:12925 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700
## Max. :21700 Max. :8124 Max. :2340.0 Max. :6800
## PhD Terminal S.F.Ratio perc.alumni
## Min. : 8.00 Min. : 24.0 Min. : 2.50 Min. : 0.00
## 1st Qu.: 62.00 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00
## Median : 75.00 Median : 82.0 Median :13.60 Median :21.00
## Mean : 72.66 Mean : 79.7 Mean :14.09 Mean :22.74
## 3rd Qu.: 85.00 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00
## Max. :103.00 Max. :100.0 Max. :39.80 Max. :64.00
## Expend Grad.Rate
## Min. : 3186 Min. : 10.00
## 1st Qu.: 6751 1st Qu.: 53.00
## Median : 8377 Median : 65.00
## Mean : 9660 Mean : 65.46
## 3rd Qu.:10830 3rd Qu.: 78.00
## Max. :56233 Max. :118.00
pairs(College_df[, 2:11]) # The first column is not numerical value.
College_df$Private <- as.factor(College_df$Private) # Convert chr -> factor
plot(Outstate ~ Private, data = College_df) # produce side-by-side boxplots
Elite <- rep("No", nrow(College_df)) # define vector with setting "No"
Elite[College_df$Top10perc > 50] <- "Yes" # Set "Yes" which have "College_df$Top10perc > 50" condition
Elite <- as.factor(Elite) # Convert to Factor
College_df <- data.frame(College_df, Elite) # Concatenate Elite vector.
par(mfrow = c(2, 2)) # Set the print window
#View(College_df)
hist(College_df$Apps , breaks = 20, main = "Apps Histogram" , xlab = "Apps") # Histogram of Apps
hist(College_df$Accept , breaks = 20, main = "Accept Histogram" , xlab = "Accept")
hist(College_df$Outstate , breaks = 20, main = "Outstate Histogram" , xlab = "Outstate")
hist(College_df$PhD , breaks = 20, main = "PhD Histogram" , xlab = "PhD")
College_df$Enroll_Rate <- College_df$Enroll / College_df$Accept # Calc Enroll rate and attach
plot(College_df$Room.Board, College_df$Enroll_Rate,
main = "Enroll Rate vs Room & Board Cost",
xlab = "Room and Board Cost",
ylab = "Enrollment Rate")
Auto_df <- read.csv("~/Downloads/Auto.csv")
#View(Auto_df)
# set Quantitative Vector and Qualitative vector
Quantitative <- c("mpg", "displacement", "horsepower", "weight", "acceleration")
Qualitative <- c("origin", "name", "cylinders", "year")
## Handle "?" data value as NA
Auto_df[Auto_df == "?"] <- NA # replace "?" to NA
for (Value in Quantitative){
Auto_df[[Value]] <- as.numeric(Auto_df[[Value]]) # Convert to nemeric value
cat(Value,": ", range(Auto_df[[Value]], na.rm = TRUE), "\n") # Check the range of each predictor
}
## mpg : 9 46.6
## displacement : 68 455
## horsepower : 46 230
## weight : 1613 5140
## acceleration : 8 24.8
for (Value in Quantitative){
Mean_Quan = mean(Auto_df[[Value]], na.rm = TRUE) # Calc Mean Value of each predictor
std_Quan = sd(Auto_df[[Value]] , na.rm = TRUE) # Calc std value of each predictor
cat("Mean of " , Value, ": ", Mean_Quan, "\n")
cat("Std of " , Value, ": ", std_Quan, "\n\n")
}
## Mean of mpg : 23.51587
## Std of mpg : 7.825804
##
## Mean of displacement : 193.5327
## Std of displacement : 104.3796
##
## Mean of horsepower : 104.4694
## Std of horsepower : 38.49116
##
## Mean of weight : 2970.262
## Std of weight : 847.9041
##
## Mean of acceleration : 15.55567
## Std of acceleration : 2.749995
## Make Sub-set without 10:85
Auto_Sub = Auto_df[-(10:85)]
for (Value in Quantitative){
Range_Quan = range(Auto_Sub[[Value]], na.rm = TRUE) # Calc Range of each predictor
Mean_Quan = mean(Auto_Sub[[Value]] , na.rm = TRUE) # Calc Mean Value of each predictor
std_Quan = sd(Auto_Sub[[Value]] , na.rm = TRUE) # calc std value of each predictor
cat("Range of ", Value, ": ", Range_Quan, "\n")
cat("Mean of " , Value, ": ", Mean_Quan , "\n")
cat("Std of " , Value, ": ", std_Quan , "\n\n")
}
## Range of mpg : 9 46.6
## Mean of mpg : 23.51587
## Std of mpg : 7.825804
##
## Range of displacement : 68 455
## Mean of displacement : 193.5327
## Std of displacement : 104.3796
##
## Range of horsepower : 46 230
## Mean of horsepower : 104.4694
## Std of horsepower : 38.49116
##
## Range of weight : 1613 5140
## Mean of weight : 2970.262
## Std of weight : 847.9041
##
## Range of acceleration : 8 24.8
## Mean of acceleration : 15.55567
## Std of acceleration : 2.749995
Observation: Horsepower tends to increase as the number of cylinders increases, while acceleration tends to decrease as the number of cylinders increases.
## set print windows
par(mfrow = c(1, 2))
## Num of cylinders vs HorsePower
plot(Auto_df$cylinders, Auto_df$horsepower,
main = "1. Cylinders vs Horsepower",
xlab = "Cylinders",
ylab = "Horsepower")
## Num of cylinders vs Acceleration
plot(Auto_df$cylinders, Auto_df$acceleration,
main = "2. Cylinders vs Acceleration",
xlab = "Cylinders",
ylab = "Acceleration")
library(ISLR2)
#Boston
?Boston
A data frame with 506 rows and 13 variables.
crim: per capita crime rate by town. zn: proportion of residential land zoned for lots over 25,000 sq.ft. indus: proportion of non-retail business acres per town. chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). nox: nitrogen oxides concentration (parts per 10 million). rm: average number of rooms per dwelling. age: proportion of owner-occupied units built prior to 1940. dis: weighted mean of distances to five Boston employment centres. rad: index of accessibility to radial highways. tax: full-value property-tax rate per $10,000. ptratio: pupil-teacher ratio by town. lstat: lower status of the population (percent). medv: median value of owner-occupied homes in $1000s.
pairs(Boston[, c(1, 2, 5, 7, 8, 10, 11, 12)])
Observation: - Tax has any relationship with other predictors. - nox is positively correlated with age, while nox is negatively correlated with dis. - crim is positively correlated with lstat, while nox is negatively correlated with dis. - age is positively correlated with nox, crim and lstat, age is negatively correlated with dis and zn.
(c)Are any of the predictors associated with per capita crime rate? If so, explain the relationship. - According to above result, crim is positively correlated with lstat, while nox is negatively correlated with dis.
range(Boston$crim) # Crime rates
## [1] 0.00632 88.97620
range(Boston$tax) # Tax rates
## [1] 187 711
range(Boston$ptratio) # Puipl-Teacher ratios
## [1] 12.6 22.0
## Summary of eahc predictor
print("Summary of Crime rates:")
## [1] "Summary of Crime rates:"
summary(Boston$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00632 0.08205 0.25651 3.61352 3.67708 88.97620
print("Summary of Tax rates:")
## [1] "Summary of Tax rates:"
summary(Boston$tax)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 187.0 279.0 330.0 408.2 666.0 711.0
print("Summary of Pupil rates:")
## [1] "Summary of Pupil rates:"
summary(Boston$ptratio)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.60 17.40 19.05 18.46 20.20 22.00
#View(Boston)
sum(Boston$chas == 1) # Sum of 1 in chas predictor
## [1] 35
median(Boston$ptratio) # Calc Median value of ptratio
## [1] 19.05
Lowest_Idx <- which.min(Boston$medv) # Find the observation what has the lowest mdev.
Lowest_Observation <- Boston[Lowest_Idx,] # Get all predictors of the observation
#print(Lowest_Observation)
Ranges <- apply(Boston, MARGIN = 2, range) # Apply range to all predictors.
#print(Ranges)
for(Value in names(Boston)){
cat(Value, "=> Lowest: ", Lowest_Observation[[Value]], " Min: ", Ranges[1, Value], " Max: ", Ranges[2, Value], "\n")
}
## crim => Lowest: 38.3518 Min: 0.00632 Max: 88.9762
## zn => Lowest: 0 Min: 0 Max: 100
## indus => Lowest: 18.1 Min: 0.46 Max: 27.74
## chas => Lowest: 0 Min: 0 Max: 1
## nox => Lowest: 0.693 Min: 0.385 Max: 0.871
## rm => Lowest: 5.453 Min: 3.561 Max: 8.78
## age => Lowest: 100 Min: 2.9 Max: 100
## dis => Lowest: 1.4896 Min: 1.1296 Max: 12.1265
## rad => Lowest: 24 Min: 1 Max: 24
## tax => Lowest: 666 Min: 187 Max: 711
## ptratio => Lowest: 20.2 Min: 12.6 Max: 22
## lstat => Lowest: 30.59 Min: 1.73 Max: 37.97
## medv => Lowest: 5 Min: 5 Max: 50
Observation: It has high crime rate, tax rate, accessibility to radial highways, percentage of lower-status population, and pupil-teacher ratio.it has a relatively low number of rooms per dwelling and is located close to employment centers.
cat("more than 7: ", sum(Boston$rm > 7), "\n")
## more than 7: 64
cat("more than 8: ", sum(Boston$rm > 8), "\n")
## more than 8: 13
Boston_8 <- Boston[Boston$rm > 8, ]
Boston_others <- Boston[Boston$rm <= 8, ]
summary(Boston_8)[4,]
## crim zn indus chas
## "Mean :0.71880 " "Mean :13.62 " "Mean : 7.078 " "Mean :0.1538 "
## nox rm age dis
## "Mean :0.5392 " "Mean :8.349 " "Mean :71.54 " "Mean :3.430 "
## rad tax ptratio lstat
## "Mean : 7.462 " "Mean :325.1 " "Mean :16.36 " "Mean :4.31 "
## medv
## "Mean :44.2 "
summary(Boston_others)[4,]
## crim zn indus
## "Mean : 3.68986 " "Mean : 11.3 " "Mean :11.24 "
## chas nox rm
## "Mean :0.06694 " "Mean :0.5551 " "Mean :6.230 "
## age dis rad
## "Mean : 68.5 " "Mean : 3.805 " "Mean : 9.604 "
## tax ptratio lstat
## "Mean :410.4 " "Mean :18.51 " "Mean :12.87 "
## medv
## "Mean :21.96 "
Areas with more than 8 rooms per dwelling have lower values in crim, ptratio, and lstat, and higher values in medv, compared to other areas.