For each of the following scenarios, indicate whether it is a regression or classification problem, whether the goal is inference or prediction, and the values of n and p.
(a) CEO Salary Analysis:
(b) New Product Success Prediction:
(c) USD/Euro Exchange Rate Prediction:
Explain the difference between a flexible and inflexible statistical learning approach. In what situations would a flexible approach be preferred over a less flexible approach? In what situations would a less flexible approach be preferred?
A very flexible approach in regression or classification refers to models that can adapt to complex, non-linear relationships in the data.
Advantages:
Disadvantages:
When to Use a Flexible Approach:
When to Use a Less Flexible Approach:
Describe the differences between parametric and non-parametric statistical learning approaches. What are the advantages of a parametric approach to regression or classification (as opposed to a non-parametric approach)? What are its disadvantages?
Differences Between Parametric and Non-Parametric Approaches:
Advantages of Parametric Models:
Disadvantages of Parametric Models:
a) Load the College dataset
college <- read.csv("data/College.csv")
b) Set the row names to be the college names
rownames(college) <- college[, 1]
View(college)
college <- college [, -1]
View(college)
c) Exploratory Data Analysis
i) Summary of the data set
summary(college)
## Private Apps Accept Enroll
## Length:777 Min. : 81 Min. : 72 Min. : 35
## Class :character 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242
## Mode :character Median : 1558 Median : 1110 Median : 434
## Mean : 3002 Mean : 2019 Mean : 780
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902
## Max. :48094 Max. :26330 Max. :6392
## Top10perc Top25perc F.Undergrad P.Undergrad
## Min. : 1.00 Min. : 9.0 Min. : 139 Min. : 1.0
## 1st Qu.:15.00 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0
## Median :23.00 Median : 54.0 Median : 1707 Median : 353.0
## Mean :27.56 Mean : 55.8 Mean : 3700 Mean : 855.3
## 3rd Qu.:35.00 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0
## Max. :96.00 Max. :100.0 Max. :31643 Max. :21836.0
## Outstate Room.Board Books Personal
## Min. : 2340 Min. :1780 Min. : 96.0 Min. : 250
## 1st Qu.: 7320 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850
## Median : 9990 Median :4200 Median : 500.0 Median :1200
## Mean :10441 Mean :4358 Mean : 549.4 Mean :1341
## 3rd Qu.:12925 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700
## Max. :21700 Max. :8124 Max. :2340.0 Max. :6800
## PhD Terminal S.F.Ratio perc.alumni
## Min. : 8.00 Min. : 24.0 Min. : 2.50 Min. : 0.00
## 1st Qu.: 62.00 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00
## Median : 75.00 Median : 82.0 Median :13.60 Median :21.00
## Mean : 72.66 Mean : 79.7 Mean :14.09 Mean :22.74
## 3rd Qu.: 85.00 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00
## Max. :103.00 Max. :100.0 Max. :39.80 Max. :64.00
## Expend Grad.Rate
## Min. : 3186 Min. : 10.00
## 1st Qu.: 6751 1st Qu.: 53.00
## Median : 8377 Median : 65.00
## Mean : 9660 Mean : 65.46
## 3rd Qu.:10830 3rd Qu.: 78.00
## Max. :56233 Max. :118.00
ii) Create a pairs plot of the first 10 columns
college$Private <- college$Private == "Yes"
pairs(college[, 1:10], cex = 0.2)
iii) Side-by-side boxplots of Outstate vs Private
# Side-by-side boxplots of Outstate vs Private
plot(factor(college$Private), college$Outstate, main="Outstate Tuition by Private/Public Universities",
xlab="Private", ylab="Outstate Tuition", col=c("lightblue", "pink"))
iv) Boxplot of Outstate vs Elite
Elite <- rep("No", nrow(college))
Elite[college$Top10perc > 50] <- "Yes"
Elite <- as.factor(Elite)
college <- data.frame(college, Elite)
summary(college$Elite)
## No Yes
## 699 78
plot(college$Elite, college$Outstate, main="Outstate Tuition by Elite Status",
xlab="Elite Status", ylab="Outstate Tuition", col=c("lightgreen", "orange"))
Inference:
v) Histograms of few Numeric Variables varying bins
par(mfrow = c(2, 2))
hist(college$Apps, breaks = 30, main = "Histogram of Applications", col = "skyblue", xlab = "Applications")
hist(college$Accept, breaks = 25, main = "Histogram of Acceptances", col = "lightgreen", xlab = "Acceptances")
hist(college$Enroll, breaks = 20, main = "Histogram of Enrollments", col = "lightpink", xlab = "Enrollments")
hist(college$Top10perc, breaks = 10, main = "Histogram of Top 10% Students", col = "lightyellow", xlab = "Top 10%")
Inference:
(vi) Further exploration:
Boxplots for comparison of quantitative variables by Private Status
par(mfrow = c(2, 2))
boxplot(Outstate ~ Private, data = college,
xlab = "Private", ylab = "Top10perc", col = c("lightblue", "lightgreen"))
boxplot(Enroll ~ Private, data = college,
xlab = "Private", ylab = "Enrollments", col = c("lightblue", "lightgreen"))
boxplot(Accept ~ Private, data = college,
xlab = "Private", ylab = "Acceptances", col = c("lightblue", "lightgreen"))
boxplot(Apps ~ Private, data = college,
xlab = "Private", ylab = "Applications", col = c("lightblue", "lightgreen"))
# Reset the plotting window
par(mfrow = c(1, 1))
Correlation analysis between key quantitative variables
cor_matrix <- cor(college[, sapply(college, is.numeric)], use = "complete.obs")
heatmap(cor_matrix, col = colorRampPalette(c("blue", "white", "red"))(256), scale = "none", margins = c(5,5),cexRow = 0.8, cexCol = 0.8)
Key Findings:
High Correlations:
Moderate Correlations:
Negative Correlations:
Weak Correlations:
Overall Summary: Universities with more applications and full-time undergraduates tend to have higher enrollments. Higher tuition (Outstate) is associated with higher spending and lower student-faculty ratios. Selective schools (Top10perc) tend to have higher spending.
auto_data <- read.table("data/Auto.data", header = TRUE, na.strings = "?")
auto_data <- na.omit(auto_data)
(a) Identify which of the predictors are quantitative, and which are qualitative
sapply(auto_data, class)
## mpg cylinders displacement horsepower weight acceleration
## "numeric" "integer" "numeric" "numeric" "numeric" "numeric"
## year origin name
## "integer" "integer" "character"
numeric <- which(sapply(auto_data, class) == "numeric")
names(numeric)
## [1] "mpg" "displacement" "horsepower" "weight" "acceleration"
Quantitative Predictors (Numeric):
Qualitative Predictors (Categorical):
auto_data$cylinders = as.factor(auto_data$cylinders)
auto_data$year = as.factor(auto_data$year)
auto_data$origin = as.factor(auto_data$origin)
(b) Range of quantitative predictors:
sapply(auto_data[, numeric], function(auto_data) diff(range(auto_data)))
## mpg displacement horsepower weight acceleration
## 37.6 387.0 184.0 3527.0 16.8
(c) Mean and standard deviation of quantitative predictors:
Mean
sapply(auto_data[, c("mpg", "displacement", "horsepower", "weight", "acceleration")], mean)
## mpg displacement horsepower weight acceleration
## 23.44592 194.41199 104.46939 2977.58418 15.54133
Standard deviation
sapply(auto_data[, c("mpg", "displacement", "horsepower", "weight", "acceleration")], sd)
## mpg displacement horsepower weight acceleration
## 7.805007 104.644004 38.491160 849.402560 2.758864
(d) Remove few observations (10:85) and recalculate range , mean and standard deviation:
subset_data <- auto_data[-(10:85), ] # Remove rows 10 to 85
print("Range")
## [1] "Range"
sapply(subset_data[, c("mpg", "displacement", "horsepower", "weight", "acceleration")], range)
## mpg displacement horsepower weight acceleration
## [1,] 11.0 68 46 1649 8.5
## [2,] 46.6 455 230 4997 24.8
print("Mean")
## [1] "Mean"
sapply(subset_data[, c("mpg", "displacement", "horsepower", "weight", "acceleration")], mean)
## mpg displacement horsepower weight acceleration
## 24.40443 187.24051 100.72152 2935.97152 15.72690
print("Standard deviation")
## [1] "Standard deviation"
sapply(subset_data[, c("mpg", "displacement", "horsepower", "weight", "acceleration")], sd)
## mpg displacement horsepower weight acceleration
## 7.867283 99.678367 35.708853 811.300208 2.693721
(e) Exploratory Data Analysis through visualizations:
numeric_vars <- sapply(auto_data, is.numeric)
pairs(auto_data[, numeric_vars], main = "Scatterplot Matrix")
Inference:
# Boxplot of mpg by cylinders
boxplot(mpg ~ cylinders, data = auto_data, main = "mpg by Cylinders",xlab = "Cylinders", ylab = "mpg", col = "lightblue")
# Boxplot of mpg by year
boxplot(mpg ~ year, data = auto_data, main = "mpg by Year", xlab = "Year", ylab = "mpg", col = "lightcoral")
# Boxplot of mpg by origin
boxplot(mpg ~ origin, data = auto_data, main = "mpg by Origin", xlab = "Origin", ylab = "mpg", col = c("lightgreen", "lightpink", "lightyellow"))
# Scatter plot of horsepower vs mpg, colored by origin
ggplot(auto_data, aes(x = horsepower, y = mpg, color = factor(origin))) +
geom_point() +
labs(title = "Horsepower vs mpg by Origin",
x = "Horsepower", y = "mpg")
Inference
(f) Predictors of mpg:
Based on the relationships observed in the plots:
- Strong Predictors of mpg: Displacement, Horsepower, Weight, and Cylinders are the most significant predictors of mpg due to their strong correlations (either negative or positive).
- Moderate Predictors of mpg: Acceleration and Year may be moderately useful, with acceleration having a weak positive correlation and year showing a clear trend over time.
- Categorical Variable: Origin is another useful predictor, especially given that cars from different regions (e.g., Japanese cars) exhibit varying fuel efficiencies.
(a) Load the data and check dimensions:
data(Boston)
dim(Boston)
## [1] 506 13
-There are 506 Rows and 13 columns in this dataset
(b) Pairwise scatter plots and correlation analysis:
pairs(Boston)
Key Insights from Pairwise Scatterplot:
medv and rm: Strong positive relationship. As the number of rooms increases, the median home value (medv) also increases.
medv and lstat: Strong negative relationship. Higher percentage of lower-status population (lstat) is associated with lower median home values.
medv and tax: Weak negative relationship. Higher property tax rates tend to be associated with slightly lower home values.
medv and age: Weak negative trend. Older homes (age) tend to have lower median values.
nox and indus: Moderate positive correlation. More industrial areas tend to have higher nitrogen oxide levels.
zn and medv: Weak positive relationship. More residential land zoned for large lots correlates with higher home values.
crim, rad, and dis: Weaker relationships with medv. Crime rate (crim) shows a slight negative trend, while dis and rad show weak positive trends.
(c) Predictors associated with per capita crime rate
# Correlation between crime rate and other predictors
cor(Boston$crim, Boston[, -13]) # Exclude MEDV from correlation calculation
## crim zn indus chas nox rm age
## [1,] 1 -0.2004692 0.4065834 -0.05589158 0.4209717 -0.2192467 0.3527343
## dis rad tax ptratio lstat
## [1,] -0.3796701 0.6255051 0.5827643 0.2899456 0.4556215
Positive Correlations:
Negative Correlations:
Weak or No Significant Correlation:
(d) Census tracts with particularly high crime rates, tax rates, pupil-teacher ratios
Boston |>
pivot_longer(cols = 1:13) |>
filter(name %in% c("crim", "tax", "ptratio")) |>
ggplot(aes(value)) +
geom_histogram(bins = 20) +
facet_wrap(~name, scales = "free", ncol = 1)
print("Summary statistics of crime rate")
## [1] "Summary statistics of crime rate"
summary(Boston$crim) # Summary statistics of crime rate
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00632 0.08204 0.25651 3.61352 3.67708 88.97620
print("Summary statistics of pupil-teacher ratio")
## [1] "Summary statistics of pupil-teacher ratio"
summary(Boston$ptratio) # Summary statistics of pupil-teacher ratio
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.60 17.40 19.05 18.46 20.20 22.00
print("Summary statistics of property tax")
## [1] "Summary statistics of property tax"
summary(Boston$tax) # Summary statistics of property tax
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 187.0 279.0 330.0 408.2 666.0 711.0
Key Descriptive Statistics and Observations:
crim): Significant
variation.tax): Considerable range.ptratio): Narrower range
compared to crime and tax rates.(e) Census tracts bounding the Charles River
print("count of census tracts bound charles river")
## [1] "count of census tracts bound charles river"
sum(Boston$chas)
## [1] 35
(f) Median pupil-teacher ratio
print(" median pupil-teacher ratio")
## [1] " median pupil-teacher ratio"
median(Boston$ptratio)
## [1] 19.05
(g) Census tract with the lowest median home value
min_medv_row <- Boston[which.min(Boston$medv), ] # Row with the minimum median home value
str(min_medv_row) # View the row
## 'data.frame': 1 obs. of 13 variables:
## $ crim : num 38.4
## $ zn : num 0
## $ indus : num 18.1
## $ chas : int 0
## $ nox : num 0.693
## $ rm : num 5.45
## $ age : num 100
## $ dis : num 1.49
## $ rad : int 24
## $ tax : num 666
## $ ptratio: num 20.2
## $ lstat : num 30.6
## $ medv : num 5
summary(Boston) # summary statistics for the entire dataset
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio lstat
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 1.73
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.: 6.95
## Median : 5.000 Median :330.0 Median :19.05 Median :11.36
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :12.65
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:16.95
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :37.97
## medv
## Min. : 5.00
## 1st Qu.:17.02
## Median :21.20
## Mean :22.53
## 3rd Qu.:25.00
## Max. :50.00
Analysis of Census Tract with Lowest Median Home Value:
crim): Extremely high.indus): Above the
median.rm): Below the median.tax): Among the highest.lstat): Markedly
high.(h) Census tracts averaging more than seven rooms per dwelling, and more than eight rooms per dwelling
print("count of census tract average more than 7 rooms per dwelling")
## [1] "count of census tract average more than 7 rooms per dwelling"
sum(Boston$rm > 7)
## [1] 64
print("count of census tract average more than 8 rooms per dwelling")
## [1] "count of census tract average more than 8 rooms per dwelling"
sum(Boston$rm > 8)
## [1] 13
large_homes <- Boston[Boston$rm > 8, ]
summary(large_homes)
## crim zn indus chas
## Min. :0.02009 Min. : 0.00 Min. : 2.680 Min. :0.0000
## 1st Qu.:0.33147 1st Qu.: 0.00 1st Qu.: 3.970 1st Qu.:0.0000
## Median :0.52014 Median : 0.00 Median : 6.200 Median :0.0000
## Mean :0.71879 Mean :13.62 Mean : 7.078 Mean :0.1538
## 3rd Qu.:0.57834 3rd Qu.:20.00 3rd Qu.: 6.200 3rd Qu.:0.0000
## Max. :3.47428 Max. :95.00 Max. :19.580 Max. :1.0000
## nox rm age dis
## Min. :0.4161 Min. :8.034 Min. : 8.40 Min. :1.801
## 1st Qu.:0.5040 1st Qu.:8.247 1st Qu.:70.40 1st Qu.:2.288
## Median :0.5070 Median :8.297 Median :78.30 Median :2.894
## Mean :0.5392 Mean :8.349 Mean :71.54 Mean :3.430
## 3rd Qu.:0.6050 3rd Qu.:8.398 3rd Qu.:86.50 3rd Qu.:3.652
## Max. :0.7180 Max. :8.780 Max. :93.90 Max. :8.907
## rad tax ptratio lstat medv
## Min. : 2.000 Min. :224.0 Min. :13.00 Min. :2.47 Min. :21.9
## 1st Qu.: 5.000 1st Qu.:264.0 1st Qu.:14.70 1st Qu.:3.32 1st Qu.:41.7
## Median : 7.000 Median :307.0 Median :17.40 Median :4.14 Median :48.3
## Mean : 7.462 Mean :325.1 Mean :16.36 Mean :4.31 Mean :44.2
## 3rd Qu.: 8.000 3rd Qu.:307.0 3rd Qu.:17.40 3rd Qu.:5.12 3rd Qu.:50.0
## Max. :24.000 Max. :666.0 Max. :20.20 Max. :7.44 Max. :50.0
Analysis of Census Tracts with More Than Eight Rooms per Dwelling:
crim): Lower average crime
rates.dis): Tendency
to be farther from employment centers.tax): Generally, lower
property tax rates.lstat): Show lower
percentages of lower-status population.medv): Substantially
elevated.