library(readr)
<- read.csv("College.csv", row.names = 1) college
Assignment 1
0.1 Problem 2
0.1.1 Part A
Problem type: Regression
The response variable (CEO salary) is continuous, so it’s a regression problem.
Goal: Inference
Since the interest is in understanding which factors affect salary, this suggests inference (examining relationships and significance of predictors).
n (number of observations): 500 firms.
p (number of predictors): 3
Profit
Number of employees
Industry (likely categorical, may be represented by multiple dummy variables in practice, but conceptually one variable).
0.1.2 Part B
Problem type: Classification
The response variable is categorical (success or failure), so it’s a classification problem.
Goal: Prediction
The interest is in forecasting success/failure of a new product.
n (number of observations): 20 previous products.
p (number of predictors): 13
Price, marketing budget, competition price, plus 10 additional variables.
0.1.3 Part C
Problem type: Regression
The response variable is % change in the USD/Euro exchange rate (a continuous value).
Goal: Prediction
Since the aim is to predict future exchange rates, it’s a prediction problem.
n (number of observations): 52 weeks in a year.
p (number of predictors): 3
% change in US, British, and German markets.
0.2 Problem 5
Flexible models can capture complex patterns and improve prediction but may overfit and are harder to interpret. Less flexible models are simpler, more interpretable, and better for inference, though they may miss nonlinear relationships. Use flexible models for prediction with large data, and less flexible models when interpretability or inference is the goal.
0.3 Problem 6
A parametric approach assumes a specific form for the model (like linear), making it simpler, faster, and easier to interpret, especially with small datasets. Its main drawback is that it can introduce bias if the assumptions are wrong. Non-parametric approaches are more flexible and can model complex patterns but require more data and are harder to interpret.
0.4 Problem 8
View(college)
summary(college)
Private Apps Accept Enroll
Length:777 Min. : 81 Min. : 72 Min. : 35
Class :character 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242
Mode :character Median : 1558 Median : 1110 Median : 434
Mean : 3002 Mean : 2019 Mean : 780
3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902
Max. :48094 Max. :26330 Max. :6392
Top10perc Top25perc F.Undergrad P.Undergrad
Min. : 1.00 Min. : 9.0 Min. : 139 Min. : 1.0
1st Qu.:15.00 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0
Median :23.00 Median : 54.0 Median : 1707 Median : 353.0
Mean :27.56 Mean : 55.8 Mean : 3700 Mean : 855.3
3rd Qu.:35.00 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0
Max. :96.00 Max. :100.0 Max. :31643 Max. :21836.0
Outstate Room.Board Books Personal
Min. : 2340 Min. :1780 Min. : 96.0 Min. : 250
1st Qu.: 7320 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850
Median : 9990 Median :4200 Median : 500.0 Median :1200
Mean :10441 Mean :4358 Mean : 549.4 Mean :1341
3rd Qu.:12925 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700
Max. :21700 Max. :8124 Max. :2340.0 Max. :6800
PhD Terminal S.F.Ratio perc.alumni
Min. : 8.00 Min. : 24.0 Min. : 2.50 Min. : 0.00
1st Qu.: 62.00 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00
Median : 75.00 Median : 82.0 Median :13.60 Median :21.00
Mean : 72.66 Mean : 79.7 Mean :14.09 Mean :22.74
3rd Qu.: 85.00 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00
Max. :103.00 Max. :100.0 Max. :39.80 Max. :64.00
Expend Grad.Rate
Min. : 3186 Min. : 10.00
1st Qu.: 6751 1st Qu.: 53.00
Median : 8377 Median : 65.00
Mean : 9660 Mean : 65.46
3rd Qu.:10830 3rd Qu.: 78.00
Max. :56233 Max. :118.00
pairs(college [,2:11])
<- as.data.frame(college)
college $Private <- as.factor(college$Private) college
plot(Outstate ~ Private, data = college,
main = "Out-of-State Tuition by School Type",
xlab = "Private School?",
ylab = "Out-of-State Tuition ($)")
<- rep("No", nrow(college))
Elite $Top10perc > 50] <- "Yes"
Elite[college<- as.factor(Elite)
Elite <- data.frame(college, Elite) college
summary(college)
Private Apps Accept Enroll Top10perc
No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
Median : 1558 Median : 1110 Median : 434 Median :23.00
Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
Max. :48094 Max. :26330 Max. :6392 Max. :96.00
Top25perc F.Undergrad P.Undergrad Outstate
Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
Room.Board Books Personal PhD
Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
Median :4200 Median : 500.0 Median :1200 Median : 75.00
Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
Terminal S.F.Ratio perc.alumni Expend
Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
Median : 82.0 Median :13.60 Median :21.00 Median : 8377
Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
Grad.Rate Elite
Min. : 10.00 No :699
1st Qu.: 53.00 Yes: 78
Median : 65.00
Mean : 65.46
3rd Qu.: 78.00
Max. :118.00
# Set up a 2x2 plotting area
par(mfrow = c(2, 2))
hist(college$Apps,
main = "Histogram of Applications",
xlab = "Applications",
col = "lightblue",
breaks = 20)
hist(college$Enroll,
main = "Histogram of Enrollments",
xlab = "Enroll",
col = "lightgreen",
breaks = 15)
hist(college$Outstate,
main = "Histogram of Outstate Tuition",
xlab = "Outstate",
col = "lightpink",
breaks = 25)
hist(college$Room.Board,
main = "Histogram of Room & Board",
xlab = "Room.Board",
col = "lightyellow",
breaks = 10)
par(mfrow = c(1, 1))
0.5 Problem 9
<- read.csv("Auto.csv", na.strings = "?") auto
<- na.omit(auto) auto
sum(is.na(auto)) # Should return 0
[1] 0
str(auto)
'data.frame': 392 obs. of 9 variables:
$ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
$ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
$ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
$ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
$ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
$ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
$ year : int 70 70 70 70 70 70 70 70 70 70 ...
$ origin : int 1 1 1 1 1 1 1 1 1 1 ...
$ name : chr "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
- attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
names(auto)[sapply(auto, is.numeric)]
[1] "mpg" "cylinders" "displacement" "horsepower" "weight"
[6] "acceleration" "year" "origin"
names(auto)[sapply(auto, function(x) is.factor(x) || is.character(x))]
[1] "name"
$origin <- as.factor(auto$origin) auto
names(auto)[sapply(auto, function(x) is.factor(x) || is.character(x))]
[1] "origin" "name"
0.5.1 Part A
0.5.1.1 Quantitative Predictors
mpg – miles per gallon (target variable)
displacement – engine displacement
horsepower – engine horsepower
weight – vehicle weight
acceleration – acceleration rate
year – model year (numeric)
cylinders – number of cylinders (numeric)
0.5.1.2 Qualitative Predictors
origin – country of origin (e.g., USA, Europe, Japan)
name – car name (a text label)
0.5.2 Part B
range(auto$mpg) # range of mpg
[1] 9.0 46.6
range(auto$displacement) # range of displacement
[1] 68 455
range(auto$horsepower) # range of horsepower
[1] 46 230
range(auto$weight) # range of weight
[1] 1613 5140
range(auto$acceleration) # range of acceleration
[1] 8.0 24.8
range(auto$year) # range of year
[1] 70 82
range(auto$cylinders) # range of cylinders
[1] 3 8
0.5.3 Part C
# Identify numeric variables
<- sapply(auto, is.numeric)
numeric_vars
# Calculate mean and standard deviation
<- sapply(auto[, numeric_vars], mean)
mean_values <- sapply(auto[, numeric_vars], sd)
sd_values
# Combine into a labeled table
<- data.frame(
summary_stats Variable = names(mean_values),
Mean = round(mean_values, 2),
SD = round(sd_values, 2)
)
# View the result
print(summary_stats)
Variable Mean SD
mpg mpg 23.45 7.81
cylinders cylinders 5.47 1.71
displacement displacement 194.41 104.64
horsepower horsepower 104.47 38.49
weight weight 2977.58 849.40
acceleration acceleration 15.54 2.76
year year 75.98 3.68
0.5.4 Part D
# Step 1: Remove rows 10 to 85
<- auto[-(10:85), ]
auto_subset
# Step 2: Identify numeric variables
<- sapply(auto_subset, is.numeric)
numeric_vars
# Step 3: Compute mean and standard deviation
<- sapply(auto_subset[, numeric_vars], mean)
means <- sapply(auto_subset[, numeric_vars], sd)
sds
# Step 4: Combine into a summary table
<- data.frame(
summary_subset Variable = names(means),
Mean = round(means, 2),
SD = round(sds, 2)
)
# View the result
print(summary_subset)
Variable Mean SD
mpg mpg 24.40 7.87
cylinders cylinders 5.37 1.65
displacement displacement 187.24 99.68
horsepower horsepower 100.72 35.71
weight weight 2935.97 811.30
acceleration acceleration 15.73 2.69
year year 77.15 3.11
0.5.5 Part E
plot(auto$horsepower, auto$mpg,
main = "MPG vs Horsepower",
xlab = "Horsepower",
ylab = "MPG")
plot(auto$weight, auto$mpg,
main = "MPG vs Weight",
xlab = "Weight",
ylab = "MPG")
pairs(auto[, c("mpg", "horsepower", "weight", "acceleration", "displacement")])
$origin <- factor(auto$origin,
autolevels = c(1, 2, 3),
labels = c("USA", "Europe", "Japan"))
plot(auto$weight, auto$mpg,
col = auto$origin,
pch = 19,
main = "MPG vs Weight Colored by Origin",
xlab = "Weight",
ylab = "MPG")
legend("topright", legend = levels(auto$origin),
col = 1:3, pch = 19)
0.5.5.1 Observations
From the scatterplots, you can see that as horsepower and weight increase, mpg tends to drop—so heavier and more powerful cars use more fuel. The pair plot shows a strong relationship between displacement, horsepower, and weight, which suggests they’re all tracking similar characteristics. The boxplot by cylinder count makes it clear: cars with fewer cylinders usually get better mileage. When the scatterplot is colored by origin, Japanese and European cars stand out as more fuel-efficient compared to American models.
0.5.6 Part F
Yes, a few of the variables definitely look useful for predicting mpg. The scatterplots show strong negative relationships between mpg and both horsepower and weight—as those go up, fuel efficiency drops. The boxplot also shows that cars with more cylinders usually get lower mpg, so that could be a solid categorical predictor. Coloring by origin highlights that Japanese and European cars tend to have better fuel efficiency than American ones, so origin might also help the model. Overall, horsepower, weight, cylinders, and origin all seem like solid predictors. ### Problem 10
library(MASS)
data("Boston")
#?Boston
0.5.7 Part A
How many rows are in this data set? How many columns? What do the rows and columns represent?
506 Rows and 14 columns
0.5.8 Part B
pairs(Boston[, c("medv", "lstat", "rm", "crim", "nox", "tax")],
main = "Key Predictors vs Median Home Value")
The plots suggest that lstat and rm are particularly strong predictors of medv, while crim, nox, and tax may still be useful but potentially with transformations or in combination with other predictors.
0.5.9 Part C
Some predictors do seem related to crime rate. For example, crime is higher in areas with higher taxes and pollution (tax and nox), and a bit higher where more lower-income residents live (lstat). On the other hand, places with bigger houses (rm) tend to have less crime. These trends suggest that crim is connected to socioeconomic and environmental conditions.
0.5.10 Part D
summary(Boston$crim)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00632 0.08205 0.25651 3.61352 3.67708 88.97620
Most values are well under 1, but some reach up to 89, indicating a huge right-skew and outliers.
summary(Boston$crim)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00632 0.08205 0.25651 3.61352 3.67708 88.97620
Most tax rates are moderate, but some areas have rates above 6%, which stand out as unusually high.
summary(Boston$ptratio)
Min. 1st Qu. Median Mean 3rd Qu. Max.
12.60 17.40 19.05 18.46 20.20 22.00
The range is smaller here, but values above 20 are on the higher end and may signal larger class sizes or underfunded schools.
0.5.11 Part E
table(Boston$chas)
0 1
471 35
35 census tracts in the dataset bound the Charles River.
0.5.12 Part F
median(Boston$ptratio)
[1] 19.05
The median pupil-teacher ratio is 19.05 students per teacher.
0.5.13 Part G
which.min(Boston$medv)
[1] 399
399, ] Boston[
crim zn indus chas nox rm age dis rad tax ptratio black lstat
399 38.3518 0 18.1 0 0.693 5.453 100 1.4896 24 666 20.2 396.9 30.59
medv
399 5
Lowest median home value is $5000 Crime Rate is high at 38.35 High tax rate, pupil to teacher ratio
0.5.14 Part H
sum(Boston$rm > 7)
[1] 64
sum(Boston$rm > 8)
[1] 13
$rm > 8, ] Boston[Boston
crim zn indus chas nox rm age dis rad tax ptratio black lstat
98 0.12083 0 2.89 0 0.4450 8.069 76.0 3.4952 2 276 18.0 396.90 4.21
164 1.51902 0 19.58 1 0.6050 8.375 93.9 2.1620 5 403 14.7 388.45 3.32
205 0.02009 95 2.68 0 0.4161 8.034 31.9 5.1180 4 224 14.7 390.55 2.88
225 0.31533 0 6.20 0 0.5040 8.266 78.3 2.8944 8 307 17.4 385.05 4.14
226 0.52693 0 6.20 0 0.5040 8.725 83.0 2.8944 8 307 17.4 382.00 4.63
227 0.38214 0 6.20 0 0.5040 8.040 86.5 3.2157 8 307 17.4 387.38 3.13
233 0.57529 0 6.20 0 0.5070 8.337 73.3 3.8384 8 307 17.4 385.91 2.47
234 0.33147 0 6.20 0 0.5070 8.247 70.4 3.6519 8 307 17.4 378.95 3.95
254 0.36894 22 5.86 0 0.4310 8.259 8.4 8.9067 7 330 19.1 396.90 3.54
258 0.61154 20 3.97 0 0.6470 8.704 86.9 1.8010 5 264 13.0 389.70 5.12
263 0.52014 20 3.97 0 0.6470 8.398 91.5 2.2885 5 264 13.0 386.86 5.91
268 0.57834 20 3.97 0 0.5750 8.297 67.0 2.4216 5 264 13.0 384.54 7.44
365 3.47428 0 18.10 1 0.7180 8.780 82.9 1.9047 24 666 20.2 354.55 5.29
medv
98 38.7
164 50.0
205 50.0
225 44.8
226 50.0
227 37.6
233 41.7
234 48.3
254 42.8
258 50.0
263 48.8
268 50.0
365 21.9
Homes with 8 or more rooms are in areas that have lower crime and lower taxes. They also have a higher pupil-to-teacher ratio, which means more students per teacher despite the overall affluence.