Assignment 1

Author

Jonathan McCanlas

0.1 Problem 2

0.1.1 Part A

Problem type: Regression

The response variable (CEO salary) is continuous, so it’s a regression problem.

Goal: Inference

Since the interest is in understanding which factors affect salary, this suggests inference (examining relationships and significance of predictors).

n (number of observations): 500 firms.

p (number of predictors): 3

Profit

Number of employees

Industry (likely categorical, may be represented by multiple dummy variables in practice, but conceptually one variable).

0.1.2 Part B

Problem type: Classification

The response variable is categorical (success or failure), so it’s a classification problem.

Goal: Prediction

The interest is in forecasting success/failure of a new product.

n (number of observations): 20 previous products.

p (number of predictors): 13

Price, marketing budget, competition price, plus 10 additional variables.

0.1.3 Part C

Problem type: Regression

The response variable is % change in the USD/Euro exchange rate (a continuous value).

Goal: Prediction

Since the aim is to predict future exchange rates, it’s a prediction problem.

n (number of observations): 52 weeks in a year.

p (number of predictors): 3

% change in US, British, and German markets.

0.2 Problem 5

Flexible models can capture complex patterns and improve prediction but may overfit and are harder to interpret. Less flexible models are simpler, more interpretable, and better for inference, though they may miss nonlinear relationships. Use flexible models for prediction with large data, and less flexible models when interpretability or inference is the goal.

0.3 Problem 6

A parametric approach assumes a specific form for the model (like linear), making it simpler, faster, and easier to interpret, especially with small datasets. Its main drawback is that it can introduce bias if the assumptions are wrong. Non-parametric approaches are more flexible and can model complex patterns but require more data and are harder to interpret.

0.4 Problem 8

library(readr)
college <- read.csv("College.csv", row.names = 1)
View(college)
summary(college)
   Private               Apps           Accept          Enroll    
 Length:777         Min.   :   81   Min.   :   72   Min.   :  35  
 Class :character   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242  
 Mode  :character   Median : 1558   Median : 1110   Median : 434  
                    Mean   : 3002   Mean   : 2019   Mean   : 780  
                    3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902  
                    Max.   :48094   Max.   :26330   Max.   :6392  
   Top10perc       Top25perc      F.Undergrad     P.Undergrad     
 Min.   : 1.00   Min.   :  9.0   Min.   :  139   Min.   :    1.0  
 1st Qu.:15.00   1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0  
 Median :23.00   Median : 54.0   Median : 1707   Median :  353.0  
 Mean   :27.56   Mean   : 55.8   Mean   : 3700   Mean   :  855.3  
 3rd Qu.:35.00   3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0  
 Max.   :96.00   Max.   :100.0   Max.   :31643   Max.   :21836.0  
    Outstate       Room.Board       Books           Personal   
 Min.   : 2340   Min.   :1780   Min.   :  96.0   Min.   : 250  
 1st Qu.: 7320   1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850  
 Median : 9990   Median :4200   Median : 500.0   Median :1200  
 Mean   :10441   Mean   :4358   Mean   : 549.4   Mean   :1341  
 3rd Qu.:12925   3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700  
 Max.   :21700   Max.   :8124   Max.   :2340.0   Max.   :6800  
      PhD            Terminal       S.F.Ratio      perc.alumni   
 Min.   :  8.00   Min.   : 24.0   Min.   : 2.50   Min.   : 0.00  
 1st Qu.: 62.00   1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00  
 Median : 75.00   Median : 82.0   Median :13.60   Median :21.00  
 Mean   : 72.66   Mean   : 79.7   Mean   :14.09   Mean   :22.74  
 3rd Qu.: 85.00   3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00  
 Max.   :103.00   Max.   :100.0   Max.   :39.80   Max.   :64.00  
     Expend        Grad.Rate     
 Min.   : 3186   Min.   : 10.00  
 1st Qu.: 6751   1st Qu.: 53.00  
 Median : 8377   Median : 65.00  
 Mean   : 9660   Mean   : 65.46  
 3rd Qu.:10830   3rd Qu.: 78.00  
 Max.   :56233   Max.   :118.00  
pairs(college [,2:11])

college <- as.data.frame(college)
college$Private <- as.factor(college$Private)
plot(Outstate ~ Private, data = college,
     main = "Out-of-State Tuition by School Type",
     xlab = "Private School?",
     ylab = "Out-of-State Tuition ($)")

Elite <- rep("No", nrow(college))
Elite[college$Top10perc > 50] <- "Yes"
Elite <- as.factor(Elite)
college <- data.frame(college, Elite)
summary(college)
 Private        Apps           Accept          Enroll       Top10perc    
 No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
 Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
           Median : 1558   Median : 1110   Median : 434   Median :23.00  
           Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
           3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
           Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
   Top25perc      F.Undergrad     P.Undergrad         Outstate    
 Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
 1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
 Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
 Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
 3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
 Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
   Room.Board       Books           Personal         PhD        
 Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
 1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
 Median :4200   Median : 500.0   Median :1200   Median : 75.00  
 Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
 3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
 Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
    Terminal       S.F.Ratio      perc.alumni        Expend     
 Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
 1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
 Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
 Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
 3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
 Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
   Grad.Rate      Elite    
 Min.   : 10.00   No :699  
 1st Qu.: 53.00   Yes: 78  
 Median : 65.00            
 Mean   : 65.46            
 3rd Qu.: 78.00            
 Max.   :118.00            
# Set up a 2x2 plotting area
par(mfrow = c(2, 2))
hist(college$Apps,
     main = "Histogram of Applications",
     xlab = "Applications",
     col = "lightblue",
     breaks = 20)

hist(college$Enroll,
     main = "Histogram of Enrollments",
     xlab = "Enroll",
     col = "lightgreen",
     breaks = 15)

hist(college$Outstate,
     main = "Histogram of Outstate Tuition",
     xlab = "Outstate",
     col = "lightpink",
     breaks = 25)

hist(college$Room.Board,
     main = "Histogram of Room & Board",
     xlab = "Room.Board",
     col = "lightyellow",
     breaks = 10)

par(mfrow = c(1, 1))

0.5 Problem 9

auto <- read.csv("Auto.csv", na.strings = "?")
auto <- na.omit(auto)
sum(is.na(auto))   # Should return 0
[1] 0
str(auto)
'data.frame':   392 obs. of  9 variables:
 $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
 $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
 $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
 $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
 $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
 $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
 $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
 $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ name        : chr  "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
 - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
  ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
names(auto)[sapply(auto, is.numeric)]
[1] "mpg"          "cylinders"    "displacement" "horsepower"   "weight"      
[6] "acceleration" "year"         "origin"      
names(auto)[sapply(auto, function(x) is.factor(x) || is.character(x))]
[1] "name"
auto$origin <- as.factor(auto$origin)
names(auto)[sapply(auto, function(x) is.factor(x) || is.character(x))]
[1] "origin" "name"  

0.5.1 Part A

0.5.1.1 Quantitative Predictors

mpg – miles per gallon (target variable)

displacement – engine displacement

horsepower – engine horsepower

weight – vehicle weight

acceleration – acceleration rate

year – model year (numeric)

cylinders – number of cylinders (numeric)

0.5.1.2 Qualitative Predictors

origin – country of origin (e.g., USA, Europe, Japan)

name – car name (a text label)

0.5.2 Part B

range(auto$mpg)           # range of mpg
[1]  9.0 46.6
range(auto$displacement)  # range of displacement
[1]  68 455
range(auto$horsepower)    # range of horsepower
[1]  46 230
range(auto$weight)        # range of weight
[1] 1613 5140
range(auto$acceleration)  # range of acceleration
[1]  8.0 24.8
range(auto$year)          # range of year
[1] 70 82
range(auto$cylinders)     # range of cylinders
[1] 3 8

0.5.3 Part C

# Identify numeric variables
numeric_vars <- sapply(auto, is.numeric)

# Calculate mean and standard deviation
mean_values <- sapply(auto[, numeric_vars], mean)
sd_values <- sapply(auto[, numeric_vars], sd)

# Combine into a labeled table
summary_stats <- data.frame(
  Variable = names(mean_values),
  Mean = round(mean_values, 2),
  SD = round(sd_values, 2)
)

# View the result
print(summary_stats)
                 Variable    Mean     SD
mpg                   mpg   23.45   7.81
cylinders       cylinders    5.47   1.71
displacement displacement  194.41 104.64
horsepower     horsepower  104.47  38.49
weight             weight 2977.58 849.40
acceleration acceleration   15.54   2.76
year                 year   75.98   3.68

0.5.4 Part D

# Step 1: Remove rows 10 to 85
auto_subset <- auto[-(10:85), ]

# Step 2: Identify numeric variables
numeric_vars <- sapply(auto_subset, is.numeric)

# Step 3: Compute mean and standard deviation
means <- sapply(auto_subset[, numeric_vars], mean)
sds <- sapply(auto_subset[, numeric_vars], sd)

# Step 4: Combine into a summary table
summary_subset <- data.frame(
  Variable = names(means),
  Mean = round(means, 2),
  SD = round(sds, 2)
)

# View the result
print(summary_subset)
                 Variable    Mean     SD
mpg                   mpg   24.40   7.87
cylinders       cylinders    5.37   1.65
displacement displacement  187.24  99.68
horsepower     horsepower  100.72  35.71
weight             weight 2935.97 811.30
acceleration acceleration   15.73   2.69
year                 year   77.15   3.11

0.5.5 Part E

plot(auto$horsepower, auto$mpg,
     main = "MPG vs Horsepower",
     xlab = "Horsepower",
     ylab = "MPG")

plot(auto$weight, auto$mpg,
     main = "MPG vs Weight",
     xlab = "Weight",
     ylab = "MPG")

pairs(auto[, c("mpg", "horsepower", "weight", "acceleration", "displacement")])

auto$origin <- factor(auto$origin,
                      levels = c(1, 2, 3),
                      labels = c("USA", "Europe", "Japan"))

plot(auto$weight, auto$mpg,
     col = auto$origin,
     pch = 19,
     main = "MPG vs Weight Colored by Origin",
     xlab = "Weight",
     ylab = "MPG")

legend("topright", legend = levels(auto$origin), 
       col = 1:3, pch = 19)

0.5.5.1 Observations

From the scatterplots, you can see that as horsepower and weight increase, mpg tends to drop—so heavier and more powerful cars use more fuel. The pair plot shows a strong relationship between displacement, horsepower, and weight, which suggests they’re all tracking similar characteristics. The boxplot by cylinder count makes it clear: cars with fewer cylinders usually get better mileage. When the scatterplot is colored by origin, Japanese and European cars stand out as more fuel-efficient compared to American models.

0.5.6 Part F

Yes, a few of the variables definitely look useful for predicting mpg. The scatterplots show strong negative relationships between mpg and both horsepower and weight—as those go up, fuel efficiency drops. The boxplot also shows that cars with more cylinders usually get lower mpg, so that could be a solid categorical predictor. Coloring by origin highlights that Japanese and European cars tend to have better fuel efficiency than American ones, so origin might also help the model. Overall, horsepower, weight, cylinders, and origin all seem like solid predictors. ### Problem 10

library(MASS)
data("Boston")
#?Boston

0.5.7 Part A

How many rows are in this data set? How many columns? What do the rows and columns represent?

506 Rows and 14 columns

0.5.8 Part B

pairs(Boston[, c("medv", "lstat", "rm", "crim", "nox", "tax")],
      main = "Key Predictors vs Median Home Value")

The plots suggest that lstat and rm are particularly strong predictors of medv, while crim, nox, and tax may still be useful but potentially with transformations or in combination with other predictors.

0.5.9 Part C

Some predictors do seem related to crime rate. For example, crime is higher in areas with higher taxes and pollution (tax and nox), and a bit higher where more lower-income residents live (lstat). On the other hand, places with bigger houses (rm) tend to have less crime. These trends suggest that crim is connected to socioeconomic and environmental conditions.

0.5.10 Part D

summary(Boston$crim)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
 0.00632  0.08205  0.25651  3.61352  3.67708 88.97620 

Most values are well under 1, but some reach up to 89, indicating a huge right-skew and outliers.

summary(Boston$crim)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
 0.00632  0.08205  0.25651  3.61352  3.67708 88.97620 

Most tax rates are moderate, but some areas have rates above 6%, which stand out as unusually high.

summary(Boston$ptratio)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  12.60   17.40   19.05   18.46   20.20   22.00 

The range is smaller here, but values above 20 are on the higher end and may signal larger class sizes or underfunded schools.

0.5.11 Part E

table(Boston$chas)

  0   1 
471  35 

35 census tracts in the dataset bound the Charles River.

0.5.12 Part F

median(Boston$ptratio)
[1] 19.05

The median pupil-teacher ratio is 19.05 students per teacher.

0.5.13 Part G

which.min(Boston$medv)
[1] 399
Boston[399, ]
       crim zn indus chas   nox    rm age    dis rad tax ptratio black lstat
399 38.3518  0  18.1    0 0.693 5.453 100 1.4896  24 666    20.2 396.9 30.59
    medv
399    5

Lowest median home value is $5000 Crime Rate is high at 38.35 High tax rate, pupil to teacher ratio

0.5.14 Part H

sum(Boston$rm > 7)
[1] 64
sum(Boston$rm > 8)
[1] 13
Boston[Boston$rm > 8, ]
       crim zn indus chas    nox    rm  age    dis rad tax ptratio  black lstat
98  0.12083  0  2.89    0 0.4450 8.069 76.0 3.4952   2 276    18.0 396.90  4.21
164 1.51902  0 19.58    1 0.6050 8.375 93.9 2.1620   5 403    14.7 388.45  3.32
205 0.02009 95  2.68    0 0.4161 8.034 31.9 5.1180   4 224    14.7 390.55  2.88
225 0.31533  0  6.20    0 0.5040 8.266 78.3 2.8944   8 307    17.4 385.05  4.14
226 0.52693  0  6.20    0 0.5040 8.725 83.0 2.8944   8 307    17.4 382.00  4.63
227 0.38214  0  6.20    0 0.5040 8.040 86.5 3.2157   8 307    17.4 387.38  3.13
233 0.57529  0  6.20    0 0.5070 8.337 73.3 3.8384   8 307    17.4 385.91  2.47
234 0.33147  0  6.20    0 0.5070 8.247 70.4 3.6519   8 307    17.4 378.95  3.95
254 0.36894 22  5.86    0 0.4310 8.259  8.4 8.9067   7 330    19.1 396.90  3.54
258 0.61154 20  3.97    0 0.6470 8.704 86.9 1.8010   5 264    13.0 389.70  5.12
263 0.52014 20  3.97    0 0.6470 8.398 91.5 2.2885   5 264    13.0 386.86  5.91
268 0.57834 20  3.97    0 0.5750 8.297 67.0 2.4216   5 264    13.0 384.54  7.44
365 3.47428  0 18.10    1 0.7180 8.780 82.9 1.9047  24 666    20.2 354.55  5.29
    medv
98  38.7
164 50.0
205 50.0
225 44.8
226 50.0
227 37.6
233 41.7
234 48.3
254 42.8
258 50.0
263 48.8
268 50.0
365 21.9

Homes with 8 or more rooms are in areas that have lower crime and lower taxes. They also have a higher pupil-to-teacher ratio, which means more students per teacher despite the overall affluence.