Question 2

For each of the following scenarios, indicate whether it is a regression or classification problem, whether the goal is inference or prediction, and the values of n and p.

(a) CEO Salary Analysis:

(b) New Product Success Prediction:

(c) USD/Euro Exchange Rate Prediction:

Question 5

Explain the difference between a flexible and inflexible statistical learning approach. In what situations would a flexible approach be preferred over a less flexible approach? In what situations would a less flexible approach be preferred?

A very flexible approach in regression or classification refers to models that can adapt to complex, non-linear relationships in the data.

Question 6

Describe the differences between parametric and non-parametric statistical learning approaches. What are the advantages of a parametric approach to regression or classification (as opposed to a non-parametric approach)? What are its disadvantages?

Question 8

a) Load the College dataset

college <- read.csv("data/College.csv")

b) Set the row names to be the college names

rownames(college) <- college[, 1]
View(college)
college <- college [, -1]
View(college)

c) Exploratory Data Analysis

i) Summary of the data set

summary(college)
##    Private               Apps           Accept          Enroll    
##  Length:777         Min.   :   81   Min.   :   72   Min.   :  35  
##  Class :character   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242  
##  Mode  :character   Median : 1558   Median : 1110   Median : 434  
##                     Mean   : 3002   Mean   : 2019   Mean   : 780  
##                     3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902  
##                     Max.   :48094   Max.   :26330   Max.   :6392  
##    Top10perc       Top25perc      F.Undergrad     P.Undergrad     
##  Min.   : 1.00   Min.   :  9.0   Min.   :  139   Min.   :    1.0  
##  1st Qu.:15.00   1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0  
##  Median :23.00   Median : 54.0   Median : 1707   Median :  353.0  
##  Mean   :27.56   Mean   : 55.8   Mean   : 3700   Mean   :  855.3  
##  3rd Qu.:35.00   3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0  
##  Max.   :96.00   Max.   :100.0   Max.   :31643   Max.   :21836.0  
##     Outstate       Room.Board       Books           Personal   
##  Min.   : 2340   Min.   :1780   Min.   :  96.0   Min.   : 250  
##  1st Qu.: 7320   1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850  
##  Median : 9990   Median :4200   Median : 500.0   Median :1200  
##  Mean   :10441   Mean   :4358   Mean   : 549.4   Mean   :1341  
##  3rd Qu.:12925   3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700  
##  Max.   :21700   Max.   :8124   Max.   :2340.0   Max.   :6800  
##       PhD            Terminal       S.F.Ratio      perc.alumni   
##  Min.   :  8.00   Min.   : 24.0   Min.   : 2.50   Min.   : 0.00  
##  1st Qu.: 62.00   1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00  
##  Median : 75.00   Median : 82.0   Median :13.60   Median :21.00  
##  Mean   : 72.66   Mean   : 79.7   Mean   :14.09   Mean   :22.74  
##  3rd Qu.: 85.00   3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00  
##  Max.   :103.00   Max.   :100.0   Max.   :39.80   Max.   :64.00  
##      Expend        Grad.Rate     
##  Min.   : 3186   Min.   : 10.00  
##  1st Qu.: 6751   1st Qu.: 53.00  
##  Median : 8377   Median : 65.00  
##  Mean   : 9660   Mean   : 65.46  
##  3rd Qu.:10830   3rd Qu.: 78.00  
##  Max.   :56233   Max.   :118.00

ii) Create a pairs plot of the first 10 columns

college$Private <- college$Private == "Yes"
pairs(college[, 1:10], cex = 0.2)

iii) Side-by-side boxplots of Outstate vs Private

# Side-by-side boxplots of Outstate vs Private
plot(factor(college$Private), college$Outstate, main="Outstate Tuition by Private/Public Universities",
     xlab="Private", ylab="Outstate Tuition", col=c("lightblue", "pink"))

iv) Boxplot of Outstate vs Elite

Elite <- rep("No", nrow(college))
Elite[college$Top10perc > 50] <- "Yes"
Elite <- as.factor(Elite)
college <- data.frame(college, Elite)
summary(college$Elite)
##  No Yes 
## 699  78
plot(college$Elite, college$Outstate, main="Outstate Tuition by Elite Status",
     xlab="Elite Status", ylab="Outstate Tuition", col=c("lightgreen", "orange"))

v) Histograms of few Numeric Variables varying bins

par(mfrow = c(2, 2))  

hist(college$Apps, breaks = 30, main = "Histogram of Applications", col = "skyblue", xlab = "Applications")
hist(college$Accept, breaks = 25, main = "Histogram of Acceptances", col = "lightgreen", xlab = "Acceptances")
hist(college$Enroll, breaks = 20, main = "Histogram of Enrollments", col = "lightpink", xlab = "Enrollments")
hist(college$Top10perc, breaks = 10, main = "Histogram of Top 10% Students", col = "lightyellow", xlab = "Top 10%")

(vi) Further exploration:

Boxplots for comparison of quantitative variables by Private Status

par(mfrow = c(2, 2))  
boxplot(Outstate ~ Private, data = college, 
        xlab = "Private", ylab = "Top10perc", col = c("lightblue", "lightgreen"))
boxplot(Enroll ~ Private, data = college, 
        xlab = "Private", ylab = "Enrollments", col = c("lightblue", "lightgreen"))
boxplot(Accept ~ Private, data = college, 
        xlab = "Private", ylab = "Acceptances", col = c("lightblue", "lightgreen"))
boxplot(Apps ~ Private, data = college,
        xlab = "Private", ylab = "Applications", col = c("lightblue", "lightgreen"))

# Reset the plotting window
par(mfrow = c(1, 1))

Correlation analysis between key quantitative variables

cor_matrix <- cor(college[, sapply(college, is.numeric)], use = "complete.obs")
heatmap(cor_matrix, col = colorRampPalette(c("blue", "white", "red"))(256), scale = "none", margins = c(5,5),cexRow = 0.8, cexCol = 0.8) 

Question 9

auto_data <- read.table("data/Auto.data", header = TRUE, na.strings = "?")
auto_data <- na.omit(auto_data)

(a) Identify which of the predictors are quantitative, and which are qualitative

sapply(auto_data, class)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##    "numeric"    "integer"    "numeric"    "numeric"    "numeric"    "numeric" 
##         year       origin         name 
##    "integer"    "integer"  "character"
numeric <- which(sapply(auto_data, class) == "numeric")
names(numeric)
## [1] "mpg"          "displacement" "horsepower"   "weight"       "acceleration"
auto_data$cylinders = as.factor(auto_data$cylinders)
auto_data$year = as.factor(auto_data$year)
auto_data$origin = as.factor(auto_data$origin)

(b) Range of quantitative predictors:

sapply(auto_data[, numeric], function(auto_data) diff(range(auto_data)))
##          mpg displacement   horsepower       weight acceleration 
##         37.6        387.0        184.0       3527.0         16.8

(c) Mean and standard deviation of quantitative predictors:

Mean

sapply(auto_data[, c("mpg", "displacement", "horsepower", "weight", "acceleration")], mean)
##          mpg displacement   horsepower       weight acceleration 
##     23.44592    194.41199    104.46939   2977.58418     15.54133

Standard deviation

sapply(auto_data[, c("mpg", "displacement", "horsepower", "weight", "acceleration")], sd)
##          mpg displacement   horsepower       weight acceleration 
##     7.805007   104.644004    38.491160   849.402560     2.758864

(d) Remove few observations (10:85) and recalculate range , mean and standard deviation:

subset_data <- auto_data[-(10:85), ]  # Remove rows 10 to 85
print("Range")
## [1] "Range"
sapply(subset_data[, c("mpg", "displacement", "horsepower", "weight", "acceleration")], range)
##       mpg displacement horsepower weight acceleration
## [1,] 11.0           68         46   1649          8.5
## [2,] 46.6          455        230   4997         24.8
print("Mean")
## [1] "Mean"
sapply(subset_data[, c("mpg", "displacement", "horsepower", "weight", "acceleration")], mean)
##          mpg displacement   horsepower       weight acceleration 
##     24.40443    187.24051    100.72152   2935.97152     15.72690
print("Standard deviation")
## [1] "Standard deviation"
sapply(subset_data[, c("mpg", "displacement", "horsepower", "weight", "acceleration")], sd)
##          mpg displacement   horsepower       weight acceleration 
##     7.867283    99.678367    35.708853   811.300208     2.693721

(e) Exploratory Data Analysis through visualizations:

numeric_vars <- sapply(auto_data, is.numeric)
pairs(auto_data[, numeric_vars], main = "Scatterplot Matrix") 

# Boxplot of mpg by cylinders
boxplot(mpg ~ cylinders, data = auto_data, main = "mpg by Cylinders",xlab = "Cylinders", ylab = "mpg", col = "lightblue")

# Boxplot of mpg by year
boxplot(mpg ~ year, data = auto_data, main = "mpg by Year", xlab = "Year", ylab = "mpg", col = "lightcoral")

# Boxplot of mpg by origin
boxplot(mpg ~ origin, data = auto_data, main = "mpg by Origin", xlab = "Origin", ylab = "mpg", col = c("lightgreen", "lightpink", "lightyellow"))

# Scatter plot of horsepower vs mpg, colored by origin
ggplot(auto_data, aes(x = horsepower, y = mpg, color = factor(origin))) +
  geom_point() +
  labs(title = "Horsepower vs mpg by Origin", 
       x = "Horsepower", y = "mpg")

(f) Predictors of mpg:

Based on the relationships observed in the plots:

- Strong Predictors of mpg: Displacement, Horsepower, Weight, and Cylinders are the most significant predictors of mpg due to their strong correlations (either negative or positive).
- Moderate Predictors of mpg: Acceleration and Year may be moderately useful, with acceleration having a weak positive correlation and year showing a clear trend over time.
- Categorical Variable: Origin is another useful predictor, especially given that cars from different regions (e.g., Japanese cars) exhibit varying fuel efficiencies.

Question 10

(a) Load the data and check dimensions:

data(Boston)   
dim(Boston)
## [1] 506  13

-There are 506 Rows and 13 columns in this dataset

(b) Pairwise scatter plots and correlation analysis:

pairs(Boston) 

(c) Predictors associated with per capita crime rate

# Correlation between crime rate and other predictors
cor(Boston$crim, Boston[, -13])  # Exclude MEDV from correlation calculation
##      crim         zn     indus        chas       nox         rm       age
## [1,]    1 -0.2004692 0.4065834 -0.05589158 0.4209717 -0.2192467 0.3527343
##             dis       rad       tax   ptratio     lstat
## [1,] -0.3796701 0.6255051 0.5827643 0.2899456 0.4556215

(d) Census tracts with particularly high crime rates, tax rates, pupil-teacher ratios

Boston |>
  pivot_longer(cols = 1:13) |>
  filter(name %in% c("crim", "tax", "ptratio")) |>
  ggplot(aes(value)) +
  geom_histogram(bins = 20) +
  facet_wrap(~name, scales = "free", ncol = 1)

print("Summary statistics of crime rate")
## [1] "Summary statistics of crime rate"
summary(Boston$crim)  # Summary statistics of crime rate
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00632  0.08204  0.25651  3.61352  3.67708 88.97620
print("Summary statistics of pupil-teacher ratio")
## [1] "Summary statistics of pupil-teacher ratio"
summary(Boston$ptratio)  # Summary statistics of pupil-teacher ratio
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.60   17.40   19.05   18.46   20.20   22.00
print("Summary statistics of property tax")
## [1] "Summary statistics of property tax"
summary(Boston$tax)  # Summary statistics of property tax
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   187.0   279.0   330.0   408.2   666.0   711.0

Key Descriptive Statistics and Observations:

(e) Census tracts bounding the Charles River

print("count of census tracts bound charles river")
## [1] "count of census tracts bound charles river"
sum(Boston$chas)
## [1] 35

(f) Median pupil-teacher ratio

print(" median pupil-teacher ratio")
## [1] " median pupil-teacher ratio"
median(Boston$ptratio)
## [1] 19.05

(g) Census tract with the lowest median home value

min_medv_row <- Boston[which.min(Boston$medv), ]  # Row with the minimum median home value
str(min_medv_row)  # View the row
## 'data.frame':    1 obs. of  13 variables:
##  $ crim   : num 38.4
##  $ zn     : num 0
##  $ indus  : num 18.1
##  $ chas   : int 0
##  $ nox    : num 0.693
##  $ rm     : num 5.45
##  $ age    : num 100
##  $ dis    : num 1.49
##  $ rad    : int 24
##  $ tax    : num 666
##  $ ptratio: num 20.2
##  $ lstat  : num 30.6
##  $ medv   : num 5
summary(Boston)  # summary statistics for the entire dataset
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          lstat      
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   : 1.73  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.: 6.95  
##  Median : 5.000   Median :330.0   Median :19.05   Median :11.36  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :12.65  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:16.95  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :37.97  
##       medv      
##  Min.   : 5.00  
##  1st Qu.:17.02  
##  Median :21.20  
##  Mean   :22.53  
##  3rd Qu.:25.00  
##  Max.   :50.00

Analysis of Census Tract with Lowest Median Home Value:

(h) Census tracts averaging more than seven rooms per dwelling, and more than eight rooms per dwelling

print("count of census tract average more than 7 rooms per dwelling")
## [1] "count of census tract average more than 7 rooms per dwelling"
sum(Boston$rm > 7)
## [1] 64
print("count of census tract average more than 8 rooms per dwelling")
## [1] "count of census tract average more than 8 rooms per dwelling"
sum(Boston$rm > 8)
## [1] 13
large_homes <- Boston[Boston$rm > 8, ]
summary(large_homes)  
##       crim               zn            indus             chas       
##  Min.   :0.02009   Min.   : 0.00   Min.   : 2.680   Min.   :0.0000  
##  1st Qu.:0.33147   1st Qu.: 0.00   1st Qu.: 3.970   1st Qu.:0.0000  
##  Median :0.52014   Median : 0.00   Median : 6.200   Median :0.0000  
##  Mean   :0.71879   Mean   :13.62   Mean   : 7.078   Mean   :0.1538  
##  3rd Qu.:0.57834   3rd Qu.:20.00   3rd Qu.: 6.200   3rd Qu.:0.0000  
##  Max.   :3.47428   Max.   :95.00   Max.   :19.580   Max.   :1.0000  
##       nox               rm             age             dis       
##  Min.   :0.4161   Min.   :8.034   Min.   : 8.40   Min.   :1.801  
##  1st Qu.:0.5040   1st Qu.:8.247   1st Qu.:70.40   1st Qu.:2.288  
##  Median :0.5070   Median :8.297   Median :78.30   Median :2.894  
##  Mean   :0.5392   Mean   :8.349   Mean   :71.54   Mean   :3.430  
##  3rd Qu.:0.6050   3rd Qu.:8.398   3rd Qu.:86.50   3rd Qu.:3.652  
##  Max.   :0.7180   Max.   :8.780   Max.   :93.90   Max.   :8.907  
##       rad              tax           ptratio          lstat           medv     
##  Min.   : 2.000   Min.   :224.0   Min.   :13.00   Min.   :2.47   Min.   :21.9  
##  1st Qu.: 5.000   1st Qu.:264.0   1st Qu.:14.70   1st Qu.:3.32   1st Qu.:41.7  
##  Median : 7.000   Median :307.0   Median :17.40   Median :4.14   Median :48.3  
##  Mean   : 7.462   Mean   :325.1   Mean   :16.36   Mean   :4.31   Mean   :44.2  
##  3rd Qu.: 8.000   3rd Qu.:307.0   3rd Qu.:17.40   3rd Qu.:5.12   3rd Qu.:50.0  
##  Max.   :24.000   Max.   :666.0   Max.   :20.20   Max.   :7.44   Max.   :50.0

Analysis of Census Tracts with More Than Eight Rooms per Dwelling: