library(tidyverse)
library(openintro)
This is a regression problem as the response variable is numeric and
continuous. We are interested in inference as we want to understand the
relationship of the firm’s characteristics to the CEO salary. The**
\(n\) (number of firms) is 500 and
the** \(p\) (number of characteristics)
is 3.
This is a classification problem as the response variable is
categorical. We are interested in prediction as we wish to know if the
product will be a success or a failure. The \(n\) (number of products) is 20 and \(p\) (number of characteristics) is 13.
This is a regression problem as the response variable is numeric and
continuous. We are interested in prediction as we wish to predict the %
change in the USD/Euro exchange rate in relation to the weekly changes
in the world stock markets. The \(n\)
(number of weeks) is 52 and \(p\)
(number of characteristics) is 4.
The advantages of a very flexible approach are that it may give a better
fit for non-linear models and it decreases bias.
The disadvantages of a very flexible approach are that it requires estimating a greater number of parameters, it is prone to overfitting (follows the noise closely) and it increases the variance.
A more flexible approach would be preferred to a less flexible approach when we are interested in prediction and not the interpretability of the results.
A less flexible approach would be preferred to a more flexible approach when we are interested in inference and the interpretability of the results.
A parametric approach reduces the problem of estimating \(f\) down to one of estimating a set of
parameters because it assumes a form for \(f\).
A non-parametric approach does not assume a particular form of \(f\) and so requires a very large sample to accurately estimate \(f\).
The advantages of a parametric approach to regression or classification are the simplifying of modeling \(f\) to a few parameters and not as many observations are required compared to a non-parametric approach.
The disadvantages of a parametric approach to regression or classification are a potentially inaccurate estimate \(f\) if the form of \(f\) assumed is wrong or to overfit the observations if more flexible models are used.
• Private : Public/private indicator
• Apps : Number of applications received
• Accept : Number of applicants accepted
• Enroll : Number of new students enrolled
• Top10perc : New students from top 10 % of high school class
• Top25perc : New students from top 25 % of high school class
• F.Undergrad : Number of full-time undergraduates
• P.Undergrad : Number of part-time undergraduates
• Outstate : Out-of-state tuition
• Room.Board : Room and board costs
• Books : Estimated book costs
• Personal : Estimated personal spending
• PhD : Percent of faculty with Ph.D.’s
• Terminal : Percent of faculty with terminal degree
• S.F.Ratio : Student/faculty ratio
• perc.alumni : Percent of alumni who donate
• Expend : Instructional expenditure per student
• Grad.Rate : Graduation rate
library(ISLR)
data(College)
college <- read.csv("College.csv", stringsAsFactors = T)
rownames(college) <- college[, 1]
#View(college)
head(college)
college <- college[, -1]
#View(college)
head(college)
summary(college)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate
## Min. : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
pairs(college[, 1:10])
plot(college$Private, college$Outstate, xlab = "Private University", ylab ="Out of State tuition in USD", main = "Outstate Tuition Plot")
Elite <- rep("No", nrow(college))
Elite[college$Top10perc > 50] <- "Yes"
Elite <- as.factor(Elite)
college <- data.frame(college, Elite)
summary(college$Elite)
## No Yes
## 699 78
plot(college$Elite, college$Outstate, xlab = "Elite University", ylab ="Out of State tuition in USD", main = "Outstate Tuition Plot")
par(mfrow = c(2,2))
hist(college$Books, col = 2, xlab = "Books", ylab = "Count")
hist(college$PhD, col = 3, xlab = "PhD", ylab = "Count")
hist(college$Grad.Rate, col = 4, xlab = "Grad Rate", ylab = "Count")
hist(college$perc.alumni, col = 6, xlab = "% alumni", ylab = "Count")
The histogram that shows the interactions between colleges and the
percent of faculty that have Ph.Ds is very left skewed. The average cost
of books seems to be $500. The graduation rate across colleges is
roughly 65%. Most college alumni also do not seem to be donating to
their respective universities. Roughly 20% of students seem to be
donating on average.
summary(college$PhD)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 62.00 75.00 72.66 85.00 103.00
Upon investigating further, it can be seen that some universities have 103% of faculty that hold Ph.Ds, which is extremely unusual.
highphd <- college[college$PhD == 103, ]
print(highphd)
## Private Apps Accept Enroll Top10perc
## Texas A&M University at Galveston No 529 481 243 22
## Top25perc F.Undergrad P.Undergrad Outstate
## Texas A&M University at Galveston 47 1206 134 4860
## Room.Board Books Personal PhD Terminal
## Texas A&M University at Galveston 3122 600 650 103 88
## S.F.Ratio perc.alumni Expend Grad.Rate Elite
## Texas A&M University at Galveston 17.4 16 6415 43 No
Texas A&M University at Galveston seems to be the only university to have 103% of faculty that hold Ph.Ds. It is possible that this was a data entry error.
All predictors, except origin and name, are quantitative.
auto <- read.csv("Auto.csv", na.strings = "?",stringsAsFactors = T)
auto <- na.omit(auto)
str(auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
## - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
## ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
sapply(auto[, -c(8, 9)], range)
## mpg cylinders displacement horsepower weight acceleration year
## [1,] 9.0 3 68 46 1613 8.0 70
## [2,] 46.6 8 455 230 5140 24.8 82
sapply(auto[, -c(8, 9)], mean)
## mpg cylinders displacement horsepower weight acceleration
## 23.445918 5.471939 194.411990 104.469388 2977.584184 15.541327
## year
## 75.979592
sapply(auto[, -c(8, 9)], sd)
## mpg cylinders displacement horsepower weight acceleration
## 7.805007 1.705783 104.644004 38.491160 849.402560 2.758864
## year
## 3.683737
subset <- auto[-c(10:85), -c(8,9)]
sapply(subset, range)
## mpg cylinders displacement horsepower weight acceleration year
## [1,] 11.0 3 68 46 1649 8.5 70
## [2,] 46.6 8 455 230 4997 24.8 82
sapply(subset, mean)
## mpg cylinders displacement horsepower weight acceleration
## 24.404430 5.373418 187.240506 100.721519 2935.971519 15.726899
## year
## 77.145570
sapply(subset, sd)
## mpg cylinders displacement horsepower weight acceleration
## 7.867283 1.654179 99.678367 35.708853 811.300208 2.693721
## year
## 3.106217
mpg seems to be higher on a 4 cylinder vehicle rather than others.
Weight, displacement and horsepower have an inverse effect with mpg.
There is an overall increase in mpg over the years. Japanese cars have
higher mpg than US or European cars.
auto$cylinders <- as.factor(auto$cylinders)
auto$year <- as.factor(auto$year)
auto$origin <- as.factor(auto$origin)
pairs(auto)
Cylinders, horsepower, year and origin can be used as predictors.
Displacement and weight were not used because they are highly correlated
with horsepower and with each other.
auto$horsepower <- as.numeric(auto$horsepower)
cor(auto$weight, auto$horsepower)
## [1] 0.8645377
cor(auto$weight, auto$displacement)
## [1] 0.9329944
cor(auto$displacement, auto$horsepower)
## [1] 0.897257
This exercise involves the Boston housing data set.
library(ISLR2)
##
## Attaching package: 'ISLR2'
## The following objects are masked from 'package:ISLR':
##
## Auto, Credit
Boston$chas <- as.factor(Boston$chas)
?Boston
nrow(Boston)
## [1] 506
ncol(Boston)
## [1] 13
The relationship between crim and nox or rm is hard to discern. The
relationship between crim and age is left skewed. The relationship
between crim and dis is right skewed.
par(mfrow = c(2, 2))
plot(Boston$nox, Boston$crim)
plot(Boston$rm, Boston$crim)
plot(Boston$age, Boston$crim)
plot(Boston$dis, Boston$crim)
Most suburbs do not have any crime (80% of data falls in crim < 20).
There may be a relationship between crim and nox, rm, age, dis, lstat
and medv.
hist(Boston$crim, breaks = 50)
pairs(Boston[Boston$crim < 20, ])
The range of the crime rate varies across census tracts, with some areas
experiencing higher crime rates than others. This could reflect
different socio-economic conditions and levels of urbanization across
different parts of the city.
Many census tracts with a tax rate of 666 suggests issues of data completeness, as this value is used to indicate missing or censored data. Further investigation is needed to find the reasons behind this pattern and whether it reflects true differences in tax rates or are data anomalies.
Similarly, the range in pupil-teacher ratios reflects variations in educational resources and class sizes between different areas of Boston. Larger pupil-teacher ratios may suggest overcrowding or resource limitations in schools within particular census tracts.
hist(Boston$crim, breaks = 50)
nrow(Boston[Boston$crim > 20, ])
## [1] 18
hist(Boston$tax, breaks = 50)
nrow(Boston[Boston$tax == 666, ])
## [1] 132
hist(Boston$ptratio, breaks = 50)
nrow(Boston[Boston$ptratio > 20, ])
## [1] 201
nrow(Boston[Boston$chas == 1, ])
## [1] 35
median(Boston$ptratio)
## [1] 19.05
Census tract 399 and 406 has the lowest median value of owner-occupied
homes. Both have high crime rate (crim) and it above
average based on the range. There are no land zones for large
residential plots but there is a high industrial presence
(indus is above average). These tracts are not located near
the Charles River (chas= 0), have high air pollution
(nox is above average most likely due to industrial
activity), have smaller homes than average (rm is less than
6.28), and all homes were older constructions (age = 100).
They are very close to employment hubs (dis is 1.48 for 399
and 1.42 for 406), has maximum highway accessibility (rad =
24), have high tax rates (tax = 666), higher than average
pupil-teacher ratios (pratio = 20.2 and close to upper
range) which indicate limited educational resources. Tract 399 has a
high percentage of low-income residents (lstat = 30.59,
which is in the upper range) and tract 406 also has a high percentage of
low-income residents but is less than tract 399.
lowest <- min(Boston$medv)
min_row <- which(Boston$medv == lowest)
low_tract <- Boston[min_row,]
low_tract
summary(Boston)
## crim zn indus chas nox
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 0:471 Min. :0.3850
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1: 35 1st Qu.:0.4490
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.5380
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.5547
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.6240
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :0.8710
## rm age dis rad
## Min. :3.561 Min. : 2.90 Min. : 1.130 Min. : 1.000
## 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100 1st Qu.: 4.000
## Median :6.208 Median : 77.50 Median : 3.207 Median : 5.000
## Mean :6.285 Mean : 68.57 Mean : 3.795 Mean : 9.549
## 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188 3rd Qu.:24.000
## Max. :8.780 Max. :100.00 Max. :12.127 Max. :24.000
## tax ptratio lstat medv
## Min. :187.0 Min. :12.60 Min. : 1.73 Min. : 5.00
## 1st Qu.:279.0 1st Qu.:17.40 1st Qu.: 6.95 1st Qu.:17.02
## Median :330.0 Median :19.05 Median :11.36 Median :21.20
## Mean :408.2 Mean :18.46 Mean :12.65 Mean :22.53
## 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :711.0 Max. :22.00 Max. :37.97 Max. :50.00
64 suburbs average more than seven rooms per dwelling. 13 suburbs
average more than seven rooms per dwelling. Tracts 98, 164, 205, 225,
226, 227, 233, 234, 254, 258, 263, 268, and 365 have more than eight
rooms per dwelling. Most of these census tracts with more than 8 rooms
per dwelling appear to be in wealthier, suburban areas with lower crime
rates, lower pollution levels, and more residential zoning. A few of
these tracts, like 365, have higher crime rates and are closer to
industrial or commercial areas, suggesting that some larger homes are
still located in more mixed-use or even urban settings. The relatively
low pupil-teacher ratios and the low percentages of lower-status
populations further suggest that these areas are more affluent with
better educational resources.
nrow(Boston[Boston$rm > 7, ])
## [1] 64
nrow(Boston[Boston$rm > 8, ])
## [1] 13
Boston[Boston$rm >8, ]