(a) We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.
CEO Salary - this is a regression problem.CEO Salary, we are most interested in inference.profit, number of employees, and industry)(b) We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.
result (whether a product is a success or failure) - this is a classification problem.price charged, marketing budget, competition price, and 10 other variables)(c) We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.
% change in the USD/Euro exchange rate this is a regression problem.% change in the USD/Euro exchange rate, we are most interested in inference.% change in the US market, % change in the British market, and % change in the German market)College data set, which can be found in the file College.csv. It contains a number of variables for 777 different universities and colleges in the US.## Private Apps Accept Enroll
## Length:777 Min. : 81 Min. : 72 Min. : 35
## Class :character 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242
## Mode :character Median : 1558 Median : 1110 Median : 434
## Mean : 3002 Mean : 2019 Mean : 780
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902
## Max. :48094 Max. :26330 Max. :6392
## Top10perc Top25perc F.Undergrad P.Undergrad
## Min. : 1.00 Min. : 9.0 Min. : 139 Min. : 1.0
## 1st Qu.:15.00 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0
## Median :23.00 Median : 54.0 Median : 1707 Median : 353.0
## Mean :27.56 Mean : 55.8 Mean : 3700 Mean : 855.3
## 3rd Qu.:35.00 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0
## Max. :96.00 Max. :100.0 Max. :31643 Max. :21836.0
## Outstate Room.Board Books Personal
## Min. : 2340 Min. :1780 Min. : 96.0 Min. : 250
## 1st Qu.: 7320 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850
## Median : 9990 Median :4200 Median : 500.0 Median :1200
## Mean :10441 Mean :4358 Mean : 549.4 Mean :1341
## 3rd Qu.:12925 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700
## Max. :21700 Max. :8124 Max. :2340.0 Max. :6800
## PhD Terminal S.F.Ratio perc.alumni
## Min. : 8.00 Min. : 24.0 Min. : 2.50 Min. : 0.00
## 1st Qu.: 62.00 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00
## Median : 75.00 Median : 82.0 Median :13.60 Median :21.00
## Mean : 72.66 Mean : 79.7 Mean :14.09 Mean :22.74
## 3rd Qu.: 85.00 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00
## Max. :103.00 Max. :100.0 Max. :39.80 Max. :64.00
## Expend Grad.Rate
## Min. : 3186 Min. : 10.00
## 1st Qu.: 6751 1st Qu.: 53.00
## Median : 8377 Median : 65.00
## Mean : 9660 Mean : 65.46
## 3rd Qu.:10830 3rd Qu.: 78.00
## Max. :56233 Max. :118.00
Outstate versus Private.# Exercise 8c-3
plot(College$Private, College$Outstate, xlim = c(0, 3), col = c('lightsteelblue', 'lightgrey'), xlab = 'Private', ylab = 'Outstate')Elite, by binning the Top10perc variable. Use the summary() function to see how many elite universities there are. Now use the plot() function to produce side-by-side boxplots of Outstate versus Elite# Exercise 8c-4
Elite = rep("No",nrow(college))
Elite[college$Top10perc >50] ="Yes"
Elite = as.factor(Elite)
college = data.frame(college, Elite)
summary(Elite)## No Yes
## 699 78
plot(college$Elite, college$Outstate, xlab = 'Elite', ylab = 'Outstate', xlim = c(0, 3), col = c('lightsteelblue', 'lightgrey'))# Exercise 8c-5
par(mfrow=c(2,2))
hist(college$Accept, main="Number of Applications Accepted", col="lightsteelblue", xlab = 'Accepted')
hist(college$Enroll, main="Number of New Students Enrolled", col="lightgrey", xlab = 'Enroll')
hist(college$PhD, main="Percent of Faculty with a PhD", col="lightsteelblue", xlab = 'PhD' )
hist(college$perc.alumni, main="Graduation Rate", col="lightgrey", xlab = 'Grad.Rate')# Exercise 8c-6
plot(college$Accept, college$Enroll,
xlab = 'Number of Applicants Accepted',
ylab = 'Number of New Students Enrolled',
col = 'steelblue')plot(college$PhD, college$Grad.Rate,
xlab = 'Percent of Faculty with a PhD',
ylab = 'Graduation Rate',
col = 'black')Accept, Enroll, PhD, and Grad.Rate. Based on these visuals, there appears to be a correlation between the Number of Applications Accepted and the New Students Enrolled. Additionally, there appears to be a correlation between the Percent of Faculty with a PhD and the Graduation Rate of students.## [1] "Amherst College" "Cazenovia College"
## [3] "College of Mount St. Joseph" "Grove City College"
## [5] "Harvard University" "Harvey Mudd College"
## [7] "Lindenwood College" "Missouri Southern State College"
## [9] "Santa Clara University" "Siena College"
## [11] "University of Richmond"
Auto data set studied in the lab. Make sure that the missing values have been removed from the data.#Exercise 9a
auto = read.csv('Auto.csv', header = T, na.strings = "?")
auto = na.omit(auto)
names(auto)## [1] "mpg" "cylinders" "displacement" "horsepower" "weight"
## [6] "acceleration" "year" "origin" "name"
mpg, cylinders, displacement, horsepower, weight, acceleration, year, and origin.name## mpg cylinders displacement horsepower weight acceleration year origin
## [1,] 9.0 3 68 46 1613 8.0 70 1
## [2,] 46.6 8 455 230 5140 24.8 82 3
# Exercise 9c
stats_pred = sapply(auto[, quant_pred], function(x) signif(c(mean(x), sd(x)), 0))
rownames(stats_pred) <- c("mean", "sd")
stats_pred## mpg cylinders displacement horsepower weight acceleration year origin
## mean 20 5 200 100 3000 20 80 2.0
## sd 8 2 100 40 800 3 4 0.8
observation_subset = sapply(auto[-10:-85, quant_pred], function(x) round(c(range(x), mean(x), sd(x)), 0))
rownames(observation_subset) <- c("min", "max", "mean", "sd")
observation_subset## mpg cylinders displacement horsepower weight acceleration year origin
## min 11 3 68 46 1649 8 70 1
## max 47 8 455 230 4997 25 82 3
## mean 24 5 187 101 2936 16 77 2
## sd 8 2 100 36 811 3 3 1
Auto dataset. From this compilation, we can see correlations between variables like weight and horsepower.weight and horsepower, the third scatterplot showcases a strong correlation between the two variables. As horsepower increases, the weight of the vehicle also increases. Therefore, vehicles that are heavier or larger - like trucks - have the ability to output more power.acceleration and mpg, the second scatterplot showcases a reasonable correlation between the two variables. As acceleration increases, the mpg of the vehicle also increases. Therefore, a vehicle that can accelerate more quickly - a vehicle that is not as heavy, lighter, and smaller - typically also has a higher miles per gallon.# Exercise 9f
plot(auto$weight, auto$mpg,
xlab = 'Weight',
ylab = 'Miles per Gallon',
col = 'black')weight and mpg, the scatterplot above showcases a negative correlation between the two variables. As weight increases, the mpg of the vehicle also decreases. Therefore, a vehicle that weights more typically also has a lower miles per gallon. This can be used to predict mpg.Boston housing data set.Boston data set. The Boston data set is part of the MASS library in R. How many rows are in this data set? How many columns? What do the rows and columns represent?## [1] 506 14
par(mfrow = c(2, 2))
plot(Boston$crim, Boston$medv,
xlab = 'Per Capita Crime Rate',
ylab = 'medv',
col = 'steelblue')
plot(Boston$rm, Boston$medv,
xlab = 'Average number of rooms per dwelling',
ylab = 'medv',
col = 'steelblue')
plot(Boston$lstat, Boston$medv,
xlab = 'Lower status of the population (percent)',
ylab = 'medv',
col = 'black')
plot(Boston$ptratio, Boston$medv,
xlab = 'Pupil-teacher ratio by town',
ylab = 'medv',
col = 'black')medv and crim, the scatterplot showcases a correlation between the two variables. As crim increases, the medv decreases. Therefore, as the Per Capita Crime Rate worsens, the median value of owner-occupied homes drops. This make sense as the demand for homes in more dangerous areas leads to a devaluation in the price of homes there.medv and rm, the scatterplot showcases a correlation between the two variables. As rm increases, the medv also increases. Therefore, as the average number of rooms per dwelling grows, the median value of owner-occupied homes rises. This makes sense as homes with more square footage/space are valued higher than homes with less space.medv and lstat, the scatterplot showcases a correlation between the two variables. As lstat increases, the medv decreases. Therefore, as the lower status of population (percent) increases, the median value of owner-occupied homes drops.# Exercise 10c
par(mfrow = c(2, 2))
plot(Boston$crim ~ Boston$zn,
log = 'xy',
col = 'steelblue')
plot(Boston$crim ~ Boston$age,
log = 'xy',
col = 'steelblue')
plot(Boston$crim ~ Boston$dis,
log = 'xy',
col = 'black')
plot(Boston$crim ~ Boston$lstat,
log = 'xy',
col = 'black')crim.
age: As the proportion of owner-occupied units built prior to 1940 increases, the Per Capita Crime Rate increases.dis: As the weighted mean of distances to five Boston employment centres increases, the Per Capita Crime Rate decreases.lstat: As the lower status of the population (percent) increases, the Per Capita Crime Rate increases.# Exercise 10d
hist(Boston$crim, breaks=25, col = "steelblue", main = "Histogram of Per Capita Crime Rate")hist(Boston$tax, breaks=25, col = "black", main = "Histogram of Full-value Property-tax Rate per $10,000")hist(Boston$ptratio, breaks=25, col = "darkgrey", main = "Histogram of Pupil-teacher Ratio by Town")## [1] 35
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.60 17.40 19.05 18.46 20.20 22.00
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 399 38.3518 0 18.1 0 0.693 5.453 100 1.4896 24 666 20.2 396.9 30.59
## medv
## 399 5
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
crim: Suburb #399 (38.3518) - This suburb has a Per Capital Crime Rate that is approximately 10 times the average of Boston suburbs.zn: Suburb #399 (0) - This suburb is below the average of Boston suburbs in regards to the proportion of residential land zoned for lots over 25,000 sq.ft.indus: Suburb #399 (18.1) - This suburb is above the average of Boston suburbs in regards to the proportion of non-retail business acres per town.chas: Suburb #399 (0) - This suburb is in relatively in line with that of the average of Boston suburbs.nox: Suburb #399 (0.693) - This suburb is slightly above the average of Boston suburbs in regards to the nitrogen oxides concentration (parts per 10 million).rm: Suburb #399 (5.453) - This suburb is below the average of Boston suburbs in regards to the average number of rooms per dwelling.age: Suburb #399 (100) - This suburb is above the average of Boston suburbs in regards to the proportion of owner-occupied units built prior to 1940.dis: Suburb #399 (1.4896) - This suburb is below the average of Boston suburbs in regards to the weighted mean of distances to five Boston employment centres.rad: Suburb #399 (24) - This suburb is above the average of Boston suburbs in regards to the index of accessibility to radial highways.tax: Suburb #399 (666) - This suburb is above the average of Boston suburbs in regards to the full-value property-tax rate per $10,000.ptratio: Suburb #399 (20.2) - This suburb is above the average of Boston suburbs in regards to the pupil-teacher ratio by town.black: Suburb #399 (396.9) - This suburb is above the average of Boston suburbs in regards to the proportion of blacks by town.lstat: Suburb #399 (30.59) - This suburb is above the average of Boston suburbs in regards to the lower status of the population (percent).medv: Suburb #399 (5) - This suburb is below the average of Boston suburbs in regards to the median value of owner-occupied homes in $1000s.## [1] 64
## [1] 13
## crim zn indus chas
## Min. :0.02009 Min. : 0.00 Min. : 2.680 Min. :0.0000
## 1st Qu.:0.33147 1st Qu.: 0.00 1st Qu.: 3.970 1st Qu.:0.0000
## Median :0.52014 Median : 0.00 Median : 6.200 Median :0.0000
## Mean :0.71879 Mean :13.62 Mean : 7.078 Mean :0.1538
## 3rd Qu.:0.57834 3rd Qu.:20.00 3rd Qu.: 6.200 3rd Qu.:0.0000
## Max. :3.47428 Max. :95.00 Max. :19.580 Max. :1.0000
## nox rm age dis
## Min. :0.4161 Min. :8.034 Min. : 8.40 Min. :1.801
## 1st Qu.:0.5040 1st Qu.:8.247 1st Qu.:70.40 1st Qu.:2.288
## Median :0.5070 Median :8.297 Median :78.30 Median :2.894
## Mean :0.5392 Mean :8.349 Mean :71.54 Mean :3.430
## 3rd Qu.:0.6050 3rd Qu.:8.398 3rd Qu.:86.50 3rd Qu.:3.652
## Max. :0.7180 Max. :8.780 Max. :93.90 Max. :8.907
## rad tax ptratio black
## Min. : 2.000 Min. :224.0 Min. :13.00 Min. :354.6
## 1st Qu.: 5.000 1st Qu.:264.0 1st Qu.:14.70 1st Qu.:384.5
## Median : 7.000 Median :307.0 Median :17.40 Median :386.9
## Mean : 7.462 Mean :325.1 Mean :16.36 Mean :385.2
## 3rd Qu.: 8.000 3rd Qu.:307.0 3rd Qu.:17.40 3rd Qu.:389.7
## Max. :24.000 Max. :666.0 Max. :20.20 Max. :396.9
## lstat medv
## Min. :2.47 Min. :21.9
## 1st Qu.:3.32 1st Qu.:41.7
## Median :4.14 Median :48.3
## Mean :4.31 Mean :44.2
## 3rd Qu.:5.12 3rd Qu.:50.0
## Max. :7.44 Max. :50.0