In collaboration with Bryce O’Connor and Dakota Barksdale
This exercise involves the Auto data set that we studied during lab. Make sure that the missing values have been removed from the data.
A. Which of the predictors are quantitative, and which are qualitative?
Auto <- read.table("http://faculty.marshall.usc.edu/gareth-james/ISL/Auto.data",
header=TRUE,
na.strings = "?")
str(Auto)
## 'data.frame': 397 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
Auto=na.omit(Auto)
Qualitative is origin and name. Quantitative is all other data
B. What is the range of each quantitative predictor?
apply(Auto[,1:6], 2, range)
## mpg cylinders displacement horsepower weight acceleration
## [1,] 9.0 3 68 46 1613 8.0
## [2,] 46.6 8 455 230 5140 24.8
C. What is the mean and standard deviation of each quantitative predictor?
#mean
options(width = 95)
apply(Auto[,1:6], 2, mean)
## mpg cylinders displacement horsepower weight acceleration
## 23.445918 5.471939 194.411990 104.469388 2977.584184 15.541327
#SD
options(width = 95)
apply(Auto[,1:6], 2, sd)
## mpg cylinders displacement horsepower weight acceleration
## 7.805007 1.705783 104.644004 38.491160 849.402560 2.758864
D. Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?
apply(Auto[-c(10:85),1:6], 2, range)
## mpg cylinders displacement horsepower weight acceleration
## [1,] 11.0 3 68 46 1649 8.5
## [2,] 46.6 8 455 230 4997 24.8
options(width = 95)
apply(Auto[-c(10:85),1:6], 2, mean)
## mpg cylinders displacement horsepower weight acceleration
## 24.404430 5.373418 187.240506 100.721519 2935.971519 15.726899
apply(Auto[-c(10:85),1:6], 2, sd)
## mpg cylinders displacement horsepower weight acceleration
## 7.867283 1.654179 99.678367 35.708853 811.300208 2.693721
E. Using the full data set, investigate the predictors graphically, using scatterplots and other tools of your choice. Create some plots (at least 3) highlighting the relationships among the predictors. Comment on your findings.
pairs(Auto[,])
plot(Auto$acceleration, Auto$weight)
It seems that there is a very weak negative correlation that shows how as weight is higher, acceleration is lower.
plot(Auto$horsepower, Auto$mpg)
It seems that there is a weak negative correlation that is when mpg is lower, horsepower is higher.
plot(Auto$year, Auto$mpg)
It seems that there is a positive correlation that shows that as year goes up, mpg also goes up.
F. Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer
Yes, from these graphs we can see some variables have positive, and some have negative, relationships to mpg outcome. For example, mpg and horsepower have a negative relationship to one another (when horsepower increases, mpg decreases) while year and mpg have a positive relationship with each other (as year increases so does mpg).
A. Construct a matrix, where rows represent each movie. Name this matrix starWars and output it.
# Box office Star Wars (in millions!)
new_hope <- c(460.998, 314.4)
empire_strikes <- c(290.475, 247.900)
return_jedi <- c(309.306, 165.8)
# Construct matrix
starWars <- matrix(c(new_hope, empire_strikes, return_jedi), nrow = 3, byrow = TRUE)
starWars
## [,1] [,2]
## [1,] 460.998 314.4
## [2,] 290.475 247.9
## [3,] 309.306 165.8
B. Rename the rows and columns of the matrix you created in Part A with the vector region for columns and the vector titles for rows. Then print the matrix.
# Vectors region and titles, used for naming
region <- c("US", "non-US")
titles <- c("A New Hope", "The Empire Strikes Back", "Return of the Jedi")
# Name the columns with region
colnames(starWars) <- region
# Name the rows with titles
rownames(starWars) <- titles
# Print out starWars
starWars
## US non-US
## A New Hope 460.998 314.4
## The Empire Strikes Back 290.475 247.9
## Return of the Jedi 309.306 165.8
C. Calculate the worldwide box office figures for each movie using the rowSums() function. Name and output this vector.
# Calculate worldwide box office figures
worldwide_vector <- rowSums(starWars)
worldwide_vector
## A New Hope The Empire Strikes Back Return of the Jedi
## 775.398 538.375 475.106
D. Now we want to add a column to our matrix for worldwide sales. You can do this by using the cbind() function. This function binds columns together.
# Bind the new variable worldwide_vector as a column to starWars
all_wars_matrix <- cbind(starWars, worldwide_vector)
all_wars_matrix
## US non-US worldwide_vector
## A New Hope 460.998 314.4 775.398
## The Empire Strikes Back 290.475 247.9 538.375
## Return of the Jedi 309.306 165.8 475.106
E. Create another matrix for the prequels and name it starWars2. Don’t forget to name the rows and the columns (similar to above)
# Prequels
phantom_menace<-c(474.5, 552.5)
attack_clones<-c(310.7, 338.7)
revenge_sith<-c(380.3, 468.5)
starWars2<-matrix(c(phantom_menace, attack_clones, revenge_sith), nrow = 3, byrow = TRUE)
starWars
## US non-US
## A New Hope 460.998 314.4
## The Empire Strikes Back 290.475 247.9
## Return of the Jedi 309.306 165.8
# Vectors region and titles, used for naming
region <- c("US", "non-US")
titles <- c("The Phantom Menace", "Attack of the Clones", "Revenge of the Sith")
# Name the columns with region
colnames(starWars2) <- region
# Name the rows with titles
rownames(starWars2) <- titles
starWars2
## US non-US
## The Phantom Menace 474.5 552.5
## Attack of the Clones 310.7 338.7
## Revenge of the Sith 380.3 468.5
F. Make one big matrix that combines all the movies (from starWars and starWars2) using rbind(). This binds rows or in this case can be used to combine to matrices. Name this new matrix allStarWars.
# Combine both Star Wars trilogies in one matrix
allStarWars <- rbind(starWars, starWars2)
allStarWars
## US non-US
## A New Hope 460.998 314.4
## The Empire Strikes Back 290.475 247.9
## Return of the Jedi 309.306 165.8
## The Phantom Menace 474.500 552.5
## Attack of the Clones 310.700 338.7
## Revenge of the Sith 380.300 468.5
G. Find the total non-US revenue for all the movies using the colSums() function.
# Total non-US revenue for all movies
totals = colSums(allStarWars)
totals[2]
## non-US
## 2087.8
This is the setup code for this question
college<-read.csv("http://faculty.marshall.usc.edu/gareth-james/ISL/College.csv",header=TRUE)
rownames(college) <- college[,1]
#View(college)
college <- college[,-1]
#View(college)
# View(college) is commented out in order to load the dataset without having to show all the data on this markdown document
summary(college)
## Private Apps Accept Enroll Top10perc Top25perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00 Min. : 9.0
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00 1st Qu.: 41.0
## Median : 1558 Median : 1110 Median : 434 Median :23.00 Median : 54.0
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56 Mean : 55.8
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00 3rd Qu.: 69.0
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00 Max. :100.0
## F.Undergrad P.Undergrad Outstate Room.Board Books
## Min. : 139 Min. : 1.0 Min. : 2340 Min. :1780 Min. : 96.0
## 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320 1st Qu.:3597 1st Qu.: 470.0
## Median : 1707 Median : 353.0 Median : 9990 Median :4200 Median : 500.0
## Mean : 3700 Mean : 855.3 Mean :10441 Mean :4358 Mean : 549.4
## 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925 3rd Qu.:5050 3rd Qu.: 600.0
## Max. :31643 Max. :21836.0 Max. :21700 Max. :8124 Max. :2340.0
## Personal PhD Terminal S.F.Ratio perc.alumni
## Min. : 250 Min. : 8.00 Min. : 24.0 Min. : 2.50 Min. : 0.00
## 1st Qu.: 850 1st Qu.: 62.00 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00
## Median :1200 Median : 75.00 Median : 82.0 Median :13.60 Median :21.00
## Mean :1341 Mean : 72.66 Mean : 79.7 Mean :14.09 Mean :22.74
## 3rd Qu.:1700 3rd Qu.: 85.00 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00
## Max. :6800 Max. :103.00 Max. :100.0 Max. :39.80 Max. :64.00
## Expend Grad.Rate
## Min. : 3186 Min. : 10.00
## 1st Qu.: 6751 1st Qu.: 53.00
## Median : 8377 Median : 65.00
## Mean : 9660 Mean : 65.46
## 3rd Qu.:10830 3rd Qu.: 78.00
## Max. :56233 Max. :118.00
college<-college[,1:10]
pairs(college)
plot(college$Private, college$Outstate)
Elite <- rep("No", nrow(college))
Elite[college$Top10perc > 50] = "Yes"
Elite <- as.factor(Elite)
college <- data.frame(college, Elite)
summary(Elite)
## No Yes
## 699 78
plot(Elite)
plot(college$Elite, college$Outstate)