In collaboration with Bryce O’Connor and Dakota Barksdale

Problem 1: Auto Data

This exercise involves the Auto data set that we studied during lab. Make sure that the missing values have been removed from the data.

A. Which of the predictors are quantitative, and which are qualitative?

Auto <- read.table("http://faculty.marshall.usc.edu/gareth-james/ISL/Auto.data", 
                   header=TRUE,
                   na.strings = "?")
str(Auto)
## 'data.frame':    397 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
Auto=na.omit(Auto)

Qualitative is origin and name. Quantitative is all other data

B. What is the range of each quantitative predictor?

apply(Auto[,1:6], 2, range)
##       mpg cylinders displacement horsepower weight acceleration
## [1,]  9.0         3           68         46   1613          8.0
## [2,] 46.6         8          455        230   5140         24.8

C. What is the mean and standard deviation of each quantitative predictor?

#mean
options(width = 95)
apply(Auto[,1:6], 2, mean)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##    23.445918     5.471939   194.411990   104.469388  2977.584184    15.541327
#SD
options(width = 95)
apply(Auto[,1:6], 2, sd)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##     7.805007     1.705783   104.644004    38.491160   849.402560     2.758864

D. Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

apply(Auto[-c(10:85),1:6], 2, range)
##       mpg cylinders displacement horsepower weight acceleration
## [1,] 11.0         3           68         46   1649          8.5
## [2,] 46.6         8          455        230   4997         24.8
options(width = 95)
apply(Auto[-c(10:85),1:6], 2, mean)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##    24.404430     5.373418   187.240506   100.721519  2935.971519    15.726899
apply(Auto[-c(10:85),1:6], 2, sd)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##     7.867283     1.654179    99.678367    35.708853   811.300208     2.693721

E. Using the full data set, investigate the predictors graphically, using scatterplots and other tools of your choice. Create some plots (at least 3) highlighting the relationships among the predictors. Comment on your findings.

pairs(Auto[,])

plot(Auto$acceleration, Auto$weight)

It seems that there is a very weak negative correlation that shows how as weight is higher, acceleration is lower.

plot(Auto$horsepower, Auto$mpg)

It seems that there is a weak negative correlation that is when mpg is lower, horsepower is higher.

plot(Auto$year, Auto$mpg)

It seems that there is a positive correlation that shows that as year goes up, mpg also goes up.

F. Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer

Yes, from these graphs we can see some variables have positive, and some have negative, relationships to mpg outcome. For example, mpg and horsepower have a negative relationship to one another (when horsepower increases, mpg decreases) while year and mpg have a positive relationship with each other (as year increases so does mpg).

Problem 2: Working with vectors and matrices

A. Construct a matrix, where rows represent each movie. Name this matrix starWars and output it.

# Box office Star Wars (in millions!)
new_hope <- c(460.998, 314.4)
empire_strikes <- c(290.475, 247.900)
return_jedi <- c(309.306, 165.8)

# Construct matrix
starWars <- matrix(c(new_hope, empire_strikes, return_jedi), nrow = 3, byrow = TRUE)
starWars
##         [,1]  [,2]
## [1,] 460.998 314.4
## [2,] 290.475 247.9
## [3,] 309.306 165.8

B. Rename the rows and columns of the matrix you created in Part A with the vector region for columns and the vector titles for rows. Then print the matrix.

# Vectors region and titles, used for naming
region <- c("US", "non-US")
titles <- c("A New Hope", "The Empire Strikes Back", "Return of the Jedi")

# Name the columns with region
colnames(starWars) <- region

# Name the rows with titles
rownames(starWars) <- titles

# Print out starWars
starWars
##                              US non-US
## A New Hope              460.998  314.4
## The Empire Strikes Back 290.475  247.9
## Return of the Jedi      309.306  165.8

C. Calculate the worldwide box office figures for each movie using the rowSums() function. Name and output this vector.

# Calculate worldwide box office figures
worldwide_vector <- rowSums(starWars)
worldwide_vector
##              A New Hope The Empire Strikes Back      Return of the Jedi 
##                 775.398                 538.375                 475.106

D. Now we want to add a column to our matrix for worldwide sales. You can do this by using the cbind() function. This function binds columns together.

# Bind the new variable worldwide_vector as a column to starWars
all_wars_matrix <- cbind(starWars, worldwide_vector)
all_wars_matrix
##                              US non-US worldwide_vector
## A New Hope              460.998  314.4          775.398
## The Empire Strikes Back 290.475  247.9          538.375
## Return of the Jedi      309.306  165.8          475.106

E. Create another matrix for the prequels and name it starWars2 . Don’t forget to name the rows and the columns (similar to above)

# Prequels
phantom_menace<-c(474.5, 552.5) 
attack_clones<-c(310.7, 338.7) 
revenge_sith<-c(380.3, 468.5)

starWars2<-matrix(c(phantom_menace, attack_clones, revenge_sith), nrow = 3, byrow = TRUE)

starWars
##                              US non-US
## A New Hope              460.998  314.4
## The Empire Strikes Back 290.475  247.9
## Return of the Jedi      309.306  165.8
# Vectors region and titles, used for naming
region <- c("US", "non-US")
titles <- c("The Phantom Menace", "Attack of the Clones", "Revenge of the Sith")

# Name the columns with region
colnames(starWars2) <- region

# Name the rows with titles
rownames(starWars2) <- titles
starWars2
##                         US non-US
## The Phantom Menace   474.5  552.5
## Attack of the Clones 310.7  338.7
## Revenge of the Sith  380.3  468.5

F. Make one big matrix that combines all the movies (from starWars and starWars2) using rbind(). This binds rows or in this case can be used to combine to matrices. Name this new matrix allStarWars.

# Combine both Star Wars trilogies in one matrix
allStarWars <- rbind(starWars, starWars2)

allStarWars
##                              US non-US
## A New Hope              460.998  314.4
## The Empire Strikes Back 290.475  247.9
## Return of the Jedi      309.306  165.8
## The Phantom Menace      474.500  552.5
## Attack of the Clones    310.700  338.7
## Revenge of the Sith     380.300  468.5

G. Find the total non-US revenue for all the movies using the colSums() function.

# Total non-US revenue for all movies
totals = colSums(allStarWars)
totals[2]
## non-US 
## 2087.8

Problem 3: College Data

This is the setup code for this question

college<-read.csv("http://faculty.marshall.usc.edu/gareth-james/ISL/College.csv",header=TRUE)

rownames(college) <- college[,1] 
#View(college)

college <- college[,-1] 
#View(college)

# View(college) is commented out in order to load the dataset without having to show all the data on this markdown document
  1. Use the summary() function to produce a numerical summary of the variables in the data set.
summary(college)
##  Private        Apps           Accept          Enroll       Top10perc       Top25perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00   Min.   :  9.0  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00   1st Qu.: 41.0  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00   Median : 54.0  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56   Mean   : 55.8  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00   3rd Qu.: 69.0  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00   Max.   :100.0  
##   F.Undergrad     P.Undergrad         Outstate       Room.Board       Books       
##  Min.   :  139   Min.   :    1.0   Min.   : 2340   Min.   :1780   Min.   :  96.0  
##  1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320   1st Qu.:3597   1st Qu.: 470.0  
##  Median : 1707   Median :  353.0   Median : 9990   Median :4200   Median : 500.0  
##  Mean   : 3700   Mean   :  855.3   Mean   :10441   Mean   :4358   Mean   : 549.4  
##  3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925   3rd Qu.:5050   3rd Qu.: 600.0  
##  Max.   :31643   Max.   :21836.0   Max.   :21700   Max.   :8124   Max.   :2340.0  
##     Personal         PhD            Terminal       S.F.Ratio      perc.alumni   
##  Min.   : 250   Min.   :  8.00   Min.   : 24.0   Min.   : 2.50   Min.   : 0.00  
##  1st Qu.: 850   1st Qu.: 62.00   1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00  
##  Median :1200   Median : 75.00   Median : 82.0   Median :13.60   Median :21.00  
##  Mean   :1341   Mean   : 72.66   Mean   : 79.7   Mean   :14.09   Mean   :22.74  
##  3rd Qu.:1700   3rd Qu.: 85.00   3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00  
##  Max.   :6800   Max.   :103.00   Max.   :100.0   Max.   :39.80   Max.   :64.00  
##      Expend        Grad.Rate     
##  Min.   : 3186   Min.   : 10.00  
##  1st Qu.: 6751   1st Qu.: 53.00  
##  Median : 8377   Median : 65.00  
##  Mean   : 9660   Mean   : 65.46  
##  3rd Qu.:10830   3rd Qu.: 78.00  
##  Max.   :56233   Max.   :118.00
  1. Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables in the data. Recall that you can reference the first ten columns of a matrix A using A[,1:10].
college<-college[,1:10]
pairs(college)

  1. Use the plot()or ggplot()function to produce side-by-side boxplots of Outstate vs Private.
plot(college$Private, college$Outstate)

  1. Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school class exceed 50%.
Elite <- rep("No", nrow(college)) 
Elite[college$Top10perc > 50] = "Yes" 
Elite <- as.factor(Elite)
college <- data.frame(college, Elite)
  1. Use the summary() function to see how many elite universities there are. Now use the plot() or ggplot() function to produce side-by-side boxplots of Outstate vs Elite.
summary(Elite)
##  No Yes 
## 699  78
plot(Elite)

plot(college$Elite, college$Outstate)