MATH 239: Homework #1

Problem 1:

This exercise involves the “AUTO” data set we studied during lab. Make sure that the missing values have been removed from the data.

a. Which of the predictors are quantatative and which are qualitative?

auto<-read.table("http://faculty.marshall.usc.edu/gareth-james/ISL/Auto.data", 
                   header=TRUE,
                   na.strings = "?")
library(tidyverse)

## -- Attaching packages --------------------------------------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.2.1     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## -- Conflicts ------------------------------------------------------------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

str(auto)

## 'data.frame':    397 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...

The quantitative predictors are mpg, cylinders, displacement, horsepower, weight, and acceleration. You could probably use year as a quantitative predictor as well.

The qualitative predictor is name, although you could say origin, year, and cylinders could be considered qualitative as the number values would be considered categories.

b. What is the range of each quantitative predictor?

range(auto$mpg)

## [1]  9.0 46.6

range(auto$cylinders)

## [1] 3 8

range(auto$displacement)

## [1]  68 455

range(auto$horsepower,
      na.rm = TRUE)

## [1]  46 230

range(auto$weight)

## [1] 1613 5140

range(auto$acceleration)

## [1]  8.0 24.8

range(auto$year)

## [1] 70 82

c. What is the mean and standard deviation of each quantitative predictor?

mean(auto$mpg)

## [1] 23.51587

mean(auto$cylinders)

## [1] 5.458438

mean(auto$displacement)

## [1] 193.5327

mean(auto$horsepower,
     na.rm = TRUE)

## [1] 104.4694

mean(auto$weight)

## [1] 2970.262

mean(auto$acceleration)

## [1] 15.55567

The mean for mpg: 23.52

cylinders: 5.458

displacement: 193.5

horsepower: 104.5

weight: 2970

acceleration: 15.56

sd(auto$mpg)

## [1] 7.825804

sd(auto$cylinders)

## [1] 1.701577

sd(auto$displacement)

## [1] 104.3796

sd(auto$horsepower,
   na.rm = TRUE)

## [1] 38.49116

sd(auto$weight)

## [1] 847.9041

sd(auto$acceleration)

## [1] 2.749995

The Standard Deviation for: mpg - 7.825804 cylinders - 1.1701577 displacement - 104.3796 horsepower - 38.49116 weight - 847.9041 acceleration - 2.749995

d. Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

auto2<-auto[c(-10:-85),]
str(auto2)

## 'data.frame':    321 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 13 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 350 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 175 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 13 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 73 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 25 ...

# Finding range of values
range(auto2$mpg)

## [1] 11.0 46.6

range(auto2$cylinders)

## [1] 3 8

range(auto2$displacement)

## [1]  68 455

range(auto2$horsepower,
      na.rm = TRUE)

## [1]  46 230

range(auto2$weight)

## [1] 1649 4997

range(auto2$acceleration)

## [1]  8.5 24.8

range(auto2$year)

## [1] 70 82

range(auto2$origin)

## [1] 1 3

# Finding Mean of Values
mean(auto2$mpg)

## [1] 24.43863

mean(auto2$cylinders)

## [1] 5.370717

mean(auto2$displacement)

## [1] 187.0498

mean(auto2$horsepower,
     na.rm = TRUE)

## [1] 100.9558

mean(auto2$weight)

## [1] 2933.963

mean(auto2$acceleration)

## [1] 15.72305

# Finding standard deviation of values
sd(auto2$mpg)

## [1] 7.908184

sd(auto2$cylinders)

## [1] 1.653486

sd(auto2$displacement)

## [1] 99.63539

sd(auto2$horsepower,
   na.rm = TRUE)

## [1] 35.89557

sd(auto2$weight)

## [1] 810.6429

sd(auto2$acceleration)

## [1] 2.680514

e. Using the full data set, investigate the predictors graphically, using scatterplots and other tools of your choice. Create some plots (at least 3) highlighting the relationships among the predictors. Comment on your findings.

auto$cylinders<-as.factor(auto$cylinders)

plot(auto$cylinders, auto$mpg, col="red")

Here, “x” represents the Cylinders column, and “y” represents the MPG column. Comparing Cylinders to MPG shows us that there tends to be a higher gas milage among cars with 4 or 5 cylinders.

auto$horsepower<-as.factor(auto$horsepower)
plot(auto$horsepower, auto$mpg, col="red")

Here, “x” is Horsepower and “y” is MPG. This shows us that more horsepower decreases your gas milage.

plot(auto$cylinders, auto$acceleration, col="red")

Here, “x” is Cylinders and “y” is Acceleration. It seems that having just the right amount of Cylinders (like around 6?) gives you the most acceleration.

f. Suppose that we wish to predict gas milage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting MPG? Justify your answer.

I believe that based on any of the other variables, we could predict MPG. The three that I just made already point to some kind of trend, where optimal gas milage is based on having an average amount of acceleration and horsepower and cylinders.

Problem 2

The following data is from the box office sales of Star Wars Movies

# Box office Star Wars (in millions!)
new_hope <- c(460.998, 314.4)
empire_strikes <- c(290.475, 247.900)
return_jedi <- c(309.306, 165.8)

# Vectors region and titles, used for naming
region <- c("US", "non-US")
titles <- c("A New Hope", "The Empire Strikes Back", "Return of
the Jedi")

Provide the code and output for each of the follwing tasks:

a. Construct a matrix, where rows represent each movie. Name this matrix “starWars” and output it.

starWars<-matrix(c(new_hope, empire_strikes, return_jedi), nrow = 3, byrow = TRUE)
starWars

##         [,1]  [,2]
## [1,] 460.998 314.4
## [2,] 290.475 247.9
## [3,] 309.306 165.8

b. Rename the rows and columns of the matrix you created in Part A with the vector “region” for columns and the vector “titles” for rows. Then print the matrix.

colnames(starWars)<-region
rownames(starWars)<-titles

starWars

##                              US non-US
## A New Hope              460.998  314.4
## The Empire Strikes Back 290.475  247.9
## Return of\nthe Jedi     309.306  165.8

c. Calculate the worldwide box office figures for each movie using the "rowSums() function. Name and output this vector.

world_wide_box_office<-rowSums(starWars)
world_wide_box_office

##              A New Hope The Empire Strikes Back     Return of\nthe Jedi 
##                 775.398                 538.375                 475.106

d. Now we want to add a column to our matrix for worldwide sales. You can do this by using the cbind() funtion. This function binds columns together.

# To do this, I bind the original matrix with the Worldwide vector.
cbind(starWars, world_wide_box_office)

##                              US non-US world_wide_box_office
## A New Hope              460.998  314.4               775.398
## The Empire Strikes Back 290.475  247.9               538.375
## Return of\nthe Jedi     309.306  165.8               475.106

e. Create another matrix for the prequels and name it “starWars2”. Don’t forget to name the rows and the columns.

# Prequels
phantom_menace <- c(474.5, 552.5)
attack_clones <- c(310.7, 338.7)
revenge_sith <- c(380.3, 468.5)

titles2<- c("The Phantom Menace", "Attack of the Clones",
"Revenge of the Sith")
starWars2<-matrix(c(phantom_menace, attack_clones, revenge_sith), nrow = 3, byrow = TRUE)

colnames(starWars2)<-region
rownames(starWars2)<-titles2

starWars2

##                         US non-US
## The Phantom Menace   474.5  552.5
## Attack of the Clones 310.7  338.7
## Revenge of the Sith  380.3  468.5

f. Make one big matrix that combines all the movies (from starWars and starWars2) using rbind(). This binds rows or in this case can be used to combine two matrices. Name the new matrix “allStarWars”.

allStarWars<-rbind(starWars, starWars2)
allStarWars

##                              US non-US
## A New Hope              460.998  314.4
## The Empire Strikes Back 290.475  247.9
## Return of\nthe Jedi     309.306  165.8
## The Phantom Menace      474.500  552.5
## Attack of the Clones    310.700  338.7
## Revenge of the Sith     380.300  468.5

g. Find the total non-US revenue for all the movies using the “colSums()” function.

colSums(allStarWars)

##       US   non-US 
## 2226.279 2087.800

sum(allStarWars[, 2])

## [1] 2087.8

# How do i use colSums for only non-US? I tried using it, but could only find answer using sum().

Problem 3

a. Use the read.csv() function to read the data into R. You can download the data from the book’s website (don’t forget to set to the workind directory) or you can use the URL.

college<-read.csv("http://faculty.marshall.usc.edu/gareth-james/ISL/
College.csv",header=TRUE)

b. Use the view() function to look at the data. You should notice that the first column is just the name of each university.

We don’t really want R to treat this as a variable. However, it may be handy to have these names for later. Try the following commands:

view(college)
rownames(college)<-college[, 1]
view(college)

You should see that there is now a row.names column with the name of each university recorded. This means that R has given each row a name corresponding to the appropriate university. R will not try to perform calculations on row names. However, we still need to eliminate the first column in the data where the names are stored. Try:

college <- college[,-1]
view(college)

Now you should see that the first data column is “Private”

c. Use the following tasks and provide the code:

a. Use the summary() function to produce a numerical summary of the variables in the data set.

summary(college)

##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 10.00  
##  1st Qu.: 53.00  
##  Median : 65.00  
##  Mean   : 65.46  
##  3rd Qu.: 78.00  
##  Max.   :118.00

b. Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables in the data. Recall that you can reference the first ten columns of a matrix A using A[, 1:10].

pairs(college[, 1:10])

c. Use the plot() or ggplot() fuction to produce side-by-side boxplots of “Outstate vs. Private”.

college$Private<-as.factor(college$Private)
plot(college$Private, college$Outstate, col="yellow")

d. Create a new qualitative variable, called “Elite”, by binning the “Top10perc” variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school class exceed 50%.

Elite<-rep("No", nrow(college))
Elite[college$Top10perc>50]="Yes"
Elite<-as.factor(Elite)
college<-data.frame(college, Elite)

Use the summary() funtion to see how many elite universities there are.

summary(Elite)

##  No Yes 
## 699  78

Now use the plot() or ggplot() function to produce side-by-side boxplots of Outstate vs. Elite.

plot(college$Elite, college$Outstate, col="forest green")