MATH 239: Problem Set 1

1. Auto Data: 1A) Which predictors are qualitative, and which are quanitative?

auto<-read.csv("Auto.csv",
               header=TRUE,
               na.strings = "?")
str(auto)

## 'data.frame':    397 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...

#Quanitative predictors: mpg, cylinders, displacement, horsepower, weight, acceleration, year, origin
#Qualitative predictors: name

1B) What is the range of each quantitative predictor?

range(auto$mpg)

## [1]  9.0 46.6

range(auto$cylinders)

## [1] 3 8

range(auto$displacement)

## [1]  68 455

range(auto$horsepower) #Horsepower outputs "N/A" for this and all below coding, even though it has a given data set. Unsure what this means.

## [1] NA NA

range(auto$weight)

## [1] 1613 5140

range(auto$acceleration)

## [1]  8.0 24.8

range(auto$year)

## [1] 70 82

range(auto$origin)

## [1] 1 3

1C) What is the mean and standard deviation of each quantitative predictor?

mean(auto$mpg)

## [1] 23.51587

sd(auto$mpg)

## [1] 7.825804

mean(auto$cylinders)

## [1] 5.458438

sd(auto$cylinders)

## [1] 1.701577

mean(auto$displacement)

## [1] 193.5327

sd(auto$displacement)

## [1] 104.3796

mean(auto$horsepower)

## [1] NA

sd(auto$horsepower)

## [1] NA

mean(auto$weight)

## [1] 2970.262

sd(auto$weight)

## [1] 847.9041

mean(auto$acceleration)

## [1] 15.55567

sd(auto$acceleration)

## [1] 2.749995

mean(auto$year)

## [1] 75.99496

sd(auto$year)

## [1] 3.690005

mean(auto$origin)

## [1] 1.574307

sd(auto$origin)

## [1] 0.8025495

1D) Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

#mpg
Automatmpg<-matrix(auto$mpg, 397, 1)
Automatmpg2<-Automatmpg[-c(10:85,1),drop = FALSE]
range(Automatmpg2)

## [1] 11.0 46.6

mean(Automatmpg2)

## [1] 24.45875

sd(Automatmpg2)

## [1] 7.912336

#cylinders
Automatcyl<-matrix(auto$cylinders,397,1)
Automatcyl2<-Automatcyl[-c(10:85,1),drop = FALSE]
range(Automatcyl2)

## [1] 3 8

mean(Automatcyl2)

## [1] 5.3625

sd(Automatcyl2)

## [1] 1.649499

#displacement
Automatdis<-matrix(auto$displacement,397,1)
Automatdis2<-Automatdis[-c(10:85,1),drop = FALSE]
range(Automatdis2)

## [1]  68 455

mean(Automatdis2)

## [1] 186.675

sd(Automatdis2)

## [1] 99.56448

#horsepower
Automathor<-matrix(auto$horsepower,397,1)
Automathor2<-Automathor[-c(10:85,1),drop = FALSE]
range(Automathor2)

## [1] NA NA

mean(Automathor2)

## [1] NA

sd(Automathor2)

## [1] NA

#weight
Automatwei<-matrix(auto$weight,397,1)
Automatwei2<-Automatwei[-c(10:85,1),drop = FALSE]
range(Automatwei2)

## [1] 1649 4997

mean(Automatwei2)

## [1] 2932.181

sd(Automatwei2)

## [1] 811.283

#acceleration
Automatacc<-matrix(auto$acceleration,397,1)
Automatacc2<-Automatacc[-c(10:85,1),drop = FALSE]
range(Automatacc2)

## [1]  8.5 24.8

mean(Automatacc2)

## [1] 15.73469

sd(Automatacc2)

## [1] 2.676582

#year
Automatyea<-matrix(auto$year,397,1)
Automatyea2<-Automatyea[-c(10:85,1),drop = FALSE]
range(Automatyea2)

## [1] 70 82

mean(Automatyea2)

## [1] 77.175

sd(Automatyea2)

## [1] 3.090181

#origin
Automatori<-matrix(auto$origin,397,1)
Automatori2<-Automatori[-c(10:85,1),drop = FALSE]
range(Automatori2)

## [1] 1 3

mean(Automatori2)

## [1] 1.6

sd(Automatori2)

## [1] 0.8167525

1E) Using the full data set, investigate the predictors graphically, using the scatterplots and other tools of your choice. Create some plots (at least 3) highlighting the relationships among the predictors. Comment on your findings.

plot(auto$mpg, auto$horsepower)

plot(auto$mpg,auto$acceleration)

plot(auto$mpg,auto$year)

#Mpg and horsepower appears to have an inverse relationship (the higher the mpg, the lower the horsepower). Mpg and acceleration appears to have a slight correlative relationship, but for the most part it appears mpg does not influence acceleration. For mpg and year, there appears to be a positive relationship (ex: Around 1970 mpg ranges from 10-30, while around 1980 mpg ranges from 15-40).

1F) Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer

#Based on my above plots, I would suggest that horsepower and year could serve as indicators for mpg because horsepower appeared to have an inverse relationship with mpg, and year appeared to have positive correlation.

2. Working with vectors and matrices: 2A) Construct a matrix, where rows represent each movie. Name this matrix starWars and output it.

# Box office Star Wars (in millions!)
new_hope <- c(460.998, 314.4)
empire_strikes <- c(290.475, 247.900)
return_jedi <- c(309.306, 165.8)
# Vectors region and titles, used for naming
region <- c("US", "non-US")
titles <- c("A New Hope", "The Empire Strikes Back", "Return of the Jedi")

starWars<-matrix(data=c(new_hope,empire_strikes,return_jedi),2,3,)
starWars<-t(starWars)
starWars

##         [,1]  [,2]
## [1,] 460.998 314.4
## [2,] 290.475 247.9
## [3,] 309.306 165.8

2B)Rename the rows and columns of the matrix you created in Part A with the vector region for columns and the vector titles for rows. Then print the matrix.

rownames(starWars)<-titles
colnames(starWars)<-region
starWars

##                              US non-US
## A New Hope              460.998  314.4
## The Empire Strikes Back 290.475  247.9
## Return of the Jedi      309.306  165.8

2C)Calculate the worldwide box office figures for each movie using the rowSums()function. Name and output this vector

Worldwide<-rowSums(starWars,na.rm = FALSE, dims=1L)

2D)Now we want to add a column to our matrix for worldwide sales. You can do this by using the cbind() function. This function binds columns together.

starWars<-cbind(starWars,Worldwide)
starWars

##                              US non-US Worldwide
## A New Hope              460.998  314.4   775.398
## The Empire Strikes Back 290.475  247.9   538.375
## Return of the Jedi      309.306  165.8   475.106

2E)Create another matrix for the prequels and name it starWars2. Don’t forget to name the rows and the columns (similar to above)

phantom_menace <- c(474.5, 552.5)
attack_clones <- c(310.7, 338.7)
revenge_sith <- c(380.3, 468.5)
titles2<- c("The Phantom Menace", "Attack of the Clones", "Revenge of the Sith")
starWars2<-matrix(data=c(phantom_menace,attack_clones,revenge_sith),2,3,)
starWars2<-t(starWars2)
rownames(starWars2)<-titles2
colnames(starWars2)<-region
Worldwide2<-rowSums(starWars2,na.rm = FALSE,)
starWars2<-cbind(starWars2,Worldwide2)
starWars2

##                         US non-US Worldwide2
## The Phantom Menace   474.5  552.5     1027.0
## Attack of the Clones 310.7  338.7      649.4
## Revenge of the Sith  380.3  468.5      848.8

2F)Make one big matrix that combines all the movies (from starWars and starWars2) using rbind(). This binds rows or in this case can be used to combine to matrices. Name this new matrix allStarWars.

allStarWars<-rbind(starWars, starWars2)
allStarWars

##                              US non-US Worldwide
## A New Hope              460.998  314.4   775.398
## The Empire Strikes Back 290.475  247.9   538.375
## Return of the Jedi      309.306  165.8   475.106
## The Phantom Menace      474.500  552.5  1027.000
## Attack of the Clones    310.700  338.7   649.400
## Revenge of the Sith     380.300  468.5   848.800

2G)Find the total non-US revenue for all the movies using the colSums() function.

colSums(allStarWars,na.rm = FALSE, dims = 1L)

##        US    non-US Worldwide 
##  2226.279  2087.800  4314.079

3. College: 3A)Use the read.csv() function to read the data into R. You can download the data from the book’s website (don’t forget to set the working directory) or you can use the URL

college<-read.csv("College.csv", header=TRUE)

3B)Use the View() function to look at the data. You should notice that the first column is the just the name of each university. We don’t really want R to treat this as a variable. However, it may be handy to have these names for later. Try the following commands: rownames(college) <- college[,1] View(college)

View(college)
rownames(college) <- college[,1] 
View(college)

You should see that there is now a row.names column with the name of each university recorded. This means that R has given each row a name corresponding to the appropriate university. R will not try to perform calculations on row names. However, we still need to eliminate the first column in the data where the names are stored. Try college <- college[,-1] View(college)

college <- college[,-1]
View(college)

3C)Perform the following tasks and provide the code:

#A. Use the summary() function to produce a numerical summary of the variables in the data set.
summary(college)

##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 10.00  
##  1st Qu.: 53.00  
##  Median : 65.00  
##  Mean   : 65.46  
##  3rd Qu.: 78.00  
##  Max.   :118.00

#B. Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables in the data. Recall that you can reference the first ten columns of a matrix A using A[,1:10].
pairs(college[,1:10])

#C. Use the plot()or ggplot()function to produce side-by-side boxplots of Outstate vs Private.
OP<-c("Private","Outstate")
plot(college$Private,college$Outstate, names=OP)

#D. Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school class exceed 50%.
Elite <- rep("No", nrow(college))
Elite[college$Top10perc > 50] = "Yes"
Elite <- as.factor(Elite)
college <- data.frame(college, Elite)
summary(college)

##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate      Elite    
##  Min.   : 10.00   No :699  
##  1st Qu.: 53.00   Yes: 78  
##  Median : 65.00            
##  Mean   : 65.46            
##  3rd Qu.: 78.00            
##  Max.   :118.00

OE<-c("Elite","Outstate")
plot(Elite,college$Outstate,names=OE)

MATH 239: Problem Set 1

Claire Verstrate (with Alexis Barela)

1. Auto Data: 1A) Which predictors are qualitative, and which are quanitative?

1B) What is the range of each quantitative predictor?

1C) What is the mean and standard deviation of each quantitative predictor?

1D) Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

1E) Using the full data set, investigate the predictors graphically, using the scatterplots and other tools of your choice. Create some plots (at least 3) highlighting the relationships among the predictors. Comment on your findings.

1F) Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer

2. Working with vectors and matrices: 2A) Construct a matrix, where rows represent each movie. Name this matrix starWars and output it.

2B)Rename the rows and columns of the matrix you created in Part A with the vector region for columns and the vector titles for rows. Then print the matrix.

2C)Calculate the worldwide box office figures for each movie using the rowSums()function. Name and output this vector

2D)Now we want to add a column to our matrix for worldwide sales. You can do this by using the cbind() function. This function binds columns together.

2E)Create another matrix for the prequels and name it starWars2. Don’t forget to name the rows and the columns (similar to above)

2F)Make one big matrix that combines all the movies (from starWars and starWars2) using rbind(). This binds rows or in this case can be used to combine to matrices. Name this new matrix allStarWars.

2G)Find the total non-US revenue for all the movies using the colSums() function.

3. College: 3A)Use the read.csv() function to read the data into R. You can download the data from the book’s website (don’t forget to set the working directory) or you can use the URL

3C)Perform the following tasks and provide the code: