Problem Set #1

Problem 1: Auto Data

A. Which of the predictors are quantitative, and which are qualitative?

auto2<-read.csv("Auto (2).csv", 
                 header=TRUE, 
                 na.strings = "?")
auto3<-na.omit(auto2)
str(auto3)

## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
##  - attr(*, "na.action")= 'omit' Named int  33 127 331 337 355
##   ..- attr(*, "names")= chr  "33" "127" "331" "337" ...

Quantitative predictors: mpg, displacement, acceleration Qualitattive predictors: cylinders, horsepower, weight, year, origin, name

B. What is the range of each quantitative predictor?

range(auto3$mpg)

## [1]  9.0 46.6

range(auto3$displacement)

## [1]  68 455

range(auto3$acceleration)

## [1]  8.0 24.8

C. What is the mean and standard deviation of each quantitative predictor?

mean(auto3$mpg)

## [1] 23.44592

sd(auto3$mpg)

## [1] 7.805007

mean(auto3$displacement)

## [1] 194.412

sd(auto3$displacement)

## [1] 104.644

mean(auto3$acceleration)

## [1] 15.54133

sd(auto3$acceleration)

## [1] 2.758864

D. Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

auto4<-auto3[-c(10:85),]

range(auto4$mpg)

## [1] 11.0 46.6

mean(auto4$mpg)

## [1] 24.40443

sd(auto4$mpg)

## [1] 7.867283

range(auto4$displacement)

## [1]  68 455

mean(auto4$displacement)

## [1] 187.2405

sd(auto4$displacement)

## [1] 99.67837

range(auto4$acceleration)

## [1]  8.5 24.8

mean(auto4$acceleration)

## [1] 15.7269

sd(auto4$acceleration)

## [1] 2.693721

E. Using the full data set, investigate the predictors graphically, using scatterplots and other tools of your choice. Create some plots (at least 3) highlighting the relationships among the predictors. Comment on your findings.

plot(auto3$mpg, auto3$displacement)

There seems to be evidence for a strong, negative association between mpg and displacement that could potentially be linear.

plot(auto3$acceleration, auto3$name)

Since the scatterplot is random and contains no pattern, there seems to be no association between acceleration and name.

plot(auto3$weight, auto3$horsepower)

There seems to be evidence for a strong, positive, linear assoication between weight and horsepower.

F. Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg?

plot(auto3$displacement, auto3$mpg)

Since the scatterplot (with x=displacement and y=mpg) shows a relativley strong, negative assoication, we could use a linear regression model between these two varibales to help predict [mpg] given a certain displacement. These predicted values would be compared to our observed data to ensure that the residual plot justifies the linear regression as the appropriate model.

Problem 2: Working with vectors and matrices

A. Construct a matrix, where rows represent each movie. Name this matrix starWars and output it.

# Box office Star Wars (in miilions!)

new_hope <- c(460.998, 314.4)
empire_strikes <- c(290.475, 247.9)
return_jedi <- c(309.306, 165.8)

starWars <- matrix(data=c(new_hope, empire_strikes, return_jedi), nrow=3, ncol=2, byrow=TRUE)
starWars

##         [,1]  [,2]
## [1,] 460.998 314.4
## [2,] 290.475 247.9
## [3,] 309.306 165.8

B. Rename the rows and columns of the matrix you created in Part A with the vector ‘region’ for columns and the vector ‘titles’ for rows.

# Vectors regions and titles, used for naming

region <- c("US", "non-US")
titles <- c("A New Hope", "The Empire Strikes Back", "Return of the Jedi")

dimnames(starWars) <- list(titles, region)
starWars

##                              US non-US
## A New Hope              460.998  314.4
## The Empire Strikes Back 290.475  247.9
## Return of the Jedi      309.306  165.8

C. Calculate the worldwide box office figures for each movie using the rowSums() function. Name and output this vector.

worldwide<-rowSums(starWars)
worldwide

##              A New Hope The Empire Strikes Back      Return of the Jedi 
##                 775.398                 538.375                 475.106

D. Now we want to add a column to our matrix for worldwide sales. You can do this by using the cbind() function. This function binds columns together.

world_sales<-cbind("US", "non-US")

E. Create another matrix for the prequels and name it starWars2. Don’t forget to name the rows and the columns (similar to above)

# Prequels

phantom_menace <- c(474.5, 552.5)
attack_clones <- c(310.7, 338.7)
revenge_sith <- c(380.3, 468.5)
starWars2 <- matrix(data=c(phantom_menace, attack_clones, revenge_sith), nrow=3, ncol=2, byrow=TRUE)
starWars2

##       [,1]  [,2]
## [1,] 474.5 552.5
## [2,] 310.7 338.7
## [3,] 380.3 468.5

titles2 <- c("The Phantom Menace", "Attack of the Clones", "Revenge of the Sith")
dimnames(starWars2) <- list(titles2, region)
starWars2

##                         US non-US
## The Phantom Menace   474.5  552.5
## Attack of the Clones 310.7  338.7
## Revenge of the Sith  380.3  468.5

F. Make one big matrix that combines all the movies (from starWars and starWars2) using rbind(). This binds rows or in this case can be used to combine two matrices. Name this new matrix allStarWars.

allStarWars<-rbind(starWars, starWars2)
allStarWars

##                              US non-US
## A New Hope              460.998  314.4
## The Empire Strikes Back 290.475  247.9
## Return of the Jedi      309.306  165.8
## The Phantom Menace      474.500  552.5
## Attack of the Clones    310.700  338.7
## Revenge of the Sith     380.300  468.5

G. Find the total non-US revenue for all the movies using the colSums() function.

colSums(allStarWars)

##       US   non-US 
## 2226.279 2087.800

The total non-US revenue for all the movies is $2087.80.

Problem 3: College Data

A. Use the read.csv() function to read the data into R.

college<-read.csv("College.csv", header=TRUE, na.strings="?")

B. Use the View() function to look at the data.

rownames(college) <- college[,1]
college <- college[,-1]
head(college)

##                              Private Apps Accept Enroll Top10perc
## Abilene Christian University     Yes 1660   1232    721        23
## Adelphi University               Yes 2186   1924    512        16
## Adrian College                   Yes 1428   1097    336        22
## Agnes Scott College              Yes  417    349    137        60
## Alaska Pacific University        Yes  193    146     55        16
## Albertson College                Yes  587    479    158        38
##                              Top25perc F.Undergrad P.Undergrad Outstate
## Abilene Christian University        52        2885         537     7440
## Adelphi University                  29        2683        1227    12280
## Adrian College                      50        1036          99    11250
## Agnes Scott College                 89         510          63    12960
## Alaska Pacific University           44         249         869     7560
## Albertson College                   62         678          41    13500
##                              Room.Board Books Personal PhD Terminal
## Abilene Christian University       3300   450     2200  70       78
## Adelphi University                 6450   750     1500  29       30
## Adrian College                     3750   400     1165  53       66
## Agnes Scott College                5450   450      875  92       97
## Alaska Pacific University          4120   800     1500  76       72
## Albertson College                  3335   500      675  67       73
##                              S.F.Ratio perc.alumni Expend Grad.Rate
## Abilene Christian University      18.1          12   7041        60
## Adelphi University                12.2          16  10527        56
## Adrian College                    12.9          30   8735        54
## Agnes Scott College                7.7          37  19016        59
## Alaska Pacific University         11.9           2  10922        15
## Albertson College                  9.4          11   9727        55

C. Perform the following tasks and provide the code: a. Use the summary() function to produce a numerical summary of the variables in the data set.

summary(college)

##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 10.00  
##  1st Qu.: 53.00  
##  Median : 65.00  
##  Mean   : 65.46  
##  3rd Qu.: 78.00  
##  Max.   :118.00

Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables in the data. Recall that you can reference the first ten columns of a matrix A using A[,1:10].

pairs(college[,1:10])

Use the plot()or ggplot()function to produce side-by-side boxplots of Outstate vs Private.

plot(college$Private, college$Outstate)

Create a new qualitative variable, called Elite, by binning the Top10perc variable.

Elite <- rep("No", nrow(college))
Elite[college$Top10perc > 50] = "Yes"
Elite <- as.factor(Elite)
college <- data.frame(college, Elite)
summary(college)

##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate      Elite    
##  Min.   : 10.00   No :699  
##  1st Qu.: 53.00   Yes: 78  
##  Median : 65.00            
##  Mean   : 65.46            
##  3rd Qu.: 78.00            
##  Max.   :118.00

There are 78 elite Universities.

plot(college$Elite, college$Outstate)

Problem Set #1

Jack Boydell

9/9/2019

Problem 1: Auto Data

Problem 2: Working with vectors and matrices

Problem 3: College Data