MATH239: Homework #1

##Problem 1: Auto Data #Part A

Auto <- read.table("http://faculty.marshall.usc.edu/gareth-james/ISL/Auto.data", 
                   header=TRUE,
                   na.strings = "?")
str(Auto)

## 'data.frame':    397 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...

The predictors mpg, cylinders, displacement, horsepower, weight, acceleration, year, and origin are quantitative. The predictor name is qualitative.

#Part B

#Range of each quantitative predictor:
range(Auto$mpg)

## [1]  9.0 46.6

range(Auto$cylinders)

## [1] 3 8

range(Auto$displacement)

## [1]  68 455

range(Auto$horsepower, na.rm=TRUE)

## [1]  46 230

range(Auto$weight)

## [1] 1613 5140

range(Auto$acceleration)

## [1]  8.0 24.8

range(Auto$year)

## [1] 70 82

range(Auto$origin)

## [1] 1 3

#Part C

#Mean of each quantitative predictor:
mean(Auto$mpg)

## [1] 23.51587

mean(Auto$cylinders)

## [1] 5.458438

mean(Auto$displacement)

## [1] 193.5327

mean(Auto$horsepower, na.rm=TRUE)

## [1] 104.4694

mean(Auto$weight)

## [1] 2970.262

mean(Auto$acceleration)

## [1] 15.55567

mean(Auto$year)

## [1] 75.99496

mean(Auto$origin)

## [1] 1.574307

#Standard deviation of each quantitative predictor:
sd(Auto$mpg)

## [1] 7.825804

sd(Auto$cylinders)

## [1] 1.701577

sd(Auto$displacement)

## [1] 104.3796

sd(Auto$horsepower, na.rm=TRUE)

## [1] 38.49116

sd(Auto$weight)

## [1] 847.9041

sd(Auto$acceleration)

## [1] 2.749995

sd(Auto$year)

## [1] 3.690005

sd(Auto$origin)

## [1] 0.8025495

#Part D

#Subset of data with the 10th through 85th observations removed.
Auto2<-Auto[-c(10:85),]
#Range of each quantitative predictor in the data subset:
range(Auto2$mpg)

## [1] 11.0 46.6

range(Auto2$cylinders)

## [1] 3 8

range(Auto2$displacement)

## [1]  68 455

range(Auto2$horsepower, na.rm=TRUE)

## [1]  46 230

range(Auto2$weight)

## [1] 1649 4997

range(Auto2$acceleration)

## [1]  8.5 24.8

range(Auto2$year)

## [1] 70 82

range(Auto2$origin)

## [1] 1 3

#Mean of each quantitative predictor in the data subset:
mean(Auto2$mpg)

## [1] 24.43863

mean(Auto2$cylinders)

## [1] 5.370717

mean(Auto2$displacement)

## [1] 187.0498

mean(Auto2$horsepower, na.rm=TRUE)

## [1] 100.9558

mean(Auto2$weight)

## [1] 2933.963

mean(Auto2$acceleration)

## [1] 15.72305

mean(Auto2$year)

## [1] 77.15265

mean(Auto2$origin)

## [1] 1.598131

#Standard deviation of each quantitative predictor in the data subset:
sd(Auto2$mpg)

## [1] 7.908184

sd(Auto2$cylinders)

## [1] 1.653486

sd(Auto2$displacement)

## [1] 99.63539

sd(Auto2$horsepower, na.rm=TRUE)

## [1] 35.89557

sd(Auto2$weight)

## [1] 810.6429

sd(Auto2$acceleration)

## [1] 2.680514

sd(Auto2$year)

## [1] 3.11123

sd(Auto2$origin)

## [1] 0.8161627

#Part E

plot(Auto$mpg, Auto$weight)

The scatterplot comparing mpg and weight suggests that as the weight of a car decreases, the miles per gallon increases. This makes sense because a heavier car requires more gas to acclerate the weightm of the car compared to a lighter car.

plot(Auto$year, Auto$mpg)

The scatterplot comparing the model year of the car and mpg suggests that there has been an increase in mpg and effeciency over the years as newer methods to create cars are created and implemented.

plot(Auto$cylinders, Auto$horsepower)

The scatter plot comparing the number of cylinders that a car has and the car’s horsepower shows a trend suggesting that the greater the number of cylinders of a car, the greater the horsepower of that car. ``` #Part F

pairs(Auto)

When observing the pairwise scatterplots between mpg and the other variables, the plots comparing mpg and displacement, mpg and horesepower, and mpg and weight show are trend in which as the x varible increase, mpg decreases. These findings suggest that if you knew the displacement, horespower, and weight of a car, it could be possible to predice gas mileage. The plots comparing mpg and cylinders and mpg and origin have similar structures, suggesting that it may be possible to predict the mpg with this information. In general, it would be difficult to predict the mpg of a car given only these variables because there are several other cofounding variables that can affect the prediction.

##Problem 2: Working with vectors and matrices #Data

#Box office Star Wars (in millions!)
new_hope <- c(460.998, 314.4)
empire_strikes <- c(290.475, 247.900)
return_jedi <- c(309.306, 165.8)

#Vectors region and titles, used for naming
region <-c("US", "non-US")
titles <- c("A New Hope", "The Empire Stikes Back", "Return of the Jedi")

#Part A

#Construct a matrix
starWars <- matrix(c(new_hope, empire_strikes, return_jedi), nrow=3, byrow = TRUE)
starWars

##         [,1]  [,2]
## [1,] 460.998 314.4
## [2,] 290.475 247.9
## [3,] 309.306 165.8

#Part B

#Rename the row and columns of the matrix
colnames(starWars) <- region
rownames(starWars) <- titles
starWars

##                             US non-US
## A New Hope             460.998  314.4
## The Empire Stikes Back 290.475  247.9
## Return of the Jedi     309.306  165.8

#Part C

#Worldwide box office figures for each movie
worldwide_figures<- rowSums(starWars)
worldwide_figures

##             A New Hope The Empire Stikes Back     Return of the Jedi 
##                775.398                538.375                475.106

#Part D

#Add column to matrix for worldwide sales
worldwide_wars <- cbind(starWars, worldwide_figures)

#Prequel Data

phantom_menace <- c(474.5, 552.5)
attack_clones <- c(310.7, 338.7)
revenge_sith <- c(380.3, 468.5)
titles2<-c("The Phantom Menace", "Attack of the Clones", "Revent of the Sith")

#Part E

#Create a matrix for the prequels with row and column names
starWars2 <- matrix(c(phantom_menace, attack_clones, revenge_sith), 
                    nrow=3, byrow = TRUE)

colnames(starWars2) <- region
rownames(starWars2) <- titles2
starWars2

##                         US non-US
## The Phantom Menace   474.5  552.5
## Attack of the Clones 310.7  338.7
## Revent of the Sith   380.3  468.5

#Part F

#Combine starWars and starWars2 into one matrix
allStarWars <- rbind(starWars,starWars2)
allStarWars

##                             US non-US
## A New Hope             460.998  314.4
## The Empire Stikes Back 290.475  247.9
## Return of the Jedi     309.306  165.8
## The Phantom Menace     474.500  552.5
## Attack of the Clones   310.700  338.7
## Revent of the Sith     380.300  468.5

#Part G

#Total non-US revenue for all movies
colSums(allStarWars)

##       US   non-US 
## 2226.279 2087.800

The total non-US revenue for all the movies is $2087.80

##Problem 3: College data #Part A

#Reading data into R:
college<-read.csv("http://faculty.marshall.usc.edu/gareth-james/ISL/College.csv",header=TRUE)

#Part B

#Create row names column with the name of each university recorded
rownames(college)<- college[,1]
#First data column becomes 'Private'
college <- college[,-1]

#Part C

#Numerical summary of the varaibles in the data set
summary(college)

##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 10.00  
##  1st Qu.: 53.00  
##  Median : 65.00  
##  Mean   : 65.46  
##  3rd Qu.: 78.00  
##  Max.   :118.00

#Scatterplot matrix of the first ten varibles in the data
pairs(college[,1:10])

#Boxplot of Outstate vs Private
plot(college$Private, college$Outstate, main = "Outstate vs Private", xlab = "Private", ylab = "Outstate")

#Create Elite varible based on whether or not the proportion of students coming from the top 10% of their high shcool class exceed 50%
Elite <- rep("No", nrow(college))
Elite[college$Top10perc > 50] = "Yes"
Elite <- as.factor(Elite)
college <- data.frame(college, Elite)
summary(college)

##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate      Elite    
##  Min.   : 10.00   No :699  
##  1st Qu.: 53.00   Yes: 78  
##  Median : 65.00            
##  Mean   : 65.46            
##  3rd Qu.: 78.00            
##  Max.   :118.00

There are 78 elite universities.

#Side-by-side boxplots of Outstate vs Elite
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.3
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0

## ── Conflicts ────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

ggplot(college, aes(x=Elite, y=Outstate, fill=Elite))+
         geom_boxplot()

MATH239: Homework #1

Bethany Fletcher

1/29/2020