HW 1

1.Consider the mtcars data set

Which cars have 4 forward gears?

mtcars[mtcars$gear == 4,]

##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C      17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

The above list has all the cars with 4 forward gears.

Which cars have 4 forward gears and manual transmission?

mtcars[mtcars$gear == 4 & mtcars$am == 1,]

##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

The above list has all of the cars with both 4 forward gears and manual transmission.

Which cars have 4 forward gears or manual transmission?

mtcars[mtcars$gear == 4 | mtcars$am == 1,]

##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C      17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

The above list has all the cars with either 4 forward gears or manual tramsission.

Find the mean mpg of the cars with 2 carburetors?

mean(mtcars[mtcars$carb == 2,]$mpg)

## [1] 22.4

Mean $mpg is the function to get the average of the mpg column in the list and mtcars carb limits it to only cars with 2 carburators.

R has a built-in vector rivers which contains the lengths of major North American rivers.

Use ?rivers to learn about the data set.

?rivers

## starting httpd help server ...

##  done

Putting a question mark in front of a data set gives information about that data set. Above is some information about the rivers data.

Find the mean and sd of the rivers data.

mean(rivers)

## [1] 591.1844

sd(rivers)

## [1] 493.8708

Mean and sd give the average and standard deviation of the rivers data set.

Make a histogram of the rivers data

hist(rivers)

The hist function displays a histogram of the given data set, in this case the rivers data.

Get the five number summary of rivers data.

summary(rivers)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   135.0   310.0   425.0   591.2   680.0  3710.0

The summary function gives a summary of a given data set, in this case the rivers data

Find the longest and shorteset river in the set.

min(rivers)

## [1] 135

max(rivers)

## [1] 3710

The min and max functions show the minimum and maximum values of a given data set; here the rivers max and min are shown.

Make a list of all the lengths of the rivers longer than 1000 miles.

rivers[rivers > 1000]

##  [1] 1459 1450 1243 2348 1171 3710 2315 2533 1306 1054 1270 1885 1100 1205
## [15] 1038 1770

Above is a list of all the rivers with a length of longer than 1000 miles.

Let x <- c(1,2,3) and y <- c(6,5,4). Predict what will happen when the following pieces of code are run. Check your answer.

x <- c(1,2,3)
y <- c(6,5,4)

The above code assigns the matrixes (1,2,3) to the variable x and (6,5,4) to the variable y.

x*2

## [1] 2 4 6

All of the numbers in the matrix will be multiplied by 2.

x*y

## [1]  6 10 12

The cross product of the two matrices will be found.

x[1]*y[2]

x[1]*y[2]

## [1] 5

The first number in x (1) and the second number in y (5) will be multiplied together.

Use R to calculate the sum of the squares of all numbers from 1 to 100:

sum((1:100)^2)

## [1] 338350

A number with a colon followed by another number will make a matrix of all integers between those two numbers, and then this value is squared, then all values are added together.

Consider the built-in data frame airquality.

How many observations of how many variables are there?

str(airquality)

## 'data.frame':    153 obs. of  6 variables:
##  $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
##  $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
##  $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
##  $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...

Str shows that there are 153 observations of 6 variables.

What are the names of the variables? Looking above at the str function we can see it also shows the headers of the airquality data set, which are Ozone, Solar.R, Wind, Temp, Month and Day.
What type of data is each variable? Again looking at the str data we see that the Ozone, Solar.R, Temp Month and Day are all integers, and the Wind is the only numeric data type.
Do you agree with the data type that has been given to each variable? What would have been some alternate choices? Most of the data could be interchangable between integer and numeric, as the data was often numbers that were saved with the integer data type. Since this doesn’t make a huge difference I agree with most of the choices made, however Month could have easily been saved as a character data type with the name of each month displayed instead of the number.

There is a built in data set state, which is really seven seperate variables with names such as state.name, state.region and state.area.

What are the possible regions a state can be in? How many states are in each region?

summary(state.region)

##     Northeast         South North Central          West 
##             9            16            12            13

The summary shows there are 4 regions states can be found in, Northeast, South, North Central and West. There are 9 states in the Northeast, 16 in the South, 12 in the North Central and 13 in the West.

Which states have area less than 10000 square miles?

state.name[state.area < 10000]

## [1] "Connecticut"   "Delaware"      "Hawaii"        "Massachusetts"
## [5] "New Hampshire" "New Jersey"    "Rhode Island"  "Vermont"

Above is shown the areas of the 8 states with an area of less than 10000 square miles.

Which state’s geographic center is furthest south?(Hint: use which.min)

state.name[which.min(state.center$y)]

## [1] "Florida"

Since the y axis is the North South axis the minimum value on the y axis will be the further south. The code above shows that the 9th state in the list has the furthest south geographic center.

Install the package Lahman by clicking on Install under the Packages tab. Type in Lahman. (Or, use the command install.packages(“Lahman”, repos=“http://R-Forge.R-project.org”).) Then, load the library into memory by typing library(Lahman). Consider the data set Batting, which should now be available. It contains batting statistics of all major league players broken down by season since 1871. We will be using this data set extensively in the data wrangling chapter of this book.

library(Lahman)

Since R is reset whenever it is knit the library must be reloaded.

a.How many observations ofhow many variables are there?

str(Batting)

## 'data.frame':    101332 obs. of  22 variables:
##  $ playerID: chr  "abercda01" "addybo01" "allisar01" "allisdo01" ...
##  $ yearID  : int  1871 1871 1871 1871 1871 1871 1871 1871 1871 1871 ...
##  $ stint   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ teamID  : Factor w/ 149 levels "ALT","ANA","ARI",..: 136 111 39 142 111 56 111 24 56 24 ...
##  $ lgID    : Factor w/ 7 levels "AA","AL","FL",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ G       : int  1 25 29 27 25 12 1 31 1 18 ...
##  $ AB      : int  4 118 137 133 120 49 4 157 5 86 ...
##  $ R       : int  0 30 28 28 29 9 0 66 1 13 ...
##  $ H       : int  0 32 40 44 39 11 1 63 1 13 ...
##  $ X2B     : int  0 6 4 10 11 2 0 10 1 2 ...
##  $ X3B     : int  0 0 5 2 3 1 0 9 0 1 ...
##  $ HR      : int  0 0 0 2 0 0 0 0 0 0 ...
##  $ RBI     : int  0 13 19 27 16 5 2 34 1 11 ...
##  $ SB      : int  0 8 3 1 6 0 0 11 0 1 ...
##  $ CS      : int  0 1 1 1 2 1 0 6 0 0 ...
##  $ BB      : int  0 4 2 0 2 0 1 13 0 0 ...
##  $ SO      : int  0 0 5 2 1 1 0 1 0 0 ...
##  $ IBB     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ HBP     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ SH      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ SF      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ GIDP    : int  NA NA NA NA NA NA NA NA NA NA ...

Str shows that there are 101332 observations of 22 variables in the Batting data set.

Use the command head(Batting) to get a look at the first six lines of data.

head(Batting)

##    playerID yearID stint teamID lgID  G  AB  R  H X2B X3B HR RBI SB CS BB
## 1 abercda01   1871     1    TRO   NA  1   4  0  0   0   0  0   0  0  0  0
## 2  addybo01   1871     1    RC1   NA 25 118 30 32   6   0  0  13  8  1  4
## 3 allisar01   1871     1    CL1   NA 29 137 28 40   4   5  0  19  3  1  2
## 4 allisdo01   1871     1    WS3   NA 27 133 28 44  10   2  2  27  1  1  0
## 5 ansonca01   1871     1    RC1   NA 25 120 29 39  11   3  0  16  6  2  2
## 6 armstbo01   1871     1    FW1   NA 12  49  9 11   2   1  0   5  0  1  0
##   SO IBB HBP SH SF GIDP
## 1  0  NA  NA NA NA   NA
## 2  0  NA  NA NA NA   NA
## 3  5  NA  NA NA NA   NA
## 4  2  NA  NA NA NA   NA
## 5  1  NA  NA NA NA   NA
## 6  1  NA  NA NA NA   NA

Head shows the first six lines of data and the headings for the Batting data set.

What is the most number of triples (X3B) that have been hit in a single season? (Hint: how do you get R to ignore the NA values?)

max(Batting$X3B, na.rm = TRUE)

## [1] 36

Max shows the maximum number of triples hit in that season.

What is the playerID(s) of the person(s) who hit the most number of triples in a single season? In what year did it happen?

Batting[max(Batting$X3B, na.rm = TRUE),]$playerID

## [1] "ewellge01"

Batting[max(Batting$X3B, na.rm = TRUE),]$yearID

## [1] 1871

Most of the two lines of code above are the same, finding the max number of triples hit by a single player, with the only difference being displaying either the playerID or the yearID.

Bonus: Which player hit the most number of triples in a single season since 1960? (Hint: it was not radatdi01.)

HW 1

Ben Schmidt

January 25, 2017