CS 698 – Homework 1

1.20 The built-in data set islands contains the size of the world’s land masses that exceed 10,000 square miles. Use sort() with the argument decreasing=TRUE to find the seven largest land masses.

data("islands")
sort(islands, decreasing=TRUE)[1:7]

##          Asia        Africa North America South America    Antarctica 
##         16988         11506          9390          6795          5500 
##        Europe     Australia 
##          3745          2968

1.21 Load the data set primes (UsingR). This is the set of prime numbers in [1,2003]. How many are there? How many in the range [1,100]? [100,1000]?

# Load the related packages
library(MASS)
library(HistData)
library(lattice)
library(survival)
library(Formula)
library(ggplot2)
library(Hmisc)
library(UsingR)

data("primes")
# How many elements in this data set
length(primes)

## [1] 304

# How many elements in the range [1,100]
length(primes[primes <= 100])

## [1] 25

# How many elements in the range [100,1000]
length(primes[primes>=100 & primes <= 1000])

## [1] 143

1.22 Load the data set primes (UsingR). We wish to find all the twin primes. These are numbers \(p\) and \(p+2\), where both are prime.

Explain what primes[-1] returns.

It returns all but the first element of primes.

head(primes[-1])

## [1]  3  5  7 11 13 17

If you set n=length(primes), explain what primes[-n] returns.

It returns all but the last element of primes.

n <- length(primes)
tail(primes[-n])

## [1] 1973 1979 1987 1993 1997 1999

Why might primes[-1]-primes[-n] give clues as to what the twin primes are? How many twin primes are there in the data set?

primes[-1]-primes[-n] give us a vector including the difference between two neighbour prime numbers, then we can see how many elements are equal to 2 in this vector, which implies how many twin primes here.

sum(primes[-1]-primes[-n]==2)

## [1] 61

Another way is to create a new vector which is the original primes plus two and then compare the corresponding position by using [-1] and [-n].

new_primes <- primes+2
sum(primes[-1]==new_primes[-n])

## [1] 61

1.23 For the data set treering, which contains tree-ring widths in dimension-less units, use an R function to answer the following:

How many observations are there?

data("treering")
length(treering)

## [1] 7980

Find the smallest observation.

min(treering)

## [1] 0

Find the largest observation.

max(treering)

## [1] 1.908

How many are bigger than 1.5?

sum(treering>1.5)

## [1] 219

1.24 The data set mandms (UsingR) contains the targeted color distribution in a bag of M&Ms as percentages for varies types of packaging. Answer these questions.

Which packaging is missing one of the six colors?

# load the data set
data("mandms")
# find the pacaging whose colors with one missing (value is 0)
names(which(rowSums(mandms==0)==1))

## [1] "Peanut Butter"

# an alternative way using subset function
rownames(subset(mandms,mandms[,1]==0|mandms[,2]==0|
         mandms[,3]==0|mandms[,4]==0|
         mandms[,5]==0|mandms[,6]==0 ))

## [1] "Peanut Butter"

Which types of packaging have an equal distribution of colors?

# find the pacaging whose colors proportion with the same value
names(which(rowSums(mandms==rowMeans(mandms))==6))

## [1] "Almond"    "kid minis"

# an alternative way using subset function
rownames(subset(mandms,mandms[,1]==mandms[,2] &
         mandms[,2]==mandms[,3] & 
         mandms[,3]==mandms[,4] & 
         mandms[,4]==mandms[,5] & 
         mandms[,5]==mandms[,6] & 
         mandms[,6]==mandms[,1]))

## [1] "Almond"    "kid minis"

Which packaging has a single color that is more likely than all the others? What color is this?

# return the single color is the unique maximum of its row
names(which(rowSums(mandms==max(mandms))==1))

## [1] "milk chocolate"

# return the color name
names(which(colSums(mandms==max(mandms))==1))

## [1] "brown"

# an alternative way using subset function
package <- subset(mandms,
(mandms[,1]>mandms[,2] & mandms[,1]>mandms[,3] & mandms[,1]>mandms[,4] &
   mandms[,1]>mandms[,5] & mandms[,1]>mandms[,6])|
(mandms[,2]>mandms[,1] & mandms[,2]>mandms[,3] & mandms[,2]>mandms[,4] &
   mandms[,2]>mandms[,5] & mandms[,2]>mandms[,6])|
(mandms[,3]>mandms[,1] & mandms[,3]>mandms[,2] & mandms[,3]>mandms[,4] &
   mandms[,3]>mandms[,5] & mandms[,3]>mandms[,6])| 
(mandms[,4]>mandms[,1] & mandms[,4]>mandms[,2] & mandms[,4]>mandms[,3] &
   mandms[,4]>mandms[,5] & mandms[,4]>mandms[,6])|
(mandms[,5]>mandms[,1] & mandms[,5]>mandms[,2] & mandms[,5]>mandms[,3] &
   mandms[,5]>mandms[,4] & mandms[,5]>mandms[,6])| 
(mandms[,6]>mandms[,1] & mandms[,6]>mandms[,2] & mandms[,6]>mandms[,3] &
   mandms[,6]>mandms[,4] & mandms[,6]>mandms[,5]))
# return the packaging satisfying the condition
rownames(package)

## [1] "milk chocolate"

# return the color name that is more likely than all the others
names(which.max(package))

## [1] "brown"

1.25 The t imes variable in the data set nym. 2002 (UsingR) contains the time to finish for several participants in the 2002 New York City Marathon. Answer these questions.

How many times are stored in the data set?

data("nym.2002")
length(nym.2002$time)

## [1] 1000

What was the fastest time in minutes? Convert this into hours and minutes using R.

fastest <- min(nym.2002$time)
# return the fastest time in minutes
fastest

## [1] 147.3333

# convert the value into format 'Hours : Minutes'
paste(fastest %/% 60, round(fastest %% 60),sep=":")

## [1] "2:27"

What was the slowest time in minutes? Convert this into hours and minutes using R.

slowest <- max(nym.2002$time)
# return the slowest time in minutes
slowest

## [1] 566.7833

# convert the value into format 'Hours : Minutes'
paste(slowest %/% 60, round(slowest %% 60), sep=":")

## [1] "9:27"

1.26 For the data set rivers, which is the longest river? The shortest?

# load the data set
data("rivers")
# return the longest river in miles
max(rivers)

## [1] 3710

# return the shortest river in miles
min(rivers)

## [1] 135

1.27 The data set uspop contains decade-by-decade population figures for the United States from 1790 to 1970.

Use names() and seq() to add the year names to the data vector.

# load the data set
data("uspop")
# add the year name to the data vector
names(uspop) <- seq(1790,1970,by=10)
# print the data with the corrsponding year
uspop

## Time Series:
## Start = 1790 
## End = 1970 
## Frequency = 0.1 
##   1790   1800   1810   1820   1830   1840   1850   1860   1870   1880 
##   3.93   5.31   7.24   9.64  12.90  17.10  23.20  31.40  39.80  50.20 
##   1890   1900   1910   1920   1930   1940   1950   1960   1970 
##  62.90  76.00  92.00 105.70 122.80 131.70 151.30 179.30 203.20

Use diff() to find the inter-decade differences. Which decade had the greatest increase?

# assign a variable to compute difference within each decade
delta <- diff(uspop)
#  find the decade of the greatest increase 
uspop[c(which.max(delta),which.max(delta)+1)]

##  1950  1960 
## 151.3 179.3

max(delta)

## [1] 28

So the decade from 1950 to 1960 had the greatest increase (28).

Explain why you could reasonably expect that the difference will always increase with each decade. Is this the case with the data?

Because from in past two hundred years, the life quality of united states people is better and better. Especially, after the World War II, “baby boomers” in United States is a factor leading to population increase. The following results of the given data also indicates that the inter-decade difference always increases. (TRUE means increase, FALSE means decrease)

# return logic value that whether the differences are increase or not
delta > 0

## Time Series:
## Start = 1800 
## End = 1970 
## Frequency = 0.1 
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [15] TRUE TRUE TRUE TRUE

CS 698 – Homework 1

Yalin Zhu

February 8, 2016