Yining Feng Programming with R Assignment #1

Instructions

R markdown is a plain-text file format for integrating text and R code, and creating transparent, reproducible and interactive reports. An R markdown file (.Rmd) contains metadata, markdown and R code “chunks”, and can be “knit” into numerous output types. Answer the test questions by adding R code to the fenced code areas below each item. Once completed, you will “knit” and submit the resulting .html file, as well the .Rmd file. The .html will include your R code and the output.

Before proceeding, look to the top of the .Rmd for the (YAML) metadata block, where the title and output are given. Please change title from ‘Programming with R Assignment #1’ to your name, with the format ‘lastName_firstName.’

If you encounter issues with knitting the .html, please send an email via Canvas to your TA.

Each code chunk is delineated by six (6) backticks; three (3) at the start and three (3) at the end. After the opening ticks, arguments are passed to the code chunk and in curly brackets. Please do not add or remove backticks, or modify the arguments or values inside the curly brackets. An example code chunk is included here:

# Comments are included in each code chunk, simply as prompts

...R code placed here

...R code placed here

You need only enter text inside the code chunks for each test item.

Depending on the problem, grading will be based on: 1) the correct result, 2) coding efficiency and 3) graphical presentation features (labeling, colors, size, legibility, etc). I will be looking for well-rendered displays. In the “knit” document, only those results specified in the problem statements should be displayed. For example, do not output - i.e. send to the Console - the contents of vectors or data frames unless requested by the problem. You should be able to code for each solution in fewer than ten lines; though code for your visualizations may exceed this.

Submit both the .Rmd and .html files for grading

Example Problem with Solution: Use rbinom() to generate two random samples of size 10,000 from the binomial distribution. For the first sample, use p = 0.45 and n = 10. For the second sample, use p = 0.55 and n = 10.

Convert the sample frequencies to sample proportions and compute the mean number of successes for each sample. Present these statistics.

set.seed(123)
sample.one <- table(rbinom(10000, 10, 0.45)) / 10000
sample.two <- table(rbinom(10000, 10, 0.55)) / 10000

successes <- seq(0, 10)

sum(sample.one * successes) # [1] 4.4827

## [1] 4.4827

sum(sample.two * successes) # [1] 5.523

## [1] 5.523

Present the proportions in a vertical, side-by-side barplot color coding the two samples.

counts <- rbind(sample.one, sample.two)

barplot(counts, main = "Comparison of Binomial Sample Proportions", 
  ylab = "Frequency", ylim = c(0,0.3),xlab = "Number of Successes",
  beside = TRUE, col = c("darkblue","red"),legend = rownames(counts),
  names.arg = c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10"))

Please delete the Instructions and Examples shown above prior to submitting your .Rmd and .html files.

Test Items starts from here - There are 5 sections - 50 points total

Read each question carefully and address each element. Do not output contents of vectors or data frames unless requested.

Section 1: (8 points) This problem deals with vector manipulations.

(1)(a) Create a vector that contains the following, in this order, and output the final, resulting vector. Do not round any values, unless requested. * A sequence of integers from 0 to 4, inclusive. * The number 13 * Three repetitions of the vector c(2, -5.1, -23). * The arithmetic sum of 7/42, 3 and 35/42

vector<-c(c(0,1,2,3,4),13,rep(c(2,-5.1,-23),3),c(7/42+3+35/42))
vector

##  [1]   0.0   1.0   2.0   3.0   4.0  13.0   2.0  -5.1 -23.0   2.0  -5.1
## [12] -23.0   2.0  -5.1 -23.0   4.0

(1)(b) Sort the vector created in (1)(a) in ascending order. Output this result. Determine the length of the resulting vector and assign to “L”. Output L. Generate a descending sequence starting with L and ending with 1. Add this descending sequence arithmetically the sorted vector. This is vector addition, not vector combination. Output the contents. Do not round any values.

vector1<-sort(vector,decreasing = FALSE)
vector1

##  [1] -23.0 -23.0 -23.0  -5.1  -5.1  -5.1   0.0   1.0   2.0   2.0   2.0
## [12]   2.0   3.0   4.0   4.0  13.0

L<-length(vector)
L

## [1] 16

vector0<-L:1
vector0

##  [1] 16 15 14 13 12 11 10  9  8  7  6  5  4  3  2  1

vector2<-rbind(vector1,vector0)
vector2

##         [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
## vector1  -23  -23  -23 -5.1 -5.1 -5.1    0    1    2     2     2     2
## vector0   16   15   14 13.0 12.0 11.0   10    9    8     7     6     5
##         [,13] [,14] [,15] [,16]
## vector1     3     4     4    13
## vector0     4     3     2     1

(1)(c) Extract the first and last elements of the vector you have created in (1)(b) to form another vector of the extracted elements. Form a third vector from the elements not extracted. Output these vectors.

part1<- vector1[1]
part2<- vector1[16]
vector3<-(c(part1, part2))
vector3

## [1] -23  13

vector4<-vector1[2:15]
vector4

##  [1] -23.0 -23.0  -5.1  -5.1  -5.1   0.0   1.0   2.0   2.0   2.0   2.0
## [12]   3.0   4.0   4.0

(1)(d) Use the vectors from (c) to reconstruct the vector in (b). Output this vector. Sum the elements and round to two decimal places.

vector1_recombined<-(c(part1, vector4, part2))
vector1_recombined

##  [1] -23.0 -23.0 -23.0  -5.1  -5.1  -5.1   0.0   1.0   2.0   2.0   2.0
## [12]   2.0   3.0   4.0   4.0  13.0

round(sum(vector1_recombined),digits = 2)

## [1] -51.3

Section 2: (10 points) The expression y = sin(x/2) + cos(x/2) is a trigonometric function.

(2)(a) Create a user-defined function - via function() - that implements the trigonometric function above, accepts numeric values, “x,” calculates and returns values “y.”

trigonometric <-function(x){y<-sin(x/2) + cos(x/2) 
return(y)}

(2)(b) Create a vector, x, of 4001 equally-spaced values from -2 to 2, inclusive. Compute values for y using the vector x and your function from (2)(a). Do not output x or y. Find the value in the vector x that corresponds to the maximum value in the vector y. Restrict attention to only the values of x and y you have computed; i.e. do not interpolate. Round to 3 decimal places and output both the maximum y and corresponding x value.

Finding the two desired values can be accomplished in as few as two lines of code. Do not use packages or programs you may find on the internet or elsewhere. Do not output the other elements of the vectors x and y. Relevant coding methods are given in the Quick Start Guide for R.

x<-seq(-2,2,length.out = 4001)
p<-x[which.max(trigonometric(x))]
p#the corresponding x value of the maximum y value

## [1] 1.571

round(trigonometric(p),digits=3) #the maximum y value of the trigonometric function

## [1] 1.414

(2)(c) Plot y versus x in color, with x on the horizontal axis. Show the location of the maximum value of y determined in 2(b). Show the values of x and y corresponding to the maximum value of y in the display. Add a title and other features such as text annotations. Text annotations may be added via text() for base R plots and geom_text() or geom_label() for ggplots.

x<-seq(-2,2,length.out = 4001)
y<-trigonometric(x)
plot(x,y,type="l",col="blue",main = "Trigonometric Function y = sin(x/2) + cos(x/2)")
p<-x[which.max(trigonometric(x))]
q<-round(trigonometric(p),digits=3)
points(p,q,col=6,pch=20)
text(p,q,labels="x=1.571   y=1.414")

Section 3: (8 points) This problem requires finding the point of intersection of two functions. Using the function y = cos(x/2)*sin(x/2), find where the curved line y = -(x/2)^3 intersects it within the range of values used in part (2) (i.e. 4001 equally-spaced values from -2 to 2). Plot both functions on the same display, and show the point of intersection. Present the coordinates of this point as text in the display.

x<-seq(-2,2,length.out = 4001)
func1 <- cos(x/2) * sin(x/2)
func2 <- -(x/2)^3
which(func1 == func2)

## [1] 2001

x[which(func1 == func2)]

## [1] 0

func1[which(func1 == func2)]

## [1] 0

func2[which(func1 == func2)]

## [1] 0

# two_functions <- cos(x/2)*sin(x/2) + (x/2)**3

#index <- sum((cos(x/2)*sin(x/2) + (x/2)**3) == 0)

# which.min(abs(two_functions))
# x_value <- x[which.min(abs(two_functions))]
# y_value <- -x_value
# x_value
# y_value
# cos(x_value/2)*sin(x_value/2)  + (x_value/2)**3

plot(x, cos(x/2)*sin(x/2), main = "Plot of Functions", col = "red",
     xlab = " x values", 
     ylab = "y values",
     type = "l",
     xlim = c(-2.5, 2.5))
points(func1[which(func1 == func2)],
       func2[which(func1 == func2)],
       col = rgb(0,0, 1,0.9),
       pch = 16,
       cex = 2)
lines(x, -(x/2)**3, col = "green")
text(func1[which(func1 == func2)] - 1.5,
     func2[which(func1 == func2)] - 0.1,
     "Intersection at Origin")
abline(h = 0, lty = 2)
abline(v = 0, lty = 2)
text(x = -1.5, y = -0.2, paste("x = ",
                               func1[which(func1 == func2)],
                               "  y =  ",
                               func2[which(func1 == func2)]))

Section 4: (12 points) Use the “trees” dataset for the following items. This dataset has three variables (Girth, Height, Volume) on 31 felled black cherry trees.

(4)(a) Use data(trees) to load the dataset. Check and output the structure with str(). Use apply() to return the median values for the three variables. Output these values. Using R and logicals, output the row number and the three measurements - Girth, Height and Volume - of any trees with Girth equal to median Girth. It is possible to accomplish this last request with one line of code.

data(trees)
str(trees)

## 'data.frame':    31 obs. of  3 variables:
##  $ Girth : num  8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
##  $ Height: num  70 65 63 72 81 83 66 75 80 75 ...
##  $ Volume: num  10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...

apply(X=trees, MARGIN=2, FUN=median, na.rm=TRUE)

##  Girth Height Volume 
##   12.9   76.0   24.2

trees[which(trees$Girth==12.9), ] #Output the corresponding row numbers and the three measurements of any trees with Girth equal to median Girth

##    Girth Height Volume
## 16  12.9     74   22.2
## 17  12.9     85   33.8

(4)(b) Girth is defined as the diameter of a tree taken at 4 feet 6 inches from the ground. Convert each diameter to a radius, r. Calculate the cross-sectional area of each tree using pi times the squared radius. Present a stem-and-leaf plot of the radii, and a histogram of the radii in color. Plot Area (y-axis) versus Radius (x-axis) in color showing the individual data points. Label appropriately.

radius<-array(trees$Girth/2)
area<-array(pi*radius^2)
stem(radius, scale = 1)

## 
##   The decimal point is at the |
## 
##    4 | 234
##    5 | 34455667779
##    6 | 055799
##    7 | 013
##    8 | 0278
##    9 | 000
##   10 | 3

cells <- seq(from = 4.0, to = 11.0, by = 0.5)
hist(radius, breaks = cells, col = "blue", right = FALSE,
     main = "Histogram of Radii", xlab = "Radius")

plot(radius,area, col=4,pch=20, cex=1.25, main = "Cross-sectional Area of Tree", xlab="radius",ylab="cross-sectional area")

(4)(c) Present a horizontal, notched, colored boxplot of the areas calculated in (b). Title and label the axis.

boxplot(area, col = 2, range = 1.5, main = "Boxplot of Cross-sectional Area of Tree", 
        ylab = "cross-sectional area", notch = TRUE,horizontal = TRUE)

(4)(d) Demonstrate that the outlier revealed in the boxplot of Volume is not an extreme outlier. It is possible to do this with one line of code using boxplot.stats() or ‘manual’ calculation and logicals. Identify the tree with the largest area and output on one line its row number and three measurements.

boxplot.stats(trees$Volume)

## $stats
## [1] 10.2 19.4 24.2 37.3 58.3
## 
## $n
## [1] 31
## 
## $conf
## [1] 19.1204 29.2796
## 
## $out
## [1] 77

#the outlier revealed in the boxplot of Volume is not an extreme outlier
summary(trees$Volume)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.20   19.40   24.20   30.17   37.30   77.00

#Q1=19.40, Q3=37.30, IQR=17.90  19.40-3*17.90=-34.4   37.30+3*17.90=91
#Since -34.4<77<91, 77 is not an extreme outlier
crossarea<-function(radius){area<-pi*radius^2 
return(area)}
trees[which(trees$Girth==2*radius[which.max(crossarea(radius))]), ]

##    Girth Height Volume
## 31  20.6     87     77

#The corresponding row number and three measurements of the tree with the largest area

Section 5: (12 points) The exponential distribution is an example of a right-skewed distribution with outliers. This problem involves comparing it with a normal distribution which typically has very few outliers.

5(a) Use set.seed(124) and rexp() with n = 100, rate = 5.5 to generate a random sample designated as y. Generate a second random sample designated as x with set.seed(127) and rnorm() using n = 100, mean = 0 and sd = 0.15.

Generate a new object using cbind(x, y). Do not output this object; instead, assign it to a new name. Pass this object to apply() and compute the inter-quartile range (IQR) for each column: x and y. Use the function IQR() for this purpose. Round the results to four decimal places and present (this exercise shows the similarity of the IQR values.).

For information about rexp(), use help(rexp) or ?rexp(). Do not output x or y.

set.seed(124)
y <- rexp(n=100, rate =5.5)
set.seed(127)
x <- rnorm(n=100, mean=0, sd=0.15)
round(apply(X=cbind(x,y),MARGIN=2, FUN=IQR, na.rm=TRUE), digits=4)

##      x      y 
## 0.2041 0.2164

#The interquartile range of x is 0.2041, whereas the interquartile range of y is 0.2164

(5)(b) This item will illustrate the difference between a right-skewed distribution and a symmetric one. For base R plots, use par(mfrow = c(2, 2)) to generate a display with four diagrams; grid.arrange() for ggplots. On the first row, for the normal results, present a histogram and a horizontal boxplot for x in color. For the exponential results, present a histogram and a horizontal boxplot for y in color.

par(mfrow =c(2,2))
plot1<-hist(x, prob=T, col=3)
plot2<-boxplot(x, col = 7, range = 1.5, main = "Boxplot of x",notch = TRUE,horizontal = TRUE)
plot3<-hist(y, prob=T, col=4)
plot4<-boxplot(y, col = 5, range = 1.5, main = "Boxplot of y",notch = TRUE,horizontal = TRUE)

(5)(c) QQ plots are useful for detecting the presence of heavy-tailed distributions. Present side-by-side QQ plots, one for each sample, using qqnorm() and qqline(). Add color and titles. In base R plots, “cex” can be used to control the size of the plotted data points and text. Lastly, determine if there are any extreme outliers in either sample.Remember extreme outliers are based on 3.0IQR in the box plot. R uses a default value of 1.5IQR to define outliers (not extreme) in both boxplot and boxplot stats.

par(mfrow =c(1,2))
plot5 <- qqnorm(x,main = "Normal Q-Q Plot of x",col=2,pch=20,cex=1.25)
plot6 <- qqline(x)
plot7 <- qqnorm(y,main = "Normal Q-Q Plot of y",col=6,pch=20,cex=1.25)
plot8 <- qqline(y)

boxplot.stats(x)

## $stats
## [1] -0.2976325808 -0.1007240230  0.0003706968  0.1088532648  0.3717619627
## 
## $n
## [1] 100
## 
## $conf
## [1] -0.03274251  0.03348391
## 
## $out
## [1] 0.4310593

#the outlier revealed in the boxplot of Volume is not an extreme outlier
summary(x)

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.2976326 -0.0971944  0.0003707  0.0056183  0.1068962  0.4310593

#Q1=-1, Q3=1, IQR=2  -1-3*2=-7   1+3*2=7
#Since -7<0<7, 0 is not an extreme outlier
boxplot.stats(y)

## $stats
## [1] 0.003880211 0.053278194 0.152793270 0.271774062 0.576966003
## 
## $n
## [1] 100
## 
## $conf
## [1] 0.1182709 0.1873156
## 
## $out
## [1] 1.4486792 0.6677194 0.6216506

#the outlier revealed in the boxplot of Volume is not an extreme outlier
summary(y)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00388 0.05493 0.15279 0.18816 0.27135 1.44868

#Q1=0.05493, Q3=0.27135, IQR=0.21642  0.05493-3*0.21642=-0.59433   0.27135+3*0.21642=0.92061
#Since 1.4486792>0.92061, 1.4486792 is an extreme outlier in sample y
#Since -0.59433<0.6677194<0.92061, 0.6677194 is not an extreme outlier
#Since -0.59433<0.6216506<0.92061, 0.6216506 is not an extreme outlier

#In conclusion, there is no extreme outlier in sample x, whereas there is one extreme outlier 1.4486792 in sample y