Assignment 1 (10%)

[Elliot Currie]

[CIND123 D30 D40 - 500276446]

Instructions

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. Review this website for more details on using R Markdown http://rmarkdown.rstudio.com.

Use RStudio for this assignment. Complete the assignment by inserting your code wherever you see the string “#INSERT YOUR ANSWER HERE”.

When you click the Knit button, a document (PDF, Word, or HTML format) will be generated that includes both the assignment content as well as the output of any embedded R code chunks.

NOTE: YOU SHOULD NEVER HAVE install.packages IN YOUR CODE; OTHERWISE, THE Knit OPTION WILL GIVE AN ERROR. COMMENT OUT ALL PACKAGE INSTALLATIONS.

Submit both the rmd and generated output files. Failing to submit both files will be subject to mark deduction. PDF or HTML is preferred.

Sample Question and Solution

Use seq() to create the vector \((3,5\ldots,29)\).

seq(3, 30, 2)

##  [1]  3  5  7  9 11 13 15 17 19 21 23 25 27 29

seq(3, 29, 2)

##  [1]  3  5  7  9 11 13 15 17 19 21 23 25 27 29

Question 1 (32 points)

Q1a (8 points)

Create and print a vector x with all integers from 4 to 115 and a vector y containing multiples of 4 in the same range. Hint: use seq()function. Calculate the difference in lengths of the vectors x and y. Hint: use length()

x<-c(4:115)
x

##   [1]   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21
##  [19]  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39
##  [37]  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57
##  [55]  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75
##  [73]  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93
##  [91]  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111
## [109] 112 113 114 115

y<-seq(4,115,by=4)
length(x)-length(y)

## [1] 84

Q1b (8 points)

Create a new vector, y_square, with the square of elements at indices 1, 3, 7, 12, 17, 20, 22, and 24 from the variable y. Hint: Use indexing rather than a for loop. Calculate the mean and median of the FIRST five values from y_square.

y_square<-y[c(1,3,7,17,17,20,22,24)]^2
mean(y_square[1:5])

## [1] 2038.4

median(y_square[1:5])

## [1] 784

Q1c (8 points)

For a given factor variable of factorVar <- factor(c(1, 6, 5.4, 3.2)), would it be correct to use the following commands to convert factor to number?

as.numeric(factorVar)

This command will only provide the integer codes associated with the respective levels of the factor. Which would be 1 4 3 2, in increasing order. To correctly convert the factor to character then back to numeric. If not, explain your answer and provide the correct one.

factorVar <- (c(1,6,5.4,3.2))
as.numeric(as.character(factorVar))

## [1] 1.0 6.0 5.4 3.2

Q1d (8 points)

A comma-separated values file dataset.csv consists of missing values represented by Not A Number (null) and question mark (?). How can you read this type of files in R? NOTE: Please make sure you have saved the dataset.csv file at your current working directory.

By specifying “null” and”?” in the na.strings argument, it specifies that those strings should be treated as NA or missing values.

null_dataset<-read.csv("dataset.csv",
                       na.strings = c("null","?"))

Question 2 (32 points)

Q2a (8 points)

Compute: \[\sum_{n=5}^{20}\frac{(-1)^{n}}{(n!)^2}\] Hint: Use factorial(n) to compute \(n!\).

sum((-1)^(5:20) / factorial(5:20)^2)

## [1] -6.755419e-05

Q2b (8 points)

Compute: \[\prod_{n=1}^{5} \left( 4n + \frac{1}{2^n} \right)\] NOTE: The symbol \(\Pi\) represents multiplication.

prod(4*(1:5)+1/(2^(1:5)))

## [1] 144833.6

Q2c (8 points)

Describe what the following R command does: c(0:5)[NA] The output of c(0:5)[NA] is the result of asking for elements at the position NA to be returned using infex operation of []. Because NA is not a valid index, the result is NA for all the values in the vector that has been created. c(0:5) creates a numeric vector with values 0 to 5. [NA] is an indexing funciton the is directed to extract the element at the position specified within the [].

c(0:5)[NA]

## [1] NA NA NA NA NA NA

Q2d (8 points)

Describe the purpose of is.vector(), is.character(), is.numeric(), and is.na() functions? Please use x <- c("a", "b", NA, 2) to explain your description.

Each of these functions are a way to check/figure out certain attributes about a dataset.

is.vector() checks whether the variable is a vector. TRUE = variable is a vector, FALSE = variable is not a vector For something to be a vector all the elements must be of the same mode so it also tells us that all the elements in our vector are being read as the same mode (type).

is.character() checks whether the elements in the vector are of character type. TRUe = vector elements are chacter, FALSE = vector elements are not character. This tells us that all elements in our vector are stored as characters, which is important to know because it means that the number 2 is not stored as number but as a character which will impact further computations.

is.numeric() checks whether the elements in the vector are of numerica type. TRUE = vector elements are numeric, FALSE = vector elements are not numeric.

In the vector x, we are able to see which value is NA, the function is.na() is also able to tell us which values in a dataset or vector are NA by returning logical vectors, TRUE = corresponding element of data set is NA, FALSE = corresponding element of dataset is not NA.

x <- c("a", "b", NA, 2)
is.vector(x)

## [1] TRUE

is.character(x)

## [1] TRUE

is.numeric(x)

## [1] FALSE

is.na(x)

## [1] FALSE FALSE  TRUE FALSE

Question 3 (36 points)

The airquality dataset contains daily air quality measurements in New York from May to September 1973. The variables include Ozone level, Solar radiation, wind speed, temperature in Fahrenheit, month, and day. Please see the detailed description using help("airquality").

Install the airquality data set on your computer using the command install.packages("datasets"). Then load the datasets package into your session.

library(datasets)
airquality<-read.csv("airquality.csv")

Q3a (4 points)

Display the first 10 rows of the airquality data set.

head(airquality, n=10)

##     X Ozone Solar.R Wind Temp Month Day
## 1   1    41     190  7.4   67     5   1
## 2   2    36     118  8.0   72     5   2
## 3   3    12     149 12.6   74     5   3
## 4   4    18     313 11.5   62     5   4
## 5   5    NA      NA 14.3   56     5   5
## 6   6    28      NA 14.9   66     5   6
## 7   7    23     299  8.6   65     5   7
## 8   8    19      99 13.8   59     5   8
## 9   9     8      19 20.1   61     5   9
## 10 10    NA     194  8.6   69     5  10

Q3b (8 points)

Compute the average of the first four variables (Ozone, Solar.R, Wind and Temp) for the fifth month using the sapply() function. Hint: You might need to consider removing the NA values; otherwise, the average will not be computed.

y<-[!is.na(x)] this will isolate all the none NA variables from the dataset x

averages_may <- sapply(airquality[, 1:4], function(x)median(x[airquality$Month==5],simplify=TRUE, na.rm=TRUE))
averages_may

##       X   Ozone Solar.R    Wind 
##    16.0    18.0   194.0    11.5

Q3c (8 points)

Construct a boxplot for the all Wind and Temp variables, then display the values of all the outliers which lie beyond the whiskers.

boxplot(Wind ~ Temp, data = 
          airquality, col = "blue")

boxplot(airquality$Wind, main="Boxplot of Wind", col="blue")

boxplot(airquality$Temp, main="Boxplot of Temp", col="red")

bp_wind <- boxplot(airquality$Wind, plot = FALSE)
wind_outliers <- bp_wind$out
print(wind_outliers)

## [1] 20.1 18.4 20.7

bp_temp <- boxplot(airquality$Temp, plot = FALSE)
temp_outliers <- bp_temp$out
print(temp_outliers)

## numeric(0)

Q3d (8 points)

Compute the upper quartile of the Wind variable with two different methods. HINT: Only show the upper quartile using indexing. For the type of quartile, please see https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/quantile.

quantile(airquality$Wind)

##   0%  25%  50%  75% 100% 
##  1.7  7.4  9.7 11.5 20.7

quantile(airquality$Wind, 0.75)

##  75% 
## 11.5

quantile(airquality$Wind, probs = 0.75, na.rm = FALSE,
         names = TRUE, type = 3)

##  75% 
## 11.5

Q3e (8 points)

Construct a pie chart to describe the number of entries by Month. HINT: use the table() function to count and tabulate the number of entries within a Month.

table(airquality$Month)

## 
##  5  6  7  8  9 
## 31 30 31 31 30

air.month<-c(31,30,31,31,30)
names(air.month)<-c("May","June","July","August","September")
pie(air.month)

END of Assignment #1.

CIND 123 - Data Analytics: Basic Methods