Assignment 1 (10%)

Laura Jamikeshova

CIND 123 DHA DHT & 501019811

Instructions

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. Review this website for more details on using R Markdown http://rmarkdown.rstudio.com.

Use RStudio for this assignment. Complete the assignment by inserting your code wherever you see the string “#INSERT YOUR ANSWER HERE”.

When you click the Knit button, a document (PDF, Word, or HTML format) will be generated that includes both the assignment content as well as the output of any embedded R code chunks.

NOTE: YOU SHOULD NEVER HAVE install.packages IN YOUR CODE; OTHERWISE, THE Knit OPTION WILL GIVE AN ERROR. COMMENT OUT ALL PACKAGE INSTALLATIONS.

Submit both the rmd and generated output files. Failing to submit both files will be subject to mark deduction. PDF or HTML is preferred.

Sample Question and Solution

Use seq() to create the vector \((3,5\ldots,29)\).

seq(3, 30, 2)

##  [1]  3  5  7  9 11 13 15 17 19 21 23 25 27 29

seq(3, 29, 2)

##  [1]  3  5  7  9 11 13 15 17 19 21 23 25 27 29

Question 1 (32 points)

Q1a (8 points)

Create and print a vector x with all integers from 15 to 100 and a vector y containing multiples of 5 in the same range. Hint: use seq()function. Calculate the difference in lengths of the vectors x and y. Hint: use length()

x <- seq(15, 100)
y <- seq(15, 100, by = 5)

length_diff <- length(x) - length(y)

print(x)

##  [1]  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33
## [20]  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52
## [39]  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
## [58]  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
## [77]  91  92  93  94  95  96  97  98  99 100

print(y)

##  [1]  15  20  25  30  35  40  45  50  55  60  65  70  75  80  85  90  95 100

print(length_diff)

## [1] 68

Q1b (8 points)

Create a new vector, x_square, with the square of elements at indices 1, 11, 21, 31, 41, 51, 61, and 71 from the variable x. Hint: Use indexing rather than a for loop. Calculate the mean and median of the FIRST five values from x_square.

indices <- seq(1, 71, by = 10)
x_square <- x[indices]^2

mean_first_five <- mean(x_square[1:5])
median_first_five <- median(x_square[1:5])

print(x_square)

## [1]  225  625 1225 2025 3025 4225 5625 7225

print(mean_first_five)

## [1] 1425

print(median_first_five)

## [1] 1225

Q1c (8 points)

For a given factor variable of factorVar <- factor(c(10.8, 2.7, 5.0, 3.5)). To convert the factor to number, you need to either: 1) use level() to extract the level labels, then use as.numeric() to convert the labels to numbers, or 2) use as.charactor() to convert the values in the factorVar, then use as.numeric() to convert the values to numbers

Please provide both solutions

# Method 1
factorVar <- factor(c(10.8, 2.7, 5.0, 3.5))
levels <- levels(factorVar)
numericVar <- as.numeric(levels[factorVar])

print(numericVar)

## [1] 10.8  2.7  5.0  3.5

# Method 2
factorVar <- factor(c(10.8, 2.7, 5.0, 3.5))
characterVar <- as.character(factorVar)
numericVar <- as.numeric(characterVar)

print(numericVar)

## [1] 10.8  2.7  5.0  3.5

Q1d (8 points)

A comma-separated values file dataset.csv consists of missing values represented by Not A Number (null) and question mark (?). How can you read this type of files in R? NOTE: Please make sure you have saved the dataset.csv file at your current working directory.

# Set the file path
file_path <- "dataset.csv"

# Read the CSV file, specifying the missing values
data <- read.csv(file_path, na.strings = c("null", "?"))

# Print the data
print(data)

##     X1  X2  X3  X4  X5  X6  X7  X8  X9 X10
## 1   11  12  13  14  15  16  17  18  19  20
## 2   21  22  23  24  25  26  27  28  29  30
## 3   31  32  33  34  35  36  37  38  39  40
## 4   41  42  43  44  45  NA  47  48  49  50
## 5   51  52  53  NA  55  56  57  NA  59  60
## 6   61  62  63  64  65  66  67  68  69  70
## 7   71  72  NA  74  75  76  77  78  79  80
## 8   81  82  83  84  85  86  87  88  89  NA
## 9   91  92  93  94  95  96  97  98  99 100
## 10  NA 102 103 104 105 106 107 108 109 110
## 11 111 112 113 114 115 116 117 118 119 120
## 12 121 122 123 124 125 126 127 128 129 130
## 13 131 132 133 134 135 136 137 138 139  NA
## 14 141 142 143 144 145 146 147 148 149 150
## 15 151 152 153 154 155 156 157 158 159 160
## 16 161 162 163 164  NA 166 167 168 169 170

Question 2 (32 points)

Q2a (8 points)

Compute: \[\frac{1}{4!} \sum_{n=10}^{40}3^{n}\] Hint: Use factorial(n) to compute \(n!\).

# Compute the factorial of 4
factorial_4 <- factorial(4)

# Initialize the sum variable
sum_value <- 0

# Iterate from 10 to 40
for (n in 10:40) {
  sum_value <- sum_value + 3^n
}

# Compute the final result
result <- (1 / factorial_4) * sum_value

# Print the result
print(result)

## [1] 7.598541e+17

Q2b (8 points)

Compute: \[\prod_{n=1}^{20} \left( 3n + \frac{1}{n} \right)\] NOTE: The symbol \(\Pi\) represents multiplication.

# Initialize the product variable
product_value <- 1

# Iterate from 1 to 20
for (n in 1:20) {
  term <- 3*n + 1/n
  product_value <- product_value * term
}

# Print the result
print(product_value)

## [1] 1.373708e+28

Q2c (8 points)

Describe what the following R command does: c(0:5)[NA]

# The R command c(0:5)[NA] creates a vector using the c() function with values ranging from 0 to 5, and then selects an element from this vector using the index NA.
# Create the vector with values from 0 to 5
vector <- c(0:5)

# Use NA as the index to select an element from the vector
result <- vector[NA]

# Print the result
print(result)

## [1] NA NA NA NA NA NA

Q2d (8 points)

Describe the purpose of is.vector(), is.character(), is.numeric(), and is.na() functions? Please use x <- c("a","b",NA,2) to explain your description.

# Here's the code that demonstrates the purpose of is.vector(), is.character(), is.numeric(), and is.na() functions using the vector x <- c("a", "b", NA, 2):
# Create the vector
x <- c("a", "b", NA, 2)

# Check if x is a vector
is_vector <- is.vector(x)
print(is_vector)

## [1] TRUE

# Check if the elements of x are of character type
is_character <- is.character(x)
print(is_character)

## [1] TRUE

# Check if the elements of x are of numeric type
is_numeric <- is.numeric(x)
print(is_numeric)

## [1] FALSE

# Check if the elements of x are missing values (NA)
is_na <- is.na(x)
print(is_na)

## [1] FALSE FALSE  TRUE FALSE

# In this code:
#is.vector(x) checks if x is a vector. The result will be TRUE since x is a vector.
# is.character(x) checks if the elements of x are of character type. The result will be TRUE for the first two elements ("a" and "b") and FALSE for the third and fourth elements.
# is.numeric(x) checks if the elements of x are of numeric type. The result will be TRUE for the fourth element (2) and FALSE for the first three elements.
# is.na(x) checks if the elements of x are missing values (NA). The result will be FALSE for the first two elements ("a" and "b"), TRUE for the third element (NA), and FALSE for the fourth element (2).

Question 3 (36 points)

The airquality dataset contains daily air quality measurements in New York from May to September 1973. The variables include Ozone level, Solar radiation, wind speed, temperature in Fahrenheit, month, and day. Please see the detailed description using help("airquality").

Install the airquality data set on your computer using the command install.packages("datasets"). Then load the datasets package into your session.

library(datasets)
help("airquality")

Q3a (4 points)

Display the first 6 rows of the airquality data set.

library(datasets)
head(airquality, n = 6)

##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

Q3b (8 points)

Compute the average of the first four variables (Ozone, Solar.R, Wind and Temp) for the fifth month using the sapply() function. Hint: You might need to consider removing the NA values; otherwise, the average will not be computed.

library(datasets)
data(airquality)
data_fifth_month <- subset(airquality, Month == 5)
data_fifth_month <- na.omit(data_fifth_month[, c("Ozone", "Solar.R", "Wind", "Temp")])
average_values <- sapply(data_fifth_month, mean)
print(average_values)

##     Ozone   Solar.R      Wind      Temp 
##  24.12500 182.04167  11.50417  66.45833

Q3c (8 points)

Construct a boxplot for the all Wind and Temp variables, then display the values of all the outliers which lie beyond the whiskers.

library(datasets)
data(airquality)

boxplot(airquality$Wind, airquality$Temp, names = c("Wind", "Temp"), outline = TRUE)

outliers <- boxplot.stats(airquality$Wind)$out
outliers <- c(outliers, boxplot.stats(airquality$Temp)$out)

print(outliers)

## [1] 20.1 18.4 20.7

# I have checked everything the name Wind is present in the column, probably data loss

Q3d (8 points)

Compute the upper quartile of the Wind variable with two different methods. HINT: Only show the upper quartile using indexing. For the type of quartile, please see https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/quantile.

library(datasets)
data(airquality)

# Method 1: Using quantile() function
upper_quartile_method1 <- quantile(airquality$Wind, probs = 0.75, type = 7)
print(upper_quartile_method1)

##  75% 
## 11.5

# Method 2: Using indexing
sorted_wind <- sort(airquality$Wind)
n <- length(sorted_wind)
upper_quartile_method2 <- sorted_wind[ceiling(n * 0.75)]
print(upper_quartile_method2)

## [1] 11.5

Q3e (8 points)

Construct a pie chart to describe the number of entries by Month. HINT: use the table() function to count and tabulate the number of entries within a Month.

library(datasets)
data(airquality)

# Count the number of entries by Month
month_counts <- table(airquality$Month)

# Create a pie chart
pie(month_counts, labels = names(month_counts), main = "Number of Entries by Month")

END of Assignment #1.

Assignment 1 CIND 123 - Data Analytics: Basic Methods

Laura Jamikeshova

Assignment 1 (10%)

Laura Jamikeshova

CIND 123 DHA DHT & 501019811

Instructions

Sample Question and Solution

Question 1 (32 points)

Q1a (8 points)

Q1b (8 points)

Q1c (8 points)

Q1d (8 points)

Question 2 (32 points)

Q2a (8 points)

Q2b (8 points)

Q2c (8 points)

Q2d (8 points)

Question 3 (36 points)

Q3a (4 points)

Q3b (8 points)

Q3c (8 points)

Q3d (8 points)

Q3e (8 points)