I pledge on my honor that i have not given or recieved any unauthorized assistance on an assignment or examination

Part 1: R Functions

Write a function that takes input a vector ‘x’ and outputs the mean of the entries of the vector, call this function ‘my_mean’

my_mean <- function(x){
   return (sum(x)/length(x))
}
test <- c(2, 3, 4, 8)
my_mean(test)
## [1] 4.25

Write a function that takes input a vector ‘x’ and outputs the standard deviation from the mean of the entries of the vector, call this function ‘my_sd’ (Note, you need to compute the mean in the process of computing standard deviation, use your function my_mean.)

my_sd <- function(x) {
  
  return(sqrt(sum(x-mean(x))^2/(length(x)-1)))
}
test <-c(2, 3, 4, 8)
  my_mean(test)
## [1] 4.25
  mean(test)
## [1] 4.25

Write a function that takes input a vector ‘x’, and outputs the inter-quartile-range of the entries of the vector (you can use the R function quantile to get Q1 and Q3). Call this function ‘my_IQR’.

my_IQR <- function(x)

{

# R program to calculate IQR value

# Defining vector

x <- c(5, 5, 8, 12, 15, 16)

# Print Interquartile range

return (IQR(x))

}

my_IQR
## function(x)
## 
## {
## 
## # R program to calculate IQR value
## 
## # Defining vector
## 
## x <- c(5, 5, 8, 12, 15, 16)
## 
## # Print Interquartile range
## 
## return (IQR(x))
## 
## }

R has in-built functions mean, sd, and IQR, that compute mean, standard dev, and inter-quartile-range resp. Test that the output from your function my_mean, my_sd, and my_IQR matches the output of the respective R functions on the vector x below:

set.seed(7)
x <- rnorm(300, mean = 2, sd = 1.5)
my_mean <- function(data_set)
{
 return(sum(data_set)/length(data_set))
 }

 my_mean(x)
## [1] 2.116994
 mean(x)
## [1] 2.116994
 my_sd <- function(data_set)
 {
 n = length(data_set)
 variance = sum((data_set)^2)/n - (my_mean(data_set))^2
 return(sqrt(variance*n/(n-1)))
 }

 my_sd(x)
## [1] 1.495104
sd(x)
## [1] 1.495104
my_IQR <- function(data_set)
 {
 Q1 = quantile(data_set,0.25)
 Q3 = quantile(data_set,0.75)
 return(Q3 - Q1)
 }

 my_IQR(x)
##      75% 
## 2.019907
IQR(x)
## [1] 2.019907

BONUS(Optional): Write a function that returns all the outliers in a given vector x of sample data values.

Outliers = function(V)
{

Q1=quantile(V,0.25);Q3=quantile(V,0.75);IQR=Q3-Q1

UL=Q3+IQR;LL=Q1-IQR

o1=V[which(V<LL)]

o2=V[which(V>UL)]

return(c(o1,02))

}

Test your function on the vector x below:

set.seed(7) 


Outliers(c(rnorm(100),-4, -8.5, 10, 100))
## [1] -4.0 -8.5  2.0

Part 2: Measures of Location and Variability.

In this section we will explore the location and variability of the variables in the dataset iris. Extract the Petal.Length variable from each of the dataframes iris, setosa, versicolor, and virginica. Save it into the variables defined below:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(datasets)

data("iris")

head(iris)
setosa <- filter(iris, Species == "setosa")
 
head(setosa)
versicolor <- filter(iris, Species == "versicolor")

head(versicolor)
virginica <- filter(iris, Species == "virginica")

head(virginica)
petal_length_iris <- iris$Petal.Length
petal_length_set <- setosa$Petal.Length
petal_length_vir <- virginica$Petal.Length
petal_length_ver <- versicolor$Petal.Length

library(dplyr)


par(mfrow=c(2,2))

hist.iris<-hist(petal_length_iris)

hist.set<- hist(petal_length_set)

hist.vir<- hist(petal_length_vir)

hist.ver<- hist(petal_length_ver)


Now use the hist function to visualize the histograms for each of these variables. 

```r
summary.vir<-summary(petal_length_vir)
summary.vir 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.500   5.100   5.550   5.552   5.875   6.900
summary.ver<-summary(petal_length_ver)
summary.ver
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    4.00    4.35    4.26    4.60    5.10

Question: Do you notice any particular differences between the histograms?

we can observe that all the individual histograph for petal length of species(setosa, virginica and versicolor) have almost normal distrubution but the overall distributio of petal length for complete iris seems to be bimodal with clear spikes at the starting of histogram after the first two spikes the histogram of overall iris data for petal length also seems to be normal.

Question: Why do you think there is a differnce between the histogram of petal_length_iris and that of petal_length_set?

# the petal length of setosa species ranges between 11 andd 2 while the overall petal length for complete iris data ranges 1 and 7 which gives a clear indication that the histogram of petal_length_iris will be different from petal_length_set.

Question: For each of the histograms, guess what the mean, median, and standard deviations must be.

by observing the histogram we can have a fair idea about the measures centre and measure of spread of the data

petal_length_iris: Mean = 3.8, Median = 4, sd = 1.5 petal_length_set: Mean = 1.5, Median = 1.4, sd = 0.2 petal_length_vir: Mean = 5.5, Median = 5.5, sd = 0.6 petal_length_ver: Mean = 4.25, Median = 4.25, sd = 0.5

Use the summary command to get the summary statistics of iris, setosa, virgnica, and versicolor.

summary.iris<-summary(petal_length_iris)
summary.iris
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.600   4.350   3.758   5.100   6.900
summary.set<-summary(petal_length_set)
summary.set
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.400   1.500   1.462   1.575   1.900

Do you notice any differences in the summary statistics for the full dataset ‘iris’ vs the summary statistics for the datasets virginica, setosa, and versicolor? Mention at least two such observation. YOUR ANSWER

from the above summary data. we can observe that the summary data for petal_length_set is completely different from the other three summaries which are little close to each other

the summary statics for petal_length_set are Mean=1.462 Median=1.5 the summary statistics for petal_length_iris are Mean=3.758 Meadian=4.350

Now compute box-plots for the variable Petal.Lenght of the dataframe iris (you will have three boxplots side by side).

boxplot.petal.length=with(iris,boxplot(Petal.Length~Species))

Using the boxplot, what can you say about the petal length of the three species with respect to each other? Can you identify any overlapping data?

From the above box plots we can observe that the petal length for virginica species is highest while the petal length for setosa species is lowest. the petal for versicolor species is in between both species.

Also we can observe that some data for virginica and veriscolor overlaps with each other