Write a function that takes input a vector ‘x’ and outputs the mean of the entries of the vector, call this function ‘my_mean’
my_mean <- function(x){
return (sum(x)/length(x))
}
test <- c(2, 3, 4, 8)
my_mean(test)
## [1] 4.25
Write a function that takes input a vector ‘x’ and outputs the standard deviation from the mean of the entries of the vector, call this function ‘my_sd’ (Note, you need to compute the mean in the process of computing standard deviation, use your function my_mean.)
my_sd <- function(x) {
return(sqrt(sum(x-mean(x))^2/(length(x)-1)))
}
test <-c(2, 3, 4, 8)
my_mean(test)
## [1] 4.25
mean(test)
## [1] 4.25
Write a function that takes input a vector ‘x’, and outputs the inter-quartile-range of the entries of the vector (you can use the R function quantile to get Q1 and Q3). Call this function ‘my_IQR’.
my_IQR <- function(x)
{
# R program to calculate IQR value
# Defining vector
x <- c(5, 5, 8, 12, 15, 16)
# Print Interquartile range
return (IQR(x))
}
my_IQR
## function(x)
##
## {
##
## # R program to calculate IQR value
##
## # Defining vector
##
## x <- c(5, 5, 8, 12, 15, 16)
##
## # Print Interquartile range
##
## return (IQR(x))
##
## }
R has in-built functions mean, sd, and IQR, that compute mean, standard dev, and inter-quartile-range resp. Test that the output from your function my_mean, my_sd, and my_IQR matches the output of the respective R functions on the vector x below:
set.seed(7)
x <- rnorm(300, mean = 2, sd = 1.5)
my_mean <- function(data_set)
{
return(sum(data_set)/length(data_set))
}
my_mean(x)
## [1] 2.116994
mean(x)
## [1] 2.116994
my_sd <- function(data_set)
{
n = length(data_set)
variance = sum((data_set)^2)/n - (my_mean(data_set))^2
return(sqrt(variance*n/(n-1)))
}
my_sd(x)
## [1] 1.495104
sd(x)
## [1] 1.495104
my_IQR <- function(data_set)
{
Q1 = quantile(data_set,0.25)
Q3 = quantile(data_set,0.75)
return(Q3 - Q1)
}
my_IQR(x)
## 75%
## 2.019907
IQR(x)
## [1] 2.019907
BONUS(Optional): Write a function that returns all the outliers in a given vector x of sample data values.
Outliers = function(V)
{
Q1=quantile(V,0.25);Q3=quantile(V,0.75);IQR=Q3-Q1
UL=Q3+IQR;LL=Q1-IQR
o1=V[which(V<LL)]
o2=V[which(V>UL)]
return(c(o1,02))
}
Test your function on the vector x below:
set.seed(7)
Outliers(c(rnorm(100),-4, -8.5, 10, 100))
## [1] -4.0 -8.5 2.0
In this section we will explore the location and variability of the variables in the dataset iris. Extract the Petal.Length variable from each of the dataframes iris, setosa, versicolor, and virginica. Save it into the variables defined below:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(datasets)
data("iris")
head(iris)
setosa <- filter(iris, Species == "setosa")
head(setosa)
versicolor <- filter(iris, Species == "versicolor")
head(versicolor)
virginica <- filter(iris, Species == "virginica")
head(virginica)
petal_length_iris <- iris$Petal.Length
petal_length_set <- setosa$Petal.Length
petal_length_vir <- virginica$Petal.Length
petal_length_ver <- versicolor$Petal.Length
library(dplyr)
par(mfrow=c(2,2))
hist.iris<-hist(petal_length_iris)
hist.set<- hist(petal_length_set)
hist.vir<- hist(petal_length_vir)
hist.ver<- hist(petal_length_ver)
Now use the hist function to visualize the histograms for each of these variables.
```r
summary.vir<-summary(petal_length_vir)
summary.vir
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.500 5.100 5.550 5.552 5.875 6.900
summary.ver<-summary(petal_length_ver)
summary.ver
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 4.00 4.35 4.26 4.60 5.10
Question: Do you notice any particular differences between the histograms?
Question: Why do you think there is a differnce between the histogram of petal_length_iris and that of petal_length_set?
# the petal length of setosa species ranges between 11 andd 2 while the overall petal length for complete iris data ranges 1 and 7 which gives a clear indication that the histogram of petal_length_iris will be different from petal_length_set.
Question: For each of the histograms, guess what the mean, median, and standard deviations must be.
petal_length_iris: Mean = 3.8, Median = 4, sd = 1.5 petal_length_set: Mean = 1.5, Median = 1.4, sd = 0.2 petal_length_vir: Mean = 5.5, Median = 5.5, sd = 0.6 petal_length_ver: Mean = 4.25, Median = 4.25, sd = 0.5
Use the summary command to get the summary statistics of iris, setosa, virgnica, and versicolor.
summary.iris<-summary(petal_length_iris)
summary.iris
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.600 4.350 3.758 5.100 6.900
summary.set<-summary(petal_length_set)
summary.set
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.400 1.500 1.462 1.575 1.900
Do you notice any differences in the summary statistics for the full dataset ‘iris’ vs the summary statistics for the datasets virginica, setosa, and versicolor? Mention at least two such observation. YOUR ANSWER
the summary statics for petal_length_set are Mean=1.462 Median=1.5 the summary statistics for petal_length_iris are Mean=3.758 Meadian=4.350
Now compute box-plots for the variable Petal.Lenght of the dataframe iris (you will have three boxplots side by side).
boxplot.petal.length=with(iris,boxplot(Petal.Length~Species))
Using the boxplot, what can you say about the petal length of the three species with respect to each other? Can you identify any overlapping data?
Also we can observe that some data for virginica and veriscolor overlaps with each other