Write a function that takes input a vector ‘x’ and outputs the mean of the entries of the vector, call this function ‘my_mean’
my_mean<-function(x){
return(sum(x)/length(x))
}
test<-c(1,2,4,8,12,4,13,23,5)
my_mean(test)
## [1] 8
Write a function that takes input a vector ‘x’ and outputs the standard deviation from the mean of the entries of the vector, call this function ‘my_sd’ (Note, you need to compute the mean in the process of computing standard deviation, use your function my_mean.)
my_sd<-function(x){
return(sqrt(sum((x-mean(x))^2/(length(x)-1))))
}
test<-c(1,2,4,8,12,4,13,23,5)
sd(test)
## [1] 7
Write a function that takes input a vector ‘x’, and outputs the inter-quartile-range of the entries of the vector (you can use the R function quantile to get Q1 and Q3). Call this function ‘my_IQR’.
my_IQR<-function(x){
quartile_3<- quantile(x, probs,quartile_3 = FALSE)
quartile_1<- quantile(x, probs, quartile_1 = FALSE)
}
test<-c(1,2,4,8,12,4,13,23,5)
IQR(test)
## [1] 8
R has in-built functions mean, sd, and IQR, that compute mean, standard dev, and inter-quartile-range resp. Test that the output from your function my_mean, my_sd, and my_IQR matches the output of the respective R functions on the vector x below:
set.seed(7)
x <- rnorm(300, mean = 2, sd = 1.5)
my_mean<-function(x){
return(sum(x)/length(x))
}
test<-c(7,1,3,5,9,10)
my_mean(test)
## [1] 5.833333
my_sd<-function(x){
return(sqrt(sum((x-mean(x))^2/(length(x)-1))))
}
test<-c(7,1,3,5,9,10)
sd(test)
## [1] 3.488075
my_IQR<-function(x){
quartile_3<- quantile(x, probs,quartile_3 = FALSE)
quartile_1<- quantile(x, probs, quartile_1 = FALSE)
}
test<-c(7,1,3,5,9,10)
IQR(test)
## [1] 5
BONUS(Optional): Write a function that returns all the outliers in a given vector x of sample data values.
Test your function on the vector x below:
set.seed(7)
x <- c(rnorm(100), -4, -8.5, 10, 100)
In this section we will explore the location and variability of the variables in the dataset iris. Extract the Petal.Length variable from each of the dataframes iris, setosa, versicolor, and virginica. Save it into the variables defined below:
#install.packages("dplyr")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data("iris")
setosa<-filter(iris, Species == "setosa")
versicolor<-filter(iris, Species == "versicolor")
virginica<- filter(iris, Species == "virginica")
petal_length_iris <- iris$Petal.Length
petal_length_set <- setosa$Petal.Length
petal_length_vir <- virginica$Petal.Length
petal_length_ver <- versicolor$Petal.Length
Now use the hist function to visualize the histograms for each of these variables.
hist(petal_length_iris)
hist(petal_length_set)
hist(petal_length_vir)
hist(petal_length_ver)
Question: Do you notice any particular differences between the histograms?
The histogram of petal_length_iris is more positively skewed than all the others and has more of a frequency at 1 than any of the others.
Question: Why do you think there is a difference between the histogram of petal_length_iris and that of petal_length_set?
There is a difference in the histogram of petal_length_iris because that histogram covers all the data of setosa, versicolor, and virginica. The histogram of petal_length_set includes data only that of setosa.
Question: For each of the histograms, guess what the mean, median, and standard deviations must be. petal_length_iris mean:3.3 median:3.5 standard deviation: 4.6
petal_length_set mean:1.5 median:1.4 standard deviation: 2.6
petal_length_vir mean:5.7 median:5.2 standard deviation: 3.7
petal_length_ver mean:4.2 median:4.1 standard deviation: 1.2 Use the summary command to get the summary statistics of iris, setosa, virgnica, and versicolor.
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
summary(setosa)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.300 Min. :1.000 Min. :0.100
## 1st Qu.:4.800 1st Qu.:3.200 1st Qu.:1.400 1st Qu.:0.200
## Median :5.000 Median :3.400 Median :1.500 Median :0.200
## Mean :5.006 Mean :3.428 Mean :1.462 Mean :0.246
## 3rd Qu.:5.200 3rd Qu.:3.675 3rd Qu.:1.575 3rd Qu.:0.300
## Max. :5.800 Max. :4.400 Max. :1.900 Max. :0.600
## Species
## setosa :50
## versicolor: 0
## virginica : 0
##
##
##
summary(virginica)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.900 Min. :2.200 Min. :4.500 Min. :1.400
## 1st Qu.:6.225 1st Qu.:2.800 1st Qu.:5.100 1st Qu.:1.800
## Median :6.500 Median :3.000 Median :5.550 Median :2.000
## Mean :6.588 Mean :2.974 Mean :5.552 Mean :2.026
## 3rd Qu.:6.900 3rd Qu.:3.175 3rd Qu.:5.875 3rd Qu.:2.300
## Max. :7.900 Max. :3.800 Max. :6.900 Max. :2.500
## Species
## setosa : 0
## versicolor: 0
## virginica :50
##
##
##
summary(versicolor)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## Min. :4.900 Min. :2.000 Min. :3.00 Min. :1.000 setosa : 0
## 1st Qu.:5.600 1st Qu.:2.525 1st Qu.:4.00 1st Qu.:1.200 versicolor:50
## Median :5.900 Median :2.800 Median :4.35 Median :1.300 virginica : 0
## Mean :5.936 Mean :2.770 Mean :4.26 Mean :1.326
## 3rd Qu.:6.300 3rd Qu.:3.000 3rd Qu.:4.60 3rd Qu.:1.500
## Max. :7.000 Max. :3.400 Max. :5.10 Max. :1.800
Do you notice any differences in the summary statistics for the full dataset ‘iris’ vs the summary statistics for the datasets virginica, setosa, and versicolor? Mention at least two such observation. YOUR ANSWER
The first quartile of petal_length_iris is considerably less than the first quartiles of the other datasets.
The third quartile of petal_length_iris is considerably more than the third quartile of petal_length_set.
Now compute box-plots for the variable Petal.Lenght of the dataframe iris (you will have three boxplots side by side).
boxplot(petal_length_set)
boxplot(petal_length_ver)
boxplot(petal_length_vir)
Using the boxplot, what can you say about the petal length of the three species with respect to each other? Can you identify any overlapping data?
All the values of setosa are much smaller than that of the other boxplots. The medians of versicolor and virginica are relatively comparable.