Part 1: R Functions

Write a function that takes input a vector ‘x’ and outputs the mean of the entries of the vector, call this function ‘my_mean’

my_mean<-function(x){
  return(sum(x)/length(x))
}

test<-c(1,2,4,8,12,4,13,23,5)
my_mean(test)

## [1] 8

Write a function that takes input a vector ‘x’ and outputs the standard deviation from the mean of the entries of the vector, call this function ‘my_sd’ (Note, you need to compute the mean in the process of computing standard deviation, use your function my_mean.)

my_sd<-function(x){
return(sqrt(sum((x-mean(x))^2/(length(x)-1))))
}
test<-c(1,2,4,8,12,4,13,23,5)
sd(test)

## [1] 7

Write a function that takes input a vector ‘x’, and outputs the inter-quartile-range of the entries of the vector (you can use the R function quantile to get Q1 and Q3). Call this function ‘my_IQR’.

my_IQR<-function(x){
  quartile_3<- quantile(x, probs,quartile_3 = FALSE)
  quartile_1<- quantile(x, probs, quartile_1 = FALSE)
  
}
test<-c(1,2,4,8,12,4,13,23,5)
IQR(test)

## [1] 8

R has in-built functions mean, sd, and IQR, that compute mean, standard dev, and inter-quartile-range resp. Test that the output from your function my_mean, my_sd, and my_IQR matches the output of the respective R functions on the vector x below:

set.seed(7)
x <- rnorm(300, mean = 2, sd = 1.5)

my_mean<-function(x){
  return(sum(x)/length(x))
}

test<-c(7,1,3,5,9,10)
my_mean(test)

## [1] 5.833333

my_sd<-function(x){
return(sqrt(sum((x-mean(x))^2/(length(x)-1))))
}
test<-c(7,1,3,5,9,10)
sd(test)

## [1] 3.488075

my_IQR<-function(x){
  quartile_3<- quantile(x, probs,quartile_3 = FALSE)
  quartile_1<- quantile(x, probs, quartile_1 = FALSE)
  
}
test<-c(7,1,3,5,9,10)
IQR(test)

## [1] 5

BONUS(Optional): Write a function that returns all the outliers in a given vector x of sample data values.

Test your function on the vector x below:

set.seed(7)
x <- c(rnorm(100), -4, -8.5, 10, 100)

Part 2: Measures of Location and Variability.

In this section we will explore the location and variability of the variables in the dataset iris. Extract the Petal.Length variable from each of the dataframes iris, setosa, versicolor, and virginica. Save it into the variables defined below:

#install.packages("dplyr")
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

data("iris")
setosa<-filter(iris, Species == "setosa")
versicolor<-filter(iris, Species == "versicolor")
virginica<- filter(iris, Species == "virginica")
petal_length_iris <- iris$Petal.Length
petal_length_set <- setosa$Petal.Length
petal_length_vir <- virginica$Petal.Length
petal_length_ver <- versicolor$Petal.Length

Now use the hist function to visualize the histograms for each of these variables.

hist(petal_length_iris)

hist(petal_length_set)

hist(petal_length_vir)

hist(petal_length_ver)

Question: Do you notice any particular differences between the histograms?

The histogram of petal_length_iris is more positively skewed than all the others and has more of a frequency at 1 than any of the others.

Question: Why do you think there is a difference between the histogram of petal_length_iris and that of petal_length_set?

There is a difference in the histogram of petal_length_iris because that histogram covers all the data of setosa, versicolor, and virginica. The histogram of petal_length_set includes data only that of setosa.

Question: For each of the histograms, guess what the mean, median, and standard deviations must be. petal_length_iris mean:3.3 median:3.5 standard deviation: 4.6

petal_length_set mean:1.5 median:1.4 standard deviation: 2.6

petal_length_vir mean:5.7 median:5.2 standard deviation: 3.7

petal_length_ver mean:4.2 median:4.1 standard deviation: 1.2 Use the summary command to get the summary statistics of iris, setosa, virgnica, and versicolor.

summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

summary(setosa)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.300   Min.   :1.000   Min.   :0.100  
##  1st Qu.:4.800   1st Qu.:3.200   1st Qu.:1.400   1st Qu.:0.200  
##  Median :5.000   Median :3.400   Median :1.500   Median :0.200  
##  Mean   :5.006   Mean   :3.428   Mean   :1.462   Mean   :0.246  
##  3rd Qu.:5.200   3rd Qu.:3.675   3rd Qu.:1.575   3rd Qu.:0.300  
##  Max.   :5.800   Max.   :4.400   Max.   :1.900   Max.   :0.600  
##        Species  
##  setosa    :50  
##  versicolor: 0  
##  virginica : 0  
##                 
##                 
##

summary(virginica)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.900   Min.   :2.200   Min.   :4.500   Min.   :1.400  
##  1st Qu.:6.225   1st Qu.:2.800   1st Qu.:5.100   1st Qu.:1.800  
##  Median :6.500   Median :3.000   Median :5.550   Median :2.000  
##  Mean   :6.588   Mean   :2.974   Mean   :5.552   Mean   :2.026  
##  3rd Qu.:6.900   3rd Qu.:3.175   3rd Qu.:5.875   3rd Qu.:2.300  
##  Max.   :7.900   Max.   :3.800   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    : 0  
##  versicolor: 0  
##  virginica :50  
##                 
##                 
##

summary(versicolor)

##   Sepal.Length    Sepal.Width     Petal.Length   Petal.Width          Species  
##  Min.   :4.900   Min.   :2.000   Min.   :3.00   Min.   :1.000   setosa    : 0  
##  1st Qu.:5.600   1st Qu.:2.525   1st Qu.:4.00   1st Qu.:1.200   versicolor:50  
##  Median :5.900   Median :2.800   Median :4.35   Median :1.300   virginica : 0  
##  Mean   :5.936   Mean   :2.770   Mean   :4.26   Mean   :1.326                  
##  3rd Qu.:6.300   3rd Qu.:3.000   3rd Qu.:4.60   3rd Qu.:1.500                  
##  Max.   :7.000   Max.   :3.400   Max.   :5.10   Max.   :1.800

Do you notice any differences in the summary statistics for the full dataset ‘iris’ vs the summary statistics for the datasets virginica, setosa, and versicolor? Mention at least two such observation. YOUR ANSWER

The first quartile of petal_length_iris is considerably less than the first quartiles of the other datasets.

The third quartile of petal_length_iris is considerably more than the third quartile of petal_length_set.

Now compute box-plots for the variable Petal.Lenght of the dataframe iris (you will have three boxplots side by side).

boxplot(petal_length_set)

boxplot(petal_length_ver)

boxplot(petal_length_vir)

Using the boxplot, what can you say about the petal length of the three species with respect to each other? Can you identify any overlapping data?

All the values of setosa are much smaller than that of the other boxplots. The medians of versicolor and virginica are relatively comparable.

Data Exploration using R

Joshua Leeman/116475709

Part 1: R Functions

Part 2: Measures of Location and Variability.