Write a summarizing function to understand the distribution of a vector

  1. The function, call it ‘printVecInfo’ should take a vector as input
  2. The function should print the following information
    1. Mean
    2. Median
    3. Min and Max
    4. Standard deviation
    5. Quantile (at 0.05 and 0.95)
    6. Skewness
library(moments)

#the function will tabe in vector as input and print the summary statistics of the vector.
printVectorInfo <-function(vectorInputs){
  mean_value <- mean(vectorInputs)
  median_value <- median(vectorInputs)
  min_value <- min(vectorInputs)
  max_value <- max(vectorInputs)
  std_value <- sd(vectorInputs)
  qt_value <- quantile(vectorInputs, probs = c(0.05, 0.95))
  skw_value <- skewness(vectorInputs)
  cat('Mean : ',mean_value,'\n')
  cat('Median : ',median_value,'\n')
  cat('Min : ',min_value,'  ')
  cat('Max : ',max_value, '\n')
  cat('Std : ',std_value,'\n')
  cat('quatile : ',qt_value,'\n')
  cat('Skewness : ',skw_value,'\n\n')
}

Test the function

Test the function with a vector that has (1,2,3,4,5,6,7,8,9,10,50).You should see something such as:

  1. mean: 9.54545454545454
  2. median: 6
  3. min: 1 max: 50
  4. sd: 13.7212509368762
  5. quantile (0.05 - 0.95): 1.5 – 30
  6. skewness: 2.62039633563579
x <- c  (1,2,3,4,5,6,7,8,9,10,50)
printVectorInfo(x)
## Mean :  9.545455 
## Median :  6 
## Min :  1   Max :  50 
## Std :  13.72125 
## quatile :  1.5 30 
## Skewness :  2.620396

Creating Samples in a Jar

  1. Create a variable ‘jar’ that has 50 red and 50 blue marbles (hint: the jar can have strings as objects, with some of the strings being ‘red’ and some of the strings being ‘blue’
  2. Confirm there are 50 reds by summing the samples that are red
  3. Sample 10 ‘marbles’ (really strings) from the jar. How many are red? What was the percentage of red marbles?
  4. Do the sampling 20 times, using the ‘replicate’ command. This should generate a list
    of 20 numbers. Each number is the mean of how many reds there were in 10
    samples. Use your printVecInfo to see information of the samples. Also generate a
    histogram of the samples.
  5. Repeat #7, but this time, sample the jar 100 times. You should get 20 numbers, this
    time each number represents the mean of how many reds there were in the 100 samples. Use your printVecInfo to see information of the samples. Also generate a
    histogram of the samples.
  6. Repeat #8, but this time, replicate the sampling 100 times. You should get 100 numbers, this time each number represents the mean of how many reds there were
    in the 100 samples. Use your printVecInfo to see information of the samples. Also
    generate a histogram of the samples.
initial_jar <- c('red','blue')

jar <- rep(initial_jar,50)

length(which(jar == 'red'))
## [1] 50
length(jar)
## [1] 100
n_sam <- 10
samples <- sample(jar,n_sam, replace = TRUE)

red_prop <- length(which(samples == 'red'))/n_sam 
red_prop
## [1] 0.3
samples_prop <- replicate(20, length(which((sample(jar,n_sam, replace = TRUE)) == 'red'))/n_sam)

hist(samples_prop)

printVectorInfo(samples_prop)
## Mean :  0.535 
## Median :  0.5 
## Min :  0.2   Max :  0.8 
## Std :  0.1926956 
## quatile :  0.2 0.8 
## Skewness :  -0.05988477
n_sam <- 100

samples_prop2 <- replicate(20, length(which((sample(jar,n_sam, replace = TRUE)) == 'red'))/n_sam)


hist(samples_prop2)

printVectorInfo(samples_prop2)
## Mean :  0.529 
## Median :  0.53 
## Min :  0.46   Max :  0.62 
## Std :  0.04037978 
## quatile :  0.46 0.601 
## Skewness :  0.2468321

Explore the airquality dataset

  1. Store the ‘airquality’ dataset into a temporary variable
  2. Clean the dataset (i.e. remove the NAs)
  3. Explore Ozone, Wind and Temp by doing a ‘printVecInfo’ on each as well as
    generating a histogram for each
air_temp <-airquality
summary(air_temp)
##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
## 

Function to remove na from the dataframe

#function to remove na
remove_na <- function(df, n=0){
  df[rowSums(is.na(df)) <= n,]
}

air_temp <- remove_na(air_temp)
summary(air_temp)
##      Ozone          Solar.R           Wind            Temp      
##  Min.   :  1.0   Min.   :  7.0   Min.   : 2.30   Min.   :57.00  
##  1st Qu.: 18.0   1st Qu.:113.5   1st Qu.: 7.40   1st Qu.:71.00  
##  Median : 31.0   Median :207.0   Median : 9.70   Median :79.00  
##  Mean   : 42.1   Mean   :184.8   Mean   : 9.94   Mean   :77.79  
##  3rd Qu.: 62.0   3rd Qu.:255.5   3rd Qu.:11.50   3rd Qu.:84.50  
##  Max.   :168.0   Max.   :334.0   Max.   :20.70   Max.   :97.00  
##      Month            Day       
##  Min.   :5.000   Min.   : 1.00  
##  1st Qu.:6.000   1st Qu.: 9.00  
##  Median :7.000   Median :16.00  
##  Mean   :7.216   Mean   :15.95  
##  3rd Qu.:9.000   3rd Qu.:22.50  
##  Max.   :9.000   Max.   :31.00
printVectorInfo(air_temp$Ozone)
## Mean :  42.0991 
## Median :  31 
## Min :  1   Max :  168 
## Std :  33.27597 
## quatile :  8.5 109 
## Skewness :  1.248104
printVectorInfo(air_temp$Wind)
## Mean :  9.93964 
## Median :  9.7 
## Min :  2.3   Max :  20.7 
## Std :  3.557713 
## quatile :  4.6 15.5 
## Skewness :  0.4556414
printVectorInfo(air_temp$Temp)
## Mean :  77.79279 
## Median :  79 
## Min :  57   Max :  97 
## Std :  9.529969 
## quatile :  61 92.5 
## Skewness :  -0.2250959

Use sapply to print the vector info for all columns

sapply(air_temp, printVectorInfo)
## Mean :  42.0991 
## Median :  31 
## Min :  1   Max :  168 
## Std :  33.27597 
## quatile :  8.5 109 
## Skewness :  1.248104 
## 
## Mean :  184.8018 
## Median :  207 
## Min :  7   Max :  334 
## Std :  91.1523 
## quatile :  22 310 
## Skewness :  -0.4862466 
## 
## Mean :  9.93964 
## Median :  9.7 
## Min :  2.3   Max :  20.7 
## Std :  3.557713 
## quatile :  4.6 15.5 
## Skewness :  0.4556414 
## 
## Mean :  77.79279 
## Median :  79 
## Min :  57   Max :  97 
## Std :  9.529969 
## quatile :  61 92.5 
## Skewness :  -0.2250959 
## 
## Mean :  7.216216 
## Median :  7 
## Min :  5   Max :  9 
## Std :  1.473434 
## quatile :  5 9 
## Skewness :  -0.2912679 
## 
## Mean :  15.94595 
## Median :  16 
## Min :  1   Max :  31 
## Std :  8.707194 
## quatile :  2 30 
## Skewness :  -0.01283216
## $Ozone
## NULL
## 
## $Solar.R
## NULL
## 
## $Wind
## NULL
## 
## $Temp
## NULL
## 
## $Month
## NULL
## 
## $Day
## NULL