Week 3 Discussion

I. (a) Classes describe the type of data, the elements of data could be numeric, like any real numbers and decimals. (integer, single or double precison), character, logical (two values, true or false), complex and raw, so data classes correspond to the attribute of elements in data.

Strutures refer to the way of data arrangement, vector is the simplest form of data structure, it has one dimension and the data are of same type; matrix has two dimension with same data type and same length; array is an extension of matrix with more than two dimensions; data frame is the most common data structure which allows different data type in different columns but also require same length; list is the most flexible data structure, where no limitation is given to the data, but is harder to deal with.

data(iris)
class(iris)

## [1] "data.frame"

typeof(iris)

## [1] "list"

Using the iris dataset, ‘class’ returns dataframe and ‘typeof’ returns list, class describes the structure of data, and ‘typeof’ gives the storage mode.

a <- c(2,4,5,8,9,3,1)
R_StandardDeviation_InBuilt <- sd(a)
R_StandardDeviation_Hand <- sqrt(sum((a - mean(a))^2)/(7-1))
R_StandardDeviation_InBuilt

## [1] 2.992053

R_StandardDeviation_Hand

## [1] 2.992053

III.(a)

IQR

## function (x, na.rm = FALSE, type = 7) 
## diff(quantile(as.numeric(x), c(0.25, 0.75), na.rm = na.rm, names = FALSE, 
##     type = type))
## <bytecode: 0x15052ebf0>
## <environment: namespace:stats>

sd

## function (x, na.rm = FALSE) 
## sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x), 
##     na.rm = na.rm))
## <bytecode: 0x126beccf0>
## <environment: namespace:stats>

b <- c(1,2,3,4,5,6)
IQR(b)

## [1] 2.5

sd(b)

## [1] 1.870829

IQR claculates the interquartile range of the vector, sd calculates the standard deviation.

calculate_variance <- function(x){
  val <- sum((x - mean(x))^2)/(length(x)-1)
  return(val)
}
c <- c(5,8,9,2,5,8)
calculate_variance(c)

## [1] 6.966667

var(c)

## [1] 6.966667

#install.packages('np')
library(np)

## Nonparametric Kernel Methods for Mixed Datatypes (version 0.60-17)
## [vignette("np_faq",package="np") provides answers to frequently asked questions]
## [vignette("np",package="np") an overview]
## [vignette("entropy_np",package="np") an overview of entropy-based methods]

library(ggplot2)
data(wage1)
ggplot(data = wage1, aes(x = wage)) +
  geom_density(fill = 'blue')

ggplot(data = wage1, aes(x = educ)) +
  geom_density(fill = 'blue')

ggplot(data = wage1, aes(x = exper)) +
  geom_density(fill = 'blue')

From a visual inspection of the graph, ‘wage’ is highly right skewed, ‘educ’ is moderately left skewed, ‘exper’ is moderately right skewed.

#install.packages('e1071')
library(e1071)
skewness(wage1$wage)

## [1] 2.001603

skewness(wage1$educ)

## [1] -0.6178081

skewness(wage1$exper)

## [1] 0.7048504

For the ‘wage’ variable, the skewness is 2.0016, greater than 1, highly right skewed. For ‘educ’, the skewness is -0.6178, more than -1 but less than -0.5, mild left skewed. For ‘exper’, the skewness is 0.7049, less than 1 but more than 0.5, mild right skewed.

Week 3 Discussion

2023-09-18