Data structures define how the data is stored and organized, they include (Data Frames, Matrices, Vectors, Lists, Arrays, and Factors). Each of these are different in their own way and can determine the usage of the data. For example a data frame allows for different classes across each column while a vector, a more basic data strucuture allows for only one class.
A class will define the type of individual value, for example numerical, integer, logical, or character. As the name suggests a numerical class can include numbers (1,2,3,5) while a character is used for text data. Data strucutures are used for storing and organiziing the data while the class is the type of data in R.
Vec1 <- c(1,2,3,4)
Vec2 <- c("A","B","C","D")
Vec3 <- c(TRUE, FALSE, TRUE, TRUE) #Write as Uppercase
Dfa1 <- cbind(Vec1, Vec2, Vec3)
print(Dfa1)
## Vec1 Vec2 Vec3
## [1,] "1" "A" "TRUE"
## [2,] "2" "B" "FALSE"
## [3,] "3" "C" "TRUE"
## [4,] "4" "D" "TRUE"
#Using Quakes data set
head(quakes,1)
## lat long depth mag stations
## 1 -20.42 181.62 562 4.8 41
class(quakes)
## [1] "data.frame"
typeof(quakes)
## [1] "list"
The “Class()” functions tells us that the Class of the “Quakes” data set is a data.frame while the data type of the data stored is a List.
V1 <- c(43, 55, 39, 89, 9, 32, 79)
R_StandardDeviation_InBuilt <- sd(V1)
print(R_StandardDeviation_InBuilt)
## [1] 27.56723
Var1 <- sum((V1-mean(V1))^2/(length(V1)-1))
R_StandardDeviation_Hand <- sqrt(Var1)
print(R_StandardDeviation_Hand)
## [1] 27.56723
IQR
## function (x, na.rm = FALSE, type = 7)
## diff(quantile(as.numeric(x), c(0.25, 0.75), na.rm = na.rm, names = FALSE,
## type = type))
## <bytecode: 0x7ff29f7e27a0>
## <environment: namespace:stats>
Based on the code, the function seems to represent the Interquartile Range, it takes the difference of the 1st quartile (25th percentile) and the 3rd quartile (75th percentile). It makes sure the vector is a numeric vector and can remove missing values.
sd
## function (x, na.rm = FALSE)
## sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x),
## na.rm = na.rm))
## <bytecode: 0x7ff2a0952830>
## <environment: namespace:stats>
While SD was one of the functions we used in part 1 to represent Standard Deviation, it is interesting to see the logic behind the code. Here the variance is being taken of the X (a vector or factor), finally the square root of this answer is caculated. I also noticed that logic makes sure X is a numeric inout and can remove missing values from the vector before continuing the calculation.
inches_to_feet <- function(inch){
feet <- (inch / 12)
return(feet)
}
inches_to_feet(129)
## [1] 10.75
#Testing a scenerio in which you have to calculate how many feet are in 129 inches
129/12
## [1] 10.75
summary(trees)
## Girth Height Volume
## Min. : 8.30 Min. :63 Min. :10.20
## 1st Qu.:11.05 1st Qu.:72 1st Qu.:19.40
## Median :12.90 Median :76 Median :24.20
## Mean :13.25 Mean :76 Mean :30.17
## 3rd Qu.:15.25 3rd Qu.:80 3rd Qu.:37.30
## Max. :20.60 Max. :87 Max. :77.00
library(ggplot2)
Tr <- trees
Avg_girth <- mean(Tr$Girth)
ggplot(Tr, aes(x = Girth)) +
geom_density(color = "green", fill = "navy") +
geom_vline(xintercept = Avg_girth, color = "white", linetype = "dashed", size = 1) +
geom_text(aes(x = Avg_girth, y = 0.02, label = paste(" Mean Girth =", round(Avg_girth, 2))),
color = "white", hjust = 0) +
ggtitle("Density plot: Girth") +
ylab("Density") +
xlab("Girth")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Based on the above graph, the data on tree girth seems positively skewed as the distribution is being pulled to the right, this being said it does not seem like a major issue to the data. Based on the summary table we produced, we see the data points are going up until 20.6 and that the Median<Mean meaning there are outliers on the right side of the distribution.
#install.packages("moments")
library(moments)
skewness(Tr)
## Girth Height Volume
## 0.5263163 -0.3748690 1.0643575
The value of ~0.53 for Girth shows us that skewness is moderate as it is between (-1 to -0.5) or (0.5 and 1). A skewness of less than -1 or greater than 1 could create a larger problem for the data, this can be observed by “Volume” in the Trees data set. Height is an example of a symmetric skewness and causes no problems.