1. Draw a boxplot showing the area size distribution for each shape
##Because standard box plots hide the distribution of observations within each group,I used ggplot2's geom_jitter() function to plot individual observation on top of the boxes to provide a more complete picture.
data %>%
ggplot( aes(x=shape, y=area, fill=shape)) +
geom_boxplot() +
scale_fill_viridis(discrete = TRUE, alpha=0.6) +
geom_jitter(color="black", size=0.2, alpha=0.9) +
theme_ipsum() +
theme(
legend.position="none",
plot.title = element_text(size=14)
) +
ggtitle("Area Size Distribution for Each Shape") +
xlab("Shape")
2. Calculate the mean, max, and standard deviation of the area size of each color
#Use favstats() function from mosaic package to print dataframe with the mean, max, and standard deviation.
summary_stats <- favstats(area ~ color, data = data)
##In addition to the requested statistics, favstats() prints "min", "Q1","median","Q3","n"(number of observations),and "missing". For the purpose of this question I excluded the statistics not requested##
summary_stats <- summary_stats[,c(1,6,7,8)] #select relevant statistics
summary_stats
## color max mean sd
## 1 blue 21642.4 3208.132 3039.213
## 2 green 27759.1 5761.119 6695.030
## 3 red 31415.9 3815.871 5092.678
## 4 yellow 31415.9 4538.208 5352.461
What is the average area size of a yellow square?
#To obtain the average area of a yellow square subset dataframe by the "area" column where the "shape" column is square and the "color" column in yellow
data %>% group_by(shape, color) %>% summarise(avg = mean(area)) %>% filter(shape == "square", color == "yellow")
## `summarise()` has grouped output by 'shape'. You can override using the `.groups` argument.
## # A tibble: 1 x 3
## # Groups: shape [1]
## shape color avg
## <chr> <chr> <dbl>
## 1 square yellow 3333.
4. Which shape is most likely to be green?
##Use base r to calculate the proportion of each shape based on the condition that it is green
prop.table(table(data$shape [data$color== "green"]))
##
## circle square
## 0.3974359 0.6025641
#prop.table() only produces circles and squares, verify the output by calling eliminating the green filter
table(data$shape, data$color)
##
## blue green red yellow
## circle 9 31 30 50
## square 152 47 56 222
## triangle 199 0 204 0
#There are, in fact, no green triangles
A square is most likely to be green.
5. Given the fact that the object is red, with an area size larger than 3,000 - what are the chances the object is a square? a triangle? a circle
#Find the probability that an object is of any given shape when the conditions are satisfied by filtering the data to align with the conditions
data %>% filter(color == "red" & area > 3000) %>%
count(color, shape) %>%
mutate(prop = n / sum(n))
## color shape n prop
## 1 red circle 20 0.160
## 2 red square 21 0.168
## 3 red triangle 84 0.672
6. Write a function that calculates the side or radius of an object, depending on the shape and area of the object [for an equilateral triangle - area = (side ^ 2) * sqrt(3) / 4]
#Note: for the purpose of this exercise I am making the assumption that all polygons are reqular (equilateral and equidistant)
#The function takes number of sides (n) and area (A) as input to calculate the length of the side or radius (a) as the output. I appended columns "n" and "A" to the dataframe- a specifies the number of sides of the polygon based on the name of the shape column "A" simply an alias of the "area" column.#
data$n <- ifelse(data$shape == "circle", 0,
ifelse(data$shape == "triangle", 3,4)) #code each shape as numeric number of sides to direct function
data$A <- data$area #add "A" as "area" column to simplify function
In building the function I am creating a an ifelse statement that addresses circles (data where n = 0) and polygons (data where n >0) separately.
Combine Parts I & II to create ifelse function (length) and I then vectorize the function so that it can operate on all elements without needing to loop through the dataframe, making the code more concise and less error prone) .
length <- function(n,A){
ifelse(n > 0, sqrt(4*A/tan(pi/n)/(n)), sqrt(A/pi))
}
side <- Vectorize(length) #Vectorize creates a function wrapper that vectorizes the action of its argument FUN.
#7. Add a column to the dataset called “side” that shows the size matching the area in each row, round that number to the closest integer (shape side or radius)
data$side <- round(side(data['n'], data['A']))
head(data)
## shape color area n A n
## 1 square yellow 9409.0 4 9409.0 97
## 2 circle yellow 4071.5 0 4071.5 36
## 3 triangle blue 2028.0 3 2028.0 40
## 4 square blue 3025.0 4 3025.0 55
## 5 square blue 9216.0 4 9216.0 96
## 6 square yellow 4356.0 4 4356.0 66
#8. Draw a boxplot showing the side size distribution for each shape - what can you infer from this plot?
box_plot <- ggplot(data, aes(x=side, y=shape, group= shape)) #basic boxplot of side distribution per shape
box_plot +
geom_boxplot() +
geom_point(shape = 5,
color = "steelblue") + #add individual observations to show distribution
theme(
legend.position="right",
plot.title = element_text(size=11) #specify position and size of legend and title
) +
ggtitle("Side Size Distribution for Each Shape")
Though there are fewer circles their side size distribution is consistent and they account for the majority of area because area increases exponentially as side size increases.
#9. Make a scatter plot with “side” on the x axis, “area” on the y axis with a different color for each shape
ggplot(data, aes(x=side, y=area, color=shape)) + #specifying labels for x and y axis and that the colors should vary by shape
geom_point() + #
theme_ipsum()
#10. Create a dataframe, table or list that show for each shape a. The proportion of red objects within the shape
reddata <- as.data.frame(data %>% filter(color == "red") %>% #Filter data- retain red shapes only
count(color, shape) %>% #add/group number of red values by shape
mutate(prop = n / sum(n))) #divide number of grouped red shapes by total number of red shapes
reddata$prop <- scales::percent(reddata$prop) #Reformat decimals as percent and add column "prop" (proportion) to dataframe
reddata
## color shape n prop
## 1 red circle 30 10.3%
## 2 red square 56 19.3%
## 3 red triangle 204 70.3%
#10b. The proportion of blue area out of the shape’s total area (sum of square inch blue area of the shape over sum of all shape size).
ExpCustomStat(data = data,Cvar=c("color","shape"),Nvar=c("area"), stat = c('prop'), gpby= T, filt = "color == 'blue'")
## color shape Attribute Filter prop
## 1: blue triangle area color == 'blue' 55.28
## 2: blue square area color == 'blue' 42.22
## 3: blue circle area color == 'blue' 2.50
#11. Create a function that calculates 10. b. for a given shape and color
custom <- function(x, y){ExpCustomStat(shapes,Cvar=c("color","shape"),Nvar=c("area"), stat = c('prop'), gpby= T, filt = "x == y")}
#x = shape/color
#y = corresponding value for "shape" or "color"