Analyst_Coding

1. Draw a boxplot showing the area size distribution for each shape

##Because standard box plots hide the distribution of observations within each group,I used ggplot2's geom_jitter() function to plot individual observation on top of the boxes to provide a more complete picture.

data %>%
  ggplot( aes(x=shape, y=area, fill=shape)) +
  geom_boxplot() +
  scale_fill_viridis(discrete = TRUE, alpha=0.6) +
  geom_jitter(color="black", size=0.2, alpha=0.9) +
  theme_ipsum() +
  theme(
    legend.position="none",
    plot.title = element_text(size=14)
  ) +
  ggtitle("Area Size Distribution for Each Shape") +
  xlab("Shape")

2. Calculate the mean, max, and standard deviation of the area size of each color

#Use favstats() function from mosaic package to print dataframe with the mean, max, and standard deviation. 

summary_stats <- favstats(area ~ color, data = data)

##In addition to the requested statistics, favstats() prints "min", "Q1","median","Q3","n"(number of observations),and "missing".  For the purpose of this question I excluded the statistics not requested##

summary_stats <- summary_stats[,c(1,6,7,8)] #select relevant statistics
summary_stats

##    color     max     mean       sd
## 1   blue 21642.4 3208.132 3039.213
## 2  green 27759.1 5761.119 6695.030
## 3    red 31415.9 3815.871 5092.678
## 4 yellow 31415.9 4538.208 5352.461

What is the average area size of a yellow square?

#To obtain the average area of a yellow square subset dataframe by the "area" column where the "shape" column is square and the "color" column in yellow

data %>% group_by(shape, color) %>% summarise(avg = mean(area)) %>% filter(shape == "square", color == "yellow")

## `summarise()` has grouped output by 'shape'. You can override using the `.groups` argument.

## # A tibble: 1 x 3
## # Groups:   shape [1]
##   shape  color    avg
##   <chr>  <chr>  <dbl>
## 1 square yellow 3333.

4. Which shape is most likely to be green?

##Use base r to calculate the proportion of each shape based on the condition that it is green
prop.table(table(data$shape [data$color== "green"]))

## 
##    circle    square 
## 0.3974359 0.6025641

#prop.table() only produces circles and squares, verify the output by calling eliminating the green filter
table(data$shape, data$color)

##           
##            blue green red yellow
##   circle      9    31  30     50
##   square    152    47  56    222
##   triangle  199     0 204      0

#There are, in fact, no green triangles

A square is most likely to be green.

5. Given the fact that the object is red, with an area size larger than 3,000 - what are the chances the object is a square? a triangle? a circle

#Find the probability that an object is of any given shape when the conditions are satisfied by filtering the data to align with the conditions

data %>% filter(color == "red" & area > 3000) %>% 
  count(color, shape) %>%
  mutate(prop = n / sum(n))

##   color    shape  n  prop
## 1   red   circle 20 0.160
## 2   red   square 21 0.168
## 3   red triangle 84 0.672

6. Write a function that calculates the side or radius of an object, depending on the shape and area of the object [for an equilateral triangle - area = (side ^ 2) * sqrt(3) / 4]

#Note: for the purpose of this exercise I am making the assumption that all polygons are reqular (equilateral and equidistant)

#The function takes number of sides (n) and area (A) as input to calculate the length of the side or radius (a) as the output. I appended columns "n" and "A" to the dataframe- a specifies the number of sides of the polygon based on the name of the shape column "A" simply an alias of the "area" column.#

data$n <- ifelse(data$shape == "circle", 0,
                            ifelse(data$shape == "triangle", 3,4)) #code each shape as numeric number of sides to direct function
data$A <- data$area #add "A" as "area" column to simplify function

In building the function I am creating a an ifelse statement that addresses circles (data where n = 0) and polygons (data where n >0) separately.

Combine Parts I & II to create ifelse function (length) and I then vectorize the function so that it can operate on all elements without needing to loop through the dataframe, making the code more concise and less error prone) .

length <- function(n,A){
  ifelse(n > 0, sqrt(4*A/tan(pi/n)/(n)), sqrt(A/pi))
}

side <- Vectorize(length) #Vectorize creates a function wrapper that vectorizes the action of its argument FUN.

#7. Add a column to the dataset called “side” that shows the size matching the area in each row, round that number to the closest integer (shape side or radius)

data$side <- round(side(data['n'], data['A']))

head(data)

##      shape  color   area n      A  n
## 1   square yellow 9409.0 4 9409.0 97
## 2   circle yellow 4071.5 0 4071.5 36
## 3 triangle   blue 2028.0 3 2028.0 40
## 4   square   blue 3025.0 4 3025.0 55
## 5   square   blue 9216.0 4 9216.0 96
## 6   square yellow 4356.0 4 4356.0 66

#8. Draw a boxplot showing the side size distribution for each shape - what can you infer from this plot?

box_plot <- ggplot(data, aes(x=side, y=shape, group= shape)) #basic boxplot of side distribution per shape
box_plot +
  geom_boxplot() +
  geom_point(shape = 5,
             color = "steelblue") + #add individual observations to show distribution
   theme(
    legend.position="right",
    plot.title = element_text(size=11) #specify position and size of legend and title
  ) +
  ggtitle("Side Size Distribution for Each Shape")

Though there are fewer circles their side size distribution is consistent and they account for the majority of area because area increases exponentially as side size increases.

#9. Make a scatter plot with “side” on the x axis, “area” on the y axis with a different color for each shape

ggplot(data, aes(x=side, y=area, color=shape)) + #specifying labels for x and y axis and that the colors should vary by shape
  geom_point() + #
  theme_ipsum()

#10. Create a dataframe, table or list that show for each shape a. The proportion of red objects within the shape

reddata <- as.data.frame(data %>% filter(color == "red") %>% #Filter data- retain red shapes only
  count(color, shape) %>% #add/group number of red values by shape
  mutate(prop = n / sum(n))) #divide number of grouped red shapes by total number of red shapes
reddata$prop <- scales::percent(reddata$prop) #Reformat decimals as percent and add column "prop" (proportion) to dataframe 
reddata

##   color    shape   n  prop
## 1   red   circle  30 10.3%
## 2   red   square  56 19.3%
## 3   red triangle 204 70.3%

#10b. The proportion of blue area out of the shape’s total area (sum of square inch blue area of the shape over sum of all shape size).

ExpCustomStat(data = data,Cvar=c("color","shape"),Nvar=c("area"), stat = c('prop'), gpby= T, filt = "color == 'blue'")

##    color    shape Attribute          Filter  prop
## 1:  blue triangle      area color == 'blue' 55.28
## 2:  blue   square      area color == 'blue' 42.22
## 3:  blue   circle      area color == 'blue'  2.50

#11. Create a function that calculates 10. b. for a given shape and color

custom <- function(x, y){ExpCustomStat(shapes,Cvar=c("color","shape"),Nvar=c("area"), stat = c('prop'), gpby= T, filt = "x == y")}
#x = shape/color
#y = corresponding value for "shape" or "color"

Analyst_Coding_Test

Julie Phillips

7/27/2021