library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(patchwork)
library(aplpack)

Review Problem

  1. AXA Company is an insurance car company. Their analyst conducted a survey on the number of car thefts in Iloilo city for a period of 30 days last summer. The raw data are shown. Construct a stem and leaf plot by using classes 50–54, 55–59, 60–64, 65–69, 70–74, and 75–79.
# data
car_thefts <- c(52,62,51,50,69,
                58,77,66,53,57,
                75,56,55,67,73,
                79,59,68,65,72,
                57,51,63,69,75,
                65,53,78,66,55)
# stemplot

stem(car_thefts)
## 
##   The decimal point is 1 digit(s) to the right of the |
## 
##   5 | 011233
##   5 | 5567789
##   6 | 23
##   6 | 55667899
##   7 | 23
##   7 | 55789
  1. The number of stories in two selected samples of tall buildings in Makati and Cubao is shown. Construct a back-to-back stem and leaf plot, and compare the distributions.
library(aplpack)
# data
Makati <- c(55,70,44,36,40,
            63,40,44,34,38,
            60,47,52,32,32,
            50,53,32,28,31,
            52,32,34,32,50,
            26,29)
Cubao <- c(61,40,38,32,30,
           58,40,40,25,30,
           54,40,36,30,30,
           53,39,36,34,33,
           50,38,36,39,32)


stem.leaf.backback(Makati, Cubao, m = 1)
## __________________________________________________
##   1 | 2: represents 12, leaf unit: 1 
##                 Makati     Cubao             
## __________________________________________________
##    3               986| 2 |5                  1   
##   13        8644222221| 3 |000022346668899  (15)  
##   (5)            74400| 4 |0000               9   
##    9            532200| 5 |0348               5   
##    3                30| 6 |1                  1   
##    1                 0| 7 |                       
## __________________________________________________
## n:                  27     25                
## __________________________________________________

The distribution of Makati is positive skewed same also to Cubao. The buildings in Makati have a large variation in the number of stories per building. Although both distributions are peaked in the 30- to 39-story class, Cubao has more buildings in this class. Makati has more buildings that have 40 or more stories than Cubao does.

  1. The age at inauguration for each Philippine President is shown. Construct a stem and leaf plot and analyze the data. Find the range.
president_age <- c(57,54,52,55,51,56,
                   61,68,56,55,54,61,
                   57,51,46,54,51,52,
                   57,49,54,42,60,69,
                   58,64,49,51,62,64,
                   57,48,50,56,43,46,
                   61,65,47,55,55,54)

stem(president_age)
## 
##   The decimal point is 1 digit(s) to the right of the |
## 
##   4 | 23
##   4 | 667899
##   5 | 011112244444
##   5 | 555566677778
##   6 | 0111244
##   6 | 589
age_range <-max(president_age) - min(president_age) 
age_range
## [1] 27

The Distribution of the President Age is roughly normally Distributed and unimodal and peak in 50 while the range of age is 27.

4.Twenty Days of Plant Growth: The growth (in centimeters) of two varieties of plant after 20 days is shown in this table. Construct a back-to-back stem and leaf plot for the data, and compare the distributions.

library(aplpack)

variety_1 <-   c(20,12,39,38,
                 41,43,51,52,
                 59,55,53,59,
                 50,58,35,38,
                 23,32,43,53)
variety_2 <-   c(18,45,62,59,
                 53,25,13,57,
                 42,55,13,57,
                 42,55,56,38,
                 41,36,50,62,
                 45,55)
stem.leaf.backback(variety_1, variety_2, m = 1)
## _____________________________________
##   1 | 2: represents 12, leaf unit: 1 
##        variety_1     variety_2   
## _____________________________________
##    1           2| 1 |338         3   
##    3          30| 2 |5           4   
##    8       98852| 3 |68          6   
##   (3)        331| 4 |12255      (5)  
##    9   998533210| 5 |035556779  (9)  
##                 | 6 |22          2   
##                 | 7 |                
## _____________________________________
## n:            20     22          
## _____________________________________

The distribution are somewhat similar in there shapes however the variations of the data for variety 2 is larger than the variation of the data for the variety 1.

  1. A nutritionist is interested in comparing the sodium content of real cheese with the sodium content of a cheese substitute. The data for two random samples are shown. Compare the distributions, using boxplots.
# data

Real_cheese = c(310,420,45,40,220,240,180,90)
Cheese_substitute = c(270,180,250,290,130,260,340,310)

real <- ggplot(mapping = aes(Real_cheese))+
  geom_boxplot(color = "green", fill = "green", alpha = .5)+
  theme_bw()+
  labs(title = "Real Cheese Boxplot",
       x = "soduim Continent")

substitute <- ggplot(mapping = aes(Cheese_substitute))+
  geom_boxplot(color = "yellow", fill = "yellow", alpha = .5)+
  theme_bw()+
  labs(title = "Cheese Substitute Boxplot",
       x = "soduim Continent")

real | substitute

Therefore the integration of real cheese is skewed to the right or positively skewed because the whisker is more shorter to the right and the substitute cheese is negatively skewed.It is quite apparent that the distribution for the cheese substitute data has a higher median than the median for the distribution for the real cheese data.The variation or spread for the distribution of the real cheese is larger than the variation for the distribution of the cheese substitute data.

  1. The Noisy Workplace: The OSHA (Occupational Safety and Health Administration) of Iloilo City received complaints about noise levels from some of the workers at a state power plant. The survey was conducted to the power plant with taking decibel readings at six different areas of the plant at different times of the day and week. The results of the data collection are listed. Use boxplots to initially explore the data and make recommendations about which plant areas workers must be provided with protective ear wear. The safe hearing level is at approximately 120 decibels.
Noisy_Workingplace <- tibble(
  Area1  = c(30,12,35,65,24,59,68,57,100,61,32,45,92,56,44),
  Area2  = c(64,99,87,59,23,16,94,78,57,32,52,78,59,55,55),
  Area3  = c(100,59,78,97,84,64,53,59,89,88,94,66,57,62,64),
  Area4  = c(25,15,30,20,61,56,34,22,24,21,32,52,14,10,33),
  Area5  = c(59,63,81,110,65,112,132,145,163,120,84,99,105,68,75),
  Area6  = c(67,80,99,49,67,56,80,125,100,93,56,45,80,34,21)
)
names(Noisy_Workingplace)
## [1] "Area1" "Area2" "Area3" "Area4" "Area5" "Area6"
# make the data to pivot_longer

Noisy_Workingplace_longer <- Noisy_Workingplace %>% 
  pivot_longer(.,everything(), values_to = "Decibles", names_to = "Area")

# make a boxplot
Noisy_Workingplace_longer %>% 
  ggplot(aes(x = Area, y = Decibles, fill = Area))+
  geom_boxplot(show.legend = F)+
  stat_summary(fun = mean, geom = "point", shape = 20, size = 5, color = "blue", alpha = .5, show.legend = F)+
  scale_fill_brewer(palette = "Dark2")+
  theme_bw() +
  labs( title = "The Noisy Workplace Boxplot")+
  coord_flip()

Based on the Boxplot graph above, I recommend that workers in Areas 6 and 5 be given with protective ear gear. Overall, the mean of each area shown by a blue dot in the boxplot is too low compared to the safe hearing threshold of 120 decibels.

For this boxplot, we see that about 25% of the readings in area 5 are above the safe hearing level of 120 decibels. Those workers in area 5 should definitely have protective earwear. one of the readings in area 6 is above the safe hearing level. It might be a good idea to provide protective earwear to those workers also in area 6 as well Areas 1-4 appear to be “safe” with respect to hearing level, with area 4 being the safest.

7.Consider again the pizza data described in page 33. Create a boxplot for the delivery time. Discuss the distribution. What is the median delivery time?

pizza <- read_csv("C:\\Users\\user1\\Desktop\\R Data Science\\EDA-MODELING AND PRESENTATION\\pizza_delivery.csv", show_col_types = FALSE)
head(pizza)
## # A tibble: 6 x 12
##   day      date   time operator branch driver temperature  bill pizzas free_wine
##   <chr>    <chr> <dbl> <chr>    <chr>  <chr>        <dbl> <dbl>  <dbl>     <dbl>
## 1 Thursday 1-Ma~  35.1 Laura    East   Bruno         68.3  58.4      4         0
## 2 Thursday 1-Ma~  25.2 Melissa  East   Salva~        71.0  26.4      2         0
## 3 Thursday 1-Ma~  45.6 Melissa  West   Salva~        53.4  58.1      3         1
## 4 Thursday 1-Ma~  29.4 Melissa  East   Salva~        70.3  35.2      3         0
## 5 Thursday 1-Ma~  30.0 Melissa  West   Salva~        71.5  38.4      2         0
## 6 Thursday 1-Ma~  40.3 Melissa  Centre Bruno         60.8  61.8      4         1
## # ... with 2 more variables: got_wine <dbl>, discount_customer <dbl>
# create a boxpot

pizza %>% 
  ggplot(aes(time))+
  geom_boxplot()+
  geom_boxplot(color = "blue", fill = "blue", alpha = .5, outlier.colour = "red")+
  theme_bw()+
  labs(title = "Pizza Delivery Time")

# median of time in pizza delivery 
median(pizza$time)
## [1] 34.38196

The boxplot for the delivery time is symmetric with a median delivery time of about 35 minutes Most deliveries took between 30-40 minutes. The extreme values(outliers) indicate that there are were some exceptionally short and long delivery times.

  1. A hiking enthusiast has a new app for his smartphone which summarizes his hikes by using a GPS device. Let us look at the distance hiked (in km) and maximum altitude (in m) for the last 10 hikes
# data
distance = c(12.5,29.9,14.8,18.7,7.6,16.2,16.5,27.4,12.1,17.5)
altitude = c(342,1245,502,555,398,670,796,912,238,466)
  1. Calculate the arithmetic mean and median for both distance and altitude.
mean(distance)
## [1] 17.32
mean(altitude)
## [1] 612.4
median(distance)
## [1] 16.35
median(altitude)
## [1] 528.5
  1. Determine the first and third quartiles for both the distance and the altitude variables. How much larger is the difference between the median and the first and third quartile?
# first quantile
quantile(distance, probs = .25, type = 5)
##  25% 
## 12.5
quantile(altitude, probs = .25, type = 5)
## 25% 
## 398
# third quantile
quantile(distance, probs = .75, type = 5)
##  75% 
## 18.7
quantile(altitude, probs = .75, type = 5)
## 75% 
## 796
# IQR
IQR(distance, type = 5)
## [1] 6.2
IQR(altitude, type = 5)
## [1] 398
quantile(distance,probs=seq(0,1,0.25),na.rm=FALSE, names=TRUE,type=5)
##    0%   25%   50%   75%  100% 
##  7.60 12.50 16.35 18.70 29.90
quantile(altitude,probs=seq(0,1,0.25),na.rm=FALSE, names=TRUE,type=5)
##     0%    25%    50%    75%   100% 
##  238.0  398.0  528.5  796.0 1245.0

first quartile for the distance = 12.50 third quartile for the distance = 18.70 first quartile for altitude = 398 third quartile for altitude = 796

INTERPRETATION FOR THE DISTANCE The difference between the median and the first quartile(16-16.35) is much larger than the difference between the median and the third quartile(18.70-16.35), this indicates a distribution that is skewed to the left.

  1. Create a boxplot. Looking at the distribution, would you say that it is symmetrical? Is there any extreme values?
Distance <- ggplot(mapping = aes(distance))+
  geom_boxplot(color = "green", fill = "green", alpha = .5)+
  theme_bw()+
  labs(title = "Distance Boxplot")

Altitude <- ggplot(mapping = aes(altitude))+
  geom_boxplot(color = "yellow", fill = "yellow", alpha = .5)+
  theme_bw()+
  labs(title = "Altitude Boxplot")

Distance | Altitude

The Distribution in distance is skewed to the left while the distribution of the altitude is skewed to the right.There also an extreme value in distance.

DATA SET 1. Test scores of students in an entrance examination (%) & strand

test_score <- read_csv("C:\\Users\\user1\\Desktop\\R Data Science\\Machine Learning\\Machine Learning\\test_score.csv", show_col_types = F)
head(test_score)
## # A tibble: 6 x 6
##   Student Strand   CAT Communication Science  Math
##     <dbl> <chr>  <dbl>         <dbl>   <dbl> <dbl>
## 1       1 HUMSS     52            54      50    42
## 2       2 HUMSS     51            56      50    42
## 3       3 HUMSS     42            62      36    24
## 4       4 HUMSS     52            64      52    36
## 5       5 HUMSS     48            62      42    30
## 6       6 HUMSS     49            60      42    28
  1. Create a stem and leaf plot for Science and Math scores. Compare the distribution.
stem.leaf.backback(test_score$Science, test_score$Math, m = 1)
## _____________________________________________________________________
##   1 | 2: represents 12, leaf unit: 1 
##               test_score$Science     test_score$Math             
## _____________________________________________________________________
##                                 | 1 |46688                       5   
##     4                       8886| 2 |000002222224444666666688  (24)  
##   (21)     888888886666644442000| 3 |00224446688                21   
##   (20)      66664444422222200000| 4 |022244468                  10   
##     5                      20000| 5 |0                           1   
##                                 | 6 |                                
## _____________________________________________________________________
## n:                            50     50                          
## _____________________________________________________________________

The distribution in Science is roughly normal while the distribution in the Math is skewed to the right.

MATH Skewed to the right.By turning the plot on its side, we can see a distribution of the observations is skewed to the right.More scores are located to the left side of the curve.There are more observations with lower scores and very few observations with high scores.

SCIENCE Fairly Symmetric.

  1. What can be said of student scores in a positively skewed score distribution?
# create a boxplot of all score

# create a longer data
test_score_longer <- test_score %>% 
  pivot_longer(.,-c(Student, Strand), names_to = "Subject", values_to = "Score")


test_score_longer %>% 
  ggplot(aes(x = Subject, y = Score, fill = Subject))+
  geom_boxplot()+
  theme_bw()+
  coord_flip()

More students got more lower scores or more students performed poorly.

  1. Referring to the data set above, count the number of students who belong to the high school strand stated in the table then compute for their %s (percentages) with respect to the sample size n (Total).
result <- test_score %>% 
  group_by(Strand) %>% 
  summarise(Frequency = n(),
    Percentile = n() / 50)

Total <- list("Total", 50, 1)
result <- rbind(result, Total)
result
## # A tibble: 5 x 3
##   Strand Frequency Percentile
##   <chr>      <dbl>      <dbl>
## 1 ABM           12       0.24
## 2 GAS           15       0.3 
## 3 HUMSS         15       0.3 
## 4 STEM           8       0.16
## 5 Total         50       1