library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.8
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(patchwork)
library(aplpack)
# data
car_thefts <- c(52,62,51,50,69,
58,77,66,53,57,
75,56,55,67,73,
79,59,68,65,72,
57,51,63,69,75,
65,53,78,66,55)
# stemplot
stem(car_thefts)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 5 | 011233
## 5 | 5567789
## 6 | 23
## 6 | 55667899
## 7 | 23
## 7 | 55789
library(aplpack)
# data
Makati <- c(55,70,44,36,40,
63,40,44,34,38,
60,47,52,32,32,
50,53,32,28,31,
52,32,34,32,50,
26,29)
Cubao <- c(61,40,38,32,30,
58,40,40,25,30,
54,40,36,30,30,
53,39,36,34,33,
50,38,36,39,32)
stem.leaf.backback(Makati, Cubao, m = 1)
## __________________________________________________
## 1 | 2: represents 12, leaf unit: 1
## Makati Cubao
## __________________________________________________
## 3 986| 2 |5 1
## 13 8644222221| 3 |000022346668899 (15)
## (5) 74400| 4 |0000 9
## 9 532200| 5 |0348 5
## 3 30| 6 |1 1
## 1 0| 7 |
## __________________________________________________
## n: 27 25
## __________________________________________________
The distribution of Makati is positive skewed same also to Cubao. The buildings in Makati have a large variation in the number of stories per building. Although both distributions are peaked in the 30- to 39-story class, Cubao has more buildings in this class. Makati has more buildings that have 40 or more stories than Cubao does.
president_age <- c(57,54,52,55,51,56,
61,68,56,55,54,61,
57,51,46,54,51,52,
57,49,54,42,60,69,
58,64,49,51,62,64,
57,48,50,56,43,46,
61,65,47,55,55,54)
stem(president_age)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 4 | 23
## 4 | 667899
## 5 | 011112244444
## 5 | 555566677778
## 6 | 0111244
## 6 | 589
age_range <-max(president_age) - min(president_age)
age_range
## [1] 27
The Distribution of the President Age is roughly normally Distributed and unimodal and peak in 50 while the range of age is 27.
4.Twenty Days of Plant Growth: The growth (in centimeters) of two varieties of plant after 20 days is shown in this table. Construct a back-to-back stem and leaf plot for the data, and compare the distributions.
library(aplpack)
variety_1 <- c(20,12,39,38,
41,43,51,52,
59,55,53,59,
50,58,35,38,
23,32,43,53)
variety_2 <- c(18,45,62,59,
53,25,13,57,
42,55,13,57,
42,55,56,38,
41,36,50,62,
45,55)
stem.leaf.backback(variety_1, variety_2, m = 1)
## _____________________________________
## 1 | 2: represents 12, leaf unit: 1
## variety_1 variety_2
## _____________________________________
## 1 2| 1 |338 3
## 3 30| 2 |5 4
## 8 98852| 3 |68 6
## (3) 331| 4 |12255 (5)
## 9 998533210| 5 |035556779 (9)
## | 6 |22 2
## | 7 |
## _____________________________________
## n: 20 22
## _____________________________________
The distribution are somewhat similar in there shapes however the variations of the data for variety 2 is larger than the variation of the data for the variety 1.
# data
Real_cheese = c(310,420,45,40,220,240,180,90)
Cheese_substitute = c(270,180,250,290,130,260,340,310)
real <- ggplot(mapping = aes(Real_cheese))+
geom_boxplot(color = "green", fill = "green", alpha = .5)+
theme_bw()+
labs(title = "Real Cheese Boxplot",
x = "soduim Continent")
substitute <- ggplot(mapping = aes(Cheese_substitute))+
geom_boxplot(color = "yellow", fill = "yellow", alpha = .5)+
theme_bw()+
labs(title = "Cheese Substitute Boxplot",
x = "soduim Continent")
real | substitute
Therefore the integration of real cheese is skewed to the right or positively skewed because the whisker is more shorter to the right and the substitute cheese is negatively skewed.It is quite apparent that the distribution for the cheese substitute data has a higher median than the median for the distribution for the real cheese data.The variation or spread for the distribution of the real cheese is larger than the variation for the distribution of the cheese substitute data.
Noisy_Workingplace <- tibble(
Area1 = c(30,12,35,65,24,59,68,57,100,61,32,45,92,56,44),
Area2 = c(64,99,87,59,23,16,94,78,57,32,52,78,59,55,55),
Area3 = c(100,59,78,97,84,64,53,59,89,88,94,66,57,62,64),
Area4 = c(25,15,30,20,61,56,34,22,24,21,32,52,14,10,33),
Area5 = c(59,63,81,110,65,112,132,145,163,120,84,99,105,68,75),
Area6 = c(67,80,99,49,67,56,80,125,100,93,56,45,80,34,21)
)
names(Noisy_Workingplace)
## [1] "Area1" "Area2" "Area3" "Area4" "Area5" "Area6"
# make the data to pivot_longer
Noisy_Workingplace_longer <- Noisy_Workingplace %>%
pivot_longer(.,everything(), values_to = "Decibles", names_to = "Area")
# make a boxplot
Noisy_Workingplace_longer %>%
ggplot(aes(x = Area, y = Decibles, fill = Area))+
geom_boxplot(show.legend = F)+
stat_summary(fun = mean, geom = "point", shape = 20, size = 5, color = "blue", alpha = .5, show.legend = F)+
scale_fill_brewer(palette = "Dark2")+
theme_bw() +
labs( title = "The Noisy Workplace Boxplot")+
coord_flip()
Based on the Boxplot graph above, I recommend that workers in Areas 6 and 5 be given with protective ear gear. Overall, the mean of each area shown by a blue dot in the boxplot is too low compared to the safe hearing threshold of 120 decibels.
For this boxplot, we see that about 25% of the readings in area 5 are above the safe hearing level of 120 decibels. Those workers in area 5 should definitely have protective earwear. one of the readings in area 6 is above the safe hearing level. It might be a good idea to provide protective earwear to those workers also in area 6 as well Areas 1-4 appear to be “safe” with respect to hearing level, with area 4 being the safest.
7.Consider again the pizza data described in page 33. Create a boxplot for the delivery time. Discuss the distribution. What is the median delivery time?
pizza <- read_csv("C:\\Users\\user1\\Desktop\\R Data Science\\EDA-MODELING AND PRESENTATION\\pizza_delivery.csv", show_col_types = FALSE)
head(pizza)
## # A tibble: 6 x 12
## day date time operator branch driver temperature bill pizzas free_wine
## <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Thursday 1-Ma~ 35.1 Laura East Bruno 68.3 58.4 4 0
## 2 Thursday 1-Ma~ 25.2 Melissa East Salva~ 71.0 26.4 2 0
## 3 Thursday 1-Ma~ 45.6 Melissa West Salva~ 53.4 58.1 3 1
## 4 Thursday 1-Ma~ 29.4 Melissa East Salva~ 70.3 35.2 3 0
## 5 Thursday 1-Ma~ 30.0 Melissa West Salva~ 71.5 38.4 2 0
## 6 Thursday 1-Ma~ 40.3 Melissa Centre Bruno 60.8 61.8 4 1
## # ... with 2 more variables: got_wine <dbl>, discount_customer <dbl>
# create a boxpot
pizza %>%
ggplot(aes(time))+
geom_boxplot()+
geom_boxplot(color = "blue", fill = "blue", alpha = .5, outlier.colour = "red")+
theme_bw()+
labs(title = "Pizza Delivery Time")
# median of time in pizza delivery
median(pizza$time)
## [1] 34.38196
The boxplot for the delivery time is symmetric with a median delivery time of about 35 minutes Most deliveries took between 30-40 minutes. The extreme values(outliers) indicate that there are were some exceptionally short and long delivery times.
# data
distance = c(12.5,29.9,14.8,18.7,7.6,16.2,16.5,27.4,12.1,17.5)
altitude = c(342,1245,502,555,398,670,796,912,238,466)
mean(distance)
## [1] 17.32
mean(altitude)
## [1] 612.4
median(distance)
## [1] 16.35
median(altitude)
## [1] 528.5
# first quantile
quantile(distance, probs = .25, type = 5)
## 25%
## 12.5
quantile(altitude, probs = .25, type = 5)
## 25%
## 398
# third quantile
quantile(distance, probs = .75, type = 5)
## 75%
## 18.7
quantile(altitude, probs = .75, type = 5)
## 75%
## 796
# IQR
IQR(distance, type = 5)
## [1] 6.2
IQR(altitude, type = 5)
## [1] 398
quantile(distance,probs=seq(0,1,0.25),na.rm=FALSE, names=TRUE,type=5)
## 0% 25% 50% 75% 100%
## 7.60 12.50 16.35 18.70 29.90
quantile(altitude,probs=seq(0,1,0.25),na.rm=FALSE, names=TRUE,type=5)
## 0% 25% 50% 75% 100%
## 238.0 398.0 528.5 796.0 1245.0
first quartile for the distance = 12.50 third quartile for the distance = 18.70 first quartile for altitude = 398 third quartile for altitude = 796
INTERPRETATION FOR THE DISTANCE The difference between the median and the first quartile(16-16.35) is much larger than the difference between the median and the third quartile(18.70-16.35), this indicates a distribution that is skewed to the left.
Distance <- ggplot(mapping = aes(distance))+
geom_boxplot(color = "green", fill = "green", alpha = .5)+
theme_bw()+
labs(title = "Distance Boxplot")
Altitude <- ggplot(mapping = aes(altitude))+
geom_boxplot(color = "yellow", fill = "yellow", alpha = .5)+
theme_bw()+
labs(title = "Altitude Boxplot")
Distance | Altitude
The Distribution in distance is skewed to the left while the distribution of the altitude is skewed to the right.There also an extreme value in distance.
DATA SET 1. Test scores of students in an entrance examination (%) & strand
test_score <- read_csv("C:\\Users\\user1\\Desktop\\R Data Science\\Machine Learning\\Machine Learning\\test_score.csv", show_col_types = F)
head(test_score)
## # A tibble: 6 x 6
## Student Strand CAT Communication Science Math
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 HUMSS 52 54 50 42
## 2 2 HUMSS 51 56 50 42
## 3 3 HUMSS 42 62 36 24
## 4 4 HUMSS 52 64 52 36
## 5 5 HUMSS 48 62 42 30
## 6 6 HUMSS 49 60 42 28
stem.leaf.backback(test_score$Science, test_score$Math, m = 1)
## _____________________________________________________________________
## 1 | 2: represents 12, leaf unit: 1
## test_score$Science test_score$Math
## _____________________________________________________________________
## | 1 |46688 5
## 4 8886| 2 |000002222224444666666688 (24)
## (21) 888888886666644442000| 3 |00224446688 21
## (20) 66664444422222200000| 4 |022244468 10
## 5 20000| 5 |0 1
## | 6 |
## _____________________________________________________________________
## n: 50 50
## _____________________________________________________________________
The distribution in Science is roughly normal while the distribution in the Math is skewed to the right.
MATH Skewed to the right.By turning the plot on its side, we can see a distribution of the observations is skewed to the right.More scores are located to the left side of the curve.There are more observations with lower scores and very few observations with high scores.
SCIENCE Fairly Symmetric.
# create a boxplot of all score
# create a longer data
test_score_longer <- test_score %>%
pivot_longer(.,-c(Student, Strand), names_to = "Subject", values_to = "Score")
test_score_longer %>%
ggplot(aes(x = Subject, y = Score, fill = Subject))+
geom_boxplot()+
theme_bw()+
coord_flip()
More students got more lower scores or more students performed poorly.
result <- test_score %>%
group_by(Strand) %>%
summarise(Frequency = n(),
Percentile = n() / 50)
Total <- list("Total", 50, 1)
result <- rbind(result, Total)
result
## # A tibble: 5 x 3
## Strand Frequency Percentile
## <chr> <dbl> <dbl>
## 1 ABM 12 0.24
## 2 GAS 15 0.3
## 3 HUMSS 15 0.3
## 4 STEM 8 0.16
## 5 Total 50 1