The following table contains the ages of people who are frequent customers of a vegetarian restaurant. With this data, calculate:
#Libraries installed directly in the console, this for avoid that every time we "knit" the packages install again and again.
# install.packages("DescTools")
# install.packages("psych")
library(DescTools)
## Warning: package 'DescTools' was built under R version 4.5.3
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:DescTools':
##
## AUC, ICC, SD
#Create table with the dataset of the ages
ages <- c(28, 23, 31, 31,
21, 30, 28, 23,
29, 28, 36, 36,
31, 28, 28, 40,
22, 21, 28, 31)
#arrange the dataset
print(ages_order <- sort(ages))
## [1] 21 21 22 23 23 28 28 28 28 28 28 29 30 31 31 31 31 36 36 40
#mean
print(mean(ages_order))
## [1] 28.65
#median
print(median(ages_order))
## [1] 28
#mode
print(Mode(ages_order))
## [1] 28
## attr(,"freq")
## [1] 6
#frequency histogram
hist(ages_order)
With the data in the table above, determine the variance and standard deviation.
#variance
print(var(ages_order))
## [1] 25.71316
#std deviation
(print(sd(ages_order)))
## [1] 5.070814
## [1] 5.070814
Using the mean and two standard deviation, determine the Chebyshev interval and interpret its meaning in the context of the problem.
28.65-2(5.070814) = 18.508372 lower limit range
28.65+2(5.070814) = 38.791628 upper limit range
1/2^2 = .25 –> 25%
1-(1/2^2) = .75 –> 75%
Interpretation: From an analytical perspective, the results indicate that the majority of the restaurant’s frequent customers fall within an approximate age range of 19 to 39 years. This suggests that the main segment is composed of young adults.
The Std deviation of approximately 5 years reflects some variability in age; however, this variability is not excessive, so the data is not highly dispersed.So, only a small proportion of customers, at most 25%, fall outside this range.
Explain what the Central Tendency measures indicate and what the dispersion measures indicate.
Central tendency measures:
Measures of central tendency allow us to identify the value around which the data is concentrated. In this case, the mean is 28.65, the median is 28, and the mode is also 28 (repeating six times). This indicates that the distribution of ages is fairly centered around 28 years.
Furthermore, the frequency histogram shows a greater concentration of data in the values close to this range.This indicates that the distribution is relatively symmetrical and does not exhibit significant skewness.
Dispersion measures:
The standard deviation indicates that, on average, ages deviate by about 5 years from the mean. This suggests that, while there is variability in customer ages, it is not excessive. The data are neither too concentrated nor too dispersed, reflecting moderate variability.
Using the data from the database titled POP BEVERAGES CONSUMPTION SURVEY, use EXCEL to determine, for each age category of consumers, the following about the Glasses of soft drink per day:
#install all the necessary libraries. Again libraries installed directly in the console, this for avoid that every time we "knit" the packages install again and again.
#install.packages("readxl", repos = "https://cran.r-project.org")
#install.packages("dplyr", repos = "https://cran.r-project.org")
#install.packages("ggplot2", repos = "https://cran.r-project.org")
#install.packages("purrr", repos = "https://cran.r-project.org")
#install.packages("stringr", repos = "https://cran.r-project.org")
library(readxl)
## Warning: package 'readxl' was built under R version 4.5.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.5.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.5.3
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
library(purrr)
## Warning: package 'purrr' was built under R version 4.5.3
library(stringr)
## Warning: package 'stringr' was built under R version 4.5.3
file.choose()
## [1] "C:\\Users\\maria\\Downloads\\POP BEVERAGES CONSUMPTION SURVEY.xlsx"
data <- read_excel("C:\\Users\\maria\\Downloads\\POP BEVERAGES CONSUMPTION SURVEY.xlsx")
## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
excel_sheets("C:\\Users\\maria\\Downloads\\POP BEVERAGES CONSUMPTION SURVEY.xlsx")
## [1] "Youth Rural" "Youth Urban" "20-40 Urban" "20-40 Rural" "40-60 Urban"
## [6] "40-60 Rural"
leer_hoja <- function(sheet_name, archivo) {
df <- read_excel(archivo, sheet = sheet_name, skip = 1)
# Nos quedamos solo con las 4 columnas importantes
df <- df[, 1:4]
names(df) <- c("Surveyed", "Glasses", "Age", "Occupation")
df <- df %>%
mutate(
Area = ifelse(str_detect(sheet_name, "Urban"), "Urban", "Rural"),
AgeGroup = case_when(
str_detect(sheet_name, "Youth") ~ "Youth",
str_detect(sheet_name, "20-40") ~ "20-40",
str_detect(sheet_name, "40-60") ~ "40-60"
)
)
return(df)
}
archivo <- "C:\\Users\\maria\\Downloads\\POP BEVERAGES CONSUMPTION SURVEY.xlsx"
hojas <- excel_sheets(archivo)
data_total <- map_dfr(hojas, ~leer_hoja(.x, archivo))
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## • `` -> `...5`
head(data_total)
## # A tibble: 6 × 6
## Surveyed Glasses Age Occupation Area AgeGroup
## <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1 1 3 18 Student Rural Youth
## 2 2 1 16 Student Rural Youth
## 3 3 0 14 Student Rural Youth
## 4 4 3 18 Student Rural Youth
## 5 5 3 17 Student Rural Youth
## 6 6 2 14 Student Rural Youth
str(data_total)
## tibble [260 × 6] (S3: tbl_df/tbl/data.frame)
## $ Surveyed : num [1:260] 1 2 3 4 5 6 7 8 9 10 ...
## $ Glasses : num [1:260] 3 1 0 3 3 2 3 4 1 1 ...
## $ Age : num [1:260] 18 16 14 18 17 14 20 19 15 15 ...
## $ Occupation: chr [1:260] "Student" "Student" "Student" "Student" ...
## $ Area : chr [1:260] "Rural" "Rural" "Rural" "Rural" ...
## $ AgeGroup : chr [1:260] "Youth" "Youth" "Youth" "Youth" ...
ggplot(filter(data_total, Area == "Rural"), aes(x = Glasses)) +
geom_histogram(binwidth = 1, color = "black", fill = "lightblue") +
facet_wrap(~AgeGroup) +
labs(
title = "Histogram of soft drink consumption - Rural Area",
x = "Glasses of soft drink per day",
y = "Frequency"
) +
theme_minimal()
resumen <- data_total %>%
group_by(Area, AgeGroup) %>%
summarise(
Mean = mean(Glasses, na.rm = TRUE),
Median = median(Glasses, na.rm = TRUE),
.groups = "drop"
)
resumen <- resumen %>%
mutate(
Bias = case_when(
Mean > Median ~ "Right-skewed",
Mean < Median ~ "Left-skewed",
TRUE ~ "Approximately symmetric"
)
)
resumen
## # A tibble: 6 × 5
## Area AgeGroup Mean Median Bias
## <chr> <chr> <dbl> <dbl> <chr>
## 1 Rural 20-40 2.44 2 Right-skewed
## 2 Rural 40-60 3.1 3 Right-skewed
## 3 Rural Youth 1.83 2 Left-skewed
## 4 Urban 20-40 1.34 1 Right-skewed
## 5 Urban 40-60 2 2 Approximately symmetric
## 6 Urban Youth 1 1 Approximately symmetric
With all the data related to the consumption in rural area and in urban area, find:
ggplot(filter(data_total, Area == "Rural"), aes(x = Glasses)) +
geom_histogram(binwidth = 1, color = "black", fill = "skyblue") +
labs(
title = "Overall histogram - Rural Area",
x = "Glasses of soft drink per day",
y = "Frequency"
) +
theme_minimal()
ggplot(filter(data_total, Area == "Urban"), aes(x = Glasses)) +
geom_histogram(binwidth = 1, color = "black", fill = "palegreen3") +
labs(
title = "Overall histogram - Urban Area",
x = "Glasses of soft drink per day",
y = "Frequency"
) +
theme_minimal()
comparacion <- data_total %>%
group_by(Area) %>%
summarise(
Mean = mean(Glasses, na.rm = TRUE),
Median = median(Glasses, na.rm = TRUE),
SD = sd(Glasses, na.rm = TRUE),
Min = min(Glasses, na.rm = TRUE),
Max = max(Glasses, na.rm = TRUE),
.groups = "drop"
)
comparacion
## # A tibble: 2 × 6
## Area Mean Median SD Min Max
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Rural 2.55 2 1.31 0 5
## 2 Urban 1.52 1 1.32 0 5