Assignment

Central Tendency Measures, dispersion and sample distribution

Exercise 1.

The following table contains the ages of people who are frequent customers of a vegetarian restaurant. With this data, calculate:

  • Mean
  • Median
  • Mode
  • Frequency histogram
#Libraries installed directly in the console, this for avoid that every time we "knit" the packages install again and again.

# install.packages("DescTools")
# install.packages("psych")

library(DescTools)
## Warning: package 'DescTools' was built under R version 4.5.3
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:DescTools':
## 
##     AUC, ICC, SD
#Create table with the dataset of the ages
ages <- c(28, 23, 31, 31,
           21, 30, 28, 23,
           29, 28, 36, 36,
           31, 28, 28, 40,
           22, 21, 28, 31)

#arrange the dataset
print(ages_order <- sort(ages))
##  [1] 21 21 22 23 23 28 28 28 28 28 28 29 30 31 31 31 31 36 36 40
#mean
print(mean(ages_order))
## [1] 28.65
#median
print(median(ages_order))
## [1] 28
#mode
print(Mode(ages_order))
## [1] 28
## attr(,"freq")
## [1] 6
#frequency histogram
hist(ages_order)

Exercise 2.

With the data in the table above, determine the variance and standard deviation.

#variance
print(var(ages_order))
## [1] 25.71316
#std deviation
(print(sd(ages_order)))
## [1] 5.070814
## [1] 5.070814

Exercise 3.

Using the mean and two standard deviation, determine the Chebyshev interval and interpret its meaning in the context of the problem.

  • mean = 28.65
  • Std deviation = 5.070814
  • 2 std deviation

28.65-2(5.070814) = 18.508372 lower limit range

28.65+2(5.070814) = 38.791628 upper limit range

1/2^2 = .25 –> 25%

1-(1/2^2) = .75 –> 75%

Interpretation: From an analytical perspective, the results indicate that the majority of the restaurant’s frequent customers fall within an approximate age range of 19 to 39 years. This suggests that the main segment is composed of young adults.

The Std deviation of approximately 5 years reflects some variability in age; however, this variability is not excessive, so the data is not highly dispersed.So, only a small proportion of customers, at most 25%, fall outside this range.

Exercise 4.

Explain what the Central Tendency measures indicate and what the dispersion measures indicate.

Central tendency measures:

  • Mean –> 28.65
  • Median –> 28
  • Mode –> 28, six times
  • Frequency histogram –> 25-30 is the most frequent segment

Measures of central tendency allow us to identify the value around which the data is concentrated. In this case, the mean is 28.65, the median is 28, and the mode is also 28 (repeating six times). This indicates that the distribution of ages is fairly centered around 28 years.

Furthermore, the frequency histogram shows a greater concentration of data in the values close to this range.This indicates that the distribution is relatively symmetrical and does not exhibit significant skewness.

Dispersion measures:

  • Variance –> 25.71316
  • Standard deviation –> 5.070814

The standard deviation indicates that, on average, ages deviate by about 5 years from the mean. This suggests that, while there is variability in customer ages, it is not excessive. The data are neither too concentrated nor too dispersed, reflecting moderate variability.

Exercise 5.

Using the data from the database titled POP BEVERAGES CONSUMPTION SURVEY, use EXCEL to determine, for each age category of consumers, the following about the Glasses of soft drink per day:

#install all the necessary libraries. Again libraries installed directly in the console, this for avoid that every time we "knit" the packages install again and again.

#install.packages("readxl", repos = "https://cran.r-project.org")
#install.packages("dplyr", repos = "https://cran.r-project.org")
#install.packages("ggplot2", repos = "https://cran.r-project.org")
#install.packages("purrr", repos = "https://cran.r-project.org")
#install.packages("stringr", repos = "https://cran.r-project.org")

library(readxl)
## Warning: package 'readxl' was built under R version 4.5.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.5.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.5.3
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
library(purrr)
## Warning: package 'purrr' was built under R version 4.5.3
library(stringr)
## Warning: package 'stringr' was built under R version 4.5.3
file.choose()
## [1] "C:\\Users\\maria\\Downloads\\POP BEVERAGES CONSUMPTION SURVEY.xlsx"
data <- read_excel("C:\\Users\\maria\\Downloads\\POP BEVERAGES CONSUMPTION SURVEY.xlsx")
## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
excel_sheets("C:\\Users\\maria\\Downloads\\POP BEVERAGES CONSUMPTION SURVEY.xlsx")
## [1] "Youth Rural" "Youth Urban" "20-40 Urban" "20-40 Rural" "40-60 Urban"
## [6] "40-60 Rural"
leer_hoja <- function(sheet_name, archivo) {
  
  df <- read_excel(archivo, sheet = sheet_name, skip = 1)
  
# Nos quedamos solo con las 4 columnas importantes
  df <- df[, 1:4]
  
  names(df) <- c("Surveyed", "Glasses", "Age", "Occupation")
  
  df <- df %>%
    mutate(
      Area = ifelse(str_detect(sheet_name, "Urban"), "Urban", "Rural"),
      AgeGroup = case_when(
        str_detect(sheet_name, "Youth") ~ "Youth",
        str_detect(sheet_name, "20-40") ~ "20-40",
        str_detect(sheet_name, "40-60") ~ "40-60"
      )
    )
  
  return(df)
}

archivo <- "C:\\Users\\maria\\Downloads\\POP BEVERAGES CONSUMPTION SURVEY.xlsx"

hojas <- excel_sheets(archivo)

data_total <- map_dfr(hojas, ~leer_hoja(.x, archivo))
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## • `` -> `...5`
head(data_total)
## # A tibble: 6 × 6
##   Surveyed Glasses   Age Occupation Area  AgeGroup
##      <dbl>   <dbl> <dbl> <chr>      <chr> <chr>   
## 1        1       3    18 Student    Rural Youth   
## 2        2       1    16 Student    Rural Youth   
## 3        3       0    14 Student    Rural Youth   
## 4        4       3    18 Student    Rural Youth   
## 5        5       3    17 Student    Rural Youth   
## 6        6       2    14 Student    Rural Youth
str(data_total)
## tibble [260 × 6] (S3: tbl_df/tbl/data.frame)
##  $ Surveyed  : num [1:260] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Glasses   : num [1:260] 3 1 0 3 3 2 3 4 1 1 ...
##  $ Age       : num [1:260] 18 16 14 18 17 14 20 19 15 15 ...
##  $ Occupation: chr [1:260] "Student" "Student" "Student" "Student" ...
##  $ Area      : chr [1:260] "Rural" "Rural" "Rural" "Rural" ...
##  $ AgeGroup  : chr [1:260] "Youth" "Youth" "Youth" "Youth" ...
  • The histogram frequency for each age category in rural area and the bias presented by the data from the graph comparing the mean to the median
ggplot(filter(data_total, Area == "Rural"), aes(x = Glasses)) +
  geom_histogram(binwidth = 1, color = "black", fill = "lightblue") +
  facet_wrap(~AgeGroup) +
  labs(
    title = "Histogram of soft drink consumption - Rural Area",
    x = "Glasses of soft drink per day",
    y = "Frequency"
  ) +
  theme_minimal()

  • The histogram frequency for each age category in urban area and the bias presented by the data from the graph comparing the mean to the median
resumen <- data_total %>%
  group_by(Area, AgeGroup) %>%
  summarise(
    Mean = mean(Glasses, na.rm = TRUE),
    Median = median(Glasses, na.rm = TRUE),
    .groups = "drop"
  )

resumen <- resumen %>%
  mutate(
    Bias = case_when(
      Mean > Median ~ "Right-skewed",
      Mean < Median ~ "Left-skewed",
      TRUE ~ "Approximately symmetric"
    )
  )
resumen
## # A tibble: 6 × 5
##   Area  AgeGroup  Mean Median Bias                   
##   <chr> <chr>    <dbl>  <dbl> <chr>                  
## 1 Rural 20-40     2.44      2 Right-skewed           
## 2 Rural 40-60     3.1       3 Right-skewed           
## 3 Rural Youth     1.83      2 Left-skewed            
## 4 Urban 20-40     1.34      1 Right-skewed           
## 5 Urban 40-60     2         2 Approximately symmetric
## 6 Urban Youth     1         1 Approximately symmetric

Exercise 6.

With all the data related to the consumption in rural area and in urban area, find:

  • The frequency histogram of the rural area.
ggplot(filter(data_total, Area == "Rural"), aes(x = Glasses)) +
  geom_histogram(binwidth = 1, color = "black", fill = "skyblue") +
  labs(
    title = "Overall histogram - Rural Area",
    x = "Glasses of soft drink per day",
    y = "Frequency"
  ) +
  theme_minimal()

  • The frequency histogram of the urban area.
ggplot(filter(data_total, Area == "Urban"), aes(x = Glasses)) +
  geom_histogram(binwidth = 1, color = "black", fill = "palegreen3") +
  labs(
    title = "Overall histogram - Urban Area",
    x = "Glasses of soft drink per day",
    y = "Frequency"
  ) +
  theme_minimal()

  • What are the main differences between the urban and rural area?
comparacion <- data_total %>%
  group_by(Area) %>%
  summarise(
    Mean = mean(Glasses, na.rm = TRUE),
    Median = median(Glasses, na.rm = TRUE),
    SD = sd(Glasses, na.rm = TRUE),
    Min = min(Glasses, na.rm = TRUE),
    Max = max(Glasses, na.rm = TRUE),
    .groups = "drop"
  )
comparacion
## # A tibble: 2 × 6
##   Area   Mean Median    SD   Min   Max
##   <chr> <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 Rural  2.55      2  1.31     0     5
## 2 Urban  1.52      1  1.32     0     5