webvisitors
Step 1: Preliminaries
rm(list = ls())
library(tidyverse)
library(DescTools) if there is any error then run this code first install.packages(“DescTools”)
setwd(“C:/BAQM/Block 01/lec02/project 01”)
Step 2: Importing & Previewing the Data
# IMPORT DATA
webvisitors <- read.table(“webvisitors.csv”, sep = “;”, header = TRUE)
This reads the CSV file into R and stores it in a variable called webvisitors. The ; tells R that columns are separated by semicolons (not commas), and header = TRUE means the first row contains column names.
summary(webvisitors)
Gives you a quick statistical overview of each column — min, max, mean, etc.
str(webvisitors)
Shows the structure — how many rows/columns, and what data type each column is (number, text, etc.).
(webvisitors\(gender <- factor(webvisitors\)gender, labels = c(“male”, “female”)))
The gender column is stored as numbers (1, 2). This converts it to a factor (a categorical variable) with the labels “male” and “female” — so 1 becomes “male” and 2 becomes “female”.
table(webvisitors$gender)
Counts how many males and females are in the dataset.
Step3: Univariate Measures
Univariate simply means we’re looking at one variable at a time — just exploring each column on its own.
nrow(webvisitors)
This counts how many rows (visitors) are in the dataset. Each row = one visitor.
mean(webvisitors$amount)
The $ sign means: “go inside and grab this column”
mean(webvisitors$amount, trim = 0.1)
It means “cut off 10% from each end before calculating the mean”.
To protect against outliers — removing extreme values that distort your average.
# Sample variance of amount
var(webvisitors$amount)
The mean tells you the center of your data. But variance tells you how far the values are scattered around that center.
sd(webvisitors$amount)
sd = standard deviation, which is simply the square root of variance:
# Alternative
sqrt(var(webvisitors$amount))
mean - sd =(lower end)
mean + sd =(upper end)
sort(webvisitors$amount)
This simply arranges all 25 amounts from lowest to highest. It helps you see the full range of your data at a glance.
sort(webvisitors$amount, decreasing = TRUE) > sorts, highest to lowest
sort(webvisitors$amount, decreasing = FALSE) > sorts, lowest to highest
order(webvisitors$amount)
sort()` gives you the **actual values** sorted
order()` gives you the **row numbers** that would create that sorted order
rank(webvisitors$amount)
the rank of each original row from smallest to largest
rank(-webvisitors$amount)
the rank of each original row from largest to smallest by adding -
min(webvisitors$amount) the lowest amount
max(webvisitors$amount) the highest amount
range(webvisitors$amount) both (lowest & highest in one line)
median(webvisitors$amount) The median is the *middle value* when all values are sorted from lowest to highest.
Odd number of values → (n + 1) / 2
5 values → (5 + 1) / 2 = position 3
Even number of values → average of n/2 and n/2 + 1
4 values → average of position 2 and 3
quantile(webvisitors$amount, 0.25) dividing your data into 4 equal parts
IQR(webvisitors$amount) Interquartile Range = Q3 - Q1
It measures the spread of the **middle 50%** of your data — ignoring the extremes on both ends
table(webvisitors$type) It counts how many times each category appears
webvisitors |>
count(type) tidyverse equivalent
# Relative frequencies of column type
prop.table(table(webvisitors$type)) relative frequency = count / total (proportion=prop)
Instead of saying **“9 visitors like classic”** you say **“36% of visitors like classic”**. That’s a relative frequency — it’s the **proportion** out of the total
# tidyverse equivalent:
webvisitors |>
count(type) |>
mutate(rel_n = n / sum(n)) Creates a new column called rel_n which is each count divided by the total
salehi <- webvisitors |>
count(type) |>
mutate(rel_n = n / sum(n)) to show it in a separate table i used the name Salehi
webvisitors <- webvisitors |>
group_by(type) |>
mutate(rel_n = n() / nrow(webvisitors)) |> to add the new column and show it in the same table
ungroup()
Step4: Bivariate Measures
So far we looked at one variable at a time (univariate). Now we look at two variables together (bivariate) — how do they relate to each other?
# Average amount per type (Base R)
by(webvisitors\(amount, webvisitors\)type, mean)
It calculates the mean amount separately for each music type: jazz, rock, classic
webvisitors |>
summarise(avg_amount = mean(amount), .by = type)
It collapses data into a summary table. Instead of 25 rows you get one row per group
# or to add the gender
webvisitors |>
group_by(gender, type) |>
summarise(avg_amount = mean(amount))
This gives you the average amount for every combination of gender and type together
# Pearson correlation
## What is correlation?
It measures the **relationship between two numeric
variables** — in this case amount and
age.
It answers the question: **“Do older visitors spend more?”
example: r=0.75
## The result is always between -1 and 1:
-1 0 +1
◄──────────┼──────────►
negative none positive
## Value | Meaning |
close to +1 = as age increases, amount increases
lose to -1 = as age increases, amount decreases
close to 0 = no relationship between age and amount
Why “Pearson”?
There are different types of correlation. Pearson is the most common one — it measures linear relationships (straight line patterns).
cor(webvisitors\(amount, webvisitors\)age, method = “pearson”) to see the relationship between 2 numeric variables
# Contingency table: counting two categories together
It counts how many people fall into **each combination** of two
categorical variables — in this case gender and
type
addmargins(xtabs(~ gender + type, data = webvisitors))
Breaking down the code:
Part Meaning
gender + type rows = gender, columns = type
addmargins() adds the Sum row and column
# Cramer’s V - Masuring the strength of relationship between two categorical variables
## The result is always between 0 and 1:
```
0 1
◄──────────────────────────────►
no relationship perfect relationship
CramerV(xtabs(~ gender + type, data = webvisitors))
Step 5: Visualizing data