webvisitors

Step 1: Preliminaries

rm(list = ls())

library(tidyverse)

library(DescTools) if there is any error then run this code first install.packages(“DescTools”)

setwd(“C:/BAQM/Block 01/lec02/project 01”)

Step 2: Importing & Previewing the Data

# IMPORT DATA

webvisitors <- read.table(“webvisitors.csv”, sep = “;”, header = TRUE)

This reads the CSV file into R and stores it in a variable called webvisitors. The ; tells R that columns are separated by semicolons (not commas), and header = TRUE means the first row contains column names.

summary(webvisitors)

Gives you a quick statistical overview of each column — min, max, mean, etc.

str(webvisitors)

Shows the structure — how many rows/columns, and what data type each column is (number, text, etc.).

(webvisitors\(gender <- factor(webvisitors\)gender, labels = c(“male”, “female”)))

The gender column is stored as numbers (1, 2). This converts it to a factor (a categorical variable) with the labels “male” and “female” — so 1 becomes “male” and 2 becomes “female”.

table(webvisitors$gender)

Counts how many males and females are in the dataset.

Step3: Univariate Measures

Univariate simply means we’re looking at one variable at a time — just exploring each column on its own.

nrow(webvisitors)

This counts how many rows (visitors) are in the dataset. Each row = one visitor.

mean(webvisitors$amount)

The $ sign means: “go inside and grab this column”

mean(webvisitors$amount, trim = 0.1)

It means “cut off 10% from each end before calculating the mean”.

To protect against outliers — removing extreme values that distort your average.

# Sample variance of amount

var(webvisitors$amount)

The mean tells you the center of your data. But variance tells you how far the values are scattered around that center.

sd(webvisitors$amount)

sd = standard deviation, which is simply the square root of variance:

# Alternative

sqrt(var(webvisitors$amount))

mean - sd =(lower end)

mean + sd =(upper end)

sort(webvisitors$amount)

This simply arranges all 25 amounts from lowest to highest. It helps you see the full range of your data at a glance.

sort(webvisitors$amount, decreasing = TRUE) > sorts, highest to lowest

sort(webvisitors$amount, decreasing = FALSE) > sorts, lowest to highest

order(webvisitors$amount)

sort()` gives you the **actual values** sorted

order()` gives you the **row numbers** that would create that sorted order

rank(webvisitors$amount)

the rank of each original row from smallest to largest

rank(-webvisitors$amount)

the rank of each original row from largest to smallest by adding -

min(webvisitors$amount) the lowest amount

max(webvisitors$amount) the highest amount

range(webvisitors$amount) both (lowest & highest in one line)

median(webvisitors$amount) The median is the *middle value* when all values are sorted from lowest to highest.

Odd number of values → (n + 1) / 2

5 values → (5 + 1) / 2 = position 3

Even number of values → average of n/2 and n/2 + 1

4 values → average of position 2 and 3

quantile(webvisitors$amount, 0.25) dividing your data into 4 equal parts

IQR(webvisitors$amount) Interquartile Range = Q3 - Q1

It measures the spread of the **middle 50%** of your data — ignoring the extremes on both ends

table(webvisitors$type) It counts how many times each category appears

webvisitors |>

count(type) tidyverse equivalent

# Relative frequencies of column type

prop.table(table(webvisitors$type)) relative frequency = count / total (proportion=prop)

Instead of saying **“9 visitors like classic”** you say **“36% of visitors like classic”**. That’s a relative frequency — it’s the **proportion** out of the total

# tidyverse equivalent:

webvisitors |>

count(type) |>

mutate(rel_n = n / sum(n)) Creates a new column called rel_n which is each count divided by the total

salehi <- webvisitors |>

count(type) |>

mutate(rel_n = n / sum(n)) to show it in a separate table i used the name Salehi

webvisitors <- webvisitors |>

group_by(type) |>

mutate(rel_n = n() / nrow(webvisitors)) |> to add the new column and show it in the same table

ungroup()

Step4: Bivariate Measures

So far we looked at one variable at a time (univariate). Now we look at two variables together (bivariate) — how do they relate to each other?

# Average amount per type (Base R)

by(webvisitors\(amount, webvisitors\)type, mean)

It calculates the mean amount separately for each music type: jazz, rock, classic

webvisitors |>

summarise(avg_amount = mean(amount), .by = type)

It collapses data into a summary table. Instead of 25 rows you get one row per group

# or to add the gender

webvisitors |>

group_by(gender, type) |>

summarise(avg_amount = mean(amount))

This gives you the average amount for every combination of gender and type together

# Pearson correlation

## What is correlation?

It measures the **relationship between two numeric variables** — in this case amount and age.

It answers the question: **“Do older visitors spend more?”

example: r=0.75

## The result is always between -1 and 1:

-1 0 +1

◄──────────┼──────────►

negative none positive

## Value | Meaning |

close to +1 = as age increases, amount increases

lose to -1 = as age increases, amount decreases

close to 0 = no relationship between age and amount

Why “Pearson”?

There are different types of correlation. Pearson is the most common one — it measures linear relationships (straight line patterns).

cor(webvisitors\(amount, webvisitors\)age, method = “pearson”) to see the relationship between 2 numeric variables

# Contingency table: counting two categories together

It counts how many people fall into **each combination** of two categorical variables — in this case gender and type

addmargins(xtabs(~ gender + type, data = webvisitors))

Breaking down the code:

Part Meaning

xtabs(~ gender + type) create the cross table

gender + type rows = gender, columns = type

addmargins() adds the Sum row and column

# Cramer’s V - Masuring the strength of relationship between two categorical variables

## The result is always between 0 and 1:

```

0 1

◄──────────────────────────────►

no relationship perfect relationship

CramerV(xtabs(~ gender + type, data = webvisitors))

Step 5: Visualizing data