getwd()
## [1] "C:/Users/osaze/Downloads/Winter 2026 Term/Marketing Analytics/R.Data"

before we import data we need to set a working directory.

setwd("C:/Users/osaze/Downloads/Winter 2026 Term/Marketing Analytics/R.Data")

then we import our dataset which will be the student data we created before using the read.csv function

student_data <- read.csv("Students_data.csv")
head(student_data,5)
##   X   Name Age Score Work_type Income Satisfaction_score
## 1 1  Alice  25    85 Part.time  10000                  5
## 2 2  Brian  35    90 Part.time  12000                  6
## 3 3 Carlos  24    88 Full.time  34000                  7
## 4 4  Diana  25    92 Full.time  36000                  8
## 5 5  Emily  34    88 Part.time   9000                  2
head(student_data)
##   X   Name Age Score Work_type Income Satisfaction_score
## 1 1  Alice  25    85 Part.time  10000                  5
## 2 2  Brian  35    90 Part.time  12000                  6
## 3 3 Carlos  24    88 Full.time  34000                  7
## 4 4  Diana  25    92 Full.time  36000                  8
## 5 5  Emily  34    88 Part.time   9000                  2
## 6 6  Farah  33    86 Full.time  37000                  9

now lets persom som descriptibe statistic using the mean, median and standard deviation for income and age.

mean(student_data$Income)
## [1] 18035.71
median(student_data$Income)
## [1] 10500
sd(student_data$Income)
## [1] 12866.87
mean(student_data$Age)
## [1] 26.85714
median(student_data$Age)
## [1] 25.5
sd(student_data$Age)
## [1] 5.230931

if you dont know the column name then use the colnames function to get our the names are written. You can also use the attach() function to attach the dataset so you don’t need to write the full dataset name followed by the $ when you want to do the descriptive statistics or more.

colnames(student_data)
## [1] "X"                  "Name"               "Age"               
## [4] "Score"              "Work_type"          "Income"            
## [7] "Satisfaction_score"
attach(student_data)
mean(Income)
## [1] 18035.71
median(Income)
## [1] 10500
sd(Income)
## [1] 12866.87
mean(Age)
## [1] 26.85714
median(Age)
## [1] 25.5
sd(Age)
## [1] 5.230931

lets say we need to compare the two variables we cant use just the standard deviation as they are not in the same unit. To compare the two variables successfully we have to create a coefficient of variation (cv) which is calculated as the sd/mean of each variables. here we can see that Income is more dispersed that Age. When CV is more than 0.3 its highly dispersed, CV < 0.1 → excellent stability • 0.1–0.2 → acceptable • 0.2–0.3 → borderline • > 0.3 → too much variability, may indicate poor reliability

cv.income <- sd(Income)/mean(Income)
cv.age <- sd(Age)/mean(Age)
cv.income
## [1] 0.7134106
cv.age
## [1] 0.1947687

We can also do the correlation between our variables. when you conduct a correlation the default tset will be the pearson unless you specify using the method = “” to assignt the test to use. In the same vain if you want your cor to produce the p-value as well you make use of the cor.test() function.

cor.test(Income, Age)
## 
##  Pearson's product-moment correlation
## 
## data:  Income and Age
## t = 1.1867, df = 12, p-value = 0.2583
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2493688  0.7292685
## sample estimates:
##       cor 
## 0.3240912
cor.test(Income, Age, method="pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  Income and Age
## t = 1.1867, df = 12, p-value = 0.2583
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2493688  0.7292685
## sample estimates:
##       cor 
## 0.3240912
cor.test(Income,Satisfaction_score, method="spearman")
## Warning in cor.test.default(Income, Satisfaction_score, method = "spearman"):
## Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  Income and Satisfaction_score
## S = 334.81, p-value = 0.3615
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##      rho 
## 0.264155

We can also create the correlation matrix to calculate the correlation between all the varuiables i am interested in. But to do that first you need to bring out the variables we need by subsetting them from the main dataset.Subsetting the dataset our function will look like this [r,c], where r is rows and c is columns.

matrix <- student_data[ ,c("Income", "Age", "Score", "Satisfaction_score")]
matrix
##    Income Age Score Satisfaction_score
## 1   10000  25    85                  5
## 2   12000  35    90                  6
## 3   34000  24    88                  7
## 4   36000  25    92                  8
## 5    9000  34    88                  2
## 6   37000  33    86                  9
## 7   32000  34    87                  3
## 8    8000  26   100                  8
## 9   33000  27    89                  3
## 10   3000  20    84                  9
## 11   8500  20    89                  1
## 12   9000  23    91                  2
## 13  10000  22    87                  3
## 14  11000  28    93                  4

leaving a space before the , [ ,c] means we want to include all the rows in the dataset. of the selected column names c(). Remember to specify the method for the cor

cor.matrix<- cor(matrix, method="spearman")
cor.matrix
##                         Income        Age       Score Satisfaction_score
## Income              1.00000000 0.43756933 -0.01546962         0.26415501
## Age                 0.43756933 1.00000000  0.16261062         0.05000049
## Score              -0.01546962 0.16261062  1.00000000        -0.14555699
## Satisfaction_score  0.26415501 0.05000049 -0.14555699         1.00000000

we can also identify the outliers in the variables in our dataset. there are two ways R allows us to identify outliers. this is the boxplot method and the IQR method. The boxplot gives us a visual presentation of the outliers but doesnt tell us which of the rows are outliers but we can get this indformation when we calculate using the IQR method.

boxplot(Income)

boxplot(Age)

boxplot(Score)

To get the outliers using the IQR method we need to get our Q1 and Q3 using the quantile() function. with that we can now get our IQR as well as the upper bound and lower bound. By using the which() function we can get our outliers by looking at values lower than the lower bound and values higher than our upper bound.

Q1 <- quantile(Score,0.25)
Q3 <- quantile(Score,0.75)
IQR <- Q3-Q1
lower_bound <- Q1 - 1.5*IQR
upper_bound <- Q3 + 1.5*IQR
outliers <- which(Score< lower_bound|Score> upper_bound)

Q1
## 25% 
##  87
Q3
##   75% 
## 90.75
IQR
##  75% 
## 3.75
lower_bound
##    25% 
## 81.375
upper_bound
##    75% 
## 96.375
outliers
## [1] 8
typeof(outliers)
## [1] "integer"