getwd()
## [1] "C:/Users/osaze/Downloads/Winter 2026 Term/Marketing Analytics/R.Data"
before we import data we need to set a working directory.
setwd("C:/Users/osaze/Downloads/Winter 2026 Term/Marketing Analytics/R.Data")
then we import our dataset which will be the student data we created before using the read.csv function
student_data <- read.csv("Students_data.csv")
head(student_data,5)
## X Name Age Score Work_type Income Satisfaction_score
## 1 1 Alice 25 85 Part.time 10000 5
## 2 2 Brian 35 90 Part.time 12000 6
## 3 3 Carlos 24 88 Full.time 34000 7
## 4 4 Diana 25 92 Full.time 36000 8
## 5 5 Emily 34 88 Part.time 9000 2
head(student_data)
## X Name Age Score Work_type Income Satisfaction_score
## 1 1 Alice 25 85 Part.time 10000 5
## 2 2 Brian 35 90 Part.time 12000 6
## 3 3 Carlos 24 88 Full.time 34000 7
## 4 4 Diana 25 92 Full.time 36000 8
## 5 5 Emily 34 88 Part.time 9000 2
## 6 6 Farah 33 86 Full.time 37000 9
now lets persom som descriptibe statistic using the mean, median and standard deviation for income and age.
mean(student_data$Income)
## [1] 18035.71
median(student_data$Income)
## [1] 10500
sd(student_data$Income)
## [1] 12866.87
mean(student_data$Age)
## [1] 26.85714
median(student_data$Age)
## [1] 25.5
sd(student_data$Age)
## [1] 5.230931
if you dont know the column name then use the colnames function to get our the names are written. You can also use the attach() function to attach the dataset so you don’t need to write the full dataset name followed by the $ when you want to do the descriptive statistics or more.
colnames(student_data)
## [1] "X" "Name" "Age"
## [4] "Score" "Work_type" "Income"
## [7] "Satisfaction_score"
attach(student_data)
mean(Income)
## [1] 18035.71
median(Income)
## [1] 10500
sd(Income)
## [1] 12866.87
mean(Age)
## [1] 26.85714
median(Age)
## [1] 25.5
sd(Age)
## [1] 5.230931
lets say we need to compare the two variables we cant use just the standard deviation as they are not in the same unit. To compare the two variables successfully we have to create a coefficient of variation (cv) which is calculated as the sd/mean of each variables. here we can see that Income is more dispersed that Age. When CV is more than 0.3 its highly dispersed, CV < 0.1 → excellent stability • 0.1–0.2 → acceptable • 0.2–0.3 → borderline • > 0.3 → too much variability, may indicate poor reliability
cv.income <- sd(Income)/mean(Income)
cv.age <- sd(Age)/mean(Age)
cv.income
## [1] 0.7134106
cv.age
## [1] 0.1947687
We can also do the correlation between our variables. when you conduct a correlation the default tset will be the pearson unless you specify using the method = “” to assignt the test to use. In the same vain if you want your cor to produce the p-value as well you make use of the cor.test() function.
cor.test(Income, Age)
##
## Pearson's product-moment correlation
##
## data: Income and Age
## t = 1.1867, df = 12, p-value = 0.2583
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2493688 0.7292685
## sample estimates:
## cor
## 0.3240912
cor.test(Income, Age, method="pearson")
##
## Pearson's product-moment correlation
##
## data: Income and Age
## t = 1.1867, df = 12, p-value = 0.2583
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2493688 0.7292685
## sample estimates:
## cor
## 0.3240912
cor.test(Income,Satisfaction_score, method="spearman")
## Warning in cor.test.default(Income, Satisfaction_score, method = "spearman"):
## Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: Income and Satisfaction_score
## S = 334.81, p-value = 0.3615
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.264155
We can also create the correlation matrix to calculate the correlation between all the varuiables i am interested in. But to do that first you need to bring out the variables we need by subsetting them from the main dataset.Subsetting the dataset our function will look like this [r,c], where r is rows and c is columns.
matrix <- student_data[ ,c("Income", "Age", "Score", "Satisfaction_score")]
matrix
## Income Age Score Satisfaction_score
## 1 10000 25 85 5
## 2 12000 35 90 6
## 3 34000 24 88 7
## 4 36000 25 92 8
## 5 9000 34 88 2
## 6 37000 33 86 9
## 7 32000 34 87 3
## 8 8000 26 100 8
## 9 33000 27 89 3
## 10 3000 20 84 9
## 11 8500 20 89 1
## 12 9000 23 91 2
## 13 10000 22 87 3
## 14 11000 28 93 4
leaving a space before the , [ ,c] means we want to include all the rows in the dataset. of the selected column names c(). Remember to specify the method for the cor
cor.matrix<- cor(matrix, method="spearman")
cor.matrix
## Income Age Score Satisfaction_score
## Income 1.00000000 0.43756933 -0.01546962 0.26415501
## Age 0.43756933 1.00000000 0.16261062 0.05000049
## Score -0.01546962 0.16261062 1.00000000 -0.14555699
## Satisfaction_score 0.26415501 0.05000049 -0.14555699 1.00000000
we can also identify the outliers in the variables in our dataset. there are two ways R allows us to identify outliers. this is the boxplot method and the IQR method. The boxplot gives us a visual presentation of the outliers but doesnt tell us which of the rows are outliers but we can get this indformation when we calculate using the IQR method.
boxplot(Income)
boxplot(Age)
boxplot(Score)
To get the outliers using the IQR method we need to get our Q1 and Q3
using the quantile() function. with that we can now get our IQR as well
as the upper bound and lower bound. By using the which() function we can
get our outliers by looking at values lower than the lower bound and
values higher than our upper bound.
Q1 <- quantile(Score,0.25)
Q3 <- quantile(Score,0.75)
IQR <- Q3-Q1
lower_bound <- Q1 - 1.5*IQR
upper_bound <- Q3 + 1.5*IQR
outliers <- which(Score< lower_bound|Score> upper_bound)
Q1
## 25%
## 87
Q3
## 75%
## 90.75
IQR
## 75%
## 3.75
lower_bound
## 25%
## 81.375
upper_bound
## 75%
## 96.375
outliers
## [1] 8
typeof(outliers)
## [1] "integer"