library(dplyr)
## Warning: package 'dplyr' was built under R version 4.5.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.5.3
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.5.3
## corrplot 0.95 loaded
library(moments)
campus<-read.csv("advanced_smart_campus_7500_rows.csv")

1 Level 1: UNDERSTANDING THE DATA

1.1 Question 1.1: What is the structure of the dataset?

str(campus)
## 'data.frame':    7500 obs. of  14 variables:
##  $ Room_ID      : chr  "R1" "R2" "R3" "R4" ...
##  $ Capacity     : int  139 96 122 128 132 47 113 136 87 116 ...
##  $ Students_Used: int  139 23 39 13 132 47 28 90 87 110 ...
##  $ Electricity  : int  11 30 12 16 18 12 47 54 61 51 ...
##  $ Time_Slot    : chr  "Afternoon" "Morning" "Afternoon" "Evening" ...
##  $ Date         : chr  "2024-01-01 00:00:00" "2024-01-01 01:00:00" "2024-01-01 02:00:00" "2024-01-01 03:00:00" ...
##  $ Department   : chr  "CE" "CS" "CS" "ME" ...
##  $ Floor        : int  1 1 5 4 3 3 2 7 7 6 ...
##  $ Utilization  : num  100 24 32 10.2 100 ...
##  $ Efficiency   : num  12.636 0.767 3.25 0.812 7.333 ...
##  $ Day          : chr  "2024-01-01" "2024-01-01" "2024-01-01" "2024-01-01" ...
##  $ Hour         : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ Is_Weekend   : chr  "False" "False" "False" "False" ...
##  $ Load_Level   : chr  "High" "Low" "Low" "Low" ...
dim(campus)
## [1] 7500   14
head(campus)
##   Room_ID Capacity Students_Used Electricity Time_Slot                Date
## 1      R1      139           139          11 Afternoon 2024-01-01 00:00:00
## 2      R2       96            23          30   Morning 2024-01-01 01:00:00
## 3      R3      122            39          12 Afternoon 2024-01-01 02:00:00
## 4      R4      128            13          16   Evening 2024-01-01 03:00:00
## 5      R5      132           132          18   Morning 2024-01-01 04:00:00
## 6      R6       47            47          12 Afternoon 2024-01-01 05:00:00
##   Department Floor Utilization Efficiency        Day Hour Is_Weekend Load_Level
## 1         CE     1   100.00000 12.6363636 2024-01-01    0      False       High
## 2         CS     1    23.95833  0.7666667 2024-01-01    1      False        Low
## 3         CS     5    31.96721  3.2500000 2024-01-01    2      False        Low
## 4         ME     4    10.15625  0.8125000 2024-01-01    3      False        Low
## 5         CE     3   100.00000  7.3333333 2024-01-01    4      False       High
## 6        ECE     3   100.00000  3.9166667 2024-01-01    5      False       High

Answer: The data set contains multiple observations related to campus operations including energy usage, attendance, and facilities.

1.2 Question 1.2: Are there any missing values in the dataset?

colSums(is.na(campus))
##       Room_ID      Capacity Students_Used   Electricity     Time_Slot 
##             0             0             0             0             0 
##          Date    Department         Floor   Utilization    Efficiency 
##             0             0             0             0             0 
##           Day          Hour    Is_Weekend    Load_Level 
##             0             0             0             0
sum(is.na(campus))
## [1] 0

Answer: All values are 0 → no missing values Data set is complete and clean

1.3 Question 1.3: What is the average and median utilization?

mean(campus$Utilization, na.rm = TRUE)
## [1] 74.32348
median(campus$Utilization, na.rm = TRUE)
## [1] 88.80215

Answer: Mean gives the average utilization, while median provides the middle value, which is less affected by outlier. Missing values were ignored using na.rm = TRUE. Shows overall efficiency of room usage.

1.4 Question 1.4: What is the range of capacity and electricity usage?

range(campus$Capacity, na.rm = TRUE)
## [1]  30 149
range(campus$Electricity, na.rm = TRUE)
## [1]  5 79

Answer: Range shows the lowest and highest values, indicating the spread of data. Useful for understanding data spread

1.5 Question 1.5: Which departments appear most frequently?

sort(table(campus$Department), decreasing = TRUE)
## 
##   IT   ME  ECE   CE  BBA   CS 
## 1322 1293 1259 1226 1212 1188

Answer: The frequency of each department was calculated using table(), and the results were sorted in descending order to identify which departments have the highest and lowest number of records.

2 Level 2: DATA CLEANING & MANIPULATION

2.1 Question 2.1: Remove duplicate records

campus <- read.csv("advanced_smart_campus_7500_rows.csv")

sum(duplicated(campus))
## [1] 0
campus <- campus[!duplicated(campus), ]

dim(campus)
## [1] 7500   14

Answer: The data set was checked for duplicate records using the duplicated() function. The result shows that there are no duplicate rows in the data set. After applying duplicate removal, the number of rows remains unchanged, confirming that the data set is already clean.

2.2 Question 2.2:Question 2.2: Top 10 rooms with highest utilization

library(dplyr)

campus %>%
  arrange(desc(Utilization)) %>%
  select(Room_ID, Department, Utilization, Efficiency) %>%
  head(10)
##    Room_ID Department Utilization Efficiency
## 1       R1         CE         100  12.636364
## 2       R5         CE         100   7.333333
## 3       R6        ECE         100   3.916667
## 4       R9        BBA         100   1.426230
## 5      R11         ME         100  10.583333
## 6      R17         CE         100   1.381818
## 7      R21         IT         100   2.511111
## 8      R26        BBA         100   1.256410
## 9      R27         IT         100   1.215385
## 10     R28         CS         100   1.734694

Answer: This identifies the top 10 rooms with highest utilization. These rooms are being used efficiently and indicate high demand areas in the campus.

###Question 2.3: Identify underutilized rooms

campus %>%
  filter(Utilization < 0.3) %>%
  select(Room_ID, Capacity, Students_Used, Utilization)
## [1] Room_ID       Capacity      Students_Used Utilization  
## <0 rows> (or 0-length row.names)
range(campus$Utilization)
## [1]   6.756757 100.000000

Answer: The analysis identifies rooms with utilization below 30%, showing that some classrooms are underused. Overall, the results suggest uneven classroom usage and a need for better space management. Helps identify wastage of resources. Useful for improving scheduling and allocation.

2.3 Question 2.4: Average utilization by department

campus %>%
  group_by(Department) %>%
  summarise(avg_util = mean(Utilization, na.rm = TRUE)) %>%
  arrange(desc(avg_util))
## # A tibble: 6 × 2
##   Department avg_util
##   <chr>         <dbl>
## 1 ME             75.2
## 2 ECE            74.9
## 3 CS             74.8
## 4 BBA            74.7
## 5 CE             73.6
## 6 IT             72.7

Answer: This analysis shows the average classroom utilization for each department. Departments with higher average utilization are using their rooms more efficiently. The departments are arranged from highest to lowest utilization for easy comparison. ### Question 2.5: Average electricity usage by floor

campus %>%
  group_by(Floor) %>%
  summarise(avg_electricity = mean(Electricity, na.rm = TRUE))
## # A tibble: 7 × 2
##   Floor avg_electricity
##   <int>           <dbl>
## 1     1            41.5
## 2     2            43.0
## 3     3            42.0
## 4     4            42.5
## 5     5            42.2
## 6     6            41.9
## 7     7            42.4

Answer: This analysis calculates the average electricity consumption for each floor. It helps compare which floor uses more or less electricity. The results can help in monitoring energy usage and improving power management.

3 Level 3: DATA TRANSFORMATION

3.1 Question 3.1: Create utilization category

campus$Utilization_Level <- ifelse(campus$Utilization > 70, "High",
                           ifelse(campus$Utilization > 40, "Medium","Low"))

Answer: Categorizes rooms based on usage levels.

3.2 Question 3.2: Count rooms in each category

table(campus$Utilization_Level)
## 
##   High    Low Medium 
##   4621   1396   1483

Answer: Shows distribution of High, Medium, Low usage. Categorization helps in better decision-making and comparison across different levels.

3.3 Question 3.3: Average efficiency by department

campus %>%
  group_by(Department) %>%
  summarise(avg_eff = mean(Efficiency, na.rm = TRUE)) %>%
  arrange(desc(avg_eff))
## # A tibble: 6 × 2
##   Department avg_eff
##   <chr>        <dbl>
## 1 ME            2.48
## 2 CS            2.48
## 3 CE            2.39
## 4 BBA           2.36
## 5 IT            2.32
## 6 ECE           2.29

Answer: This analysis calculates the average efficiency of each department. Departments with higher average efficiency are performing better in resource usage. The results are arranged from highest to lowest efficiency for comparison.

4 Level 4: DATA VISUALIZATION

4.1 Question 4.1: Histogram of utilization

library(ggplot2)

ggplot(campus, aes(x = Utilization)) +
  geom_histogram(bins = 30) +
  labs(title = "Utilization Distribution",
       x = "Utilization",
       y = "Count")

Answer: A histogram was plotted to understand the distribution of utilization values. I Most bars are on the higher side → classrooms are well utilized Few bars on the lower side → some underutilized rooms

4.2 Question 4.2: Boxplot of electricity usage by department

ggplot(campus, aes(x = Department, y = Electricity)) +
  geom_boxplot() +
  labs(title = "Electricity Usage by Department",
       x = "Department",
       y = "Electricity")

Answer: This boxplot compares electricity usage across different departments. It shows the distribution, median, and variation of electricity consumption in each department. Departments with higher or wider boxplots indicate greater electricity usage or variability.

4.3 Question 4.3: Scatter plot (Capacity vs Students Used)

ggplot(campus, aes(x = Capacity, y = Students_Used)) +
  geom_point() +
  labs(title = "Capacity vs Students Used",
       x = "Capacity",
       y = "Students Used")

Answer: This scatter plot shows the relationship between room capacity and the number of students using the rooms. Points closer to a straight upward pattern indicate a positive relationship between capacity and student usage.

4.4 Question 4.4: Bar chart for load levels

ggplot(campus, aes(x = Load_Level)) +
  geom_bar() +
  labs(title = "Load Level Distribution",
       x = "Load Level",
       y = "Count")

Answer: This bar chart shows the distribution of different load levels in the campus dataset. It displays how many rooms or areas fall under each load level category. The chart helps identify which load level is most common on the campus.

4.5 Question 4.5: Bar Chart Showing Average Utilization by Department

ggplot(campus, aes(x = Department, y = Utilization)) +
  geom_bar(stat = "summary", fun = "mean") +
  theme_minimal()

Answer:

A bar chart was used to compare the average utilization across different departments. The mean function was applied to calculate the average utilization for each department

5 Level 5: EXPLORATORY DATA ANALYSIS

5.1 Question 5.1: Outlier Detection using IQR Method

Q1 <- quantile(campus$Utilization, 0.25)
Q3 <- quantile(campus$Utilization, 0.75)
IQR_value <- Q3 - Q1

lower <- Q1 - 1.5 * IQR_value
upper <- Q3 + 1.5 * IQR_value

campus$Utilization[campus$Utilization < lower | campus$Utilization > upper]
## numeric(0)

Answer IQR shows the spread of middle 50% data and helps identify variability. ### Question 5.2: Detect Outliers using Z-score Method

z_scores <- (campus$Utilization - mean(campus$Utilization, na.rm = TRUE)) / 
            sd(campus$Utilization, na.rm = TRUE)

outliers <- campus$Utilization[abs(z_scores) > 3]

outliers
## numeric(0)

Answer: Outliers were detected using the Z-score method. Values with Z-score greater than ±3 were considered extreme and identified as outliers.

5.2 Question 5.3: Group-wise Analysis

library(dplyr)

campus %>%
  group_by(Department) %>%
  summarise(
    avg_util = mean(Utilization),
    avg_electricity = mean(Electricity)
  )
## # A tibble: 6 × 3
##   Department avg_util avg_electricity
##   <chr>         <dbl>           <dbl>
## 1 BBA            74.7            42.4
## 2 CE             73.6            42.7
## 3 CS             74.8            41.7
## 4 ECE            74.9            42.5
## 5 IT             72.7            42.6
## 6 ME             75.2            41.6

5.3 Question 5.4:Detect skewness (Left or Right Skewed Data)

library(moments)
skewness(campus$Utilization)
## [1] -0.7341737

Answer: Skewness was calculated to understand the shape of the distribution. A negative value indicates that the data is left-skewed, meaning most values are concentrated on the higher side.

5.4 Question 5.5: Density Plot (Distribution Shape)

ggplot(campus, aes(x = Utilization)) +
  geom_density(fill = "green") +
  labs(title = "Density Plot of Utilization") +
  theme_minimal()

Answer: Skewness was calculated to understand the shape of the distribution. A negative value indicates that the data is l left-skewed, meaning most values are concentrated on the higher side.

6 Level 6: CORRELATION ANALYSIS

6.1 Question 6.1: What is the correlation between utilization, electricity usage, and students used?

numeric_data <- campus[, c("Utilization",
                           "Electricity",
                           "Students_Used",
                           "Capacity",
                           "Efficiency")]

cor_matrix <- cor(numeric_data)

print(cor_matrix)
##                Utilization  Electricity Students_Used    Capacity Efficiency
## Utilization    1.000000000 -0.003481175    0.56978980 -0.39654103  0.2534893
## Electricity   -0.003481175  1.000000000    0.01315685  0.01502653 -0.6156781
## Students_Used  0.569789799  0.013156852    1.00000000  0.47945998  0.4199094
## Capacity      -0.396541029  0.015026529    0.47945998  1.00000000  0.1858072
## Efficiency     0.253489302 -0.615678100    0.41990943  0.18580717  1.0000000

Answer: The correlation matrix helps identify relationships between numerical variables in the smart campus dataset. A positive correlation between utilization and electricity usage indicates that rooms with higher utilization generally consume more electricity. Similarly, the number of students using a room also contributes to increased electricity consumption. Capacity shows how room size may influence utilization and resource usage.

6.2 Question 6.2: Visualize the correlation matrix using heatmap

library(corrplot)

corrplot(cor_matrix,
         method = "color",
         type = "upper",
         tl.col = "black",
         tl.srt = 45)

Answer: The heatmap visually represents the strength of relationships between variables. Darker or stronger colors indicate higher correlation values. The visualization makes it easier to quickly identify which factors are strongly associated with electricity usage and resource utilization in the campus.

6.3 Question 6.3: What does the correlation imply?

Answer: Interpretation: Positive correlation indicates that when utilization increases, electricity usage also tends to increase. The correlation analysis suggests that electricity usage increases as utilization and student occupancy increase. This indicates that highly occupied campus facilities consume more resources. Such findings can help administrators optimize room allocation and energy management strategies to reduce unnecessary power consumption.

7 Level 7: REGRESSION ANALYSIS

7.1 Question 7.1: Can we predict electricity usage using utilization and students used?

set.seed(123)

sample_rows <- sample(1:nrow(campus),
                      0.8 * nrow(campus))

train_data <- campus[sample_rows, ]

test_data <- campus[-sample_rows, ]

Answer: A regression model is created to predict electricity usage based on utilization and the number of students using the facility. The dataset is divided into training and testing data to evaluate the model properly. This helps simulate real-world prediction scenarios and improves the reliability of the analysis.

7.2 Build Regression Model

model <- lm(Electricity ~ Utilization + Students_Used + Capacity,
            data = train_data)

summary(model)
## 
## Call:
## lm(formula = Electricity ~ Utilization + Students_Used + Capacity, 
##     data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.012 -18.784   0.477  18.347  37.358 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   44.43624    2.70390  16.434   <2e-16 ***
## Utilization   -0.02989    0.03083  -0.970    0.332    
## Students_Used  0.03133    0.03010   1.041    0.298    
## Capacity      -0.02046    0.02481  -0.825    0.410    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.51 on 5996 degrees of freedom
## Multiple R-squared:  0.000214,   Adjusted R-squared:  -0.0002862 
## F-statistic: 0.4278 on 3 and 5996 DF,  p-value: 0.7331

Answer: The linear regression model establishes a mathematical relationship between electricity usage and the predictor variables. The model summary provides coefficients, significance levels, and statistical measures that help determine how strongly utilization and student count affect electricity consumption.

7.3 Make Predictions

predictions <- predict(model,
                       newdata = test_data)

results <- data.frame(
  Actual = test_data$Electricity,
  Predicted = predictions
)

head(results)
##    Actual Predicted
## 4      16  41.92116
## 10     51  42.67437
## 11     12  42.82728
## 12     76  42.05610
## 14     37  42.55362
## 15     62  42.47190

Answer: The prediction step uses the trained regression model to estimate electricity usage values for the testing dataset. These predicted values are compared with the actual electricity usage values to evaluate model performance.

7.4 Question 7.2: Compare actual vs predicted values

library(ggplot2)

ggplot(results,
       aes(x = Actual,
           y = Predicted)) +

  geom_point(color = "blue",
             size = 3) +

  geom_abline(slope = 1,
              intercept = 0,
              color = "red") +

  labs(title = "Actual vs Predicted Electricity Consumption",
       x = "Actual Values",
       y = "Predicted Values") +

  theme_minimal()

Answer: The scatter plot compares actual electricity usage with predicted values generated by the regression model. Points closer to the red reference line indicate more accurate predictions. This visualization helps assess how well the model performs.

7.5 Question 7.3: What is the accuracy of the model?

rmse <- sqrt(mean((results$Actual -
                   results$Predicted)^2))

print(rmse)
## [1] 21.66092

Answer: The RMSE (Root Mean Square Error) value measures prediction error in the regression model. A lower RMSE value indicates better prediction accuracy, meaning the model’s predicted electricity usage values are closer to the actual observed values.

7.6 Question 7.4: How much variation is explained by the model?

summary(model)$r.squared
## [1] 0.0002139938

Answer: The R-squared value indicates how much variability in electricity usage is explained by utilization and student occupancy. A higher R-squared value suggests that the model successfully captures the relationship between the variables.

7.7 Question 7.5: Interpretation of Regression Results

Answer: Higher utilization and more students generally increase electricity consumption. R-squared explains how much variability in electricity usage is explained by the predictors. The regression analysis shows that utilization and student occupancy significantly influence electricity usage in campus facilities. As room usage increases, electricity consumption also rises. These insights can help improve campus energy optimization and smart resource management systems.

8 Level 8 : Important Visualisations

8.1 8.1 Pie Chart for Department Distribution

dept_count <- table(campus$Department)

pie(dept_count,
    main = "Department Distribution")

Answer: The pie chart represents the proportion of different departments present in the smart campus dataset. It helps visualize how campus resources are distributed among various departments and identifies departments with higher facility usage.

8.2 8.2 Line Graph for Electricity Usage Trend

plot(campus$Electricity,
     type = "l",
     col = "blue",
     main = "Electricity Consumption Trend",
     xlab = "Observations",
     ylab = "Electricity")

Answer: The line graph shows the variation in electricity usage across different campus rooms or facilities. It helps identify increasing or decreasing trends in energy consumption and highlights areas with unusually high electricity usage.

9 FINAL CONCLUSION

Answer: Key Findings: 1. Higher utilization leads to increased electricity usage. 2. Some departments consume significantly more resources. 3. Outlier rooms indicate inefficient energy management. 4. Regression analysis successfully predicts electricity usage. 5. Correlation analysis shows strong relationships among variables.

10 Future Improvements

Answer: Future Improvements: 1. Integrate real-time IoT sensor data. 2. Build a dashboard using Shiny. 3. Add machine learning models for advanced prediction. 4. Monitor classroom energy efficiency in real time. 5. Automate smart resource allocation.