STUDENT NAME: TUNG LINH HA
STUDENT ID: 14285872
COURSE: TRM - 32931
The Faculty of Engineering and IT at UTS offers the “Practical Data Analysis using - Intermediate Level” curriculum. It is one of the courses that can be signed up for via the hub for the subject. With comprehensive step-by-step instructions and practical exercises, the module aims to provide students with an advanced understanding of the R programming language’s application in data analytics.
This Module has seven main objectives that students are expected to achieve by the end of the teaching period, which are:
How to choose the right regression model for various outcomes;
How to examine continuous outcomes using linear regression;
How to examine binary outcomes using logistic regression;
How to account for confounding effects in your analysis;
How to comprehend and interpret the analysis results;
How to create and validate a prediction model; and
How to use R software for inferential data analysis.
Over the course of four weeks, the class was divided into six lessons, each requiring a respectable level of weekly attention. This gives me enough time to fully comprehend the material being taught while still having enough of time for my other classes and responsibilities.
One of the most exciting concept I learnt during this class is the way predictive model is created and validate in R. I have many experience with predictive models in Python, including but not limited to Logistic Regression, Linear Regression, Decision Tree and more. I found it particularly challenging when I first learnt about these models in R in this course, due to the confusing syntax that is so different from how I learnt it in Python. However, the concepts of these models are very well explained, and soon enough, I were able to grasp these new concepts. Learning how these models will be applied in a different programming language was the most valuable thing I have learnt throughout this module.
With my current research, I want to investigate how I can validate the results and what strategies I can use to better predicts future outcomes. My present research uses Electroencephalography (EEG) signals, an extremely complicated and hard-to-analyse data. My primary language of choice for the data analytics in this research will be Python. However, I want to arm myself with the knowledge of several programming languages so that I can be as ready as possible for the process.
I have to familiarise myself with the various methods of data analysis as a university student major in AI and Data Analytics. I have a lot of experience using Python and Java to analyse datasets and develop AI models. But when it came to the Data Analytics component of my degree, I discovered I only really know the fundamentals. I learned a little bit about the R programming language in passing, and I think continuing to learn R would be really good for me. My goal is to become as proficient in R as I am in my other main programming language so that I can work on a project using R in the future.
library(table1)
##
## Attaching package: 'table1'
## The following objects are masked from 'package:base':
##
## units, units<-
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(rms)
## Loading required package: Hmisc
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:table1':
##
## label, label<-, units
## The following objects are masked from 'package:base':
##
## format.pval, units
library(BMA)
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: leaps
## Loading required package: robustbase
##
## Attaching package: 'robustbase'
## The following object is masked from 'package:survival':
##
## heart
## Loading required package: inline
## Loading required package: rrcov
## Scalable Robust Estimators with High Breakdown Point (version 1.7-5)
house = read.csv("C:\\Users\\DELL\\OneDrive\\ANDREAS\\ACADEMICS_UNIVERSITY\\Year_4_2024\\Autumn\\32931\\Classes\\R_Int\\Ass.task\\Housing prices data.csv")
head(house)
## ID crime zone industry river nox rooms age distance radial ptratio lstat
## 1 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 15.3 4.98
## 2 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 17.8 9.14
## 3 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 17.8 4.03
## 4 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 18.7 2.94
## 5 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 18.7 5.33
## 6 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 18.7 5.21
## price
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
set.seed(1234)
index = createDataPartition(house$price, p = 0.75, list = FALSE)
train = house[index, ]
dim(train)
## [1] 381 13
test = house[-index, ]
dim(test)
## [1] 125 13
table1(~ crime + zone + industry + river + nox + rooms + age + distance + radial + ptratio + lstat + price , data = train)
| Overall (N=381) |
|
|---|---|
| crime | |
| Mean (SD) | 3.85 (9.36) |
| Median [Min, Max] | 0.250 [0.00906, 89.0] |
| zone | |
| Mean (SD) | 12.6 (24.5) |
| Median [Min, Max] | 0 [0, 100] |
| industry | |
| Mean (SD) | 10.9 (6.88) |
| Median [Min, Max] | 8.56 [0.460, 27.7] |
| river | |
| Mean (SD) | 0.0604 (0.238) |
| Median [Min, Max] | 0 [0, 1.00] |
| nox | |
| Mean (SD) | 0.552 (0.117) |
| Median [Min, Max] | 0.538 [0.385, 0.871] |
| rooms | |
| Mean (SD) | 6.29 (0.722) |
| Median [Min, Max] | 6.22 [3.56, 8.73] |
| age | |
| Mean (SD) | 68.1 (27.7) |
| Median [Min, Max] | 76.5 [6.00, 100] |
| distance | |
| Mean (SD) | 3.86 (2.16) |
| Median [Min, Max] | 3.36 [1.13, 10.7] |
| radial | |
| Mean (SD) | 9.41 (8.66) |
| Median [Min, Max] | 5.00 [1.00, 24.0] |
| ptratio | |
| Mean (SD) | 18.5 (2.15) |
| Median [Min, Max] | 19.0 [12.6, 22.0] |
| lstat | |
| Mean (SD) | 12.6 (7.20) |
| Median [Min, Max] | 10.9 [1.73, 38.0] |
| price | |
| Mean (SD) | 22.6 (9.43) |
| Median [Min, Max] | 21.2 [5.00, 50.0] |
table1(~ crime + zone + industry + river + nox + rooms + age + distance + radial + ptratio + lstat + price , data = test)
| Overall (N=125) |
|
|---|---|
| crime | |
| Mean (SD) | 2.88 (5.69) |
| Median [Min, Max] | 0.318 [0.00632, 45.7] |
| zone | |
| Mean (SD) | 7.71 (18.8) |
| Median [Min, Max] | 0 [0, 90.0] |
| industry | |
| Mean (SD) | 11.8 (6.80) |
| Median [Min, Max] | 10.0 [1.38, 27.7] |
| river | |
| Mean (SD) | 0.0960 (0.296) |
| Median [Min, Max] | 0 [0, 1.00] |
| nox | |
| Mean (SD) | 0.562 (0.114) |
| Median [Min, Max] | 0.538 [0.398, 0.871] |
| rooms | |
| Mean (SD) | 6.28 (0.642) |
| Median [Min, Max] | 6.19 [4.52, 8.78] |
| age | |
| Mean (SD) | 70.0 (29.5) |
| Median [Min, Max] | 82.6 [2.90, 100] |
| distance | |
| Mean (SD) | 3.59 (1.93) |
| Median [Min, Max] | 2.89 [1.33, 12.1] |
| radial | |
| Mean (SD) | 9.96 (8.87) |
| Median [Min, Max] | 5.00 [1.00, 24.0] |
| ptratio | |
| Mean (SD) | 18.4 (2.23) |
| Median [Min, Max] | 19.1 [13.0, 21.2] |
| lstat | |
| Mean (SD) | 12.8 (6.98) |
| Median [Min, Max] | 12.3 [1.92, 37.0] |
| price | |
| Mean (SD) | 22.4 (8.49) |
| Median [Min, Max] | 21.2 [7.00, 50.0] |
ggplot(data = house, aes(x = price)) + geom_histogram(fill = "lightblue", color = "black", bins = 30) + labs(title = "Distribution of Housing Prices", x = "Median Housing Price (x $1000)", y = "Frequency")
xvars = train[, c("crime", "zone", "industry", "river", "nox", "rooms", "age", "distance", "radial", "ptratio", "lstat")]
yvar = train[,c("price")]
bma = bicreg(x = xvars, y = yvar, strict = FALSE, OR = 20)
summary(bma)
##
## Call:
## bicreg(x = xvars, y = yvar, strict = FALSE, OR = 20)
##
##
## 10 models were selected
## Best 5 models (cumulative posterior probability = 0.8748 ):
##
## p!=0 EV SD model 1 model 2 model 3
## Intercept 100.0 3.340e+01 6.111503 35.33403 29.85940 36.71631
## crime 94.2 -1.121e-01 0.046309 -0.13580 -0.09735 -0.12806
## zone 79.0 3.430e-02 0.022296 0.04106 0.04716 .
## industry 6.2 -4.563e-03 0.024252 . . .
## river 100.0 3.967e+00 1.037131 3.91136 4.01809 3.90246
## nox 100.0 -1.660e+01 4.472467 -18.25315 -14.14962 -19.19456
## rooms 100.0 4.343e+00 0.463257 4.24145 4.41324 4.40063
## age 4.6 -5.851e-04 0.004301 . . .
## distance 100.0 -1.362e+00 0.246605 -1.41523 -1.43516 -1.12179
## radial 57.7 7.124e-02 0.070948 0.11754 . 0.13843
## ptratio 100.0 -9.408e-01 0.177222 -0.97625 -0.81426 -1.12488
## lstat 100.0 -5.640e-01 0.054430 -0.56674 -0.55997 -0.56341
##
## nVar 9 8 8
## r2 0.755 0.750 0.750
## BIC -481.86182 -481.39852 -480.16361
## post prob 0.359 0.285 0.154
## model 4 model 5
## Intercept 35.43341 32.65267
## crime -0.13716 .
## zone 0.04154 0.03891
## industry -0.08030 .
## river 3.90915 4.20615
## nox -16.42561 -15.55386
## rooms 4.14118 4.32036
## age . .
## distance -1.49468 -1.33535
## radial 0.12411 .
## ptratio -0.94513 -0.90516
## lstat -0.56139 -0.59340
##
## nVar 10 7
## r2 0.756 0.744
## BIC -477.45955 -477.28964
## post prob 0.040 0.037
imageplot.bma(bma)
m1 = lm(price ~ zone + river + rooms + radial, data = train)
summary(m1)
##
## Call:
## lm(formula = price ~ zone + river + rooms + radial, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.158 -3.292 -0.442 2.696 32.023
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -28.24713 2.92137 -9.669 < 2e-16 ***
## zone 0.04334 0.01390 3.118 0.00196 **
## river 5.61526 1.29895 4.323 1.97e-05 ***
## rooms 8.26980 0.45768 18.069 < 2e-16 ***
## radial -0.21587 0.03809 -5.667 2.90e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.026 on 376 degrees of freedom
## Multiple R-squared: 0.5958, Adjusted R-squared: 0.5915
## F-statistic: 138.6 on 4 and 376 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(m1)
fit.m1 = train(price ~ zone + river + rooms + radial, data = train, method = "lm", metric = "Rsquared")
summary(fit.m1)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.158 -3.292 -0.442 2.696 32.023
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -28.24713 2.92137 -9.669 < 2e-16 ***
## zone 0.04334 0.01390 3.118 0.00196 **
## river 5.61526 1.29895 4.323 1.97e-05 ***
## rooms 8.26980 0.45768 18.069 < 2e-16 ***
## radial -0.21587 0.03809 -5.667 2.90e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.026 on 376 degrees of freedom
## Multiple R-squared: 0.5958, Adjusted R-squared: 0.5915
## F-statistic: 138.6 on 4 and 376 DF, p-value: < 2.2e-16
pred.m1 = predict(fit.m1, test)
test.data = data.frame(obs = test$price, pred = pred.m1)
defaultSummary(test.data)
## RMSE Rsquared MAE
## 6.3390586 0.4537534 4.0215065
xvars = house[, c("crime", "zone", "industry", "river", "nox", "rooms", "age", "distance", "radial", "ptratio", "lstat")]
yvar = house[,c("price")]
bma.2 = bicreg(x = xvars, y = yvar, strict = FALSE, OR = 20)
summary(bma.2)
##
## Call:
## bicreg(x = xvars, y = yvar, strict = FALSE, OR = 20)
##
##
## 10 models were selected
## Best 5 models (cumulative posterior probability = 0.8713 ):
##
## p!=0 EV SD model 1 model 2 model 3
## Intercept 100.0 38.066751 5.29477 39.98405 34.99981 40.80894
## crime 76.4 -0.077851 0.05413 -0.11854 -0.08042 -0.11061
## zone 67.1 0.025403 0.02106 0.03658 0.04109 .
## industry 6.8 -0.004933 0.02359 . . .
## river 100.0 3.175507 0.87748 3.13944 3.17045 3.09444
## nox 100.0 -19.756773 3.80180 -21.37566 -17.71878 -21.95511
## rooms 100.0 3.961303 0.42193 3.85056 4.00998 3.99022
## age 0.0 0.000000 0.00000 . . .
## distance 100.0 -1.356310 0.22229 -1.45079 -1.45011 -1.19964
## radial 49.7 0.054116 0.06207 0.10457 . 0.11877
## ptratio 100.0 -0.974062 0.15049 -1.00175 -0.85618 -1.11544
## lstat 100.0 -0.555997 0.04916 -0.55346 -0.54777 -0.55201
##
## nVar 9 8 8
## r2 0.727 0.724 0.723
## BIC -601.35620 -600.89285 -600.23419
## post prob 0.275 0.218 0.157
## model 4 model 5
## Intercept 37.12121 36.92263
## crime . .
## zone 0.03512 .
## industry . .
## river 3.30769 3.24430
## nox -18.84096 -18.74043
## rooms 3.93816 4.11181
## age . .
## distance -1.38632 -1.14459
## radial . .
## ptratio -0.91872 -1.00275
## lstat -0.57683 -0.56984
##
## nVar 7 6
## r2 0.720 0.716
## BIC -599.84863 -599.17436
## post prob 0.129 0.092
imageplot.bma(bma.2)
m2 = lm(price ~ zone + river + rooms + radial, data = house)
summary(m2)
##
## Call:
## lm(formula = price ~ zone + river + rooms + radial, data = house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.669 -3.316 -0.471 2.508 42.016
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -25.73498 2.62226 -9.814 < 2e-16 ***
## zone 0.04282 0.01272 3.365 0.000823 ***
## river 4.45939 1.07483 4.149 3.92e-05 ***
## rooms 7.90701 0.41170 19.206 < 2e-16 ***
## radial -0.23247 0.03303 -7.039 6.42e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.092 on 501 degrees of freedom
## Multiple R-squared: 0.5648, Adjusted R-squared: 0.5613
## F-statistic: 162.5 on 4 and 501 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(m2)
fit.m2 = ols(price ~ zone + river + rooms + radial, data = house, x = TRUE, y = TRUE)
fit.m2
## Linear Regression Model
##
## ols(formula = price ~ zone + river + rooms + radial, data = house,
## x = TRUE, y = TRUE)
##
## Model Likelihood Discrimination
## Ratio Test Indexes
## Obs 506 LR chi2 420.91 R2 0.565
## sigma6.0918 d.f. 4 R2 adj 0.561
## d.f. 501 Pr(> chi2) 0.0000 g 7.490
##
## Residuals
##
## Min 1Q Median 3Q Max
## -20.6687 -3.3161 -0.4707 2.5080 42.0164
##
##
## Coef S.E. t Pr(>|t|)
## Intercept -25.7350 2.6223 -9.81 <0.0001
## zone 0.0428 0.0127 3.37 0.0008
## river 4.4594 1.0748 4.15 <0.0001
## rooms 7.9070 0.4117 19.21 <0.0001
## radial -0.2325 0.0330 -7.04 <0.0001
set.seed(1234)
m2.val = validate(fit.m2, B = 500)
m2.val
## index.orig training test optimism index.corrected n
## R-square 0.5648 0.5694 0.5574 0.0121 0.5527 500
## MSE 36.7434 36.4538 37.3673 -0.9135 37.6569 500
## g 7.4898 7.5109 7.4604 0.0504 7.4394 500
## Intercept 0.0000 0.0000 0.0951 -0.0951 0.0951 500
## Slope 1.0000 1.0000 0.9946 0.0054 0.9946 500