R Intermediate - Assignment

STUDENT NAME: TUNG LINH HA

STUDENT ID: 14285872

COURSE: TRM - 32931

1. Reflection

The Faculty of Engineering and IT at UTS offers the “Practical Data Analysis using - Intermediate Level” curriculum. It is one of the courses that can be signed up for via the hub for the subject. With comprehensive step-by-step instructions and practical exercises, the module aims to provide students with an advanced understanding of the R programming language’s application in data analytics.

This Module has seven main objectives that students are expected to achieve by the end of the teaching period, which are:

How to choose the right regression model for various outcomes;
How to examine continuous outcomes using linear regression;
How to examine binary outcomes using logistic regression;
How to account for confounding effects in your analysis;
How to comprehend and interpret the analysis results;
How to create and validate a prediction model; and
How to use R software for inferential data analysis.

Over the course of four weeks, the class was divided into six lessons, each requiring a respectable level of weekly attention. This gives me enough time to fully comprehend the material being taught while still having enough of time for my other classes and responsibilities.

One of the most exciting concept I learnt during this class is the way predictive model is created and validate in R. I have many experience with predictive models in Python, including but not limited to Logistic Regression, Linear Regression, Decision Tree and more. I found it particularly challenging when I first learnt about these models in R in this course, due to the confusing syntax that is so different from how I learnt it in Python. However, the concepts of these models are very well explained, and soon enough, I were able to grasp these new concepts. Learning how these models will be applied in a different programming language was the most valuable thing I have learnt throughout this module.

With my current research, I want to investigate how I can validate the results and what strategies I can use to better predicts future outcomes. My present research uses Electroencephalography (EEG) signals, an extremely complicated and hard-to-analyse data. My primary language of choice for the data analytics in this research will be Python. However, I want to arm myself with the knowledge of several programming languages so that I can be as ready as possible for the process.

I have to familiarise myself with the various methods of data analysis as a university student major in AI and Data Analytics. I have a lot of experience using Python and Java to analyse datasets and develop AI models. But when it came to the Data Analytics component of my degree, I discovered I only really know the fundamentals. I learned a little bit about the R programming language in passing, and I think continuing to learn R would be really good for me. My goal is to become as proficient in R as I am in my other main programming language so that I can work on a project using R in the future.

2. Analysis Report

library(table1)

## 
## Attaching package: 'table1'

## The following objects are masked from 'package:base':
## 
##     units, units<-

library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

library(rms)

## Loading required package: Hmisc

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:table1':
## 
##     label, label<-, units

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(BMA)

## Loading required package: survival

## 
## Attaching package: 'survival'

## The following object is masked from 'package:caret':
## 
##     cluster

## Loading required package: leaps

## Loading required package: robustbase

## 
## Attaching package: 'robustbase'

## The following object is masked from 'package:survival':
## 
##     heart

## Loading required package: inline

## Loading required package: rrcov

## Scalable Robust Estimators with High Breakdown Point (version 1.7-5)

2.0: Import the dataset

house = read.csv("C:\\Users\\DELL\\OneDrive\\ANDREAS\\ACADEMICS_UNIVERSITY\\Year_4_2024\\Autumn\\32931\\Classes\\R_Int\\Ass.task\\Housing prices data.csv")
head(house)

##   ID   crime zone industry river   nox rooms  age distance radial ptratio lstat
## 1  1 0.00632   18     2.31     0 0.538 6.575 65.2   4.0900      1    15.3  4.98
## 2  2 0.02731    0     7.07     0 0.469 6.421 78.9   4.9671      2    17.8  9.14
## 3  3 0.02729    0     7.07     0 0.469 7.185 61.1   4.9671      2    17.8  4.03
## 4  4 0.03237    0     2.18     0 0.458 6.998 45.8   6.0622      3    18.7  2.94
## 5  5 0.06905    0     2.18     0 0.458 7.147 54.2   6.0622      3    18.7  5.33
## 6  6 0.02985    0     2.18     0 0.458 6.430 58.7   6.0622      3    18.7  5.21
##   price
## 1  24.0
## 2  21.6
## 3  34.7
## 4  33.4
## 5  36.2
## 6  28.7

2.1: Study Design

A cross-sectional investigation of house conditions in 506 suburbs in the Boston area was conducted to develop and validate a model for predicting housing prices

2.2: Describe the characteristics of the study sample

2.2.1: Split data

set.seed(1234)

index = createDataPartition(house$price, p = 0.75, list = FALSE)
train = house[index, ]
dim(train)

## [1] 381  13

test = house[-index, ]
dim(test)

## [1] 125  13

2.2.2: Describe the characteristics of the training dataset

table1(~ crime + zone + industry + river + nox + rooms + age + distance + radial + ptratio + lstat + price , data = train)

	Overall (N=381)
crime
Mean (SD)	3.85 (9.36)
Median [Min, Max]	0.250 [0.00906, 89.0]
zone
Mean (SD)	12.6 (24.5)
Median [Min, Max]	0 [0, 100]
industry
Mean (SD)	10.9 (6.88)
Median [Min, Max]	8.56 [0.460, 27.7]
river
Mean (SD)	0.0604 (0.238)
Median [Min, Max]	0 [0, 1.00]
nox
Mean (SD)	0.552 (0.117)
Median [Min, Max]	0.538 [0.385, 0.871]
rooms
Mean (SD)	6.29 (0.722)
Median [Min, Max]	6.22 [3.56, 8.73]
age
Mean (SD)	68.1 (27.7)
Median [Min, Max]	76.5 [6.00, 100]
distance
Mean (SD)	3.86 (2.16)
Median [Min, Max]	3.36 [1.13, 10.7]
radial
Mean (SD)	9.41 (8.66)
Median [Min, Max]	5.00 [1.00, 24.0]
ptratio
Mean (SD)	18.5 (2.15)
Median [Min, Max]	19.0 [12.6, 22.0]
lstat
Mean (SD)	12.6 (7.20)
Median [Min, Max]	10.9 [1.73, 38.0]
price
Mean (SD)	22.6 (9.43)
Median [Min, Max]	21.2 [5.00, 50.0]

2.2.3: Describe the characteristics of the testing dataset

table1(~ crime + zone + industry + river + nox + rooms + age + distance + radial + ptratio + lstat + price , data = test)

	Overall (N=125)
crime
Mean (SD)	2.88 (5.69)
Median [Min, Max]	0.318 [0.00632, 45.7]
zone
Mean (SD)	7.71 (18.8)
Median [Min, Max]	0 [0, 90.0]
industry
Mean (SD)	11.8 (6.80)
Median [Min, Max]	10.0 [1.38, 27.7]
river
Mean (SD)	0.0960 (0.296)
Median [Min, Max]	0 [0, 1.00]
nox
Mean (SD)	0.562 (0.114)
Median [Min, Max]	0.538 [0.398, 0.871]
rooms
Mean (SD)	6.28 (0.642)
Median [Min, Max]	6.19 [4.52, 8.78]
age
Mean (SD)	70.0 (29.5)
Median [Min, Max]	82.6 [2.90, 100]
distance
Mean (SD)	3.59 (1.93)
Median [Min, Max]	2.89 [1.33, 12.1]
radial
Mean (SD)	9.96 (8.87)
Median [Min, Max]	5.00 [1.00, 24.0]
ptratio
Mean (SD)	18.4 (2.23)
Median [Min, Max]	19.1 [13.0, 21.2]
lstat
Mean (SD)	12.8 (6.98)
Median [Min, Max]	12.3 [1.92, 37.0]
price
Mean (SD)	22.4 (8.49)
Median [Min, Max]	21.2 [7.00, 50.0]

2.3: Develop a model for predicting housing prices

2.3.1: Examine the distribution of housing prices and the correlation among its candidate predictors. Interpret the graph.

ggplot(data = house, aes(x = price)) + geom_histogram(fill = "lightblue", color = "black", bins = 30) + labs(title = "Distribution of Housing Prices", x = "Median Housing Price (x $1000)", y = "Frequency")

2.3.2: Develop a model for predicting housing prices using a Bayesian Model Averaging (BMA) approach.

xvars = train[, c("crime", "zone", "industry", "river", "nox", "rooms", "age", "distance", "radial", "ptratio", "lstat")]
yvar = train[,c("price")]
bma = bicreg(x = xvars, y = yvar, strict = FALSE, OR = 20)
summary(bma)

## 
## Call:
## bicreg(x = xvars, y = yvar, strict = FALSE, OR = 20)
## 
## 
##   10  models were selected
##  Best  5  models (cumulative posterior probability =  0.8748 ): 
## 
##            p!=0    EV         SD        model 1     model 2     model 3   
## Intercept  100.0   3.340e+01  6.111503    35.33403    29.85940    36.71631
## crime       94.2  -1.121e-01  0.046309    -0.13580    -0.09735    -0.12806
## zone        79.0   3.430e-02  0.022296     0.04106     0.04716       .    
## industry     6.2  -4.563e-03  0.024252       .           .           .    
## river      100.0   3.967e+00  1.037131     3.91136     4.01809     3.90246
## nox        100.0  -1.660e+01  4.472467   -18.25315   -14.14962   -19.19456
## rooms      100.0   4.343e+00  0.463257     4.24145     4.41324     4.40063
## age          4.6  -5.851e-04  0.004301       .           .           .    
## distance   100.0  -1.362e+00  0.246605    -1.41523    -1.43516    -1.12179
## radial      57.7   7.124e-02  0.070948     0.11754       .         0.13843
## ptratio    100.0  -9.408e-01  0.177222    -0.97625    -0.81426    -1.12488
## lstat      100.0  -5.640e-01  0.054430    -0.56674    -0.55997    -0.56341
##                                                                           
## nVar                                         9           8           8    
## r2                                         0.755       0.750       0.750  
## BIC                                     -481.86182  -481.39852  -480.16361
## post prob                                  0.359       0.285       0.154  
##            model 4     model 5   
## Intercept    35.43341    32.65267
## crime        -0.13716       .    
## zone          0.04154     0.03891
## industry     -0.08030       .    
## river         3.90915     4.20615
## nox         -16.42561   -15.55386
## rooms         4.14118     4.32036
## age             .           .    
## distance     -1.49468    -1.33535
## radial        0.12411       .    
## ptratio      -0.94513    -0.90516
## lstat        -0.56139    -0.59340
##                                  
## nVar            10          7    
## r2            0.756       0.744  
## BIC        -477.45955  -477.28964
## post prob     0.040       0.037

imageplot.bma(bma)

2.3.3: Check the model’s assumptions

m1 = lm(price ~ zone + river + rooms + radial, data = train)
summary(m1)

## 
## Call:
## lm(formula = price ~ zone + river + rooms + radial, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.158  -3.292  -0.442   2.696  32.023 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -28.24713    2.92137  -9.669  < 2e-16 ***
## zone          0.04334    0.01390   3.118  0.00196 ** 
## river         5.61526    1.29895   4.323 1.97e-05 ***
## rooms         8.26980    0.45768  18.069  < 2e-16 ***
## radial       -0.21587    0.03809  -5.667 2.90e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.026 on 376 degrees of freedom
## Multiple R-squared:  0.5958, Adjusted R-squared:  0.5915 
## F-statistic: 138.6 on 4 and 376 DF,  p-value: < 2.2e-16

par(mfrow = c(2,2))
plot(m1)

2.3.4: Present its mathematical equation

Y = -28.25 + 0.04 zone + 5.62 river + 8.27 rooms + (-0.22 radial)

2.4: Validate the model’s performance internally

2.4.1: Identify the optimal internal validation method used to validate the model’s performance. Explain the reason(s)

fit.m1 = train(price ~ zone + river + rooms + radial, data = train, method = "lm", metric = "Rsquared")
summary(fit.m1)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.158  -3.292  -0.442   2.696  32.023 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -28.24713    2.92137  -9.669  < 2e-16 ***
## zone          0.04334    0.01390   3.118  0.00196 ** 
## river         5.61526    1.29895   4.323 1.97e-05 ***
## rooms         8.26980    0.45768  18.069  < 2e-16 ***
## radial       -0.21587    0.03809  -5.667 2.90e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.026 on 376 degrees of freedom
## Multiple R-squared:  0.5958, Adjusted R-squared:  0.5915 
## F-statistic: 138.6 on 4 and 376 DF,  p-value: < 2.2e-16

pred.m1 = predict(fit.m1, test)
test.data = data.frame(obs = test$price, pred = pred.m1)
defaultSummary(test.data)

##      RMSE  Rsquared       MAE 
## 6.3390586 0.4537534 4.0215065

xvars = house[, c("crime", "zone", "industry", "river", "nox", "rooms", "age", "distance", "radial", "ptratio", "lstat")]
yvar = house[,c("price")]
bma.2 = bicreg(x = xvars, y = yvar, strict = FALSE, OR = 20)
summary(bma.2)

## 
## Call:
## bicreg(x = xvars, y = yvar, strict = FALSE, OR = 20)
## 
## 
##   10  models were selected
##  Best  5  models (cumulative posterior probability =  0.8713 ): 
## 
##            p!=0    EV         SD       model 1     model 2     model 3   
## Intercept  100.0   38.066751  5.29477    39.98405    34.99981    40.80894
## crime       76.4   -0.077851  0.05413    -0.11854    -0.08042    -0.11061
## zone        67.1    0.025403  0.02106     0.03658     0.04109       .    
## industry     6.8   -0.004933  0.02359       .           .           .    
## river      100.0    3.175507  0.87748     3.13944     3.17045     3.09444
## nox        100.0  -19.756773  3.80180   -21.37566   -17.71878   -21.95511
## rooms      100.0    3.961303  0.42193     3.85056     4.00998     3.99022
## age          0.0    0.000000  0.00000       .           .           .    
## distance   100.0   -1.356310  0.22229    -1.45079    -1.45011    -1.19964
## radial      49.7    0.054116  0.06207     0.10457       .         0.11877
## ptratio    100.0   -0.974062  0.15049    -1.00175    -0.85618    -1.11544
## lstat      100.0   -0.555997  0.04916    -0.55346    -0.54777    -0.55201
##                                                                          
## nVar                                        9           8           8    
## r2                                        0.727       0.724       0.723  
## BIC                                    -601.35620  -600.89285  -600.23419
## post prob                                 0.275       0.218       0.157  
##            model 4     model 5   
## Intercept    37.12121    36.92263
## crime           .           .    
## zone          0.03512       .    
## industry        .           .    
## river         3.30769     3.24430
## nox         -18.84096   -18.74043
## rooms         3.93816     4.11181
## age             .           .    
## distance     -1.38632    -1.14459
## radial          .           .    
## ptratio      -0.91872    -1.00275
## lstat        -0.57683    -0.56984
##                                  
## nVar            7           6    
## r2            0.720       0.716  
## BIC        -599.84863  -599.17436
## post prob     0.129       0.092

imageplot.bma(bma.2)

2.4.2: Conduct an internal validation to assess the model’s performance

m2 = lm(price ~ zone + river + rooms + radial, data = house)
summary(m2)

## 
## Call:
## lm(formula = price ~ zone + river + rooms + radial, data = house)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.669  -3.316  -0.471   2.508  42.016 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -25.73498    2.62226  -9.814  < 2e-16 ***
## zone          0.04282    0.01272   3.365 0.000823 ***
## river         4.45939    1.07483   4.149 3.92e-05 ***
## rooms         7.90701    0.41170  19.206  < 2e-16 ***
## radial       -0.23247    0.03303  -7.039 6.42e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.092 on 501 degrees of freedom
## Multiple R-squared:  0.5648, Adjusted R-squared:  0.5613 
## F-statistic: 162.5 on 4 and 501 DF,  p-value: < 2.2e-16

par(mfrow = c(2,2))
plot(m2)

fit.m2 = ols(price ~ zone + river + rooms + radial, data = house, x = TRUE, y = TRUE)
fit.m2

## Linear Regression Model
## 
## ols(formula = price ~ zone + river + rooms + radial, data = house, 
##     x = TRUE, y = TRUE)
## 
##                 Model Likelihood    Discrimination    
##                       Ratio Test           Indexes    
## Obs     506    LR chi2    420.91    R2       0.565    
## sigma6.0918    d.f.            4    R2 adj   0.561    
## d.f.    501    Pr(> chi2) 0.0000    g        7.490    
## 
## Residuals
## 
##      Min       1Q   Median       3Q      Max 
## -20.6687  -3.3161  -0.4707   2.5080  42.0164 
## 
## 
##           Coef     S.E.   t     Pr(>|t|)
## Intercept -25.7350 2.6223 -9.81 <0.0001 
## zone        0.0428 0.0127  3.37 0.0008  
## river       4.4594 1.0748  4.15 <0.0001 
## rooms       7.9070 0.4117 19.21 <0.0001 
## radial     -0.2325 0.0330 -7.04 <0.0001

set.seed(1234)

m2.val = validate(fit.m2, B = 500)
m2.val

##           index.orig training    test optimism index.corrected   n
## R-square      0.5648   0.5694  0.5574   0.0121          0.5527 500
## MSE          36.7434  36.4538 37.3673  -0.9135         37.6569 500
## g             7.4898   7.5109  7.4604   0.0504          7.4394 500
## Intercept     0.0000   0.0000  0.0951  -0.0951          0.0951 500
## Slope         1.0000   1.0000  0.9946   0.0054          0.9946 500