Setup

Install and load the necessary packages to reproduce the report here:

# This is a chunk where you can load the necessary packages required to reproduce the report 


library(readr)
library(knitr)
library(openxlsx)

# Here are some example packages, you may add others if you require  


library(readr) # Useful for importing data
library(foreign) # Useful for importing SPSS, SAS, STATA etc. data files
library(rvest) # Useful for scraping HTML data
## Loading required package: xml2
## 
## Attaching package: 'rvest'
## The following object is masked from 'package:readr':
## 
##     guess_encoding
library(knitr) # Useful for creating nice tables

Locate Data

Locate an open source of data from the web. This can be a tabular, spreadsheet data (i.e., .txt, .csv, .xls, .xlsx files), data sets from other statistical software (i.e., SPSS, SAS, Stata etc.), or you can scrape HTML table data.

Some sources for open data are provided below, but I encourage you to find others:

As a minimum, your data set should include:

There is no limit on the number of observations and number of variables. But keep in mind that when you have a very large data set, it will increase your reading time. A clear description of data and its source should be provided in this section.

Data: The Daily Adjusted Closing Price data of the top 10 companies on ASX (marketindex) as of 3-9-2021 for the period 1-1-2020 to 3-9-2021.

Source: Yahoo Finance (https://finance.yahoo.com/)

Read/Import Data

Read/Import the data into R, then save it as a data frame. You can use Base R functions or readr, xlsx, readxl, foreign, rvest packages for this purpose. You must also provide the R codes with outputs (i.e. head() of the data set) that you used to import/read/scrape the data set. You can provide the R codes with outputs using R chunks like this:

# This is an R chunk for importing the data. Provide your R codes here:

DWA1 <- read.xlsx("data/DWA1X.xlsx", sheet = "A1Ret")

head(DWA1)

Explain everything that you do in this step using regular text outside the R chunks. You may use bulleted lists like this:

Here is an example of bulleted list:

Data description

Provide a clear description of the data and its source (i.e. URL of the web site). Provide variable descriptions.

Inspect dataset and variables

Inspect the data frame and variables using R functions. You should:

Provide your R codes with outputs and explain everything that you do in this step.

# This is a chunk where you inspect the types of variables, data structures, check the attributes in the data.
str(DWA1)
## 'data.frame':    425 obs. of  10 variables:
##  $ CBA.Ret: num  0.00537 -0.00675 0.01777 -0.0037 0.00727 ...
##  $ CSL.Ret: num  0.00818 -0.0018 0.02363 0.00854 0.0182 ...
##  $ BHP.Ret: num  0.005122 0.006365 0.0038 0.000506 0.011558 ...
##  $ WBC.Ret: num  0.007003 -0.000821 0.011438 -0.003255 -0.002857 ...
##  $ NAB.Ret: num  0.006086 -0.006901 0.014156 -0.003218 0.000403 ...
##  $ ANZ.Ret: num  0.00731 -0.00405 0.01291 -0.00361 0.00121 ...
##  $ WES.Ret: num  0.00961 0.00857 0.02041 -0.00349 0.00858 ...
##  $ FMG.Ret: num  -0.00279 0.00186 -0.00746 -0.00469 0.01678 ...
##  $ MQG.Ret: num  0.00609 -0.00536 0.01357 -0.00727 0.00454 ...
##  $ WOW.Ret: num  0.001928 -0.001101 0.017202 0.008357 -0.000806 ...

Tidy data

Check if the data conforms the tidy data principles. If your data is untidy, reshape your data into a tidy format. If the data is in a tidy format, you will be expected to explain why the data is originally ‘tidy’.

# This is a chunk where you check if the data conforms the tidy data principles and reshape your data into a tidy format.

# The original price data is not of the required format for the required analysis.
# The original price data were taken from 10 different companies.
# The price data have been used to calculate the Return, the Standard Deviation and the Return per Standard Deviation.
# The sum of the return series is the return for the period from 1-1-2020 to 3-9-2021 for each security due to the property of ln (X/Y) = ln X - ln Y.
# Return per Standard Deviation is a measure of the Return per unit risk, this permits the performance of the security to be measured on a comparable basis, rather than a comparison based on return alone, as high risk security usually requires a high return for compensation.

Summary statistics

Provide summary statistics (mean, median, minimum, maximum, standard deviation) of numeric variables grouped by one of the qualitative (categorical) variable. For example, if your categorical variable is age groups and quantitative variable is income, provide summary statistics of income grouped by the age groups.

# This is a chunk where you provide summary statistics


Sumstat <- read.xlsx("data/DWA1X.xlsx", sheet = "A1Ret")

#Mean
sapply(Sumstat, mean, na.rm=TRUE)
##      CBA.Ret      CSL.Ret      BHP.Ret      WBC.Ret      NAB.Ret 
## 0.0007447446 0.0002823561 0.0005229419 0.0002645284 0.0004939509 
##      ANZ.Ret      WES.Ret      FMG.Ret      MQG.Ret      WOW.Ret 
## 0.0004298528 0.0009046196 0.0020127306 0.0006009233 0.0003920370
#
# Median

sapply(Sumstat, median, na.rm=TRUE)
##      CBA.Ret      CSL.Ret      BHP.Ret      WBC.Ret      NAB.Ret 
## 0.0002277326 0.0006123513 0.0000000000 0.0000000000 0.0008699501 
##      ANZ.Ret      WES.Ret      FMG.Ret      MQG.Ret      WOW.Ret 
## 0.0011553420 0.0010602796 0.0016096095 0.0011668180 0.0010860029
#
# Minimum

sapply(Sumstat, min, na.rm=TRUE)
##    CBA.Ret    CSL.Ret    BHP.Ret    WBC.Ret    NAB.Ret    ANZ.Ret 
## -0.1054275 -0.1092873 -0.1556533 -0.1256784 -0.1328331 -0.1335313 
##    WES.Ret    FMG.Ret    MQG.Ret    WOW.Ret 
## -0.1043073 -0.1123291 -0.1664131 -0.1671068
#
# Maximum
sapply(Sumstat, max, na.rm=TRUE)
##    CBA.Ret    CSL.Ret    BHP.Ret    WBC.Ret    NAB.Ret    ANZ.Ret 
## 0.12453260 0.11353842 0.11283246 0.08833145 0.09214871 0.11202514 
##    WES.Ret    FMG.Ret    MQG.Ret    WOW.Ret 
## 0.10694062 0.12464677 0.10383202 0.09275371
#
# Standard Deviation

sapply(Sumstat, sd, na.rm=TRUE)
##    CBA.Ret    CSL.Ret    BHP.Ret    WBC.Ret    NAB.Ret    ANZ.Ret 
## 0.02066647 0.02097091 0.02232177 0.02305426 0.02288352 0.02416212 
##    WES.Ret    FMG.Ret    MQG.Ret    WOW.Ret 
## 0.01767793 0.02775333 0.02382700 0.01772848

Create a list

Create a list that contains a numeric value for each response to the categorical variable. Typically, they are numbered from 1-n.

# This is a chunk where you create a list

company <- c("CBA", "CSL", "BHP", "WBC", "NAB", "ANZ", "WES", "FMG", "MQG", "WOW")

Mean <- c(0.000745, 0.000282, 0.000523, 0.000265, 0.000494, 0.000430, 0.000905, 0.00201, 0.000601, 0.000392)

Median <- c(0.000228, 0.000612, 0, 0, 0.000870, 0.00116, 0.00106, 0.00161, 0.00117, 0.00109)

Min <- c(-0.105, -0.109, -0.156, -0.126, -0.133, -0.134, -0.104, -0.112, -0.166, -0.167)

Max <- c(0.125, 0.114, 0.113, 0.088, 0.092, 0.112, 0.107, 0.125, 0.104, 0.093)

SD <- c(0.021, 0.021, 0.022, 0.023, 0.0230, 0.024, 0.018, 0.028, 0.024, 0.018)

out_list1 <- list(company)
out_list2 <- list(Mean)
out_list3 <- list(Median)
out_list4 <- list(Min)
out_list5 <- list(Max)
out_list6 <- list(SD)


out_list1
## [[1]]
##  [1] "CBA" "CSL" "BHP" "WBC" "NAB" "ANZ" "WES" "FMG" "MQG" "WOW"
out_list2
## [[1]]
##  [1] 0.000745 0.000282 0.000523 0.000265 0.000494 0.000430 0.000905
##  [8] 0.002010 0.000601 0.000392
out_list3
## [[1]]
##  [1] 0.000228 0.000612 0.000000 0.000000 0.000870 0.001160 0.001060
##  [8] 0.001610 0.001170 0.001090
out_list4
## [[1]]
##  [1] -0.105 -0.109 -0.156 -0.126 -0.133 -0.134 -0.104 -0.112 -0.166 -0.167
out_list5
## [[1]]
##  [1] 0.125 0.114 0.113 0.088 0.092 0.112 0.107 0.125 0.104 0.093
out_list6
## [[1]]
##  [1] 0.021 0.021 0.022 0.023 0.023 0.024 0.018 0.028 0.024 0.018

Join the list

Join this list on using a join of your choice. Remember that this has to keep the numeric variable, as well as matching to your categorical variable.

# This is a chunk where you join the list

# Original data
my_list <- list(out_list1, out_list2, out_list3, out_list4, out_list5, out_list6)
named_list <- list(Company = out_list1, Mean = out_list2, Median = out_list3, Min = out_list4, 
                   Max = out_list5, SD = out_list6)


c(my_list, named_list)
## [[1]]
## [[1]][[1]]
##  [1] "CBA" "CSL" "BHP" "WBC" "NAB" "ANZ" "WES" "FMG" "MQG" "WOW"
## 
## 
## [[2]]
## [[2]][[1]]
##  [1] 0.000745 0.000282 0.000523 0.000265 0.000494 0.000430 0.000905
##  [8] 0.002010 0.000601 0.000392
## 
## 
## [[3]]
## [[3]][[1]]
##  [1] 0.000228 0.000612 0.000000 0.000000 0.000870 0.001160 0.001060
##  [8] 0.001610 0.001170 0.001090
## 
## 
## [[4]]
## [[4]][[1]]
##  [1] -0.105 -0.109 -0.156 -0.126 -0.133 -0.134 -0.104 -0.112 -0.166 -0.167
## 
## 
## [[5]]
## [[5]][[1]]
##  [1] 0.125 0.114 0.113 0.088 0.092 0.112 0.107 0.125 0.104 0.093
## 
## 
## [[6]]
## [[6]][[1]]
##  [1] 0.021 0.021 0.022 0.023 0.023 0.024 0.018 0.028 0.024 0.018
## 
## 
## $Company
## $Company[[1]]
##  [1] "CBA" "CSL" "BHP" "WBC" "NAB" "ANZ" "WES" "FMG" "MQG" "WOW"
## 
## 
## $Mean
## $Mean[[1]]
##  [1] 0.000745 0.000282 0.000523 0.000265 0.000494 0.000430 0.000905
##  [8] 0.002010 0.000601 0.000392
## 
## 
## $Median
## $Median[[1]]
##  [1] 0.000228 0.000612 0.000000 0.000000 0.000870 0.001160 0.001060
##  [8] 0.001610 0.001170 0.001090
## 
## 
## $Min
## $Min[[1]]
##  [1] -0.105 -0.109 -0.156 -0.126 -0.133 -0.134 -0.104 -0.112 -0.166 -0.167
## 
## 
## $Max
## $Max[[1]]
##  [1] 0.125 0.114 0.113 0.088 0.092 0.112 0.107 0.125 0.104 0.093
## 
## 
## $SD
## $SD[[1]]
##  [1] 0.021 0.021 0.022 0.023 0.023 0.024 0.018 0.028 0.024 0.018

Subsetting I

Subset the data frame using first 10 observations (include all variables). Then convert it to a matrix. Check the structure of that matrix (i.e. check whether the matrix is character, numeric, integer, factor, or logical) and explain in a few words why you ended up with that structure.

# This is a chunk to subset your data and convert it to a matrix 

S1 <- read.xlsx("data/DWA1X.xlsx", sheet = "A1Ret", cols = 1:10, rows= 1: 11)

str(S1)
## 'data.frame':    10 obs. of  10 variables:
##  $ CBA.Ret: num  0.00537 -0.00675 0.01777 -0.0037 0.00727 ...
##  $ CSL.Ret: num  0.00818 -0.0018 0.02363 0.00854 0.0182 ...
##  $ BHP.Ret: num  0.005122 0.006365 0.0038 0.000506 0.011558 ...
##  $ WBC.Ret: num  0.007003 -0.000821 0.011438 -0.003255 -0.002857 ...
##  $ NAB.Ret: num  0.006086 -0.006901 0.014156 -0.003218 0.000403 ...
##  $ ANZ.Ret: num  0.00731 -0.00405 0.01291 -0.00361 0.00121 ...
##  $ WES.Ret: num  0.00961 0.00857 0.02041 -0.00349 0.00858 ...
##  $ FMG.Ret: num  -0.00279 0.00186 -0.00746 -0.00469 0.01678 ...
##  $ MQG.Ret: num  0.00609 -0.00536 0.01357 -0.00727 0.00454 ...
##  $ WOW.Ret: num  0.001928 -0.001101 0.017202 0.008357 -0.000806 ...
# The initial matrix shows the name-Ret of the companies in a column, same orientation as str(S1).  In order to match the original orientation as per the source data and the Excel Sheet: "A1Ret", as well as the orientation shown above, a double transformation is carried out to bring the matrix to the required orientation, with name-Ret shown horizontally on the first row.

## Change name-Ret of companies from column to row

t(t(S1[1:10,]))
##         CBA.Ret      CSL.Ret       BHP.Ret       WBC.Ret       NAB.Ret
## 1   0.005368665  0.008183389  0.0051215304  0.0070031118  0.0060864241
## 2  -0.006746547 -0.001804790  0.0063654646 -0.0008214224 -0.0069007535
## 3   0.017767699  0.023634762  0.0037997747  0.0114381180  0.0141559833
## 4  -0.003701408  0.008537224  0.0005056133 -0.0032546934 -0.0032180119
## 5   0.007266540  0.018199769  0.0115578676 -0.0028566892  0.0004027757
## 6   0.012318030  0.027779426 -0.0032528395  0.0073290296  0.0040193682
## 7  -0.000242517 -0.016167423 -0.0093165105 -0.0024369477  0.0000000000
## 8   0.008330764  0.006768229  0.0130687872 -0.0008136939  0.0052010328
## 9   0.005874383  0.004173394  0.0032407623  0.0048720927  0.0043798017
## 10  0.009517037  0.010524320 -0.0012452463  0.0084695003  0.0075203767
##         ANZ.Ret      WES.Ret      FMG.Ret      MQG.Ret       WOW.Ret
## 1   0.007305234  0.009606373 -0.002786896  0.006089294  0.0019275015
## 2  -0.004051945  0.008567390  0.001858835 -0.005362303 -0.0011009330
## 3   0.012908764  0.020406461 -0.007455910  0.013567292  0.0172018171
## 4  -0.003613760 -0.003488836 -0.004688046 -0.007266055  0.0083569304
## 5   0.001205925  0.008583815  0.016775796  0.004538653 -0.0008057017
## 6   0.009198311  0.013992665 -0.011152621  0.008730540  0.0159918594
## 7  -0.004388593 -0.006857156  0.008376167 -0.006074761 -0.0103655325
## 8   0.007568263 -0.003906750  0.018365845  0.011616888  0.0250638949
## 9   0.001982114  0.005510785 -0.012820626  0.002688922 -0.0018255685
## 10  0.006710109  0.007301140  0.013730240  0.013755526  0.0096117718

Subsetting II

Subset the data frame including only first and the last variable in the data set, save it as an R object file (.RData). Provide the R codes with outputs and explain everything that you do in this step.

# This is a chunk to subset your data and convert it to an R object file 

Create a new Data Frame

Create a data frame with 2 variables. Your data frame has to contain one integer variable and one ordinal variable.

# This is a chunk to create a new data frame with the given specifications

Create another Data Frame

Create another data frame with a common variable to the dataset created in step 11.

# This is a chunk to create another data frame with the given specifications

IMPORTANT NOTE:

The report must be uploaded to Assignment 1 section in Canvas as a PDF document with R codes and outputs showing. The easiest way to achieve this is to run all R chunks first, then Preview your notebook in HTML (by clicking Preview), then Open in Browser (Chrome), then Right Click on the report in Chrome , then Click Print and Select the Destination Option to Save as PDF. Upload this PDF report as one single file via the Assignment 1 page in CANVAS.

DELETE the instructional text provided in the template. Failure to do this will INCREASE the SIMILARITY INDEX reported in TURNITIN If you have any questions regarding the assignment instructions and the R template, please post it on Canvas discussion.