IMPORTANT NOTE:

All the instructions to complete this assignment are available on the MATH2349_1910 Assignment_3 Word file. Please read through this document carefully before submitting your report.

Groups

Students are permitted to work individually or in groups of up to 3 people for Assignment 3. Each group must fill out the group registration form before 2/06/2019 to register their group details.

All group members must submit a copy of the report! Group members that are not registered and do not submit a report will not be acknowledged.

You must use the headings and chunks provided in the template, you may add additional sections and R chunks if you require. In the report, all R chunks and outputs needs to be visible. Failure to do so will result in a loss of marks.

This report must be uploaded to Turnitin as a PDF with your code chunks and outputs showing. The easiest way to achieve this is to Preview your notebook in HTML (by clicking Preview) → Open in Browser (Chrome) → Right click on the report in Chrome → Click Print and Select the Destination Option to Save as PDF.

You must also publish your report to RPubs (see here) and and submit this RPubs link to the google form given here. This online version of the report will be used for marking. Failure to submit your link will delay your feedback and risk late penalties.

Feel free to DELETE the instructional text provided in the template. If you have any questions regarding the assignment instructions and the R template, please post it on Slack under the #assignment3 channel.

Required packages

Provide the packages required to reproduce the report. Make sure you fulfilled the minimum requirement #10.

The package dplyr was installed and used

install.packages("dplyr")
Installing package into 㤼㸱C:/Users/spirzada/Documents/R/win-library/3.5㤼㸲
(as 㤼㸱lib㤼㸲 is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/dplyr_0.8.1.zip'
Content type 'application/zip' length 3231480 bytes (3.1 MB)
downloaded 3.1 MB
package ‘dplyr’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\spirzada\AppData\Local\Temp\RtmpOUeVAJ\downloaded_packages
library(dplyr)
package 㤼㸱dplyr㤼㸲 was built under R version 3.5.3
Attaching package: 㤼㸱dplyr㤼㸲

The following objects are masked from 㤼㸱package:stats㤼㸲:

    filter, lag

The following objects are masked from 㤼㸱package:base㤼㸲:

    intersect, setdiff, setequal, union

Executive Summary

In your own words, provide a brief summary of the preprocessing. Explain the steps that you have taken to preprocess your data. Write this section last after you have performed all data preprocessing. (Word count Max: 300 words)

Firstly I took the 2 datasets from Kaggle.com and converted them from csv files to excel files. Then I imported them to excel. I downloaded the required package in excel which in this case was dplyr. The two data frame were train and tube. I combined both of these datasets using an inner join. I inspected the data types and attributes of all the variables. I changed one variable from character to factor with the labels “yes” and “no”. I noticed that the data was tidy because each variable had its own column, each observation had its own separate row and each value had its own cell. Thus no more data manipulation was required at this stage. I mutated the data frame to create a new variable total_price which was a product of cost and quantity. I scanned the data for missing values and inconsistencies and none were found. I think used boxplots to check for outliers. I was only concerned with the box plots which showed a small number of outliers, since they are to be excluded. The ones with a large number of outliers were left alone. Based on the boxplots other has 7 outliers, num_boss has 5 and num_bends has 8 outliers and bend_radius. All the other variables have far too many outliers, and thus those cannot be excluded. We can remove these outliers using the capping method. This removes outliers below the 5th percentile and those above the 95th percentile. Upon observing the boxplots for the variables. I observed some skewness in num_bends and the diameter variable. To confirm this a histogram was created. The histogram shows right skewness in both variables. Thus we perform logarithmic transformation to make the data set more symmetrical in nature.

Data

A clear description of data sets, their sources, and variable descriptions should be provided. In this section, you must also provide the R codes with outputs (head of data sets) that you used to import/read/scrape the data set. You need to fulfil the minimum requirement #1 and merge at least two data sets to create the one you are going to work on. In addition to the R codes and outputs, you need to explain the steps that you have taken.

Two data sets were taken from Kaggle.com(https://www.kaggle.com/arionai/caterpillar-tube-pricing-dataset#tube.csv). Theey were in CSV form and I saved them as an excel workbook then uploaded them to R. An inner join was used to join the data sets. This dataset was used in the Caterpillar Tube Pricing competition that ran between Jun 2015 and September 2015. We are dealing with two files from this relation. The tube file contains information on tube assemblies, which are the primary focus of the competition. This includes dimesnions of the tube, materials used etc.The train file has informaiton the suppliers,pricing quantity etc. The combined dataframe has been named dataset and has 24 variables.

library(readxl)
train <- read_excel("train.xlsx")
library(readxl)
tube <- read_excel("tube.xlsx")
dataset <- train %>% left_join(tube, by = "tube_assembly_id")

Understand

Summarise the types of variables and data structures, check the attributes in the data and apply data type conversions. In addition to the R codes and outputs, explain briefly the steps that you have taken. In this section, show that you have fulfilled minimum requirements 2-4.

Upon inspection of the data many different types of data is present such as characters, numbers and date format. The bracket_pricing was converted to a factor variable with 2 levels “Yes” and “No”.

str(dataset)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   30213 obs. of  23 variables:
 $ tube_assembly_id  : chr  "TA-00002" "TA-00002" "TA-00002" "TA-00002" ...
 $ supplier          : chr  "S-0066" "S-0066" "S-0066" "S-0066" ...
 $ quote_date        : POSIXct, format: "2013-07-07" "2013-07-07" "2013-07-07" "2013-07-07" ...
 $ annual_usage      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ min_order_quantity: num  0 0 0 0 0 0 0 0 0 0 ...
 $ bracket_pricing   : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
 $ quantity          : num  1 2 5 10 25 50 100 250 1 2 ...
 $ cost              : num  21.91 12.34 6.6 4.69 3.54 ...
 $ material_id       : chr  "SP-0019" "SP-0019" "SP-0019" "SP-0019" ...
 $ diameter          : num  6.35 6.35 6.35 6.35 6.35 6.35 6.35 6.35 6.35 6.35 ...
 $ wall              : num  0.71 0.71 0.71 0.71 0.71 0.71 0.71 0.71 0.71 0.71 ...
 $ length            : num  137 137 137 137 137 137 137 137 137 137 ...
 $ num_bends         : num  8 8 8 8 8 8 8 8 9 9 ...
 $ bend_radius       : num  19.1 19.1 19.1 19.1 19.1 ...
 $ end_a_1x          : chr  "N" "N" "N" "N" ...
 $ end_a_2x          : chr  "N" "N" "N" "N" ...
 $ end_x_1x          : chr  "N" "N" "N" "N" ...
 $ end_x_2x          : chr  "N" "N" "N" "N" ...
 $ end_a             : chr  "EF-008" "EF-008" "EF-008" "EF-008" ...
 $ end_x             : chr  "EF-008" "EF-008" "EF-008" "EF-008" ...
 $ num_boss          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ num_bracket       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ other             : num  0 0 0 0 0 0 0 0 0 0 ...
dataset$bracket_pricing<-as.factor(dataset$bracket_pricing)
levels(dataset$bracket_pricing)
[1] "No"  "Yes"
attributes(dataset)
$`row.names`
   [1]    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20   21
  [22]   22   23   24   25   26   27   28   29   30   31   32   33   34   35   36   37   38   39   40   41   42
  [43]   43   44   45   46   47   48   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63
  [64]   64   65   66   67   68   69   70   71   72   73   74   75   76   77   78   79   80   81   82   83   84
  [85]   85   86   87   88   89   90   91   92   93   94   95   96   97   98   99  100  101  102  103  104  105
 [106]  106  107  108  109  110  111  112  113  114  115  116  117  118  119  120  121  122  123  124  125  126
 [127]  127  128  129  130  131  132  133  134  135  136  137  138  139  140  141  142  143  144  145  146  147
 [148]  148  149  150  151  152  153  154  155  156  157  158  159  160  161  162  163  164  165  166  167  168
 [169]  169  170  171  172  173  174  175  176  177  178  179  180  181  182  183  184  185  186  187  188  189
 [190]  190  191  192  193  194  195  196  197  198  199  200  201  202  203  204  205  206  207  208  209  210
 [211]  211  212  213  214  215  216  217  218  219  220  221  222  223  224  225  226  227  228  229  230  231
 [232]  232  233  234  235  236  237  238  239  240  241  242  243  244  245  246  247  248  249  250  251  252
 [253]  253  254  255  256  257  258  259  260  261  262  263  264  265  266  267  268  269  270  271  272  273
 [274]  274  275  276  277  278  279  280  281  282  283  284  285  286  287  288  289  290  291  292  293  294
 [295]  295  296  297  298  299  300  301  302  303  304  305  306  307  308  309  310  311  312  313  314  315
 [316]  316  317  318  319  320  321  322  323  324  325  326  327  328  329  330  331  332  333  334  335  336
 [337]  337  338  339  340  341  342  343  344  345  346  347  348  349  350  351  352  353  354  355  356  357
 [358]  358  359  360  361  362  363  364  365  366  367  368  369  370  371  372  373  374  375  376  377  378
 [379]  379  380  381  382  383  384  385  386  387  388  389  390  391  392  393  394  395  396  397  398  399
 [400]  400  401  402  403  404  405  406  407  408  409  410  411  412  413  414  415  416  417  418  419  420
 [421]  421  422  423  424  425  426  427  428  429  430  431  432  433  434  435  436  437  438  439  440  441
 [442]  442  443  444  445  446  447  448  449  450  451  452  453  454  455  456  457  458  459  460  461  462
 [463]  463  464  465  466  467  468  469  470  471  472  473  474  475  476  477  478  479  480  481  482  483
 [484]  484  485  486  487  488  489  490  491  492  493  494  495  496  497  498  499  500  501  502  503  504
 [505]  505  506  507  508  509  510  511  512  513  514  515  516  517  518  519  520  521  522  523  524  525
 [526]  526  527  528  529  530  531  532  533  534  535  536  537  538  539  540  541  542  543  544  545  546
 [547]  547  548  549  550  551  552  553  554  555  556  557  558  559  560  561  562  563  564  565  566  567
 [568]  568  569  570  571  572  573  574  575  576  577  578  579  580  581  582  583  584  585  586  587  588
 [589]  589  590  591  592  593  594  595  596  597  598  599  600  601  602  603  604  605  606  607  608  609
 [610]  610  611  612  613  614  615  616  617  618  619  620  621  622  623  624  625  626  627  628  629  630
 [631]  631  632  633  634  635  636  637  638  639  640  641  642  643  644  645  646  647  648  649  650  651
 [652]  652  653  654  655  656  657  658  659  660  661  662  663  664  665  666  667  668  669  670  671  672
 [673]  673  674  675  676  677  678  679  680  681  682  683  684  685  686  687  688  689  690  691  692  693
 [694]  694  695  696  697  698  699  700  701  702  703  704  705  706  707  708  709  710  711  712  713  714
 [715]  715  716  717  718  719  720  721  722  723  724  725  726  727  728  729  730  731  732  733  734  735
 [736]  736  737  738  739  740  741  742  743  744  745  746  747  748  749  750  751  752  753  754  755  756
 [757]  757  758  759  760  761  762  763  764  765  766  767  768  769  770  771  772  773  774  775  776  777
 [778]  778  779  780  781  782  783  784  785  786  787  788  789  790  791  792  793  794  795  796  797  798
 [799]  799  800  801  802  803  804  805  806  807  808  809  810  811  812  813  814  815  816  817  818  819
 [820]  820  821  822  823  824  825  826  827  828  829  830  831  832  833  834  835  836  837  838  839  840
 [841]  841  842  843  844  845  846  847  848  849  850  851  852  853  854  855  856  857  858  859  860  861
 [862]  862  863  864  865  866  867  868  869  870  871  872  873  874  875  876  877  878  879  880  881  882
 [883]  883  884  885  886  887  888  889  890  891  892  893  894  895  896  897  898  899  900  901  902  903
 [904]  904  905  906  907  908  909  910  911  912  913  914  915  916  917  918  919  920  921  922  923  924
 [925]  925  926  927  928  929  930  931  932  933  934  935  936  937  938  939  940  941  942  943  944  945
 [946]  946  947  948  949  950  951  952  953  954  955  956  957  958  959  960  961  962  963  964  965  966
 [967]  967  968  969  970  971  972  973  974  975  976  977  978  979  980  981  982  983  984  985  986  987
 [988]  988  989  990  991  992  993  994  995  996  997  998  999 1000
 [ reached getOption("max.print") -- omitted 29213 entries ]

$names
 [1] "tube_assembly_id"   "supplier"           "quote_date"         "annual_usage"       "min_order_quantity"
 [6] "bracket_pricing"    "quantity"           "cost"               "material_id"        "diameter"          
[11] "wall"               "length"             "num_bends"          "bend_radius"        "end_a_1x"          
[16] "end_a_2x"           "end_x_1x"           "end_x_2x"           "end_a"              "end_x"             
[21] "num_boss"           "num_bracket"        "other"             

$class
[1] "tbl_df"     "tbl"        "data.frame"

Tidy & Manipulate Data I

Check if the data conforms the tidy data principles. If your data is untidy, reshape your data into a tidy format (minimum requirement #5). In addition to the R codes and outputs, explain everything that you do in this step.

No change needs to be made as the data is already tidy due to 3 main reasons.

Each variable has its its own column Each observation has its own seperate row Each value has its own cell

Tidy & Manipulate Data II

Create/mutate at least one variable from the existing variables (minimum requirement #6). In addition to the R codes and outputs, explain everything that you do in this step.

We mutate the dataset by adding a new variable which I have named total_price. I believe this will be useful as it tells the the total price of all the tubes for each order. This is formed by multiplying quantity and cost.

dataset<-mutate(dataset,
       total_price = quantity*cost)
str(dataset)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   30213 obs. of  24 variables:
 $ tube_assembly_id  : chr  "TA-00002" "TA-00002" "TA-00002" "TA-00002" ...
 $ supplier          : chr  "S-0066" "S-0066" "S-0066" "S-0066" ...
 $ quote_date        : POSIXct, format: "2013-07-07" "2013-07-07" "2013-07-07" "2013-07-07" ...
 $ annual_usage      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ min_order_quantity: num  0 0 0 0 0 0 0 0 0 0 ...
 $ bracket_pricing   : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
 $ quantity          : num  1 2 5 10 25 50 100 250 1 2 ...
 $ cost              : num  21.91 12.34 6.6 4.69 3.54 ...
 $ material_id       : chr  "SP-0019" "SP-0019" "SP-0019" "SP-0019" ...
 $ diameter          : num  6.35 6.35 6.35 6.35 6.35 6.35 6.35 6.35 6.35 6.35 ...
 $ wall              : num  0.71 0.71 0.71 0.71 0.71 0.71 0.71 0.71 0.71 0.71 ...
 $ length            : num  137 137 137 137 137 137 137 137 137 137 ...
 $ num_bends         : num  8 8 8 8 8 8 8 8 9 9 ...
 $ bend_radius       : num  19.1 19.1 19.1 19.1 19.1 ...
 $ end_a_1x          : chr  "N" "N" "N" "N" ...
 $ end_a_2x          : chr  "N" "N" "N" "N" ...
 $ end_x_1x          : chr  "N" "N" "N" "N" ...
 $ end_x_2x          : chr  "N" "N" "N" "N" ...
 $ end_a             : chr  "EF-008" "EF-008" "EF-008" "EF-008" ...
 $ end_x             : chr  "EF-008" "EF-008" "EF-008" "EF-008" ...
 $ num_boss          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ num_bracket       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ other             : num  0 0 0 0 0 0 0 0 0 0 ...
 $ total_price       : num  21.9 24.7 33 46.9 88.5 ...

Scan I

Scan the data for missing values, inconsistencies and obvious errors. In this step, you should fulfil the minimum requirement #7. In addition to the R codes and outputs, explain how you dealt with these values.

Upon scanning the values we find 0 missing valies. Thus no further steps need to be taken

is.na(dataset)
         tube_assembly_id supplier quote_date annual_usage min_order_quantity bracket_pricing quantity  cost
    [1,]            FALSE    FALSE      FALSE        FALSE              FALSE           FALSE    FALSE FALSE
    [2,]            FALSE    FALSE      FALSE        FALSE              FALSE           FALSE    FALSE FALSE
    [3,]            FALSE    FALSE      FALSE        FALSE              FALSE           FALSE    FALSE FALSE
    [4,]            FALSE    FALSE      FALSE        FALSE              FALSE           FALSE    FALSE FALSE
    [5,]            FALSE    FALSE      FALSE        FALSE              FALSE           FALSE    FALSE FALSE
    [6,]            FALSE    FALSE      FALSE        FALSE              FALSE           FALSE    FALSE FALSE
    [7,]            FALSE    FALSE      FALSE        FALSE              FALSE           FALSE    FALSE FALSE
    [8,]            FALSE    FALSE      FALSE        FALSE              FALSE           FALSE    FALSE FALSE
    [9,]            FALSE    FALSE      FALSE        FALSE              FALSE           FALSE    FALSE FALSE
   [10,]            FALSE    FALSE      FALSE        FALSE              FALSE           FALSE    FALSE FALSE
   [11,]            FALSE    FALSE      FALSE        FALSE              FALSE           FALSE    FALSE FALSE
   [12,]            FALSE    FALSE      FALSE        FALSE              FALSE           FALSE    FALSE FALSE
   [13,]            FALSE    FALSE      FALSE        FALSE              FALSE           FALSE    FALSE FALSE
   [14,]            FALSE    FALSE      FALSE        FALSE              FALSE           FALSE    FALSE FALSE
   [15,]            FALSE    FALSE      FALSE        FALSE              FALSE           FALSE    FALSE FALSE
   [16,]            FALSE    FALSE      FALSE        FALSE              FALSE           FALSE    FALSE FALSE
   [17,]            FALSE    FALSE      FALSE        FALSE              FALSE           FALSE    FALSE FALSE
   [18,]            FALSE    FALSE      FALSE        FALSE              FALSE           FALSE    FALSE FALSE
   [19,]            FALSE    FALSE      FALSE        FALSE              FALSE           FALSE    FALSE FALSE
   [20,]            FALSE    FALSE      FALSE        FALSE              FALSE           FALSE    FALSE FALSE
   [21,]            FALSE    FALSE      FALSE        FALSE              FALSE           FALSE    FALSE FALSE
   [22,]            FALSE    FALSE      FALSE        FALSE              FALSE           FALSE    FALSE FALSE
   [23,]            FALSE    FALSE      FALSE        FALSE              FALSE           FALSE    FALSE FALSE
   [24,]            FALSE    FALSE      FALSE        FALSE              FALSE           FALSE    FALSE FALSE
   [25,]            FALSE    FALSE      FALSE        FALSE              FALSE           FALSE    FALSE FALSE
   [26,]            FALSE    FALSE      FALSE        FALSE              FALSE           FALSE    FALSE FALSE
   [27,]            FALSE    FALSE      FALSE        FALSE              FALSE           FALSE    FALSE FALSE
   [28,]            FALSE    FALSE      FALSE        FALSE              FALSE           FALSE    FALSE FALSE
   [29,]            FALSE    FALSE      FALSE        FALSE              FALSE           FALSE    FALSE FALSE
   [30,]            FALSE    FALSE      FALSE        FALSE              FALSE           FALSE    FALSE FALSE
   [31,]            FALSE    FALSE      FALSE        FALSE              FALSE           FALSE    FALSE FALSE
   [32,]            FALSE    FALSE      FALSE        FALSE              FALSE           FALSE    FALSE FALSE
   [33,]            FALSE    FALSE      FALSE        FALSE              FALSE           FALSE    FALSE FALSE
   [34,]            FALSE    FALSE      FALSE        FALSE              FALSE           FALSE    FALSE FALSE
   [35,]            FALSE    FALSE      FALSE        FALSE              FALSE           FALSE    FALSE FALSE
   [36,]            FALSE    FALSE      FALSE        FALSE              FALSE           FALSE    FALSE FALSE
   [37,]            FALSE    FALSE      FALSE        FALSE              FALSE           FALSE    FALSE FALSE
   [38,]            FALSE    FALSE      FALSE        FALSE              FALSE           FALSE    FALSE FALSE
   [39,]            FALSE    FALSE      FALSE        FALSE              FALSE           FALSE    FALSE FALSE
   [40,]            FALSE    FALSE      FALSE        FALSE              FALSE           FALSE    FALSE FALSE
   [41,]            FALSE    FALSE      FALSE        FALSE              FALSE           FALSE    FALSE FALSE
         material_id diameter  wall length num_bends bend_radius end_a_1x end_a_2x end_x_1x end_x_2x end_a
    [1,]       FALSE    FALSE FALSE  FALSE     FALSE       FALSE    FALSE    FALSE    FALSE    FALSE FALSE
    [2,]       FALSE    FALSE FALSE  FALSE     FALSE       FALSE    FALSE    FALSE    FALSE    FALSE FALSE
    [3,]       FALSE    FALSE FALSE  FALSE     FALSE       FALSE    FALSE    FALSE    FALSE    FALSE FALSE
    [4,]       FALSE    FALSE FALSE  FALSE     FALSE       FALSE    FALSE    FALSE    FALSE    FALSE FALSE
    [5,]       FALSE    FALSE FALSE  FALSE     FALSE       FALSE    FALSE    FALSE    FALSE    FALSE FALSE
    [6,]       FALSE    FALSE FALSE  FALSE     FALSE       FALSE    FALSE    FALSE    FALSE    FALSE FALSE
    [7,]       FALSE    FALSE FALSE  FALSE     FALSE       FALSE    FALSE    FALSE    FALSE    FALSE FALSE
    [8,]       FALSE    FALSE FALSE  FALSE     FALSE       FALSE    FALSE    FALSE    FALSE    FALSE FALSE
    [9,]       FALSE    FALSE FALSE  FALSE     FALSE       FALSE    FALSE    FALSE    FALSE    FALSE FALSE
   [10,]       FALSE    FALSE FALSE  FALSE     FALSE       FALSE    FALSE    FALSE    FALSE    FALSE FALSE
   [11,]       FALSE    FALSE FALSE  FALSE     FALSE       FALSE    FALSE    FALSE    FALSE    FALSE FALSE
   [12,]       FALSE    FALSE FALSE  FALSE     FALSE       FALSE    FALSE    FALSE    FALSE    FALSE FALSE
   [13,]       FALSE    FALSE FALSE  FALSE     FALSE       FALSE    FALSE    FALSE    FALSE    FALSE FALSE
   [14,]       FALSE    FALSE FALSE  FALSE     FALSE       FALSE    FALSE    FALSE    FALSE    FALSE FALSE
   [15,]       FALSE    FALSE FALSE  FALSE     FALSE       FALSE    FALSE    FALSE    FALSE    FALSE FALSE
   [16,]       FALSE    FALSE FALSE  FALSE     FALSE       FALSE    FALSE    FALSE    FALSE    FALSE FALSE
   [17,]       FALSE    FALSE FALSE  FALSE     FALSE       FALSE    FALSE    FALSE    FALSE    FALSE FALSE
   [18,]       FALSE    FALSE FALSE  FALSE     FALSE       FALSE    FALSE    FALSE    FALSE    FALSE FALSE
   [19,]       FALSE    FALSE FALSE  FALSE     FALSE       FALSE    FALSE    FALSE    FALSE    FALSE FALSE
   [20,]       FALSE    FALSE FALSE  FALSE     FALSE       FALSE    FALSE    FALSE    FALSE    FALSE FALSE
   [21,]       FALSE    FALSE FALSE  FALSE     FALSE       FALSE    FALSE    FALSE    FALSE    FALSE FALSE
   [22,]       FALSE    FALSE FALSE  FALSE     FALSE       FALSE    FALSE    FALSE    FALSE    FALSE FALSE
   [23,]       FALSE    FALSE FALSE  FALSE     FALSE       FALSE    FALSE    FALSE    FALSE    FALSE FALSE
   [24,]       FALSE    FALSE FALSE  FALSE     FALSE       FALSE    FALSE    FALSE    FALSE    FALSE FALSE
   [25,]       FALSE    FALSE FALSE  FALSE     FALSE       FALSE    FALSE    FALSE    FALSE    FALSE FALSE
   [26,]       FALSE    FALSE FALSE  FALSE     FALSE       FALSE    FALSE    FALSE    FALSE    FALSE FALSE
   [27,]       FALSE    FALSE FALSE  FALSE     FALSE       FALSE    FALSE    FALSE    FALSE    FALSE FALSE
   [28,]       FALSE    FALSE FALSE  FALSE     FALSE       FALSE    FALSE    FALSE    FALSE    FALSE FALSE
   [29,]       FALSE    FALSE FALSE  FALSE     FALSE       FALSE    FALSE    FALSE    FALSE    FALSE FALSE
   [30,]       FALSE    FALSE FALSE  FALSE     FALSE       FALSE    FALSE    FALSE    FALSE    FALSE FALSE
   [31,]       FALSE    FALSE FALSE  FALSE     FALSE       FALSE    FALSE    FALSE    FALSE    FALSE FALSE
   [32,]       FALSE    FALSE FALSE  FALSE     FALSE       FALSE    FALSE    FALSE    FALSE    FALSE FALSE
   [33,]       FALSE    FALSE FALSE  FALSE     FALSE       FALSE    FALSE    FALSE    FALSE    FALSE FALSE
   [34,]       FALSE    FALSE FALSE  FALSE     FALSE       FALSE    FALSE    FALSE    FALSE    FALSE FALSE
   [35,]       FALSE    FALSE FALSE  FALSE     FALSE       FALSE    FALSE    FALSE    FALSE    FALSE FALSE
   [36,]       FALSE    FALSE FALSE  FALSE     FALSE       FALSE    FALSE    FALSE    FALSE    FALSE FALSE
   [37,]       FALSE    FALSE FALSE  FALSE     FALSE       FALSE    FALSE    FALSE    FALSE    FALSE FALSE
   [38,]       FALSE    FALSE FALSE  FALSE     FALSE       FALSE    FALSE    FALSE    FALSE    FALSE FALSE
   [39,]       FALSE    FALSE FALSE  FALSE     FALSE       FALSE    FALSE    FALSE    FALSE    FALSE FALSE
   [40,]       FALSE    FALSE FALSE  FALSE     FALSE       FALSE    FALSE    FALSE    FALSE    FALSE FALSE
   [41,]       FALSE    FALSE FALSE  FALSE     FALSE       FALSE    FALSE    FALSE    FALSE    FALSE FALSE
         end_x num_boss num_bracket other total_price
    [1,] FALSE    FALSE       FALSE FALSE       FALSE
    [2,] FALSE    FALSE       FALSE FALSE       FALSE
    [3,] FALSE    FALSE       FALSE FALSE       FALSE
    [4,] FALSE    FALSE       FALSE FALSE       FALSE
    [5,] FALSE    FALSE       FALSE FALSE       FALSE
    [6,] FALSE    FALSE       FALSE FALSE       FALSE
    [7,] FALSE    FALSE       FALSE FALSE       FALSE
    [8,] FALSE    FALSE       FALSE FALSE       FALSE
    [9,] FALSE    FALSE       FALSE FALSE       FALSE
   [10,] FALSE    FALSE       FALSE FALSE       FALSE
   [11,] FALSE    FALSE       FALSE FALSE       FALSE
   [12,] FALSE    FALSE       FALSE FALSE       FALSE
   [13,] FALSE    FALSE       FALSE FALSE       FALSE
   [14,] FALSE    FALSE       FALSE FALSE       FALSE
   [15,] FALSE    FALSE       FALSE FALSE       FALSE
   [16,] FALSE    FALSE       FALSE FALSE       FALSE
   [17,] FALSE    FALSE       FALSE FALSE       FALSE
   [18,] FALSE    FALSE       FALSE FALSE       FALSE
   [19,] FALSE    FALSE       FALSE FALSE       FALSE
   [20,] FALSE    FALSE       FALSE FALSE       FALSE
   [21,] FALSE    FALSE       FALSE FALSE       FALSE
   [22,] FALSE    FALSE       FALSE FALSE       FALSE
   [23,] FALSE    FALSE       FALSE FALSE       FALSE
   [24,] FALSE    FALSE       FALSE FALSE       FALSE
   [25,] FALSE    FALSE       FALSE FALSE       FALSE
   [26,] FALSE    FALSE       FALSE FALSE       FALSE
   [27,] FALSE    FALSE       FALSE FALSE       FALSE
   [28,] FALSE    FALSE       FALSE FALSE       FALSE
   [29,] FALSE    FALSE       FALSE FALSE       FALSE
   [30,] FALSE    FALSE       FALSE FALSE       FALSE
   [31,] FALSE    FALSE       FALSE FALSE       FALSE
   [32,] FALSE    FALSE       FALSE FALSE       FALSE
   [33,] FALSE    FALSE       FALSE FALSE       FALSE
   [34,] FALSE    FALSE       FALSE FALSE       FALSE
   [35,] FALSE    FALSE       FALSE FALSE       FALSE
   [36,] FALSE    FALSE       FALSE FALSE       FALSE
   [37,] FALSE    FALSE       FALSE FALSE       FALSE
   [38,] FALSE    FALSE       FALSE FALSE       FALSE
   [39,] FALSE    FALSE       FALSE FALSE       FALSE
   [40,] FALSE    FALSE       FALSE FALSE       FALSE
   [41,] FALSE    FALSE       FALSE FALSE       FALSE
 [ reached getOption("max.print") -- omitted 30172 rows ]
colSums(is.na(dataset))
  tube_assembly_id           supplier         quote_date       annual_usage min_order_quantity 
                 0                  0                  0                  0                  0 
   bracket_pricing           quantity               cost        material_id           diameter 
                 0                  0                  0                  0                  0 
              wall             length          num_bends        bend_radius           end_a_1x 
                 0                  0                  0                  0                  0 
          end_a_2x           end_x_1x           end_x_2x              end_a              end_x 
                 0                  0                  0                  0                  0 
          num_boss        num_bracket              other        total_price 
                 0                  0                  0                  0 

Scan II

Scan the numeric data for outliers. In this step, you should fulfil the minimum requirement #8. In addition to the R codes and outputs, explain how you dealt with these values.

First I plotted box plot for all the numeric variables to check for outliers. Some authors recomennd that when there are outliers that are small in numbers then they can be excluded. Based on the boxplots other has 7 outliers, num_boss has 5 and num_bends has 8 outliers and bend_radius. All the other variables have far too many outliers, and thus those cannot be excluded. We can remove these outliers using the capping method. This removes outliers below the 5th percentile and those above the 95th percentile

dataset$annual_usage %>% boxplot()

dataset$min_order_quantity %>% boxplot()

dataset$quantity %>% boxplot()

dataset$cost %>% boxplot()

dataset$diameter %>% boxplot()

dataset$wall %>% boxplot()

dataset$length %>% boxplot()

dataset$num_bends %>% boxplot()

dataset$bend_radius %>% boxplot()

dataset$num_boss %>% boxplot()

dataset$other %>% boxplot()

dataset$total_price %>% boxplot()

ap <- function(x){
  quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
  x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
  x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
  x
}
dataset$other <- dataset$other %>% cap()
dataset$num_boss <- dataset$num_boss %>% cap()
dataset$num_bends <- dataset$num_bends %>% cap()
dataset$bend_radius <- dataset$bend_radius %>% cap
dataset$other %>% boxplot()

dataset$num_boss %>% boxplot()

dataset$num_bends %>% boxplot()

dataset$bend_radius %>% boxplot()

Transform

Apply an appropriate transformation for at least one of the variables. In addition to the R codes and outputs, explain everything that you do in this step. In this step, you should fulfil the minimum requirement #9.

In the last question we observed the boxplots for the variables. I observed some skewness in num_bends and the diameter variable. To confirm this a histogram was created. The histogram shows right skewness in both variables. Thus we perform logarithmic transformation to make the data set more symettrical in nature.

hist(dataset$diameter)

hist(dataset$num_bends)

dataset$diameter <- log10(dataset$diameter)
dataset$num_bends <- log10(dataset$num_bends)
hist(dataset$diameter)

hist(dataset$num_bends)

NOTE: Follow the order outlined above in the report. Make sure your code is visible (within the margin of the page). Do not use View() to show your data instead give headers (using head() )

Any further or optional pre-processing tasks can be added to the template using an additional section in the R Markdown file. Please also provide the R codes, outputs and brief explanations on why and how you applied these tasks on the data.



