In the dynamic landscape of social media, staying at the forefront of innovation is crucial for any company seeking to engage and captivate its users. Our machine learning project delves into the captivating world of image recognition, with a singular focus on its application and business implications for our esteemed client, a leading social media company.
The primary objective of this project is to construct a diverse array of ten machine learning models designed to perform predictive classifications through cutting-edge image recognition techniques. By harnessing the power of artificial intelligence and advanced algorithms, our project aims to elevate our client’s platform by providing intelligent image recognition capabilities.
As a social media company, our client understands the pivotal role that visuals play in capturing the attention of its users. By integrating image recognition technology into their platform, our client seeks to offer its users an enhanced and personalized experience. The ability to accurately recognize and classify images opens up a myriad of exciting business implications that can revolutionize user engagement and content discovery.
One of the most significant impacts of our machine learning project is its potential to transform content discovery on the social media platform. By empowering the platform with image recognition capabilities, users can effortlessly find relevant content and connect with like-minded individuals, leading to increased user retention and satisfaction.
Moreover, the predictive classifications generated by our machine learning models enable our client to tailor content recommendations to individual preferences and interests. This level of personalization enhances user satisfaction and fosters stronger user-brand relationships, ultimately driving user loyalty and platform growth.
In the realm of content moderation, image recognition technology offers invaluable support in identifying and flagging inappropriate or harmful content. By automating the content moderation process, our client can ensure a safe and respectful online community, maintaining a positive user experience and protecting its brand reputation.
Beyond the user experience, image recognition also unlocks innovative marketing opportunities for our client. By analyzing user-generated content and trends, our machine learning models enable targeted and data-driven advertising strategies, optimizing ad placement and increasing ad relevance for users and advertisers alike.
While the potential benefits of image recognition are immense, we are also mindful of the ethical implications associated with AI and machine learning. Our project is committed to addressing privacy concerns and ensuring unbiased image analysis, safeguarding user data and maintaining the trust of the social media community.
In conclusion, this image recognition machine learning project is a transformative endeavor for our social media client. By deploying ten powerful predictive classification models, we aim to revolutionize content discovery, enhance user engagement, optimize content moderation, and unlock new marketing opportunities. Together, we embark on a journey that not only elevates the client’s social media platform but also sets new industry standards for innovation and user-centric experiences in the ever-evolving world of social media.
A Neural Network model is a deep learning model that takes inspiration from the biological neuron. It’s consisted of networks of neurons, taking in an input layer, one or more hidden layers, and a output layer. With our dataset of 7 x 7 images, which is 49 pixels, there will be 49 inputs in the neural network model.
Advantages:
General & adaptive algorithm; work directly on raw data; capture
complex non-linear relationships
Disadvantages:
Less flexibility & complexity; tricky hyper-parameters, challenging
to interpret the learned weights and understand the decision-making
process
The nnet() package & model is a basic neural network algorithm that only allows for a single hidden layer, which is the layer between the input layer and the output layer where the actual learning and feature extraction occur. There are several parameters to consider for the model.
The size parameter sets the number of hidden nodes or neurons in the neural network. The decay parameter, also known as weight decay or L2 regularization, is a regularization technique used to prevent over-fitting in neural network models. A higher value of decay will result in a stronger regularization effect, leading to smaller weights in the network. It prevents the model from memorizing noise in the training data and encourages it to learn more general patterns. Max weight is used to set an upper limit on the magnitude of the weights in the neural network model. It restricts the maximum absolute value of the weights during training. Max iteration, or maxit, controls the maximum number of iterations for training. Increasing the maxit allows the neural network more opportunities to learn from the data, potentially leading to better convergence and a more accurate model. However, setting the maxit too high also risk overfitting the model on the training data.
Here, we set the parameters for a single layer neural network as follows:
size: 10 decay: 0.2 max weight: 10000 max iteration: 300
# weights: 610
initial value 2774.484680
iter 10 value 1329.090586
iter 20 value 1016.719348
iter 30 value 798.867900
iter 40 value 701.260546
iter 50 value 668.863613
iter 60 value 649.390439
iter 70 value 641.597423
iter 80 value 637.368077
iter 90 value 635.687325
iter 100 value 634.912516
iter 110 value 633.777295
iter 120 value 633.051445
iter 130 value 632.338245
iter 140 value 631.650511
iter 150 value 630.947008
iter 160 value 630.127352
iter 170 value 629.578567
iter 180 value 628.976374
iter 190 value 628.168483
iter 200 value 626.696657
iter 210 value 625.164084
iter 220 value 624.043113
iter 230 value 623.499843
iter 240 value 622.949543
iter 250 value 622.596924
iter 260 value 622.107510
iter 270 value 621.635305
iter 280 value 621.451368
iter 290 value 621.334965
iter 300 value 621.241916
final value 621.241916
stopped after 300 iterations
# weights: 610
initial value 2541.578810
iter 10 value 1204.115043
iter 20 value 909.266756
iter 30 value 761.330009
iter 40 value 685.286542
iter 50 value 650.041716
iter 60 value 637.638572
iter 70 value 631.696833
iter 80 value 626.439889
iter 90 value 618.039066
iter 100 value 609.251913
iter 110 value 606.285055
iter 120 value 604.420625
iter 130 value 603.469319
iter 140 value 602.757696
iter 150 value 602.337337
iter 160 value 601.898210
iter 170 value 601.580838
iter 180 value 601.378824
iter 190 value 601.263140
iter 200 value 601.172908
iter 210 value 601.109213
iter 220 value 601.068639
iter 230 value 600.280899
iter 240 value 599.379587
iter 250 value 599.165985
iter 260 value 599.018728
iter 270 value 598.988761
iter 280 value 598.976992
iter 290 value 598.971842
iter 300 value 598.967052
final value 598.967052
stopped after 300 iterations
# weights: 610
initial value 2469.686353
iter 10 value 1326.675917
iter 20 value 1088.694244
iter 30 value 822.477444
iter 40 value 720.975405
iter 50 value 694.705253
iter 60 value 669.965182
iter 70 value 660.123526
iter 80 value 653.836705
iter 90 value 649.164322
iter 100 value 646.282243
iter 110 value 644.755048
iter 120 value 643.549542
iter 130 value 642.451061
iter 140 value 641.760442
iter 150 value 641.424934
iter 160 value 641.197800
iter 170 value 641.108973
iter 180 value 641.061569
iter 190 value 641.051512
iter 200 value 641.049701
iter 210 value 641.048996
iter 220 value 641.048827
final value 641.048814
converged
# weights: 610
initial value 11944.075818
iter 10 value 5238.876264
iter 20 value 3959.623375
iter 30 value 3348.637150
iter 40 value 3103.399513
iter 50 value 2949.568342
iter 60 value 2870.811898
iter 70 value 2795.601551
iter 80 value 2750.655431
iter 90 value 2722.180377
iter 100 value 2698.132381
iter 110 value 2670.254783
iter 120 value 2651.038405
iter 130 value 2623.302225
iter 140 value 2598.587694
iter 150 value 2573.550642
iter 160 value 2551.967908
iter 170 value 2535.063444
iter 180 value 2519.107747
iter 190 value 2511.738653
iter 200 value 2505.325546
iter 210 value 2501.560330
iter 220 value 2498.402393
iter 230 value 2491.430967
iter 240 value 2481.529004
iter 250 value 2477.659258
iter 260 value 2475.784902
iter 270 value 2474.627328
iter 280 value 2473.517060
iter 290 value 2472.585357
iter 300 value 2471.934098
final value 2471.934098
stopped after 300 iterations
# weights: 610
initial value 13106.564617
iter 10 value 6731.584566
iter 20 value 4444.178759
iter 30 value 3742.758405
iter 40 value 3372.433788
iter 50 value 3143.220850
iter 60 value 2984.345057
iter 70 value 2875.821216
iter 80 value 2812.192743
iter 90 value 2772.456516
iter 100 value 2749.472392
iter 110 value 2705.617200
iter 120 value 2651.143643
iter 130 value 2604.909519
iter 140 value 2566.879847
iter 150 value 2539.529403
iter 160 value 2515.373549
iter 170 value 2498.340371
iter 180 value 2482.199920
iter 190 value 2472.050449
iter 200 value 2465.749218
iter 210 value 2460.832484
iter 220 value 2455.497100
iter 230 value 2450.557398
iter 240 value 2446.194898
iter 250 value 2443.561051
iter 260 value 2439.777076
iter 270 value 2435.435337
iter 280 value 2432.589206
iter 290 value 2427.917390
iter 300 value 2420.869194
final value 2420.869194
stopped after 300 iterations
# weights: 610
initial value 13017.066746
iter 10 value 6144.303536
iter 20 value 4152.415934
iter 30 value 3469.701318
iter 40 value 3147.357579
iter 50 value 2908.462279
iter 60 value 2760.576852
iter 70 value 2669.946676
iter 80 value 2615.328608
iter 90 value 2592.065020
iter 100 value 2580.864458
iter 110 value 2573.007661
iter 120 value 2560.262108
iter 130 value 2542.869800
iter 140 value 2530.383731
iter 150 value 2523.044035
iter 160 value 2517.479586
iter 170 value 2513.375432
iter 180 value 2505.031053
iter 190 value 2497.753511
iter 200 value 2492.698363
iter 210 value 2486.328661
iter 220 value 2481.217007
iter 230 value 2478.670339
iter 240 value 2476.373062
iter 250 value 2473.390624
iter 260 value 2470.332410
iter 270 value 2465.842106
iter 280 value 2460.882564
iter 290 value 2456.744115
iter 300 value 2452.688000
final value 2452.688000
stopped after 300 iterations
# weights: 610
initial value 23927.905259
iter 10 value 9716.776783
iter 20 value 7243.170353
iter 30 value 6076.117489
iter 40 value 5639.087305
iter 50 value 5359.385414
iter 60 value 5188.309096
iter 70 value 5083.511536
iter 80 value 5027.346681
iter 90 value 4978.107830
iter 100 value 4923.449119
iter 110 value 4863.969706
iter 120 value 4832.275330
iter 130 value 4809.628392
iter 140 value 4790.207231
iter 150 value 4777.718390
iter 160 value 4764.846090
iter 170 value 4743.893177
iter 180 value 4723.001151
iter 190 value 4711.687296
iter 200 value 4699.027214
iter 210 value 4687.618987
iter 220 value 4679.733212
iter 230 value 4672.234396
iter 240 value 4663.988711
iter 250 value 4652.406962
iter 260 value 4644.981505
iter 270 value 4636.521302
iter 280 value 4631.171033
iter 290 value 4627.597051
iter 300 value 4623.936775
final value 4623.936775
stopped after 300 iterations
# weights: 610
initial value 25293.058922
iter 10 value 11406.879075
iter 20 value 7706.171200
iter 30 value 6621.099743
iter 40 value 5869.173017
iter 50 value 5581.810415
iter 60 value 5384.811144
iter 70 value 5240.329143
iter 80 value 5151.030926
iter 90 value 5090.621592
iter 100 value 5042.088298
iter 110 value 5002.565401
iter 120 value 4968.229573
iter 130 value 4936.411656
iter 140 value 4918.401043
iter 150 value 4908.728469
iter 160 value 4901.411477
iter 170 value 4893.068574
iter 180 value 4884.180612
iter 190 value 4877.490637
iter 200 value 4868.845397
iter 210 value 4858.569797
iter 220 value 4851.312011
iter 230 value 4843.097700
iter 240 value 4837.552522
iter 250 value 4832.944672
iter 260 value 4828.282441
iter 270 value 4820.775420
iter 280 value 4813.915120
iter 290 value 4809.933231
iter 300 value 4806.795720
final value 4806.795720
stopped after 300 iterations
# weights: 610
initial value 25365.583754
iter 10 value 11943.466995
iter 20 value 8994.377853
iter 30 value 7875.040983
iter 40 value 7034.759331
iter 50 value 6305.302975
iter 60 value 6081.669126
iter 70 value 5703.922349
iter 80 value 5515.280179
iter 90 value 5315.665641
iter 100 value 5189.547129
iter 110 value 5112.061071
iter 120 value 5045.369499
iter 130 value 4981.216107
iter 140 value 4906.769359
iter 150 value 4825.263094
iter 160 value 4750.646753
iter 170 value 4704.653864
iter 180 value 4671.777300
iter 190 value 4653.218229
iter 200 value 4642.306360
iter 210 value 4628.073029
iter 220 value 4617.075628
iter 230 value 4607.537535
iter 240 value 4599.682218
iter 250 value 4592.495691
iter 260 value 4583.236723
iter 270 value 4573.548930
iter 280 value 4568.383321
iter 290 value 4564.881621
iter 300 value 4561.074490
final value 4561.074490
stopped after 300 iterations
For this second nnet neural netwowrk model, we will still use the neural Network nnet package and the defined function above, but with different parameters set.
For this model, we are increasing the size of the single hidden layer with 25 neurons. Adding more nodes to the hidden layer increases the model’s capacity to learn complex patterns in the data. However, increasing the size too much can lead to over-fitting on the training data.
size: 25 decay: 0.2 max weight: 10000 max iteration: 100
# weights: 1510
initial value 2811.245073
iter 10 value 1178.171741
iter 20 value 816.414641
iter 30 value 708.873097
iter 40 value 656.810631
iter 50 value 629.719173
iter 60 value 613.289735
iter 70 value 602.686541
iter 80 value 597.698447
iter 90 value 593.712441
iter 100 value 590.271050
iter 110 value 587.832663
iter 120 value 586.442261
iter 130 value 585.123959
iter 140 value 583.865685
iter 150 value 583.166047
iter 160 value 582.627114
iter 170 value 582.310072
iter 180 value 581.942816
iter 190 value 581.301585
iter 200 value 580.964042
iter 210 value 580.784885
iter 220 value 580.627945
iter 230 value 580.511718
iter 240 value 580.420677
iter 250 value 580.371496
iter 260 value 580.332088
iter 270 value 580.292320
iter 280 value 580.268074
iter 290 value 580.249403
iter 300 value 580.228353
final value 580.228353
stopped after 300 iterations
# weights: 1510
initial value 2986.547943
iter 10 value 1216.212637
iter 20 value 817.295117
iter 30 value 706.314112
iter 40 value 637.248354
iter 50 value 604.961232
iter 60 value 591.810822
iter 70 value 585.111764
iter 80 value 580.647664
iter 90 value 574.443771
iter 100 value 571.002553
iter 110 value 568.420935
iter 120 value 567.228833
iter 130 value 566.571855
iter 140 value 566.145112
iter 150 value 565.620084
iter 160 value 565.087583
iter 170 value 564.769160
iter 180 value 564.601948
iter 190 value 564.423659
iter 200 value 564.292667
iter 210 value 564.182097
iter 220 value 564.073673
iter 230 value 564.018629
iter 240 value 563.910100
iter 250 value 563.838850
iter 260 value 563.789166
iter 270 value 563.717950
iter 280 value 563.668343
iter 290 value 563.645119
iter 300 value 563.619537
final value 563.619537
stopped after 300 iterations
# weights: 1510
initial value 2802.979759
iter 10 value 1196.501008
iter 20 value 913.911395
iter 30 value 804.490919
iter 40 value 711.485553
iter 50 value 654.871543
iter 60 value 632.152386
iter 70 value 621.495392
iter 80 value 615.311504
iter 90 value 611.198086
iter 100 value 609.090105
iter 110 value 607.560611
iter 120 value 605.869160
iter 130 value 603.463955
iter 140 value 602.292083
iter 150 value 601.879836
iter 160 value 601.588756
iter 170 value 601.340529
iter 180 value 601.207810
iter 190 value 601.114824
iter 200 value 600.999297
iter 210 value 600.782700
iter 220 value 600.599393
iter 230 value 600.378393
iter 240 value 600.231993
iter 250 value 600.120747
iter 260 value 600.043350
iter 270 value 600.000019
iter 280 value 599.951830
iter 290 value 599.894717
iter 300 value 599.842184
final value 599.842184
stopped after 300 iterations
# weights: 1510
initial value 14856.388037
iter 10 value 6204.200735
iter 20 value 3756.624764
iter 30 value 3196.988499
iter 40 value 2874.371097
iter 50 value 2695.617188
iter 60 value 2594.250974
iter 70 value 2504.135574
iter 80 value 2431.483354
iter 90 value 2388.413001
iter 100 value 2349.948863
iter 110 value 2322.434104
iter 120 value 2298.645554
iter 130 value 2278.917473
iter 140 value 2263.231760
iter 150 value 2250.019741
iter 160 value 2238.474576
iter 170 value 2228.508246
iter 180 value 2220.822370
iter 190 value 2215.633557
iter 200 value 2210.229227
iter 210 value 2202.448250
iter 220 value 2195.603458
iter 230 value 2190.342129
iter 240 value 2184.843778
iter 250 value 2180.554768
iter 260 value 2176.847913
iter 270 value 2173.618258
iter 280 value 2170.624899
iter 290 value 2167.427004
iter 300 value 2164.676386
final value 2164.676386
stopped after 300 iterations
# weights: 1510
initial value 14165.935593
iter 10 value 5746.559874
iter 20 value 3713.743844
iter 30 value 2966.391696
iter 40 value 2684.800362
iter 50 value 2560.624910
iter 60 value 2478.281106
iter 70 value 2395.022142
iter 80 value 2321.905133
iter 90 value 2281.815265
iter 100 value 2249.293021
iter 110 value 2219.745780
iter 120 value 2196.691388
iter 130 value 2180.746210
iter 140 value 2168.973238
iter 150 value 2160.549385
iter 160 value 2154.555491
iter 170 value 2149.803358
iter 180 value 2144.097372
iter 190 value 2139.587114
iter 200 value 2134.842284
iter 210 value 2130.334383
iter 220 value 2124.617937
iter 230 value 2120.187880
iter 240 value 2116.813244
iter 250 value 2114.848720
iter 260 value 2113.351543
iter 270 value 2111.829479
iter 280 value 2110.216555
iter 290 value 2108.165635
iter 300 value 2106.440977
final value 2106.440977
stopped after 300 iterations
# weights: 1510
initial value 13161.805032
iter 10 value 5768.201103
iter 20 value 3915.976675
iter 30 value 3261.216984
iter 40 value 3049.705016
iter 50 value 2730.548153
iter 60 value 2555.835557
iter 70 value 2484.112966
iter 80 value 2392.398612
iter 90 value 2334.611604
iter 100 value 2300.836341
iter 110 value 2272.861335
iter 120 value 2251.022639
iter 130 value 2232.272001
iter 140 value 2216.478870
iter 150 value 2201.476062
iter 160 value 2189.346137
iter 170 value 2179.050405
iter 180 value 2172.350759
iter 190 value 2168.329186
iter 200 value 2165.235664
iter 210 value 2162.928980
iter 220 value 2160.909086
iter 230 value 2158.264811
iter 240 value 2156.242336
iter 250 value 2154.174104
iter 260 value 2151.373707
iter 270 value 2149.720071
iter 280 value 2148.473546
iter 290 value 2147.357747
iter 300 value 2145.821039
final value 2145.821039
stopped after 300 iterations
# weights: 1510
initial value 26503.379979
iter 10 value 10628.303280
iter 20 value 7081.807392
iter 30 value 5916.309905
iter 40 value 5402.207533
iter 50 value 5058.756371
iter 60 value 4833.125574
iter 70 value 4690.070780
iter 80 value 4607.034552
iter 90 value 4555.558389
iter 100 value 4492.093718
iter 110 value 4447.044112
iter 120 value 4415.927287
iter 130 value 4378.905908
iter 140 value 4343.443519
iter 150 value 4309.850435
iter 160 value 4274.773583
iter 170 value 4238.383409
iter 180 value 4198.713163
iter 190 value 4167.130649
iter 200 value 4144.218970
iter 210 value 4126.370295
iter 220 value 4109.160823
iter 230 value 4087.733164
iter 240 value 4064.297313
iter 250 value 4044.426288
iter 260 value 4028.137060
iter 270 value 4016.476377
iter 280 value 4006.958397
iter 290 value 3998.368268
iter 300 value 3990.123621
final value 3990.123621
stopped after 300 iterations
# weights: 1510
initial value 32308.858163
iter 10 value 15984.108836
iter 20 value 9434.584349
iter 30 value 7716.123954
iter 40 value 6794.370700
iter 50 value 6105.193403
iter 60 value 5753.743633
iter 70 value 5451.267671
iter 80 value 5243.297364
iter 90 value 5005.466716
iter 100 value 4875.125826
iter 110 value 4783.364933
iter 120 value 4695.407978
iter 130 value 4635.274327
iter 140 value 4582.453403
iter 150 value 4525.760331
iter 160 value 4476.896131
iter 170 value 4437.253655
iter 180 value 4406.174676
iter 190 value 4375.017372
iter 200 value 4349.520394
iter 210 value 4317.538587
iter 220 value 4292.399336
iter 230 value 4268.317887
iter 240 value 4248.257757
iter 250 value 4229.286511
iter 260 value 4213.596588
iter 270 value 4196.365844
iter 280 value 4180.942912
iter 290 value 4165.238823
iter 300 value 4150.685336
final value 4150.685336
stopped after 300 iterations
# weights: 1510
initial value 28011.577632
iter 10 value 12000.606643
iter 20 value 7331.698935
iter 30 value 6027.776139
iter 40 value 5381.475872
iter 50 value 5021.984316
iter 60 value 4770.699281
iter 70 value 4614.987520
iter 80 value 4486.846474
iter 90 value 4392.306235
iter 100 value 4308.090884
iter 110 value 4231.811193
iter 120 value 4170.213732
iter 130 value 4122.667721
iter 140 value 4083.196671
iter 150 value 4047.690645
iter 160 value 4017.697563
iter 170 value 3994.035507
iter 180 value 3977.567081
iter 190 value 3964.202759
iter 200 value 3952.988618
iter 210 value 3943.508283
iter 220 value 3933.477389
iter 230 value 3925.220875
iter 240 value 3919.870931
iter 250 value 3915.071037
iter 260 value 3910.459970
iter 270 value 3906.714559
iter 280 value 3903.309702
iter 290 value 3899.822186
iter 300 value 3895.184365
final value 3895.184365
stopped after 300 iterations
Random Forest is a popular ensemble learning method used for both classification and regression tasks in machine learning. It belongs to the family of bagging algorithms, which combine multiple weak learners (usually decision trees) to create a strong and robust predictive model. The fundamental idea behind Random Forest is to aggregate the predictions of multiple individual models to achieve better generalization and reduce overfitting. Random Forest creates multiple subsets (bootstrapped samples) of the original training data by randomly sampling with replacement. Each subset is used to train a separate decision tree. During the tree-building process, at each split, the algorithm considers only a random subset of features rather than all the features. This randomness helps to introduce diversity among the individual trees and reduces correlation between them. For classification tasks, the final prediction is obtained by taking a majority vote from all the individual trees. For regression tasks, the predictions are averaged. Overall, Random Forest is a powerful and versatile algorithm that can be effective in a wide range of machine learning tasks. It’s especially useful when dealing with complex data with many features, but it may not be the best choice for applications where interpretability is crucial or when dealing with extremely large datasets.
Advantages: High Accuracy: Random Forest often yields highly accurate predictions due to the combination of multiple decision trees, which reduces overfitting and improves generalization. Robustness: Random Forest is robust against noisy data and outliers since it aggregates the predictions from multiple models. Feature Importance: Random Forest can provide an estimate of feature importance, helping in feature selection and understanding the most influential variables in the data. No Data Splitting: Random Forest does not require a separate validation set for hyperparameter tuning, as it internally uses the out-of-bag samples (data not used during bootstrapped training) for validation. Parallelization: The training of individual trees can be done in parallel, making it computationally efficient for large datasets.
Disadvantages: Model Size: The ensemble of decision trees can lead to a large model size, especially when dealing with numerous trees or deep trees, which can increase memory requirements. Interpretability: While Random Forest can provide feature importance, the overall model can be challenging to interpret compared to individual decision trees. Bias in Imbalanced Datasets: Random Forest tends to favor the majority class in imbalanced datasets, which may require additional techniques to handle class imbalances. Hyperparameter Tuning: Although Random Forests have fewer hyperparameters to tune compared to individual decision trees, finding the optimal values for these hyperparameters can still be time-consuming.
K-Nearest Neighbors (KNN) is a simple and widely used supervised machine learning algorithm for both classification and regression tasks. In KNN, the output (class label or value) of an unseen data point is determined based on the majority class or average of its ‘k’ nearest neighbors in the feature space. The algorithm relies on the assumption that similar data points tend to have similar output values.
Advantages: Simple and Easy to Implement: KNN is straightforward to understand and implement, making it a good starting point for beginners in machine learning. No Training Phase: KNN is a lazy learning algorithm, meaning there is no explicit training phase. The model quickly adapts to new data as it arrives. Non-parametric: KNN makes no assumptions about the underlying data distribution, making it suitable for both linear and nonlinear relationships. No Model Complexity: Since KNN does not learn an explicit model, it can be more interpretable than complex models like neural networks. Versatile: KNN can handle multi-class classification, regression, and even outlier detection tasks.
Disadvantages: Computationally Intensive: Predicting the class or value for a new data point can be computationally expensive, especially with large datasets, as it requires calculating distances for all data points. Curse of Dimensionality: KNN’s performance can degrade when dealing with high-dimensional data, as the concept of distance becomes less meaningful in higher dimensions. Choosing the Right ‘k’: Selecting an appropriate value for ‘k’ is critical. A small ‘k’ can lead to noisy predictions, while a large ‘k’ may cause loss of local patterns. Sensitive to Outliers: KNN is sensitive to outliers since the nearest neighbors might be heavily influenced by them. Imbalanced Data: In classification tasks with imbalanced classes, KNN may be biased towards the majority class due to the voting mechanism.
Elastic Net is a linear regression model that combines L1 (Lasso) and L2 (Ridge) regularization to address multicollinearity and perform feature selection. It adds both the absolute value of the coefficients (L1 regularization) and the squared value of the coefficients (L2 regularization) to the loss function, allowing it to simultaneously perform variable selection and shrinkage.
Advantages: Elastic Net is particularly useful when dealing with datasets that have a large number of predictor variables and potential multicollinearity issues. It helps prevent overfitting, provides stable and interpretable model coefficients, and automatically selects relevant features by shrinking less important coefficients towards zero.
Disadvantages: While Elastic Net is effective in many cases, it may not perform well if the relationships between the predictors and the response are highly nonlinear. Additionally, finding the optimal hyperparameters for Elastic Net can be challenging.
Lasso, short for “Least Absolute Shrinkage and Selection Operator,” is a linear regression model that uses L1 regularization to perform feature selection and shrinkage. Lasso adds the absolute value of the coefficients to the loss function, penalizing large coefficient values and encouraging some coefficients to be exactly zero.
Advantages: Lasso is particularly useful when dealing with high-dimensional datasets, as it can handle large numbers of predictor variables and identify the most relevant features. It helps prevent overfitting by shrinking less important coefficients towards zero, leading to a simpler and more interpretable model. Additionally, the sparsity introduced by Lasso makes the model suitable for feature selection, as it naturally identifies and excludes irrelevant features from the model.
Disadvantages: Lasso may not perform well if the relationships between the predictors and the response variable are highly nonlinear. Furthermore, choosing the optimal regularization parameter (lambda) for Lasso can be challenging, similar to other regularization techniques, as it requires tuning hyperparameters.
Ridge Regression, also known as L2 regularization, is a linear regression technique that adds the squared values of the coefficients to the loss function. It is used to prevent overfitting and reduce the impact of multicollinearity in the dataset. Ridge Regression introduces a penalty term that encourages smaller coefficients, effectively shrinking them towards zero but not exactly to zero.
Advantages: Ridge Regression is beneficial when dealing with datasets that have multicollinearity issues, where predictor variables are highly correlated. The L2 regularization helps stabilize the model by shrinking correlated coefficients, reducing the impact of collinearity, and providing more reliable predictions. Ridge Regression can also handle high-dimensional datasets with many predictors, making it suitable for scenarios where feature selection is not the primary goal.
Disadvantages: Similar to Lasso and Elastic Net, Ridge Regression has hyperparameters that need to be tuned, such as the regularization parameter (lambda or alpha) and selecting the optimal value for the regularization parameter can be challenging, requiring cross-validation or other techniques. Additionally, Ridge Regression does not perform feature selection like Lasso does, as it rarely sets coefficients exactly to zero, which might be a limitation when feature reduction is desired.
XGBoost combines multiple decision trees, to create an ensemble of models that can predict whether an instance of a dataset belongs to a certain class. At each iteration, a new tree is built that tries to correct the mistakes of the previous ensemble using gradient descent. The prediction of the final ensemble is then the weighted sum of the predictions of each individual tree.
Advantages: XGBoost is highly optimized, efficient and has a high predictive power because of its robust handling of different types of predictor variables, its regularization properties, and its tree building algorithm.
Disadvantages: XGBoost can be slow, overfit and is more complex to tune because there are many hyperparameters that can be adjusted.
Details of selected parameters: The parameters were tuned using a 300 sample size and fitted here as final model. nrounds - The number of boosting rounds or trees to fit. early_stopping_rounds - Training will stop if the validation error does not decrease for 10 consecutive rounds. objective - “multi:softmax” is used for multiclass classification problems. num_class - For multiclass problems, this is set to the number of classes in the dataset.
SVM separates the classes by hyperplanes. The model tries to find the hyperplane which has the maximum margin, i.e., the maximum distance between data points of the classes.
Advantages: SVM works well with high-dimensional data and efficient.
Disadvantages: The decision of choosing a kernel function can be tricky, and the results are sensitive to the choice of kernel parameters. Noise and overlapping classes can affect the performance of SVM.
Details of selected parameters: The parameters were tuned using a 300 sample size and fitted here as final model. kernel - The radial basis function kernel is one of the most popular choices as it can handle complex, nonlinear classification problems. type - ‘C-classification’ refers to the type of SVM for classification tasks. gamma - It defines how far the influence of a single training example reaches. The lower it is the further points will be considered. cost - A larger cost creates a narrower margin so as not to misclassify training data, but it might also overfit the data.
Rpart works by recursively partitioning the data based on certain criteria to create a decision tree.
Advantages: The tree structure and decision rules are easy to understand and provide information about feature importance, i.e., which variables are most influential in making predictions.
Disadvantages: Decision trees can easily overfit the data. It is also very sensitive to changes in the data. A small change can result in a very different tree.
Details of selected parameters: method - The type of prediction problem to solve and “class” indicates a classification problem. control - The complexity parameter is used to control the size of the decision tree and to prevent overfitting. The tree-building process will stop if splitting a node would result in the improvement of the model’s fit being less than 0.01.
An overall scoreboard of average results for the 30 combinations of Model and Sample Size
Model Sample_Size A B C Points
1: model1 1000 0.0167 0.0288 0.2079 0.1613
2: model1 5000 0.0833 0.1397 0.1752 0.1579
3: model1 10000 0.1667 0.3403 0.1667 0.1841
4: model2 1000 0.0167 0.2007 0.2040 0.1756
5: model2 5000 0.0833 0.4016 0.1601 0.1727
6: model2 10000 0.1667 0.8085 0.1508 0.2189
7: model3 1000 0.0167 0.0170 0.2145 0.1651
8: model3 5000 0.0833 0.1036 0.1735 0.1530
9: model3 10000 0.1667 0.2634 0.1593 0.1708
10: model4 1000 0.0167 0.0146 0.2716 0.2077
11: model4 5000 0.0833 0.0570 0.2150 0.1794
12: model4 10000 0.1667 0.1472 0.1921 0.1838
13: model5 1000 0.0167 0.5259 0.2115 0.2137
14: model5 5000 0.0833 0.0357 0.1943 0.1618
15: model5 10000 0.1667 1.0000 0.1870 0.2653
16: model6 1000 0.0167 0.6984 0.2276 0.2430
17: model6 5000 0.0833 0.3654 0.1942 0.1947
18: model6 10000 0.1667 1.0000 0.1857 0.2643
19: model7 1000 0.0167 0.0933 0.2335 0.1869
20: model7 5000 0.0833 0.4601 0.2244 0.2268
21: model7 10000 0.1667 0.0187 0.2227 0.1939
22: model8 1000 0.0167 0.0566 0.2168 0.1708
23: model8 5000 0.0833 0.4565 0.1626 0.1801
24: model8 10000 0.1667 0.9929 0.1456 0.2335
25: model9 1000 0.0167 0.0228 0.2321 0.1789
26: model9 5000 0.0833 0.1563 0.1752 0.1595
27: model9 10000 0.1667 0.3846 0.1564 0.1807
28: model10 1000 0.0167 0.0041 0.3583 0.2716
29: model10 5000 0.0833 0.0174 0.3430 0.2715
30: model10 10000 0.1667 0.0308 0.3441 0.2862
Model Sample_Size A B C Points
Based on the findings from our dataset and the training of 90 models using the different modeling methods, it seems that neural network, random forest, support vector machine (SVM), and XGBoost have higher accuracy in classification prediction with less time to run. We have concluded some possible reasons for this observation:
It’s important to note that the superiority of these models is based on our specific dataset and the experimental setup used in training the 90 models. Different datasets may yield different results, and the performance of each model can be influenced by factors like hyperparameter tuning and data preprocessing. Therefore, it’s always a good practice to experiment with multiple models and assess their performance on different datasets to select the most suitable model for a given task.
While we have implemented fine tuning to some models including neural network and XGBoost, we expect to further adjust parameter settings to achieve better performance in the future.
Model Sample_Size A B C Points
1: model3 5000 0.0833 0.1036 0.1735 0.1530
2: model1 5000 0.0833 0.1397 0.1752 0.1579
3: model9 5000 0.0833 0.1563 0.1752 0.1595
4: model1 1000 0.0167 0.0288 0.2079 0.1613
5: model5 5000 0.0833 0.0357 0.1943 0.1618