Anitta Varghese, Arun Palani, Balasameeksha Pulaparthi, Harinishree Balaji, Hyndavi Nidadavolu
2023-12-06
We have covered data wrangling, data exploration and insights generation in this section.
There are 3 types of data that we have used in our project:
Each of the above 3 datasets contain ~42000 instances.
EDA on user data set:
Boxplot of reputation by user type
Word cloud of user’s locations
EDA on Tags data set:
Wordcloud of popularly viewed topics:
EDA on Post data set:
Bar chart of Post Type based on their count
Raw datasets:
1. User data:
## X account_id is_employee last_modified_date last_access_date
## 1 1 11683 FALSE 2023-11-29 16:05:26 2023-12-05 12:48:46
## 2 2 4243 FALSE 2023-12-01 18:10:12 2023-12-05 12:04:21
## 3 3 1165580 FALSE 2023-12-04 06:55:00 2023-06-16 18:48:45
## 4 4 52822 FALSE 2023-12-01 08:05:00 2023-12-05 13:03:28
## 5 5 35417 FALSE 2023-12-02 16:25:45 2023-12-04 14:31:52
## 6 6 52616 FALSE 2023-12-05 09:00:11 2023-12-05 13:18:06
## 7 7 11975 FALSE 2023-10-30 14:00:09 2023-12-05 12:32:07
## 8 8 14332 FALSE 2023-11-25 11:05:10 2023-12-05 09:52:02
## 9 9 39846 FALSE 2023-12-02 00:45:07 2023-12-05 12:11:53
## 10 10 680 FALSE 2023-12-02 18:55:06 2023-12-03 17:42:17
## reputation_change_year reputation_change_quarter reputation_change_month
## 1 58506 9791 795
## 2 119986 26718 1507
## 3 26699 4271 250
## 4 25359 4189 248
## 5 59341 11156 680
## 6 42785 7819 508
## 7 31394 4926 243
## 8 17234 2866 144
## 9 18766 3374 270
## 10 37449 6200 466
## reputation_change_week reputation_change_day reputation creation_date
## 1 525 215 1437267 2008-09-26 08:05:05
## 2 846 315 1299740 2008-09-13 18:22:33
## 3 170 70 1248477 2012-01-11 14:53:57
## 4 180 70 1088739 2009-08-17 12:42:02
## 5 470 80 1064731 2009-05-03 10:53:57
## 6 323 90 1042332 2009-08-16 07:00:22
## 7 175 80 1033296 2008-09-29 01:46:02
## 8 96 18 1027164 2008-10-19 12:07:47
## 9 140 30 990863 2009-05-31 12:20:08
## 10 238 120 960187 2008-08-10 04:27:00
## user_type user_id accept_rate location
## 1 registered 22656 86 Reading, United Kingdom
## 2 registered 6309 100 France
## 3 registered 1144035 NA New York, United States
## 4 registered 157882 93 Willemstad, Curaçao
## 5 moderator 100297 NA Cambridge, UK
## 6 registered 157247 91 United Kingdom
## 7 registered 23354 100 Forest of Dean, United Kingdom
## 8 registered 29407 86 Sofia, Bulgaria
## 9 registered 115145 84 Pennsylvania, United States
## 10 registered 893 84 Christchurch, New Zealand
## website_url
## 1 http://csharpindepth.com
## 2 https://devstory.fyi/vonc
## 3 http://www.data-miners.com
## 4 https://balusc.omnifaces.org
## 5 http://www.zopatista.com/
## 6 https://thenewtoys.dev
## 7 http://blog.marcgravell.com
## 8 http://stackoverflow.com/search?q=user%3a29407&tab=newest
## 9 https://commonsware.com
## 10 https://hewgill.com
## link
## 1 https://stackoverflow.com/users/22656/jon-skeet
## 2 https://stackoverflow.com/users/6309/vonc
## 3 https://stackoverflow.com/users/1144035/gordon-linoff
## 4 https://stackoverflow.com/users/157882/balusc
## 5 https://stackoverflow.com/users/100297/martijn-pieters
## 6 https://stackoverflow.com/users/157247/t-j-crowder
## 7 https://stackoverflow.com/users/23354/marc-gravell
## 8 https://stackoverflow.com/users/29407/darin-dimitrov
## 9 https://stackoverflow.com/users/115145/commonsware
## 10 https://stackoverflow.com/users/893/greg-hewgill
## profile_image
## 1 https://www.gravatar.com/avatar/6d8ebb117e8d83d74ea95fbdd0f87e13?s=256&d=identicon&r=PG
## 2 https://i.stack.imgur.com/I4fiW.jpg?s=256&g=1
## 3 https://www.gravatar.com/avatar/e514b017977ebf742a418cac697d8996?s=256&d=identicon&r=PG
## 4 https://www.gravatar.com/avatar/89927e2f4bde24991649b353a37678b9?s=256&d=identicon&r=PG
## 5 https://www.gravatar.com/avatar/24780fb6df85a943c7aea0402c843737?s=256&d=identicon&r=PG
## 6 https://i.stack.imgur.com/lUM5Z.jpg?s=256&g=1
## 7 https://i.stack.imgur.com/CrVFH.png?s=256&g=1
## 8 https://www.gravatar.com/avatar/e3a181e9cdd4757a8b416d93878770c5?s=256&d=identicon&r=PG
## 9 https://i.stack.imgur.com/wDnd8.png?s=256&g=1
## 10 https://www.gravatar.com/avatar/747ffa5da3538e66840ebc0548b8fd58?s=256&d=identicon&r=PG
## display_name badge_counts_bronze badge_counts_silver badge_counts_gold
## 1 Jon Skeet 9216 9166 873
## 2 VonC 5320 4479 533
## 3 Gordon Linoff 797 652 58
## 4 BalusC 3563 3623 374
## 5 Martijn Pieters 3375 4102 305
## 6 T.J. Crowder 1888 1940 188
## 7 Marc Gravell 2914 2582 267
## 8 Darin Dimitrov 2934 3296 273
## 9 CommonsWare 2514 2404 191
## 10 Greg Hewgill 1288 1155 185
## tags1 tags2
## 1 firebase-app-distribution google-cloud-profiler
## 2 topic-modeling spacy
## 3
## 4
## 5
## 6
## 7 azure-sdk-.net azure-http-trigger
## 8
## 9 ios android
## 10
## tags3 tags4 tags5
## 1 google-cloud-networking dialogflow-es-fulfillment google-cloud-tools
## 2 nlp-question-answering gensim tf-idf
## 3
## 4
## 5
## 6
## 7 azure-machine-learning-service azure-data-lake-gen2 azure-bastion
## 8
## 9 <NA> <NA> <NA>
## 10
## tags6 tags7 tags8
## 1 google-cloud-dns google-cloud-internal-load-balancer firebase-hosting
## 2 spacy-3 word-embedding word2vec
## 3
## 4
## 5
## 6
## 7 azure-ad-v2 azure-managed-database azure-notebooks
## 8
## 9 <NA> <NA> <NA>
## 10
## tags9 tags10
## 1 looker-studio google-cloud-node
## 2 nltk sentiment-analysis
## 3
## 4
## 5
## 6
## 7 microsoft-entra-internet-access azure-integration-account
## 8
## 9 <NA> <NA>
## 10
## tags11 tags12 tags13
## 1 google-cloud-spanner firebase-invites google-cloud-launcher
## 2 stanford-nlp bert-language-model nlp
## 3
## 4
## 5
## 6
## 7 azure-container-apps azure-emulator azure-managed-disk
## 8
## 9 <NA> <NA> <NA>
## 10
## tags14 tags15
## 1 google-cloud-functions google-container-builder
## 2 huggingface-transformers named-entity-recognition
## 3
## 4
## 5
## 6
## 7 azure-log-analytics-workspace azure-ad-domain-services
## 8
## 9 <NA> <NA>
## 10
## tags16 tags17
## 1 google-cloud-memorystore firebasesimplelogin
## 2 opennlp r-package
## 3
## 4
## 5
## 6
## 7 azure-web-app-firewall azure-dashboard
## 8
## 9 <NA> <NA>
## 10
2. Posts data:
## X post_state score last_activity_date creation_date post_type
## 1 1 Published 0 2023-12-05 13:15:43 2023-12-05 13:15:43 question
## 2 2 Published 0 2023-12-05 13:15:42 2023-12-05 12:52:39 question
## 3 3 Published 1 2023-12-05 13:15:39 2023-12-05 12:56:06 question
## 4 4 Published 0 2023-12-05 13:15:22 2023-11-29 13:40:17 answer
## 5 5 Published 1 2023-12-05 13:15:22 2019-05-06 10:14:02 question
## 6 6 Published 0 2023-12-05 13:15:19 2023-12-05 13:10:04 answer
## 7 7 Published 0 2023-12-05 13:15:19 2023-12-05 12:44:08 question
## 8 8 Published 1 2023-12-05 13:15:17 2022-04-06 09:34:16 question
## 9 9 Published 0 2023-12-05 13:15:02 2023-05-24 12:02:10 answer
## 10 10 Published 0 2023-12-05 13:15:02 2023-05-24 12:02:10 question
## post_id content_license link
## 1 77608436 CC BY-SA 4.0 https://stackoverflow.com/q/77608436
## 2 77608307 CC BY-SA 4.0 https://stackoverflow.com/q/77608307
## 3 77608326 CC BY-SA 4.0 https://stackoverflow.com/q/77608326
## 4 77573743 CC BY-SA 4.0 https://stackoverflow.com/a/77573743
## 5 56006978 CC BY-SA 4.0 https://stackoverflow.com/q/56006978
## 6 77608401 CC BY-SA 4.0 https://stackoverflow.com/a/77608401
## 7 77608262 CC BY-SA 4.0 https://stackoverflow.com/q/77608262
## 8 71767750 CC BY-SA 4.0 https://stackoverflow.com/q/71767750
## 9 76325447 CC BY-SA 4.0 https://stackoverflow.com/a/76325447
## 10 76325446 CC BY-SA 4.0 https://stackoverflow.com/q/76325446
## last_edit_date owner_account_id owner_reputation owner_user_id
## 1 <NA> 6150674 271 4795151
## 2 2023-12-05 13:15:42 239885 1393 509770
## 3 2023-12-05 13:15:39 452355 5669 848746
## 4 2023-12-05 13:15:22 12939096 1 9355485
## 5 2019-05-06 13:04:35 1344154 365 1285210
## 6 2023-12-05 13:15:19 12513488 950 9107694
## 7 2023-12-05 12:46:52 30074202 1 23047467
## 8 2023-12-05 13:15:17 14221191 53 10273462
## 9 2023-12-05 13:15:02 392777 3063 754254
## 10 <NA> 392777 3063 754254
## owner_user_type owner_accept_rate
## 1 registered 23
## 2 registered NA
## 3 registered 83
## 4 registered NA
## 5 registered 100
## 6 registered NA
## 7 registered NA
## 8 registered NA
## 9 registered 67
## 10 registered 67
## owner_profile_image
## 1 https://www.gravatar.com/avatar/7a76cce42f2ef8d5d54fbaf45ce6c8db?s=256&d=identicon&r=PG&f=y&so-version=2
## 2 https://i.stack.imgur.com/lNymg.png?s=256&g=1
## 3 https://www.gravatar.com/avatar/53f8a63ef3f02b3d02788f4a90fff3e3?s=256&d=identicon&r=PG
## 4 https://www.gravatar.com/avatar/11b0ca0f851786415f8d974048810ad6?s=256&d=identicon&r=PG&f=y&so-version=2
## 5 https://i.stack.imgur.com/0GaVG.jpg?s=256&g=1
## 6 https://www.gravatar.com/avatar/c8df0972eede360961e46d3df9a29b48?s=256&d=identicon&r=PG&f=y&so-version=2
## 7 https://i.stack.imgur.com/8hITY.jpg?s=256&g=1
## 8 https://www.gravatar.com/avatar/cbc92cae95fc36e52d7f73e2862be862?s=256&d=identicon&r=PG&f=y&so-version=2
## 9 https://www.gravatar.com/avatar/e40fffca9788769cd8ab1a9cd29c623f?s=256&d=identicon&r=PG&f=y&so-version=2
## 10 https://www.gravatar.com/avatar/e40fffca9788769cd8ab1a9cd29c623f?s=256&d=identicon&r=PG&f=y&so-version=2
## owner_display_name
## 1 atul mishra
## 2 jaybro
## 3 AJW
## 4 Rustamjon Akhmedov
## 5 DudiDude
## 6 SyndRain
## 7 yuyibruh
## 8 liner183
## 9 Felipe
## 10 Felipe
## owner_link
## 1 https://stackoverflow.com/users/4795151/atul-mishra
## 2 https://stackoverflow.com/users/509770/jaybro
## 3 https://stackoverflow.com/users/848746/ajw
## 4 https://stackoverflow.com/users/9355485/rustamjon-akhmedov
## 5 https://stackoverflow.com/users/1285210/dudidude
## 6 https://stackoverflow.com/users/9107694/syndrain
## 7 https://stackoverflow.com/users/23047467/yuyibruh
## 8 https://stackoverflow.com/users/10273462/liner183
## 9 https://stackoverflow.com/users/754254/felipe
## 10 https://stackoverflow.com/users/754254/felipe
## posted_by_collectives collectives
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
3.Tags data:
## X has_synonyms is_moderator_only is_required count name
## 1 1 TRUE FALSE FALSE 2519743 javascript
## 2 2 TRUE FALSE FALSE 2176854 python
## 3 3 TRUE FALSE FALSE 1912027 java
## 4 4 TRUE FALSE FALSE 1608171 c#
## 5 5 TRUE FALSE FALSE 1462912 php
## 6 6 TRUE FALSE FALSE 1413467 android
## 7 7 TRUE FALSE FALSE 1184004 html
## 8 8 TRUE FALSE FALSE 1034750 jquery
## 9 9 TRUE FALSE FALSE 802627 c++
## 10 10 TRUE FALSE FALSE 801143 css
## collectives
## 1
## 2
## 3
## 4
## 5 list("php"), list(list(type = "support", link = "https://stackoverflow.com/contact?topic=15")), A collective where developers working with PHP can learn and connect about the open source scripting language., /collectives/php, PHP, php
## 6 list(c("ios", "android")), list(list(type = "support", link = "https://stackoverflow.com/contact?topic=15")), A collective for developers who want to share their knowledge and learn more about mobile development practices and platforms, /collectives/mobile-dev, Mobile Development, mobile-dev
## 7
## 8
## 9
## 10
Steps:
1.User data:
## X account_id is_employee last_modified_date last_access_date
## 1 1 11683 FALSE 2023-11-29 16:05:26 2023-12-05 12:48:46
## 2 2 4243 FALSE 2023-12-01 18:10:12 2023-12-05 12:04:21
## 3 4 52822 FALSE 2023-12-01 08:05:00 2023-12-05 13:03:28
## 4 6 52616 FALSE 2023-12-05 09:00:11 2023-12-05 13:18:06
## 5 7 11975 FALSE 2023-10-30 14:00:09 2023-12-05 12:32:07
## 6 8 14332 FALSE 2023-11-25 11:05:10 2023-12-05 09:52:02
## reputation_change_year reputation_change_quarter reputation_change_month
## 1 58506 9791 795
## 2 119986 26718 1507
## 3 25359 4189 248
## 4 42785 7819 508
## 5 31394 4926 243
## 6 17234 2866 144
## reputation_change_week reputation_change_day reputation creation_date
## 1 525 215 1437267 2008-09-26 08:05:05
## 2 846 315 1299740 2008-09-13 18:22:33
## 3 180 70 1088739 2009-08-17 12:42:02
## 4 323 90 1042332 2009-08-16 07:00:22
## 5 175 80 1033296 2008-09-29 01:46:02
## 6 96 18 1027164 2008-10-19 12:07:47
## user_type user_id accept_rate location
## 1 registered 22656 86 Reading, United Kingdom
## 2 registered 6309 100 France
## 3 registered 157882 93 Willemstad, Curaçao
## 4 registered 157247 91 United Kingdom
## 5 registered 23354 100 Forest of Dean, United Kingdom
## 6 registered 29407 86 Sofia, Bulgaria
## website_url
## 1 http://csharpindepth.com
## 2 https://devstory.fyi/vonc
## 3 https://balusc.omnifaces.org
## 4 https://thenewtoys.dev
## 5 http://blog.marcgravell.com
## 6 http://stackoverflow.com/search?q=user%3a29407&tab=newest
## link
## 1 https://stackoverflow.com/users/22656/jon-skeet
## 2 https://stackoverflow.com/users/6309/vonc
## 3 https://stackoverflow.com/users/157882/balusc
## 4 https://stackoverflow.com/users/157247/t-j-crowder
## 5 https://stackoverflow.com/users/23354/marc-gravell
## 6 https://stackoverflow.com/users/29407/darin-dimitrov
## profile_image
## 1 https://www.gravatar.com/avatar/6d8ebb117e8d83d74ea95fbdd0f87e13?s=256&d=identicon&r=PG
## 2 https://i.stack.imgur.com/I4fiW.jpg?s=256&g=1
## 3 https://www.gravatar.com/avatar/89927e2f4bde24991649b353a37678b9?s=256&d=identicon&r=PG
## 4 https://i.stack.imgur.com/lUM5Z.jpg?s=256&g=1
## 5 https://i.stack.imgur.com/CrVFH.png?s=256&g=1
## 6 https://www.gravatar.com/avatar/e3a181e9cdd4757a8b416d93878770c5?s=256&d=identicon&r=PG
## display_name badge_counts_bronze badge_counts_silver badge_counts_gold
## 1 Jon Skeet 9216 9166 873
## 2 VonC 5320 4479 533
## 3 BalusC 3563 3623 374
## 4 T.J. Crowder 1888 1940 188
## 5 Marc Gravell 2914 2582 267
## 6 Darin Dimitrov 2934 3296 273
## tags1 tags2 tags3 tags4 tags5 tags6 tags7 tags8 tags9 tags10 tags11 tags12
## 1 5 7 5 6 7 7 7 6 6 6 8 7
## 2 10 9 8 7 10 11 11 11 8 8 11 6
## 3 1 1 1 1 1 1 1 1 1 1 1 1
## 4 1 1 1 1 1 1 1 1 1 1 1 1
## 5 3 4 3 3 3 3 3 4 7 3 4 4
## 6 1 1 1 1 1 1 1 1 1 1 1 1
## tags13 tags14 tags15 tags16 tags17
## 1 7 7 7 6 7
## 2 9 9 9 9 13
## 3 1 1 1 1 1
## 4 1 1 1 1 1
## 5 4 4 3 4 5
## 6 1 1 1 1 1
This section focuses on the Machine Learning Models and Text Mining
We are implementing 2 modeling techniques as below:
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 20 hours 35 minutes
## H2O cluster timezone: America/New_York
## H2O data parsing timezone: UTC
## H2O cluster version: 3.42.0.2
## H2O cluster version age: 4 months and 11 days
## H2O cluster name: H2O_started_from_R_arun_hmk386
## H2O cluster total nodes: 1
## H2O cluster total memory: 1.96 GB
## H2O cluster total cores: 8
## H2O cluster allowed cores: 8
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## R Version: R version 4.3.1 (2023-06-16)
XAI Model 1 - Variable Importance of RF Model
rf_model <- h2o.loadModel("Group4-StackExchange_RF_model.h20")
# Generate variable importance plot
var_imp <- h2o.varimp(rf_model)
h2o.varimp_plot(rf_model)XAI Model 2 - Variable Importance of Deep Learning Model
The top three most important variables are:
-badge_counts_silver -badge_counts_bronze -badge_counts_gold
-Random Forest: The variable importance scores for the badge count variables are all relatively high, while the variable importance scores for the tag variables are all relatively low. This suggests that badge counts are more important for predicting (Direct Referral Index) than tags.
-Deep Learning : The Variable importance graph clearly indicates that -badge_count_silver has the highest variable importance -Badge counts are often used to identify users who are engaged and active in the community. These users are more likely to be knowledgeable and experienced, and they are also more likely to be social and influential.
Overall Findings: -Badge counts are often used to identify users who are engaged and active in the community. These users are more likely to be knowledgeable and experienced, and they are also more likely to be social and influential.
-On the Contrary, users who are new to the community should focus on earning their first few badges as quickly as possible to increase their DRF.