Group4 - Stack Exchange Predictive Analysis

Anitta Varghese, Arun Palani, Balasameeksha Pulaparthi, Harinishree Balaji, Hyndavi Nidadavolu

2023-12-06

Data Analytics Plan:

Key Peer Comments:

  1. Diversified Model Techniques:
    • Use ensemble methods (Random Forest, GBMs) alongside regression for robust predictions.
  2. Handling Code and Non-Textual Data:
    • Apply NLP tools like tidytext, stringr, and lubridate to effectively process mixed content.
  3. Data Extraction from Stack Exchange API:
    • Extract varied data types from Stack Exchange API for predictive analysis.
  4. Random Forest and Deep Learning Usage:
    • Plan to employ Random Forest and Deep Learning models for unspecified tasks.
  5. Label Encoding for Tags:
    • Apply label encoding to tags during data pre-processing, but methodology details are lacking.

Data Summary, Exploration and discussion:

We have covered data wrangling, data exploration and insights generation in this section.

There are 3 types of data that we have used in our project:

  1. Posts data
  2. Users data
  3. Tags data

Each of the above 3 datasets contain ~42000 instances.

Data Summary, Exploration and discussion:

EDA on user data set:

Boxplot of reputation by user type

Word cloud of user’s locations

EDA on Tags data set:

Wordcloud of popularly viewed topics:

EDA on Post data set:

Bar chart of Post Type based on their count

Datasets:

Raw datasets:

1. User data:

##     X account_id is_employee  last_modified_date    last_access_date
## 1   1      11683       FALSE 2023-11-29 16:05:26 2023-12-05 12:48:46
## 2   2       4243       FALSE 2023-12-01 18:10:12 2023-12-05 12:04:21
## 3   3    1165580       FALSE 2023-12-04 06:55:00 2023-06-16 18:48:45
## 4   4      52822       FALSE 2023-12-01 08:05:00 2023-12-05 13:03:28
## 5   5      35417       FALSE 2023-12-02 16:25:45 2023-12-04 14:31:52
## 6   6      52616       FALSE 2023-12-05 09:00:11 2023-12-05 13:18:06
## 7   7      11975       FALSE 2023-10-30 14:00:09 2023-12-05 12:32:07
## 8   8      14332       FALSE 2023-11-25 11:05:10 2023-12-05 09:52:02
## 9   9      39846       FALSE 2023-12-02 00:45:07 2023-12-05 12:11:53
## 10 10        680       FALSE 2023-12-02 18:55:06 2023-12-03 17:42:17
##    reputation_change_year reputation_change_quarter reputation_change_month
## 1                   58506                      9791                     795
## 2                  119986                     26718                    1507
## 3                   26699                      4271                     250
## 4                   25359                      4189                     248
## 5                   59341                     11156                     680
## 6                   42785                      7819                     508
## 7                   31394                      4926                     243
## 8                   17234                      2866                     144
## 9                   18766                      3374                     270
## 10                  37449                      6200                     466
##    reputation_change_week reputation_change_day reputation       creation_date
## 1                     525                   215    1437267 2008-09-26 08:05:05
## 2                     846                   315    1299740 2008-09-13 18:22:33
## 3                     170                    70    1248477 2012-01-11 14:53:57
## 4                     180                    70    1088739 2009-08-17 12:42:02
## 5                     470                    80    1064731 2009-05-03 10:53:57
## 6                     323                    90    1042332 2009-08-16 07:00:22
## 7                     175                    80    1033296 2008-09-29 01:46:02
## 8                      96                    18    1027164 2008-10-19 12:07:47
## 9                     140                    30     990863 2009-05-31 12:20:08
## 10                    238                   120     960187 2008-08-10 04:27:00
##     user_type user_id accept_rate                       location
## 1  registered   22656          86        Reading, United Kingdom
## 2  registered    6309         100                         France
## 3  registered 1144035          NA        New York, United States
## 4  registered  157882          93       Willemstad, Curaçao
## 5   moderator  100297          NA                  Cambridge, UK
## 6  registered  157247          91                 United Kingdom
## 7  registered   23354         100 Forest of Dean, United Kingdom
## 8  registered   29407          86                Sofia, Bulgaria
## 9  registered  115145          84    Pennsylvania, United States
## 10 registered     893          84      Christchurch, New Zealand
##                                                  website_url
## 1                                   http://csharpindepth.com
## 2                                  https://devstory.fyi/vonc
## 3                                 http://www.data-miners.com
## 4                               https://balusc.omnifaces.org
## 5                                  http://www.zopatista.com/
## 6                                     https://thenewtoys.dev
## 7                                http://blog.marcgravell.com
## 8  http://stackoverflow.com/search?q=user%3a29407&tab=newest
## 9                                    https://commonsware.com
## 10                                       https://hewgill.com
##                                                      link
## 1         https://stackoverflow.com/users/22656/jon-skeet
## 2               https://stackoverflow.com/users/6309/vonc
## 3   https://stackoverflow.com/users/1144035/gordon-linoff
## 4           https://stackoverflow.com/users/157882/balusc
## 5  https://stackoverflow.com/users/100297/martijn-pieters
## 6      https://stackoverflow.com/users/157247/t-j-crowder
## 7      https://stackoverflow.com/users/23354/marc-gravell
## 8    https://stackoverflow.com/users/29407/darin-dimitrov
## 9      https://stackoverflow.com/users/115145/commonsware
## 10       https://stackoverflow.com/users/893/greg-hewgill
##                                                                              profile_image
## 1  https://www.gravatar.com/avatar/6d8ebb117e8d83d74ea95fbdd0f87e13?s=256&d=identicon&r=PG
## 2                                            https://i.stack.imgur.com/I4fiW.jpg?s=256&g=1
## 3  https://www.gravatar.com/avatar/e514b017977ebf742a418cac697d8996?s=256&d=identicon&r=PG
## 4  https://www.gravatar.com/avatar/89927e2f4bde24991649b353a37678b9?s=256&d=identicon&r=PG
## 5  https://www.gravatar.com/avatar/24780fb6df85a943c7aea0402c843737?s=256&d=identicon&r=PG
## 6                                            https://i.stack.imgur.com/lUM5Z.jpg?s=256&g=1
## 7                                            https://i.stack.imgur.com/CrVFH.png?s=256&g=1
## 8  https://www.gravatar.com/avatar/e3a181e9cdd4757a8b416d93878770c5?s=256&d=identicon&r=PG
## 9                                            https://i.stack.imgur.com/wDnd8.png?s=256&g=1
## 10 https://www.gravatar.com/avatar/747ffa5da3538e66840ebc0548b8fd58?s=256&d=identicon&r=PG
##       display_name badge_counts_bronze badge_counts_silver badge_counts_gold
## 1        Jon Skeet                9216                9166               873
## 2             VonC                5320                4479               533
## 3    Gordon Linoff                 797                 652                58
## 4           BalusC                3563                3623               374
## 5  Martijn Pieters                3375                4102               305
## 6     T.J. Crowder                1888                1940               188
## 7     Marc Gravell                2914                2582               267
## 8   Darin Dimitrov                2934                3296               273
## 9      CommonsWare                2514                2404               191
## 10    Greg Hewgill                1288                1155               185
##                        tags1                 tags2
## 1  firebase-app-distribution google-cloud-profiler
## 2             topic-modeling                 spacy
## 3                                                 
## 4                                                 
## 5                                                 
## 6                                                 
## 7             azure-sdk-.net    azure-http-trigger
## 8                                                 
## 9                        ios               android
## 10                                                
##                             tags3                     tags4              tags5
## 1         google-cloud-networking dialogflow-es-fulfillment google-cloud-tools
## 2          nlp-question-answering                    gensim             tf-idf
## 3                                                                             
## 4                                                                             
## 5                                                                             
## 6                                                                             
## 7  azure-machine-learning-service      azure-data-lake-gen2      azure-bastion
## 8                                                                             
## 9                            <NA>                      <NA>               <NA>
## 10                                                                            
##               tags6                               tags7            tags8
## 1  google-cloud-dns google-cloud-internal-load-balancer firebase-hosting
## 2           spacy-3                      word-embedding         word2vec
## 3                                                                       
## 4                                                                       
## 5                                                                       
## 6                                                                       
## 7       azure-ad-v2              azure-managed-database  azure-notebooks
## 8                                                                       
## 9              <NA>                                <NA>             <NA>
## 10                                                                      
##                              tags9                    tags10
## 1                    looker-studio         google-cloud-node
## 2                             nltk        sentiment-analysis
## 3                                                           
## 4                                                           
## 5                                                           
## 6                                                           
## 7  microsoft-entra-internet-access azure-integration-account
## 8                                                           
## 9                             <NA>                      <NA>
## 10                                                          
##                  tags11              tags12                tags13
## 1  google-cloud-spanner    firebase-invites google-cloud-launcher
## 2          stanford-nlp bert-language-model                   nlp
## 3                                                                
## 4                                                                
## 5                                                                
## 6                                                                
## 7  azure-container-apps      azure-emulator    azure-managed-disk
## 8                                                                
## 9                  <NA>                <NA>                  <NA>
## 10                                                               
##                           tags14                   tags15
## 1         google-cloud-functions google-container-builder
## 2       huggingface-transformers named-entity-recognition
## 3                                                        
## 4                                                        
## 5                                                        
## 6                                                        
## 7  azure-log-analytics-workspace azure-ad-domain-services
## 8                                                        
## 9                           <NA>                     <NA>
## 10                                                       
##                      tags16              tags17
## 1  google-cloud-memorystore firebasesimplelogin
## 2                   opennlp           r-package
## 3                                              
## 4                                              
## 5                                              
## 6                                              
## 7    azure-web-app-firewall     azure-dashboard
## 8                                              
## 9                      <NA>                <NA>
## 10

2. Posts data:

##     X post_state score  last_activity_date       creation_date post_type
## 1   1  Published     0 2023-12-05 13:15:43 2023-12-05 13:15:43  question
## 2   2  Published     0 2023-12-05 13:15:42 2023-12-05 12:52:39  question
## 3   3  Published     1 2023-12-05 13:15:39 2023-12-05 12:56:06  question
## 4   4  Published     0 2023-12-05 13:15:22 2023-11-29 13:40:17    answer
## 5   5  Published     1 2023-12-05 13:15:22 2019-05-06 10:14:02  question
## 6   6  Published     0 2023-12-05 13:15:19 2023-12-05 13:10:04    answer
## 7   7  Published     0 2023-12-05 13:15:19 2023-12-05 12:44:08  question
## 8   8  Published     1 2023-12-05 13:15:17 2022-04-06 09:34:16  question
## 9   9  Published     0 2023-12-05 13:15:02 2023-05-24 12:02:10    answer
## 10 10  Published     0 2023-12-05 13:15:02 2023-05-24 12:02:10  question
##     post_id content_license                                 link
## 1  77608436    CC BY-SA 4.0 https://stackoverflow.com/q/77608436
## 2  77608307    CC BY-SA 4.0 https://stackoverflow.com/q/77608307
## 3  77608326    CC BY-SA 4.0 https://stackoverflow.com/q/77608326
## 4  77573743    CC BY-SA 4.0 https://stackoverflow.com/a/77573743
## 5  56006978    CC BY-SA 4.0 https://stackoverflow.com/q/56006978
## 6  77608401    CC BY-SA 4.0 https://stackoverflow.com/a/77608401
## 7  77608262    CC BY-SA 4.0 https://stackoverflow.com/q/77608262
## 8  71767750    CC BY-SA 4.0 https://stackoverflow.com/q/71767750
## 9  76325447    CC BY-SA 4.0 https://stackoverflow.com/a/76325447
## 10 76325446    CC BY-SA 4.0 https://stackoverflow.com/q/76325446
##         last_edit_date owner_account_id owner_reputation owner_user_id
## 1                 <NA>          6150674              271       4795151
## 2  2023-12-05 13:15:42           239885             1393        509770
## 3  2023-12-05 13:15:39           452355             5669        848746
## 4  2023-12-05 13:15:22         12939096                1       9355485
## 5  2019-05-06 13:04:35          1344154              365       1285210
## 6  2023-12-05 13:15:19         12513488              950       9107694
## 7  2023-12-05 12:46:52         30074202                1      23047467
## 8  2023-12-05 13:15:17         14221191               53      10273462
## 9  2023-12-05 13:15:02           392777             3063        754254
## 10                <NA>           392777             3063        754254
##    owner_user_type owner_accept_rate
## 1       registered                23
## 2       registered                NA
## 3       registered                83
## 4       registered                NA
## 5       registered               100
## 6       registered                NA
## 7       registered                NA
## 8       registered                NA
## 9       registered                67
## 10      registered                67
##                                                                                         owner_profile_image
## 1  https://www.gravatar.com/avatar/7a76cce42f2ef8d5d54fbaf45ce6c8db?s=256&d=identicon&r=PG&f=y&so-version=2
## 2                                                             https://i.stack.imgur.com/lNymg.png?s=256&g=1
## 3                   https://www.gravatar.com/avatar/53f8a63ef3f02b3d02788f4a90fff3e3?s=256&d=identicon&r=PG
## 4  https://www.gravatar.com/avatar/11b0ca0f851786415f8d974048810ad6?s=256&d=identicon&r=PG&f=y&so-version=2
## 5                                                             https://i.stack.imgur.com/0GaVG.jpg?s=256&g=1
## 6  https://www.gravatar.com/avatar/c8df0972eede360961e46d3df9a29b48?s=256&d=identicon&r=PG&f=y&so-version=2
## 7                                                             https://i.stack.imgur.com/8hITY.jpg?s=256&g=1
## 8  https://www.gravatar.com/avatar/cbc92cae95fc36e52d7f73e2862be862?s=256&d=identicon&r=PG&f=y&so-version=2
## 9  https://www.gravatar.com/avatar/e40fffca9788769cd8ab1a9cd29c623f?s=256&d=identicon&r=PG&f=y&so-version=2
## 10 https://www.gravatar.com/avatar/e40fffca9788769cd8ab1a9cd29c623f?s=256&d=identicon&r=PG&f=y&so-version=2
##    owner_display_name
## 1         atul mishra
## 2              jaybro
## 3                 AJW
## 4  Rustamjon Akhmedov
## 5            DudiDude
## 6            SyndRain
## 7            yuyibruh
## 8            liner183
## 9              Felipe
## 10             Felipe
##                                                    owner_link
## 1         https://stackoverflow.com/users/4795151/atul-mishra
## 2               https://stackoverflow.com/users/509770/jaybro
## 3                  https://stackoverflow.com/users/848746/ajw
## 4  https://stackoverflow.com/users/9355485/rustamjon-akhmedov
## 5            https://stackoverflow.com/users/1285210/dudidude
## 6            https://stackoverflow.com/users/9107694/syndrain
## 7           https://stackoverflow.com/users/23047467/yuyibruh
## 8           https://stackoverflow.com/users/10273462/liner183
## 9               https://stackoverflow.com/users/754254/felipe
## 10              https://stackoverflow.com/users/754254/felipe
##    posted_by_collectives collectives
## 1                                   
## 2                                   
## 3                                   
## 4                                   
## 5                                   
## 6                                   
## 7                                   
## 8                                   
## 9                                   
## 10

3.Tags data:

##     X has_synonyms is_moderator_only is_required   count       name
## 1   1         TRUE             FALSE       FALSE 2519743 javascript
## 2   2         TRUE             FALSE       FALSE 2176854     python
## 3   3         TRUE             FALSE       FALSE 1912027       java
## 4   4         TRUE             FALSE       FALSE 1608171         c#
## 5   5         TRUE             FALSE       FALSE 1462912        php
## 6   6         TRUE             FALSE       FALSE 1413467    android
## 7   7         TRUE             FALSE       FALSE 1184004       html
## 8   8         TRUE             FALSE       FALSE 1034750     jquery
## 9   9         TRUE             FALSE       FALSE  802627        c++
## 10 10         TRUE             FALSE       FALSE  801143        css
##                                                                                                                                                                                                                                                                                             collectives
## 1                                                                                                                                                                                                                                                                                                      
## 2                                                                                                                                                                                                                                                                                                      
## 3                                                                                                                                                                                                                                                                                                      
## 4                                                                                                                                                                                                                                                                                                      
## 5                                                            list("php"), list(list(type = "support", link = "https://stackoverflow.com/contact?topic=15")), A collective where developers working with PHP can learn and connect about the open source scripting language., /collectives/php, PHP, php
## 6  list(c("ios", "android")), list(list(type = "support", link = "https://stackoverflow.com/contact?topic=15")), A collective for developers who want to share their knowledge and learn more about mobile development practices and platforms, /collectives/mobile-dev, Mobile Development, mobile-dev
## 7                                                                                                                                                                                                                                                                                                      
## 8                                                                                                                                                                                                                                                                                                      
## 9                                                                                                                                                                                                                                                                                                      
## 10

Preprocessed datasets:

Steps:

  1. Pre-processing
  2. Splitting the tags
  3. Converting text to numeric values using Label Encoding
  4. Handling NA values

1.User data:

##   X account_id is_employee  last_modified_date    last_access_date
## 1 1      11683       FALSE 2023-11-29 16:05:26 2023-12-05 12:48:46
## 2 2       4243       FALSE 2023-12-01 18:10:12 2023-12-05 12:04:21
## 3 4      52822       FALSE 2023-12-01 08:05:00 2023-12-05 13:03:28
## 4 6      52616       FALSE 2023-12-05 09:00:11 2023-12-05 13:18:06
## 5 7      11975       FALSE 2023-10-30 14:00:09 2023-12-05 12:32:07
## 6 8      14332       FALSE 2023-11-25 11:05:10 2023-12-05 09:52:02
##   reputation_change_year reputation_change_quarter reputation_change_month
## 1                  58506                      9791                     795
## 2                 119986                     26718                    1507
## 3                  25359                      4189                     248
## 4                  42785                      7819                     508
## 5                  31394                      4926                     243
## 6                  17234                      2866                     144
##   reputation_change_week reputation_change_day reputation       creation_date
## 1                    525                   215    1437267 2008-09-26 08:05:05
## 2                    846                   315    1299740 2008-09-13 18:22:33
## 3                    180                    70    1088739 2009-08-17 12:42:02
## 4                    323                    90    1042332 2009-08-16 07:00:22
## 5                    175                    80    1033296 2008-09-29 01:46:02
## 6                     96                    18    1027164 2008-10-19 12:07:47
##    user_type user_id accept_rate                       location
## 1 registered   22656          86        Reading, United Kingdom
## 2 registered    6309         100                         France
## 3 registered  157882          93       Willemstad, Cura&#231;ao
## 4 registered  157247          91                 United Kingdom
## 5 registered   23354         100 Forest of Dean, United Kingdom
## 6 registered   29407          86                Sofia, Bulgaria
##                                                 website_url
## 1                                  http://csharpindepth.com
## 2                                 https://devstory.fyi/vonc
## 3                              https://balusc.omnifaces.org
## 4                                    https://thenewtoys.dev
## 5                               http://blog.marcgravell.com
## 6 http://stackoverflow.com/search?q=user%3a29407&tab=newest
##                                                   link
## 1      https://stackoverflow.com/users/22656/jon-skeet
## 2            https://stackoverflow.com/users/6309/vonc
## 3        https://stackoverflow.com/users/157882/balusc
## 4   https://stackoverflow.com/users/157247/t-j-crowder
## 5   https://stackoverflow.com/users/23354/marc-gravell
## 6 https://stackoverflow.com/users/29407/darin-dimitrov
##                                                                             profile_image
## 1 https://www.gravatar.com/avatar/6d8ebb117e8d83d74ea95fbdd0f87e13?s=256&d=identicon&r=PG
## 2                                           https://i.stack.imgur.com/I4fiW.jpg?s=256&g=1
## 3 https://www.gravatar.com/avatar/89927e2f4bde24991649b353a37678b9?s=256&d=identicon&r=PG
## 4                                           https://i.stack.imgur.com/lUM5Z.jpg?s=256&g=1
## 5                                           https://i.stack.imgur.com/CrVFH.png?s=256&g=1
## 6 https://www.gravatar.com/avatar/e3a181e9cdd4757a8b416d93878770c5?s=256&d=identicon&r=PG
##     display_name badge_counts_bronze badge_counts_silver badge_counts_gold
## 1      Jon Skeet                9216                9166               873
## 2           VonC                5320                4479               533
## 3         BalusC                3563                3623               374
## 4   T.J. Crowder                1888                1940               188
## 5   Marc Gravell                2914                2582               267
## 6 Darin Dimitrov                2934                3296               273
##   tags1 tags2 tags3 tags4 tags5 tags6 tags7 tags8 tags9 tags10 tags11 tags12
## 1     5     7     5     6     7     7     7     6     6      6      8      7
## 2    10     9     8     7    10    11    11    11     8      8     11      6
## 3     1     1     1     1     1     1     1     1     1      1      1      1
## 4     1     1     1     1     1     1     1     1     1      1      1      1
## 5     3     4     3     3     3     3     3     4     7      3      4      4
## 6     1     1     1     1     1     1     1     1     1      1      1      1
##   tags13 tags14 tags15 tags16 tags17
## 1      7      7      7      6      7
## 2      9      9      9      9     13
## 3      1      1      1      1      1
## 4      1      1      1      1      1
## 5      4      4      3      4      5
## 6      1      1      1      1      1

AI/ML Procedures and Results:

This section focuses on the Machine Learning Models and Text Mining

We are implementing 2 modeling techniques as below:

  1. Random Forest
  2. Deep Learning
#install.packages("h2o")
library(h2o)
# Load the model
h2o.init()
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         19 hours 49 minutes 
##     H2O cluster timezone:       America/New_York 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.42.0.2 
##     H2O cluster version age:    4 months and 11 days 
##     H2O cluster name:           H2O_started_from_R_arun_hmk386 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   1.96 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 4.3.1 (2023-06-16)
rf_model <- h2o.loadModel("Group4-StackExchange_RF_model.h20")

Performance of the Random Forest Model:

## MSE:  3235834927
## RMSE: 56884.4
## MAE:  31362.41
## RMSLE:  0.4523724
## Mean Residual deviance:  3235834927

XAI Model 1 - Variable Importance of RF Model

# Generate variable importance plot
var_imp <- h2o.varimp(rf_model)
h2o.varimp_plot(rf_model)

Performance of the Deep Learning model:

# Load the model
dl_model <- h2o.loadModel("Group4-StackExchange_DL_model.h20")
## MSE:  4763714533
## RMSE:  69019.67
## MAE:  43687.28
## RMSE:  0.5574298
## Mean Residual deviance:  4763714533

XAI Model 2 - Variable Importance of Deep Learning Model

# Generate variable importance plot
var_imp <- h2o.varimp(dl_model)
h2o.varimp_plot(dl_model)

Key Take aways:

The top three most important variables are:

-badge_counts_silver -badge_counts_bronze -badge_counts_gold

-Random Forest: The variable importance scores for the badge count variables are all relatively high, while the variable importance scores for the tag variables are all relatively low. This suggests that badge counts are more important for predicting (Direct Referral Index) than tags.

-Deep Learning : The Variable importance graph clearly indicates that -badge_count_silver has the highest variable importance -Badge counts are often used to identify users who are engaged and active in the community. These users are more likely to be knowledgeable and experienced, and they are also more likely to be social and influential.

Overall Findings: -Badge counts are often used to identify users who are engaged and active in the community. These users are more likely to be knowledgeable and experienced, and they are also more likely to be social and influential.

-On the Contrary, users who are new to the community should focus on earning their first few badges as quickly as possible to increase their DRF.

Thank you