doi: 10.17706/jsw.20.2.51-65
Understanding Regression Models on Stack Overflow Code: GBM Returns the Best Prediction Performance among Regression Techniques
2. Research Computing Services, University of Melbourne, 3010, Melbourne, Victoria, Australia.
*Corresponding author. Tel.: +643 479 8319; email: sherlock.licorish@otago.ac.nz (S.A.L.)
Manuscript submitted February 17, 2025; revised April 3, 2025, accepted May 8, 2025; published July 9, 2025
Abstract—Practitioners are often dependent on Stack Overflow code during software development, where poor quality is occasionally reported. Research tends to focus on ranking content, identifying defects and predicting future content, but less attention is dedicated to identifying the most suitable techniques for modelling/prediction. Contextualizing the Stack Overflow code quality problem as regression-based, we examined the variables that predict Stack Overflow (Java) code quality, and the regression approach that provides the best predictive power. We observed answer count (β = 0.138), code length (β = 0.382), code spaces (β = 0.099) and lines of code (β = 1.959) as the strongest predictors of code quality on Stack Overflow. Six regression approaches were considered in our evaluation, where Gradient Boosting Machine (GBM) achieved superior performance (RMSE = 2.77, R2 = 0.99, MAE = 0.79) compared to other methods including eXtreme Gradient Boosting (XGBoost) (RMSE = 3.12, R2 = 0.97, MAE = 2.36), and Classification and Regression Trees (CART) (RMSE = 3.45, R2 = 0.96, MAE = 1.77). In fact, even when evaluated against Deep Neural Networks (DeepNN), GBM's superior performance is maintained. Follow-up evaluations using two independent datasets on Electrical Grid Stability and USA Cancer Mortality confirm GBM’s superior performance, supporting claims for generalizability of our findings. Outcomes here point to the value of the GBM ensemble learning mechanism and need for continued modelling techniques' experimentation.
Keywords—evaluation study, regression methods, stack overflow, code quality, electrical grid stability, USA cancer mortality
Cite: Sherlock A. Licorish, Brendon Woodford, Lakmal Kiyaduwa Vithanage, and Osayande Pascal Omondiagbe, "Understanding Regression Models on Stack Overflow Code: GBM Returns the Best Prediction Performance among Regression Techniques," Journal of Software, vol. 20, no. 2, pp. 51-65, 2025.
General Information
ISSN: 1796-217X (Online)
Abbreviated Title: J. Softw.
Frequency: Biannually
APC: 500USD
DOI: 10.17706/JSW
Editor-in-Chief: Prof. Antanas Verikas
Executive Editor: Ms. Cecilia Xie
Google Scholar, ProQuest,
INSPEC(IET), ULRICH's Periodicals
Directory, WorldCat, etcE-mail: jsweditorialoffice@gmail.com
-
Mar 07, 2025 News!
Vol 19, No 4 has been published with online version [Click]
-
Mar 07, 2025 News!
JSW had implemented online submission system [Click]
-
Apr 01, 2024 News!
Vol 14, No 4- Vol 14, No 12 has been indexed by IET-(Inspec) [Click]
-
Apr 01, 2024 News!
Papers published in JSW Vol 18, No 1- Vol 18, No 6 have been indexed by DBLP [Click]
-
Oct 22, 2024 News!
Vol 19, No 3 has been published with online version [Click]