On data mining for software repositories

Dmitry Namiot, Vladimir Romanov


The article discusses issues related to the use of data science and data mining methods for software repositories. The paper attempts to provide an overview of the technologies that are used in the analysis of programs and are based on static data that can be extracted directly from the code or the code repositories. The paper reviews papers using deep learning methods (recurrent neural networks), classification methods based on other machine learning models, and the use of clustering in software engineering. Practical applications of the methods under consideration include, for example, classification and prediction of errors, determining the characteristics of code change over time, searching for duplicate fragments, automatically detecting design errors, recommending code refactoring.

Full Text:

PDF (Russian)


Guyon I., Elisseeff A. An introduction to variable and feature selection //Journal of machine learning research. – 2003. – Т. 3. – №. Mar. – С. 1157-1182.

AI Predicts Coding Mistakes Before Developers Make Them https://futurism.com/ai-predicts-coding-mistakes-before-developers-make-them/ Retrieved: Mar, 2018

Rich C., Waters R. C. (ed.). Readings in artificial intelligence and software engineering. – Morgan Kaufmann, 2014.

The Unreasonable Effectiveness of Recurrent Neural Networks http://karpathy.github.io/2015/05/21/rnn-effectiveness/ Retrieved: Mar, 2018

White M. et al. Toward deep learning software repositories //Mining Software Repositories (MSR), 2015 IEEE/ACM 12th Working Conference on. – IEEE, 2015. – С. 334-345.

A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu, “On the naturalness of software,” in Proceedings of the 34th International Conference on Software Engineering, ser. ICSE ’12. Piscataway, NJ, USA: IEEE Press, 2012, pp. 837–847.

T. T. Nguyen, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen, “A statistical semantic language model for source code,” in Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, ser. ESEC/FSE ’13. New York, NY, USA: ACM, 2013, pp. 532–542.

S. Afshan, P. McMinn, and M. Stevenson, “Evolving readable string test inputs using a natural language model to reduce human oracle cost,” in Proceedings of the 2013 IEEE Sixth International Conference on Software Testing, Verification and Validation, ser. ICST ’13. Washington, DC, USA: IEEE Computer Society, 2013, pp. 352–361.

D. Movshovitz-Attias and W. W. Cohen, “Natural language models for predicting programming comments,” in ACL. Sofia, Bulgaria: Association for Computational Linguistics, August 2013.

M. Allamanis and C. A. Sutton, “Mining source code repositories at massive scale using language modeling,” in MSR, 2013, pp. 207–216.

J. C. Campbell, A. Hindle, and J. N. Amaral, “Syntax errors just aren’t natural: Improving error reporting with language models,” in Proceedings of the 11th Working Conference on Mining Software Repositories, ser. MSR ’14. New York, NY, USA: ACM, 2014, pp. 252–261.

P. Tonella, R. Tiella, and D. C. Nguyen, “Interpolated n-grams for model based testing,” in ICSE, 2014, pp. 562–572.

Z. Tu, Z. Su, and P. Devanbu, “On the localness of software,” in Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, ser. FSE ’14. New York, NY, USA: ACM, 2014, pp. 269–280.

M. Allamanis, E. T. Barr, C. Bird, and C. Sutton, “Learning natural coding conventions,” in Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, ser. FSE ’14. New York, NY, USA: ACM, 2014, pp. 281–293.

Shepperd M., Bowes D., Hall T. Researcher bias: The use of machine learning in software defect prediction //IEEE Transactions on Software Engineering. – 2014. – Т. 40. – №. 6. – С. 603-616.

The tera-PROMISE Repository http://openscience.us/repo/ Retrieved: Mar, 2018

Malhotra R. A systematic review of machine learning techniques for software fault prediction //Applied Soft Computing. – 2015. – Т. 27. – С. 504-518.

Di Martino S. et al. A genetic algorithm to configure support vector machines for predicting fault-prone components //International Conference on Product Focused Software Process Improvement. – Springer, Berlin, Heidelberg, 2011. – С. 247-261.

Laradji I. H., Alshayeb M., Ghouti L. Software defect prediction using ensemble learning on selected features //Information and Software Technology. – 2015. – Т. 58. – С. 388-402.

Kouroshfar E. et al. A Study on the Role of Software Architecture in the Evolution and Quality of Software //Proceedings of the 12th Working Conference on Mining Software Repositories. – IEEE Press, 2015. – С. 246-257.

Li Z. et al. An empirical investigation of modularity metrics for indicating architectural technical debt //Proceedings of the 10th international ACM Sigsoft conference on Quality of software architectures. – ACM, 2014. – С. 119-128.

Fernandes E. et al. A review-based comparative study of bad smell detection tools //Proceedings of the 20th International Conference on Evaluation and Assessment in Software Engineering. – ACM, 2016. – С. 18.

Blincoe K., Harrison F., Damian D. Ecosystems in GitHub and a method for ecosystem identification using reference coupling //Proceedings of the 12th Working Conference on Mining Software Repositories. – IEEE Press, 2015. – С. 202-207.

Thomas S. W., Hassan A. E., Blostein D. Mining unstructured software repositories //Evolving Software Systems. – Springer, Berlin, Heidelberg, 2014. – С. 139-162.

Thomas S. W. Mining unstructured software repositories using ir models. – Queen's University (Canada), 2013.

Linstead E. et al. Mining internet-scale software repositories //Advances in neural information processing systems. – 2008. – С. 929-936.

Papas D., Tjortjis C. Combining clustering and classification for software quality evaluation //Hellenic Conference on Artificial Intelligence. – Springer, Cham, 2014. – С. 273-286.

Shtern M., Tzerpos V. Clustering methodologies for software engineering //Advances in Software Engineering. – 2012. – Т. 2012. – С. 1.


  • There are currently no refbacks.

Abava  Кибербезопасность MoNeTec 2024

ISSN: 2307-8162