Using Machine Learning Methods to Establish Program Authorship

Sergey Gorshkov, Maxim Nered, Eugene Ilyushin, Dmitry Namiot


The subject of the article is the “coding style” concept and the main approaches to detecting the individual style of a programmer. The entire process of creating a software product from this point of view and the main features of programming style are analyzed. It emphasizes the relevance and commercial significance of the problem in terms of product support, plagiarism, work of a large developer’s community in a single repository, an evolution of developer skills. Computational stylometry issues, a possibility of using programming paradigms as an additional factor of style identification are considered. It offers the idea of creating a software tool that allows to identify the style of the author who wrote a particular program fragment and allows less experienced developers to follow the rules accepted in the major part of the repository and determined by coding style of "experts", which leads the code to a uniform format that is easier to maintain and make adjustments. Globally, this stage of analyzing the original (and then the modified code) allows improving the existing algorithms for automatic synthesis of programs.

Full Text:



Y. Wang, B. Zheng, H. Huang. “Complying with Coding Standards or Retaining Programming Style” // Journal of Software Engineering and Applications, pp. 1:88-91, 2008

A. Mohan, N. Gold. “Programming Style Changes in Evolving Source Code” // IEEE, 2004

P. Oman, C. Cook. “A taxonomy for programming style” // 18th ACM Computer Science Conference Proceedings, pp. 244-247, 1990

D. I. Holmes. “Stylometry” // Encyclopedia of Statistical Sciences, 2006

W. Daelemans. “Explanation in Computational Stylometry” // Springer: International Conference on Intelligent Text Processing and Computational Linguistics, pp 451-462, 2013

F. Corbo, C. Del Grosso, M. Di Penta. “Smart Formatter: Learning Coding Style from Existing Source Code” // Software Maintenance. IEEE International Conf. 2007. Pp. 525-526.

H. Ding, M. Samadzadeh. “Extraction of java program fingerprints for software authorship identification” // Journal of Systems and Software 72, 1 (2004), 49–57.

J. Hayes, J. Offutt. “Recognizing authors: an examination of the consistent programmer hypothesis” // Journal of Software Testing, Verification and Reliability 20, 4 (2010), 329–356.

A. Gray, P. Sallis, S. MacDonell. “Software forensics: Extending authorship analysis techniques to computer programs” // Information Science Discussion Papers Series No. 97/14

E. Spafford, S. Weeber. “Software forensics: Can we track code to its authors?” // Computers & Security 12, 6 (1993), 585–595.

B. Pellin. “Using classification techniques to determine source code authorship”. // White Paper: Department of Computer Science, University of Wisconsin (2000).

D. Yu, X. Peng, W. Zhao. “Automatic refactoring method of cloned code using abstract syntax tree and static analysis” // Journal of Chinese Computer Systems 30(9), 1752–1760 (2009)

I. Baxter, A. Yahin, L. Moura, M. Sant’Anna, L. Bier. “Clone detection using abstract syntax trees”. // Software Maintenance, 1998. Proceedings., International Conference on. pp. 368–377. IEEE (1998)

F. Lazar, O. Banias. “Clone detection algorithm based on the abstract syntax tree approach” // Applied Computational Intelligence and Informatics (SACI), 2014 IEEE 9th International Symposium on. pp. 73–78. IEEE (2014)

G. Frantzeskou, S. MacDonell, E. Stamatatos, S. Gritzalis. “Examining the significance of high-level programming features in source code author classification”. // Journal of Systems and Software 81, 3 (2008), 447–460.

G. Frantzeskou, E. Stamatatos, S. Gritzalis, S. Katsikas. “Effective identification of source code authors using byte-level information” // Proceedings of the 28th International Conference on Software Engineering (2006), ACM, pp. 893–896.

E. Kouroshfar, M. Mirakhorli, H. Bagheri, L. Xiao, S. Malek and Y. Cai. “A Study on the Role of Software Architecture in the Evolution and Quality of Software”. // Proceedings of the 12th Working Conference on Mining Software Repositories, (2015) 246-257

E. Fernandes, J. Oliveira, G. Vale, T. Paiva, and E. Figueiredo. “A review-based comparative study of bad smell detection tools” // Proceedings of the 20th International Conference on Evaluation and Assessment in Software Engineering, ACM, 2016, p. 18.

J. Kothari, M. Shevertalov, E. Stehle, and S. Mancoridis. “A probabilistic approach to source code authorship identification”. // 4th International Conference on Information technology, IEEE Conference Publication, 2007.

B. Alsulami, E. Dauber, R. Harang, S. Mancoridis, R. Greenstadt. “Source Code Authorship Attribution Using Long Short-Term Memory Based Networks”. // Proceedings of the 22nd European Symposium on Research in Computer Security, Oslo, Norway, 2017, pp. 65–82.

A. Caliskan-Islam, R. Harang, A. Li, A. Narayanan, C. Voss, F. Yamaguchi, R. Greenstadt “De-anonymizing Programmers via Code Stylometry” // Proceedings of the 24th Usenix Security Symposium (2015)

A System for Detecting Software Similarity Retrieved: Dec, 2018

JPlag Detecting Software Plagiarism Retrieved: Dec, 2018

The BOSS Online Submission System Retrieved: Dec, 2018


  • There are currently no refbacks.

Abava  Кибербезопасность FRUCT 2024

ISSN: 2307-8162