Source Code Change Vectorization Using Additive-Subtractive Embeddings
Abstract
This paper presents a novel method for vector representation of source code changes in the task of automatic commit message generation. We propose an Additive-Subtractive Embeddings (ASE) algorithm based on git diff decomposition and accounting for the semantic contribution of added and removed code fragments. The methodology incorporates a three-component decomposition of changes, vectorization of components using the pre-trained CodeBERT model, and their subsequent integration through linear operations in the embedding space.
The developed method is implemented as a modification of the T5 architecture, augmented with a projection layer for integrating change vectors. Experimental validation was conducted on a corpus of commits from open repositories using the BLEU metric. Results demonstrate improved generation quality: the model with integrated ASE mechanism achieves a BLEU score of 12.04% compared to 11.97% for the baseline architecture while maintaining computational efficiency.
Analysis of the training process confirms the methodological validity of the proposed approach: stable training convergence, absence of overfitting, and preserved inference speed are observed. The obtained results indicate the promise of using additive-subtractive embeddings for semantic analysis of source code changes.
Full Text:
PDF (Russian)References
Zhang Y. et al. Automatic commit message generation: A critical review and directions for future work //IEEE Transactions on Software Engineering. – 2024.
Eliseeva A. et al. From commit message generation to history-aware commit message completion //2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). – IEEE, 2023. – С. 723-735.
van Hal S. R. P., Post M., Wendel K. Generating commit messages from git diffs //arXiv preprint arXiv:1911.11690. – 2019.
Papineni K. et al. Bleu: a method for automatic evaluation of machine translation //Proceedings of the 40th annual meeting of the Association for Computational Linguistics. – 2002. – С. 311-318.
Morin F., Bengio Y. Hierarchical probabilistic neural network language model //International workshop on artificial intelligence and statistics. – PMLR, 2005. – С. 246-252.
Mikolov T., Yih W., Zweig G. Linguistic regularities in continuous space word representations //Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies. – 2013. – С. 746-751.
Champollion L. Distributivity in formal semantics //Annual review of linguistics. – 2019. – Т. 5. – №. 1. – С. 289-308.
Feng Z. et al. Codebert: A pre-trained model for programming and natural languages //arXiv preprint arXiv:2002.08155. – 2020.
Temčinas T. Local homology of word embeddings //arXiv preprint arXiv:1810.10136. – 2018.
Jiang S., Armaly A., McMillan C. Automatically generating commit messages from diffs using neural machine translation //2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). – IEEE, 2017. – С. 135-146.
Nugroho Y. S., Hata H., Matsumoto K. How different are different diff algorithms in Git? Use--histogram for code changes //Empirical Software Engineering. – 2020. – Т. 25. – С. 790-823.
Feurer M., Hutter F. Hyperparameter optimization //Automated machine learning: Methods, systems, challenges. – 2019. – С. 3-33.
Hawkins D. M. The problem of overfitting //Journal of chemical information and computer sciences. – 2004. – Т. 44. – №. 1. – С. 1-12.
Imambi S., Prakash K. B., Kanagachidambaresan G. R. PyTorch //Programming with TensorFlow: solution for edge computing applications. – 2021. – С. 87-104.
Wolf T. et al. Transformers: State-of-the-art natural language processing //Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations. – 2020. – С. 38-45.
Ni J. et al. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models //arXiv preprint arXiv:2108.08877. – 2021.
Grefenstette G. Tokenization //Syntactic wordclass tagging. – Dordrecht : Springer Netherlands, 1999. – С. 117-133.
Loshchilov I. Decoupled weight decay regularization //arXiv preprint arXiv:1711.05101. – 2017.
Mao A., Mohri M., Zhong Y. Cross-entropy loss functions: Theoretical analysis and applications //International conference on Machine learning. – PMLR, 2023. – С. 23803-23828.
Kondrak G. N-gram similarity and distance //International symposium on string processing and information retrieval. – Berlin, Heidelberg : Springer Berlin Heidelberg, 2005. – С. 115-126.
I. A. Kosyanenko and R. G. Bolbakov, "On automatic generation of commit messages in version control systems," International Journal of Open Information Technologies, vol. 10, no. 4, pp. 55-60, 2022.
Refbacks
- There are currently no refbacks.
Abava Кибербезопасность IT Congress 2024
ISSN: 2307-8162