Accuracy vs. Efficiency: Vectorization Methods for E-commerce Product Titles

Fedor Krasnov

Accuracy vs. Efficiency: Vectorization Methods for E-commerce Product Titles

Fedor Krasnov

Abstract

This paper presents an empirical study and comparative analysis of the effectiveness of modern term vectorization methods in the context of Information Retrieval (IR) tasks, focusing on the processing of short textual data, specifically product titles. The objective is to identify the optimal method capable of accurately reproducing the global structure of semantic connections within the corpus while maintaining high computational efficiency. The Frobenius Norm () of the difference between the normalized target term co-occurrence matrix () and the cosine similarity matrix derived from the vector representations () was chosen as the key evaluation criterion. The investigation was conducted in three sequential stages. The first experiment involved a comparative assessment of classical matrix factorization methods (SVD/LSA, NMF, LDA) and local-window models (Word2Vec, FastText), utilizing basic whitespace tokenization. At this stage, the LDA algorithm demonstrated the minimum error (191.00), indicating its highest correspondence to the global structure of the corpus. In the second stage (main experiment), BERT-compatible tokenization (BPE-like) was employed for comprehensive evaluation, and the pre-trained contextual transformer model BERT was added to the comparison. To ensure methodological rigor, BERT was evaluated in a static averaged embedding mode (fixed vector representation). Experimental data confirmed that the LDA algorithm maintained its lead with an error of 156.9, exhibiting higher accuracy in this task than the BERT model, which achieved an error of 253.17. The third experiment was dedicated to the multi-objective optimization of the most effective LDA method’s hyperparameters. Using the Optuna library, a Pareto Front of solutions was found, reflecting the optimal compromise between internal consistency (max Log-Likelihood) and empirical accuracy (min Frobenius Norm).The results obtained confirm that for IR tasks that do not require deep contextual understanding, methods based on global frequency linkage factorization (LDA) are the most economically and technically justifiable, surpassing complex neural network models based on the key metric of semantic structure reproduction.

Full Text:

PDF (Russian)

References

Manning C.D., Raghavan P., Schütze H. Introduction to Information Retrieval. — Cambridge University Press, 2008.

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391-407.

Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755), 788-791.

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022.

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532-1543.

Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135-146.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the Association for Computational Linguistics: Human language technologies, Volume 1 (Long and Short Papers), pp. 4171-4186.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, L., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

Gao, T., Yan, X., & Chen, X. (2021). SimCSE: Simple Contrastive Learning of Sentence Embeddings. arXiv preprint arXiv:2104.08821.

Wang, Y., Liu, S., Afzal, N., Rastegar-Mojarad, M., Wang, L., Shen, Liu, H. (2019). A comparison of word embeddings for the biomedical natural language processing. Journal of biomedical informatics, 87, 12-20.

Refbacks

There are currently no refbacks.

Abava Кибербезопасность Monetec 2026 СНЭ

ISSN: 2307-8162

International Journal of Open Information Technologies