Comparative analysis of applied natural language processing technologies for improving the quality of digital document classification
Abstract
Full Text:
PDF (Russian)References
URL: https://aws.amazon.com/ru/what-is/nlp/ (accessed
10.2023).What is Natural Language Processing (NLP) // Amazon.
Hobson Lane, Hannes Hapke, Cole Howard Natural Language Processing in Action. - SPb.: Peter, 2020. - C. 68-140.
Ganegedara T. Natural Language Processing with TensorFlow. V. S. Yatsenkov. - Moscow: DMK Press, 2020. - С. 74-102.
Hickman L. et al. Text preprocessing for text mining in organizational research: Review and recommendations //Organizational Research Methods. – 2022. – Т. 25. – №. 1. – С. 114-146.
Kadhim A. I. An evaluation of preprocessing techniques for text classification //International Journal of Computer Science and Information Security (IJCSIS). – 2018. – Т. 16. – №. 6. – С. 22-32.
Denny M. J., Spirling A. Text preprocessing for unsupervised learning: Why it matters, when it misleads, and what to do about it //Political Analysis. – 2018. – Т. 26. – №. 2. – С. 168-189.
Tabassum A., Patil R. R. A survey on text pre-processing & feature extraction techniques in natural language processing //International Research Journal of Engineering and Technology (IRJET). – 2020. – Т. 7. – №. 06. – С. 4864-4867.
Etaiwi W., Naymat G. The impact of applying different preprocessing steps on review spam detection //Procedia computer science. – 2017. – Т. 113. – С. 273-279.
Kashina M., Lenivtceva I. D., Kopanitsa G. D. Preprocessing of unstructured medical data: the impact of each preprocessing stage on classification //Procedia Computer Science. – 2020. – Т. 178. – С. 284-290.
Pak M. Y., GUNAL S. The impact of text representation and preprocessing on author identification //Anadolu University Journal of Science and Technology A-Applied Sciences and Engineering. – 2017. – Т. 18. – №. 1. – С. 218-224.
Ideal preprocessing pipelines for NLP models // Temofeev.ru URL: https://temofeev.ru/info/articles/idealnyy-preprotsessingovyy-payplayn-dlya-nlp-modeley/ (date of access: 23.10.2023).
A Gentle Introduction to the Bag-of-Words Model // Machine Learning Mastery URL: https://machinelearningmastery.com/gentle-introduction-bag-words-model/ (date of access: 28.10.2023).
Gensim Word2Vec Tutorial // Kaggle URL: https://www.kaggle.com/code/pierremegret/gensim-word2vec-tutorial (date of access: 28.10.2023).
Jeffrey Pennington, Richard Socher, Christopher D. Manning // GloVe: Global Vectors for Word Representation URL: https://www-nlp.stanford.edu/projects/glove/ (date of access: 02.11.2023).
Grapheme // Wikipedia URL: https://ru.wikipedia.org/wiki/Графема (date of access 03.11.2023).
Satish Gunjal. Tokenization in NLP [Electronic resource]. – Access mode: https://www.kaggle.com/code/satishgunjal/tokenization-in-nlp (date of access: 04.11.2013).
Machine Learning Mastery. How to Prepare Text Data for Deep Learning with Keras [Electronic resource]. – Access mode: https://machinelearningmastery.com/prepare-text-data-deep-learning-keras/ (date of access: 04.11.2023).
McMahan Brian, Rao Delip. Getting to know PyTorch. - SPb.: Peter, 2020. - С. 88-101
Stemmer Porter. In: Wikipedia: the free encyclopedia [Electronic resource]. - Available at: https://ru.wikipedia.org/wiki/Стеммер_Портера (date of access: 04.11.2023).
Porter Stemmer. In: Snowball: A language for stemming algorithms [Electronic resource]. - Access mode: https://snowballstem.org/algorithms/porter/stemmer.html (date of access: 04.11.2023).
Baeldung. Stemming vs Lemmatization [Electronic resource] // Baeldung.com. - Access mode: https://www.baeldung.com/cs/stemming-vs-lemmatization (date of access: 08.11.2023).
Stopwords-iso repository on GitHub [Electronic resource] // GitHub.com. - Access mode: https://github.com/stopwords-iso (date of access: 08.11.2023).
GitHub.com. Stopwords-iso repository on GitHub [Electronic resource]. Access mode: https://github.com/stopwords-iso (date of access: 08.11.2023).
Stopwords-iso. List of stopwords for Russian language [Electronic resource]. Access mode: https://github.com/stopwords-iso/stopwords-ru/blob/master/stopwords-ru.txt (date of access: 08.11.2023).
Kaggle. NLP Preprocessing [Electronic resource]. Access mode: https://www.kaggle.com/code/abdallahwagih/nlp-preprocessing (date of access: 08.11.2023).
McMahan Brian, Rao Delip. Deep learning in natural language processing. - SPb.: Peter, 2020. - С. 46-92
NCW. Open access to scientific publications [Electronic resource]. Access mode: https://www.nkj.ru/open/36052/ (date of access: 08.11.2023).
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural Architectures for Named Entity Recognition. arXiv preprint arXiv:1603.01360.
NLP Эмбеддинги [Electronic resource]. Access mode: https://blog.bayrell.org/ru/iskusstvennyj-intellekt/495-nlp-embeddingi.html (date of access: 08.11.2023).
Soyalp G. et al. Improving Text Classification with Transformer //2021 6th International Conference on Computer Science and Engineering (UBMK). – IEEE, 2021. – С. 707-712.
Wang C., Banko M. Practical transformer-based multilingual text classification //Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers. – 2021. – С. 121-129.
Shaheen Z., Wohlgenannt G., Filtz E. Large scale legal text classification using transformer models //arXiv preprint arXiv:2010.12871. – 2020.
Tezgider M., Yildiz B., Aydin G. Text classification using improved bidirectional transformer //Concurrency and Computation: Practice and Experience. – 2022. – Т. 34. – №. 9. – С. e6486.
Vaswani A. et al. Attention is all you need //Advances in neural information processing systems. – 2017. – Т. 30.
Devlin J. et al. Bert: Pre-training of deep bidirectional transformers for language understanding //arXiv preprint arXiv:1810.04805. – 2018.
Sun C. et al. How to fine-tune bert for text classification? //Chinese Computational Linguistics: 18th China National Conference, CCL 2019, Kunming, China, October 18–20, 2019, Proceedings 18. – Springer International Publishing, 2019. – С. 194-206.
Beltagy I., Peters M. E., Cohan A. Longformer: The long-document transformer //arXiv preprint arXiv:2004.05150. – 2020.
Longformer model designed for Russian language [Electronic resource]. Access mode: https://huggingface.co/kazzand/ru-longformer-base-4096 (date of access: 08.11.2023).
Hossin M., Sulaiman M. N. A review on evaluation metrics for data classification evaluations //International journal of data mining & knowledge management process. – 2015. – Т. 5. – №. 2. – С. 1.
Li Y. et al. A comparative study of pretrained language models for long clinical text // J. Am. Med. Inform. Assoc. 2023. Vol. 30, № 2.
Wei F. et al. An Empirical Comparison of DistilBERT, Longformer and Logistic Regression for Predictive Coding // Proceedings - 2022 IEEE International Conference on Big Data, Big Data 2022. 2022.
Mamakas D. et al. Processing Long Legal Documents with Pre-trained Transformers: Modding LegalBERT and Longformer // NLLP 2022 - Natural Legal Language Processing Workshop 2022, Proceedings of the Workshop. 2022.
Khandelwal A. Fine-Tune Longformer for Jointly Predicting Rumor Stance and Veracity // ACM International Conference Proceeding Series. 2020.
Refbacks
- There are currently no refbacks.
Abava Кибербезопасность IT Congress 2024
ISSN: 2307-8162