Identification of Propaganda Documents in the News Text Corpоra

Ravil Mukhamediev, Olga Filatova, Kirill Yakunin


The article demonstrates the possibilities of using topic modeling to identify propaganda in the media. In modern conditions of increasing information confrontation between countries, propaganda and counter-propaganda come to the forefront, since states need to protect their citizens from various informational threats, to ensure their safety, which is a necessary condition for the further development of the state. To achieve this research projects are necessary to test methods for identifying propaganda. One of such projects, focused on the use of artificial intelligence systems in various applied research areas at the intersection of machine learning, natural language processing and social studies, is presented in the article. The described approach for identifying such a semantically fuzzy phenomenon as propaganda is proposed for the first time. The following definition for political propaganda is suggested - a coordinated, systematic informational influence of the subject of propaganda on target audiences to achieve political goals and promote political ideas.

The proposed method includes four main stages: formation of corpus sections, calculation of a thematic model of an overall corpus, calculation of imbalance estimates of corpuses for each topic; extrapolation of the imbalance estimates results to all documents. The method was cross-checked on a subsample of 1000 news marked by an expert and showed a fairly high classification result. Harmonic measure score (F1-Score) varies from 0.72 to 0.94 depending on the selected threshold.

Full Text:

PDF (Russian)


Filatova O.G. Propaganda in the era of bots, trolls and fake-news: theoretical approaches and applied research // Strategic communications in business and politics. - 2018 .-- Vol. 1 (4). - S.86-94. (In Russ.)

Barakhnin V.B., Muhamedyev R.I. , Mussabaev R.R., Kozhemyakina O.Yu., Issayeva A., Kuchin Ya.I., Murzakhmetov S.В., Yakunin K.O. Methods to identify the destructive information // Journal of Physics: Conf. Series. – 2019. – V. 1117. –10 p. URL:

Muhamedyev R. Machine learning methods: An overview // Computer Modelling & New Technologies. – 2015. – Vol. 19 (6). – С. 14-29.

Korencÿic D., Ristov, S., Sÿnajder, J. Document-based topic coherence measures for news media text // Expert Systems with Applications. – 2018. – Vol. 114. – P. 357-373.

Neuendorf K. A. The content analysis guidebook. Sage. – 2016.

Steinberger J., Ebrahim M., Ehrmann M., Hurriyetoglu A., Kabadjov M., Lenkova P., Steinberger R., Tanev H., VGЎzquez S., Zavarella V. Creating sentiment dictionaries via triangulation // Decision Support Systems. – 2012. – Vol. 53 (4). – P. 689-694.

Clerwall C. Enter the robot journalist: Users’ perceptions of automated content // Journalism Practice. –2014. – Vol. 8. – P. 519–531.

Popescu O., Strapparava C. Natural Language Processing meets Journalism // Proceedings of the 2017 EMNLP Workshop. Copenhagen, Denmark: Association for Computational Linguistics. – 2017.

Hirschberg J., Manning C. D. Advances in natural language processing // Science. – 2015. – Vol. 349 (6245). – P. 261–266.

Barrón-Cedeno A. et al. Proppy: A system to unmask propaganda in online news //Proceedings of the AAAI Conference on Artificial Intelligence. – 2019. – Т. 33. – №. 01. – С. 9847-9848.

Barrón-Cedeno A., Jaradat I., Da San Martino G., Nakov P. Proppy: Organizing the news based on their propagandistic content // Information Processing & Management. 2019. Vol. 56 (5). – P. 1849-1864.

Da San Martino G., Yu S., Barrón-Cedeno A., Petrov R., Nakov P. Fine-grained analysis of propaganda in news article . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). – 2019. – Р. 5640–5650.

Altiti O., Abdullah M., Obiedat R. JUST at SemEval-2020 Task 11: Detecting Propaganda Techniques Using BERT Pre-trained Model // Proceedings of the Fourteenth Workshop on Semantic Evaluation. – 2020. – С. 1749-1755.

Devlin J., Chang M.-W., Lee K., Toutanova K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. URL:

Sadana A. et al. NSIT@ NLP4IF-2019: Propaganda detection from news articles using transfer learning // Proceedings of the Second Workshop on Natural Language Processing for Internet Freedom: Censorship, Disinformation, and Propaganda. – 2019. – С. 143-147.

Vlad G. A. et al. Sentence-level propaganda detection in news articles with transfer learning and BERT-BiLSTM-capsule model // Proceedings of the Second Workshop on Natural Language Processing for Internet Freedom: Censorship, Disinformation, and Propaganda. – 2019. – С. 148-154.

Oliinyk V. A. et al. Propaganda detection in text data based on NLP and machine learning // CEUR Workshop Proceedings. – 2020. – Vol. 2631. – P. 132-144.

Mashechkin I.V, Petrovsky M.I, Tsarev D.V. Methods for calculating the relevance of text fragments based on thematic models in the problem of automatic annotation // Computational methods and programming. - 2013. – Vol. 14 (1). - P. 91-102. (In Russ.)

Vorontsov K.V., Potapenko A.A. Regularization, robustness and sparsity of probabilistic topic models // Computer Research and Modeling. 2012. Vol. 14 (4). P. 693-706. (In Russ.)

Parhomenko P.A., Grigorev A.A., Astrakhantsev N.A. A survey and an experimental comparison of methods for text clustering: application to scientific articles // Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2017. Vol. 29(2). P. 161-200. (In Russ.)

Yakunin K., Ionescu G.M., Murzakhmetov S., Mussabayev R., Filatova O., Mukhamediev R. Propaganda Identification Using Topic Modelling // Procedia Computer Science. 2020. Vol. 178. P. 205–212.

Vorontsov K. et al. Bigartm: Open source library for regularized multimodal topic modeling of large collections // International Conference on Analysis of Images, Social Networks and Texts. – Springer, Cham, 2015. – С. 370–381.

Blei D.M., Ng A.Y., Jordan M.I. Latent dirichlet allocation // Journal of machine Learning research. – 2003. – Т. 3. – No Jan. – P. 993–1022.

Jelodar H. et al. Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey // Multimedia Tools and Applications. – 2018. – С. 1–43.

Mimno D., Wallach H., Talley Ed., Leenders M. & McCallum A. Optimizing Semantic Coherence in Topic Models // Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. – 2011. – P. 262–272.

Barakhnin V. B., et al.: Methods to identify the destructive information // Journal of Physics. 1405(1), 012004. – 2019.

Mukhamediev R.I., Mustakayev R., Yakunin K., Kiseleva S., Gopejenko V. Multi-Criteria Spatial Decision Making Support system for Renewable Energy Development in Kazakhstan // IEEE Access. 2019. 7, 122275-122288.

Mukhamediev R. I. et al. Classification of Negative Information on Socially Significant Topics in Mass Media //Symmetry. – 2020. – Т. 12. – №. 12. – С. 1945.

Zhu X., Goldberg A. B. Introduction to semi-supervised learning // Synthesis lectures on artificial intelligence and machine learning. – 2009. Vol. 3(1). – P. 1-130.

Bradley A.P. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition. 1997. – Vol. 30 (7). – 1145–1159. URL:

Akobeng A. K. Understanding diagnostic tests 3: receiver operating characteristic curves // Wiley Online Library, 21–Mar–2007. [Online]. Available:

Yakunin K.O, Mussabayev R.R., Eylis M.S., Mukhamediev R.I. Energy topic in news publications // Renewable energy sources. Proceedings of the All-russian scientific conference and the XIII youth school with international participation. 24–25 november, 2020. Moscow, 2020. P. 451-456. (In Russ.)


  • There are currently no refbacks.

Abava  Кибербезопасность MoNeTec 2024

ISSN: 2307-8162