LiveJournal topic models and their improvement with contextualized representations for creating a model of hidden communities

Ivan Mamaev, Olga Mitrofanova

Abstract


Social networks reflect contemporary tendencies in our society. These tendencies allow users to form communities that have both explicit and hidden links. The latter one is of current interest among scholars. Despite the effectiveness of modern algorithms, they do not take linguistic parameters of datasets into account. This gap can be filled by an algorithm that combines linguistic and quantitative data analysis. The purpose of the study is to detect hidden links among users’ posts of the Russian segment of LiveJournal with the help of topic modeling procedures. The current size of the corpus is more than 95,490 posts (132 users). The procedure for constructing a model of hidden communities contains several stages. The first step is to process the corpus data using the Stanza library, which provides a single process of tokenization and lemmatization of social network posts and the removal of manually selected stopwords. The second step is creating contextualized topic models and their manual annotation. The final step is to build a semantic network of users using Easy Linavis and Gephi. The resultant model of hidden communities is represented as a group of vertices connected by edges. The results of the study provide new information about possible social groups in the Russian segment of social networks that can further be analyzed in terms of linguistics.

Full Text:

PDF

References


DOI: 10.25559/INJOIT.2307-8162.10.202211.54-59

A. V. Kutyrkin, A. V. Syomin, Klasternyj analiz: Metodicheskie ukazanija, Pereizdanie, M.: MIIT, 2009, 22 p.

A. Wong, C. Lai, A. K. Shum, and P. S. Yip, “From the hidden to the obvious: classification of primary and secondary school student suicides using cluster analysis”, in BMC public health, vol. 22(1), 2022, pp. 1-7.

Beautiful Soup Documentation [Online]. Available: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

C. Milling, E. S. Shlosman, and D. A. Skorinkin, “Easy linavis (simple network visualisation for literary texts)”, in Informacionnye tekhnologii v gumanitarnykh naukakh, 2017, pp. 104-107.

F. Bianchi, S. Terragni, D. Hovy, D. Nozza, and E. Fersini, “Cross-lingual contextualized topic models with zero-shot learning”, in: arXiv preprint arXiv:2004.07737, 2020.

F. Iqbal, B. C. Fung, M. Debbabi, R. Batool, and A. Marrington, “Wordnet-based criminal networks mining for cybercrime investigation”, in IEEE Access, vol. 7, 2019, pp. 22740-22755.

Gensim. Author-topic models [Online]. Available: https://radimrehurek.com/gensim/models/atmodel.html

Github, lj-crawler 0.9 [Online]. Available: https://github.com/roman-lugovkin/lj-crawler

I. Mamaev, and O. Mitrofanova, “Automatic Detection of Hidden Communities in the Texts of Russian Social Network Corpus”, in Conference on Artificial Intelligence and Natural Language, Springer, Cham, 2020, pp. 17-33.

I. Mamaev, and O. Mitrofanova, “Hidden Communities in the Russian Social Network Corpus: a Comparative Study of Detection Methods”, in CMCL, 2020, pp. 69-78.

J. Gan, and Y. Qi, “Selection of the Optimal Number of Topics for LDA Topic Model—Taking Patent Policy Analysis as an Example”, in Entropy, vol. 23(10), 2021, pp. 1-45.

K. He, S. Soundarajan, X. Cao, J. Hopcroft, and M. Huang, Revealing multiple layers of hidden community structure in networks, in arXiv preprint arXiv:1501.05700, 2015.

K. He, Y. Li, S. Soundarajan, and J. E. Hopcroft, “Hidden community detection in social networks”, in Information Sciences, vol. 425, 2018, pp. 92-106.

L. Chaudhary, and B. Singh, “Community detection using unsupervised machine learning techniques on COVID-19 dataset”, in Social Network Analysis and Mining, vol. 11(1), 2021, pp. 1-9.

L. Euler. “Solutio problematis ad geometriam situs pertinentis”, in Commentarii academiae scientiarum Petropolitanae, 1741, pp. 128-140.

M. E. Newman, “The structure and function of complex networks”, in SIAM review, vol. 45(2), 2003, pp. 167-256.

M. Grootendorst, BERTopic: Leveraging BERT and c-TF-IDF to create easily interpretable topics, 2020, doi: 10.5281/zenodo.4381785.

M. Hasan, A. Rahman, M. Karim, M. Khan, S. Islam, and M. Islam, “Normalized approach to find optimal number of topics in Latent Dirichlet Allocation (LDA)”, in Proceedings of International Conference on Trends in Computational and Cognitive Engineering, Springer, Singapore, 2021, pp. 341-354.

N. E. Lyapin, and M. E. Abramov, “Instruments and technologies for automated assessment of expression of personal characteristics of social network users” [Instrumenty i tekhnologii dlya avtomatizatsii otsenki vyrazhennosti lichnostnykh osobennostej pol'zovatelej social'nykh setej], in Regional Informatics (RI-2020). XVII St. Petersburg International Conference "Regional Informatics (RI-2020)" [Regional'naya informatika (RI-2020). XVII Sankt-Peterburgskaya mezhdunarodnaya konferentsiya «Regional'naya informatika (RI-2020)»], vol. 2, 2020, pp. 253-255.

O. Koltsova, S. Alexeeva, S. Pashakhin, and S. Koltsov, “PolSentiLex: Sentiment Detection in Socio-Political Discussions on Russian Social Media”, in Conference on Artificial Intelligence and Natural Language, Springer, Cham, 2020, pp. 1-16.

O. N. Lyashevskaya, and S. A. Sharoff, Frequency dictionary of modern Russian based on the Russian National Corpus [Chastotnyj slovar’ sovremennogo russkogo jazyka], Azbukovnik, Moscow, 2009.

P. Qi, Y. Zhang, Y. Zhang, J. Bolton, and C. D. “Manning, Stanza: A Python natural language processing toolkit for many human languages”, in arXiv preprint arXiv:2003.07082, 2020.

R. D. Alba, “A graph‐theoretic definition of a sociometric clique”, in Journal of Mathematical Sociology, vol. 3(1), 1973, pp. 113-126.

R. Pastor-Satorras, and A. Vespignani, Evolution and structure of the Internet: A statistical physics approach. Cambridge University Press, 2004.

Requests: HTTP for Humans [Online]. Available: https://requests.readthedocs.io/en/latest/

S. Fortunato, Community detection in graphs, in Physics reports, vol. 486(3-5), 2010, pp. 75-174.

Y. Jia, Q. Zhang, W. Zhang, and X. Wang, X. “CommunityGAN: Community Detection with Generative Adversarial Nets”, in The World Wide Web Conference, 2019, pp. 784-794.


Refbacks

  • There are currently no refbacks.


Abava  Кибербезопасность IT Congress 2024

ISSN: 2307-8162