Frequency analysis of text using a computer

Robert Mayer

Abstract


The frequency analysis method of the text is considered, which leads to obtaining spectral distributions of letters, words, and semantic segments. The purpose of the article: 1) to create computer programs that allow you to obtain the spectra of the distribution of words and individual characters in large texts; 2) to test them in the analysis of V.G. Korolenko's novel "Children of the Dungeon"; 3) to build a probabilistic model of the writer. There are three programs in the ABCPascal language that allow to get: 1) the frequency distribution of letters and their combinations; 2) the spectral distribution of words and semantic segments; 3) the number of transitions from semantic segments of length n to semantic segments of length m. The article provides: 1) the spectral distributions of the vowels "o", "a", "e", "i", "u", "ya", "yu" in the analyzed text; 2) the frequency distribution of words along the length; 3) the spectrum of semantic segments of the text limited by punctuation marks; 4) the matrix of transitions from semantic segments of length n to semantic segments of length m; 5) the probabilities table of these transitions; 6) the graph of probabilistic automaton simulating the generation of text by the author. Its vertices correspond to the number of words in semantic text segments separated by punctuation marks, and the edges correspond to the most likely transitions. All this characterizes the individual characteristics of the style and can be used to establish authorship.

Full Text:

PDF (Russian)

References


Andrievskaja, N.K. Gibridnaja intellektual'naja mera ocenki semanticheskoj blizosti // Problemy iskusstvennogo intellekta. – 2021. # 1. – S. 4-17.

Belonogov, G. G. Komp'juternaja lingvistika i perspektivnye informacionnye tehnologii: teorija i praktika postroenija sistem avtomat. obrab. tekstovoj inform. / G.G. Belonogov, Ju.P. Kalinin, A.A. Horoshilov. – Moskva: Rus. mir, 2004. – 246 s.

Borodashhenko, A.Ju. Analiz tekstov na semanticheskoe shodstvo na osnove apparata teorii grafov // Izvestija OrelGTU. Serija "Informacionnye sistemy i tehnologii". 2008. # 1-2. S. 46-52.

Korolenko, V. G. Deti podzemel'ja. Mahaon, 2022. 96 s.

Kuusela, D. A. Analiz leksicheskih spektrov tekstov s pomoshh'ju matematicheskih metodov // StudArctic Forum. 2023. T. 8, # 2. S. 30 – 35.

Moskal'chuk, G.G., Manakov, N.A. Forma teksta kak mnogourovnevyj konstrukt // Znanie. Ponimanie. Umenie. 2014 – #4. S. 291 – 302.

Piotrovskij, R.G., Bektaev, K.B., Piotrovskaja, A.A. Matematicheskaja lingvistika: ucheb. posobie dlja ped. in-tov. M.: Vyssh. shk., 1977. 383 s.

Radbil', T.B., Markina, M.V. Verojatnostno-statisticheskie modeli v proizvodstve avtorovedcheskoj jekspertizy russkojazychnyh tekstov // Politicheskaja lingvistika. 2019. # 2 (74). S. 156 – 166.

Rudnev D.V., Drugovejko-Dolzhanskaja S.V. Raspredelenie znakov prepinanija v sovremennoj delovoj pis'mennosti // Jazyk i metod. Russkij jazyk v lingvisticheskih issledovanijah XXI veka. Vyp. 7. Russkaja punktuacija v kommunikativnom aspekte. Krakov: Izd-vo Jagellonskogo un-ta, 2021. S. 75–86.

Trubkina, A.I. Sistema periodicheskih konstrukcij v jazyke i diskurse: problema statusa // Izvestija Sochinskogo gosudarstvennogo universiteta. 2013. # 3 (26). S. 251-254.


Refbacks

  • There are currently no refbacks.


Abava  Кибербезопасность IT Congress 2024

ISSN: 2307-8162