Measuring Similarity of Fiction Texts Based on Distributional Semantic Models (Case Study of the Russian Original Text and English Translations of M.Bulgakov's Novel "The Master and Margarita")

E. V. Tretyak


The paper deals with the application of distributional semantic methods to the task of measuring similarity between several translations of the original text. In particular, Word2Vec neural network toolkit is employed for comparison between two translations. Moreover, in terms of the theory of translation, descriptions of transformations for paraphrasing, which are also used for testing plagiarism detection methods, suit the task of comparing translations. Experiments discussed in this paper are carried out for the Russian original and English translations of M. Bulgakov's novel "The Master and Margarita". In the paper, the above mentioned approaches are combined to contrast the translation by M. Glenny (1967) with one by R. Pevear and L. Volokhonsky (1997). Hypothesis that parallel translations can be treated as paraphrases obtained as a result of transformations is under consideration. The paper contains detailed quantitative analysis of the data obtained regarding the similarity between two translations of fiction text as well as discussion of particular contexts.

Full Text:



K. Church, P. Hanks, “Word Association Norms, Mutual Information, and Lexicography”. Computational Linguistics, vol. 16, issue 1, 1990, pp. 22–29.

F. Smadja, “Retrieving collocations from text: Xtract”. Computational Linguistics - Special issue on using large corpora, vol. 19, issue 1, 1993, pp. 143–177.

D. Lin, “Using collocation statistics in information extraction”. Proceedings of the Seventh Message Understanding Conference (MUC-7), 1998.

St. Evert, “Corpora and Collocations”. Corpus Linguistics. An International Handbook / A. Lüdeling, M. Kytö (eds.), 2008, article 58, pp. 1212–1248.

V.Seretan, Syntax-Based Collocation Extraction. Text, Speech and Language Technology series, vol. 44, 2011.

L. Wanner, B. Bohnet, M. Giereth, “Making sense of collocations”. Computer, Speech and Language, vol. 20, issue 4, 2006, pp. 609–624.

J. Kupiec, “An algorithm for finding noun phrase correspondences in bilingual corpora”. Proceedings of the 31st annual meeting on association for computational linguistics (ACL 1993), 1993, pp. 17–22.

M. Haruno, S. Ikehara, T. Yamazaki, “Learning bilingual collocations by word-level sorting”. Proceedings of the 16th Conference on Computational Linguistics, vol. 1, 1996, pp. 525–530.

Ch.-Ch. Wu, J.S. Chang, “Bilingual collocation extraction based on syntactic and statistical analyses”. Proceedings of the 15th Conference on Computational Linguistics and Speech Processing. Association for Computational Linguistics and Chinese Language Processing, 2003, pp. 1–20.

P. Fung, “A statistical view on bilingual lexicon extraction: From parallel corpora to non-parallel corpora”. Proceedings of the 3rd Conference of the Association for Machine Translation in the America. Machine Translation and the Information Soup (AMTA 1998), 1998, pp. 1–17.

G. Bukia, E. Protopopova, O. Mitrofanova, “A corpus-driven estimation of association strength in lexical constructions”. Proceedings of the AINL-ISMW FRUCT, FRUCT Oy, Finland, 2015, pp. 147–152.

T. Mikolov, K. Chen, G. Corrado, J. Dean, “Efficient estimation of word representations in vector space”. Workshop Proceedings of the International Conference on Learning Representations (ICLR), 2013.

Yu. Morozova, E. Kozerenko, M. Sharnin, “Method for extracting single-word translation correspondences from parallel texts using distributional semantics models”. Systems and Means of Informatics, vol. 24., issue 2, 2014, pp. 131–142. (In Rus.) = Yu. Morozova, E.

Kozerenko, M. Sharnin, “Metodika izvlechenija poslovnyh perevodnyh sootvetstvij iz parallelnyh tekstov s primenenijem modelej distributivnoj semantiki”. Sistemy i sredstva informatiki, tom 24, vyp. 2, 2014. Pp. 131–142.

O. Vācietis, Ieejam Bulgakova galaktikā. Jaunās grāmatas, № 11, 1979. (In Lat.) = O. Vācietis, Vhodim v galaktiku Bulgakova. Novyje knigy, № 11, 1979.

T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, “Distributed representations of words and phrases and their compositionality”. NIPS’13 Proceedings of the 26th International Conference of Neural Information Processing Systems, 2013.

Y. Bengio, “A Neural Probabilistic Language Model”. Journal of Machine Learning Research 3, 2003, pp. 1137–1155.

Melchuk, The experience of the theory of linguistic models “Meaning <=> Text”. M., 1999. (In Rus.) = I. Melchuk, Opyt teorii lingvisticheskih modeley “Smysl<=> Tekst”. M., 1999.

V. Komissarov, Theory of translation. M., 1990. (In Rus.) = V. Komissarov, Teorija perevoda. M., 1990.

V. Komissarov, Modern translation science. M., 2004. (In Rus.) = V. Komissarov, Sovremennoje perevodovedenije. M., 2004.

DocSim: the code for the cosine measure. URL:

Google News Corpus. URL:

LF Aligner. URL:

Microsoft Research Paraphrase Corpus. URL:

ParaPhraser. URL:

ParaPlag. URL:


  • There are currently no refbacks.

Abava  Absolutech Convergent 2020

ISSN: 2307-8162