Method of Automated Address Data Extraction from Unstructured Text

A.V. Komarova, A.A. Menshchikov, A.V. Polev, Y.A. Gatchin

Abstract


This article presents a method of automated address data extraction from unstructured text on the Internet. The authors focus on the issue of extracting information from the text containing postal addresses and geographical landmarks. The emphasis is on two main techniques: template analysis and statistical analysis with the use of machine learning. The paper describes the advantages of using automated search technologies for Smart Cities and for open data initiatives that are becoming very popular today. In addition, the authors developed software for collecting and retrieving information from the text. The method can be used as a basis for information analysis system on real estate web resources, as well as in semantic web resources and knowledge management systems building.

Full Text:

PDF (Russian)

References


Horoshevskij V. F. Prostranstva znanij v seti Internet i Semantic Web (Chast' 1) // Iskusstvennyj intellekt i prinjatie reshenij. – 2008. – #1. – S. 80-97.

Dolgih E. I., Antonov E. V., Erlich V. A. Umnye goroda: perspektivy razvitija v Rossii // Urbanistika i rynok nedvizhimosti. – 2015. – # 1. – S. 50–61.

Hollands, R. G. Will the real smart city please stand up? Intelligent, progressive or entrepreneurial? // City. – 2008. – #12(3). – P. 303-320.

Schmidt, Sebastian, et al. Extraction of address data from unstructured text using free knowledge resources. - Proceedings of the 13th International Conference on Knowledge Management and Knowledge Technologies. – ACM. – 2013. – Article #7. URL: http://dl.acm.org/citation.cfm?doid=2494188.2494193 (data obrashhenija: 19.10.2017).

Alekseev S. S., Morozov V. V., Simakov K. V. Metody mashinnogo obuchenija v zadachah izvlechenija informacii iz tekstov po jetalonu // Trudy 11j Vserossijskoj nauchnoj konferencii «Jelektronnye biblioteki: perspektivnye metody i tehnologii, jelektronnye kollekcii». – 2009. – S. 237-246. – URL: http://rcdl.ru/doc/2009/237_246_Section07-1.pdf (data obrashhenija 19.10.2017).

Chang, Chia-Hui, Chia-Yi Huang and Yueng-Sheng Su. On Chinese Postal Address and Associated Information Extraction // The 26th Annual Conference of the Japanese Society for Artificial Intelligence. – 2012. – Pp. 1-7. – URL: https://www.researchgate.net/publication/267422107_On_Chinese_Postal_Address_and_Associated_Information_Extraction (data obrashhenija 19.10.2017).

Nesi, Paolo, Gianni Pantaleo, and Marco Tenti. Geographical localization of web domains and organization addresses recognition by employing natural language processing, Pattern Matching and clustering // Engineering Applications of Artificial Intelligence 51. – 2016. – Rr. 202-211. – URL: http://dl.acm.org/citation.cfm?id=2910172 (data obrashhenija 19.10.2017).

Rasporjazhenie Pravitel'stva RF ot 28.07.2017 N 1632-r Ob utverzhdenii programmy "Cifrovaja jekonomika Rossijskoj Federacii". - URL: http://static.government.ru/media/files/9gFM4FHj4PsB79I5v7yLVuPgu4bvR7M0.pdf (data obrashhenija: 19.10.2017).

Dobrynin A. P, Chernyh K. Ju., Kuprijanovskij V. P., Kuprijanovskij P. V., Sinjagov S. A. Cifrovaja jekonomika - razlichnye puti k jeffektivnomu primeneniju tehnologij (BIM, PLM, CAD, IOT, Smart City, BIG DATA i drugie) // International Journal of Open Information Technologies. – 2016. – #1. – URL: http://cyberleninka.ru/article/n/tsifrovaya-ekonomika-razlichnye-puti-k-effektivnomu-primeneniyu-tehnologiy-bim-plm-cad-iot-smart-city-big-data-i-drugie (data obrashhenija: 19.10.2017).

Zheyuan Yu. High accuracy postal address extraction from web pages // Masters Abstracts International. – 2007. – Vol. 45. – No. 05.

Asadi S., Yang G., Zhou X., Shi Y., Zhai B., Jiang W. Pattern-Based Extraction of Addresses from Web Page Content // APWeb. – 2008. – Pp. 407-418. – URL: https://link.springer.com/chapter/10.1007/978-3-540-78849-2_41 (data obrashhenija 19.10.2017).

Pasternack J. and Roth D. Extracting Article Text from The Web With Maximum Subsequence Segmentation // WWW. – 2009. – Pp. 971-980. – URL: http://www.academia.edu/2661588/Extracting_article_text_from_the_web_with_maximum_subsequence_segmentation (data obrashhenija 19.10.2017).

B. Loos and C. Biemann. Supporting Web-based Address Extraction with Unsupervised Tagging. In C. Preisach, H. Burkhardt, L. Schmidt-Thieme, and R. Decker // Data Analysis, Machine Learning and Applications, Studies in Classification, Data Analysis, and Knowledge Organization. – 2008. – Pp. 577–584.

S. Asadi, G. Yang, X. Zhou, Y. Shi, B. Zhai, and W.-R. Jiang. Pattern-Based Extraction of Addresses from Web Page Content. In Y. Zhang, G. Yu, E. Bertino, and G. Xu // Progress in WWW Research and Development. – 2008. – Vol. 4976 of Lecture Notes in Computer Science. – Pp. 407–418.

D. Ahlers and S. Boll. Retrieving Address-based Locations from the Web // Proceedings of the 2nd international workshop on Geographic information retrieval, ACM . – 2008. – Pp. 27–34.

Schmidt, S., Manschitz, S., Rensing, C., and Steinmetz, R. Extraction of address data from unstructured text using free knowledge resources // Proceedings of the 13th International Conference on Knowledge Management and Knowledge Technologies. – P. 7.

W. Cai, S. Wang, and Q. Jiang. Address extraction: Extraction of location-based information from the web // Web Technologies Research and Development. – 2005. – Vol. 3399 of Lecture Notes in Computer Science. – Pp. 925–937.

Chang, Chia-Hui, Chia-Yi Huang, and Yueng-Sheng Su. On chinese postal address and associated information extraction. // The 26th Annual Conference of the Japanese Society for Artificial Intelligence. – 2012.


Refbacks

  • There are currently no refbacks.


Abava  Кибербезопасность MoNeTec 2024

ISSN: 2307-8162