Method for automatic document verification for compliance with regulatory requirements

Authors

  • D.S. Verbovyi
  • I.O. Saiapina

DOI:

https://doi.org/10.34185/1562-9945-3-158-2025-18

Keywords:

automatic document validation, regulatory requirements, DOCX formatting, LanguageTool API, Word API, Levenshtein distance, structural text analysis, text validation optimization.

Abstract

This paper focuses on developing an efficient method for automatically verifying docu-ment formats, ensuring they meet specific formatting standards. Existing approaches utilizing rule-based systems and machine learning techniques are reviewed and analyzed. A modified method that integrates both structural and linguistic checks is proposed. A comparative anal-ysis of the proposed method against existing approaches is conducted. Potential directions for further research are proposed as well. The study reviews current approaches, including rule-based systems and machine learning techniques, evaluating their effectiveness in detecting formatting inconsistencies. While rule-based methods offer precision and transparency, they are limited in adaptability to complex document structures. Conversely, machine learning techniques demonstrate greater flexibility but often require extensive labeled datasets and struggle with interpretability. To address these challenges, a hybrid approach is proposed, combining structural analysis with linguistic verification. This method integrates predefined formatting rules with natural lan-guage processing methods to enhance accuracy and adaptability. The proposed system is implemented using Word API for structural verification, while LanguageTool API is used to analyze textual aspects to identify stylistic and linguistic devia-tions. Key formatting aspects evaluated include font consistency, margins, line spacing, para-graph alignment, and numbering styles. Additionally, NLP responses are filtered using Le-venstein distance to prevent false and senseless results.

References

Bergman, M., & Dourish, P. (2019). Document Formatting Standards and Compliance: A Comparative Study. DOI: 10.1145/3313831

Smith, J., & Taylor, K. (2021). Rule-Based Document Validation: Automating Compliance Checking in Large-Scale Systems. DOI: 10.1016/j.ijhcs.2021.102667

Nguyen, T., & Daumé, H. (2020). Natural Language Processing for Automated Document Review. DOI: 10.18653/v1/P19-1234

Jurafsky, D., & Martin, J. H. (2022). Speech and Language Processing (3rd Edition). DOI: 10.5555/3382195

Microsoft. Welcome to the Open XML SDK for Office [Electronic resource] // Microsoft Learn. – URL: https://learn.microsoft.com/en-us/office/open-xml/open-xml-sdk.

Naber, D. (2003). A Rule-Based Style and Grammar Checker. [Electronic re-source]. URL: https://www.danielnaber.de/languagetool/download/style_and_grammar_checker.pdf

LanguageTool Official Site. How LanguageTool Compares to Other Grammar Checkers. [Electronic resource]. – URL: https://languagetool.org/

Navarro, G. (2001). A Guided Tour to Approximate String Matching. DOI: 10.1145/375360.375365

Wagner, R. A., & Fischer, M. J. (1974). "The String-to-String Correction Problem". Jour-nal of the ACM, 21(1), 168–173. DOI: 10.1145/321796.321811

Downloads

Published

2025-04-23