• Home

  • Custom Ecommerce
  • Application Development
  • Database Consulting
  • Cloud Hosting
  • Systems Integration
  • Legacy Business Systems
  • Security & Compliance
  • GIS

  • Expertise

  • About Us
  • Our Team
  • Clients
  • Blog
  • Careers

  • CasePointer

  • VisionPort

  • Contact
  • Our Blog

    Ongoing observations by End Point Dev people

    OpenITI Starts Arabic-script OCR Catalyst Project

    Elizabeth Garrett Christensen

    By Elizabeth Garrett Christensen
    September 10, 2019

    Decorative Arabic calligraphy Photo by Free Quran Pictures 4K, cropped, CC BY 2.0

    Congratulations to the Open Islamicate Texts Initiative (OpenITI) on their new project the Arabic-script OCR Catalyst Project (AOCP)! This project received funding from the The Andrew W. Mellon Foundation this summer.

    End Point developer Kamil Ciemniewski will be serving the project as a Technology Integration Specialist. Kamil has been involved with OpenITI since 2018 and with the affiliated project, Corpus Builder, since 2017.

    Corpus Builder project version 1.0 made collaborative effort possible in producing ground truth datasets for OCR models training. The application acts as a versioned database of text transcriptions and a full OCR pipeline itself. The versioned character of the database follows closely the model used by Git.

    What is remarkable about it is that it brings the ability to work on revisions of documents whose character isn’t linear as text in the Git case. For the OCR problem, one needs both textual data but also the spatial: where exactly the text is to be found.

    A sophisticated mechanism of applying updates to those documents minimizes (with mathematical guarantees) the chance of introducing merge conflicts.

    The project also hosts a great-looking UI interface allowing non-technical editors to work within the workflow of this versioned data.

    CorpusBuilder works with both Tesseract and Kraken as its OCR backends and is capable of exporting datasets in their respective formats for further model training / retraining. Training of Tesseract models was covered last year in a blog post by Kamil.

    AOCP will rapidly expand prior work and will help establish a digital pipeline for digitizing texts and creating a set of tools for students and scholars of historic texts.

    End Point is really excited to be a part of such a cool integration of technology and the humanities!

    Read more at:

    clients machine-learning natural-language-processing


    Comments