Thursday, May 3, 2018

The Second Leg

We knew what problems we faced, and started tackling them one at a time.Our aim was to perfect our project by the end of the semester.

Inclusion of more fonts

We scanned through all the books we had, and started listing all the different fonts used in them. Since we had already converted one font, the process went smoother for the new fonts we discovered. We also used the sorce code from some open-source converters we found online, which helped speed up the process considerably. There were some extremely obscure fonts too, so rare that even their font files weren't available on the internet. Another type of fonts we couldnt't convert were Type I fonts, which were discontinued by Adobe a few years ago. Devlys010 and Walkman Chanakya were two new major fonts that we added to our converter. At the end of this, we were able to convert a lot of the books provided to us. Our mentor also arranged more books from the Rajasthan board, the Chhattisgarh board as well as the CBSE. 

Font Sizing

We moved on to our next problem- the difference in sizes of the fonts. Another issue closely related to this was the difference in the appearance of fonts. These problems were solved in two steps, identifying fonts that closely resembled each other, and then finding an appropriate size conversion for them. Each of these tasks was time consuming, and not something we liked doing, but it had to be done. These conversions were then added to our original script, taking it one step closer to completion.

Headings

Our final task was to tag the text present in the documents according to their position in the hierarchical structure. InDesign supports six levels of headings, h1 through h6, for exporting to various types. These heading tags were important for a visually impaired reader to understand the book structure. We used a tree structure to decide heading levels, with a proper descending structure i.e h2 after h1, h3 after h2, and so on. One assumption we took in this case was that the first heading-type object of the document would be h1. All the remaining text was categorized as paragraph, and the document was ready for exporting.

The Final Deliverable

Our final deliverable that was published online as well as provided to NGOs consisted of two scripts - oe for conversion of fonts with appropriate sizing, and the other for tagging of headings. The only problem we couldnt resolve was the tagging of tables and lists in the document, but these objects were rarely present in the books we converted. The output file generated by exporting the converted InDesign document can be read by any screen reader or document reader which supports Hindi.
The project directory can be found at https://github.com/prakhariitd/COP315-Hindi-fonts-to-Unicode

Our scripts work quite fast, and a 300 page book can be converted in under 2 minutes.
This is how our project works:

Wednesday, May 2, 2018

The Font Conversion

We started working on the problem in December of 2017 under the guidance of Akashdeep sir, our PhD mentor. The first font that we had to convert was Chanakya, which was the original aim of the project.

The deliverable we had to produce was an InDesign Script that ould convert the non-unicode text present in a document. The reason for using InDesign was that our primary goal was to get publishers to use our script, so that they could directly publish the books in Unicode online. InDesign also provides the maximum editing freedom over a document. The procurement of new master files from various school boards was handled by our mentor, while we worked on the conversion.
  

The Dictionary

The first step to convert any font into another is to create a dictionary that maps characters in one font to the corresponding characters in another. We used a software called Font Forge which displays the mapping of a font's characters. Using this, we could map two fonts to each other. But since each font has hundreds of characters, and for non-Unicode fonts, this mapping is unordered, the task of creating a dictionary was extremely tedious and time-consuming. To make things easier, we created a tool in Visual Basic which allowed us to map the characters to their corresponding ones in non-Unicde and create the dictionary easily.

We selected the source non-Unicode font, the target Unicode font, selected a source code for the source font which would correspond to a certain character, and created that character using the unicode Devnagri available to us, therefore creating the dictionary.

The Exceptions
The dictonary alone wasn't suffcient for handling the complete conversion because each font had some exceptional characters that were made by some unique combinations, and coudnt be mapped by the Dictionary tool. these exceptions had to b identified manually by proofreading the converted outputs, and then mapped to their corresponding destinations manually. 

The First Product

At the end of the winters, we had completed a working InDesign script that could convert documents containing Chanakya font to Unicode fonts. However, there were still several problems left - 
  • Most books contained more than just one font, and therefore, we werent able to convert them completely.
  • Since two different fonts rarely have the same character size for the same font size, converting the fonts created an overflow in the pages, which caused some of the text to go beyond the limit of the page, and hence, become invisible. 
  • We also needed proper tagging in the books so that the exported versions could be read by the blind while understanding proper heirarchy in the text.
So, to solve these issues, we decided to expand our project and continue it through the upcoming semester.

Monday, April 30, 2018

The Problem

Knowledge should be available to everyone equally. But when it comes to the disabled, that is hardly ever true. We set out to make things better, improving accessibilty to books- the most reliable source of knowledge.

The visually impaired can read books in two major ways- Braille copies or the newer option of using a screen reader for ebooks. But when it comes to Hindi, the former is hardly available, and most of the online content for the latter is unusable. Why is this? Let me explain. 

Almost all Hindi books published in print and subsequently online were written long ago, and have had few changes since. This was before the 1990s, a time when global computational resources were still developing, and the now universal Unicode format was far from complete. Back then, it had very few characters inbuilt, and definitely not Devnagri, the Hindi script. But books still needed to be published and printed, so the Indians used something of a quick-fix, which was knows as a non-Unicode font. The problem with this was that even though it looks and reads exactly like Hindi, its actually written in Roman characters at its back-end. Now this poses a serious problem. When we use our trusted screen readers on this non-unicde Hindi, it reads the characters written at the back-end, which is pure gibbersih. 


Front-End
स्वतंत्रता प्राप्ति के बाद विभिन्न विकसित राष्ट्रों के साथ विविध आयामी संपर्कों के बढ़ने से हिंदी साहित्य में नए-नए तत्त्व एवं तथ्य प्रविष्ट हुए।
Back-End
Lora=krk izkfIr Dsd ckn fofHkUu foDdflr jk"VQksa Dsd lkFk fofo/k vk;kkeh laiDdks¢ Dsd c<+us ls fganh lkfgR;k esa u,-u, rÁ~o ,oa rF;k izfo"V gq,` blls x|'kSyh ,oa x| fo/kkvksa Ddk rhozrk Dsd lkFk foDdkl g
( It looks like a mouse ran accross your keboard :P )

Almost all school boards - CBSE, Rajasthan board, CG board etc had their books written in the non-Unicode format.
Then came the 1990s, and the new and improved version of Unicode was introduced, which included characters for almost every script in existence, including Devnagri. And now, screen readers became capable of reading these texts smoothly. But, Indian publishers, as lazy as they are, weren't going to write new books just to facilitate a minor section of the society. So, we took upon this task of converting the fonts used in these books to Unicode, so that these books could be used for automatic reading.