Atalasoft Imaging SDK Development Blog

Document Imaging and Developer Commentary

Blog Home RSS Feed Old Archive Atalasoft.com

Recent blog posts

This is our new blog page. If you're looking for posts before 2012, see our archive.

Anatomy of a Feature Request

Creating a product that is an API presents many challenges as an architect. There are a number of axes that describe trade-offs that are omnipresent when adding support for a particular feature. For example you might have an easy-to-understand public abstraction at the cost of a challenging (or unreliable) private implementation. I’m going to take you through the process I went through in order to implement a feature in DotPdf for a customer. The back story is that the PDF specification includes a misfeature called “PDF Portfolios”. In the PDF specification, these are called “Portable Collections” (a portfolio in the real world is a collection of documents that you carry). This feature is a way in which a number of documents/files can be embedded within a single PDF file and accessed from within the viewer’s UI. The embedded documents need not be PDF, but could be a Word doc, email, text, images, etc. The resulting embedded files can be prese...

Read More

Posted by Steve Hawley on 06/25/2014 with 0 comments

Your Whole Programming Language is a Set of Domain-Specific-Languages

  A Domain-Specific-Language (DSL) is a small language used to make routine tasks in a particular problem easier. Examples of DSLs include spread-sheet macros, the Unix software build utility known as Make, and the virtual machine I wrote to parse PDF implements a simple DSL. When you consider the syntax of most modern-ish programming languages (I’m looking at you C++, Java, C#, F#), nearly all of them are a hodge-podge of DSLs jammed together. This is sometimes a horrible thing, and unfortunately it’s our own fault. It stems from how we got here in the first place and how we saw our problem domain. The first thing that comes to mind is assignment, which is the first DSL.  Value mutation is a direct reflection of the initial implementation of hardware.  We had memory that was used to hold numbers and we needed a way to put/get values into/from cells.  Thus was born the “move” instruction (or load/store instructions in accumulator ...

Read More

Posted by Steve Hawley on 05/05/2014 with 0 comments

Improving OCR Results: Adding Spellcheck

With the new Tesseract 3.2 engine available as an add-on for Atalasoft DotImage, I have been more interested in the quality of OCR results. When I scour the internet for OCRed documents, I find that many of them have words that are misspelled due to a misinterpreted character or omitted letter. I thought about spellcheck being able to solve this issue, and after experimentation I believe it can only make minor improvements to the overall OCR results without very sophisticated integration. With DotImage the OcrEngine object is setup to be very extensible; giving hooks into many major steps of the OCR process. Using DotImage I came up with two simple algorithms to use an open source .NET spell checking engine, “Missing Letter” and “Single Incorrect Letter:” Missing Letter In several of the raw OCR results from my sample set I noticed that there would be words that were completely missing a letter. The spell check engine provided good guesses when a let...

Read More

Posted by Kevin Hulse on 03/18/2014 with 0 comments

Some Introduction, Some Tesseract

Hi there! I’m Kevin Hulse, the newish Solutions Enablement Specialist at Atalasoft. You may have worked with me directly after I started working at Atalasoft as a Developer Support Engineer nearly six years ago. Since starting here, I have worked in almost every department from Support to Engineering and now Marketing (watch out Sales). I hope to begin a small series of blogs on all things OCR and plan on providing interesting, technical-minded posts on our products, our customers, and document imaging in general, as well as posts on things that I simply find interesting enough to talk about. Speaking of products, with the release of DotImage 10.4.1, our OCR libraries have been upgraded to handle version 3.02 of the Tesseract OCR Engine. This upgrade includes a few small improvements to speed and accuracy of processing as well as an increased ability to use new data packages to support more extended character sets.  Additionally, here’s a list of all the langua...

Read More

Posted by Kevin Hulse on 02/27/2014 with 0 comments Tags: OCR, Tesseract

How to Work With Library Developers/Support

In addition to writing code from the ground up, we also work with other library developers and we package the libraries in a C# or Java API which is typically easier to work with or more convenient for our customers.  Many of our customers aren’t comfortable working with C++ libraries or sometimes the C++ libraries have awkward interfacing and that’s fine.  We’re very good at taking this type of API and presenting it in a way that feels right for .NET or the JVM and integrates with the rest of our code base. Still, the same way that you work with us, we have to, at times, work with other library writers and we find bugs every now and again.  Here are five tips for working with library creators to get the most out of your interactions. Make sure that you are using the library correctly. The reason is that (hopefully) your library has a model of operation that lends itself to a particular model of usage.  For example, some librarie...

Read More

Posted by Steve Hawley on 01/14/2014 with 0 comments

When is boolean not a boolean?

I ran into a failing C# unit test today, with the following output:   Expected: True   But was:  True Seriously. I stopped it in the debugger and the property that was being checked was “true”.  I set it to a local and that was also “true” in the debugger. So when is it the case that true != true? The answer, to me, was straight forward: in C#, “true” is supposed to be the 32-bit value 0x00000001, but in many languages, “true” is defined as “anything that is not ‘false’ (aka, 0). Since the code that was generating the value was C++/CLI interfacing with C, it seemed pretty clear where the issue was – I opened up a Memory window in the debugger and dropped the member onto it, which showed that the value for the boolean was 0x00002080 (or something like that).  The culprit was C++ code that was calling a low-level C function, passing in two locals by reference.  ...

Read More

Posted by Steve Hawley on 11/26/2013 with 0 comments

Devoxx 13 Belgium, a Quick Look

I spent last week in Antwerp, Belgium attending the Devoxx conference.  It was an interesting experience, not the least of which was because I was there playing a trio of roles: developer, presenter, and evangelist. In the developer role, I was looking for new technologies that we could make use of in current or future work.  Interesting things I saw included Genymotion (a high speed android emulator) and the Dart programming language. My presentation on parsing PDF in Java went well, although I went way too fast – it happens when you are both a little nervous and very passionate about your topic.  Hopefully the talk will be up on parleys.com soon – at this writing, the channel page is empty. What struck me the most was working the trade show floor.  It’s always a bit of a challenge at developer conferences in that we, as developers, can be introverted and do not want to interact with other people.  Very often, people who walked by...

Read More

Posted by Steve Hawley on 11/20/2013 with 0 comments

Off to Devoxx 2013

I’ll be going to Devoxx in Antwerp, Belgium next week.  If you’re there, stop by the Atalasoft booth to say hi, or come see me present on PDF Parsing.

Read More

Posted by Steve Hawley on 11/08/2013 with 0 comments

Atalasoft 10.4 SDKs Released

Update: Atalasoft version 10.4.1 was released in early 2014 and brings support for Tesseract OCR 3.2, updated DICOM libraries, and bug fixes. See the release notes for more information: What's new in 10.4.0? Our engineers have been hard at work and we are excited to announce the latest release of our SDKs: Version 10.4. This is a major release just like 10.2 and 10.3. Let's dive into some new features for our products across the board: Word Reader Add-on (for DotImage, any edition) Customers have been asking to be able to open Microsoft Word files in our viewers. With this new product users will be able to rasterize basic word documents to save in other formats or display in our viewers. Simply add the WordDecoder to the list of RegisteredDecoders. Any Word document opened with our controls would then become rasterized automatically and displayed in the viewer. The .docx is turned into an AtalaImage that can be treated like many other formats we supp...

Read More

Posted by Eric Deutchman on 10/07/2013 with 0 comments

A Quick Java Gem, Borrowing from C#

I love the C# operator as.  This is syntactic sugar to cast an object of one type another type, but evaluate to null if the types are incompatible.  This is equivalent to writing the following code: Bar bar = o is bar ? (Bar)o : null; It’s just neater to use as.  I’ve been writing some Java code and I really wanted the as operator but Java doesn’t have it.  Instead I could write this equivalent: Bar bar = o instanceof Bar.class ? (Bar)o : null; but embedding that ternary expression into my code is sloppy.  Instead, I wrote it into this object: public class Cast {     public static <T> T as(Object o, Class<T> cl) {         if (o == null) return null;         return cl.isInstance(o) ? cl.cast(o) : null;     } } which when used in context feels a lot neater to me: Bar bar = Cast.as(o, Bar.class); Remember...

Read More

Posted by Steve Hawley on 09/23/2013 with 0 comments Tags: C#, java
 |<  < 1 - 2 - 3 - 4 - 5 - 6  >  >| 



Register to receive our monthly newsletter.
preload preload preload