Login
 

Atalasoft Imaging SDK Development Blog

Document Imaging and Developer Commentary

Blog Home RSS Feed Old Archive Atalasoft.com

Recent blog posts

This is our new blog page. If you're looking for posts before 2012, see our archive.

Improving OCR Results: Adding Spellcheck


With the new Tesseract 3.2 engine available as an add-on for Atalasoft DotImage, I have been more interested in the quality of OCR results. When I scour the internet for OCRed documents, I find that many of them have words that are misspelled due to a misinterpreted character or omitted letter. I thought about spellcheck being able to solve this issue, and after experimentation I believe it can only make minor improvements to the overall OCR results without very sophisticated integration. With DotImage the OcrEngine object is setup to be very extensible; giving hooks into many major steps of the OCR process. Using DotImage I came up with two simple algorithms to use an open source .NET spell checking engine, “Missing Letter” and “Single Incorrect Letter:” Missing Letter In several of the raw OCR results from my sample set I noticed that there would be words that were completely missing a letter. The spell check engine provided good guesses when a let...

Read More

Posted by Kevin Hulse on 03/18/2014 with 0 comments

Some Introduction, Some Tesseract


Hi there! I’m Kevin Hulse, the newish Solutions Enablement Specialist at Atalasoft. You may have worked with me directly after I started working at Atalasoft as a Developer Support Engineer nearly six years ago. Since starting here, I have worked in almost every department from Support to Engineering and now Marketing (watch out Sales). I hope to begin a small series of blogs on all things OCR and plan on providing interesting, technical-minded posts on our products, our customers, and document imaging in general, as well as posts on things that I simply find interesting enough to talk about. Speaking of products, with the release of DotImage 10.4.1, our OCR libraries have been upgraded to handle version 3.02 of the Tesseract OCR Engine. This upgrade includes a few small improvements to speed and accuracy of processing as well as an increased ability to use new data packages to support more extended character sets.  Additionally, here’s a list of all the langua...

Read More

Posted by Kevin Hulse on 02/27/2014 with 0 comments Tags: OCR, Tesseract

How to Work With Library Developers/Support


In addition to writing code from the ground up, we also work with other library developers and we package the libraries in a C# or Java API which is typically easier to work with or more convenient for our customers.  Many of our customers aren’t comfortable working with C++ libraries or sometimes the C++ libraries have awkward interfacing and that’s fine.  We’re very good at taking this type of API and presenting it in a way that feels right for .NET or the JVM and integrates with the rest of our code base. Still, the same way that you work with us, we have to, at times, work with other library writers and we find bugs every now and again.  Here are five tips for working with library creators to get the most out of your interactions. Make sure that you are using the library correctly. The reason is that (hopefully) your library has a model of operation that lends itself to a particular model of usage.  For example, some librarie...

Read More

Posted by Steve Hawley on 01/14/2014 with 0 comments

When is boolean not a boolean?


I ran into a failing C# unit test today, with the following output:   Expected: True   But was:  True Seriously. I stopped it in the debugger and the property that was being checked was “true”.  I set it to a local and that was also “true” in the debugger. So when is it the case that true != true? The answer, to me, was straight forward: in C#, “true” is supposed to be the 32-bit value 0x00000001, but in many languages, “true” is defined as “anything that is not ‘false’ (aka, 0). Since the code that was generating the value was C++/CLI interfacing with C, it seemed pretty clear where the issue was – I opened up a Memory window in the debugger and dropped the member onto it, which showed that the value for the boolean was 0x00002080 (or something like that).  The culprit was C++ code that was calling a low-level C function, passing in two locals by reference.  ...

Read More

Posted by Steve Hawley on 11/26/2013 with 0 comments

Devoxx 13 Belgium, a Quick Look


I spent last week in Antwerp, Belgium attending the Devoxx conference.  It was an interesting experience, not the least of which was because I was there playing a trio of roles: developer, presenter, and evangelist. In the developer role, I was looking for new technologies that we could make use of in current or future work.  Interesting things I saw included Genymotion (a high speed android emulator) and the Dart programming language. My presentation on parsing PDF in Java went well, although I went way too fast – it happens when you are both a little nervous and very passionate about your topic.  Hopefully the talk will be up on parleys.com soon – at this writing, the channel page is empty. What struck me the most was working the trade show floor.  It’s always a bit of a challenge at developer conferences in that we, as developers, can be introverted and do not want to interact with other people.  Very often, people who walked by...

Read More

Posted by Steve Hawley on 11/20/2013 with 0 comments

Off to Devoxx 2013


I’ll be going to Devoxx in Antwerp, Belgium next week.  If you’re there, stop by the Atalasoft booth to say hi, or come see me present on PDF Parsing.

Read More

Posted by Steve Hawley on 11/08/2013 with 0 comments

Atalasoft 10.4 SDKs Released


Update: Atalasoft version 10.4.1 was released in early 2014 and brings support for Tesseract OCR 3.2, updated DICOM libraries, and bug fixes. See the release notes for more information: What's new in 10.4.0? Our engineers have been hard at work and we are excited to announce the latest release of our SDKs: Version 10.4. This is a major release just like 10.2 and 10.3. Let's dive into some new features for our products across the board: Word Reader Add-on (for DotImage, any edition) Customers have been asking to be able to open Microsoft Word files in our viewers. With this new product users will be able to rasterize basic word documents to save in other formats or display in our viewers. Simply add the WordDecoder to the list of RegisteredDecoders. Any Word document opened with our controls would then become rasterized automatically and displayed in the viewer. The .docx is turned into an AtalaImage that can be treated like many other formats we supp...

Read More

Posted by Eric Deutchman on 10/07/2013 with 0 comments

A Quick Java Gem, Borrowing from C#


I love the C# operator as.  This is syntactic sugar to cast an object of one type another type, but evaluate to null if the types are incompatible.  This is equivalent to writing the following code: Bar bar = o is bar ? (Bar)o : null; It’s just neater to use as.  I’ve been writing some Java code and I really wanted the as operator but Java doesn’t have it.  Instead I could write this equivalent: Bar bar = o instanceof Bar.class ? (Bar)o : null; but embedding that ternary expression into my code is sloppy.  Instead, I wrote it into this object: public class Cast {     public static <T> T as(Object o, Class<T> cl) {         if (o == null) return null;         return cl.isInstance(o) ? cl.cast(o) : null;     } } which when used in context feels a lot neater to me: Bar bar = Cast.as(o, Bar.class); Remember...

Read More

Posted by Steve Hawley on 09/23/2013 with 0 comments Tags: C#, java

Remember that Writing Software is a Creative Process


Being in software engineering for more than 20 years I’ve met a lot of solid engineers and creativity and imagination were part and parcel of their personalities.  This is a great trait in that it helps us come up with solutions to challenging problems.  One thing that doesn’t get talked about enough is that I believe that no creativity comes without its share of problems.  For example, creativity isn’t something that can be switched on if it isn’t there.  I don’t think I know anyone who can choose to be creative on a particular day, but instead we need to make the best of the creativity that we have on any given day.  Worse then being “not on” you might instead suffer from depression or other issues that totally sap your abilities. Greg Baugues gave an interesting talk about dealing with ADHD and Type II Bipolar: I don’t have the same issues that Greg has, but I do have a list of things that migh...

Read More

Posted by Steve Hawley on 08/28/2013 with 0 comments

Compressing PDF Documents for Archive


There was an interesting article that appeared on Hacker News linked from David Kriesel about an issue with PDF documents scanned from Xerox scanners produce documents that have numbers randomly swapped.  At this writing, Xerox is working on a patch – awesome – this is exactly what they should be doing.  The issue, in brief, is that the software built into the unit that creates PDF documents uses JBIG2 compression.  JBIG2 compression is a (potentially) lossy compression algorithm that looks for nearly identical tiles and replaces all the tiles that are nearly identical with one tile.  This type of compression can create much smaller documents, but can create content errors if the predicate “close enough” replaces a ‘6’ with an ‘8’. Atalasoft, to my knowledge, does not supply Xerox with the software that they use in their scanners, but Atalasoft does provide a powerful set of PDF creation tools.  So let&rsq...

Read More

Posted by Steve Hawley on 08/13/2013 with 0 comments
 |<  < 1 - 2 - 3 - 4 - 5 - 6  >  >| 

Syndication

Subscribe

Register to receive our monthly newsletter.
preload preload preload