OCR on pdf file in alfresco
There are lots of tool available which can be integrated with alfresco for performing PDF to PDF conversation with OCR.The best tool which i used so far is PDFsandwich.Any one can download it as its an open source.Another tool which can be integrated is tesseract.Internally pdfsandwich is using tesseract only, but i prefer pdfsandwich.
Technically, its doesn't matter which tool you are selecting, the thing which differ is output only.There is not much change in development if you want to change the tool.So for this things what we are going to do is we will create one shell script which is going to fire this command and generate output.We will fetch that out put and will overide existing file in alfresco.
Let's go further in development.
The first step for developing this is to create shell script and executing it and testing it.
Create file called transformation.sh and before adding your command in it you have to add below line in it.If you are using windows you need to create batch file accordingly.
unset LD_LIBRARY_PATH
If you are not setting above in the script file you will face an error while conversation.You can find that bug details on below link of alfresco.Its registered issue in alfresco.
Once you create script in linux , you have to create action-executor in alfresco which fire script which we have created.You will be able to create action-executor by following below steps.
TransformerAction.java
package com.ocr; import java.io.File; import java.util.Date; import java.util.HashMap; import java.util.Map; import org.alfresco.repo.content.transform.ContentTransformerHelper; import org.alfresco.repo.content.transform.ContentTransformerWorker; import org.alfresco.service.cmr.repository.ContentIOException; import org.alfresco.service.cmr.repository.ContentReader; import org.alfresco.service.cmr.repository.ContentWriter; import org.alfresco.service.cmr.repository.MimetypeService; import org.alfresco.service.cmr.repository.TransformationOptions; import org.alfresco.util.TempFileProvider; import org.alfresco.util.exec.RuntimeExec; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; public class TransformerAction extends ContentTransformerHelper implements ContentTransformerWorker { private static final Log logger = LogFactory.getLog(TransformerAction.class); private static final String VAR_SOURCE = "source"; private static final String VAR_TARGET = "target"; private RuntimeExec executer; private RuntimeExec checkCommand; //Injected private MimetypeService mimetypeService; private boolean available=true; private Date lastChecked = new Date(0l); private int checkFrequencyInSeconds=120; public void setCheckFrequencyInSeconds(int frequency) { checkFrequencyInSeconds=frequency; } public boolean isAvailable() { Date refreshAvailabilityDate = new Date(lastChecked.getTime()+1000l*checkFrequencyInSeconds); if (new Date().after(refreshAvailabilityDate)) { test(); } return available; } public void setMimetypeService(MimetypeService ms) { mimetypeService=ms; } public void setTimeout(long timeout) { pdfTimeout=timeout; } public long pdfTimeout=60000l; public void setExecuter(RuntimeExec executer) { this.executer = executer; } public void setCheckCommand(RuntimeExec checkCommand) { this.checkCommand = checkCommand; } protected void test() { try { logger.debug("Testing availability"); RuntimeExec.ExecutionResult result = checkCommand.execute(); available=result.getSuccess(); logger.info("Is tesseract available? "+available); } catch (Exception e) { available=false; logger.warn("Check command ["+checkCommand.getCommand()+"] failed. Registering transform as unavailable for the next "+checkFrequencyInSeconds+" seconds"); } } public final void transform(ContentReader reader, ContentWriter writer, TransformationOptions options) throws Exception { try { logger.debug("Beginning transform for "+reader.getContentData().getContentUrl()); String sourceMimetype = getMimetype(reader); String targetMimetype = getMimetype(writer); String sourceExtension = mimetypeService.getExtension(sourceMimetype); String targetExtension = mimetypeService.getExtension(targetMimetype); File sourceFile = TempFileProvider.createTempFile(getClass().getSimpleName() + "_source_", "." + sourceExtension); File targetFile = TempFileProvider.createTempFile(getClass().getSimpleName() + "_target_","."+targetExtension); logger.debug("Temp files created"); reader.getContent(sourceFile); logger.debug("Source file written"); Maptransformer-context.xmlproperties = new HashMap (5); properties.put(VAR_SOURCE, sourceFile.getAbsolutePath()); properties.put(VAR_TARGET, targetFile.getAbsolutePath()); RuntimeExec.ExecutionResult result = executer.execute(properties, pdfTimeout); File actualLocation = new File(targetFile.getAbsolutePath()); actualLocation.renameTo(targetFile); if (result.getExitValue() != 0 && result.getStdErr() != null && result.getStdErr().length() > 0) { throw new ContentIOException("Failed to perform OCR transformation: \n" + result); } logger.debug("Transform executed"); writer.putContent(targetFile); logger.info("Transform complete"); } catch (Exception e) { logger.error("Exception during transform", e); throw e; } } @Override public boolean isTransformable(String sourceMimetype, String targetMimetype, TransformationOptions options) { if (targetMimetype.equals("application/pdf")) { if (sourceMimetype.equals("application/pdf")) { return true; } } return false; } @Override public String getVersionString() { return "Tesseract Transformer V1.0 - this method not implemented yet"; } }
OCR.properties
image/png text/plain image/jpeg text/plain image/gif text/plain
transformer.pdf.title=PDF OCR transformer.pdf.description=Content file will be converted to PDF and OCR process will produce a searchable PDF file. transformer.pdf.continue-on-error.display-label=Continue on error
Once you create this files in your project, you will be able to see action under rule called PDF OCR.Select it as a rule and do other configuration as per your requirement.You will be able to perform PDF to PDF conversation.
Hope this helps.