August 10, 2014Phillip Sipe • Comment

PDF Generation in Node - Part 3

PDF Generation in Node - Part 3: Rendering the PDF with an HTML to PDF Converter

Here are the previous posts on the topic:

Part 1 - Drawing the PDF Directly.

Part 2 - Using a Headless Browser.

As a reminder, here is how I will ultimately rate the solutions (on a scale of 1 to 5):

Power - The level of control you have over what the resulting PDF looks like
Efficiency - How many resources the strategy consumes
Ease of Use - How easy the solution is to implement
Flexibility - How easy it is to alter the appearance of the PDF (for example due to a marketing redesign)

Everyone ready for some shameless self promotion? Great! Let’s get started…

Using a HTML to PDF Converter

For the javascript examples in this discussion I’ll be using html-to-pdf. html-to-pdf is a Node module which uses a Java HTML to PDF conversion library called Flyingsaucer in the background. Full disclosure: I wrote html-to-pdf (the Node module parts anyway - Flyingsaucer was written by other developers). There are also a couple of pieces of Java code showing the use of Flyingsaucer and JTidy.

So, I had struggled with the past couple of solutions for generating PDFs in Node. Drawing directly worked fine, but had a huge development and maintenance overhead. Headless browsers were much easier, but generated files that were far too large for my purposes. Eventually, through a great deal of Google patience, I stumbled on Flyingsaucer. Flyingsaucer is a Java library that reads XML and CSS transforms it into a PDF. Flyingsaucer has a few weaknesses though:

It is designed as an XML to PDF converter, which means it needs to be fed strict XHTML.
The CSS renderer is custom built and somewhat dated, meaning CSS has to conform to CSS2 standards.
It's in Java so we have to do some shenanigans to get it working in Node.

These were some hurdles, but I had some sweet Jade templates ready to go and wasn’t going to give up. So here’s a simple Java program using Flyingsaucer:

//Import a bunch of relevant Core Java things here (snipped for conciseness)
import org.xhtmlrenderer.pdf.ITextRenderer;

public class PDFRenderer {
    public static void main(String[] args) throws Exception {

        //Setup the inputs and outputs for the PDF rendering
    	String url = new File("path/to/html/file.html").toURI().toURL().toString();
        OutputStream outputPDF = new FileOutputStream(pdfFilePath);

        //Create the renderer and point it to the HTML document
        ITextRenderer renderer = new ITextRenderer();
        renderer.setDocument(url);

        //Render the PDF document
        renderer.layout();
        renderer.createPDF(outputPDF);

        //Close the stream
        os.close();
        outputPDF.close();
    }
}

Now this works fine if your HTML is already strictly XHTML. But, since the files I was generating were based on dynamic data, I was using Jade templates and rendering them into HTML. This meant I couldn’t be sure I would get XHTML, so I needed to clean up the HTML before handing it off to Flyingsaucer. Fortuantely, there is an HTML syntax library called JTidy which has a Tidy class that can clean up HTML into XHTML. Adding that took a little more code:

//Import a bunch of relevant Core Java things here (snipped for conciseness)
import org.xhtmlrenderer.pdf.ITextRenderer;
import org.w3c.tidy.Tidy;

public class PDFRenderer {
    public static void main(String[] args) throws Exception {

        //Set up input file and output file for cleaning up the HTML
        InputStream inputFileStream = new FileInputStream("path/to/html/file.html");
        String cleanHTMLFile = "temp.html";
        OutputStream cleanedFileStream = new FileOutputStream(cleanHTMLFile);

        //Clean the HTML
    	Tidy htmlCleaner = new Tidy();
    	htmlCleaner.setXHTML(true);
    	htmlCleaner.parse(inputFileStream, cleanedFileStream);

        //Setup the inputs and outputs for the PDF rendering
    	String url = new File(cleanHTMLFile).toURI().toURL().toString();
        OutputStream outputPDF = new FileOutputStream(pdfFilePath);

        //Create the renderer and point it to the XHTML document
        ITextRenderer renderer = new ITextRenderer();
        renderer.setDocument(url);

        //Render the PDF document
        renderer.layout();
        renderer.createPDF(outputPDF);

        //Close the streams (and don't cross them!)
        cleanedFileStream.close();
        outputPDF.close();

		//Clean up the temp file
		File tempFile = new File("temp.html");
		tempFile.delete();
    }
}

This did the trick quite well. The only thing left was too get it into my Node server. After a small bit of tweaking on the Java code to get it to accept command line arguments instead of using hard coded paths, all I had to do was use a child process in Node to launch the executable JAR of the Java program:

var child_process = require('child_process');

//Again for conciseness, let's save the tedium of constructing the arguments
var renderer = child_process.spawn('java', args);

renderer.on('error', function (error) {
    console.log(error);
});

renderer.on('exit', function (code) {
    console.log(code);
});

And that did the trick! However, I was unsatisfied with the overhead of dealing with all of the child process handling, so I wrapped it into a module. This ended with the following syntax:

var htmlToPdf = require('html-to-pdf');

htmlToPdf.convertHTMLFile('path/to/source.html', 'path/to/destination.pdf',
    function (error, success) {
        if (error) {
            console.log('Oh noes! Errorz!');
            console.log(error);
        } else {
            console.log('Woot! Success!');
            console.log(success);
        }
    }
);

The Flyingsaucer converter puts actual text into the resulting document, so the sizes were similar to sizes obtained from direct drawing, but with the convenience of using HTML templates. However this approach has it’s own set of drawbacks. There was nothing that could be done about the old version of CSS. As a result, looking at your HTML in the browser won’t always give you an accurate representation of how the PDF will look. It also requires you to have Java installed, which usually isn’t a big deal, but it can be relevant. I think the rating for html-to-pdf breaks down like this:

Power - 3
Efficiency - 4
Ease of Use - 5
Flexibility - 5

My own solution comes with a tradeoff as well. With the restriction to CSS2, you lose a bit of power in what you can accomplish, requiring you to jump through hoops at times to get the effect you are looking for. Still, html-to-pdf has a very simple and easy syntax and it renders the document very efficiently space-wise. I think each solution in this series has its own advantages and disadvantages. No one solution will be right for every application. Hopefully, this series will have helped you find whichever solution is right for you!

Let me know what you think of the solutions and my ratings of them in the comments, and thanks for reading!

July 16, 2014Phillip Sipe

Ramblings of a Chaotic Madman

Hello to all of you who stumbled here by mistake. This is the place where I talk about technology and stuff. This post doesn’t have any of that content, so if you are looking for that then too bad! This was mostly just a test post while I was getting the blog going, so if you are looking for great insights into the world of technology and javascript then I suggest you go to http://www.google.com and type your desired topic into the search bar and press the Enter or Return key. However, if you are looking for the thoughts and opinions of some random developer who figured out how to put words on the internet, then you might try just clicking on one of my other posts.

July 16, 2014Phillip Sipe • Comment

PDF Generation in Node - Part 2

PDF Generation in Node - Part 2: Rendering the PDF with a Headless Browser

If you missed part 1 of this series, you can find it here.

As a reminder, here is how I will ultimately rate the solutions (on a scale of 1 to 5):

Power - The level of control you have over what the resulting PDF looks like
Efficiency - How many resources the strategy consumes
Ease of Use - How easy the solution is to implement
Flexibility - How easy it is to alter the appearance of the PDF (for example due to a marketing redesign)

Alright, no one like recaps right? Enough of that, let’s get to…

Using a Headless Browser

For the examples in this discussion I will be using NodePDF. This node module uses PhantomJS in the background. It is very possible to just use Phantom directly. However, the syntax for doing so is a lot harder to grok at a first look. So, I’m going to use NodePDF to keep things simple and clear.

For those who may not be familiar with the term, headless browsers are pieces of software that behave like a web browser, but have no GUI (this is why they are “headless”). They have whole suites of functions that allow you to interact with a webpage programmatically. NodePDF will abstract most of the handling of our headless browser away from us, so that we can focus on the PDF, but it’ll be important to understand the concept of headless browsing when discussing the strengths and weaknesses of this approach.

So, clearly drawing the PDF directly is a powerful, but low-level solution to PDF generation. What if we wanted something that had less maintenance and code overhead? We’d also prefer to have a solution which utilizes web technologies that we are familiar with. Fortunately, headless browsers provide that solution. We can point a headless browser at a web page and then render it as a PDF file. This allows us to use HTML and CSS to stylize a document and then simply convert it to a PDF all automatically. Since we are using HTML, this also lets us use whatever templating engine we’d like (e.g. Jade) to make our lives easier as well.

Here’s how to generate a PDF using NodePDF and PhantomJS:

var NodePDF = require('nodepdf');

var document = new NodePDF('/path/to/HTML/file.html', '/path/to/output/PDF/file.pdf', {
    viewportSize: {
        width: 1440,
        height: 900
    },
    paperSize: {
        pageFormat: 'Letter',
        margin: {
            top: '1.5in',
            left: '1in',
            right: '1in',
            bottom: '1in'
        }
    },
    zoomFactor: 1.0
});

document.on('error', function (error) {
    console.log(error);
});

document.on('done', function (path_to_pdf) {
    console.log(path_to_pdf);
});

You may notice that this is a similar syntax to Node’s HTTP and child process modules. There are also events for the stdout and stderr streams. The options available are the options that PhantomJS provided for its render function, and there are many more than are shown here. The options are fairly extensive and cover a wide range of things. For this example, I’ve just used a relatively simple baseline.

Fundamentally, we are opening an HTML file in a browser, taking a picture of it, and saving that picture as a PDF. Simple, clean, and easy. However, this solution comes with a significant drawback that may not be clear at first.

When we put a line of text into the PDF document using the direct drawing technique, raw text was placed into the document. But, if we did the same thing with the headless browser technique, then we are putting a picture of text into the document. This is a massive jump in space demands. For many applications, this won’t matter. For a single page PDF the differences will likely not be as important. However, for the application I was working on, the sizes of the documents could range from a couple of pages to hundreds of pages. Clients wanted their reports delivered by email, but some of the larger reports exceeded the attachment limit! I attempted a variety of hack-like workarounds, but I couldn’t bring the size down enough, and this solution proved unworkable as a result.

So, this solution had a fatal weakness for me, but it could very easily be just what you are looking for. Here’s how this solution rates:

Power - 4
Efficiency - 1
Ease of Use - 4
Flexibility - 5

You don’t lose too much power with this solution, as you can do a lot with what is possible in modern browsers. You gain a ton in how easy this solution is to implement. It doesn’t get much easier than this, though there are still a couple of gotchas with using Phantom that can provide a few frustrating moments. However, the real loss here is in the size of the document. These PDFs will be relatively enormous, which can make it unuseable for certain applications (like mine).

I was not ready to give up on using simple HTML templates, however. Designing and tweaking those templates had felt like heaven compared to tweaking the direct drawing technique. Undaunted, I continued searching for an alternative. I delved into the depths of Google until I found a new path. A third solution…

That I will discuss next time!

July 16, 2014Phillip Sipe • Comment

PDF Generation in Node - Part 1

PDF Generation in Node - Part 1: Directly Drawing the PDF

I’ve recently been neck-deep in refactoring a project. I just got to the automated report generation code in the project, and it made me all nostalgic. That’s the feeling where you need to suddenly run to the bathroom right?

I’ve tried quite a variety of strategies in creating PDF reports on a Node server automatically. Each has a set of strengths and weaknesses, and there isn’t a single perfect solution. I’ve broken down the types of solutions into 3 distinct categories. I will be using a particular technology for each of the examples, but you should know that each category has a variety of technologies/libraries/etc. within it. There are subtleties between each of them, but for now I am just going to go over the larger categories and the major differences between them. Because everyone likes it when stuff is broken down into numbers, I’ll also be rating the categories on a scale of 1 to 5 on the following fields:

Power - The level of control you have over what the resulting PDF looks like
Efficiency - How many resources the strategy consumes
Ease of Use - How easy the solution is to implement
Flexibility - How easy it is to alter the appearance of the PDF (for example due to a marketing redesign)

With that, let’s get to the categories!

Draw the PDF Directly

For the examples in this section I’ll be using PDFKit.

Drawing the PDF is by far the most powerful solution. However, with great power comes great amounts of code:

var PDFDocument = require('pdfkit');
    var document = new PDFDocument;
    document.font('Times-Roman')
        .fontSize(20)
        .text('Hello World!', 100, 100);

Now obviously, this is not a “great amount” of code. Heck, I sort of made it seem larger with all those extra new lines and indentations!

If you program with a library like this for any significant amount of time, you’ll likely end up writing in a similar way. PDFKit’s API is very chainable, which significantly cuts down on lines of code, but does result in some very long lines. Styling your code like this will really help future readability.

Let’s add a bit of complexity:

document.rect(100, 20, 100, 100)
    .fill('gray');
document.font('Times-Roman')
    .fontSize(12)
    .fillColor('white')
    .text('White text on a gray background!', 110, 60);

This added complexity is making another potential problem with direct drawing apparent. There are a lot of magic numbers. It can be very hard to keep all of your hardcoded numbers organized. It is possible to keep everything together without using magic numbers, but it is a time consuming development effort.

The real drawback of this solution, however, is the amount of effort it takes to change the document’s appearance. In my case, marketing came back with a design once that they wanted to try for a report we were generating. It was so different, we were forced to tell them that we would be (more or less) starting over! Given that the visual design of things often changes at a rapid pace, this is obviously a large drawback.

Now, this overview has been pretty negative so far. I won’t lie and say that this is my preferred solution. It isn’t. But, this solution is extremely powerful. If you want to draw some crazy mathematically calculated fractal and put it in a PDF, you can do that without jumping through hoops. This solution trades convenience for power, and it has power in spades.

The other benefit to this solution that will become more apparent when we look at the next solution, is resource consumption. This is fast and disk efficient, and you are unlikely to run into any size barriers using this method.

Final Rating

Power - 5
Efficiency - 5
Ease of Use - 2
Flexibility - 1

Ultimately, drawing the PDF directly is a solution of extremes. If you need large amounts of precise control over the document’s appearance and you need it to be efficiently sized, then directly drawing it will serve you well. However, it will take a much greater effort to develop and change the PDF (relative to other solutions).

So that took up way more text than I expected when I started writing this post. So, I’m going to break it up into several posts. Next time we will cover the world of PDF generation through headless browsers!

Next: PDF Generation Part 2 - Using Headless Browsers

Secret Robot Internet

Ramblings of a Chaotic Madman

Using a HTML to PDF Converter

Using a Headless Browser

Draw the PDF Directly