Technocrazian Hack the Code

Using Python and odfpy to create Open Document Texts

Heya,
I have been doing a project for my college, which involved a question paper generation module. I initially relied on LaTex for doing the job and compiling the question paper as pdf from it. But, the staff may sometimes need to edit the questions on the fly before generating a pdf. So, it meant I have to give them an editable format. Most of the staff in my college are not well versed in LaTex, especially the ones from Non-engineering. So I had to go with some text document formats like docx or odt. The project was written in Django, so I searched for docx wrappers in Python (as my college was still in Windows). I eventually found out python-docx but it lacked many features that I wanted, like tabstop to seperate between the question and it's mark or to write two words at two ends of a single line etc. Yes, there were issues file on these, but nothing seems to happen on that front. So, I decided to drop the idea of using docx and move with odt, since newer versions of MS Word supports them without much hassle. That's when I ended up with odfpy

Like any other library, I wanted to read the documentation of the library to understand how basic stuff works (Yeah, the RTFM part of me!!). Boy, was I disappointed. The documentation part of odfpy is crap, to put it out bluntly. Yes, there exits an API documentation but it seems too technical and doesn't convey the concepts well. So, it was pretty difficult to get what I want with it. Thanks to the hacker mindset, I was ready to read code and understand it. And the upstream had added several tests to test most of the features which I could use as reference. I was able to get what I wanted and I decided to blog about it so that it'll be a help to others and a note to myself for future reference. So here goes.

To install odfpy in Debian systems, one may use the command sudo apt-get install python-odf. You can also use pip for the job by using sudo pip install odfpy.

Let the hacking begins

  1. Create a python file named odftest.py with the following contents
    #!/usr/bin/env python
    # -*- coding: utf-8 -*-

    from odf.opendocument import OpenDocumentText
    from odf.style import (Style, TextProperties, ParagraphProperties, ListLevelProperties, TabStop, TabStops)
    from odf.text import (H, P, List, ListItem, ListStyle, ListLevelStyleNumber, ListLevelStyleBullet)
    from odf import teletype

    These are the common modules we need to import for a document. The first one is OpenDocumentText which is the master module that deals with the base class. odf.style has the stuff needed to stylize the document. Like properties for paragraphs, text, lists etc. odf.text handles the text objects like headings, paragraphs, lists etc. teletype module deals with retrieval and insertion of text to elements with proper handling of line breaks and whitespaces.

  2. First we need to create an OpenDocumentText object which will act as the base object which creates the document.

    textdoc = OpenDocumentText()

  3. Next we need to create styles to be used in the document - Headings, Bold texts, Bullets, Numbering etc. The following code is written for readability and may be bummed skipping variable assignments with direct usage.

    # For Level-1 Headings that are centerd
    h1style = Style(name="CenterHeading 1", family="paragraph")
    h1style.addElement(ParagraphProperties(attributes={"textalign": "center"}))
    h1style.addElement(TextProperties(
    attributes={"fontsize": "18pt", "fontweight": "bold"}))

    Here, the first line of code initializes a Style object, which has a specific name and belongs to a specific family. Alignment of the text is the property of a paragraph, so it is given as an argument to ParagraphProperties in the second line. However, fontsize and fontweight apply to a specific text, not a paragraph. Hence, they are given as argument to TextProperties object.

    # For Level-2 Headings that are centered
    h2style = Style(name="CenterHeading 2", family="paragraph")
    h2style.addElement(ParagraphProperties(attributes={"textalign": "center"}))
    h2style.addElement(TextProperties(
    attributes={"fontsize": "15pt", "fontweight": "bold"}))

    This block of code defines a smaller headin (note the change in fontsize attribute)

    # For bold text
    boldstyle = Style(name="Bold", family="text")
    boldstyle.addElement(TextProperties(attributes={"fontweight": "bold"}))

    Here we are defining a style to make text bold. Since it doesn't have anything to deal with paragraphs, it only has TextProperties object

    # Justified style
    justifystyle = Style(name="justified", family="paragraph")
    justifystyle.addElement(ParagraphProperties(
    attributes={"textalign": "justify"}))

    Here, we specify the justified style, that is applicable to paragraphs. Hence it uses a ParagraphProperties element.

    # For numbered list
    numberedliststyle = ListStyle(name="NumberedList")
    level = 1
    numberedlistproperty = ListLevelStyleNumber(level=str(level), numsuffix=".", startvalue=1)
    numberedlistproperty.addElement(ListLevelProperties(minlabelwidth="%fcm" % (level)))
    numberedliststyle.addElement(numberedlistproperty)

    Here we define the style to be used for Level 1 numbering, which means no nesting of numbering. The first line, as always defines a Style object which will represent the style. ListLevelStyleNumber() is used to specify that the list is a numbered list and to define the level of the numbering, which in our case is 1. The attribute numsuffix defines the character that should be inserted after a numeral, in the numbering scheme - which in our case is a period (.) and startvalue defines the starting number of the list.
    Also, we add ListLevelProperties to impart some features to the items of the list. Here, it is minlabelwidth which specifies how much width should be given to the numbering portion of the text, i.e how much space should be left after the number before the content begins.

    # For Bulleted list
    bulletedliststyle = ListStyle(name="BulletList")
    level = 1
    bulletlistproperty = ListLevelStyleBullet(level=str(level), bulletchar=u"•")
    bulletlistproperty.addElement(ListLevelProperties(
    minlabelwidth="%fcm" % level))
    bulletedliststyle.addElement(bulletlistproperty)

    Bulleted list is pretty similar to numbered list and the differences are it uses ListLevelStyleBullet which takes bulletchar as argument. Rest is pretty much the same.

    # Creating a tabstop at 10cm
    tabstops_list = TabStops()
    tabstop = TabStop(position="10cm")
    tabstops_list.addElement(tabstop)
    tabstoppar = ParagraphProperties()
    tabstoppar.addElement(tabstops_list)
    tabstyle = Style(name="Question", family="paragraph")
    tabstyle.addElement(tabstoppar)
    s.addElement(tabstyle)

    In this block, we specify the tabstops we may encounter in the documet. The TabStops element is a collection of all the tab stops we may define. TabStop element is actually the one pointing to each tab stop and understandably, it takes position as an argument. Since tab stops apply to a paragraph, we create a ParagraphProperties object to which we add the list of tabstops we have - tabstops_list. For it to be applied to a text, as seen below, we have to make it a style. So, we create tabstyle for that purpose and add the ParagraphProperties element to it.

  4. So, now we have created all our necessary styles. But, how do we associate that to the OpenDocumentText object we created? How will we specify that these all are the style that may be used in the document? For that, we have to add each of the styles we created to the document's style list. It is done as follows

    s = textdoc.styles
    s.addElement(h1style)
    s.addElement(h2style)
    s.addElement(boldstyle)
    s.addElement(numberedliststyle)
    s.addElement(bulletedliststyle)
    s.addElement(justifystyle)
    s.addElement(tabstyle)
  5. So, all our styles have been created and added to our Document. Now it is time to actually insert some text and apply these styles to it. Let's first add our main heading
    mymainheading_element = H(outlinelevel=1, stylename=h1style)
    mymainheading_text = "This is my main heading"
    teletype.addTextToElement(mymainheading_element, mymainheading_text)
    textdoc.text.addElement(mymainheading_element)

    odfpy has two main classes for text - H and P. H is for headings and P is for paragraphs. In this block, since we are adding a heading, we create an H object and specifies the outlinelevel (which is normally 1) and the stylename that we created earlier, h1style. Also, we add some text to the object using teletype.addTextToElement method. Like I said before, we use teletype to properly handle whitespaces like tabs or newlines. Instead, we can directly give the text as text argument to the initialization of H (or P) object. But this will simply skip tabs or newlines instead of handling them. So I prefer using teletype. Finally, we add the created H object to our document using textdoc.text.addElement method.

  6. Similarly, create a subheading using h2style as stylename

  7. Adding a paragraph is also similar

    paragraph_element = P(stylename=justifystyle)
    paragraph_text = """
    Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem
    Ipsum has been the industry's standard dummy text ever since the 1500s, when an
    unknown printer took a galley of type and scrambled it to make a type specimen
    book. It has survived not only five centuries, but also the leap into electronic
    typesetting, remaining essentially unchanged. It was popularised in the 1960s
    with the release of Letraset sheets containing Lorem Ipsum passages, and more
    recently with desktop publishing software like Aldus PageMaker including
    versions of Lorem Ipsum.
    """
    teletype.addTextToElement(paragraph_element, paragraph_text)
    textdoc.text.addElement(paragraph_element, paragraph_text)
  8. Let's see how to add a bulleted list with two items
    bulletlist = List(stylename=bulletedliststyle)
    listitemelement1 = ListItem()
    listitemelement1_paragraph = P()
    listitemelement1_content = "My first item"
    teletype.addTextToElement(listitemelement1_paragraph, listitemelement1_content)
    listitemelement1.addElement(listitemelement1_paragraph)
    bulletlist.addElement(listitemelement1)
    listitemelement2 = ListItem()
    listitemelement2_paragraph = P()
    listitemelement2_content = "My second item"
    teletype.addTextToElement(listitemelement2_paragraph, listitemelement2_content)
    listitemelement2.addElement(listitemelement2_paragraph)
    bulletlist.addElement(listitemelement2)
    textdoc.text.addElement(bulletlist)

    As we seen earlier, we create a List object to represent the complete list with the style bulletedliststyle that we defined earlier as stylename argument. For each individual item of the list, we need a ListItem object which contains a P object that holds the text. We add text to the paragraph using teletype, then add the P object to the ListItem object and the ListItem object to the List object. This is repeated for another ListItem. This List object is finally added to the document.

  9. Adding a numbered list is similar to a bulleted list. We just specify numberedliststyle as stylename.
  10. To use the tabstop style we defined, create a paragraph with that style

    newtext = "Testing\tTabstops"
    tabp = P(stylename=tabparagraphstyle)
    teletype.addTextToElement(tabp, newtext)
    textdoc.text.addElement(tabp)
  11. So, we have added all the necessary contents to our document. Now we should save it to an odt file. For that, we use the save method.
    textdoc.save(u"myfirstdocument.odt")

If we open the document 'myfirstdocument.odt' in LibreOffice, we can see something like the following.

LO_preview

The complete code is as follows