3.2. PDF Filter Example

This tutorial serves as an initial introduction to recipes and checklets. In this example, we will be making a recipe that checks whether all submitted files are PDF files. This functionality can later be extended. The recipe for this is explained in recipe. The checklet is explained in checklet.

3.2.1. The recipe

The recipe describes the overall structure of the PDF filter we are trying to implement. It does this by referring to one or more checklets and describing which input to provide to them and how to interpret the output. To construct a PDF filter that only accepts submissions that consist of PDF files, we will write a recipe that calls a single checklet.

This recipe looks as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
<?xml version="1.0"?>
<rcp:recipe xmlns:rcp="http://peach3.nl/daemon/recipe"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://peach3.nl/daemon/recipe http://peach3.nl/daemon/schema/recipe-1.0.xsd"
    id="pdffilter-recipe">
  <meta>
    <name>PDFFilter</name>
    <description xml:lang="en">
      A recipe that determines whether all submitted files are PDF files or not.
    </description>
    <version>
      0.0
    </version>
  </meta>
  <options>
    <option name="num-files-min"
        description="Minimal number of PDF files the product should contain"
        value="1" type="int"/>
    <option name="num-files-max"
        description="Maximum number of PDF files the product should contain"
        value="1" type="int"/>
    <option name="num-pages-min"
        description="Minimal number of pages each PDF file should include"
        value="0" type="int"/>
    <option name="only-pdf"
        description="Whether all files should be PDF files" value="True" type="bool"/>
  </options>
  <steps>
    <checklets basename="mtrchk-org-momotor">
      <checklet id="pdffilter-checklet" name="pdffilter"/>
    </checklets>
    <step id="pdffilter">
      <checklet ref="pdffilter-checklet"/>
      <options>
        <option name="num-files-min" value="1"/>
        <option name="num-files-max" value="1"/>
        <option name="num-pages-min" value="0"/>
        <option name="only-pdf" value="True"/>
      </options>
    </step>
  </steps>
  <tests>
    <expectedResult id="pass">
      <expect step="pdffilter" outcome="pass"/>
    </expectedResult>
    <expectedResult id="fail">
      <expect step="pdffilter" outcome="fail"/>
    </expectedResult>
    <test id="test1">
      <product>
        <files basesrc="testdata/test1/files">
          <file name="test.pdf" src="test.pdf" type="application/pdf"/>
        </files>
      </product>
      <expectedResult ref="pass"/>
    </test>
    <test id="test2">
      <product>
        <files basesrc="testdata/test2/files">
          <file name="test.pdf" src="test.pdf" type="application/pdf"/>
          <file name="test.txt" src="test.txt" type="text/plain"/>
        </files>
      </product>
      <expectedResult ref="fail"/>
    </test>
    <test id="test3">
      <product>
        <files basesrc="testdata/test3/files">
          <file name="test.txt" src="test.txt" type="text/plain"/>
        </files>
      </product>
      <expectedResult ref="fail"/>
    </test>
    <test id="test1-embedded">
      <product>
        <files basesrc="file">
          <file class="" encoding="base64" name="test.pdf" type="application/pdf">JVBERi0xLjUKJdDUxdgKMyAwIG9iaiA8PAovTGVuZ3RoIDggICAgICAgICAKL0ZpbHRlciAvRmxhdGVEZWNvZGUKPj4Kc3RyZWFtCnjaAwAAAAABCmVuZHN0cmVhbQplbmRvYmoKNyAwIG9iaiA8PAovUHJvZHVjZXIgKHBkZlRlWC0xLjQwLjE0KQovQ3JlYXRvciAoVGVYKQovQ3JlYXRpb25EYXRlIChEOjIwMTQwNTI1MTQwNjI2KzAyJzAwJykKL01vZERhdGUgKEQ6MjAxNDA1MjUxNDA2MjYrMDInMDAnKQovVHJhcHBlZCAvRmFsc2UKL1BURVguRnVsbGJhbm5lciAoVGhpcyBpcyBwZGZUZVgsIFZlcnNpb24gMy4xNDE1OTI2LTIuNS0xLjQwLjE0IChUZVggTGl2ZSAyMDEzL0RlYmlhbikga3BhdGhzZWEgdmVyc2lvbiA2LjEuMSkKPj4gZW5kb2JqCjQgMCBvYmogPDwKL1R5cGUgL09ialN0bQovTiA0Ci9GaXJzdCAyMQovTGVuZ3RoIDE1OSAgICAgICAKL0ZpbHRlciAvRmxhdGVEZWNvZGUKPj4Kc3RyZWFtCnjaXY3BCoJAFEX37yveFzjOlJOCuMhoE4FYO3Ex6EOEcMIZof6+N0YIbS/3nKMwRomZxgSlilGj1AryHMT9/SQUlRkIRGknT5N3uON3DaImZ5e5I8foOlypH83RvrCJeUiyJFIHjeleRmnWAltmxjkRzkWx+qvZdjfy2HDkdMb2t29dF8ILcxLEZewdNioI/p6l8eZhB/giW+MDR7k3qgplbmRzdHJlYW0KZW5kb2JqCjggMCBvYmogPDwKL1R5cGUgL1hSZWYKL0luZGV4IFswIDldCi9TaXplIDkKL1cgWzEgMiAxXQovUm9vdCA2IDAgUgovSW5mbyA3IDAgUgovSUQgWzwxMDk5NTk0RjNBM0JEMEQ4OEM3NUY0MjAyOEQzMTUyND4gPDEwOTk1OTRGM0EzQkQwRDg4Qzc1RjQyMDI4RDMxNTI0Pl0KL0xlbmd0aCAzOCAgICAgICAgCi9GaWx0ZXIgL0ZsYXRlRGVjb2RlCj4+CnN0cmVhbQp42g3GsQ0AIADDsKSw8yUrO89DB0sGXpgWstBDnxqyMZcPMAYCgwplbmRzdHJlYW0KZW5kb2JqCnN0YXJ0eHJlZgo2MzQKJSVFT0YK</file>
        </files>
      </product>
      <expectedResult ref="pass"/>
    </test>
  </tests>
</rcp:recipe>

Download this recipe

Like any recipe, this is a standard XML document with a number of special tags. The first tag is <rcp:recipe>, which indicates that this is a recipe. One level lower, the <meta> contains information about the recipe, such as a name, description and version number. Within <steps>, the different steps of the recipe are described. Finally, the <tests> tag describes the test cases that can be run to verify the correctness of this recipe.

Note

The order in which tags and attributes occur in a recipe does not influence their meaning. However, we will use a consistent order for readability reasons. Any valid XML with the right tags is supported by Momotor.

3.2.1.1. <meta>

The <meta> tag contains general information about the recipe. Within the <name> tag, the name of the filter is specified. Within the <description>, there is a short (typically single-sentence) description of the recipe. Optionally, xml:lang can be used to specify the language of the description. Finally, version contains a version number for the recipe.

Note

It is recommended to include all meta-information in a single <meta> tag.

3.2.1.2. <options>

The <options> tag contains all the options that can be set for the recipe. Options are described in more detail in Configuration. The value that is specified for each option here is the default value, which should correspond with the default value in the checklet that uses the option.

Note

The <options> tag can be used in multiple places within the recipe. This specific occurence of the tag defines the configurable options of the recipe.

3.2.1.3. <steps>

The <steps> tag contains two sub-tags: <checklets> and <step>. The <checklets> tag specifies a base name, which is the package that includes all checklets that are used. Within this, each checklet is declared using the <checklet> tag. This tag specifies an id by which the checklet is known, as well as the name of the checklet. Multiple <checklets> tags can be used to use different base names.

Note

Each id is of the type xsd:ID, meaning there are restrictions on the characters that can be used, and each id must be unique throughout the whole recipe. More details can be found in the xsd:ID definition.

The checklets that have been specified can now be used in one or more steps.

A step is an invocation of a checklet, optionally with options. It is specified using the <step> tag, which assigns an id to the step. The <checklet> tag contains a reference ref to the checklet that is used in this step. Note that this is the same id that was specified earlier. Optionally, options can be set for each step. Options are discussed in more detail in Configuration.

Each step is defined using its own <step> tag. This means that generally, there will be a single <steps> tag containing multiple <step> tags. As noted earlier, this can be deviated from, as long as the meaning in XML is equivalent. For instance, steps may be grouped in multiple <steps> tags to improve readability.

Note

Recipe editors can choose to omit the <checklets> tag, instead directly referring to the name of each checklet. The <checklets> tag simply sets up an alias to improve readability.

3.2.1.4. <tests>

The <tests> tag specifies tests that can be run to verify the correctness of the recipe. A test specifies a product and an expected outcome for the product when running the recipe on it.

The <expectedResult> tag specifies a possible result for a checklet. It specifies an id for each expected result with the outcome for each step. In this case, there is only one step. We specify both a pass and fail result for this. If there were more checklets, we could include an outcome for each of them.

Note

The <expectedResult> tag is optional here. Instead of defining expected results once and referring to them later, they can also be specified within a <test> tag. As a convention, we typically define expected results first, but if a result is only expected for a single test, the added overhead may decrease readability.

The <test> tag specifies a test. Within each test, the product and the expected result are specified. The <submission> tag specifies defines a product consisting of one or more files. The basesrc of the <files> tag specifies the directory within which all files are contained. Products are discussed in more detail in Products.

3.2.2. Configuration

An option specifies a value for a named parameter that can be provided to a checklet. The options are specified within the <options> tag. Each <option> consists of a name for the option and a value that is assigned to it. The checklet has access to the values specified here. More details on options can be found in mtrchk.org.momotor.base.stepbase.StepOption.

The options below serve as an example.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
<?xml version='1.0' encoding='utf-8'?>
<Q:config xmlns:Q="http://peach3.nl/daemon/config"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://peach3.nl/daemon/config http://peach3.nl/schema/daemon/config-1.0.xsd">
  <options>
    <option domain="checklet" name="num-files-min" value="1"
        description="Minimal number of PDF files the product should contain" type="int"/>
    <option domain="checklet" name="num-files-max" value="1"
        description="Maximum number of PDF files the product should contain" type="int"/>
    <option domain="checklet" name="num-pages-min" value="0"
        description="Minimal number of pages each PDF file should include" type="int"/>
    <option domain="checklet" name="only-pdf" value="True"
        description="Whether all files should be PDF files" type="bool"/>
  </options>
</Q:config>

Download this configuration

Options can be specified in the following locations:

  1. step: the option is defined in the step section of a recipe
  2. recipe: the option is defined in the global section of a recipe
  3. dependent: the option is provided by a step that depends on the current step
  4. product: the option is defined by a product
  5. config: the option is defined in the configuration

By default, a checklet will check the step, recipe and config in that order. This means that a value in the step overrides a value in the global section of a recipe, which in turn overrides a value in config.xml. Note that this default order may be different for other recipes.

3.2.3. The checklet

The checklet takes the input provided by the recipe and produces a result that is sent back to the recipe. In the case of the PDF filter, the checklet should check if all of the files in the product are PDF files.

Checklets can be hosted in a central repository that Momotor can find, but they can also be included with a recipe. A recipe can found these embedded checklets if they are included in the checklets directory, within the same directory as recipe.xml.

Every checklet is a Python egg. It includes a file setup.py containing meta-information that looks as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from setuptools import setup, find_packages

setup(
    name = 'mtrchk-org-momotor-pdffilter',
    version = '0.0',
    author = 'Maikel Steneker',
    author_email = 'm.p.j.steneker@student.tue.nl',
    description = "Momotor PDF filter",
    url = 'http://checklet.momotor.org/',
    install_requires = [
        'setuptools',
        'mtrchk-org-momotor-base',
        'pypdf',
    ],
    packages = find_packages('src'),
    package_dir = {'': 'src'},
    namespace_packages = [
        'mtrchk',
        'mtrchk.org',
        'mtrchk.org.momotor',
    ],
    zip_safe = True,
    entry_points = {
        'momotor.checklet' : [
            'mtrchk-org-momotor-pdffilter = mtrchk.org.momotor.pdffilter:PdfFilter',
        ],
    },
    classifiers = [
        'Programming Language :: Python :: 2.7',
        'License :: Other/Proprietary License',
    ]
)

Download setup.py

Most of the information in this file is self-explanatory. Note that the name that recipes refer to is specified here. The install_requires field is a list of other eggs that the checklet depends on. This includes setuptools and mtrchk-org-momotor-base, which are required for any checklet, as well as any other packages, in this case pypdf. A large selection of packages can be found in the Python Package index. The entry_points field is a reference to the Python class that contains the core functionality of the checklet.

The core functionality of the checklet is contained in the following code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
from mtrchk.org.momotor.base.stepbase import StepBase, StepError, StepOption
from pyPdf import PdfFileReader
from pyPdf.utils import PdfReadError

class PdfFilter(StepBase):
    checklet_options = StepBase.checklet_options + (
        StepOption('num-files-min',
            doc='Minimal number of PDF files the product should contain',
            default='1'
        ),
        StepOption('num-files-max',
            doc='Maximum number of PDF files the product should contain',
            default='1'
        ),
        StepOption('num-pages-min',
            doc='Minimal number of pages each PDF file should include',
            default='0'
        ),
        StepOption('only-pdf',
            doc='Whether all files should be PDF files',
            default='True'
        ),
        StepOption('files',
            doc="""The files to consider. The following values are legal:
    @recipe:<class>      Copy files of class from this recipe
    @product:<class>     Copy files of class from the product
    <resultid>:<class>   Copy files of class from result
    
    <class> can be empty indicating all files, regardless of class
    <resultid> can be empty, defaulting to '@recipe'
            
If supplied multiple times, the sources will be merged. If omitted, defaults to '@product:' """,
            default='@product:',
            multiple=True,
        ),
    )
    
    def run(self):        
        passed = True  # the test passes by default
        pdf_files = []
        other_files = []
        
        # Retrieve checklet options
        num_files_min = self.retrieve_option_int('num-files-min')
        num_files_max = self.retrieve_option_int('num-files-max')
        num_pages_min = self.retrieve_option_int('num-pages-min')
        only_pdf = self.retrieve_option_bool('only-pdf')
        sources = self.options.get('files')
        
        files = self.find_files(sources)
        for f in files:
            if f.attributes.get('type') == 'application/pdf':
                # file is marked as a PDF file, open using pyPDF
                with open(f.src.absolute(), 'r') as pdf:
                    try:
                        rdr = PdfFileReader(pdf)
                        pdf_files.append(f)
                        if rdr.numPages < num_pages_min:
                            passed = False # the document does not contain enough pages
                            self.log.info('file %r: not enough pages; %d<%d', (f,rdr.numPages,num_pages_min))
                    except PdfReadError:
                        # the file cannot be opened as a PDF file => fail
                        passed = False
                        self.log.info('file %r: could not open as PDF', f)
            else:
                # the file is not a PDF file => fail if all files need to be PDF
                passed = passed and not only_pdf
                other_files.append(f)
                self.log.info('file %r: not specified to be PDF', f)
        
        # Check if the number of files was correct
        self.log.info('# files: %d', len(pdf_files))
        passed = passed and num_files_min <= len(pdf_files) <= num_files_max
        self.log.info('pass: %r', passed)
        
        return {
            'outcome': 'pass' if passed else 'fail',
            'files': pdf_files,
        }

Download this file

The PdfFilter class is the checklet itself. It extends the StepBase class, which should be extended by all checklets.

A checklet should contain a run() method that returns the result of the checklet. Particularly, it should be a dictionary with at least a value for the outcome key. Additional information can be returned as well. More details can be found in the documentation for mtrchk.org.momotor.base.stepbase.StepBase.run().

Within a checklet, the checklet_options variable specifies the options that can be used by the checklet. Note that these same options are provided by the recipe or separate config.xml. The options property of the checklet provides a dictionary containing all the options that were set. These can be retrieved and used in the checklet. Each option has a name that is also used in the recipe, as well as some optional parameters, such as a default value and a description.

A checklet has access to a copy of a result for each of the checklets that have finished before, as well as the product. These and more helper methods are defined in the base checklet.

See also

Attribute mtrchk.org.momotor.base.stepbase.StepBase.checklet_options
Documentation of the checklet options.
Module mtrchk.org.momotor.base.stepbase
Documentation of the base checklet.

3.2.4. Products

A product contains the data that the recipe validates. Typically, this will be a few files, along with an XML document describing it. A product for the PDF filter looks as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
<?xml version="1.0" encoding="UTF-8"?>
<p:product xmlns:p="http://peach3.nl/daemon/product" 
	xmlns:xml="http://www.w3.org/XML/1998/namespace" 
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
	xsi:schemaLocation="http://peach3.nl/daemon/product http://peach3.nl/daemon/schema/product-1.0.xsd"
	id="test2">
 <files basesrc="files">
  <file name="test.pdf" src="test.pdf" type="application/pdf" />
  <file name="test.txt" src="test.txt" type="text/plain" />
 </files>
</p:product>

Download this product

Note that this product is simply one of the test cases for the recipe. Each file is given a name under which it will be provided to the checklet, a src that specifies its physical location and a type. The type should be an internet media type. A list of internet media types is maintained by the IANA.

3.2.5. XSLT files

Note that any XML file that is read by Momotor can be an embedded XSLT stylesheet. This means that xsd tags, such as <xsl:if> and <xsl:for-each>, can be used. After XSLT processing, the result should be a valid XML file.

More details can be found in the W3C XSLT documentation.