Xtables2 User Documentation Jan Engelhardt Xtables2, Architecture and API Draft 9 (“A9”), December 2012 Copyright © 2010--2012 Jan Engelhardt . This work is made available under the Creative Commons Attribution-Noncommercial-Sharealike 3.0 (CC-BY-NC-SA) license. See http://creativecommons.org/licenses/by-nc-sa/3.0/ for details. (Alternate arrangements can be made with the copyright holder(s).) Table of Contents 1 Summary Xtables is a packet filter for the Linux kernel. Its origins trace back to ip_tables and its siblings ip6_tables, arp_tables and ebtables that were originally authored in 1999 et seq., and worked on since by the many nameful contributors. Together, ip-ip6-arp-ebtables is retroactively referred to as Xtables(1), based upon the kernel module name for shared code portions, x_tables.ko. Xtables2 is an effort to do gradual improvements and modernize the packet filter, with input from many sources, especially the user community itself. You will find, so I hope, no need to relearn an entire system, at any level. Xtables2 intends to displace the classic iptables setsockopt ABI --- both ipt and xt2 can be run concurrently --- and morphs the userspace tool into one that uses the new Netlink-based interface. 2 Retained Features Atomic replace of rules ip6tables-restore and its IPv4 sibling are popular tools to reload one table (using instructions from stdin) such that, at any point in time during the switchover period, a CPU executing the ruleset either sees the old one, or the new one, and no intermediate incomplete updates. I have made sure that this is also possible in Xtables2. See also the subsection on granularity under “New Features”. Network namespace support Network namespaces are used to provide a isolated containerized environments for operating system-level virtualization implementations like LXC and OpenVZ. Each network namespace has their own set of packet filter rules, just like a virtual machine with a separate kernel would. Continuing to use xt extension modules Xtables2 reuses the existing extension modules. Currently supported are all those that Some outstanding (as in: to-be-done) modifications have yet to be made to Xtables2 so it can also use ipt_ and ip6t_. Same syntax for the command-line utility Focus has been laid on that the syntax used for iptables(8) is retained with the Netlink-using utility. Some options have become useless as there is no equivalent for them anymore, such as -t for the table name. See the “New Features” section below. 3 New features Netlink as transport protocol for the packet filter ruleset The traditional user<->kernel communication happened over a side channel provided by the getsockopt(2) and setsockopt(2) system calls that were usable on any raw IP socket, which can be widely considered an ugly API misuse, because the packet filter ruleset is not a property of a socket, but the network namespace one is in. Netlink is considered the modern communication channel for network-related configuration and some other subsystems of the operating system, and it is basically a standard prerequisite. (“ Something without Netlink is a hard sell.”) The Netlink interface used by Xtables2 is described in more detail in the Xtables2 Developer Documentation. Protocol-independent table Xtables2 rules have, in empty form, no mandatory protocol-specific parts attached anymore, like, for example, struct ip6t_ip6 used to be in ip6tables rules. This allows for a single given table to be called from various packet type handlers (such as IPv6, BRIDGE, etc.). User rulesets often do not care about the network protocol being used, and with the protocol-independent table, -p tcp -m tcp --dport 22 will match TCP/22, independent of the used network protocol, which means you saved replicating the rule between iptables and ip6tables. Checking explicitly for network protocols is of course still possible by means of -m ipv4, -m ipv6, and -m arp for the network protocols of the same name. The respective replacement for ebtables rules is a mixture of -m physdev to check for an Ethernet bridge device, and -m eth to test for Ethernet packets. The side effect of protocol-agnostic tables is that the ip_tables, ip6_tables, arp_tables and ebtables kernel modules all get obsoleted in one swoop, ending a painful legacy of copy-and-paste. Improved C library for low-level access The iptables package did offer a userspace library, libiptc, that would allow to modify the in-kernel rulesets. The library set of libip6tc, libiptc, as well as libarptc, libebtc in their respective sibling packages --- yes, duplication on this level existed as well --- exposed a great deal of the internal implementation, and was arguably not too popular with either users or developers. With Xtables2, a new package, libnetfilter_xtables, is introduced that is more appealing to developers. (This is pretty much guaranteed given libiptc's state.) The new library provides access to the in-kernel table directly, and like libiptc, uses local copies to store results of dump operations, and to build tables for bulk replace. C library for high-level access [work-in-progress] Programmers have a natural aversion against spawning new processes from their program to access other tools, also because process management is not always under one's control, such as when the user of iptables or successors thereof is not the final program, but a library, or a language binding. Xtables2 will be providing a library that offers an interface based upon the known and well-established textual representation. The API will be described in more detail in the Xtables2 Developer Documentation. Singular table In Xtables2, each network namespace only has a single table (internally referred to as “master”). What previously have been 40 base chains in 13 tables per network namespace over four protocol/subsystem domains (IPv4, IPv6, ARP, bridge) is now a comfortable 40 base chains in a single table per network namespace. This has the benefits that user-defined chains need not be reproduced across different tables as it was previously necessary. Whereas xt1 only allowed replacing one table at a time atomically, it is thus possible to replace the entire ruleset --- which may spanned multiple tables in xt1 --- at once in Xtables2. The name of base chains is now freely selectable. If you cannot think of anything, I recommend “ipv6/filter/INPUT” as a name for what was previously the INPUT chain in the IPv6 filter table. Initial absence of base chains Initially, base chains will not exist and, from a kernel point of view, need to be created first. A userspace component can transparently take care of that for the user, just as iptables(8) did autoload table modules and thus makes base chains available. A respective Netfilter hook will be installed on base chain creation and removed again on deletion, so that non-existing base chains do not delay packet processing. This is similar to not having loaded a table in xt1, however, if you only used, for example, a single INPUT rule in the mangle table, you would have still added four hooks for the other base chains from the mangle table. Arbitrary new base chains The administrator is free to create base chains using arbitrary Netfilter hook priorities (corresponds to “raw”, “filter”, etc.) and hook numbers (corresponds to prerouting, input, etc.). This alleviates the need to have kernel modules for each table. Absent chain policy Chain policies used to be a hidden rule at the end of a base chain that was jumped to if the user issued a RETURN from the base chain (“underflow”). Running off the end of a chain also causes the hidden rule to be executed. Since the Internet has become a much more hostile place since its inception, Xtables2 uses a strict drop policy for underflows and run-offs, but you should know the following. A system where no packet filter hooks are registered will simply pass on, i. e. accept, packets. When loading the Xtables2 kernel module, there will be no chains and no rules. Only when you “promote” a (normal) chain to a base chain (one that is active and receives packets from Networking) will the implicit drop become active. It is therefore of importance to have the desired rules in your chains before promoting chains. Absent default counters Rules no longer have byte and packet counters attached to them by default, to increase processing speed for the users who in fact do not need the counters. Counters, if desired, need to be explicitly specified when creating a rule. For them to have their original behavior, that is, only rise when all match conditions of the rule have been met, -m counter is to be used as the last action of a rule. Counters may also be added anywhere else, as in: • -m tcp -m time -m counter: counts after both xt_tcp and xt_time matched successfully. • -m tcp -m counter -m time: counts after successful xt_tcp match already. • -m counter -m tcp -m time: always counts • -m tcp -m counter -m time -m counter: also possible: two counter objects The Xtables1 translator will of course generate xt2 rules that always have counters. This behavior can be changed by tuning the sysfs variable /sys/module/xt1_support/parameters/xlat_counters. A9/2012: Counters are implemented by -m quota (downwards fashion only), -m quota2 (up/down, in xtables-addons), or -m nfacct (new since Dec-23-2011!) You can pick any of these modules. Action types (This is actually more of an internal item.) Yielding verdicts like ACCEPT, DROP, etc. and jumping/goto was, from an object-oriented perspective, implemented through targets. The special target called “standard” (and its parameter block struct xt_standard_target) was used in Xtables1 to terminate processing in a table and yield a verdict, and jumps/gotos were implemented by having an empty target name in combination with struct xt_standard_target that now carried the jump offset. For Xtables2, a more common approach has been chosen. Instead of making verdicts, jumps and gotos special cases (OO: “subclass”) of targets, they are separate entities in their own right, termed “actions”. There is a total of five actions defined to date: verdicts, matches, targets, jumps and gotos. There are no desires to changing the userspace command-line syntax. `-j` continues to be used for verdicts, targets, and jumps. Arbitrary mix of multiple matches and targets [user-request] Xtables1's ruleset encoding allowed for one group of (zero or more) consecutively-executed matches, and one group of (zero or more) consecutively-executed targets. Neither the kernel nor userspace parts ever made use of multiple targets. Xtables2 uses Netlink, but also in its internal encoding, there are no longer such groups. Instead, each “action” entity specifies its type, in other words, you can have zero or more groups of one action each, giving significant freedom. In Xtables2, the multiple targets feature that is highly desired in the user community is therefore supported. Targets are executed in the given order, provided the previous target was not terminating. This makes commands such as `-j LOG -j DROP` possible. `-j DROP -j LOG` will of course not log it, since the packet is first dropped, leaving nothing to log. However, as there are no longer restrictions of what action types are executed when. You can thus use `-m foo -j bar -m baz -j quux`. This is loosely equivalent to two separate rules, `-m foo -j bar` and `-m foo -m bax -j quux`, with the difference that “ foo” is only executed once instead of twice, which is important to keep in mind if “foo” has some changing internal state. Classic iptables encoded rules without a target as having an implicit XT_CONTINUE, i. e. a jump to the next rule. This is no longer needed in Xtables2, where rules default to XT_CONTINUE, so the 40 bytes that it took to encode CONTINUE are saved. (Not that rules without a target make up a significant portion of any real-world ruleset...) Packing of rules Tests have shown that rulesets stored in “free-hanging” data structures, such as linked lists, suffer from increased memory usage and severe performance degradation when executing the ruleset. Processing time and memory expansion of up to 2.8x has been observed. The time expansion is believed to be a result of increased D-cache or TLB misses due to the “fragmentation” of the ruleset's objects. The memory usage increase is due to natural housekeeping cost of the allocator. Xtables1 packs all rules in a table together, which is good for locality, but has implications on the difficulty, specifically time cost, of ruleset manipulation. As a result of the findings about time and memory, Xtables2 (starting with snapshot A4) packs rules together at the chain level, based upon the following assumption: • Jumps can lead to anywhere in the table. The jump may transfer control to a “nearby” (byte offset-wise) chain, but it may equally jump to somewhere outside the cached region. There have been no tests made in regard of the cost. Higher degree of freedom in modifications (granularity) The capabilities of an implementation can, among other things, be characterized by one or more of the following “granularities”. From coarse to fine-grained: • Ruleset level: Exchange of a ruleset (allows manipulation of multiple tables). • Table level: Exchange of a table (multi-chain manipulation -- within a single table). • Chain level: Exchange of a chain (multi-rule manipulation -- within a single chain). • Rule level: Exchange of a rule. • Subrule level: Exchange of extensions, rule parameters. Additionally, the atomic guarantees of a level apply to lower ones, so that table atomicity implies chain and rule atomicity. Subrule level control seems to have not much practical value, too; it is only listed here for completeness. The ip_tables kernel interface only offers table-level granularity. The iptables(8) userspace program retrieves an entire table at a time from the kernel, unpacks it, applies the desired user modification, repacks it, and then submits it back to the kernel. Even for small changes, this means that more data than really needed will be transferred forth and back. Also, since a new table is constructed, the kernel has to redo all checks. This problem is amplified by users wrongly calling iptables repeatedly instead of doing a highly-recommended bulk replace. The Xtables2 kernel interface offers modification at two (three) levels: table, chain. Since Xtables2 conveniently has just a single table, ruleset granularity is already achieved by table granularity in xt2's implementation[footnote: But it would not be so in Xtables1, of course. ]. Xtables2 always provides at least chain atomicity even for single rule updates, as a result of the implementation of rule splicing. Cost: per splice operation. Other minor issues fixed As a result of making the table replace operation a single atomic operation --- previously, it was split into to two, whereby first the table was exchanged, and then the counters --- “Resource temporarily unavailable” is not possible in xt2. [LaTeX Command: printindex]