diff --git a/Documentation/networking/unicache.txt b/Documentation/networking/unicache.txt new file mode 100644 index 0000000..715f9af --- /dev/null +++ b/Documentation/networking/unicache.txt @@ -0,0 +1,145 @@ +intro +----- + +The unicache is an attempt to do full flow lookup based on the trash/trie +data-structure. Trash builds very flat trees even with very large of entries. +Millions. The key holds src/dst/sprt/dprt/proto. Data-nodes are called leafs +as LC-trie and holds a list struct rtable. This just as the hash-chain. Which +means we can reuse all available ipv4 matching code. + +current take +------------ + +New Documentation/networking/unicache.txt | 128 ++++ + include/linux/in_route.h | 5 + + include/linux/netlink.h | 1 + +New include/linux/trie_core.h | 167 ++++ +New include/linux/unicache.h | 130 ++++ + include/net/route.h | 35 +- + net/core/Makefile | 2 +- +New net/core/trie_core.c | 1348 +++++++++++++++++++++++++++++++++ +New net/core/unicache.c | 959 +++++++++++++++++++++++ + net/ipv4/af_inet.c | 13 + + net/ipv4/icmp.c | 41 +- + net/ipv4/route.c | 590 +++++++++------ + + +Garbage collection +------------------ +garbage collection is one of the most crucial processes. Dived into passive, +timer and active GC.The main focus is at the trash/trie size, the number of +leafs. Also at insertion the size/gc_thresh is checked. If gc_thresh is reached +rt-entries are removed so gc_goal is reached. also at insert the leafs chain- +length is pruned when gc_elasticity is reached. This is as before with hash- +chains. As the trash/trie size is most important parameter "legacy" route +parameters are calculatedfrom trie/trash size + +static int ip_rt_gc_elasticity = 3; + +void ip_rt_new_size(struct trie *t) +{ + ipv4_dst_ops.gc_thresh = t->gc_thresh * ip_rt_gc_elasticity; + ip_rt_max_size = t->gc_thresh * (ip_rt_gc_elasticity + 1); +} + +TGC +--- +Timer-based GC is slightly different from before. See rt_may_expite + + +Tuning +------ +Virtually the GC process is now very simple and controlled in most cases only +with one variable, the trash/trie size. Via the unicache user app the the +unicache can be monitored. + +# routing w/o route cache + unicache --set_gc_thresh 2000 + +# default + unicache --set_gc_thresh 100000 + +The process GC described above we call Passive GC (PDC) as we do this GC after- +wards and take a lot of work not through away active and valuable entries. This +is a rather expensive process. + +It's a good idea to keep gc_goal relatively small to keep the resizing of +trie/trash at a minimum. The default should a good start. + + +AGC +--- + +The full flow lookup makes new ways of GC. We can monitor state changes and +termination of active flows. We call this active garbage collection or AGC. +For TCP can monitor session termination by looking at FIN snooping or RST +and direct remove stale entries from trash. This is very effective. During +development pktgen (UDP) was instrumented to signal end-of-flow stress the +implementation. + +Monitoring +---------- + +/proc/net/unicache_stat + +Basic info: size of leaf: 36 bytes, size of tnode: 44 bytes. +trie: + Aver depth: 1.31 + Max depth: 4 + Leaves: 198375 + Internal nodes: 29440 + 1: 26745 2: 2675 3: 19 19: 1 + Pointers: 588630 +Null ptrs: 359801 +Total size: 10539 kB + +/proc/net/unicache_flows holds "active flows" + +pkts/trash/src/dst/sprt/dprt/proto/ifidx + +00000000 00000004 08d44b2f 0101a8c0 99a9b105 09000900 00110003 +00000000 00000004 08d46e23 0101a8c0 893c8505 09000900 00110003 +00000000 00000004 08d486eb 010a0a0a 248a4a0b 09000900 00110001 +00000000 00000004 08d4ce67 010a0a0a bbc46b0b 09000900 00110001 + +And of course rtstat to see hit of warm cache entries and fib lookup as +other related parameters. + +equilibrium +----------- +equilibrium resizing is not yet added. If needed ip_rt_new_size can be called +on period basis to adjust goal. Recall that the dynamic trie/trash data- +structure has no preallocated hash table and grows with new entries to +gc_thresh. and timer-based gc will will remove stale entries. + +locking +------- + +locking. Current implantation takes a "rather safe than sorrow" approach. +RCU-BH protections and trie-writers are serialized via the trie_write_lock. +It should be possible to do more fine-grain lock to support many concurrent +writers. + +flowlogging +----------- +Current code has can log finished via netlink. Logged info has flow information +src/dst/sprt/dprt/proto/if and packets count. + +unsupprted +----------- +CONFIG_IP_ROUTE_MULTIPATH_CACHED is not (yet) supported. + +future work +----------- +It's possibly to extend the key to ipv6 at very little cost. It's possibly to +store i.e struct socket in leaf to get a unified lookup. The full key should +give opportunities for i.e connection tracking etc. + + +testing +------- +Code has been tested for weeks with rDoS attacks but of course there are bugs. + + +performance comparison +----------------------- \ No newline at end of file