diff options
Diffstat (limited to 'doc/dsc-manual.tex')
-rw-r--r-- | doc/dsc-manual.tex | 1863 |
1 files changed, 1863 insertions, 0 deletions
diff --git a/doc/dsc-manual.tex b/doc/dsc-manual.tex new file mode 100644 index 0000000..501d34a --- /dev/null +++ b/doc/dsc-manual.tex @@ -0,0 +1,1863 @@ +\documentclass{report} +\usepackage{epsfig} +\usepackage{path} +\usepackage{fancyvrb} + +\def\dsc{{\sc dsc}} + +\DefineVerbatimEnvironment% + {MyVerbatim}{Verbatim} + {frame=lines,framerule=0.8mm,fontsize=\small} + +\renewcommand{\abstractname}{} + +\begin{document} + +\begin{titlepage} +\title{DSC Manual} +\author{Duane Wessels, Measurement Factory\\ +Ken Keys, CAIDA\\ +\\ +http://dns.measurement-factory.com/tools/dsc/} +\date{\today} +\end{titlepage} + +\maketitle + +\begin{abstract} +\setlength{\parskip}{1ex} +\section{Copyright} + +The DNS Statistics Collector (dsc) + +Copyright 2003-2007 by The Measurement Factory, Inc., 2007-2008 by Internet +Systems Consortium, Inc., 2008-2019 by OARC, Inc. + +{\em info@measurement-factory.com\/}, {\em info@isc.org\/} + +\section{License} + +{\dsc} is licensed under the terms of the BSD license: + +Redistribution and use in source and binary forms, with or without +modification, are permitted provided that the following conditions +are met: + +Redistributions of source code must retain the above copyright +notice, this list of conditions and the following disclaimer. +Redistributions in binary form must reproduce the above copyright +notice, this list of conditions and the following disclaimer in the +documentation and/or other materials provided with the distribution. +Neither the name of The Measurement Factory nor the names of its +contributors may be used to endorse or promote products derived +from this software without specific prior written permission. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS +"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT +LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS +FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE +COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, +INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, +BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; +LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER +CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN +ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE +POSSIBILITY OF SUCH DAMAGE. + +\section{Contributors} +\begin{itemize} +\item Duane Wessels, Measurement Factory +\item Ken Keys, Cooperative Association for Internet Data Analysis +\item Sebastian Castro, New Zealand Registry Services +\end{itemize} +\end{abstract} + + +\tableofcontents + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +\chapter{Introduction} + +{\dsc} is a system for collecting and presenting statistics from +a busy DNS server. + +\section{Components} + +{\dsc} consists of the following components: +\begin{itemize} +\item A data collector +\item A data presenter, where data is archived and rendered +\item A method for securely transferring data from the collector + to the presenter +\item Utilities and scripts that parse XML and archive files from the collector +\item Utilities and scripts that generate graphs and HTML pages +\end{itemize} + +\subsection{The Collector} + +The collector is a binary program, named {\tt dsc\/}, which snoops +on DNS messages. It is written in C and uses {\em libpcap\/} for +packet capture. + +{\tt dsc\/} uses a relatively simple configuration file called {\em +dsc.conf\/} to define certain parameters and options. The configuration +file also determines the {\em datasets\/} that {\tt dsc\/} collects. + +A Dataset is a 2-D array of counters of IP/DNS message properties. +You can define each dimension of the array independently. For +example you might define a dataset categorized by DNS query type +along one dimension and TLD along the other. +{\tt dsc\/} dumps the datasets from memory to XML files every 60 seconds. + +\subsection{XML Data Transfer} + +You may run the {\dsc} collector on a remote machine. That +is, the collector may run on a different machine than where the +data is archived and displayed. {\dsc} includes some Perl and {\tt /bin/sh} +scripts to move XML files from collector to presenter. One +technique uses X.509 certificates and a secure HTTP server. The other +uses {\em rsync\/}, presumably over {\em ssh\/}. + +\subsubsection{X.509/SSL} + +To make this work, Apache/mod\_ssl should run on the machine where data +is archived and presented. +Data transfer is authenticated via SSL X.509 certificates. A Perl +CGI script handles all PUT requests on the server. If the client +certificate is allowed, XML files are stored in the appropriate +directory. + +A shell script runs on the collector to upload the XML files. It +uses {\tt curl\/}\footnote{http://curl.haxx.se} to establish an +HTTPS connection. XML files are bundled together with {\tt tar\/} +before transfer to eliminate per-connection delays. +You could use {\tt scp\/} or {\tt rsync\/} instead of +{\tt curl\/} if you like. + +\path|put-file.pl| is the script that accepts PUT requests on the +HTTP server. The HTTP server validates the client's X.509 certificate. +If the certificate is invalid, the PUT request is denied. This +script reads environment variables to get X.509 parameters. The +uploaded-data is stored in a directory based on the X.509 Organizational +Unit (server) and Common Name fields (node). + +\subsubsection{rsync/ssh} + +This technique uses the {\em rsync\/} utility to transfer files. +You'll probably want to use {\em ssh\/} as the underlying transport, +although you can still use the less-secure {\em rsh\/} or native +rsync server transports if you like. + +If you use {\em ssh\/} then you'll need to create passphrase-less +SSH keys so that the transfer can occur automatically. You may +want to create special {\em dsc\/} userids on both ends as well. + +\subsection{The Extractor} + +The XML extractor is a Perl script that reads the XML files from +{\tt dsc\/}. The extractor essentially converts the XML-structured +data to a format that is easier (faster) for the graphing tools to +parse. Currently the extracted data files are line-based ASCII +text files. Support for SQL databases is planned for the future. + +\subsection{The Grapher} + +{\dsc} uses {\em Ploticus\/}\footnote{http://ploticus.sourceforge.net/} +as the graphing engine. A Perl module and CGI script read extracted +data files and generate Ploticus scriptfiles to generate plots. Plots +are always generated on demand via the CGI application. + +\path|dsc-grapher.pl| is the script that displays graphs from the +archived data. + + +\section{Architecture} + +Figure~\ref{fig-architecture} shows the {\dsc} architecture. + +\begin{figure} +\centerline{\psfig{figure=dsc-arch.eps,width=3.5in}} +\caption{\label{fig-architecture}The {\dsc} architecture.} +\end{figure} + +Note that {\dsc} utilizes the concept of {\em servers\/} and {\em +nodes\/}. A server is generally a logical service, which may +actually consist of multiple nodes. Figure~\ref{fig-architecture} +shows six collectors (the circles) and two servers (the rounded +rectangles). For a real-world example, consider a DNS root server. +IP Anycast allows a DNS root server to have geographically distributed +nodes that share a single IP address. We call each instance a +{\em node\/} and all nodes sharing the single IP address belong +to the same {\em server\/}. + +The {\dsc} collector program runs on or near\footnote{by +``near'' we mean that packets may be sniffed remotely via Ethernet taps, switch +port mirroring, or a SPAN port.} the remote nodes. Its XML output +is transferred to the presentation machine via HTTPS PUTs (or something simpler +if you prefer). + +The presentation machine includes an HTTP(S) server. The extractor looks +for XML files PUT there by the collectors. A CGI script also runs on +the HTTP server to display graphs and other information. + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\chapter{Installing the Presenter} + +You'll probably want to get the Presenter working before the Collector. +If you're using the secure XML data transfer, you'll need to +generate both client- and server-side X.509 certificates. + +Installing the Presenter involves the following steps: +\begin{itemize} +\setlength{\itemsep}{0ex plus 0.5ex minus 0.0ex} +\item + Install Perl dependencies +\item + Install {\dsc} software +\item + Create X.509 certificates (optional) +\item + Set up a secure HTTP server (e.g., Apache and mod\_ssl) +\item + Add some cron jobs +\end{itemize} + + +\section{Install Perl Dependencies} + +{\dsc} uses Perl for the extractor and grapher components. Chances are +that you'll need Perl-5.8, or maybe only Perl-5.6. You'll also need +these readily available third-party Perl modules, which you +can find via CPAN: + +\begin{itemize} +\setlength{\itemsep}{0ex plus 0.5ex minus 0.0ex} + \item CGI-Untaint (CGI::Untaint) + \item CGI.pm (CGI) + \item Digest-MD5 (Digest::MD5) + \item File-Flock (File::Flock) + \item File-Spec (File::Spec) + \item File-Temp (File::Temp) + \item Geography-Countries (Geography::Countries) + \item Hash-Merge (Hash::Merge) + \item IP-Country (IP::Country) + \item MIME-Base64 (MIME::Base64) + \item Math-Calc-Units (Math::Calc::Units) + \item Scalar-List-Utils (List::Util) + \item Text-Template (Text::Template) + \item URI (URI::Escape) + \item XML-Simple (XML::Simple) + \item Net-DNS-Resolver (Net::DNS::Resolver) + +\end{itemize} + +\noindent +Also note that XML::Simple requires XML::Parser, which in +turn requires the {\em expat\/} package. + +\section{Install Ploticus} + +{\dsc} uses Ploticus to generate plots and graphs. You can find +this software at \verb|http://ploticus.sourceforge.net|. The {\em +Download\/} page has links to some pre-compiled binaries and packages. +FreeBSD and NetBSD users can find Ploticus in the ports/packages +collection. + + +\section{Install {\dsc} Software} + +All of the extractor and grapher tools are Perl or {\tt /bin/sh} +scripts, so there is no need to compile anything. Still, +you should run {\tt make} first: + +\begin{MyVerbatim} +% cd presenter +% make +\end{MyVerbatim} + +If you see errors about missing Perl prerequisites, you may want +to correct those before continuing. + +The next step is to install the files. Recall that +\path|/usr/local/dsc| is the hard-coded installation prefix. +You must create it manually: + +\begin{MyVerbatim} +% mkdir /usr/local/dsc +% make install +\end{MyVerbatim} + +Note that {\dsc}'s Perl modules are installed in the +``site\_perl'' directory. You'll probably need {\em root\/} +privileges to install files there. + +\section{CGI Symbolic Links} + +{\dsc} has a couple of CGI scripts that are installed +into \path|/usr/local/dsc/libexec|. You should add symbolic +links from your HTTP server's \path|cgi-bin| directory to +these scripts. + +Both of these scripts have been designed to be mod\_perl-friendly. + +\begin{MyVerbatim} +% cd /usr/local/apache/cgi-bin +% ln -s /usr/local/dsc/libexec/put-file.pl +% ln -s /usr/local/dsc/libexec/dsc-grapher.pl +\end{MyVerbatim} + +You can skip the \path|put-file.pl| link if you plan to use +{\em rsync\/} to transfer XML files. +If you cannot create symbolic links, you'll need to manually +copy the scripts to the appropriate directory. + + +\section{/usr/local/dsc/data} + +\subsection{X.509 method} + +This directory is where \path|put-file.pl| writes incoming XML +files. It should have been created when you ran {\em make install\/} earlier. +XML files are actually placed in {\em server\/} and {\em +node\/} subdirectories based on the authorized client X.509 certificate +parameters. If you want \path|put-file.pl| to automatically create +the subdirectories, the \path|data| directory must be writable by +the process owner: + +\begin{MyVerbatim} +% chgrp nobody /usr/local/dsc/data/ +% chmod 2775 /usr/local/dsc/data/ +\end{MyVerbatim} + +Alternatively, you can create {\em server\/} and {\em node\/} directories +in advance and make those writable. + +\begin{MyVerbatim} +% mkdir /usr/local/dsc/data/x-root/ +% mkdir /usr/local/dsc/data/x-root/blah/ +% mkdir /usr/local/dsc/data/x-root/blah/incoming/ +% chgrp nobody /usr/local/dsc/data/x-root/blah/ +% chmod 2775 /usr/local/dsc/data/x-root/blah/incoming/ +\end{MyVerbatim} + +Make sure that \path|/usr/local/dsc/data/| is on a large partition with +plenty of free space. You can make it a symbolic link to another +partition if necessary. Note that a typical {\dsc} installation +for a large DNS root server requires about 4GB to hold a year's worth +of data. + +\subsection{rsync Method} + +The directory structure is the same as above (for X.509). The only +differences are that: +\begin{itemize} +\item + The {\em server\/}, {\em node\/}, and {\em incoming\/} + directories must be made in advance. +\item + The directories should be writable by the userid associated + with the {\em rsync}/{\em ssh\/} connection. You may want + to create a dedicated {\em dsc\/} userid for this. +\end{itemize} + + +\section{/usr/local/dsc/var/log} + +The \path|put-file.pl| script logs its activity to +\path|put-file.log| in this directory. It should have been +created when you ran {\em make install\/} earlier. The directory +should be writable by the HTTP server userid (usually {\em nobody\/} +or {\em www\/}). Unfortunately the installation isn't fancy enough +to determine that userid yet, so you must change the ownership manually: + +\begin{MyVerbatim} +% chgrp nobody /usr/local/dsc/var/log/ +\end{MyVerbatim} + +Furthermore, you probably want to make sure the log file does not +grow indefinitely. For example, on FreeBSD we add this line to \path|/etc/newsyslog.conf|: + +\begin{MyVerbatim} +/usr/local/dsc/var/log/put-file.log nobody:wheel 644 10 * @T00 BN +\end{MyVerbatim} + +You need not worry about this directory if you are using the +{\em rsync\/} upload method. + +\section{/usr/local/dsc/cache} + +This directory, also created by {\em make install\/} above, holds cached +plot images. It also must be writable by the HTTP userid: + +\begin{MyVerbatim} +% chgrp nobody /usr/local/dsc/cache/ +\end{MyVerbatim} + +\section{Cron Jobs} + +{\dsc} requires two cron jobs on the Presenter. The first +is the one that processes incoming XML files. It is called +\path|refile-and-grok.sh|. We recommend running it every +minute. You also may want to run the jobs at a lowerer priority +with {\tt nice\/}. Here is the cron job that we use: + +\begin{MyVerbatim} +* * * * * /usr/bin/nice -10 /usr/local/dsc/libexec/refile-and-grok.sh +\end{MyVerbatim} + +The other useful cron script is \path|remove-xmls.pl|. It removes +XML files older than a specified number of days. Since most of the +information in the XML files is archived into easier-to-parse +data files, you can remove the XML files after a few days. This is +the job that we use: + +\begin{MyVerbatim} +@midnight find /usr/local/dsc/data/ | /usr/local/dsc/libexec/remove-xmls.pl 7 +\end{MyVerbatim} + +\section{Data URIs} + +{\dsc} uses ``Data URIs'' by default. This is a URI where the +content is base-64 encoded into the URI string. It allows us +to include images directly in HTML output, such that the browser +does not have to make additional HTTP requests for the images. +Data URIs may not work with some browsers. + +To disable Data URIs, edit {\em presenter/perllib/DSC/grapher.pm\/} +and change this line: + +\begin{verbatim} + $use_data_uri = 1; +\end{verbatim} + +to + +\begin{verbatim} + $use_data_uri = 0; +\end{verbatim} + +Also make this symbolic link from your HTTP servers ``htdocs'' directory: + +\begin{verbatim} +# cd htdocs +# ln -s /usr/local/dsc/share/html dsc +\end{verbatim} + + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\chapter{Configuring the {\dsc} Presenter} + +This chapter describes how to create X.509 certificates and configure +Apache/mod\_ssl. If you plan on using a different upload +technique (such as scp or rsync) you can skip these instructions. + +\section{Generating X.509 Certificates} + +We use X.509 certificates to authenticate both sides +of an SSL connection when uploading XML data files from +the collector to the presenter. + +Certificate generation is a tricky thing. We use three different +types of certificates: +\begin{enumerate} +\item A self-signed root CA certificate +\item A server certificate +\item Client certificates for each collector node +\end{enumerate} + +In the client certificates +we use X.509 fields to store the collector's server and node name. +The Organizational Unit Name (OU) becomes the server name and +the Common Name (CN) becomes the node name. + +The {\dsc} source code distribution includes some shell scripts +that we have +used to create X.509 certificates. You can find them in the +\path|presenter/certs| directory. Note these are not installed +into \path|/usr/local/dsc|. You should edit \path|openssl.conf| +and enter the relevant information for your organization. + +\subsection{Certificate Authority} + +You may need to create a self-signed certificate authority if you +don't already have one. The CA signs client and server certificates. +You will need to distribute the CA and client certificates to +collector sites. Here is how to use our \path|create-ca-cert.sh| +script: + +\begin{MyVerbatim} +% sh create-ca-cert.sh +CREATING CA CERT +Generating a 2048 bit RSA private key +.............................................................................. +............+++ +......+++ +writing new private key to './private/cakey.pem' +Enter PEM pass phrase: +Verifying - Enter PEM pass phrase: +----- +\end{MyVerbatim} + + +\subsection{Server Certificate} + +The server certificate is used by the HTTP server (Apache/mod\_ssl). +The clients will have a copy of the CA certificate so they +can validate the server's certificate when uploading XML files. +Use the \path|create-srv-cert.sh| script to create a server +certificate: + +\begin{MyVerbatim} +% sh create-srv-cert.sh +CREATING SERVER REQUEST +Generating a 1024 bit RSA private key +..........................++++++ +.....................................++++++ +writing new private key to 'server/server.key' +Enter PEM pass phrase: +Verifying - Enter PEM pass phrase: +----- +You are about to be asked to enter information that will be incorporated +into your certificate request. +What you are about to enter is what is called a Distinguished Name or a DN. +There are quite a few fields but you can leave some blank +For some fields there will be a default value, +If you enter '.', the field will be left blank. +----- +Country Name (2 letter code) [AU]:US +State or Province Name (full name) [Some-State]:Colorado +Locality Name (eg, city) []:Boulder +Organization Name (eg, company) [Internet Widgits Pty Ltd]:The Measurement Factory, Inc +Organizational Unit Name (eg, section) []:DNS +Common Name (eg, YOUR name) []:dns.measurement-factory.com +Email Address []:wessels@measurement-factory.com + +Please enter the following 'extra' attributes +to be sent with your certificate request +A challenge password []: +An optional company name []: +Enter pass phrase for server/server.key: +writing RSA key +CREATING SERVER CERT +Using configuration from ./openssl.conf +Enter pass phrase for ./private/cakey.pem: +Check that the request matches the signature +Signature ok +The Subject's Distinguished Name is as follows +countryName :PRINTABLE:'US' +stateOrProvinceName :PRINTABLE:'Colorado' +localityName :PRINTABLE:'Boulder' +organizationName :PRINTABLE:'The Measurement Factory, Inc' +organizationalUnitName:PRINTABLE:'DNS' +commonName :PRINTABLE:'dns.measurement-factory.com' +emailAddress :IA5STRING:'wessels@measurement-factory.com' +Certificate is to be certified until Jun 3 20:06:17 2013 GMT (3000 days) +Sign the certificate? [y/n]:y + + +1 out of 1 certificate requests certified, commit? [y/n]y +Write out database with 1 new entries +Data Base Updated +\end{MyVerbatim} + +Note that the Common Name must match the hostname of the HTTP +server that receives XML files. + +Note that the \path|create-srv-cert.sh| script rewrites the +server key file without the RSA password. This allows your +HTTP server to start automatically without prompting for +the password. + +The script leaves the server certificate and key in the \path|server| +directory. You'll need to copy these over to the HTTP server config +directory as described later in this chapter. + +\section{Client Certificates} + +Generating client certificates is similar. Remember that +the Organizational Unit Name and Common Name correspond to the +collector's {\em server\/} and {\em node\/} names. For example: + +\begin{MyVerbatim} +% sh create-clt-cert.sh +CREATING CLIENT REQUEST +Generating a 1024 bit RSA private key +................................++++++ +..............++++++ +writing new private key to 'client/client.key' +Enter PEM pass phrase: +Verifying - Enter PEM pass phrase: +----- +You are about to be asked to enter information that will be incorporated +into your certificate request. +What you are about to enter is what is called a Distinguished Name or a DN. +There are quite a few fields but you can leave some blank +For some fields there will be a default value, +If you enter '.', the field will be left blank. +----- +Country Name (2 letter code) [AU]:US +State or Province Name (full name) [Some-State]:California +Locality Name (eg, city) []:Los Angeles +Organization Name (eg, company) [Internet Widgits Pty Ltd]:Some DNS Server +Organizational Unit Name (eg, section) []:x-root +Common Name (eg, YOUR name) []:LAX +Email Address []:noc@example.com + +Please enter the following 'extra' attributes +to be sent with your certificate request +A challenge password []: +An optional company name []: +CREATING CLIENT CERT +Using configuration from ./openssl.conf +Enter pass phrase for ./private/cakey.pem: +Check that the request matches the signature +Signature ok +The Subject's Distinguished Name is as follows +countryName :PRINTABLE:'US' +stateOrProvinceName :PRINTABLE:'California' +localityName :PRINTABLE:'Los Angeles' +organizationName :PRINTABLE:'Some DNS Server' +organizationalUnitName:PRINTABLE:'x-root ' +commonName :PRINTABLE:'LAX' +emailAddress :IA5STRING:'noc@example.com' +Certificate is to be certified until Jun 3 20:17:24 2013 GMT (3000 days) +Sign the certificate? [y/n]:y + + +1 out of 1 certificate requests certified, commit? [y/n]y +Write out database with 1 new entries +Data Base Updated +Enter pass phrase for client/client.key: +writing RSA key +writing RSA key +\end{MyVerbatim} + +The client's key and certificate will be placed in a directory +based on the server and node names. For example: + +\begin{MyVerbatim} +% ls -l client/x-root/LAX +total 10 +-rw-r--r-- 1 wessels wessels 3311 Mar 17 13:17 client.crt +-rw-r--r-- 1 wessels wessels 712 Mar 17 13:17 client.csr +-r-------- 1 wessels wessels 887 Mar 17 13:17 client.key +-rw-r--r-- 1 wessels wessels 1953 Mar 17 13:17 client.pem +\end{MyVerbatim} + +The \path|client.pem| (and \path|cacert.pem|) files should be copied +to the collector machine. + +\section{Apache Configuration} + +\noindent +You need to configure Apache for SSL. Here is what our configuration +looks like: + +\begin{MyVerbatim} +SSLRandomSeed startup builtin +SSLRandomSeed startup file:/dev/random +SSLRandomSeed startup file:/dev/urandom 1024 +SSLRandomSeed connect builtin +SSLRandomSeed connect file:/dev/random +SSLRandomSeed connect file:/dev/urandom 1024 + +<VirtualHost _default_:443> +DocumentRoot "/httpd/htdocs-ssl" +SSLEngine on +SSLCertificateFile /httpd/conf/SSL/server/server.crt +SSLCertificateKeyFile /httpd/conf/SSL/server/server.key +SSLCertificateChainFile /httpd/conf/SSL/cacert.pem + +# For client-validation +SSLCACertificateFile /httpd/conf/SSL/cacert.pem +SSLVerifyClient require + +SSLOptions +CompatEnvVars +Script PUT /cgi-bin/put-file.pl +</VirtualHost> +\end{MyVerbatim} + +\noindent +Note the last line of the configuration specifies the CGI script +that accepts PUT requests. The {\em SSLOptions\/} +line is necessary so that the CGI script receives certain HTTP +headers as environment variables. Those headers/variables convey +the X.509 information to the script so it knows where to store +received XML files. + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\chapter{Collector Installation} + + +A collector machine needs only the {\em dsc\/} binary, a configuration +file, and a couple of cron job scripts. + +At this point, {\dsc} lacks certain niceties such as a \path|./configure| +script. The installation prefix, \path|/usr/local/dsc| is currently +hard-coded. + + +\section{Prerequisites} + +You'll need a C/C++ compiler to compile the {\tt dsc\/} source code. + +If the collector and archiver are different systems, you'll need a +way to transfer data files. We recommend that you use the {\tt +curl\/} HTTP/SSL client You may use another technique, such as {\tt +scp\/} or {\tt rsync\/} if you prefer. + +\section{\tt Installation} + +You can compile {\tt dsc\/} from the {\tt collector\/} directory: + +\begin{MyVerbatim} +% cd collector +% make +\end{MyVerbatim} + +Assuming there are no errors or problems during compilation, install +the {\tt dsc\/} binary and other scripts with: + +\begin{MyVerbatim} +% make install +\end{MyVerbatim} + +This installs five files: +\begin{Verbatim} +/usr/local/dsc/bin/dsc +/usr/local/dsc/etc/dsc.conf.sample +/usr/local/dsc/libexec/upload-prep.pl +/usr/local/dsc/libexec/upload-rsync.sh +/usr/local/dsc/libexec/upload-x509.sh +\end{Verbatim} + +Of course, if you don't want to use the default installation +prefix, you can manually copy these files to a location +of your choosing. If you do that, you'll also need to +edit the cron scripts to match your choice of pathnames, etc. + +\section{Uploading XML Files} +\label{sec-install-collector-cron} + +This section describes how XML files are transferred from +the collector to one or more Presenter systems. + +As we'll see in the next chapter, each {\tt dsc} process +has its own {\em run directory\/}. This is the directory +where {\tt dsc} leaves its XML files. It usually has a +name like \path|/usr/local/dsc/run/NODENAME|\@. XML files +are removed after they are successfully transferred. If the +Presenter is unreachable, XML files accumulate here until +they can be transferred. Make sure that you have +enough disk space to queue a lot of XML files in the +event of an outage. + +In general we want to be able to upload XML files to multiple +presenters. This is the reason behind the {\tt upload-prep.pl} +script. This script runs every 60 seconds from cron: + +\begin{MyVerbatim} +* * * * * /usr/local/dsc/libexec/upload-prep.pl +\end{MyVerbatim} + +{\tt upload-prep.pl} looks for \path|dsc.conf| files in +\path|/usr/local/dsc/etc| by default. For each config file +found, it cd's to the {\em run\_dir\/} and links\footnote{as in +``hard link'' made with \path|/bin/ln|.} +XML files to one or more upload directories. The upload directories +are named \path|upload/dest1|, \path|upload/dest2|, and so on. + +In order for all this to work, you must create the directories +in advance. For example, if you are collecting stats on +your nameserver named {\em ns0\/}, and want to send the XML files +to two presenters (named oarc and archive), the directory structure +might look like: + +\begin{MyVerbatim} +% set prefix=/usr/local/dsc +% mkdir $prefix/run +% mkdir $prefix/run/ns0 +% mkdir $prefix/run/ns0/upload +% mkdir $prefix/run/ns0/upload/oarc +% mkdir $prefix/run/ns0/upload/archive +\end{MyVerbatim} + +With that directory structure, the {\tt upload-prep.pl} script moves +XML files from the \path|ns0| directory to the two +upload directories, \path|oarc| and \path|archive|. + +To actually transfer files to the presenter, use either +\path|upload-x509.sh| or \path|upload-rsync.sh|. + +\subsection{upload-x509.sh} + +This cron script is responsible for +actually transferring XML files from the upload directories +to the remote server. It creates a {\em tar\/} archive +of XML files and then uploads it to the remote server with +{\tt curl}. The script takes three commandline arguments: + +\begin{MyVerbatim} +% upload-x509.sh NODE DEST URI +\end{MyVerbatim} + +{\em NODE\/} must match the name of a directory under +\path|/usr/local/dsc/run|. Similarly, {\em DEST\/} must match the +name of a directory under \path|/usr/local/dsc/run/NODE/upload|. +{\em URI\/} is the URL/URI that the data is uploaded to. Usually +it is just an HTTPS URL with the name of the destination server. +We also recommend running this from cron every 60 seconds. For +example: + +\begin{MyVerbatim} +* * * * * /usr/local/dsc/libexec/upload-x509.sh ns0 oarc \ + https://collect.oarc.isc.org/ +* * * * * /usr/local/dsc/libexec/upload-x509.sh ns0 archive \ + https://archive.example.com/ +\end{MyVerbatim} + +\path|upload-x509.sh| looks for X.509 certificates in +\path|/usr/local/dsc/certs|. The client certificate should be named +\path|/usr/local/dsc/certs/DEST/NODE.pem| and the CA certificate +should be named +\path|/usr/local/dsc/certs/DEST/cacert.pem|. Note that {\em DEST\/} +and {\em NODE\/} must match the \path|upload-x509.sh| +command line arguments. + +\subsection{upload-rsync.sh} + +This script can be used to transfer XML files files from the upload +directories to the remote server. It uses {\em rsync\/} and +assumes that {\em rsync\/} will use {\em ssh\/} for transport. +This script also takes three arguments: + +\begin{MyVerbatim} +% upload-rsync.sh NODE DEST RSYNC-DEST +\end{MyVerbatim} + +Note that {\em DEST\/} is the name of the local ``upload'' directory +and {\em RSYNC-DEST\/} is an {\em rsync\/} destination (i.e., hostname and remote directory). +Here is how you might use it in a crontab: + +\begin{MyVerbatim} +* * * * * /usr/local/dsc/libexec/upload-rsync.sh ns0 oarc \ + dsc@collect.oarc.isc.org:/usr/local/dsc/data/Server/ns0 +* * * * * /usr/local/dsc/libexec/upload-rsync.sh ns0 archive \ + dsc@archive.oarc.isc.org:/usr/local/dsc/data/Server/ns0 +\end{MyVerbatim} + +Also note that \path|upload-rsync.sh| will actually store the remote +XML files in \path|incoming/YYYY-MM-DD| subdirectories. That is, +if your {\em RSYNC-DEST\/} is \path|host:/usr/local/dsc/data/Server/ns0| +then files will actually be written to +\path|/usr/local/dsc/data/Server/ns0/incoming/YYYY-MM-DD| on {\em host}, +where \path|YYYY-MM-DD| is replaced by the year, month, and date of the +XML files. These subdirectories reduce filesystem pressure in the event +of backlogs. + +{\em rsync\/} over {\em ssh\/} requires you to use RSA or DSA public keys +that do not have a passphrase. If you do not want to use one of +{\em ssh\/}'s default identity files, you can create one specifically +for this script. It should be named \path|dsc_uploader_id| (and +\path|dsc_uploader_id.pub|) in the \$HOME/.ssh directory of the user +that will be running the script. For example, you can create it +with this command: + +\begin{MyVerbatim} +% ssh-keygen -t dsa -C dsc-uploader -f $HOME/.ssh/dsc_uploader_id +\end{MyVerbatim} + +Then add \path|dsc_uploader_id.pub| to the \path|authorized_keys| +file of the receiving userid on the presenter system. + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\chapter{Configuring and Running the {\dsc} Collector} + +\section{dsc.conf} + +Before running {\tt dsc\/} you need to create a configuration file. +Note that configuration directive lines are terminated with a semi-colon. +The configuration file currently understands the following directives: + +\begin{description} + +\item[local\_address] + + Specifies the DNS server's local IP address. It is used + to determine the ``direction'' of an IP packet: sending, + receiving, or other. You may specify multiple local addresses + by repeating the {\em local\_address} line any number of times. + + Example: {\tt local\_address 172.16.0.1;\/} + Example: {\tt local\_address 2001:4f8:0:2::13;\/} + +\item[run\_dir] + + A directory that should become {\tt dsc\/}'s current directory + after it starts. XML files will be written here, as will + any core dumps. + + Example: {\tt run\_dir "/var/run/dsc";\/} + +\item[minfree\_bytes] + + If the filesystem where {\tt dsc\/} writes its XML files + does not have at least this much free space, then + {\tt dsc\/} will not write the XML files. This prevents + {\tt dsc\/} from filling up the filesystem. The XML + files that would have been written are simply lost and + cannot be receovered. {\tt dsc\/} will begin writing + XML files again when the filesystem has the necessary + free space. + +\item[bpf\_program] + + A Berkeley Packet Filter program string. Normally you + should leave this unset. You may use this to further + restrict the traffic seen by {\tt dsc\/}. Note that {\tt + dsc\/} currently has one indexer that looks at all IP + packets. If you specify something like {\em udp port 53\/} + that indexer will not work. + + However, if you want to monitor multiple DNS servers with + separate {\dsc} instances on one collector box, then you + may need to use {\em bpf\_program} to make sure that each + {\tt dsc} process sees only the traffic it should see. + + Note that this directive must go before the {\em interface\/} + directive because {\tt dsc\/} makes only one pass through + the configuration file and the BPF filter is set when the + interface is initialized. + + Example: {\tt bpf\_program "dst host 192.168.1.1";\/} + +\item[interface] + + The interface name to sniff packets from or a pcap file to + read packets from. You may specify multiple interfaces. + + Example: + {\tt interface fxp0;\/} + {\tt interface /path/to/dump.pcap;\/} + +\item[bpf\_vlan\_tag\_byte\_order] + + {\tt dsc\/} knows about VLAN tags. Some operating systems + (FreeBSD-4.x) have a bug whereby the VLAN tag id is + byte-swapped. Valid values for this directive are {\tt + host\/} and {\tt net\/} (the default). Set this to {\tt + host\/} if you suspect your operating system has the VLAN + tag byte order bug. + + Example: {\tt bpf\_vlan\_tag\_byte\_order host;\/} + +\item[match\_vlan] + + A list of VLAN identifiers (integers). If set, only the + packets belonging to these VLANs are counted. + + Example: {\tt match\_vlan 101 102;\/} + +\item[qname\_filter] + + This directive allows you to define custom filters + to match query names in DNS messages. Please see + Section~\ref{sec-qname-filter} for more information. + +\item[dataset] + + This directive is the heart of {\dsc}. However, it is also + the most complex. + To save time we recommend that you copy interesting-looking + dataset definitions from \path|dsc.conf.sample|. Comment + out any that you feel are irrelevant or uninteresting. + Later, as you become more familiar with {\dsc}, you may + want to read the next chapter and add your own custom + datasets. + +\item[output\_format] + + Specify the output format, can be give multiple times to output in more then + one format. Default output format is XML. + + Available formats are: + - XML + - JSON + + Example: {\tt output\_format JSON} +\end{description} + + +\section{A Complete Sample dsc.conf} + +Here's how your entire {\em dsc.conf\/} file might look: + +\begin{MyVerbatim} +#bpf_program +interface em0; + +local_address 192.5.5.241; + +run_dir "/usr/local/dsc/run/foo"; + +dataset qtype dns All:null Qtype:qtype queries-only; +dataset rcode dns All:null Rcode:rcode replies-only; +dataset opcode dns All:null Opcode:opcode queries-only; +dataset rcode_vs_replylen dns Rcode:rcode ReplyLen:msglen replies-only; +dataset client_subnet dns All:null ClientSubnet:client_subnet queries-only + max-cells=200; +dataset qtype_vs_qnamelen dns Qtype:qtype QnameLen:qnamelen queries-only; +dataset qtype_vs_tld dns Qtype:qtype TLD:tld queries-only,popular-qtypes + max-cells=200; +dataset certain_qnames_vs_qtype dns CertainQnames:certain_qnames + Qtype:qtype queries-only; +dataset client_subnet2 dns Class:query_classification + ClientSubnet:client_subnet queries-only max-cells=200; +dataset client_addr_vs_rcode dns Rcode:rcode ClientAddr:client + replies-only max-cells=50; +dataset chaos_types_and_names dns Qtype:qtype Qname:qname + chaos-class,queries-only; +dataset idn_qname dns All:null IDNQname:idn_qname queries-only; +dataset edns_version dns All:null EDNSVersion:edns_version queries-only; +dataset do_bit dns All:null D0:do_bit queries-only; +dataset rd_bit dns All:null RD:rd_bit queries-only; +dataset tc_bit dns All:null TC:tc_bit replies-only; +dataset idn_vs_tld dns All:null TLD:tld queries-only,idn-only; +dataset ipv6_rsn_abusers dns All:null ClientAddr:client + queries-only,aaaa-or-a6-only,root-servers-n et-only max-cells=50; +dataset transport_vs_qtype dns Transport:transport Qtype:qtype queries-only; + +dataset direction_vs_ipproto ip Direction:ip_direction IPProto:ip_proto + any; +\end{MyVerbatim} + +\section{Running {\tt dsc}} + +{\tt dsc\/} accepts a single command line argument, which is +the name of the configuration file. For example: + +\begin{MyVerbatim} +% cd /usr/local/dsc +% bin/dsc etc/foo.conf +\end{MyVerbatim} + +If you run {\tt ps} when {\tt dsc} is running, you'll see two processes: + +\begin{MyVerbatim} +60494 ?? S 0:00.36 bin/dsc etc/foo.conf +69453 ?? Ss 0:10.65 bin/dsc etc/foo.conf +\end{MyVerbatim} + +The first process simply forks off child processes every +60 seconds. The child processes do the work of analyzing +and tabulating DNS messages. + +Please use NTP or another technique to keep the collector's +clock synchronized to the correct time. + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\chapter{Viewing {\dsc} Graphs} + +To view {\dsc} data in a web browser, simply enter the +URL to the \path|dsc-grapher.pl| CGI. But before you +do that, you'll need to create a grapher configuration file. + +\path|dsc-grapher.pl| uses a simple configuration file to set certain +menu options. This configuration file is +\path|/usr/local/dsc/etc/dsc-grapher.cfg|. You should find +a sample version in the same directory. For example: + +\begin{MyVerbatim} +server f-root pao1 sfo2 +server isc senna+piquet +server tmf hq sc lgh +trace_windows 1hour 4hour 1day 1week 1month +accum_windows 1day 2days 3days 1week +timezone Asia/Tokyo +domain_list isc_tlds br nl ca cz il pt cl +domain_list isc_tlds sk ph hr ae bg is si za +valid_domains isc isc_tlds + +\end{MyVerbatim} + +\begin{figure} +\centerline{\psfig{figure=screenshot1.eps,width=6.5in}} +\caption{\label{fig-screenshot1}A sample graph} +\end{figure} + +Refer to Figure~\ref{fig-screenshot1} to see how +the directives affect the visual display. +The following three directives should always be set in +the configuration file: + +\begin{description} +\item[server] + This directive tells \path|dsc-grapher.pl| to list + the given server and its associated nodes in the + ``Servers/Nodes'' section of its navigation menu. + You can repeat this directive for each server that + the Presenter has. +\item[trace\_windows] + Specifies the ``Time Scale'' menu options for + trace-based plots. +\item[accum\_windows] + Specifies the ``Time Scale'' menu options for + ``cumulative'' plots, such as the Classification plot. +\end{description} + +Note that the \path|dsc-grapher.cfg| only affects what +may appear in the navigation window. It does NOT prevent users +from entering other values in the URL parameters. For example, +if you have data for a server/node in your +\path|/usr/local/dsc/data/| directory that is not listed in +\path|dsc-grapher.cfg|, a user may still be able to view that +data by manually setting the URL query parameters. + +The configuration file accepts a number of optional directives +as well. You may set these if you like, but they are not +required: + +\begin{description} +\item[timezone] + Sets the time zone for dates and times displayed in the + graphs. + You can use this if you want to override the system + time zone. + The value for this directive should be the name + of a timezone entry in your system database (usually found + in {\path|/usr/share/zoneinfo|}. + For example, if your system time zone is set + to UTC but you want the times displayed for the + London timezone, you can set this directive to + {\tt Europe/London\/}. +\item[domain\_list] + This directive, along with {\em valid\_domains\/}, tell the + presenter which domains a nameserver is authoritative for. + That information is used in the TLDs subgraphs to differentiate + requests for ``valid'' and ``invalid'' domains. + + The {\em domain\_list\/} creates a named list of domains. + The first token is a name for the list, and the remaining + tokens are domain names. The directive may be repeated with + the same list name, as shown in the above example. +\item[valid\_domains] + This directive glues servers and domain\_lists together. The + first token is the name of a {\em server\/} and the second token is + the name of a {\em domain\_list\/}. +\item[embargo] + The {\em embargo\/} directive may be used to delay the + availability of data via the presenter. For example, you + may have one instance of {\em dsc-grapher.pl\/} for internal + use only (password protected, etc). You may also have a + second instance for third-parties where data is delayed by + some amount of time, such as hours, days, or weeks. The value + of the {\em embargo\/} directive is the number of seconds which + data availability should be delayed. For example, if you set + it to 604800, then viewers will not be able to see any data + less than one week old. +\item[anonymize\_ip] + When the {\em anonymize\_ip\/} directive is given, IP addresses + in the display will be anonymized. The anonymization algorithm + is currently hard-coded and designed only for IPv4 addresses. + It masks off the lower 24 bits and leaves only the first octet + in place. +\item[hide\_nodes] + When the {\em hide\_nodes\/} directive is given, the presenter + will not display the list node names underneath the current + server. This might be useful if you have a number of nodes + but only want viewers to see the server as a whole, without + exposing the particular nodes in the cluster. Note, however, + that if someone already knows the name of a node they can + hand-craft query terms in the URL to display the data for + only that node. In other words, the {\em hide\_nodes\/} + only provides ``security through obscurity.'' +\end{description} + + +The first few times you try \path|dsc-grapher.pl|, be sure to run +{\tt tail -f} on the HTTP server error.log file. + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\chapter{{\dsc} Datasets} + +A {\em dataset\/} is a 2-D array of counters. For example, you +might have a dataset with ``Query Type'' along one dimension and +``Query Name Length'' on the other. The result is a table that +shows the distribution of query name lengths for each query type. +For example: + +\vspace{1ex} +\begin{center} +\begin{tabular}{l|rrrrrr} +Len & A & AAAA & A6 & PTR & NS & SOA \\ +\hline +$\cdots$ & & & & & \\ +11 & 14 & 8 & 7 & 11 & 2 & 0 \\ +12 & 19 & 2 & 3 & 19 & 4 & 1 \\ +$\cdots$ & & & & & & \\ +255 & 0 & 0 & 0 & 0 & 0 & 0 \\ +\hline +\end{tabular} +\end{center} +\vspace{1ex} + +\noindent +A dataset is defined by the following parameters: +\begin{itemize} +\setlength{\itemsep}{0ex plus 0.5ex minus 0.0ex} +\item A name +\item A protocol layer (IP or DNS) +\item An indexer for the first dimension +\item An indexer for the second dimension +\item One or more filters +\item Zero or more options and parameters +\end{itemize} + +\noindent +The {\em dataset\/} definition syntax in \path|dsc.conf| is: + +{\tt dataset\/} +{\em name\/} +{\em protocol\/} +{\em Label1:Indexer1\/} +{\em Label2:Indexer2\/} +{\em filter\/} +{\em [parameters]\/}; +\vspace{2ex} + +\section{Dataset Name} + +The dataset name is used in the filename for {\tt dsc\/}'s XML +files. Although this is an opaque string in theory, the Presenter's +XML extractor routines must recognize the dataset name to properly +parse it. The source code file +\path|presenter/perllib/DSC/extractor/config.pm| contains an entry +for each known dataset name. + +\section{Protocol} + +{\dsc} currently knows about two protocol layers: IP and DNS. +On the {\tt dataset\/} line they are written as {\tt ip\/} and {\tt dns\/}. + + +\section{Indexers} + +An {\em indexer\/} is simply a function that transforms the attributes +of an IP/DNS message into an array index. For some attributes the +transformation is straightforward. For example, the ``Query Type'' +indexer simply extracts the query type value from a DNS message and +uses this 16-bit value as the array index. + +Other attributes are slightly more complicated. For example, the +``TLD'' indexer extracts the TLD of the QNAME field of a DNS message +and maps it to an integer. The indexer maintains a simple internal +table of TLD-to-integer mappings. The actual integer values are +unimportant because the TLD strings, not the integers, appear in +the resulting XML data. + +When you specify an indexer on a {\tt dataset\/} line, you must +provide both the name of the indexer and a label. The Label appears +as an attribute in the XML output. For example, +Figure~\ref{fig-sample-xml} shows the XML corresponding to this +{\em dataset\/} line: + +\begin{MyVerbatim} +dataset the_dataset dns Foo:foo Bar:bar queries-only; +\end{MyVerbatim} + +\begin{figure} +\begin{MyVerbatim} +<array name="the_dataset" dimensions="2" start_time="1091663940" ... + <dimension number="1" type="Foo"/> + <dimension number="2" type="Bar"/> + <data> + <Foo val="1"> + <Bar val="0" count="4"/> + ... + <Bar val="100" count="41"/> + </Foo> + <Foo val="2"> + ... + </Foo> + </data> +</array> +\end{MyVerbatim} +\caption{\label{fig-sample-xml}Sample XML output} +\end{figure} + +In theory you are free to choose any label that you like, however, +the XML extractors look for specific labels. Please use the labels +given for the indexers in Tables~\ref{tbl-dns-indexers} +and~\ref{tbl-ip-indexers}. + +\subsection{IP Indexers} + +\begin{table} +\begin{center} +\begin{tabular}{|lll|} +\hline +Indexer & Label & Description \\ +\hline +ip\_direction & Direction & one of sent, recv, or other \\ +ip\_proto & IPProto & IP protocol (icmp, tcp, udp) \\ +ip\_version & IP version number (4, 6) \\ +\hline +\end{tabular} +\caption{\label{tbl-ip-indexers}IP packet indexers} +\end{center} +\end{table} + +{\dsc} includes only minimal support for collecting IP-layer +stats. Mostly we are interested in finding out the mix of +IP protocols received by the DNS server. It can also show us +if/when the DNS server is the subject of denial-of-service +attack. +Table~\ref{tbl-ip-indexers} shows the indexers for IP packets. +Here are their longer descriptions: + +\begin{description} +\item[ip\_direction] + One of three values: sent, recv, or else. Direction is determined + based on the setting for {\em local\_address\/} in the configuration file. +\item[ip\_proto] + The IP protocol type, e.g.: tcp, udp, icmp. + Note that the {\em bpf\_program\/} setting affects all traffic + seen by {\dsc}. If the program contains the word ``udp'' + then you won't see any counts for non-UDP traffic. +\item[ip\_version] + The IP version number, e.g.: 4 or 6. Can be used to compare how much + traffic comes in via IPv6 compared to IPV4. +\end{description} + +\subsection{IP Filters} + +Currently there is only one IP protocol filter: {\tt any\/}. +It includes all received packets. + + +\subsection{DNS Indexers} + +\begin{table} +\begin{center} +\begin{tabular}{|lll|} +\hline +Indexer & Label & Description \\ +\hline +certain\_qnames & CertainQnames & Popular query names seen at roots \\ +client\_subnet & ClientSubnet & The client's IP subnet (/24 for IPv4, /96 for IPv6) \\ +client & ClientAddr & The client's IP address \\ +do\_bit & DO & Whether the DO bit is on \\ +edns\_version & EDNSVersion & The EDNS version number \\ +idn\_qname & IDNQname & If the QNAME is in IDN format \\ +msglen & MsgLen & The DNS message length \\ +null & All & A ``no-op'' indexer \\ +opcode & Opcode & DNS message opcode \\ +qclass & - & Query class \\ +qname & Qname & Full query name \\ +qnamelen & QnameLen & Length of the query name \\ +qtype & Qtype & DNS query type \\ +query\_classification & Class & A classification for bogus queries \\ +rcode & Rcode & DNS response code \\ +rd\_bit & RD & Check if Recursion Desired bit set \\ +tc\_bit & TC & Check if Truncated bit set \\ +tld & TLD & TLD of the query name \\ +transport & Transport & Transport protocol for the DNS message (UDP or TCP) \\ +dns\_ip\_version & IPVersion & IP version of the packet carrying the DNS message \\ +\hline +\end{tabular} +\caption{\label{tbl-dns-indexers}DNS message indexers} +\end{center} +\end{table} + +Table~\ref{tbl-dns-indexers} shows the currently-defined indexers +for DNS messages, and here are their descriptions: + +\begin{description} +\item[certain\_qnames] + This indexer isolates the two most popular query names seen + by DNS root servers: {\em localhost\/} and {\em + [a--m].root-servers.net\/}. +\item[client\_subnet] + Groups DNS messages together by the subnet of the + client's IP address. The subnet is maked by /24 for IPv4 + and by /96 for IPv6. We use this to make datasets with + large, diverse client populations more manageable and to + provide a small amount of privacy and anonymization. +\item[client] + The IP (v4 and v6) address of the DNS client. +\item[do\_bit] + This indexer has only two values: 0 or 1. It indicates + whether or not the ``DO'' bit is set in a DNS query. According to + RFC 2335: {\em Setting the DO bit to one in a query indicates + to the server that the resolver is able to accept DNSSEC + security RRs.} +\item[edns\_version] + The EDNS version number, if any, in a DNS query. EDNS + Version 0 is documented in RFC 2671. +\item[idn\_qname] + This indexer has only two values: 0 or 1. It returns 1 + when the first QNAME in the DNS message question section + is an internationalized domain name (i.e., containing + non-ASCII characters). Such QNAMEs begin with the string + {\tt xn--\/}. This convention is documented in RFC 3490. +\item[msglen] + The overall length (size) of the DNS message. +\item[null] + A ``no-op'' indexer that always returns the same value. + This can be used to effectively turn the 2-D table into a + 1-D array. +\item[opcode] + The DNS message opcode is a four-bit field. QUERY is the + most common opcode. Additional currently defined opcodes + include: IQUERY, STATUS, NOTIFY, and UPDATE. +\item[qclass] + The DNS message query class (QCLASS) is a 16-bit value. IN + is the most common query class. Additional currently defined + query class values include: CHAOS, HS, NONE, and ANY. +\item[qname] + The full QNAME string from the first (and usually only) + QNAME in the question section of a DNS message. +\item[qnamelen] + The length of the first (and usually only) QNAME in a DNS + message question section. Note this is the ``expanded'' + length if the message happens to take advantage of DNS + message ``compression.'' +\item[qtype] + The query type (QTYPE) for the first QNAME in the DNS message + question section. Well-known query types include: A, AAAA, + A6, CNAME, PTR, MX, NS, SOA, and ANY. +\item[query\_classification] + A stateless classification of ``bogus'' queries: + \begin{itemize} + \setlength{\itemsep}{0ex plus 0.5ex minus 0.0ex} + \item non-auth-tld: when the TLD is not one of the IANA-approved TLDs. + \item root-servers.net: a query for a root server IP address. + \item localhost: a query for the localhost IP address. + \item a-for-root: an A query for the DNS root (.). + \item a-for-a: an A query for an IPv4 address. + \item rfc1918-ptr: a PTR query for an RFC 1918 address. + \item funny-class: a query with an unknown/undefined query class. + \item funny-qtype: a query with an unknown/undefined query type. + \item src-port-zero: when the UDP message's source port equals zero. + \item malformed: a malformed DNS message that could not be entirely parsed. + \end{itemize} +\item[rcode] + The RCODE value in a DNS response. The most common response + codes are 0 (NO ERROR) and 3 (NXDOMAIN). +\item[rd\_bit] + This indexer returns 1 if the RD (recursion desired) bit is + set in the query. Usually only stub resolvers set the RD bit. + Usually authoritative servers do not offer recursion to their + clients. +\item[tc\_bit] + This indexer returns 1 if the TC (truncated) bit is + set (in a response). An authoritative server sets the TC bit + when the entire response won't fit into a UDP message. +\item[tld] + the TLD of the first QNAME in a DNS message's question section. +\item[transport] + Indicates whether the DNS message is carried via UDP or TCP\@. +\item[dns\_ip\_version] + The IP version number that carried the DNS message. +\end{description} + +\subsection{DNS Filters} + +You must specify one or more of the following filters (separated by commas) on +the {\tt dataset\/} line: + +\begin{description} +\item[any] + The no-op filter, counts all messages. +\item[queries-only] + Count only DNS query messages. A query is a DNS message + where the QR bit is set to 0. +\item[replies-only] + Count only DNS response messages. A query is a DNS message + where the QR bit is set to 1. +\item[popular-qtypes] + Count only DNS messages where the query type is one of: + A, NS, CNAME, SOA, PTR, MX, AAAA, A6, ANY. +\item[idn-only] + Count only DNS messages where the query name is in the + internationalized domain name format. +\item[aaaa-or-a6-only] + Count only DNS Messages where the query type is AAAA or A6. +\item[root-servers-net-only] + Count only DNS messages where the query name is within + the {\em root-servers.net\/} domain. +\item[chaos-class] + Counts only DNS messages where QCLASS is equal to + CHAOS (3). The CHAOS class is generally used + for only the special {\em hostname.bind\/} and + {\em version.bind\/} queries. +\end{description} + +\noindent +Note that multiple filters are ANDed together. That is, they +narrow the input stream, rather than broaden it. + +In addition to these pre-defined filters, you can add your own +custom filters. + +\subsubsection{qname\_filter} +\label{sec-qname-filter} + +The {\em qname\_filter} directive defines a new +filter that uses regular expression matching on the QNAME field of +a DNS message. This may be useful if you have a server that is +authoritative for a number of zones, but you want to limit +your measurements to a small subset. The {\em qname\_filter} directive +takes two arguments: a name for the filter and a regular expression. +For example: + +\begin{MyVerbatim} +qname_filter MyFilterName example\.(com|net|org)$ ; +\end{MyVerbatim} + +This filter matches queries (and responses) for names ending with +{\em example.com\/}, {\em example.net\/}, and {\em example.org\/}. +You can reference the named filter in the filters part of a {\em +dataset\/} line. For example: + +\begin{MyVerbatim} +dataset qtype dns All:null Qtype:qtype queries-only,MyFilterName; +\end{MyVerbatim} + +\subsection{Parameters} +\label{sec-dataset-params} + +\noindent +{\tt dsc\/} currently supports the following optional parameters: + +\begin{description} +\item[min-count={\em NN\/}] + Cells with counts less than {\em NN\/} are not included in + the output. Instead, they are aggregated into the special + values {\tt -:SKIPPED:-\/} and {\tt -:SKIPPED\_SUM:-\/}. + This helps reduce the size of datasets with a large number + of small counts. +\item[max-cells={\em NN\/}] + A different, perhaps better, way of limiting the size + of a dataset. Instead of trying to determine an appropriate + {\em min-count\/} value in advance, {\em max-cells\/} + allows you put a limit on the number of cells to + include for the second dataset dimension. If the dataset + has 9 possible first-dimension values, and you specify + a {\em max-cell\/} count of 100, then the dataset will not + have more than 900 total values. The cell values are sorted + and the top {\em max-cell\/} values are output. Values + that fall below the limit are aggregated into the special + {\tt -:SKIPPED:-\/} and {\tt -:SKIPPED\_SUM:-\/} entries. +\end{description} + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +\chapter{Data Storage} + +\section{XML Structure} + +A dataset XML file has the following structure: + +\begin{MyVerbatim} +<array name="dataset-name" dimensions="2" start_time="unix-seconds" + stop_time="unix-seconds"> + <dimension number="1" type="Label1"/> + <dimension number="2" type="Label2"/> + <data> + <Label1 val="D1-V1"> + <Label2 val="D2-V1" count="N1"/> + <Label2 val="D2-V2" count="N2"/> + <Label2 val="D2-V3" count="N3"/> + </Label1> + <Label1 val="D1-V2"> + <Label2 val="D2-V1" count="N1"/> + <Label2 val="D2-V2" count="N2"/> + <Label2 val="D2-V3" count="N3"/> + </Label1> + </data> +</array> +\end{MyVerbatim} + +\noindent +{\em dataset-name\/}, +{\em Label1\/}, and +{\em Label2\/} come from the dataset definition in {\em dsc.conf\/}. + +The {\em start\_time\/} and {\em stop\_time\/} attributes +are given in Unix seconds. They are normally 60-seconds apart. +{\tt dsc} usually starts a new measurement interval on 60 second +boundaries. That is: + +\begin{equation} +stop\_time \bmod{60} == 0 +\end{equation} + +The LABEL1 VAL attributes ({\em D1-V1\/}, {\em D1-V2\/}, etc) are +values for the first dimension indexer. +Similarly, the LABEL2 VAL attributes ({\em D2-V1\/}, {\em D2-V2\/}, +{\em D2-V3\/}) are values for the second dimension indexer. +For some indexers these +values are numeric, for others they are strings. If the value +contains certain non-printable characters, the string is base64-encoded +and the optional BASE64 attribute is set to 1. + +There are two special VALs that help keep large datasets down +to a reasonable size: {\tt -:SKIPPED:-\/} and {\tt -:SKIPPED\_SUM:-\/}. +These may be present on datasets that use the {\em min-count\/} +and {\em max-cells\/} parameters (see Section~\ref{sec-dataset-params}). +{\tt -:SKIPPED:-\/} is the number of cells that were not included +in the XML output. {\tt -:SKIPPED\_SUM:-\/}, on the other hand, is the +sum of the counts for all the skipped cells. + +Note that ``one-dimensional datasets'' still use two dimensions in +the XML file. The first dimension type and value will be ``All'', +as shown in the example below. + +The {\em count\/} values are always integers. If the count for +a particular tuple is zero, it should not be included in the +XML file. + +Note that the contents of the XML file do not indicate +where it came from. In particular, the server and node that +it came from are not present. Instead, DSC relies on the +presenter to store XML files in a directory hierarchy +with the server and node as directory names. + + +\noindent +Here is a short sample XML file with real content: +\begin{MyVerbatim} +<array name="rcode" dimensions="2" start_time="1154649600" + stop_time="1154649660"> + <dimension number="1" type="All"/> + <dimension number="2" type="Rcode"/> + <data> + <All val="ALL"> + <Rcode val="0" count="70945"/> + <Rcode val="3" count="50586"/> + <Rcode val="4" count="121"/> + <Rcode val="1" count="56"/> + <Rcode val="5" count="44"/> + </All> + </data> +</array> +\end{MyVerbatim} + +\noindent +Please see +\path|http://dns.measurement-factory.com/tools/dsc/sample-xml/| +for more sample XML files. + +The XML is not very strict and might cause XML purists to cringe. +{\tt dsc} writes the XML files the old-fashioned way (with printf()) +and reads them with Perl's XML::Simple module. +Here is a possibly-valid DTD for the dataset XML format. +Note, however, that the {\em LABEL1\/} +and {\em LABEL2\/} strings are different +for each dataset: + +\begin{MyVerbatim} +<!DOCTYPE ARRAY [ + +<!ELEMENT ARRAY (DIMENSION+, DATA))> +<!ELEMENT DIMENSION> +<!ELEMENT DATA (LABEL1+)> +<!ELEMENT LABEL1 (LABEL2+)> + +<!ATTLIST ARRAY NAME CDATA #REQUIRED> +<!ATTLIST ARRAY DIMENSIONS CDATA #REQUIRED> +<!ATTLIST ARRAY START_TIME CDATA #REQUIRED> +<!ATTLIST ARRAY STOP_TIME CDATA #REQUIRED> +<!ATTLIST DIMENSION NUMBER CDATA #REQUIRED> +<!ATTLIST DIMENSION TYPE CDATA #REQUIRED> +<!ATTLIST LABEL1 VAL CDATA #REQUIRED> +<!ATTLIST LABEL2 VAL CDATA #REQUIRED> +<!ATTLIST LABEL2 COUNT CDATA #REQUIRED> + +]> +\end{MyVerbatim} + +\subsection{XML File Naming Conventions} + +{\tt dsc\/} relies on certain file naming conventions for XML files. +The file name should be of the format: + +\begin{quote} +{\em timestamp\/}.dscdata.xml +\end{quote} + +\noindent +For example: + +\begin{quote} +1154649660.dscdata.xml +\end{quote} + +NOTE: Versions of DSC prior to 2008-01-30 used a different naming +convention. Instead of ``dscdata'' the XML file was named after +the dataset that generated the data. The current XML extraction +code still supports the older naming convention for backward compatibility. +If the second component of the XML file name is not ``dscdata'' then +the extractor assume it is a dataset name. + +\noindent +Dataset names come from {\em dsc.conf\/}, and should match the NAME +attribute of the ARRAY tag inside the XML file. The timestamp is in +Unix epoch seconds and is usually the same as the {\em stop\_time\/} +value. + + +\section{JSON Structure} + +The JSON structure mimics the XML structure so that elements are the same. + +\begin{MyVerbatim} +{ + "name": "dataset-name", + "start_time": unix-seconds, + "stop_time": unix-seconds, + "dimensions": [ "Label1", "Label2" ], + "data": [ + { + "Label1": "D1-V1", + "Label2": [ + { "val": "D2-V1", "count": N1 }, + { "val": "D2-V2", "count": N2 }, + { "val": "D2-V3", "count": N3 } + ] + }, + { + "Label1": "D1-V1-base64", + "base64": true, + "Label2": [ + { "val": "D2-V1", "count": N1 }, + { "val": "D2-V2-base64", "base64": true, "count": N2 }, + { "val": "D2-V3", "count": N3 } + ] + } + ] +} +\end{MyVerbatim} + + +\section{Archived Data Format} + +{\dsc} actually uses four different file formats for archived +datasets. These are all text-based and designed to be quickly +read from, and written to, by Perl scripts. + +\subsection{Format 1} + +\noindent +\begin{tt}time $k1$ $N_{k1}$ $k2$ $N_{k2}$ $k3$ $N_{k3}$ ... +\end{tt} + +\vspace{1ex}\noindent +This is a one-dimensional time-series format.\footnote{Which means +it can only be used for datasets where one of the indexers is set +to the Null indexer.} The first column is a timestamp (unix seconds). +The remaining space-separated fields are key-value pairs. For +example: + +\begin{MyVerbatim} +1093219980 root-servers.net 122 rfc1918-ptr 112 a-for-a 926 funny-qclass 16 +1093220040 root-servers.net 121 rfc1918-ptr 104 a-for-a 905 funny-qclass 15 +1093220100 root-servers.net 137 rfc1918-ptr 116 a-for-a 871 funny-qclass 12 +\end{MyVerbatim} + +\subsection{Format 2} + +\noindent +\begin{tt}time $j1$ $k1$:$N_{j1,k1}$:$k2$:$N_{j1,k2}$:... $j2$ $k1$:$N_{j2,k1}$:$k2$:$N_{j2,k2}$:... ... +\end{tt} + +\vspace{1ex}\noindent +This is a two-dimensional time-series format. In the above, +$j$ represents the first dimension indexer and $k$ represents +the second. Key-value pairs for the second dimension are +separated by colons, rather than space. For example: + +\begin{MyVerbatim} +1093220160 recv icmp:2397:udp:136712:tcp:428 sent icmp:819:udp:119191:tcp:323 +1093220220 recv icmp:2229:udp:124708:tcp:495 sent icmp:716:udp:107652:tcp:350 +1093220280 recv udp:138212:icmp:2342:tcp:499 sent udp:120788:icmp:819:tcp:364 +1093220340 recv icmp:2285:udp:137107:tcp:468 sent icmp:733:udp:118522:tcp:341 +\end{MyVerbatim} + +\subsection{Format 3} + +\noindent +\begin{tt}$k$ $N_{k}$ +\end{tt} + +\vspace{1ex}\noindent +This format is used for one-dimensional datasets where the key space +is (potentially) very large. That is, putting all the key-value pairs +on a single line would result in a very long line in the datafile. +Furthermore, for these larger datasets, it is prohibitive to +store the data as a time series. Instead the counters are incremented +over time. For example: + +\begin{MyVerbatim} +10.0.160.0 3024 +10.0.20.0 92 +10.0.244.0 5934 +\end{MyVerbatim} + +\subsection{Format 4} + +\noindent +\begin{tt}$j$ $k$ $N_{j,k}$ +\end{tt} + +\vspace{1ex}\noindent +This format is used for two-dimensional datasets where one or both +key spaces are very large. Again, counters are incremented over +time, rather than storing the data as a time series. +For example: + +\begin{MyVerbatim} +10.0.0.0 non-auth-tld 105 +10.0.0.0 ok 37383 +10.0.0.0 rfc1918-ptr 5941 +10.0.0.0 root-servers.net 1872 +10.0.1.0 a-for-a 6 +10.0.1.0 non-auth-tld 363 +10.0.1.0 ok 144 +\end{MyVerbatim} + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +\chapter{Bugs} + +\begin{itemize} + +\item + Seems too confusing to have an opaque name for indexers in + dsc.conf dataset line. The names are pre-determined anyway + since they must match what the XML extractors look for. +\item + Also stupid to have indexer names and a separate ``Label'' for + the XML file. + +\item + {\dsc} perl modules are installed in the ``site\_perl'' directory + but they should probably be installed under /usr/local/dsc. + +\item + {\dsc} collector silently drops UDP frags + +\end{itemize} + +\end{document} |