cql - A Flat File Database Query Language

                        Glenn Fowler

                    gsf@research.att.com

                   AT&T Bell Laboratories
               Murray Hill, New Jersey 07974

                          Abstract

cql is a UNIX*  system  tool  that  applies  C  style  query
expressions  to flat file databases.  In some respects it is
yet another addition to the  toolbox  of  programmable  file
filters:  grep [Hume88], sh [Bour78][BK89], awk [AKW88], and
perl [Wall].  However, by restricting  its  problem  domain,
cql  takes advantage of optimizations not available to these
more general purpose tools.

This paper describes the  cql  data  description  and  query
language, query optimizations, and provides comparisons with
other tools.


1.  Introduction

Flat file databases are common in UNIX system  environments.
They  consist  of  newline  terminated records with a single
character that delimits fields  within  each  record.   Well
known  examples  are  /etc/passwd  and  /etc/group, and more
recently the sablime [CF88] MR databases  and  cia  [Chen89]
abstraction databases.

There are two basic flat file database operations:

  update - delete, add or modify records

  query - scan for records based on field selection function

For the most part UNIX system tools make a clear distinction
between these operations.  Update is usually done by special
purpose tools to avoid problems that arise from concurrency.
Some  of  these  tools  are  admittedly low-tech: vipw write
locks the /etc/passwd file and runs the vi editor on it; any
other user running vipw concurrently will be locked out.  On
the other hand query tools usually  assume  that  the  input
files  are  readonly  or  that they at least will not change
during query access.  cql falls into this  category:  it  is
strictly  for  queries  and  supports  no update operations.
Despite  this  restriction  cql  adequately  fills  the  gap
between awk and full featured database management systems.

In the simplest case a flat file database query is a pattern
match  that is applied to one or more fields in each record.
The output is normally a list of all matched records.  grep,
sh,  awk, and perl are well suited for such queries on small
databases.  These commands scan the database from  the  top,
one record at a time, and apply the match expression to each
record.  Unfortunately, as the number of records and queries
increases, the repeated linear scans required by these tools
soon become an intolerable bottleneck.  The  bottleneck  can
be  diminished  by examining the queries to limit the number
of records that must be  scanned,  but  this  requires  some
modifications,  either  to  the  database or to the scanning
tools.

Some applications, such as sablime, ease the  bottleneck  by
partitioning  the  database into several flat files based on
one or more of the record fields.  This  speeds  up  queries
that key on the partitioned fields, but hinders queries that
must  span  the   partition.    Besides   complicating   the
application  query implementation, partitioning also imposes
complexity on database updates and backup.

The perl solution (actually, one of  the  perl  solutions  -
perl  is  the  UNIX  system swiss army knife) is to base the
queries on dbm [BSD86] hashed files rather than flat  files.
Linear  scans are then avoided by accessing the dbm files as
associative arrays.  A problem with this is that a dbm  file
contains  the  hashed  field  name  and record data for each
database record, so its file size is always larger than  the
original  flat  file.  This method also generates a separate
dbm file for each hashed field, making it  unacceptable  for
use with large databases.

Other applications, such as cia, preprocess the database  by
generating  B-tree  or  hash  index files [Park91] for quick
random access.  Specialized scanning tools are then used  to
process the queries.  The advantage here is that no database
changes are required to speed up the queries.  In  addition,
hash  index  files only store pointers into the database, so
their size is usually smaller than  the  original  database.
The speedup, though, is not without cost.  Some of the tools
may be so specialized as to work for only a small  class  of
possible queries; new query classes may require new tools.

Along with sufficient access speed, another challenge is  to
provide   reasonable   syntax   and   semantics   for  query
expressions.  For maximal transparency and  portability  the
database  fields  should  be  accessed  by  name rather than
number or position.  Otherwise queries would become outdated
as the database changes.

cql addresses these issues by providing a fast,  interpreted
symbolic  interface at the user level, with automatic record
hash indexing and query optimization at  the  implementation
level.   Query  expressions  are  modeled  on C, including a
struct construct for defining database record schemas.


2.  Background

As opposed to the UNIX  system  database  tools  like  unity
[Felt82],  cql  traces  its  roots to the C language and the
grep and awk tools.  As such  cql  is  limited  to  readonly
database access.

An example will clarify the differences between the  various
tools.   The example database is /etc/passwd with the record
schema:

    name:passwd:uid:gid:info:home:shell

where : is the field delimiter,  uid  and  gid  are  numeric
fields,  and  the remaining fields are strings.  The example
query selects all records with  uid  less  than  10  and  no
passwd.

Example solutions may not be optimal for each tool, but they
are  a  fair  representation of what can be derived from the
manuals and documentation.   The  author  has  a  few  years
experience  with  grep and sh, some exposure to awk, but had
to resort to a netnews request for perl.

2.1  grep

    grep '^[^:]*::[0-9]:' /etc/passwd

grep associates records with lines and has no implicit field
support,  so  the select expression must explicitly list all
fields.  As it  turns  out  the  expression  uid<10  can  be
matched   by   a   regular   expression;   more  complicated
expressions would  require  extra  tool  plumbing,  possibly
using  the  cut  and  expr  commands.  grep differs from the
other tools in that  a  single  regular  expression  pattern
describes both the schema and query.  This works fine at the
implementation level but is cumbersome as a general  purpose
user interface.

2.2  awk

    awk '
        BEGIN { FS = ":" }
        { if ($3 < 10 && $2 == "") print }
    ' /etc/passwd

Lines are the default awk record and FS specifies the  field
separator  character.   Numeric  expressions are as in C and
string comparison may also use  the  ==  and  !=  operators.
Unfortunately  the  fields  are named by number (starting at
1).  If the database format changes then all  references  to
$number must be changed accordingly.  An advantage over grep
is that fields are accessed as separate entities rather than
being a part of the matching pattern.

2.3  shell

    ifs=$IFS
    IFS=:
    while read name passwd uid gid info home shell junk
    do    if   (( $uid < 10 )) && [[ $passwd == "" ]]
          then IFS=$ifs
               print "$name:$passwd:$uid:$gid:$info:$home:$shell"
               IFS=:
          fi
    done < /etc/passwd

The shell  (ksh[BK89])  version  uses  the  field  splitting
effects  of IFS and read to blast the input records.  A nice
side effect is that read also  names  the  fields.   If  the
database  changes then only the field name arguments to read
must change.  Notice, however, that the shell has  different
syntax  for  numeric  and  string  comparisons.  Also, older
shells [Bour78]  have  no  built-in  expressions  and  would
require  a  separate  program  like  expr  to  do the record
selection.

2.4  perl

    perl -e '
        open (PASSWD, "< /etc/passwd") || die "cannot open /etc/passwd: $!";
        while (<PASSWD>) {
            ($name, $passwd, $uid, $gid, $info, $home, $shell) = split(":");
            if ($uid < 10 && $passwd eq "") {
                print "$name:$passwd:$uid:$info:$home:$shell";
            }
        }
    '

The perl example [Chri92] is similar to shell,  except  that
shell  combines  the  record read and field split operations
into a single read operation.  As with shell string equality
requires   special  syntax  and  $  must  prefix  expression
identifiers.

2.5  cql

    cql -d "
        passwd {
            char*    name;
            char*    passwd;
            int      uid, gid;
            char*    info;
            char*    home, shell;
        }
        passwd.delimiter = ':';
    " -e "uid < 10 && passwd == ''" /etc/passwd

cql queries are  split  into  two  parts.   The  declaration
section  (-d) describes the record schema and the expression
section (-e) provides the matching  query.   Using  cql  for
this query is overkill, but it provides a basis for the more
complex examples that follow.

2.6  Performance

Figure 1 shows the timing in user+sys seconds for the  above
examples,  ordered  from  best  to  worst.   The  times were
averaged over 5 runs on a lightly loaded 20 mip  workstation
with  2  cpus  on an input file consisting of 19,847 records
(1,525,549 bytes) in a local file system.  cat  is  included
as a lower bound.


                             cat      0.31
                             grep     1.77
                             cql      3.29
                             awkcc    3.37
                             awk      7.73
                             perl     9.09
                             ksh     19.98

                 Figure 2.  Example timings

Although the compiled awkcc example runs more than twice  as
fast  as  the awk script it suffers by having a fixed select
expression.   Any  change  in  the  expression  would  force
recompilation  of  the  awk script to make a new executable.
The timings also show that performance for the example query
seems to be inversely proportional to tool functionality.


3.  Optimization

Queries that check fields for equality  are  candidates  for
optimization.   For  example,  most  /etc/passwd queries are
lookups for a particular name, uid  or  gid.   As  mentioned
before,  perl supports an associative array interface to dbm
hash files, but converting to use this  would  require  more
than four times the file space of /etc/passwd itself and the
query syntax would need to change to use the array notation.
cql  offers  an  alternative  that  only  changes the schema
declaration:

        passwd {
            register char*  name;
            char*           passwd;
            register int    uid, gid;
            char*           info;
            char*           home, shell;
        }

As with  C  the  register  keyword  is  a  hint  that  marks
variables that may be frequently accessed.  For cql register
marks fields that may be frequently  checked  for  equality.
cql  generates  a  hash  index  file for each register field
during the first database query.  Subsequent queries use the
index files to prune the scan to only those records with the
same  hash  value  as  the  register  fields  in  the  query
expression.   The  index files are connected to a particular
database; if the database file changes then the index  files
are  regenerated  by doing a full database scan.  Because of
index file  generation  the  first  query  on  schemas  with
register fields is always slower than subsequent queries.

The hash index file algorithm is due to David Korn  and  has
been  implemented  as  a library (hix) by the author.  A hix
file stores only hash codes and database file  offsets,  and
its  size  ranges  from 10% to 50% of the original database.
The /etc/passwd example above has one record with  the  name
bozo.   The timings for the query name=="bozo" are listed in
Figure 2.


                    no register fields              2.95
                    first register query            6.52
                    subsequent register queries     0.54
                    grep                            1.64
                    awk                             7.13
                    perl                            7.56

             Figure 4.  Register query timings

The hix file generation slowed the first  query  by  over  2
times but the subsequent queries were about 10 times faster.
Even with hix file generation cql is still  slightly  faster
than awk and perl.  For the example /etc/passwd file size of
1,525,644 bytes the 3 hix files  were  a  total  of  788,952
bytes, or approximately 50% of the original database size.


4.  Sub-schemas

Fields often contain data that  can  be  viewed  as  another
database  record.   cql  supports  this  by  allowing schema
fields within  schemas.   The  sub-schema  fields  are  then
accessed  using  the  familiar  C  `.'  notation.  Our local
/etc/passwd file formats the info field as:

        info {
            char*  name, address, office, home;
        }
        info.delimiter = ",";

where the info sub-schema delimiter is  `,'.   An  important
difference  with  C declaration syntax is that the cql char*
is a basic type.  This means that all of the fields in  this
example  have  type char*, whereas in C only the first field
would be char*.

Adding a second schema declaration introduces  an  ambiguity
as  to  which  schema applies to the main database file.  By
default the main  schema  is  first  schema  from  the  top.
schema=schema-name;  can  be  used  to override the default.
The complete declaration now becomes:

        passwd {
            register char*  name;
            char*           passwd;
            register int    uid, gid;
            info            info;
            char*           home, shell;
        }
        info {
            char*  name, address, office, home;
        }
        passwd.delimiter = ":";
        info.delimiter = ",";
        schema = passwd;

and the following queries are possible:

        info.name=="Bozo T. Clown"
        info.address=="* MH *"

where the second query illustrates ksh pattern  matching  on
the address field.

Fields that refer to sub-schema data in different files  are
also  possible.   In  this case the sub-schema field data is
actually a key that corresponds  to  a  field  (usually  the
first)  in  the  sub-schema data file.  cia uses this format
for its reference and symbol schemas.   Sub-schema  pointers
are  also used in shadow password implementations that split
the /etc/passwd file into /etc/passwd that  contains  public
information  (no  encrypted  passwords) and /etc/shadow that
contains privileged information (the  encrypted  passwords).
An example /etc/passwd record might look like:

        bozo:*shadow*:123:123:Bozo T. Clown, Big Top, 123-456::

with a corresponding /etc/shadow record:

        bozo:abcdef.FEDCBA:Aug 11, 1993

In this case the name field doubles as the  shadow  key  and
the   main   schema   passwd  field  is  ignored.   The  cql
declaration for these shadow passwords is:

        passwd {
            shadow*         name;
            char*           passwd;
            register int    uid, gid;
            info            info;
            char*           home, shell;
        }
        info {
            char*  name, address, office, home;
        }
        shadow {
            char*           name;
            char*           passwd;
            date_t          expire;
        }
        delimiter = ":";
        info.delimiter = ",";
        schema = passwd;
        passwd.input = "/etc/passwd";
        shadow.input = "/etc/shadow";

C pointer notation is used to declare the shadow  sub-schema
reference  and  the predefined date_t type (described below)
is used for the shadow password expiration field.  The  sub-
schema  reference  also  requires  an input file that can be
assigned  by  a  schema="pathname";  statement.   A   shadow
equivalent of the original example query is:

        uid<10 && name.passwd==""

Notice that `.'  is  also  used  to  dereference  sub-schema
pointers  (as opposed to `->' in C).  Additionally, register
is inferred for all sub-schema pointers.  This allows cql to
optimize  queries  by  transforming  equality expressions on
sub-schema fields into hash index offset  expressions.   The
extern  keyword denotes the sub-schema key field in the sub-
schema declaration; it may be omitted if the  key  field  is
the first sub-schema field.

Sub-schema data can also be  specified  in  the  declaration
section:

        info {
            char*    name;
            map*     type;
        }
        map {
            char*    name;
            char*    value;
        }
        info.type.input = { /* each "string" is a record */
            ";_ERROR_",   "g;global",   "t;typedef",
            "e;extern",   "s;static",   "l;libsym"
        };

This is convenient for expanding database encodings  in  the
user  interface.   For example, type.value=="extern" is more
descriptive than type=="e".


5.  Language_Description

The declaration language has already been introduced in  the
previous  examples.  Schemas are declared using the C struct
style:

        schema-name
        {
            type-specifier    field-name [ , field-name ... ] ;
            ...
        }

Schema,  type  and  field  names  must  match  the   regular
expression  [a-zA-Z_][a-zA-Z_0-9]*.   The  C  reserved words
break, case,  continue,  default,  else,  for,  if,  return,
switch,  while,  are  also  reserved  in  cql,  as  are  the
following predefined types:

  char*        A variable length string.

  date_t       A  date  represented  internally  as  seconds
               since the epoch.  The data representation can
               be in any of the ``standard'' forms.   date_t
               fields  can  be used in date comparisons such
               as mtime<"yesterday".

  double       A double floating point constant.

  elapsed_t    Scaled elapsed  time  showing  the  two  most
               significant time components.  Examples are:

                 1.03s one and three one hundreth seconds

                 2m20s two minutes and 20 seconds

                 3w11d three weeks and 11 days

  float        Equivalent to double.

  int          Equivalent to long.

  long         A long integer constant.

  void         The return type of user defined actions.

A type-specifier may be schema name.  schema-name*  declares
a  sub-schema field whose data is in a separate file whereas
schema-name declares a sub-schema whose data  is  the  field
data  itself.   type-specifier  field-name[size] declares an
array field  whose  values  are  separated  by  a  sub-field
delimiter.

The schema name cql is also reserved.  The  fields  in  this
schema  are predefined by cql and provide access to run-time
information.  The fields are:

  elapsed_t clock   The elapsed time since the start of this
                    cql in hundredths of a second.

  date_t date       The time this cql started.

  int errors        The non-fatal error count.

  char* getenv(char* name) Returns   the   value   of    the
                    environment variable name.  For example,
                    cql.getenv("PATH") is the value  of  the
                    PATH environment variable.

  int line          The current  declaration  or  expression
                    input line.

  int offset        The current record offset  in  the  main
                    schema input file starting at 0.

  char* path(char* name, int length) Returns  value  of  the
                    file  pathname name truncated to length.
                    The truncation attempts to preserve  the
                    significant parts of the pathname.

  int record        The current record number  in  the  main
                    schema input file starting at 1.

  int select        The number of records  selected  by  the
                    select expression.

  int size          The size in bytes of the  current  input
                    record.

  date_t time       The current time.

Schema attributes  are  also  assigned  in  the  declaration
section.

        schema = schema-name;

sets the main schema  name.   If  omitted  the  main  schema
defaults  to  the  first  declared  schema  (from  the top).
Schema delimiters are defined by:

        qualified-schema-name.delimiter = "delimiter";

The default main schema delimiter is : and the default  sub-
schema delimiter is ;.  Schema input is defined by:

        qualified-schema-name.delimiter = "path";

where path is the pathname of the schema data file, or by:

        qualified-schema-name.input = {
             "record-1",
             "record-n",
             };

where record-i are the schema records.  Input  data  may  be
shared by:

        a.input = b.input = ...

The  default  main  schema  input  is  the  standard  input;
indirect sub-schema inputs must always be defined.

The expression language is basically C  with  the  exception
that  char*   operands  are  allowed for the ==,!=,<,<=,>=,>
operators.  The expression may be labeled by:

        label: expression;
        ...

or equivalently by:

        void label() { expression };
        ...

the  former  being  more   convenient   for   command   line
expressions.   The  value of a labelled expression is either
the value of a return statement evaluated in the  expression
or  the  value  of  the  last  statement  evaluated  in  the
expression.  The default expression label  is  select.   The
select expression is applied to each record to determine the
matching records.   The  default  select  is  1,  i.e.,  all
records  match.   The  action  expression is then applied to
each record matched by select.  The default action lists the
matching   records   on  the  standard  output.   The  begin
expression, if specified, is evaluated before  the  database
is   scanned  and  the  end  expression,  if  specified,  is
evaluated after all records have been scanned.

It is important to  note  that  cql  is  not  a  complete  C
interpreter.   Just  enough  is  borrowed  from C to get the
query  job  done.   The  implementation  is  based  on  a  C
expression  library  that  was originally written for the tw
[FKV89] replacement for  the  find  command.   This  library
constitutes  a  large  portion  of the work; in fact the cql
prototype was written in one day (although the  addition  of
hash  indexing  and  query  optimization  took  considerably
longer).


6.  Reports

A  query  language  is  incomplete  without  some  form   of
reporting.  cql provides the printf function to output field
values.  printf is most often used in action expressions  to
output portions of selected fields.  For example,

      select: uid < 10 && name.expire < "in 2 days";
      action: printf("%s will expire on %s\n", name, name.expire);

Format specifications are type checked and string to integer
conversions are supported.  For example, name.expire printed
as %d lists the seconds since the epoch, but printed  as  %s
lists the date in the standard date format.

Headers and footers are done by using printf  in  the  begin
and  end  expressions.   Fancier operations like pagination,
text filling and font highlights are left to tools  designed
specifically for that job.

Selected records may be sorted by specifying the sort order:

      sort = { field, ... };


7.  Alternate_Database_Formats

Three alternate database formats are supported.  The formats
are  orthogonal  and  may  appear  in  any combination.  The
first, virtual database files,  Figure  3,  allows  multiple
schemas  to  be  defined  in  a single file.  This format is
similar to UNIX system archive file format but  is  designed
for quick access to the individual schemas.  In the past cia
generated  two  database  files   with   the   fixed   names
reference.db  and symbol.db.  This made it difficult to copy
and  backup  abstraction  databases.   By  using  a  virtual
database  cia  now combines the two files into a single file
with an application specific name, e.g., ksh.db for the  ksh
abstraction  database.   A virtual database file acts like a
directory in the cql interface.   For  example,  the  symbol
schema   data   in   the   ksh.db   virtual  file  is  named
ksh.db/symbol.  A virtual database  file  is  also  used  to
store  the  index  files for each input database.  The index
file is named by inverting the  case  of  the  corresponding
database  file  name  suffix.   For example, the indices for
ksh.db would be ksh.DB.

    [see the USENIX conference proceedings, Winter 1994]
             Figure 6.  Virtual database format

The second format, partitioned  database  files,  Figure  4,
allows  a single database schema to span more than one file.
The implementation is complicated by the fact that  separate
index   files   may   be  associated  with  each  partition.
Partitions are named as a single file pathname by separating
the   partition   component   pathnames   with  `:',  as  in
"ksh.db:libast.db:libdl.db".

    [see the USENIX conference proceedings, Winter 1994]
            Figure 8.  Partition database format

Finally, a union database  file,  Figure  4,  allows  schema
records  to be intermixed in a single file.  The first union
record field is used to identify the schema for each record.
This  format  is  convenient  for applications that generate
streams of different record types.

    [see the USENIX conference proceedings, Winter 1994]
             Figure 10.  Union database format


8.  Future_Work

It should be possible to  access  binary  and  fixed  length
fields with no delimiters.  These are not supported yet.

A transitive closure operation would also be useful.


9.  Conclusion

cql is an attractive scripting tool  for  database  queries.
Because  it  is  an  interpreted  query language it supports
rapid prototyping; because it is fast it  allows  prototypes
to  become  production  code.   Its  interpreted nature also
makes it easy for database users  (as  opposed  to  database
providers) to write and test new queries.


10.  Acknowledgements

Thanks go to  my  colleagues  in  the  Software  Engineering
Research   Department:  Phong  Vo  for  the  excellent  sfio
implementation that pepped up cql I/O; David  Korn  for  the
hash  index  file  algorithm;  Robin  Chen  for  translating
database lingo and concepts into terms a shell  and  C  hack
could understand.


References

[AKW88] A. V. Aho, B. W. Kernighan, P.  J.  Weinberger,  The
      Awk Programming Language, Addison-Wesley, 1988.

[BK89] Morris  Bolsky  and  David  G.  Korn,  The  KornShell
      Command and Programming Language, Prentice Hall, 1989.

[Bour78] S.  R.  Bourne,   The   UNIX   Shell,   AT&T   Bell
      Laboratories  Technical Journal, Vol. 57 No. 6 Part 2,
      pp. 1971-1990, July-August 1978.

[BSD86] The UNIX Programmer's Reference Manual: 4.3 Berkeley
      Software Distribution, UC Berkeley, California, 1986.

[CF88] Steve  Cichinski  and  Glenn   S.   Fowler,   Product
      Administration  through  Sable  and  Nmake,  AT&T Bell
      Laboratories Technical Journal, Vol.  67  No.  4,  pp.
      59-70, July-August 1988.

[Chen89] Yih-Farn Chen,  The  C  Program  Database  and  Its
      Applications, Proc. of Summer 1989 USENIX Conf.

[Chri92] Tom Christiansen, private correspondence.

[Felt82] S. Felts, The Unity DBMS,  AT&T  Bell  Laboratories
      Technical Memorandum, TM82-59312-1, 1992.

[FKV89] Glenn S. Fowler, David G. Korn and Kiem-Phong Vo, An
      Efficient  File Hierarchy Walker, Proc. of Summer 1989
      USENIX Conf.

[Hume88] Andrew   Hume,   Grep   Wars,    EUUG    Conference
      Proceedings, London, England, Spring 1988.

[Park91]  Off The  shelf:  B-tree  data-file  managers,  Tim
      Parker,  UNIX  Review,  Vol. 9 No. 3, pp. 55-58, March
      1991.

[Wall] Larry Wall, The Nutshell perl Book.

Glenn Fowler is a distinguished member of technical staff in
the  Software  Engineering  Research Department at AT&T Bell
Laboratories in Murray Hill, New Jersey.   He  is  currently
involved  with  research  on  configuration  management  and
software  portability,  and  is  the  author  of  nmake,   a
configurable  ANSI  C  preprocessor library, and the coshell
network execution service.  Glenn has been  with  Bell  Labs
since  1979  and  has  a  B.S.E.E., M.S.E.E., and a Ph.D. in
electrical engineering, all from Virginia  Tech,  Blacksburg
Virginia.

__________

  * UNIX is a registered trademark of USL.