warc ls

List warc file contents

Synopsis

List information about records in one or more warc files.

Output options:

--delimiter accepts a string to be used as the output field delimiter.
--fields specifies which fields to include in output. Field specification letters are mostly the same as the fields in
       the CDX file specification (https://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2015/).
       The following fields are supported:
         a - original URL
         b - date in 14 digit format
         B - date in RFC3339 format
         e - IP
         g - file name
         h - original host
         i - record id
         k - checksum
         m - document mime type
         s - http response code
         S - record size in WARC file
         T - record type
         V - Offset in WARC file
       A number after the field letter restricts the field length. By adding a + or - sign before the number the field is
       padded to have the exact length. + is right aligned and - is left aligned.
warc ls <files/dirs> [flags]

Options

  -c, --concurrency int            number of input files to process simultaneously. The default value is 1.5 x <number of cpu cores> (default 1)
  -d, --delimiter string           use string instead of SPACE for field delimiter (default " ")
  -f, --fields string              which fields to include. See 'warc help ls' for a description
  -h, --help                       help for ls
      --id stringArray             filter record ID's. For more than one, repeat flag or comma separated list.
  -m, --mime-type strings          filter records with given mime-types. For more than one, repeat flag or comma separated list.
  -o, --offset int                 record offset (default -1)
  -n, --record-count int           The maximum number of records to show
  -t, --record-type strings        filter record types. For more than one, repeat flag or comma separated list.
                                   Legal values: warcinfo,request,response,metadata,revisit,resource,continuation,conversion
  -r, --recursive                  walk directories recursively
  -S, --response-code string       filter records with given http response codes. Format is 'from-to' where from is inclusive and to is exclusive.
                                   Examples:
                                   '200': only records with 200 response
                                   '200-300': all records with response code between 200(inclusive) and 300(exclusive)
                                   '-400': all response codes below 400
                                   '500-': all response codes from 500 and above
      --source-file-list string    a file containing a list of files to process, one file per line
      --source-filesystem string   the source filesystem to use for input files. Default is to use OS file system. Legal values:
                                     ftp://user/pass@host:port
                                     tar://path/to/archive.tar
                                     tgz://path/to/archive.tar.gz
                                   
      --strict                     strict parsing
      --suffixes strings           filter files by suffixes (default [.warc,.warc.gz])
  -s, --symlinks                   follow symlinks

Options inherited from parent commands

      --config string           config file. If not set, /etc/xdg/warc, /home/johnh/.config/warc and the current directory will be searched for a file named 'config.yaml'
      --log-console strings     the kind of log output to write to console. Valid values: info, error, summary, progress (default [progress,summary])
      --log-file strings        the kind of log output to write to file. Valid values: info, error, summary (default [info,error,summary])
  -L, --log-file-name string    a file to write log output. Empty for no log file
      --max-buffer-mem string   the maximum bytes of memory allowed for each buffer before overflowing to disk (default "1MB")
      --tmpdir string           directory to use for temporary files (default "/tmp")

SEE ALSO

  • warc - A tool for handling warc files