warc validate
Validate WARC files
warc validate FILE/DIR ... [flags]
Options
--calculate-hash string calculate hash of output file. The hash is made available to the close output file hook as WARC_HASH. Valid values: md5, sha1, sha256, sha512
--close-input-file-hook string a command to run after closing each input file. The command has access to data as environment variables.
WARC_COMMAND contains the subcommand name
WARC_HOOK_TYPE contains the hook type (OpenInputFile, CloseInputFile, OpenOutputFile, CloseOutputFile)
WARC_FILE_NAME contains the file name of the input file
WARC_ERROR_COUNT contains the number of errors found if the file was validated and the validation failed
--close-output-file-hook string a command to run after closing each output file. The command has access to data as environment variables.
WARC_COMMAND contains the subcommand name
WARC_HOOK_TYPE contains the hook type (OpenInputFile, CloseInputFile, OpenOutputFile, CloseOutputFile)
WARC_FILE_NAME contains the file name of the output file
WARC_SIZE contains the size of the output file
WARC_INFO_ID contains the ID of the output file's WARCInfo-record if created
WARC_SRC_FILE_NAME contains the file name of the input file if the output file is generated from an input file
WARC_HASH contains the hash of the output file if computed
WARC_ERROR_COUNT contains the number of errors found if the file was validated and the validation failed
-c, --concurrency int number of input files to process simultaneously. (default 6)
--ftp-pool-size int32 size of the ftp pool (default 1)
-h, --help help for validate
--id strings filter record ID's. For more than one, repeat flag or comma separated list.
--index-dir string directory to store indexes (default "/home/runner/.cache/validate")
-i, --input-file string input file (system). Default is to use OS file system.
Legal values:
/path/to/archive.( tar | tar.gz | tgz | zip | wacz )
ftp://user/pass@host:port
-k, --keep-index true to keep index on disk so that the next run will continue where the previous run left off
-m, --mime-type strings filter records with given mime-types. For more than one, repeat flag or comma separated list.
-K, --new-index true to start from a fresh index, deleting eventual index from last run
--open-input-file-hook string a command to run before opening each input file. The command has access to data as environment variables.
WARC_COMMAND contains the subcommand name
WARC_HOOK_TYPE contains the hook type (OpenInputFile, CloseInputFile, OpenOutputFile, CloseOutputFile)
WARC_FILE_NAME contains the file name of the input file
--open-output-file-hook string a command to run before opening each output file. The command has access to data as environment variables.
WARC_COMMAND contains the subcommand name
WARC_HOOK_TYPE contains the hook type (OpenInputFile, CloseInputFile, OpenOutputFile, CloseOutputFile)
WARC_FILE_NAME contains the file name of the output file
WARC_SRC_FILE_NAME contains the file name of the input file if the output file is generated from an input file
-o, --output-dir string output directory for validated warc files. If not empty this enables copying of input file. Directory must exist.
-t, --record-type strings filter records by type. For more than one, repeat the flag or use a comma separated list.
Legal values:
warcinfo, request, response, metadata, revisit, resource, continuation and conversion
-r, --recursive walk directories recursively
-S, --response-code string filter records by http response code
Example:
200 - only records with a 200 response
200-300 - records with response codes between 200 (inclusive) and 300 (exclusive)
500- - response codes from 500 and above
-400 - all response codes below 400
--source-file-list string a file containing a list of files to process, one file per line
--suffixes strings filter files by suffix (default [.warc,.warc.gz])
-s, --symlinks follow symlinks
--tmpdir string directory to use for temporary files (default "/tmp")
Options inherited from parent commands
--config string config file. If not set, $XDG_CONFIG_DIRS, /etc/xdg/warc $XDG_CONFIG_HOME/warc and the current directory will be searched for a file named 'config.yaml'
-O, --log-file string log to file (default "-")
--log-format string log format. Valid values: text, json (default "text")
--log-level string log level. Valid values: debug, info, warn, error (default "info")
SEE ALSO
- warc - A tool for handling warc files