warc validate
Validate warc files
warc validate <files/dirs> [flags]
Options
--calculate-hash string calculate hash of output file. The hash is made available to the close output file hook as WARC_HASH. Valid values: md5, sha1, sha256, sha512
--close-input-file-hook string a command to run after closing each input file. The command has access to data as environment variables.
WARC_COMMAND contains the subcommand name
WARC_HOOK_TYPE contains the hook type (OpenInputFile, CloseInputFile, OpenOutputFile, CloseOutputFile)
WARC_FILE_NAME contains the file name of the input file
WARC_ERROR_COUNT contains the number of errors found if the file was validated and the validation failed
--close-output-file-hook string a command to run after closing each output file. The command has access to data as environment variables.
WARC_COMMAND contains the subcommand name
WARC_HOOK_TYPE contains the hook type (OpenInputFile, CloseInputFile, OpenOutputFile, CloseOutputFile)
WARC_FILE_NAME contains the file name of the output file
WARC_SIZE contains the size of the output file
WARC_INFO_ID contains the ID of the output file's WARCInfo-record if created
WARC_SRC_FILE_NAME contains the file name of the input file if the output file is generated from an input file
WARC_HASH contains the hash of the output file if computed
WARC_ERROR_COUNT contains the number of errors found if the file was validated and the validation failed
-c, --concurrency int number of input files to process simultaneously. The default value is 1.5 x <number of cpu cores> (default 24)
-h, --help help for validate
-i, --index-dir string directory to store indexes (default "/home/johnh/.cache/warc")
-k, --keep-index true to keep index on disk so that the next run will continue where the previous run left off
-K, --new-index true to start from a fresh index, deleting eventual index from last run
--open-input-file-hook string a command to run before opening each input file. The command has access to data as environment variables.
WARC_COMMAND contains the subcommand name
WARC_HOOK_TYPE contains the hook type (OpenInputFile, CloseInputFile, OpenOutputFile, CloseOutputFile)
WARC_FILE_NAME contains the file name of the input file
--open-output-file-hook string a command to run before opening each output file. The command has access to data as environment variables.
WARC_COMMAND contains the subcommand name
WARC_HOOK_TYPE contains the hook type (OpenInputFile, CloseInputFile, OpenOutputFile, CloseOutputFile)
WARC_FILE_NAME contains the file name of the output file
WARC_SRC_FILE_NAME contains the file name of the input file if the output file is generated from an input file
-r, --recursive walk directories recursively
--source-file-list string a file containing a list of files to process, one file per line
--source-filesystem string the source filesystem to use for input files. Default is to use OS file system. Legal values:
ftp://user/pass@host:port
tar://path/to/archive.tar
tgz://path/to/archive.tar.gz
--suffixes strings filter files by suffixes (default [.warc,.warc.gz])
-s, --symlinks follow symlinks
--warc-dir string output directory for validated warc files. If not empty this enables copying of input file. Directory must exist.
Options inherited from parent commands
--config string config file. If not set, /etc/xdg/warc, /home/johnh/.config/warc and the current directory will be searched for a file named 'config.yaml'
--log-console strings the kind of log output to write to console. Valid values: info, error, summary, progress (default [progress,summary])
--log-file strings the kind of log output to write to file. Valid values: info, error, summary (default [info,error,summary])
-L, --log-file-name string a file to write log output. Empty for no log file
--max-buffer-mem string the maximum bytes of memory allowed for each buffer before overflowing to disk (default "1MB")
--tmpdir string directory to use for temporary files (default "/tmp")
SEE ALSO
- warc - A tool for handling warc files