Field desriptions
Below you will find explanations of the most crucial fields. From the search result, you can also click “See all fields” to inspect other possible fields.
Field | Multivalued | Type | Description |
author | yes | string | From Author meta-data (typical for word, html, pdf, image etc.) |
description | no | string | From description meta-data (word, html, pdf, image etc.) |
content_language | no | string | Language of text (recognised by Tika) |
content_length | no | int | Content length of the payload from the server |
content_text_length | no | int | Content length of the extracted text |
content_type_norm | no | string | Content type determined by Tika. Possible values: html,image,pdf,audio,video,word,powerpoint,excel,text,other |
crawl_date | no | date | When was then url crawled. Additional similar fields: crawl_year_month_day,crawl_year_month,crawl_year |
domain | no | string | Domain of the URL. Example: nb.dk |
host | no | string | Host of the URL, this includes subdomain Example: nettarkivet.nb.no |
id | no | string | The index identifier, unique for each indexed resource. |
image_size | no | long | The size of image in pixels. There are also similar fields image_height and image_width |
links_images | yes | string | Links of all image tags on a HTML page. |
links | yes | string | Links to other pages found in this HTML. |
links_norm | yes | string | Same as the links field except values are normalized |
public_suffix | no | string | The public suffix of the url: Example: no, org, co.uk |
resourcename | no | string | Last part of the URL, after ‘/’ with query parameters. E.g. index.html or cats.jpg&size=100 |
server | yes | string | Value of the Server field in the HTTP header |
status_code | no | int | The http status code in the HTTP header. 200=ok, 301=redirect, 403=forbidden |
source_file_path | no | string | Full path to the warc-file where the resource is from. The field source_file_offset gives the offset for the resource in that warc-file |
source_file | no | string | The filename of the WARC-file without the absolute file path. Is case sensitive |
title | no | string | From title meta-data |
type | no | string | Almost same content_type_norm. Just more human names and fewer values: Web Page, Image, Other, Document, Audio, Video, Presentation, Data |
url | no | string | The exact url seen from the harvest client that created the warc-file |
url_norm | no | string | A normalized version of the url field. It is lowercased and https is made into http. Also finds unique representation of varius encodings. Also removes som predefined parameter names such as session-id etc. This field is very important for playback in SolrWayback. |
warc_ip | no | string | IP-address of the server. Taken from the metadat field WARC-IP-Address in the warc-header. |
warc_key_id | yes | string | The unique identifier (URN) of the resource. Used for persistent referencing. |