This article is first in a 3-part series on CLF and ELF web log data, where we introduce these file formats. The next article covers IRI solutions for processing web log data, and the last demonstrates web log data masking to protect visitor identities and destinations.
The NCSA Common Log Format (CLF) is a standardized text file format used by web servers when generating server log files. The format is standardized so that analytic programs can more conveniently make use of the information contained within them, though other proprietary log formats exist.
CLF logs are in a fixed (non-customizable) ASCII format, and record basic information about user requests. For example, a CLF record might contain:
18.104.22.168 user - identifier sjones [10/Oct/2011:13:55:36 -0700] "GET /examp_alt.png HTTP/1.0" 200 10801
- A “-” in a field indicates missing data
- 22.214.171.124 is the IP address of the client (remote host) which made the request to the server
- user-identifier is the identity of the client
- sjones is the userid of the person requesting the document
- [10/Oct/2011:13:55:36 -0700] is the date, time, and time zone when the server finished processing the request
- “GET /examp_alt.png HTTP/1.0” is the request line from the client; the method GET, /examp_alt.png is the resource requested; and HTTP/1.0 is the HTTP protocol
- 200 is the HTTP status code returned to the client
- 10801 is the size of the object returned to the client, measured in bytes
W3C Extended Log Format (ELF) format is a customizable ASCII format, with a variety of different fields, that is used by web servers when generating log files. ELF files provide more information and flexibility than CLF files.
With ELF, you can include fields important to you, while limiting log file sizes by omitting unwanted fields. In addition, note that fields are separated by spaces, and that time is recorded as UTC (Greenwich Mean Time). For example, an ELF record might contain:
2010-05-02 15:42:15 - 126.96.36.199 188.8.131.52 80 GET /default.htm 200 - HTTP/1.0 Mozilla/4.0 (compatible: MSIE+5.5+Windows+2000+Server)
In this case, the format is:
date, time,c-ip, cs-username(-), s-ip, sport, method, cs-uri-stem, status, csUserAgent
Each line can contain either a directive or an entry. Entries consist of a sequence of fields relating to a single HTTP transaction. Fields are separated by spaces. A “-” in a field indicates missing data.
Directives record information about the logging process itself. Lines beginning with the # character are directives.
These directives are defined as follows:
- Version – rendition of the extended log file format used
- Fields – space in which data is recorded in the log
- Software – program that generated the log
- Start Date – date/time when the log began
- End Date – date/time when the log was finished
- Date – date/time when the entry was added
- Remark – specific comment information (data recorded in this field should be ignored by analysis tools)
See the next article on CLF and ELF Web Log Data Processing that introduces IRI solutions for transforming, migrating, protecting, reporting from, and prototyping huge web logs.