NCSA Common and W3C Extended Log Formats (CLF and ELF, respectively) are two popular structures for logging of clickstream activity on web sites. The visitor information recorded may include the IP address, timestamp, page URL, entry and exit pages, and so on.
As these logs grow in size, they can take a long time to process. They may not be in formats that your applications recognize or readily support, and they may contain personally identifiable information (PII) which needs to be masked or encrypted for compliance or other reasons.
IRI has technologies for web-log-file format conversion, as well as web log data transformation, conversion, protection, and reporting for clickstream analytics and data webhouse operations.
IRI software can address data in CLF, ELF, and other structured (sequential) web log formats, but has done additional work to make CLF and ELF file handling easier.* For clickstream data in unstructured text file formats, IRI provides a data restructuring facility to extract and bucket searched strings into structured data repositories for the same activities below.
The conversion-only solution lies in the free (lite) or affordable (database) edition of IRI NextForm. NextForm can converts CLF and ELF files into CSV and other flat-file formats for free.
The upgrade edition that connects to databases will populate web log data directly into relational tables. NextForm allows you to convert files in LDIF to other formats like CSV, XML, text, etc. and vice versa. NextForm also supports data type conversion at the field level, and the remapping of record layouts.
Beyond format conversion, the CoSort SortCL program can process (filter, sort, join, aggregate, scrub, reformat, etc.), protect, and report from these huge log files into CSV, LDIF, XML, text, index, and other structured file formats and DB tables.
Using a simple 4GL to define the layout and manipulation of your log files, you can perform and combine data:
- transformation (scrub, sort, join, group, calculate, re-map, etc.)
- conversion (data types, record layouts, files)
- masking (field-level encryption, de-ID, etc.)
- reporting (custom detail, delta and summaries)
plus validation, find/replace, custom transforms, etc. in the same job script and I/O pass. With SortCL, you can map one or more sources in one or more formats to one or more detail or summary reports, and/or hand off filtered subsets to specialized clickstream analysis tools. See this blog article for an example.
To protect PII in web log files, use SortCL or IRI FieldShield. FieldShield offers the compliance industry's broadest array of field-level security functions for data at rest. Mask, encrypt, pseudonymize, randomize, or otherwise obfuscate and de-identify email and IP addresses, and other items subject to data privacy law protection. See this blog article for an example.
Do you need test data in log file formats? IRI RowGen supports the random generation and selection of data at the field level to help generate custom-formatted files. RowGen uses the same layout metadata as NextForm, CoSort (SortCL), and FieldShield, so you can easily move between test data generation and real data processing.