Selected Questions and Answers
Important note: The FAQs below are not a comprehensive resource, and only address a fraction of the available capabilities in IRI software or questions people ask.
Please visit the IRI solutions and products sections to learn more. Also, do not hesitate to contact us if you have any questions, or need details on specific features or options applicable to your use case(s).
The company is IRI, Inc. IRI stands for Innovative Routines International. IRI was founded in New York in 1978 as Information Resources, Inc. and changed its name during the relocation to Florida in 1995 (where the name Information Resources was already in use in Florida).
We are not affiliated with another IRI -- the Chicago firm best known for retail market research called Information Resources, Inc., but ironically, they are an IRI CoSort licensee of record.
CoSort is IRI's best known software product for data management and manipulation. The primary operational facility and user interface in the CoSort package is the Sort Control Language (SortCL) program.
SortCL refers to both the executable and the 4GL syntax for job scripts that contain both data definition and manipulation statements. The SortCL program has given rise to several IRI spin-off products that use the same syntax for data definition, but support a lower-cost, targeted functional subset of manipulation commands. Such SortCL-compatible, fit-for-purpose IRI products include: NextForm for data migration, FieldShield for sensitive data masking, and RowGen for test data generation.
Voracity is IRI's new, "total data management" platform that includes CoSort and all the SortCL spin-off product capabilities, as well as new data discovery (profiling) tools, visual job design features for DW/BI architects, DBAs and business users, plus: Hadoop engine options, cloud and big data connectors, and data quality and MDM wizards, and multiple analytic frameworks.
IRI Workbench is the free, common Eclipse GUI for designing, deploying, and managing all SortCL-compatbile software jobs as well as all the additional functions of Voracity above.
Until recently, IRI's primary focus and product line has served back-end systems that not many talk about. Doing the heavy lifting for other people's products and operations (including Accenture, CGI, Cincom, Dell, Epsilon, NTT Data, Sabre, and Sungard) has made IRI more of a silent partner than the total solution provider that the company's software stack proves it to be.
In actuality, IRI is also a prominent enterprise software vendor and well known to many large companies worldwide (like American Airlines, Bank of America, Comcast, Disney, EDS, and Fidelity Investments), as well as the consultancies that serve them. Both customers and analysts following IRI note that the company is not venture-backed and invests far more in R&D than it does in marketing, providing far more value to customers.
Gartner is following IRI software in their data integration, data masking, legacy migration, test data, and business intelligence research. IRI is also repeatedly recognized in the areas of big data (with and without Hadoop), data governance, analytics, quality, metadata management, master data management, and unstructured data.
Yes, and yes.
From IRI's support staff, its international representatives, and expert independent consultants. Call or email us before, during, or after your evaluation to get directed to the right resource. Because it is in IRI's best interests for your business and technical goals to be met through the use of IRI software, we will collaborate with you in the development and maintenance of successful solutions.
We will. So will more than 40 international support offices. And there are many third-party consultancies familiar with IRI software who can help you implement, optimize, and support applications around the tools.
More importantly, you will find it surprisingly easy to support yourself. One of the most compelling aspects of IRI software is its simplicity. All the data and job-related metadata are shared, exposed, self-documenting, and easy to modify and extend. There is also a familiar Eclipse GUI automating script creation, integration, execution, and management.
Because they are lower. See the answer to the question above about why IRI is not as well known as those competitors.
Being part of so many other applications has not required the marketing overhead of our competitors. IRI has also chosen not to be a public company, and has not been servicing external investors or debt. Our customers continue to benefit from these savings, and we from being overwhelmed with demand that would tax our quality or service.
Directly from IRI or through an approved IRI VAR outside the Untied States who can also service our software. Call or email us for a price quote -- or use this form -- and specify the the product(s) you want to license after successful evaluation. Prices are based on what and where you run (see the pricing FAQ). IRI accepts purchase orders and credit cards.
Yes, it is available. There is normally an upcharge to standard annual support to provide 24/7 support worldwide.
Maintenance is free in the first year after licensing. For those users covered under maintenance, minor new releases are provided free upon request or if needed for support. Major new releases are also optional, but usually chargeable upgrades. The cost of the upgrades depends on your maintenance level.
Because you are buying a perpetual-use license, once. Support is optional and you are not forced to upgrade (even though you should eventually). If you are using Voracity on a subscription basis, however, you can lock in the license and support cost for five years through a discounted up-front payment.
Per-copy prices, which include the SortCL program, typically range from the mid 4 to 5 figures (USD) and are based on hardware (e.g. RAM, model number) and the number of cores in use. CoSort is also available by subscription within the IRI Voracity platform which does not take machine sizing into account. Contact your IRI representative for a cost estimate for your environment.
In the perpetual use model, there are discounts for successive licenses ordered at the same time, limited or expiring usage, runtime integration, and GSA schedule procurements. Your company may also have an umbrella (site), or distribution (royalty) agreement in place with pre-negotiated discounts.
Specific and final price quotes are provided under NDA.
The price of IRI FieldShield is based only on the number of hostnames where static data masking jobs run; i.e., wherever the executable/engine is installed. The price does not vary by the number of cores, users, data sources, rows, functions, features, etc. The same considerations below apply to IRI Voracity (data management platform) licenses which include FieldShield, CellShield EE, and DarkShield.
Base FieldShield one-time license fees cover perpetual use, documentation, and the first year of technical support in standard use cases. There are discounts on multiple and runtime (see below) licenses, and FieldShield is also available by subscription in Voracity (see below).
At least one FieldShield license is required at each site. Most sites procure more than one masking engine (hostname license) at the same time, receiving a 40% discount on those additional licenses in the process. The number of licenses they take factors in both performance and convenience considerations.
In performance terms, the number of licenses may match the number of (major) database servers in use ... particularly where tables volumes are large. While it is not a one-for-one requirement, masking jobs run faster if the masking engines are installed/licensed on each large DB source or target system (to avoid both network and I/O contention). The best way to know your tolerable degradation point on a given install is through testing.
In convenience terms, it's very common to also license the engine for local use on one or more IRI Workbench client dev/test PCs, where masking jobs are first designed and debugged, as well as on disaster recovery/failover systems. Regardless of the number of licenses however, IRI Workbench is always free to distribute.
Again, additional perpetual use FieldShield licenses procured simultaneously (along with the first one at full price) are discounted 40%. Annual maintenance renewals are offered at 20% of the paid-up license fee(s), and include upgrades and limited license transfer and disaster recovery rights.
Alternatively, five FieldShield hostname licenses and support for them are included within the base tier, annual IRI Voracity subscription. Voracity also provides for DB subsetting, IRI RowGen for test data synthesis, CoSort-powered data integration and migration, data quality, analytics, and much more.
For dynamic data masking applications through the FieldShield 'Sandkey' API library (SDK) -- or for runtime decryption sites using the static FieldShield engine without IRI Workbench, documentation, or support -- there is more deeply discounted, volume-tiered, pricing. That is offered in a royalty covenant attached to a primary static data masking development license (which covers IRI support and the other features above). Dynamic data masking performed on the same systems/s) (hostname/s) on which the static data masking license is already installed does not require additional licensing, however.
IRI is also working in 2020 to provide a proxy-based dynamic data masking system for relational database applications that use a special JDBC driver. As this is partner technology, pricing is different, but discounted with the rest of FieldShield for data classification, search, and static data masking.
- IRI Workbench, the graphical IDE built on Eclipse, is currently free on Windows and Linux platforms. A MacOS version is also available on request and is free only to licensees of one of the products below.
- IRI Voracity is typically a subscription (one or five years) based only the total number of SortCL-executing hostname licenses, which are usually database or ETL servers. There are additional charges for Hadoop and other integrated (premium feature) options available through our partners. Perpetual use pricing is also available on request.
- IRI FACT (VLDB unload) and IRI RowGen (test data generation) are priced per hostname according to the number of licensed CPU cores (threads) licensed for use on each.
- IRI NextForm has multiple editions and prices. See below.
- IRI FieldShield (static data masking) is based on the number of executing hostname licenses only, with higher costs on Linux and outside the US through IRI VARs. Runtime API calls to FieldShield SDK (dynamic data masking) functions are at various percentages of the full package price depending on volume.
- IRI CellShield and IRI DarkShield pricing is based on the number of spreadsheets / documents that need protecting.
- IRI Data Masking as a Service (DMaaS) is performed at daily, hourly, or per-project rates.
Refer to the licensing information in each product section on this site for price ranges. Contact your IRI representative for more information and an NDA-confidential quotation for the use of any IRI software product, or for an IRI Professional Services engagement estimate.
It may be included in your license fee in the first year, and then become an annual option at a percentage of the license fee that reflects the desired level of support and upgrades.
Annual support is typically renewed at 20% of the license fee basis for the hardware for direct US customers only. This level entitles you to product updates and limited license transfer rights.
If support lapses, back support is owed and major new release may be required and offered at a discount.
24/7 support is an additional premium that only some sites request or require.
The software does not expire. IRI's default business model allows you to keep running forever, with or without support. So, you do not have to pay us all over again at the end of some multi-year term, unless you have specifically requested and are paying us to lease instead.
Voracity is a modern one-stop-shop for rapidly managing and leveraging enterprise data. It is also a standalone ETL and data life cycle management platform product that also packages, protects, and provisions all forms of "big data."
Voracity saves money on software, hardware and consulting resources, while expanding your enterprise information management (EIM) capabilities in support of digital business initiatives -- all from one Eclipse pane of glass.
All the features/functions listed below are supported in the IRI Voracity data management platform and constituent IRI Data Manager and IRI Data Protector suite products (excpect FACT and Chakra Max, respectively). Refer to the Components here.
GUI refers to the IRI Workbench Graphical User Interface. IRI Workbench is a free Integrated Development Environment (IDE), built on Eclipse,™ for integrating and transforming data with the SortCL program in Voracity, IRI CoSort, and all other IRI software.
DTP refers to the Data Tools Plugin (and Data Source Explorer) in the IRI Workbench. DDF refers to Data Definition Files, the metadata for source and target data layouts.
We understand that, and have been accelerating jobs in legacy ETL tools (especially Informatica and DataStage transforms) for years. To accelerate third-party ETL and BI/analytic tools, as well as DB operations, use IRI's scriptable, batchable transform engine(s) alongside -- and amplify the return on your investment in -- these platforms:
- ETL Tools
- ETI Solution
- IBM DataStage
- Informatica PowerCenter
- Microsoft SSIS
- Oracle Data Integrator
- Pentaho Data Integrator
- BI Tools
- Analytic Tools
- SQL Server
Run "SortCL" program jobs in the IRI CoSort package or IRI Voracity platform from your tool's command-line (shell) option to prepare big data faster, and populate the DB tables or file formats your tool can directly ingest.
Use the same high-performance data movement engines that Voracity can: IRI FACT for extraction, IRI CoSort (or Hadoop) for data transformation, IRI NextForm for data/DB migration and replication, IRI FieldShield for data masking, and/or IRI RowGen for generating test data.
Yes, now you can. Voracity is API-integrated with AnalytiX DS metadata hub technology so you can convert from legacy ETL products more or less automatically and affordably.
Contact your IRI or AnalytiX DS representative and ask about CatFX templates for Voracity from your current ETL tool, along with any LiteSpeed Conversion services you need to help port and test the more complex mappings.
Whether you are switching ETL platforms or just starting out in data integration, use Voracity to shrink time to deployment and information delivery.
Some of the things that Voracity offers that legacy and open source ETL (much less ELT) tools do not are:
- Built-in data profiling tools for flat-files, databases, and dark data (unstructured) document sources
- Raw power and scalability with or without Hadoop; i.e., built-in performance in volume, but also seamless support for Hadoop!
- A negligible learning curve: simple, explicit, accessible, and open text metadata you can easily use, modify and share
- The ability to deploy jobs outside the GUI, running them via command line, batch, or any program via system or API call
- An open source GUI you already know (Eclipse) that front-ends proven, robust manipulations and reports on big structured data
- Advanced aggregation functionality like lead/lag, ranking and running, multiplication and expressions
- Multiple nested layers for both conditions and derived fields with support for PCREs, fuzzy matching, C (math/trig) functions, locale and 'conversion specifiers', etc.
- Composite data value definition for both production data (format masking) migration and test data generation
- Built-in: data and DB profiling, migration, replication and administration
- 12 field protection (static data masking), DB subsetting, and synthetic referentially correct test data generation
- Data-centric change data capture, slowly changing dimension and detail and summary reporting, plus trend (predictive analytics), and web log (clickstream analytics) reporting
- Seamless metadata integration with Fast Extract (FACT) for major RDBs, plus Hadoop, AnalytiX DS and MIMB-embedding platforms
- Superior price-performance, fast ROI, and immediate access to US-developer support
Another way to consider the differences is by looking at what Voracity's does not require, and why:
With Voracity, there is no need for:
|separate transforms or transform stages||can combine filter, sort, join, aggregate, pivot, remap, custom and other transforms in the same job script and I/O pass, though it can represent and run them separately in separate task blocks|
|partitioning, manual or otherwise||automatically multi-threads and uses other system resources only your resource controls limit, and does not push transformations into the database layer where there are inherently less efficient|
|manual metadata definition||provides automatic metadata discovery and format conversion tools, and is supported by AnalytiX DS Mapping Manager and CATfx templates, as well as MITI's MIMB platform|
|separate BI (reporting) tools||can produce custom-formatted details and summary reports in the same job script and I/O pass with all the transforms, and/or hand off data to files, tables, or ODA streams in Eclipse for BIRT|
|separate data masking tools||includes every single function in FieldShield, the most robust data masking and encryption tool available.|
|separate test data tools||all the functions of RowGen, which can generate safe (no need for production data), intelligent (realistic and referentially correct) test data for DB, file, and report targets|
|long-term consulting||uses an already familiar Eclipse GUI and metadata defining both data and ETL processes|
|separate MDM hubs or data quality tools||has a wizard for MDM, plus support for: composite data type definitions, master data value lookups, joins, tables and set files suitable for production or test data|
|a new team sharing or version control paradigm||metadata repositories and job scripts work with any source code and metadata version control system, including AnalytiX DS and GIT, CVS or SVN in Eclipse|
|concerns about open source or support||is backed by IRI, a stable 38-year-old company with more than 40 international offices|
|a huge budget now, or a lease renewal headache later||is sold at affordable prices for perpetual or subscription use|
Yes, CoSort is a data transformation, and thus, ETL engine. It is not a traditional ETL package however, but the IRI Voracity that leverages CoSort for data transformation is. Voracity can use CoSort as well as Hadoop for transformations, and of course FACT for extraction and pre-sorted bulk loads into auto-config'd DB load utilities. Refer to:
When you use Voracity or CoSort, you benefit from high-performance (I/O-consolidated, multi-threaded) Transformations like:
In addition, CoSort -- and in particular, its Sort Control Language (SortCL) program -- can also handle:
- slowly changing dimensions
- fuzzy logic lookup tables
- pivoting (normalization and denormalization)
- running, ranking, and windowed aggregates
- bulk/batch change data capture
In Voracity, much of the above is exposed in graphical wizards, specification dialogs, workflow and transform mapping diagrams so you don't need to learn how to script those jobs. The jobs can be previewed and then run (scheduled) from the GUI, or on the command line (and thus in batch scripts and other applications, as well as third-party ETL tools that need a boost). Metadata can be change-tracked, shared, secured, and version-controlled in repositories like EGit on-premise or in the cloud.
Yes, both. As far back as 1999, industry experts have been touting CoSort as an ETL engine for its high-performance data staging and integration capabilities. CoSort - and it's SortCL program in particular - performs the heavy lifting of selection, transformation, reporting, and pre-load sorting against sequential files in an ODS, DW staging area, or on extracted tables in suspense.
CoSort's SortCL is a push-down optimization option for Informatica PowerCenter, and in the sequential file stage of IBM DataStage, to perform faster, combined (single-pass) sort, join, and aggregation operations. Click here to read the press release about CoSort's 6x improvement of Informatica speed.
Besides proven integrations with, and plug 'n play sort replacements for, DataStage and Informatica, CoSort also links to Kalido, ETI, SAS and TeraStream ETL packages. CoSort's SortCL programs can be called as an executable from any tool allowing that as well, which would also mean Ab Initio, Pentaho, JasperETL, Pervasive DataRush, Hummingbird, and others, to further consolidate and optimize data transformation performance via the file system.
Yes. Either through the DataStage sequential file stage or Before-Job Subroutine. With or without the larger Voracity ETL and data management platform subscription supporting CoSort, you would use CoSort as an external data transformation hub, combining large sort, join, aggregate, reformatting, protection, and cleansing functions in a single job script and I/O pass in the file system. Voracity adds the visual ETL design environment and Hadoop execution options around CoSort.
For more information, please see:
You can also leverage AnalytiX DS technology to automatically convert most ETL jobs currently in DataStage to Voracity:
Yes. With or without the larger IRI Voracity ETL and data management platform subscription supporting IRI CoSort, you can use CoSort as an external data transformation hub, combining large sort, join, aggregate, reformatting, protection, and cleansing functions in a single job script and I/O pass in the file system, and just call those jobs into existing Informatica job flows as a command line operation. Voracity adds the visual ETL design environment and Hadoop execution options around CoSort.
For more information, see:
You can also leverage AnalytiX DS technology to automatically convert most ETL jobs currently in PowerCenter to Voracity:
We have hundreds of tables and files, and thousands of fields already defined for processing in our ETL tool(s). How can we exploit CoSort / Voracity on our flat files or other sources and leverage existing field layouts (i.e. not re-define them manually)?
There are two third-party technologies designed for, and tested to be compatible with your existing tools and ours, which help:
1) AnalytiX DS Mapping Manager or LiteSpeed Conversion processes migrate legacy ETL jobs to equivalent Voracity ETL jobs. Voracity ETL jobs are powered by the CoSort SortCL program or interchangeable Hadoop engines mapped the same way. This includes both source and target data layout specifications, and most transform mappings. Manual translation and testing services fill in the gaps for complex mappings that do not auto-convert.
2) For only converting the data layouts in another ETl (BI, modeling, and DB) tool to SortCL layouts, you could also use the Meta Integration Model Bridge (MIMB) to convert Informatica .xml, DataStage .DSX and other third-party tool metadata to CoSort SortCL DDF.
Yes, to a degree.
Voracity's core data manipulation features -- transformation, mapping, masking, and embedded report formatting -- are all built into the SortCL program you can already script and run with your CoSort license to do those things. And, the GUI for CoSort (IRI Workbench, built on Eclipse), is the same GUI Voracity and all subset IRI products in the IRI Data Manager (CoSort, FACT, NextForm) and IRI Data Protector (FieldShield, CellShield EE, RowGen) suites use! With that GUI comes access to a lot of Voracity features through the toolbar menu, like data discovery, ETL flow diagrams, MDM wizards, FieldShield data masking and RowGen test data generation wizards. So you could run a lot of those jobs with your existing SortCL license.
As a CoSort licensee, however, you are only entitled to IRI support for the feature-functions documented in the CoSort manual (and in particular, the SortCL language reference chapter), and for GUI operations based on jobs created from wizards in the "CoSort" (stopwatch icon) and "IRI" menus in the IRI Workbench toolbar. That is, you would normally be confined to the materials for CoSort users in the Welcome section and support content for CoSort in the help menu.
Voracity subscribers, on the other hand, are entitled to support from IRI for all the feature-functions and menu items exposed in the IRI Workbench GUI, including data profiling, masking, test data, ETL, MDM, CDC, slowly changing dimensions, metadata management, and BIRT integration. Voracity users can also get IRI support on optional upgrades with Voracity-compatible software like AnalytIX DS Mapping Manager, IRI FACT, running SortCL jobs in Hadoop, predictive analytic reports using SortCL and BIRT, and the Paques self-service, shard-powered BI module.
For a complete list of the available features and support provided with all IRI software, please refer to this comparison page.
CoSort's SortCL program has built-in filtering and selection logic to reduce bulk, segment, and scrub data during or after processing.
For more advanced data cleansing and quality operations, SortCL allows you to plug in your own function libraries to perform custom transformations at the field level, before or after sort, join, merge, or report processing. SortCL ships with a sample template: a Melissa Data address standardization object that cleanses the address field as records are output.
For more information, see:
CoSort SortCL, IRI Voracity, and other IRI software programs (including FACT, NextForm, FieldShield, and RowGen) can all run from the command line and thus be scheduled into batch streams with cron, Stonebranch Universal Controller (UAC), Cisco TES, CA Autosys, ASCI ActiveBatch and similar applications. PoCs were done with Oracle DBMS_Scheduler and UAC and Full 360 metaController. See:
From within the IRI Workbench -- the free Eclipse GUI supporting all IRI software -- a Task Launch Scheduler is built in. See:
IRI Voracity provides analytic capabilities in five ways, with two more pending in 2019:
1) Embedded reporting and analysis - via CoSort SortCL programs that write custom detail, summary, and trend reports in 2D formats complete with cross-calculation, and other incorporated data transformation, remapping, masking and formatting features. The reports can be descriptive, or through more fuzzy logic and functions like standard deviation -- and Boost-driven statistical and BIRT-driven linear regression graphs -- predictive.
2) Integration with BIRT in Eclipse - where at reporting time, BIRT charts and graphs you design get populated with 'IRI Data Sources' via ODA support for Voracity/CoSort SortCL output. What's interesting about this in-memory transfer of SortCL data and metadata to BIRT that the data integration/preparation get run when the report is requested, saving time as well as resources by having data prepared outside the BI layer (with CoSort or Hadoop engines)
3) Data preparation (franchising) that accelerates time-to-visualization for 10 third-party BI and analytic vendor platforms. This section of the IRI blog site features benchmarks run when SortCL (available to Voracity or CoSort users) alone runs ahead of BOBJ, Cognos, Microstrategy, QlikView, Splunk, Spotfire, R, and Tableau.
5) Integration with a cloud dashboard from DWDigest for interactive business intelligence you can customize and view in any browser, including the internal web browser in the IRI Workbench GUI for Voracity, built on Eclipse.
6) Streaming anlalytics (pending) through JupiterOne, where Voracity can be a source of Kafka-fed data streams, or a target for further procressing of live sentiment analysis data.
7) Colocation and integration with the KNIME Analytic Platform in the IRI Workbench IDE for Voracity, built on Eclipse, to allow citizen data scients to punch above their weight with machine learning, artificial intelligence, neural networks, unstructured data, and other advanced data mining nodes and projects.
It can be. CoSort offers a range of solutions for generating meaningful reports from huge volumes of data. You can use CoSort's SortCL (4GL) program as a standalone report generator, or as a staging tool to digest and hand-off large volumes of data.
Not only does SortCL transform and protect huge volumes of disparate data from a variety of RDBMS sources, and sequential or index files, plus web and various device logs (including ASN.1 TAP3 CDRs). It can also join, aggregate, calculate, and display that data in custom detail and summary report formats, complete with special variables and tags for web pages.
By creating output in .CSV and .XML formats, CoSort's SortCL tool can directly populate spreadsheets like Excel, databases, ETL tools, and BI tools. See the next question in this section, or:
Any and all tools that can import CSV and flat XML files, or RDBMS tables ...
IRI software's job in these cases is to prepare or "wrangle" big, structured data in a centralized place. CoSort or Voracity users create Sort Control Language (SortCL) programs in 4GL text scripts -- or through wizards in the free IRI Workbench GUI, built on Eclipse™ -- that transform (filter, sort, join, aggregate, mask/encrypt, pivot, pre-calc, etc.) data in more than 125 sources prior to the hand-off.
SortCL programs or Voracity targets can directly populate spreadsheets like Excel, databases (relational and NoSQL), ETL tools, and BI tools, including:
- SAP Business Objects
- IBM Cognos
- DI-Diver from Dimensional Insight
- DWDigiest from NextCoder
- iDashboards from iViz
- Oracle DV / OBIEE
as well as newer analytic platforms, such as:
- Power BI
In addition, ODA driver support in the IRI Workbench provides seamless data and metadata flows between CoSort/SortCL data preparation and BIRT presentation. There is also a Splunk Add-on for Voracity, plus support for the DWDigest cloud dashboard app from NextCoder, and upcoming Voracity feeds through Kafka to streaming analytic platforms like JupiterOne and integration with the KNIME Analytic Platform in the same pane of glass.
For more information, please see:
By applying the same masking function to the same plaintext each time, automatically and globally. This is done through rules associated with pattern-matched column names, or more reliably, through integrated data classes tied to identified data. Classified data is discovered / validated through robust built-in value search methods like RegEx pattern matches with custom-set accuracy thresholds, lookup value matches, fuzzy-match algorithms, named entity and facial recognition models, and JSON/XML path filters.
Right, as there are more use cases and invocation methods than we can currently publish. However there are a couple demo videos linked on our self-learning page in the unstructured sources section (scroll down). And, and we can demonstrate specific solutions via https://www.iri.com/products/live-demo.
Yes, per an external command line interface (CLI) call described with examples in this document. A REST API call is also in development for DarkShield.
- Expression (calculation/function) logic
- Substring and byte shifting
- Data type conversion
- Custom function
The decision criteria for which protection function to use for each datum are:
- Security - how strong (uncrackable) must the function be
- Reversability - whether that which was concealed must later be revealed
- Performance - how much computational overhead is associated with the algorithm
- Appearance - whether the data must retain its original format after being protected
IRI is happy to help you assess which functions best apply to your data.
Note also that you can protect one or more fields with the same or different functions, or protect one or more records entirely ("wholerec"). In each case, the condition criteria and targets/layout parameters can also be customized, and combined with data transformation and reporting in the same job. And, in fit-for-purpose multi-table wizards, or through global data classification, DBAs and data stewards can apply these protections as rules to preserve consistency and referential integrity database or enterprise wide.
Both! FieldShield and CoSort's SortCL program give you the ability to protect both kinds of data sources simultaneously with one or more field-level security functions. These IRI products can address either source type in bulk (static data masking) or surgically (dynamic data masking) through filter command or customized stored procedure calls.
Of course, some databases have built-in column encryption. But their approach may be cumbersome or limiting for a variety of reasons, such as:
- You need to protect multiple databases other sources or data in motion, like flat files and a single platform or method will not address, or be compatible with, enterprise needs
- Built-in DB encryption libraries may also be too slow, costly, or complex to implement
- They are limited to a single encryption methodology that may or not conform to security or appearance requirements
- You may also need to leave your data as-is while it's in the database, but protect it while it's moving into or out of the database. That's where flat files come in. Data is often in a flat file format as it goes in or out of your databases.
Other encryption products protect an entire file, database, disk, computer, or network to protect sensitive data moving through your systems. However, encrypting more than the fields that matter can take a long time, and cut off your access to the non-sensitive data that still needs to be accessed and processed. FieldShield and CoSort's SortCL can encrypt (or otherwise protect) only those fields/columns that need it, and can do it in the same job scripts and I/O passes with big data transformation, migration, and reporting.
After you find and classify PII in the free IRI Workbench IDE (built on Eclipse) for FieldShield, Voracity, etc., you can declare ad hoc and/or rule-assign specific field-level protection functions in FieldShield or other SortCL-compatible jobs. These static data masking functions provide field-protected views of sensitive data (like social security or phone numbers, salaries, medical codes) in ODBC-connected database tables and sequential files, through some 14 different categories of techniques, which include:
- field filtering (removal)
- string manipulation or masking (redaction)
- quasi-identifier generalization
- secure encryption and decryption
- reversible and non-reversible pseudonymization
- ASCII de/re-ID, encoding, and big shifting
and it is through the consistent application of these functions (on the basis of data classes or pattern-matched column name rules) that preserves referential integrity. Other ways to protect data and preserve referential integrity in Voracity in addition to persistent data masking is through the built-in database subsetting wizard (where masking functions can also be applied) or through structurally and referentially correct synthetic test data generation through the (IRI RowGen) DB test data job wizard. See the links here.
Yes. You can also use FieldShield to remove (redact, omit, or delete), or randomize (random generation or value selection to replace) PII at the column or row level as options, instead of obfuscating it (via encryption or blurring, for example). The target of that job can be new tables with the same structure in another schema you can create, build, and load from IRI Workbench, which is as much a cross-platform DB administration environment as much as it is the IRI tooling environment.
Yes, by declaring the target to be the source. We recommend doing that only after testing the output (e.g., via a small test file or stdout), to make sure it's what you want in terms of format/appearance and functionality (e.g., reversibility via decryption) if you have no backup.
Does your provided pseudonym option guarantee consistency; e.g., that all Peters will become Pauls? Or do we need to specify your option to "Use your own pseudonym list” and create a special crosswalk between the names in our database and fake names, and “Use original field as a look-up into pseudonym list”?
Yes, simultaneously. In fact, the IRI CoSort product (via its SortCL program) or the IRI Voracity (big) data management platform (via SortCL or interchangeble Hadoop engines) can enforce field-level security in the course of data integration, data quality, and reporting jobs. In other words, you can in the same product, program, and I/O pass: mask/redact, encrypt, psedonymize or otherwise de-identify PII values while transforming, cleansing, and otherwise remapping and reformatting the data from heterogeneous data sources.
Legacy ETL and BI tools cannot do this as efficiently or as affordably. In fact in Voracity -- which supports and consolidates data discovery, integration, migration, governance, and analytics -- you can process (integrate, cleanse, etc.), protect (mask), and present (report/analyze) or prepare (blend/munge/wrangle) data all at once.
Alternatively, you can run IRI data masking programs on static data sources (or call our API functions dynamically) just to protect certain fields that your existing platform will then transform or visualize. In this way, you can:
- use your existing code
- protect only the fields needing security
- keep both protected and unprotected data available to the routines and systems that need to access it.
There are several, but start with the latest described in this article:
It depends on the version of Excel (not your O/S) that is running.
How can I tell whether I'm running a 32-bit or 64-bit version of Microsoft Office?
No data is transmitted from the CellShield plugin. And no connection is made or needed by CellShield or any other IRI PII discovery or static data masking product.
FieldShield, CellShield, and DarkShield (as well as CoSort) and thus Voracity, ship with multiple 128 and 256-bit encryption libraries using proven, compliant 3DES, AES, GPG and OpenSSL algorithms. For each PII item or sub-string, you can use the same or different built-in encryption routine, or link to your own encryption library and specify it as a custom, field-level transformation function in a job script. You can also use the same algorithm(s) and a different encryption key for each field as well.
Encryption key management is supported through passphrases in job script, secure files, and environment variables, as well as in third-party vaults like Azure Key Vault and Townsend Alliance Key Manager.
All FieldShield and CoSort/SortCL job scripts and field-level functions can be recorded in XML audit logs that you can secure, and query with your preferred XML reporting tool. You can also use SortCL scripts (n.b. samples are provided, where /INFILE=$path/auditlog.xml /PROCESS=XML, etc.) against these audit logs for reporting.
Yes, to Splunk for example, in multiple ways. See articles like these:
- Revealing Data Profiling Secrets in Splunk
- Automatically Forward Target or Log Data into Splunk
- Shedding Light on Dark Data with Splunk ES
Yes. In the context of IRI Voracity, CoSort, FACT, NextForm, or IRI FieldShield and RowGen operations, data and metadata can be role-segregated via:
- DBA-defined source/target data accesses (configuration details for which are stored in manageable DSN files or Workbench Data Connection Registry settings subject to workspace access permissions controlled at the file (or O/S log-in) level
- Active Directory or LDAP (O/S)- and/or repository (e.g., Git)-controlled access to data files, workspaces, projects, data and job metadata, executables, logs, etc.
- Script-integrated IAM (and ultimately AD-compatible) policies exposed in IRI Workbench data classes for script, source, and field-level asset access, and/or execution/modification-specific authority (pending)
- project-level controls on IRI assets in the metadata/API-integrated AnalytiX DS data governance (now called erwin EDGE) platform (premium option).
Yes, Multiple roles/permissions for IRI FieldShield or Voracity (etc.) metadata assets (ddf, job scripts, flows) etc. can be assigned through system administators assigning policy-driven ActiveDirectory or LDAP objects to those assets. Other options include the AnalytiX DS data governance platform (premium option) or Eclipse code control hubs like Git; see http://www.iri.com/blog/iri/iri-workbench/introduction-metadata-management-hub/ for an example.
The IRI Chakra Max (DB firewall) manager program supports the designation and deletion of Administrator, Viewer, and Manager roles. The highest level administrator can also delegate security policy approval permissions to other administrators. Such role permissions are set at very fine-grained individual or group levels controlling DB log-in, execution authority, data access, and audit log query/report/restore privileges.
Yes, In the context of IRI Voracity (for profiling, ETL, DQ, MDM, BI, etc.), CoSort (for data transformation), FieldShield (for PII discovery and masking) , NextForm (for data migration, remappping and replication), and RowGen (for TDM and subsetting), access to specific data source (and targets, down to the column level) can be controlled through DBA-granted or file-level permissions (managed in DSN files and the IRI Workbench data connection registry), as well as through field-level revelation authorizations controlled in (securable) job scripts and decryption keys.
In the context of IRI Chakra Max (for DB DAM/DAP and DDM), individual and group roles can be matched to granually-defined security policies that will determine the right to connect to specific DB instances, execute certain SQL statements, have columns dynamically masked (or not), etc.
Yes, in multiple ways, including access permissions to job metadata, data sources (and targets), and decryption keys, and via differential data class function/rule assignments. See the other questions in this section and contact firstname.lastname@example.org or email@example.com for assistance setting up your controls.
Does Voracity (FieldShield, or other IRI software) independently check who the end-user is attempting to access protected data, or does it rely on the underlying database or application access controls?
Both, inasmuch as data, metadata, and/or job script access -- plus execution permission -- are associated with ActiveDirectory object-defined, DBA login, and/or file-system controls imposed on a policy basis for authentication purposes.
IRI also offers optional DAM/DAP technology (Chakra Max), a market-leading database firewall system that also dynamically masks queried data pursuant to specific policies affecting individual users, user roles, or user groups.
Calling applications can also impose additional layers of user authentication to optionally fill a gap or protect additional data touchpoints.
Through client computer or ActiveDirectory/LDAP imposed access controls and file-level permissions. Beyond that, either the erwin (AnalytiX DS) governance platform or any Eclipse-compatible SCCS like Git for metadata assets -- where permissions by role are configurable -- can lock down specific projects, jobs, and other metadata assets.
CoSort is a robust, commercial-grade software package for efficiently manipulating and managing high volumes of data. More specifically, it is a sorting, data transformation, migration and reporting package that addresses a very wide range of enterprise and development-related challenges in data integration, data masking, business intelligence, and ancillary disciplines. Please review the product description and solution sections of this web site for details.
For the purpose of this simple sort/merge question, CoSort stands for Co-routine Sort, first released for commercial use
CoSort exploits parallel processing, advanced memory management and I/O techniques, as well as task consolidation and superior algorithms, to optimize data movement and manipulation performance in existing file systems. No paradigm shifts to database engines, NoSQL, Hadoop, or appliances are necessary. Maybe a little more RAM, but that's usually enough ...
Very. Performance varies by source sizes and formats, data and job orientation, hardware configuration and resource allocation, concurrent activity and application tuning. The best benchmarks (e.g. 1GB in 12 seconds, 50GB in 2 minutes) run in memory on fast, multiple CPU Unix servers.
When you perceive a bottleneck, which could be starting at 500K to 50M rows depending on your hardware. CoSort sorts routinely in the terabyte range -- and scales linearly in volume without Hadoop. Input files in the dozens or hundreds of gigabytes are now common. Any number of input and output files -- and structured file formats -- are simultaneously supported, including line, record and variable sequential, blocked, CSV, I-SAM, LDIF, XML (flat), and Vision. For a list of supported data sources, click here.
CoSort is now also the default engine in the IRI Voracity ETL, BI/analytic data preparation, and data management platform which can also run many CoSort data transformation and masking jobs (scripted in SortCL 4GL syntax and/or represented graphically in the IRI Workbench GUI) seamlessly in Hadoop. Thus the question may be at what volume point would I need to run such jobs via a Hadoop engine instead. See our article, When to Use Hadoop.
Usually through a CoSort Resource Control (cosortrc) text file, which can be global, user, and/or job-specific. On Windows, default registry settings also set up at installation time and can be overridden by an rc file. You can specify a ceiling and floor on CPU/core threads and memory, I/O buffers, and allocate/compress disk space for sort overflow. There are several other documented job controls also specified at setup, and easily modified (or secured) later.
Typically the single most important factor in file sort performance is the speed of the I/O channels. For small files, which may be sorted entirely in memory, this means optimizing the reading of the source files, and the writing of the target files. For large files, which will exceed the available memory, The throughput to the work area files will also be critical. Many times, the source and target file locations are fixed, and little can be done to improve their I/O performance. However, the local work areas where the temporary merge files are stored can often be optimized. For example, having more than one fast SSD type local drives, on separate controllers, can make overflow sorts almost as fast as in-memory sort jobs.
The location of the overflow files is specified by the WORK_AREAS tuning settings. Multiple locations can be specified, but there is no advantage to using multiple locations unless they are on separate devices and I/O channels. At a minimum, there should be at least one fast local drive that is not shared with either the source or target files.
After addressing the performance of the I/O channels, the next most important tuning settings are the ones related to memory usage. Fortunately, CoSort 10 has some new and powerful techniques to self-tume the memory allocation. We recommend starting with the CoSort 10 default settings for memory settings. This is best accomplished by having only the single setting of:
in the $COSORT_HOME/etc/cosort.rc file. Other memory related settings, like BLOCKSIZE and COMPRESS_WORKFILES will then be controlled by the intelligent algorithms in CoSort 10. Here is an example tuning file using the recommended settings. Of course you will want to set THREAD_MAX to a value up to the number of CPU cores that your license allows. In this example, the two work areas would be on local solid-state drives on two separate dedicated controllers.
THREAD_MAX 6 # Maximum number of sort threads
THREAD_MIN 1 # Minimum number of sort threads
MEMORY_MAX AUTO # Internal sort memory ceiling
WORK_AREAS /mnt/ssd1/work # Overflow (temp) file paths
WORK_AREAS /mnt/ssd2/work # Overflow (temp) file paths
MONITOR_LEVEL 0 # Runtime monitor level
MINIMUM_YEAR 70 # Century window
ON_EMPTY_INPUT PROCESS_WITH_ZEROS # Output file option
OUTPUT_TERMINATOR INFILE # Output terminator
Please email to firstname.lastname@example.org with any additional questions or concerns.
An external sort is defined as a sort operation which is too big to fit into memory, and must use temporary work files (LWF). In this case, there will be one physical work file created for each thread and work area combination. The amount of memory used while READING the input will be approximated by the formula:
THREAD_MAX * AIO_BUFFERS * BLOCKSIZE * number of WORKAREAS
However, during the merge phase, when WRITING data, much more memory is required, because each physical file can contain multiple logical work files (LWF). As the volume of data increases, the amount of merge memory required will increase. If there is not enough memory available from the MEMORY_MAX setting, and insufficient merge memory error (error 2) will be raised. The remedy is to increase MEMORY_MAX, or decrease any of the other settings involved in the memory usage calculation.
CoSort will begin using multiple threads when the sort (input file/table) volume is at least two times larger than the BLOCKSIZE specified in your CoSort Resource Control (cosortrc) file, which is typically 1 to 2 MB in an auto-tuned job.
CoSort does not make any distinction between physical CPU chips, CPU cores, or hyper-threading. We do not attempt to micromanage which device a thread is created on. The operating system is in the best position to determine where to schedule new threads. CoSort just creates the sort threads, up to the maximum number specified in the tuning file, which cannot exceed the number of threads supported by the license. Cores are usually the best indicator of what's possible to expect in terms of peak performance before the point of diminishing return, subject to resource contention and Amdahl's Law.
Note also that each CoSort sortcl process is independent of all others. There is no inter-process communication or synchronization taking place, so in a concurrent multi-job environment, specifying only 1 or 2 max threads will likely be more efficient, even when each job running simultaneously may be high volume. Memory, however, is self tuning when on MEMORY_MAX is set to AUTO.
For testing different numbers of threads, we recommend that you request a license key through the normal CoSort installation and registration process (as instructed in the installation guide) for a temporary period, wherein you specify the total number of physical cores on the host machine. Then once you get the license keys to allow that number (up to 64), run jobs with different THREAD_MAX values (from 1 to the max) in your cosortrc file. You can also experiment with memory and overflow-related settings to see what works best.
Please advise email@example.com if you need more details, and consult Section D of the Appendix chapter of the CoSort manual for technical specifics, and a way to automate the benchmarking process. Remember that your runtime results are logged so you can analyze performance off line later.
There is no way to specify which CPU core threads will use. This is entirely up to the operating system. If your operating systems supports it, you can try using system tuning to assign sortcl processes to certain chips or cores.
CoSort has no inter-process communication; each instance stands alone. So if MEMORY_MAX is set to 10%, and you run two simultaneous jobs, you would be using 20% of the system memory.
Yes. A higher BLOCKSIZE will help read/write performance if the source and work file locations are on the same device, otherwise, a lower BLOCKSIZE is better. Test with various block sizes from 100K up to 16M. Just remember that on large, external sorts, insufficient merge memory may result from using too big of a BLOCKSIZE.
The AIO_BUFFERS parameter is the number of buffers used for reading and writing. CoSort uses overlapped I/O, which means that while one buffer is being processed, another is being read or written. A slight increase in speed is possible by increasing the number of buffers, but will come at the expense of needing more merge memory. Use any value greater than the default with caution. Test first with the largest input size that will occur. Check to see if the performance gain is measurable.
CoSort users can display runtime information before, during, and after execution through:
• optional on-screen display levels
• self-appending and replacing log files
• application-specific statistical files
• and, a full audit trail for various compliance and forensic requirements
More than 120 now, and counting. These includes single and multi-byte character sets, Unicode, C, COBOL, and mainframe numerics. Contact IRI to help obtain a definition if you are not sure what you have. Moreover, CoSort supports the (simultaneous) collation, conversion, and creation of more than two dozen file formats.
I am using the translation tool mvs2scl and I have mvs scripts that are similar to this:
OMIT COND=((63,1,CH,NE,C’ ‘))
Your mvs2scl utility produces this output from the conversion:
/FIELD=(field_1, POSITION=1, SIZE=20)
/FIELD=(field_2, POSITION=40, SIZE=3)
/FIELD=(field_0, POSITION=63, SIZE=1, EBCDIC)
/CONDITION=(cond_0, TEST=(field_0 != ” “))
/FIELD=(field_1, POSITION=1, SIZE=20)
/FIELD=(field_2, POSITION=21, SIZE=3)
This is fine, except that I do not want the lines that have $SORTIN or $SORTOUT.
Q. How can I suppress these lines?
A. There are no specific mvs2scl options that will do this. But there are options with the grep command that can be used while executing the translation. Here is what you should execute on the command line.
mvs2scl job1.mvs | grep -v ‘$SORT’ > job1.scl
job1.mvs is the mvs script that is to be translated. job1.scl is the translated script for sortcl without the $SORTIN or $SORTOUT lines. The -v option with grep says to only output lines that do not contain the expression $SORT.
SAS documents the CoSort option in the v7 and 8 system for Unix, which is reflected in the SAS and CoSort user manuals. The use of CoSort has been proven to dramatically and affordably accelerate native SAS PROC performance off the mainframe! In SAS, CoSort performance is tuned automatically, or through a resource control file that is configured at set up time by the system administrator. This text file can be modified at any time for reference at a global, user or job level.
For SAS 9 and later however, please contact SAS, as they yet updated their 'sort appendage' for CoSort. IRI has made repeated requests to, and offered to help, SAS update their CoSort connection and give their users a better option. But SAS told us that you must make your CoSort request known directly to them. When you do, please notify us through this form at the same time so that we can follow up with SAS for you, too. Thank you!
They differ, and are based on collaboration and feedback with partners and customers who own their data and job definition metadata.
Installation of IRI products is typically documented in their user manuals, or in the case of CoSort, in version-specific installation guides available from your representative.
IRI Workbench is the Eclipse front-end (graphical job design client) piece, which runs on Windows, Linux, and MacOS (Sierra). The back-end SortCL program in the CoSort package/Voracity platform -- which supports FieldShield data masking and many other data management functions -- runs on all the above, as well as all flavors of Unix including AIX, Solaris, HP-UX, and z/i/pSeries Linux.
The requirements for Windows and Linux are similar as far as hardware is concerned. An absolute minimum configuration for Workbench would be 2GB of RAM, and 1GB of free disk space. Remember that Workbench needs a 1.8 JRE. Workbench and CoSort are tested and supported back to XP on Windows. We also test with various major Linux distributions of both Debian and Red Hat package management standards.
We recommend where possible co-location of the licensed back-end on the database source or target server for performance reasons, particularly if there are known network bottlenecks.
If you are going to also run CoSort on the same PC as Workbench, then you will have to allow extra capacity to run jobs. The requirements for CoSort increase with the size of the data that you intend to process at any one time. A recommended hardware platform for a PC running Workbench and CoSort would be 8GB of RAM, and 10GB available disk space, plus additional disk space for temporary files equal to 1.5 times the largest data set to be processed. A general guiding principal for hardware is that the more RAM, the better the performance.
Many 'big data' CoSort and Voracity users license and cosortrc-tune the product on very large multi-core Unix systems to leverage hundreds of GB of RAM to out-perform Hadoop for example. If you run the Hadoop edition of Voracity, load balancing shoud be automatic.
Yes. This is no different than installing CoSort on a Unix server for multiple users. Note also that installing Cygwin and allowing users ssh access would work well too. The problem with this approach is the access to the input/output data streams — which would need to be network resources — possibly slowing runtimes considerably.
In Windows 10, make sure you have edit (not just write) permission for the C:\IRI\CoSort100\etc\cosort.lic file. Without that, the setup program cannot update its 3-string license key value with what the keys you tried to input through the command line or IRI Workbench setup and registration program (Update License option). You may even have to insert those instead by hand, replacing the default 0.0.0 values.
For Windows 2003, first, double check that the license keys you entered were provided for the current (and correct) private key in your RegForm.txt file. If so, the registry was not properly updated. To rectify this, open the main directory where you installed CoSort (C:Program FilesIRICoSort82) and double click on global_config.reg and license.reg to re-set these Windows registry settings. To verify that they have been set-up correctly in the registry, click Start, Run, and enter the command REGEDIT. Click OK and to open the registry and navigate to:
HKEY_LOCAL_MACHINES->Software->Innovative Routines Intl -> CoSORT82
Check for these 2 folders: Global Configuration and License. If the folders are there, please run the C:Progra~1IRICoSORT82/setup program again re-enter your license keys.
Check the LOG file entry in $COSORT_HOME/etc/cosortrc. If you use v8.2.1, make sure that it does not have a %s in the path. You (and all CoSort users) must have read and write permissions for the path/LOG file (default location should be $COSORT_HOME/etc/cosort.log) and the sort overflow (WORK_AREA) directories.
Apart from the many data warehouse architects who use CoSort's SortCL tool and find its flat-file approach faster than SQL procedures and ETL tool steps, many experts also acknowledge the efficiency of flat-files in high volume data staging:
"The first system for which the data warehouse is responsible is the data staging area, where production data from many sources is brought in, cleansed, conformed, combined, and ultimately delivered to the data warehouse presentation systems ... The two dominant data structures in the data staging area are the flat file and the entity/relationship schema, which are directly extracted or derived from the production systems."
from "The Foundations for Modern Data Warehousing" by Ralph Kimball, appearing in Intelligent Enterprise Magazine's Data Warehouse Designer Section.
By using flat files, the SortCL program -- which is the primary utility in the IRI CoSort product and default engine of the IRI Voracity data management (and integration) platform -- bypasses the normal overhead of DB connectivity and transformations. SortCL even flattens FDB tables connected through ODBC and the data in semi-structured sources like MongoDB and JSON files. This helps move the engine through rows of data sequentially and consistently, while simultaneously exploiting the asynchronous I/O, advanced memeory management, and multiple-threads available from the O/S; i.e., it doesn't use Java programs, either.
The flat structure it's processing also allows the tool to rapidly combine data transformation, reporting, protection and prototyping functions in the same job and I/O pass. Thus with SortCL, you can simultaneously filter, cleanse, standardize, transform, mask, and report on massive collections of data. Compare this kind of task consolidation potential with how you get work done now, particuarly against multi-step ETL tools which are not only inefficient by virtue of the multiple steps they have to configure, but also run, one memory-constrained, separate I/O step at a time, and sometimes also requiring partitioning with SortCL never would.
Put another way, because of SortCL, if you use CoSort or Voracity or other SortCL-compatible products from IRI (like NextForm, FieldShield or RowGen), you can move through your data sources rapidly, and get a lot more done. So it's "one product, one place, one pass" vs. the complexity, cost and time it takes competing alternatives (multiple products, places, passes and prices).
Through it's Sort Control Language (SortCL) program, users of the IRI CoSort product or IRI Voracity platform can leverage the resources of their existing file systems to perform these kinds of jobs without the overhead and administrative constraints of databases and SQL procedures -- not to mention the cost of megavendor ETL tools and ELT appliances, in-memory DBs, or complex Apache projects.
Through SortCL, you can perform and combine many of these activities simultaneously against multiple data sources of any size:
- Data Transformation > select, sort, merge, join, aggregate, re-map, pivot, cross-calc, etc.
- Data Cleansing > enrich, evaluate, filter, reformat, and validate data across disparate sources
- Data Governance > manage and mask master data, manage metadata, improve data quality
- Data Migration > remap file formats, data types, endian states, record formats
- Data Replication > copy, shift, enrich, and re-purpose data from one or formats/platforms into others
- Data Federation > virtualize ad hoc mash-up views and formatted reports, or feed direct BIRT displays via ODA
- Data Masking > de-ID, encrypt, hash, pseudonymize, randomize, redact, tokenize and otherwise obfuscate fields
- Data Presentation > get 2D BI via detail, delta, and summary reports in custom formats, even with embedded HTML
- Data Franchising > filter, pivot, transform, and segment data into CSV, XML and ODBC hand-offs for BI tools
- Data Staging > scrub and prepare bulk data for other ETL tools, databases, data and spreadmarts, and analytic platforms
- Data Prototyping > generate safe, intelligent, and referentially correct DB, file, and custom-report-formatted test data
So, to process, present, protect, and prototype big data, SortCL and flat files are still the fastest, and most cost-effective approach to consolidating information lifecycle management (ILM) activities. And if you keep your data in HDFS, many of the core data transformation, masking, and test data generation functions designed in SortCL can run in MapReduce 2, Spark, Spark Stream, Storm or Tez through Voracity's Hadoop gateway (called "VGrid") without re-coding anything!
If you are executing Voracity jobs in Windows, Linux, or other Unix file system using the default CoSort/SortCL (or subset) program, that executable can send event (job status) messages to the (CLI or GUI) console at various verbosity level (most basic shown below).
If you are executing Voracity jobs in Hadoop MR2, Spark, Storm, or Tez, the HDFS job inventory indicates status:
We believe that Hadoop provides for this automatically. Default CoSort/SortCL-based executions only allow for pause/resume in the event of insufficient soft overflow (temp) file space.
There are several logs generated, including SortCL app stats, error logs, and a self-appending runtime performance file. There is also an optional XML audit log file generated from each run showing the contents of the script and environment details:
The CoSort package comes with all the tools, conversion utilities, and callable libraries, as well as full .pdf documentation for the executables and APIs.
When a CoSort package is installed, default system tuning parameters are automatically configured to exploit available RAM and CPU resources. Users can manually adjust their tuning parameters (including the assignment of sort overflow disks, and audit logs) in a simple text file.
Usually through system-specific license keys. Call us to describe your situation.
That depends on what component(s) you choose to apply, and the extent of functionality you need. For a basic sort, it may take as long to choose an interface as to implement one. Implementing custom user input, compare, or output routines, or write custom, field-level transforms (i.e. adding your own cleansing or statistical functions for individual fields) will take longer.
From USD$150 to $15K per copy, or site, when embedded and redistributed within your application. It depends on what piece you integrate, how many and how quickly you distribute them, plus the (average) size of end-user hardware and cost of your application. We aim for fair licensing within your business model.
Actually, it can be any (combination) of the above.
You can choose to invisibly embed, re-license, or refer a full or partial CoSort package. Partial options include:
• Standalone interfaces: sorti, sortcl
• Third-Party Sort PlugIns: e.g., Unix sort, SAS, etc.
• Runtime libraries: cosort_r(), sortcl_routine()
That depends on your current methodology, hardware, kernel and CoSort tuning, data volume, and job specs. You can measure it during a free, confidential trial.
If so, less than you would for production. Call IRI to discuss your situation.
IRI Voracity's new Data Unification wizard follows the consolidation style of MDM, and allows you to compare, reconcile, and bucket new master values using fuzzy matching logic.
Alternatively, you can manually move the master data to a central place and metadata repository where it is easier to clean, manage risk, and support a SOA chain. Define the files and fields in the CoSort SortCL tool readable format and scan for naming and attribute discrepancies. Create SortCL jobs to integrate and standardize these files, and leverage custom, field-level transformation functions to cleanse the data according to your own business rules. Once prepared in the file system via rapid CoSort SortCL operations, you can then re-populate master data tables as needed. SortCL join operations can also filter and report on changed master data that are in files.
As we move to a centralized MDM model, all our critical metadata may sit in one vulnerable basket. Our trusted internal systems and processes touch this data. But we also need to make it available to outsourcers and for testing. What can we do to balance need-to-know and access/user requirements?
CoSort's SortCL tool can apply the necessary security filters to help you enforce your need-to-know rules, and to encrypt and anonymize specific file data at risk before it is exposed. You must mask your master data at the field level if you need to test or outsource this data -- or provide views to different departments -- and leave less-sensitive data still accessible.
All CoSort (SortCL) application scripts (whether for data transformation, reporting, protection or prototyping) are parsed and saved with runtime information in audit logs that you can secure and query with your preferred XML application or SortCL itself.
The SortCL program in IRI CoSort can flatten and process some hierarchical data, including mainframe index files, to prepare for DB loads and business intelligence tools. SortCL can also join product data with customer transactions (as well as other files with common keys) and output custom reports. These reports can include aggregation and cross-calculation of the detail and summary data produced. SortCL can also perform field-level lookups to find discrete values, and populate newly formatted files.
Dumping the data to a flat file or pipe into CoSort without qualifiers encumbering the unload can help. So, save the order by, group by, distinct and join work for CoSort -- which can handle them all (at once) faster in the file system, and give you formatted reports in the same IO.
CoSort's SortCL supports all SQL aggregate functions, including sum, average, count, maximum and minimum. But SortCL is more efficient in that it can sort on multiple keys and produce aggregate results for one or more output files in the same pass through the data, off-line.
On an ia64 hp server rx5670 with four 1GHz Itanium2 CPUs and 32GB of RAM, Oracle 9i's SQL*Plus joined two 1G tables in 48 minutes. Unloading the same tables with the IRI Fast Extract (FACT) tool, piping these to flat stream sorts and joins in the CoSort SortCL program, and then piping the result into SQL*Loader built the same joined table in 18 minutes (or, ~1/3 the time of the on-line method).
Direct path, pre-sorted loads are the fastest way to build new tables since this loading method bypasses the overhead of Oracle's index sort. For bulk loads, CoSort the data first on the primary index key; the create index will bypass the sort step. For regular insertion loads, CoSort the data on the clustered index as the key.
Through database-specific APIs and parallel unloading techniques that produce portable flat files. For interface details, and to request a brochure, white paper, webinar or trial, go to Products > FACT
CoSort (or Voracity) SortCL job scripts can do many of the same jobs much faster, and with far less coding. SortCL runs outside the database on flat file inputs, using the same relational logic and functions; i.e. SELECT WHERE, DISTINCT, ENCRYPT, ORDER BY, GROUP BY, JOIN. For example, CoSORT's SortCL uses conditional /INCLUDE and /OMIT statements to select from sequential input sources and output targets. DISTINCT is similar to /NODUPLICATES, ORDER BY is a /KEY, GROUP BY may be a /SUM, /AVERAGE, etc.
We have hundreds of tables and thousands of columns defined in tables, though these are also represented as external files. How can we leverage SortCL without having to re-define these layouts by hand?
Like FACT, both the AnalytiX DS Mapping Manager or Meta Integration Model Bridge (MIMB) from Meta Integration Technology, Inc. (MITI) can automatically convert the file layout metadata used in your relational or ETL tool into CoSort/SortCL data definition file (DDF) format.
Yes, all of the above, and more. See the data sources and targets listed generally under:
Note that RowGen can produce all of these targets simultaneously in fact, from the same source of synthetically generated (or masked) data, as well as detail and summary reports with customer formatting in each.
RowGen v3 pricing starts well below $10K for perpetual use, and increases proportionally with additional CPU cores that you may wish to exploit. Maintenance is free in the first year and then 15- 20% of your license fee, depending on the level of upgrades.
It reads popular data model and file layout metadata to formats the rows of test data into precise flat file, reporting, and database table structures for DB populations, and preserves the referential integrity of those tables. RowGen is a complete solution for test data, as it allows you to be simultaneously:
- Generating Huge Volumes in Parallel
- Applying Custom Rules and Ranges
- Using DB Data Models and Metadata
- Preserving Realism and Relationships
- Transforming, Segmenting, and Reporting
- Formatting in Custom File Layouts
- Auditing Jobs to Verify Compliance
You can work with an IRI engineer directly by completing the Support Request Form, sending a descriptive email to firstname.lastname@example.org, or by calling 1-321-777-8889, main menu option 3 (for Technical Support).
A list of frequently asked technical questions pertinent to installation and configuration for customers under maintenance agreements is posted and answered in the protected support section of this web site. Nonetheless you can still find many technical and how-to answers available from email@example.com and the self-learning page here (even if you are not under contract).
Yes, as long as your XML files describe structured, sequential records. The output can be another flat XML file, or any another file type that SortCL supports.
Yes, if the XML file contains structured, sequential records, and the output process type is documented in SortCL.
Yes, if those other process types are documented in SortCL. Any valid output targets can also be custom-formatted and protected for reporting, hand-offs, and outsourcing.
Yes, subject to the format limitations defined above.
No, unless you run out of disk space.
At the field-level, you cannot link custom transformations that require file-level information, such as statistics. However, SortCL already supports file and record-level aggregation, plus cross-calculation, for detail and summary reporting.
SortCL sends the field data to the routine, which then sends it back in the transformed state, formatted per the other attributes in the output field description.
Write or license a library and specify it at the top of a SortCL job script (which can also be invoked in a thread-safe API call). In the /INREC or /OUTFILE phase of the script, define the library as a field attribute.