The following are procedure we adopt at Meridian Litigation Analytics for processing electronic documents in preparation for eDiscovery.
When receiving a large number of electronic stored information (ESI) from a client, such as on CDs or removable hard drives, a careful analysis has to be carried out on this ESI before loading into our eDiscovery application, Summation. I use a script to build log files of all the files on each of the storage devices: their folder location, file name, file extension, file size, etc. I then process the ESI external to Summation as follows:
Compressed files such as ZIP files are decompressed to individual files. Care must be given to check for further zipped files, continuing until all zipped files and embedded zipped files are unzipped. All decompressed zipped files are to be subjected to the same analysis as normal electronic files as described below.
Large files containing more than one single or composite document, typically PDF files – split into single or composite files and OCR where necessary.
PDF files are generally problematic. They are badly named, contain batched documents, in most cases they come into being through scanning paper documents, quality on average is poor, page orientations have to be checked, completeness have to be checked and in the majority of cases they are not OCRed. These all have to be checked and correctly processed.
Outlook email Personal Storage Table (PST) files – these are removed and saved to a separate folder just for PST files. Summation loads PST files in a specific and very structured way for maximum access to all components of email messages and their metadata.
Outlook individual message (MSG) files – these are removed and saved to a separate folder just for MSG files. Using Outlook, MSG files are combined into one or more PST files and moved to the PST folder.
Other email formats are investigated and where possible, converted to an Outlook PST format and moved to the PST folder. These formats could be: Microsoft Exchange Database (EDB), Microsoft Outlook Express (EML), Microsoft Outlook Forms Template (OFT), Microsoft Outlook PST (MAC), Outlook Express (DBX), Netscape RFC 833, EML (Microsoft Internet Mail, Earthlink, Thunderbird, Quickmail, etc), IBM Lotus Notes Domino XML, Language DXL, Eudora and Encoded mail messages.
Image files – these typically have file extensions of: JPG, BMP, PNG, GIF, TIF, etc. All image files are viewed to see if they are text files. Image files containing text need to be converted to PDF file format, page orientations checked and corrected and then OCRed.
Group all audio and music files (MP3, WMA, etc) into one folder.
Group all video files (MPG, AVI, WAV, MOV, MP4, WMV, etc) into one folder.
Files that are human generated are identified in most cases through examining their file extensions. The list I use is: csv; dbx, doc; docx; eml; htm; html; mht; msg; pdf; pps; ppsx; ppt; pptx; pst; rtf; txt; wk1; wk4; wks; wp; wpd; xls; xlsx; xml; zip. This would cover more than 95% of files. The balance have to be checked together with the remnants after de-NISTing files explained in the following paragraph.
All files that are non-human generated – everyone has this digital detritus on their systems; things like Windows screen saver images, document templates, clip art, system sound files and so forth. These are files that come straight off installation disks, and it’s just noise to a document review. The process of identifying these files is called ‘de-NISTING’. Those noise files are identified by matching their hash values (i.e., digital fingerprints) to a huge list of software hash values maintained and published by a branch of the USA National Institute of Standards and Technology (NIST). The NIST list is free to download, and pretty much everyone who processes data for e-discovery and computer forensic examination uses it. I use it with caution as I have found non-human generated files not on the NIST list.
Usually, an iterative procedure needs to be applied to process electronic files until all conditions are met for all file types so that there are no compressed files, large batched files and the document types are clearly tagged for group processing and tagging.
Once all the above have been done, the electronic files are all loaded into iBlaze and data garnered from the above processes are loaded via DII files to tag each document record.
Anton van Dorsten – November 2014