File Creation | Social History Portal

Applying best practices at the beginning of the digitization process helps to increase the overall quality of the content. Limiting the range of file formats used for archival storage enables a trusted repository to provide predictable delivery services. Good file names applied as part of a robust digitization workflow can serve to uniquely identify a file and to preserve aspects of its provenance. The following section specifies the object types, file storage formats, and delivery formats that should be supported by a social history repository. It closes with a discussion on the recommended form and application of file names and directory structures.

In general a repository should aim to provide different file versions for each original item in the collection; this would include at least one master file and multiple derivatives. Master files and derivatives are defined as:

''Master file:'' The result of the digitization process; a high quality digital object or digital file from which all other versions or derivatives (e.g. compressed versions for accessing via the Web) can be derived. The idea behind the master file is to have a digital copy which resembles the original as much as possible. Thus, the master is usually created at the highest suitable resolution and bit depth that is both affordable and practical. Often a distinction is made between a production master file and an archival master file. The archival master file is an unaltered version of the digitized object, while the production master can have some adjustments to the object.
''Derivatives:'' Different versions derived from the master or from the born-digital original. Derivatives are generally used for online access to and distribution of digital content. Derivatives may include thumbnails, previews, high- and low-resolution copies, and OCRed text versions.

HOPE distinguishes between three types of derivatives:

''Derivative 1'' is a high-resolution derivative for reproduction and publication (online/print) purposes. It might alternatively be considered the production master as defined above (the master could then be referred to as the archival master). In other words, the derivative 1 file may serve as the base file for use and reuse. It can be used as the basis of other derivatives and likewise for tasks which call for a high quality copy, such as illustrating a publication.
''Derivative 2'' is a medium to low-resolution derivative that enables easy online consultation. It should therefore be optimized so that the file is not too heavy and can be easily downloaded or viewed online but still provides an adequate viewing or listening experience (e.g. not hindering legibility).
''Derivative 3'' is a preview quality derivative for display in search results or on individual records. It should provide a sample or suggestion of the content of the original, though it is not even necessary that it be legible. The only real requirement is easy transfer over the internet.

Separate guidelines should be used when creating master files and the different types of derivatives. A basic rule of thumb when digitizing material should be that—as the digitization process is labor intensive—the master should have a lifetime of at least fifty years. While compression for the master file is not explicitly forbidden, it is discouraged as an added barrier to the use of the file over the long term. Compression can be 'lossless', which means that no information is lost and the original file can be reconstructed pixel by pixel, or 'lossy', which means that some information is lost in the compression process. When compression is used for masters, lossless compression should always be employed. Lossy compression should only be employed for derivatives. Another general rule, whether you are digitizing internally or outsourcing, is to do a random check of the created files for signs of faults (e.g. lines) or, if compression has been employed, distortions known as 'compression artifacts'. In general, open standard formats should be preferred to proprietary formats. Formats selected should be widely used standards and supported by browsers, players, and viewers.

Master Files

Should be scanned at full size.
Should show the smallest relevant details of the original. An appropriate resolution should be selected.
Should exhibit the full range of relevant colors and tones. This does not mean that every image has to be scanned in 24-bit color, but only that if an original contains colors, these should be visible—e.g. a newspaper that contains no color images may be scanned in 8-bit grayscale or even 1-bit black and white, but one that contains color images should be scanned with a bitrate and color depth which show the full range of colors and tones.
Should be saved in a standardized and well-documented file format which will be supported over the long term. The official standards are TIFF, JPEG, JPEG2000 and PDF/A. Other file formats which have become standards because of widespread use and support, such as GIF, PNG and PDF, are ''de facto'' standards. It is advisable to use only the official standards.
Should be uncompressed. A compression algorithm is just another obstacle to long-term access as it can become obsolete independently of the file format. Thus, compression should be avoided for masters.
Should undergo no unnecessary tinkering with the image quality. No sharpening, resizing, de-skewing, de-speckle, etc. should be undertaken unless deemed absolutely necessary to make the image intelligible. If the image is unintelligible, rescanning should be considered.

Derivative 1 Files

Should show the smallest relevant details of the original. (see Master Files above)
Should exhibit the full range of relevant colors and tones. (see Master Files above)
Do not require a standardized or well-documented file format, although it is still advisable to employ one. File formats should be chosen with an eye to convenience.
Can be compressed. It is advised to use a lossless, standardized and well-documented compression algorithm, such as LZW or JPEG2000.''
Should undergo no tinkering with image quality. (see Master Files above)

Derivative 2 Files

Should have a file size which is as small as possible without compromising legibility.
Should not be downsized.
Can be of lower resolution, provided this does not compromise legibility.
Do not have to exhibit the full range of colors and tones, as longs as legibility is not compromised.
Can use any convenient file format. Preference should be given to formats which are universally supported, such as JPEG.
Can employ any kind of lossless or lossy compression. Preference should be given to standardized and well-documented algorithms, such as JPEG.
Tinkering, such as sharpening, de-skewing or de-speckling may be permitted, provided it enhances legibility.

Derivative 3 Files

Should have a small file size to facilitate downloading.
Can be adapted to fit any need. Resizing to smaller measurements is allowed.
Do not have to contain all the pages of the original.
Can have a very low resolution (e.g. 75 dpi).
Do not have to exhibit the full range of colors and tones.
Can use any convenient file format. Preference should be given to a format which is universally supported, such as JPEG.
Can employ any kind of lossless or lossy compression. Preference should be given to standardized and well-documented algorithms, such as JPEG.

The sections below detail the main formats used for digitizing material for each type of media: visual and text documents, moving images, and sound, giving recommendations on formats for both preservation and access purposes and on quality characteristics relevant to each media type.

Raster Based Images for Visual and Text Documents

Computer graphics can either be raster graphics (also known as 'bitmap images') or vector graphics. The former are composed of pixels in the form of a grid; the latter represents images with a geometry of points, lines, curves, and polygons. Raster graphics produce more faithful reproductions of photographs and photo-realistic images (as appropriate to scanned reproductions), while vector graphics often serve better for typesetting or for graphic design. The following treats raster images as used in the digital reproduction of analog or born-digital material.

The quality of a digital image is dependent on the number of pixels per inch (PPI), called the resolution. Simply put, the higher number of pixels per inch, the higher the quality of the image. However, if the pixel rate increases the image not only gets sharper but the file size gets bigger. Thus, when images are scanned at a very high resolution, compression should be considered. Lossless compression schemes can help save up to fifty percent in storage space, while lossy algorithms can help save up to eighty percent. When the compression rate is not too high, the lossy compression can sometimes result in a file of acceptable quality. Much depends on the compression technique used and, of course, the intended use of the file.

Other factors which play a role in image quality are color depth and color space. Color depth, or bit depth, is the number of bits used to represent the color of a single pixel in a bit mapped image or video frame buffer. This concept is also known as bits per pixel (bpp), particularly when specified along with the number of bits used. A higher bit depth gives a broader range of distinct colors but also leads to a larger file. For thumbnails a low bit depth is allowed, but other derivatives and master files should strive for very high bit depth (a minimum of 24 bit: TrueColor is recommended). When scanning images or documents the color space should also be taken into account. Color space is a physical way of describing colors. Some color spaces are able to represent more colors than others. Not all file formats support all color spaces. It is best practice to use the European Color Initiative RGB (ECIRGB).

Social history collections may include a range of visual and textual document types, ''inter alia'' continuous and halftone images (pictures, photographs); hand-written texts; simple line drawings and lower quality printed or typed text; and higher quality printed text with OCR full text option available. In most digital repositories visual and textual materials are handled in the same way. Like visual materials, text materials (serials, correspondence, reports, grey literature, etc.) are scanned (or in rare cases photographed) and saved in a bitmap format. This means that for master files the same file formats can be used for both visual and textual source material. For derivatives files, however, different solutions may be appropriate. To preserve legibility, derivatives of textual master files should be created in higher resolutions and even in different formats (e.g PDF) than derivatives of visual master files.

Below is a fairly comprehensive list of widely-supported and well-documented file formats which may be considered by social history repositories for different document types:

TIFF (Tagged Image File Format)
PDF (Portable Document Format)
JPEG (Joint Picture Expert Group)
PNG (Portable Network Graphics)
GIF (Graphics Interchange Format)

(See File Formats, section on Image File Formats voor more detailed information.)

The preservation master should capture the full qualities of the original, including size, detail, and color tone and be serviceable as the basis of further reproductions. Thus, it should be created with the highest resolution and bit depth possible without undue use of resources. HOPE recommends uncompressed baseline TIFF 6.0 for masters of all visual and textual material. HOPE offers JPEG2000 as a strong alternative master format for text documents and line drawings. However, the JPEG2000 standard definition is unclear concerning color space and resolution making it a less viable solution for half-tone images, such as photographs and illustrations. TIFF is still recommended as a more robust all-purpose master format. TIFF with lossless compression may also be used as a derivative 1 (high quality copy or production master) format, though HOPE prefers PNG 1.2 as an all-purpose derivative 1 format. For printed text documents or line drawings, HOPE also recommends PDF/a with lossless compression for use as a derivative 1 format; alternatively, PDF 1.4 with lossless compression or JPEG2000 may be used for this type of material. If necessary, simple JPEG with minimal compression may be used as a derivative 1 format for images.

Derivative 2 and 3 files are created primarily for the purposes of online access and distribution. These do not have to preserve all the qualities of the original document, but they should relay the basic content; derivative 2 files in particular should preserve legibility. Both types of derivatives should be optimized to provide a decent viewing experience while facilitating transfer over the internet. HOPE recommends the use of simple JPEG as a derivative 2 and 3 format for continuous and halftone images and PDF/a with lossy compression as a format for text documents or line drawings. Alternatively, JPEG or PDF 1.4 with lossy compression may also be used for text documents or line drawings. TIFF with lossy compression and PNG 1.2 may also be used as general all-purpose derivative 2 and 3 formats. Importantly, GIF 89a is not recommended as a master or derivative format for any document type and should not be used.

(As born digital documents remain outside the scope of the HOPE project, text formats such as Open Document Format (ODF), Rich Text Format (RTF), and others have not been analyzed.)

File Formats for Video

At present, archives and libraries are compelled to preserve and make available an increasing number of audiovisual documents. Digitizing and giving access to this growing body of material is not an easy task, requiring both infrastructure and technical know how. There is a motto in the digital video community: "Quality, size, speed. Pick two!" In fact, encoding, storing, and making available digital video demands a large-scale investment of resources. To store high quality digital video material requires either extensive disk space or a slow and computationally expensive encoding/decoding process. Compression of digital video is almost a necessity. One hour of uncompressed PAL video recorded at standard definition (720 x 576 @ 25fps) will require 93 GB, while one hour of HDTV uncompressed video (1280 x 720p @ 60fps) will require a shocking 742 GB! Thus it remains crucial to chose the most appropriate formats when digitizing audiovisual collections.

The digital video arena is continuously evolving. Today formats that have been used for over fifteen years stand side by side with others that have emerged only very recently. Economic interests inform these dynamics, making the selection process even more difficult. For this reason, social history organizations must use discretion in selecting formats, taking into account the intended use of the file and more specifically the required longevity. The following is a comprehensive list of the most widespread digital video formats both for preservation and access:

MPEG family (Motion Picture Expert Group family)
MJ2 (Motion JPEG 2000)
Theora
Dirac
VP8

(See File Formats, section on Video File Formats voor more detailed information.)

For preservation masters the format should preserve the original characteristics of the audiovisual document unchanged. So there are basically two choices: storing the audiovisual document uncompressed (with the size implications noted above) or using a lossless compression format. In the latter case, HOPE recommends as best practice to use Motion JPEG 2000 (MJ2), which is an open standard—if not royalty free—offering lossless compression. MJ2 has been adopted as the format of choice for Digital Cinema, which suggests that use of and support for the format will only increase. If an institution cannot afford the cost of hardware compression and relatively large file sizes produced by MJ2, and if source material is not of high quality (e.g. stored on VHS, U-Matic, Betacam SP tapes, Video CD, DVDs, etc.), MPEG-2 or MPEG-4 AVC/H.264 may also be used, with bitrates as appropriate for the quality of the original material.

For derivative 2 files, created primarily for the purposes of online access and distribution, HOPE recommends the VP8/WebM format, an open and free format that is well suited to low bitrate streaming. The format promises to be widely supported due to Google's decision to use it in its video streaming portal YouTube, and along with this to phase out the previously used H.264 format. The H.264 format is a good alternative with solid performance and wide support; however, its licensing costs threaten to hamper its adoption over the long term. Theora is another open and widely supported format, but its performance is not up that of VP8 or H.264.

File Formats for Audio

Digital audio is of course not as storage demanding as digital video, but even so compression algorithms can facilitate the storage of high-volume digital audio collections. The distinction between lossy and lossless compression algorithms should again be kept in mind.

The following is a comprehensive list of the most widespread formats for digital audio both for preservation and access:

LPCM (Linear Pulse-Code Modulation)
FLAC (Free Lossless Audio Codec)
MP3 (MPEG-1 or MPEG-2 Audio Layer 3)
AAC (Advanced Audio Coding)
Vorbis

(See File Formats, section on Audio File Formats voor more detailed information.)

For long term digital preservation, as noted above, the file should preserve the original characteristics of the audio document unchanged. So again two choices present themselves: storing the audio document uncompressed using the WAVE-LPCM format, or—if storage space is an issue—using a lossless compression format. In the latter case, HOPE recommends FLAC. FLAC is an open and royalty free format which offers lossless compression and is increasingly supported. Adopting FLAC instead of WAVE-LPCM may save up to the fifty percent in storage space. Still, it should be noted that WAVE-LPCM encoding/decoding is faster than that for FLAC; this has consequences for audio digitization workflows.

For derivative 2 files, created for online access and distribution, more compressed formats like those used by lossy codecs: MP3, AAC, or Vorbis. It is important to note that the audio quality of these lossy codecs above 128 kilobits per second is quite similar. Still, listening tests carried out at different bitrates have suggested slight quality differences. Listening tests are normally carried out as ABX tests, i.e. the listener has to identify an unknown sample X as being A or B, with A (the original) and B (the encoded version) available for reference. The outcome of a test must be statistically significant. This setup ensures that the listener is not biased by his expectations and that the outcome is very unlikely to be the result of chance. If sample X can be identified reliably, the listener can assign a score as a subjective judgement of the quality. Otherwise, the encoded version is considered to be transparent.

The following test results suggest that Vorbis performs better than MP3 and AAC, though differences are minimal at high bitrates:

In July 2005, a trial was made of AAC, MP3, Vorbis, and Windows Media Audio (WMA) at 80 kbit/s. Findings were that aoTuV beta 4 (Vorbis) is the best encoder for either classical or other music at this bitrate, and that its quality is comparable to that of LAME ABR (MP3) at 128 kbit/s.

In August 2005, a trial was made of AAC, MP3, Vorbis, and WMA at 96 kbit/s. Findings were that aoTuV beta 4 (Vorbis) and AAC performed equally strongly as encoders for classical music at this bitrate, while aoTuV beta 4 (Vorbis) was the best encoder for pop music, outperforming LAME (MP3) at 128 kbit/s.

In August 2005, a trial was also made of AAC, MP3, Vorbis, and this time Musepack (MPC) at 180 kbit/s. An audiophile listening test found that for classical music aoTuV beta 4 (Vorbis) and MPC were the best encoders.

Given these results, HOPE recommends the Ogg/Vorbis format for access and distribution. The format is open and free and well suited to low bitrate streaming. Moreover, in the wake of Google's decision to use Ogg/Vorbis in WebM, the format also promises to be widely adopted and supported over the medium to long term. AAC and MP3 are safe alternatives due to their wide diffusion and support.

File Naming

A file naming convention is a set of agreed-upon rules used to assign identifiers to digital objects in a collection. A naming convention ensures that files can be consistently and uniquely identified within the repository system and is thus essential to data integrity and internal workflows. The focus of the following recommendations is digitized analog material. In the case of born-digital objects, an institutional file naming convention may also be applied, but in this case the original file name must be preserved as part of the provenance metadata.

A good naming convention should:

Be standardized, stable, and applicable to all collections and projects in the institution;
Preclude the possibility of identical file names, which could lead to accidental overwriting and loss of files;
Enforce unambiguous distinction between files to allow files to be easily identified (directly through the name itself or indirectly through a metadata record);
Provide the means to easily distinguish among the different instances of a file (format, quality, etc.);
Support compound digital objects, i.e. objects comprised of two or more content files having a physical and/or logical relationship to one another;
Facilitate the retrieval and processing of materials from creation onwards.

All file names should comply with the following minimum requirements. Under normal conditions, all operating systems support file names consisting of 255 characters. It is, however, advisable to restrict file names to about 30 characters, including the period '.' and extension, as some operating systems are unable to handle very long paths, a fact which can lead to copying errors. Characters should be in lower case, and only alphanumeric characters should be used with the exception of hyphen '-', underscore '_', and period '.' (for the file extension). Spaces should not be used.

Every file name is comprised of a few basic elements. Some are mandatory, while others are optional:

''Institutional prefix:'' The prefix should be a unique identifier designating the institution that created or has custody of the digital object. If possible the identifier should include a formal country code specified according to ISO 3166 and a national repository code or other unique institutional identifier. The institutional prefix is particularly helpful if material will be exchanged or aggregated with material from other institutions.
Example: hu-osa
''Root file name:'' The root file name is a name given to the file to distinguish it from other files created or stored in the same institution. The name may be 'descriptive', incorporating some characteristic of the content, such as its predominant content or its call number, or the name may be 'non-descriptive', completely arbitrary and devoid of any reference to characteristics of the file’s content.
Example: hu-osa_mss64
''Sequence designator:'' Files belonging to the same compound digital object (e.g. the digitized pages of a diary) should have the same root file name. In such cases, to distinguish one file from another and to indicate the relative position of one file in the sequence of files, a sequence designator should be used. The sequence designator aids in expressing the structural relationship of the files so that the digital object can be displayed in the proper sequence to an end user. The value of the sequence indicator should be a number between 1 and n, with 1 designating the first file in the sequence of files, and n designating the following files. It is important to remember to add 0s in front of the numbers to facilitate automatic sorting.
Example: hu-osa_mss64_001
''Quality suffix:'' A quality suffix should be used only to distinguish different levels of quality for files of the same file format to prevent reuse of an identical file name. In this context, quality is used to indicate the richness of a file (e.g. 'h' for high quality) or the use to which the file will be put (e.g. 'm' for master).
Example: hu-osa_mss64_001_h
''Processing suffix:'' If a file has been edited, and needs to be distinguished from an unedited version of the same file, this should be indicated in the file name by a lowercase 'e'. For example, an original file may be edited to modify the content of the file in some way, such as to delete unwanted artifacts or confidential text or to insert content.
Example: hu-osa_mss64_001_h_e
''File extension:'' The file extension is a three- or four-letter string designating the file format (e.g. *.html,*.sgml, *.tiff, *.jpg, *.gif, *.mpeg, etc.). File extensions are usually generated by the software application used to create the content file.
Example: hu-osa_mss64_001_h_e.tiff

Only the root file name and file extension elements are mandatory for every instance of a file name. The composition of the file name may vary, even when using the same naming convention, depending on the material being named. An underscore '_' should be used to separate any of the first five elements. A period should be used to separate the file extension from the other elements.

The choice of a descriptive or non-descriptive root file name is at the discretion of the organization. Descriptive root file names contain words, numbers, or abbreviations that describe in some way the file they pertain to. They may be composed of a title, the name of the creator, the accession number of the physical item, collection or media designation, or some other descriptive identifier. Meaningful root names make it easier to identify and manage the digital files and require less dependence on the collection management system or repository; this reduces the impact if something goes awry with the system. Descriptive names may also facilitate end user access to and use of material. On the downside, meaningful file names are often specific to particular collections and should be conceived for each project, so they are only feasible for medium to small collections. Furthermore, there is the added possibility that the name’s meaning will be lost or change connotation over time or that the convention will not scale well as collections grow and change.

Non-descriptive root file names express no relationship to the item and are usually sequential numbers. Non-descriptive root names work well for medium to large collections and are easy to assign and apply automatically. Non-descriptive root file names provide no identifying information; thus the files are harder to manage and workflows center on the database that contains the associated metadata. The decision to use descriptive or non-descriptive file names should be based on the collection’s characteristics, current and future repository requirements, and available resources.
Example of descriptive root file name: hu-osa_mss64_001_h_e.tiff
Example of non-descriptive root file name: hu-osa_12345678_001_h_e.tiff

Many of the rules for file names also apply for directory names. Often, the file naming is integrated with the directory structure rules, the file name replicating to some degree the structure. In this case, it is important that the file name does not depend on its location in the structure for its uniqueness but that it can function independently as a file identifier. Other than this, folder names should be restricted to thirty characters and the number of nested sub-folders should be restricted to five, not including the root folder.

When digitizing materials, the three possible file naming procedures are: 1) automatically producing file names with scanning software; 2) manually editing after scanning; and 3) running a script that batch renames files according to custom rules. The choice largely rests on the broader digitization and digital curation workflows, e.g. when and how files are created; when and how quality control is undertaken; when and how files are packaged into objects; when and how file names are stored in the local system; when and how derivatives are created; when and how objects are stored in the file server or on storage media; etc. As a rule, manual editing is discouraged as it is labor intensive and prone to human error. In any case, naming conventions should be agreed upon and documented in advance of digitization—and not applied retrospectively. Policies should be set indicating whether naming conventions are project or collection based or institution wide. The latter is preferable, if only because it is more scalable, reducing the risk of confusion in the long term.

HOPE has not taken it upon itself to recommend a single naming convention or workflow but instead recommends that organizations develop and use a file and directory naming scheme that is logical, consistent, and stable (i.e. not based on values which are subject to modification); does not duplicate names or values; supports complex objects and multiple derivative formats; and complies with the minimal character and length guidelines. HOPE recommends the use of non-descriptive numbers or codes as root file names only for medium or large collections or institutions with robust repository infrastructures. For smaller-scale collections supported by less developed infrastructure, it is advisable to use descriptive root file name, based on call numbers, local identifiers, or archival reference numbers or some combination of elements representing the intellectual structure of institutional holdings. If whether by fault or by design, file names are changed, it is recommended to store old file names as provenance information. Finally, organizations should avoid using file names and directory structures as their sole structural metadata but should instead attempt to store structural metadata in a more robust manner. HOPE recommends the use of Metadata Encoding and Transmission Standard (METS) to capture the structural metadata on digital objects. Beyond the minimal technical requirements, file naming conventions should be developed to suit local needs, and file naming procedures should be integrated into digitization, transformation, processing, and storage workflows.

In general, it is good practice to document institutional file format policies, including settings used and metadata embedded during digitization, and file naming conventions. This helps ensure that the same rules are used with every digitization project. Importantly, rules should not be applied retroactively to existing content but should rather serve and the basis for future digitization or migration projects.

Related Resources

State Library of Queensland. ''Directory & File Naming Conventions for Digital Objects, v1.06''. 2012. (http://www.slq.qld.gov.au/__data/assets/pdf_file/0011/93377/DirectoryFi…)

UCSD Libraries Digital Library Program. ''A Naming Protocol for Digital Content Files''. 2003. (http://libraries.ucsd.edu/artsnet/fvlnet/filename_conventions.pdf)

University Library, University of Illinois at Urbana Champagne. ''Library Digital Content Creation: Best Practices for File Naming''. 2010.
(http://www.library.illinois.edu/dcc/bestpractices/chapter_02_filenaming…)