![]() |
Information collected by DROID |
Welcome to DROID |
![]() |
![]() |
DROID collects a variety of information about your files and folders, including:
Type | File name | File name extension |
Extension mismatch warning | Location | File size |
Last modified date | Number of format identifications | File formats |
Identification method | Content hash | Status |
Type | top |
DROID categorises the files and folders it profiles as being one of three types:
Files have format identifications, but do not have other files or folders inside them. Folders do not have any format identifications or sizes, but can contain other folders, files and archival files inside them. Archival files are like folders, in that they can contain other folders, files and archival files inside them, but they are also files, so they have format identifications and a file size. In this version, DROID can look inside zip, tar and gzip archival files. Archival files may have other archival files nested inside them. DROID will also profile inside these, and in any further nested archival files.
File name | top |
The name of a file, folder or archival file is its name, independent of its location on a disk or inside an archival file. The file name extension (if any) is part of its name. DROID treats all filenames as case-sensitive. For example, 'MYDOCUMENT.DOC' and 'mydocument.doc' are regarded as different file names.
File name extension | top |
File extensions are a convention to indicate the broad type of a file (or archival file) by appending a short string to a file name, separated by a full stop. On Microsoft Windows, the filename extension is used to indicate to the operating system what application to run when double-clicking on the file. Other operating systems do not use the filename extension to determine which application to use. However, filename extensions have become a de-facto standard for indicating the broad type of a file format, and are usually appended to filenames, even when a file is created on other platforms.
DROID extracts the file extension (if any) from a file name or archival file name and stores it separately, to facilitate reporting, sorting and filtering on the extension alone.
File names which begin with a full stop and have no other full stops in them are not regarded as having an extension. For example, a file called '.myfile' has a filename of '.myfile' and a blank extension, whereas '.myfile.doc' has a file name of '.myfile.doc' and an extension of '.doc'. This is because file names starting with a full stop are hidden files in unix file systems, and also because it is not likely that a file name would be entirely composed of a file extension, with no name before it.
DROID treats file extensions as case sensitive. However, it converts all file extensions to lower-case to facilitate filtering and reporting.
![]() |
top |
Sometimes file extensions are incorrect for the type of the file, or are missing where there should be one. If DROID detects that the file extension for a file name does not match the formats it has identified, it will issue a file extension mismatch warning. For example, if a file called 'myfile.doc' is identified as a spreadsheet, then a file extension mismatch warning will be issued.
In the graphical user interface, extension mismatch warnings appear as a warning symbol against the file extension itself. When
exported to a CSV file, it will appear as a True or False value in its own column.
Location | top |
DROID records the location of every file and folder it profiles. It records location in two
ways, using a file Uniform Resource
Indicator (URI) , and a file path where one
exists. Like file names and extensions, DROID treats file paths and URIs as case sensitive.
There are two ways of recording location because not all files and folders have a file path, although this is the usual method of identifying location in a file system. Any file, folder or archival file which is inside another archival file does not have a defined file path, as it is inside the archival file, not directly in the file system.
For example, if we have:
Then we have the following file paths and URIs:
|
File path |
Uniform Resource Indicator (URI) |
1 |
C:\Folder |
file:/C:/Folder/ |
2 |
C:\Folder\Document.doc |
file:/C:/Folder/Document.doc |
3 |
C:\Folder\Archive.zip |
file:/C:/Folder/Archive.zip |
4 |
|
zip:/file:/C:/Folder/Archive.zip!Spreadsheet.xls |
5 |
|
zip:/file:/C:/Folder/Archive.zip!Another%20folder/ |
6 |
|
zip:/file:/C:/Folder/Archive.zip!Another%20folder/Large%20picture.jpg |
Only files, archival files or folders which are directly accessible in the file system have a file path. Those files and folders which are inside the zip file do not have a file path, but do have a URI, which tells you that they are inside the zip file, where they can be found in it, and where the zip file they are inside is to be found.
The prefixes of a URI tell you what sort of resource is being described by the URI, and the exclamation marks indicate where one type of resource is contained by another. For example, for 'Spreadsheet.xls', we can see that there is a file, C:/Folder/Archive.zip, with the prefix file:/. The exclamation mark (!) tells us that the spreadsheet is contained by the Archive.zip file, and the first prefix zip:/ tells us the type of the containment is a zip file. Note that spaces in URIs are encoded by '%20', and folder separators are always forward slashes. If zip files are contained inside zip files, inside zip files, more prefixes and exclamation marks are added as needed.
URIs mean that all resources profiled by DROID have a unique reference which tells you where the resource is, even if it is inside an archival file, inside another archival file, and so on. This is something that file paths cannot do. However, both are provided, as working with file paths is easier, where they exist for a resource.
File size | top |
The size of a file or archival file is recorded as the number of bytes used by the file. Files can have a size of zero (no content, just a record in the file system). Folders do not have a size.
The size of an archival file is the size of the archival file itself, not the sum of the sizes of its contents. For example, zip files compress their contents, so the sum of the sizes of the files inside a zip file will be bigger than the size of the archival file itself.
Last modified date and time | top |
Most files, folders and archival files record the date and time on which they were last modified. This is not the same as the date a file was originally created, or the date on which a file was last read. Unfortunately, due to limitations in Java 6, DROID can only acquire the last modified date, even though the other dates may be present on the file system.
It is possible that not every file, folder or archival file will have a last modified date. For example, in some cases, resources inside archival files may not record this date.
It is important to note that last-modified dates can be changed when files are copied from one server to another, so this date may not reflect the last date a user actively modified the content of a file. Also, the content of a file (the data within it) may actually be older than the file itself if a file was copied, or simply typed up manually from an older piece of content.
Some files may have noticeably inaccurate dates, e.g. 1 Jan 1970. In this case, the files will be newer than indicated. This error will likely be caused by the battery failing on the internal clock of the computer from which the document was uploaded, or some other error which caused the date to be set incorrectly.
Number of format identifications | top |
DROID attempts to identify the format of files, including archival files, but not folders. The number of identifications DROID records for a file can vary. It can have
In the user interface, the number in brackets indicates the number of possible format identifications made. Clicking on the link will bring up a window showing all the identifications in a table. Multiple possible identifications can happen for three reasons.
File formats | top |
When DROID identifies a file format, it records four pieces of information:
The format name is simply a human-readable name given to a file format or family of file
formats, for example, 'Microsoft Word'. The format version is the version of the format, for
example '97-2003'. The PUID is a globally unique, persistent identifier for a file format and
version, assigned by the National Archives through its PRONOM file format registry. For example, the PUID for the 'Microsoft Word 97-2003'
file format is 'fmt/40'.
PUIDs are guaranteed never to change, although new PUIDs may be defined. Clicking on a PUID in DROID will take you to the relevant page for that file format on the National Archives PRONOM website. The website will also help you with some file format names that you may be unfamiliar with. In particular, you may see files identified as 'OLE2 Compound Document Format' (PUID fmt/111) which you can interpret as 'Microsoft Office generic' . In these cases, the file is a Microsoft Office file which DROID could not identify any more closely, but the file extension may indicate more precisely.
Finally, the mime-type is another scheme for identifying broad types of files in use on the
internet. They are assigned by a body called the Internet Assigned Numbers Authority. Mime-types are quite broad classifications, so
many different file formats will have the same mime-type. For example, the mime-type for
'fmt/40' is 'application/msword' which is shared by all other binary Microsoft word
formats.
Identification method | top |
DROID has three different methods of identifying file formats:
An 'extension' identification means that a format was identified purely on the basis of its file extension. Such an identification may not be reliable, as files can be named in any way, and extensions do not identify formats down to the version level, so such identifications can be quite broad, and may result in multiple identifications.
A 'signature' identification means that a format was identified by finding signature patterns inside the file which are known to occur in particular file formats and versions. This method is quite reliable, as it is fairly unlikely that by chance a file will happen to have a pattern belonging to a different file format than its own.
A 'container' identification means that a format was identified by finding embedded files (possibly with signatures of their own) inside the main file. For example, Microsoft Office 2007 word processing files are actually zip files containing xml files, images or other resources used in the document. A container identification would identify the main file as a Microsoft Office 2007 file, not a zip file. This method is very reliable, as not only does the broad type of container have to be identified (e.g. zip), but the zip file must then be opened, and files inside scanned for further identifications to be made. The original zip identification is removed, and replaced by the Office 2007 identification, on the basis of the files discovered within it.
Note that this is not the same as profiling files inside Archival files, even though container-format files may be based on an archival format like zip. A container-format is a single file format, whose specification relies on specific files being inside it to define the overarching format. An archival file format is a format whose only purpose is to contain other files, and the particular files inside it has no effect on its identification as an archival format.
Content hash | top |
DROID can optionally generate a content hash of the contents of each file and archival file, using the industry standard 'MD5' 'SHA1' or 'SHA256' algorithms. A content hash is a short signature that can be used to identify the content of the file. It is extremely unlikely that two different files will have the same content hash (although this is a remote possibility).
Content hashes can be used to detect files with duplicate content, or can be linked to forensic hash databases to find or exclude files which are widely used (and therefore not unique to your organisation) or which contain illegal content. See "Detecting duplicate files" for more information.
Content hashing is turned off by default, as producing a hash requires reading the entire file, which will slow down DROID significantly.
Status | top |
As DROID profiles your files and folders, it records whether the profiling was successful or not. There are four different statuses which a file or folder can have:
Done | The file or folder was read successfully and any results found recorded. |
![]() |
The file or folder was moved or deleted before it could be profiled. |
![]() |
The operating system refused read access to DROID. You will have to grant read permission to those files or folders if you want DROID to profile them. |
![]() |
An error occurred while trying to read the file. You may be able to determine the cause of the error by examining DROID's log files. |
In the user interface, these status icons are overlaid on the files, folders and archival files as needed.
Case Sensitivity | top |
All text collected by DROID is treated case sensitively, so upper case and lower case text is regarded as different. This is due to limitations of the underlying database, which must either be entirely case sensitive, or entirely case insensitive. DROID requires some fields in its database to be case sensitive in order to operate properly, which means we cannot make only some information case insensitive, even where it might be more useful to do so.
Welcome to DROID |
![]() |
![]() |