![]() |
Detecting duplicate files |
Welcome to DROID | Information collected |
![]() |
It is very common to find that files are duplicated in different areas of your filing systems. Some estimates show that around 30% of all file storage consists of duplicate files. This can happen because many users save the same files from email attachments, or they take a backup copy of files while they are working on them, but don't end up changing most of them.
One method of duplicate detection is to use content hashes. If two files have the same hash value, then they are overwhelmingly likely to have identical content. The odds of two arbitrary files having the same hash value by accident are less than 1 in 18,000,000,000,000,000,000, which is very, very low (these odds are for MD5, with SHA1, and then SHA256 the odds become orders of magnitude lower still). DROID can generate content hashes for your files, but note that DROID will not locate files with the same hash value for you, only generate them in the first place. If you export your profiles to a CSV file and import them into software like Excel or Access, you can query for files which have the same hash.
Another method of locating duplicate files (without using hashes) is to search for folder names containing words like, 'backup', 'temp' and 'old', as users frequently name folders or files with these words if they intend them to be temporary copies. Another, more time consuming method, is by examining the names of files and folders. If there are areas with very similar (or identical names), then you may have duplicate information within them. However, both of these methods can only give you an indication that there may be duplication and a high degree of manual review will still be required to assure yourself that the file contents are really duplicated.
If you do find duplicate files, you must decide how to deal with them. Clearly you will need to keep at least one of them, but you will have to decide which, if any, can be safely removed. There are risks to digital continuity in deleting files, so you should take into account several considerations before deleting duplicates:
You can mitigate some of the risks related to loss of context by leaving shortcuts (or symbolic links in a UNIX file system) to the 'master file' when you delete a duplicate.
What are hashes
Hashes are long numbers, often represented as hexadecimal text, which can be used as a signature to identify
the content of a file. DROID can generate hashes called "MD5"
, or SHA1"
, or SHA256"
, which are fairly fast to produce (relative
to other hashing algorithms).
Hashes are useful to locate duplicates in the files you profile, and to match with common files which have published hash values. However, MD5 hashes are not resistant to malicious attack - an attacker can create files which have the same hash but with different content. The goal of hashing in DROID is not to provide a cryptographic assurance of uniqueness, only to locate likely duplicates and to link to forensic hash databases (most of which use MD5). SHA1 and SHA256 are more recent and more secure than MD5, but should still not be taken to provide an absolute guarantee of uniqueness.
Forensic hash databases are published databases of hash values for files which are widely found. These can allow you to detect whether the files in your systems are common, well known files (such as Windows system files), whether they contain known illegal content, and in some cases, malware such as viruses. Knowing which files are well known outside your organisation can support information policy and decision making. For example, you may discover that a lot of storage space is being taken up with multiple copies of files which are easily replaced from install CDs. Files which are not well known probably contain unique content, and would be hard to replace if deleted.
There are a variety of content hash databases available. One such database is:
Note that DROID does not link your files to these hash databases. It merely generates a compatible hash for each of your files, which you can then use to link with them. You will require additional technical assistance to perform these links.
Welcome to DROID | Information collected |
![]() |