PowerShell v3 Check File Headers

The following function is a very quick way to validate a set of file signatures. The files were provided to us in a project with a single, incorrect extension. Seeing as I need to process about 3.8 million files, there was no way I wanted to manually do that. So, I narrowed down the headers to the usual suspects for a project like this (.pdf and .tif) and wrote this function:
function Check-Header
{
       param(
             $path
       )
      
       # Hexidecimal signatures for expected files
       $pdf = 25504446;
       $TIFF_1 = 492049;
       $TIFF_2 = 49492A00;
       $TIFF_3 = 4D4D002A;
       $TIFF_4 = 4D4D002B;
            
       # Get content of each file (up to 4 bytes) for analysis
       ([Byte[]] $fileheader = Get-Content -Path $path -TotalCount 4 -Encoding Byte) |
       ForEach-Object {
             if(("{0:X}" -f $_).length -eq 1)
             {
                   $HeaderAsHexString += "0{0:X}" -f $_
             }
             else
             {
                   $HeaderAsHexString += "{0:X}" -f $_
             }
       }
      
       # Validate file header
       @($pdf, $tiff_1, $tiff_2, $tiff_3, $tiff_4) -contains $HeaderAsHexString
}
This function does a few things:

  1. Takes a file path argument
  2. Declares five known signatures (there are the headers we want files to have)
  3. Reads the first 4 bytes of the file into a [Byte[]] array
  4. Passes this byte array to a simple if/else statement to convert each byte from byte to a hexidecimal string
  5. Compares an array of all known good signatures to see if any of them match the converted file signature 
If the -contains operator validates that one of the binary arrays matches our header the function returns true. If it does not find a match it returns false. On a directory of 1024 files this took just over 3.9 seconds on my test server. If I can get a straight run, I anticipate my 3.8 million file collection to take just a shade more than 4 hours. I will be doing some other manipulation, so, it will be considerably slower, but, in cases like this, it just goes to show there is no alternative to a good automated solution.

Related Posts by Categories

0 comments:

Post a Comment