Powershell v1 v2 v3 Weirdness with File Formats

One of the systems we use at work requires a specific set of file format settings to work properly. Now, by file format settings I mean PC versus Mac versus Unix. In different operating systems there are different file encoding/formats. A new line on a Mac is different from a new line on a Windows or a Unix machine. In a recent project this actually posed some problems. Some reports I had modified were not displaying in our custom application, so, I had to dig around and figure out what the deal was. In the end, I ran this test using TextPads Save As feature. TextPad allows you to choose the File Format from a list of Mac, PC and Unix. To make analysis easier, added test to a file twice separated by a line break. After saving my files I ran the following to see what my actually binary was:
# Start afresh
Clear-Host

# Create storage array
[PSObject[]] $bytes = @()

# Get files
Get-ChildItem -Path C:DataDocumentsPowershell estFile Format |
ForEach-Object {
      # Add new PSObject to array
      $bytes += New-Object -TypeName PSObject |
     
      # Add new field with file name
      Add-Member -Name Name `
            -MemberType NoteProperty `
            -Value $_.Name `
            -PassThru |
           
      # Add new field with content as bytes
      Add-Member -Name Bytes `
      -MemberType NoteProperty `
      -Value ([System.IO.File]::ReadAllBytes($_.fullname) -join ) `
      -PassThru |
     
      # Add new fields with content as hex
      Add-Member -Name Hex `
            -MemberType NoteProperty `
            -Value (([System.IO.File]::ReadAllBytes($_.fullname) | ForEach-Object { "{0:X2}" -f $_ } ) -join ) `
            -PassThru
}

# Return resulting array
$bytes | Format-List
When I run this script it produces the following (variations highlighted in red below):
Name  : mac.txt
Bytes : 116 101 115 116 13 116 101 115 116 13
Hex   : 74 65 73 74 0D 74 65 73 74 0D

Name  : pc.txt
Bytes : 116 101 115 116 13 10 116 101 115 116
Hex   : 74 65 73 74 0D 0A 74 65 73 74

Name  : unix.txt
Bytes : 116 101 115 116 10 116 101 115 116 10
Hex   : 74 65 73 74 0A 74 65 73 74 0A
If you look closely at the hex values you can see the distinct differences.

  • The first format, Mac, uses 0x0D (ASCII of 13 and commonly referred to as the CR or carriage return) as its end of line character. So, anywhere there is a line break, you use 0x0D on Macs. 
  • The PC shows one instance of 0x0D, 0x0A, but, none at the end. So, in this case, Windows files use two characters: 0x0D and 0x0A. The 0x0A character paired with the 0x0D is the ASCII 10 or LF, also, known as Line Feed. So, any time you need to add a line break in Windows, you need two characters.  
  • Finally, Unix uses a single character, like the Mac, 0x0A. The ASCII (code 10) and control character (LF or line feed) are the only character required for Unix environments to break a new line.
The interesting thing about PowerShell is that these particular characters have special, escape characters you can use explicitly in scripts (which is what I ended up doing). The PowerShell codes are:
  • CR:  `r
  • LF:  `n
In my case, I was manually constructing some headers and used this to end the line because I was doing some special formatting:
$line = "Data Date: {0,-10} Report {1,-30}`n"
$line += "Customer: {2,-35} Output: {3,-15}`n"
$line -f (Get-Date -Format yyyy/MM/dd),Daily analysis,Bank of Banks,PDF
What I needed to do was append `r`n at the end of each line instead of the `n approach I used. Granted this is a trivial details, but, it is important to note, particularly when you work with various systems and cmdlets might handle details like this automatically. If, in the future, you run into some weird issues with encoding and file formats, remember, PowerShells escaped characters can offer a lot, but, you need to be sure you understand exactly what you need and what PowerShell gives you.

Related Posts by Categories

0 comments:

Post a Comment