Information Blog

Powershell v1 v2 v3 Weirdness with File Formats

Posted by gilogo at 4:31 AM

One of the systems we use at work requires a specific set of file format settings to work properly. Now, by file format settings I mean PC versus Mac versus Unix. In different operating systems there are different file encoding/formats. A new line on a Mac is different from a new line on a Windows or a Unix machine. In a recent project this actually posed some problems. Some reports I had modified were not displaying in our custom application, so, I had to dig around and figure out what the deal was. In the end, I ran this test using TextPads Save As feature. TextPad allows you to choose the File Format from a list of Mac, PC and Unix. To make analysis easier, added test to a file twice separated by a line break. After saving my files I ran the following to see what my actually binary was:

# Start afresh
Clear-Host

# Create storage array
[PSObject[]] $bytes = @()

# Get files
Get-ChildItem -Path C:DataDocumentsPowershell estFile Format |
ForEach-Object {
      # Add new PSObject to array
      $bytes += New-Object -TypeName PSObject |

      # Add new field with file name
      Add-Member -Name Name `
            -MemberType NoteProperty `
            -Value $_.Name `
            -PassThru |

      # Add new field with content as bytes
      Add-Member -Name Bytes `
      -MemberType NoteProperty `
      -Value ([System.IO.File]::ReadAllBytes($_.fullname) -join ) `
      -PassThru |

      # Add new fields with content as hex
      Add-Member -Name Hex `
            -MemberType NoteProperty `
            -Value (([System.IO.File]::ReadAllBytes($_.fullname) | ForEach-Object { "{0:X2}" -f $_ } ) -join ) `
            -PassThru
}

# Return resulting array
$bytes | Format-List

When I run this script it produces the following (variations highlighted in red below):

Name : mac.txt
Bytes : 116 101 115 116 13 116 101 115 116 13
Hex   : 74 65 73 74 0D 74 65 73 74 0D

Name : pc.txt
Bytes : 116 101 115 116 13 10 116 101 115 116
Hex   : 74 65 73 74 0D 0A 74 65 73 74

Name : unix.txt
Bytes : 116 101 115 116 10 116 101 115 116 10
Hex   : 74 65 73 74 0A 74 65 73 74 0A

If you look closely at the hex values you can see the distinct differences.

The first format, Mac, uses 0x0D (ASCII of 13 and commonly referred to as the CR or carriage return) as its end of line character. So, anywhere there is a line break, you use 0x0D on Macs.
The PC shows one instance of 0x0D, 0x0A, but, none at the end. So, in this case, Windows files use two characters: 0x0D and 0x0A. The 0x0A character paired with the 0x0D is the ASCII 10 or LF, also, known as Line Feed. So, any time you need to add a line break in Windows, you need two characters.
Finally, Unix uses a single character, like the Mac, 0x0A. The ASCII (code 10) and control character (LF or line feed) are the only character required for Unix environments to break a new line.

The interesting thing about PowerShell is that these particular characters have special, escape characters you can use explicitly in scripts (which is what I ended up doing). The PowerShell codes are:

CR: `r
LF: `n

In my case, I was manually constructing some headers and used this to end the line because I was doing some special formatting:

$line = "Data Date: {0,-10} Report {1,-30}`n"
$line += "Customer: {2,-35} Output: {3,-15}`n"
$line -f (Get-Date -Format yyyy/MM/dd),Daily analysis,Bank of Banks,PDF

What I needed to do was append `r`n at the end of each line instead of the `n approach I used. Granted this is a trivial details, but, it is important to note, particularly when you work with various systems and cmdlets might handle details like this automatically. If, in the future, you run into some weird issues with encoding and file formats, remember, PowerShells escaped characters can offer a lot, but, you need to be sure you understand exactly what you need and what PowerShell gives you.

High Definition Scary Halloween Desktop Wallpaper

Posted by gilogo at 2:09 PM Labels: computer, definition, desktop, halloween, high, scary, wallpaper

TYBSc IT Sem 5 Exam Time Table 2010 Regular

Posted by gilogo at 1:25 PM Labels: 2010, 5, computer, exam, it, regular, sem, table, time, tybsc

TYBSc IT Sem 5 Exam Time Table 2010 ( Regular )

A PPT Presentation on Web 2 0

Posted by gilogo at 11:18 PM Labels: 0, 2, a, computer, on, ppt, presentation, web

Few Points which are covered in this PPT are as fallows:

Intro to Web
Terminology
What is Web 2.0
History of Web 2.0
Need for Web 2.0
What makes the Difference?
Then & Now
Current Scenario
Characteristics
Web-based applications and desktops
The Future of Close (web3.0)

Submitted by: Elton Jain
College: KC College, Mumbai

Team Members:

Elton Jain
Nayab Shaikh
Rishabh

Download the ppt file here:

web2.0 ppt

Making Blockly Universally Accessible

Posted by gilogo at 2:46 PM Labels: accessible, blockly, computer, making, universally

Posted by Neil Fraser, Chief Interplanetary Liaison

We work hard to make our products accessible to people everywhere, in every culture. Today we’re expanding our outreach efforts to support a traditionally underserved community -- those who call themselves "tlhIngan."

Googles Blockly programming environment is used in K-12 classrooms around the world to teach programming. But the world is not enough. Students on QonoS have had difficulty learning to code because most of the teaching tools arent available in their native language. Additionally, many existing tools are too fragile for their pedagogical approach. As a result, Klingons have found it challenging to enter computer science. This is reflected in the fact that less than 2% of Google engineers are Klingon.

Today we launch a full translation of Blockly in Klingon. It incorporates Klingon cultural norms to facilitate learning in this unique population:

Blockly has no syntax errors. This reduces frustration, and reduces the number of computers thrown through bulkheads.
Variables are untyped. Type errors can too easily be perceived as a challenge to the honor of a students family (and we’ve seen where that ends).
Debugging and bug reports have been omitted, our research indicates that in the event of a bug, they prefer the entire program to just blow up.

Get a little keyboard dirt under your fingernails. Learn that although ghargh is delicious, code structure should not resemble it. And above all, be proud that tlhIngan maH. Qapla!

You can try out the demo here or get involved here.

Another Spam email won the sum of £860 000 00 GBP UK NATIONAL LOTTERY INC

Posted by gilogo at 12:04 PM Labels: 00, 000, â£860, another, computer, email, gbp, inc, lottery, national, of, spam, sum, the, uk, won

Once i replied to the Damba Barous email now i got a second spam email which i would like to share with you all.
The email was as follows :-

Congratulation!!!You have won £860,000.00 GBP?
From: Telecap Stewart (info@lottery.co.uk)
Sent: March 2009 17:16PM
To:
------------------------------------
This electronic mail is to inform you that you have won the sum of £860,000.00 GBP [EIGHT HUNDRED AND SIXTY THOUSAND POUNDS STERLING] in the just concluded UK National Lottery Official On-Line Draw held on the first of January 2009 in London.The result of our mputer draw (#078) selected your email address attached to:

E -ticket Number: 50075611546 109
Batch Number: 074/05/ZY369
Reference Number: UK/9420X/68

Which subsequently won you the lottery as the 2nd prize winner in the 2nd category i.e. match 5 plus bonus. You have therefore been approved to claim a total sum of £860,000.00 GBP [EIGHT HUNDRED AND SIXTY THOUSAND POUNDS STERLING] in cash credited to file KTU/0023118308/08.
Contact your claims officer below to process/forward your prize to you.
***************************************
Name: Mr. Mark Gerald
Contact E-mail: mak_gerald00@yahoo.com.hk
Telephone: +44-704-573-1869
***************************************

Please provide him with the below information for Verification:
E -ticket Number: 50075611546 109
=========================
Full Names:
Address:
Date of Birth:
Telephone/Fax number:
Nationality:
Marital Status:
Age:
Occupation:
CONGRATULATIONS!!!

Mrs. Telecap Stewart,
Online Co-ordinator,
UK NATIONAL LOTTERY INC.
Copyright © 2009 UK National Lottery Award.

--

This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.

Teaching machines to read between the lines and a new corpus with entity salience annotations

Posted by gilogo at 12:02 PM Labels: a, and, annotations, between, computer, corpus, entity, lines, machines, new, read, salience, teaching, the, to, with

Posted by Dan Gillick, Research Scientist, and Dave Orr, Product Manager

Language understanding systems are largely trained on freely available data, such as the Penn Treebank, perhaps the most widely used linguistic resource ever created. We have previously released lots of linguistic data ourselves, to contribute to the language understanding community as well as encourage further research into these areas.

Now, we’re releasing a new dataset, based on another great resource: the New York Times Annotated Corpus, a set of 1.8 million articles spanning 20 years. 600,000 articles in the NYTimes Corpus have hand-written summaries, and more than 1.5 million of them are tagged with people, places, and organizations mentioned in the article. The Times encourages use of the metadata for all kinds of things, and has set up a forum to discuss related research.

We recently used this corpus to study a topic called “entity salience”. To understand salience, consider: how do you know what a news article or a web page is about? Reading comes pretty easily to people -- we can quickly identify the places or things or people most central to a piece of text. But how might we teach a machine to perform this same task? This problem is a key step towards being able to read and understand an article.

One way to approach the problem is to look for words that appear more often than their ordinary rates. For example, if you see the word “coach” 5 times in a 581 word article, and compare that to the usual frequency of “coach” -- more like 5 in 330,000 words -- you have reason to suspect the article has something to do with coaching. The term “basketball” is even more extreme, appearing 150,000 times more often than usual. This is the idea of the famous TFIDF, long used to index web pages.

Congratulations to Becky Hammon, first female NBA coach! Image via Wikipedia.

Term ratios are a start, but we can do better. Search indexing these days is much more involved, using for example the distances between pairs of words on a page to capture their relatedness. Now, with the Knowledge Graph, we are beginning to think in terms of entities and relations rather than keywords. “Basketball” is more than a string of characters; it is a reference to something in the real word which we already already know quite a bit about.

Background information about entities ought to help us decide which of them are most salient. After all, an article’s author assumes her readers have some general understanding of the world, and probably a bit about sports too. Using background knowledge, we might be able to infer that the WNBA is a salient entity in the Becky Hammon article even though it only appears once.

To encourage research on leveraging background information, we are releasing a large dataset of annotations to accompany the New York Times Annotated Corpus, including resolved Freebase entity IDs and labels indicating which entities are salient. The salience annotations are determined by automatically aligning entities in the document with entities in accompanying human-written abstracts. Details of the salience annotations and some baseline results are described in our recent paper: A New Entity Salience Task with Millions of Training Examples (Jesse Dunietz and Dan Gillick).

Since our entity resolver works better for named entities like WNBA than for nominals like “coach” (this is the notoriously difficult word sense disambiguation problem, which we’ve previously touched on), the annotations are limited to names.

Below is sample output for a document. The first line contains the NYT document ID and the headline; each subsequent line includes an entity index, an indicator for salience, the mention count for this entity in the document as determined by our coreference system, the text of the first mention of the entity, the byte offsets (start and end) for the first mention of the entity, and the resolved Freebase MID.

Features like mention count and document positioning give reasonable salience predictions. But because they only describe what’s explicitly in the document, we expect a system that uses background information to expose what’s implicit could give better results.

Download the data directly from Google Drive, or visit the project home page with more information at our Google Code site. We look forward to seeing what you come up with!

Search

Archive