Log in

No account? Create an account

Previous Entry | Next Entry

Programming Challenge

Given the first 1024 bytes of a file how would you determine if the file is binary or text? If you have a good answer please respond with a comment.


( 13 comments — Leave a comment )
Jul. 26th, 2007 03:41 am (UTC)
Break it up into char-sized pieces and see if a suspicious number of them are non-alphabetic, non-punctuation and non-whitespace? Or better yet, non-printable?

It's not a great answer, but all I could think of off the top of my head.
Jul. 26th, 2007 08:27 pm (UTC)
How do you determine what char sized pieces are? Especially in light of utf8(16,32).
Jul. 26th, 2007 08:41 pm (UTC)
Sounds like you also don't know what system the file was created on. Are you trying to build something like a tool to analyze attachments?
Jul. 26th, 2007 08:49 pm (UTC)
The file is a random file which could have been created by anything. I cannot actually tell you where the files come from, but email attachments fit the bill. :)
Jul. 26th, 2007 01:05 pm (UTC)
Yeah, what raaga123 said. Look for characters like \0 and high order bit being set. If it is some Unicode encoding rather than ASCII then it gets more complicated... you could check invalid byte sequences or non-printables like with ASCII. If the text encoding could be any possible encoding known to man, then it is impossible.
Jul. 26th, 2007 08:29 pm (UTC)
With a pure ascii file, looking for non-printables usually does the trick. But unicode makes it really complicated. I was hoping my friends has already solved this. ;)

My first solution was simply to look for '\0', but as you pointed out unicode defeats that immediately. Oh well...
Jul. 26th, 2007 09:18 pm (UTC)
For the Unicode encodings I am familiar with the only character that encodes to a series of bytes containing 0x00 is U+0000, which should appear in any reasonable text document.

That doesn't mean there isn't some non-Unicode character encoding out there that would allow 0x00... like perhaps a hypothetical ZIPed ASCII text encoding.

So, what is this program you are writing that can accept any arbitrary text encoding? Typically programs that accept files in multiple text encoding require something particular in the file to identify the encoding (i.e. web browers that require a content type tag), so you could look for that particular thing that identifies the encoding, and if you don't find it then it isn't text.
Jul. 26th, 2007 09:19 pm (UTC)
Err, I mean "should not appear in any reasonable text document".
Jul. 26th, 2007 09:20 pm (UTC)
And actually, I'm not so sure about what I said above anyway. It is true for UTF-8, but not for UCS-2 (what Windows uses for wchar_t data).
Jul. 26th, 2007 09:22 pm (UTC)
But if your program is going to accept UTF-8, ASCII, UCS-2, etc... then the program has to have some way to identify the encoding.
Jul. 27th, 2007 03:51 pm (UTC)
I thought with UCS and UTF 16 (which I could be wrong about) the leading byte for standard ASCII characters was 0. I will have to look into that, as it may be a bad assumption.
Jul. 27th, 2007 04:15 pm (UTC)
You are correct.
Jul. 27th, 2007 03:49 pm (UTC)
When I said '\0' I meant one byte which equals zero.
( 13 comments — Leave a comment )

Latest Month

July 2011
Powered by LiveJournal.com
Designed by yoksel