Monday, 30 April 2012

Practical Unicode–Part 1

This post is not intended to be an exhaustive essay on unicode, the standards and how it works; this is the result of working with multiple systems transferring data in various formats and trying to get it into SQL Server without corruption.
Environment Background
As previously mentioned this is based on my experience working with multiple source systems and the following tools in a Windows environment.
Overview of Unicode and Why We Have It
When we first started storing text (ignoring EBCDIC) the general convention was one byte (8 bits) represented a single character and the alphabet and common symbols were mapped to values 32 to 127 (see ASCII table), the values 128 to 255 were considered to be the extended character set. What this meant was that to represent different alphabets and common usages of accented characters we ended up with code pages, this meant that value x could represent character y in one codepage but character z in another, this was a pain with websites as unless the code page was declared you could not be confident that you were displaying the characters correctly.
As a result we now have unicode where it is possible for a character to be represented by multiple bytes (simplification: glyphs and combined characters not covered). As with all things standards based unicode isn’t just a monolithic entity, there are various flavours and things to note.
Common ‘Flavours’
UTF-8
This is pretty much the defacto standard for storing text in a manner which can be read by almost everyone. Where a character can be represented by ASCII it is held as a single byte, other characters can be 2, 3 or 4 bytes.
While there is only one correct way to read the bytes and thus no need for a Byte Order Marker (BOM) is is common to find the bytes 0xEF 0xBB 0xBF at the beginning of some files. This can be removed using the following binmay command
binmay –s “EF BB BF” –r “” 
UCS2 Little Endian
This is the only format SQL Server can use for input via BCP or Bulk Insert, any data exported using BCP where the are NVARCHAR fields will be in this format.
Like UTF-16 it stores a character as either one or two 16 bit words.
A UCS2 Little Endian file with have a two byte Byte Order Marker consisting of 0xFF 0xFE, if it were a UCS2 Big Endian file then the BOM would be 0xFE 0xFF
Both UTF-16 and UCS2 are variable length/space encodings.
UTF-32
In UTF-32 every character is stored as two 16 bit words, this means that it is not space efficient but is very simple to parse. At the moment very few systems support it and with Python you have to choose the option at compile time.

Thursday, 22 December 2011

Timestamps in Filenames when Using DOS Batch


To get a nicely formatted timestamp in a filename e.g. ftpTransfer_20111214@0134.txt when using DOS batch scripts to control execution you can use the following.




@ECHO OFF
::GetDate
for /f "tokens=1-3 delims=/" %%a in ('DATE/T') do (
set dd=%%a
set mm=%%b
set yyyy=%%c
)


Call :TRIMYEAR %yyyy%
SET odate=%yyyy%_%mm%_%dd%
::Get Time
SET WFLWDATE=%odate%


for /f "tokens=1-3 delims=:" %%a in ('time/t') do (
  set hh=%%a
  set mm=%%b
  set ss=%%c
)


SET WFLWTIME=%hh%%mm%


::Create TimeStamp
SET TMSTMP=%TMSTMP%@%hh%%mm%


::Do your stuff here

:TRIMYEAR
set yyyy=%*
GOTO :EOF

Thursday, 8 December 2011

Excel Output Libraries

In previous posts I've discuss using the Carlos AG Excel Writer library which outputs in XML format and have found it to be very usable. I have been finding a few issues which were bugging me and that was largely the difficulty in working with large datasets due to the file size and the fact that the library looks like it is no longer under active development.

I then looked at ExcelLibrary which had some recommendations but has almost no documentation and the defects page gives the indication that it is no longer being actively developed. This isn't as much of a problem as it was with the Carlos AG library but it was worth looking further afield.

A post on StackOverflow linked to ClosedXML which had plenty of documentation and the source was available for modification, even better it appeared to be under active development.

To do a basic test of the library I wrote a small program to do something I've been meaning to do for a while.

Given a Datalink file and a file containing a list of SQL Statements generate a Excel file with a worksheet for each statement. I also used this as an opportunity to improve how I handle command line arguments by using the CommandLineParser utility for XSD2DB.

The full code of the program I wrote to test this can be found on GitHub and a zip of the binaries can be downloaded.

At this point in time I've gone for the very simplest model and just created a new Worksheet using a DataTable as a source and left the formatting as a manual operation.
As you can see it's incredibly simple


   1:      class DataTableToExcel
   2:      {
   3:   
   4:          private DataTable dt;
   5:          private XLWorkbook wb;
   6:          private string filename;
   7:          
   8:   
   9:          public DataTableToExcel(DataTable dt, string filename)
  10:          {
  11:              this.dt = dt;
  12:              this.filename = filename;
  13:              if (File.Exists(filename))
  14:              {
  15:                  wb = new XLWorkbook(filename);
  16:              }
  17:              else
  18:              {
  19:                  wb = new XLWorkbook();
  20:              }
  21:          }
  22:   
  23:          public void Process()
  24:          {
  25:              var ws = wb.Worksheets.Add(dt);
  26:              ws.Name = "Sheet " + wb.Worksheets.Count;
  27:              wb.SaveAs(filename);
  28:          }
  29:      }

Planned Updates
In time I want to be able to add the following features by having a more complex configuration file.

  • Associating a name with a query so that sheets are correctly named
  • Running the queries in parallel and passing the datatables to the generator when necessary
  • Better error handling of bad statements
  • Support for auto inserting date and timestamp in the output file name

Wednesday, 7 December 2011

Saving Flash Video & Audio Using JCopia


Background
One of the most common ways to demonstrate software is to record the screen and capture it into a Flash Video. This techinque is used both in the commercial and educational sphere but does result in the problem that you may come across the following situations.

* The video is no longer being hosted when you return to it at a later date.
* The video is inaccessible from your classroom or when giving a presentation.

Over the years I've tried various browser based extensions which haven't always worked or the user interface has been so poor that I could not recommend it to non-technical friends.

While searching for a more mature solution I came across JCopia by Jiteco which is competitively priced application which can save both flash video files and MP3s from websites.

JCopia
The application install is a standard MSI and as expected there were no problems installing it. Upon launch you are presented with an intutive interface which differentiates between music and video and also allows you to track completed and pending downloads.

Using the application is incredibly simple as it starts off in monitoring mode, as soon as I opened by browser and went to YouTube and selected a video it immediately started downloading it. The browser can be closed at this point and the download will continue and can be paused and restarted which is very handy for large videos.

More advanced users of the application can add their own mime types and extensions in which means that it's a moderately simple task to be able to download new formats as they are added and adjust to other websites. The other advantage of this method is that unlike many browser based solutions this software is not site independent will allow you to save flash files from many sources.

JCopia is made by Jiteco and can be found here.

Monday, 28 November 2011

Converting Bitmaps to Excel

Introduction
After having dinner I was browsing the internet and noticed someone making a joke about wasting time in an Excel course by using the grid to draw an image. As I'm interested in learning more about working with Excel automation I thought this would be an ideal learning exercise as working with Bitmaps is already simple.

Excel 2010 Addin
My initial approach was to write an addin driven by a button on the ribbon which would allow the user to select  a file and then convert that to an array and set the colour for each cell. I succeeded in this goal but came to the following conclusions

  1. I am missing something quite fundamental about COM Interobability with Excel or it really is needlessly irritating.
  2. It performed terribly so I'm clearly doing something wrong.
  3. There's no decent documentation anywhere.
  4. It restricts you to working with certain versions of Excel.
However, I had proved to myself that the logic was sound and as there wasn't anything on telly I decided to revert back to writing a command line application.

Command Line Application
I already know that you can generate an XML file that Excel will load and display having come across it when maintaining a COBOL program that built an XML file for Excel using only string concatenation. As I was well aware of the horror of this approach my first instinct was to find a library to do the work (why reinvent the wheel)?

I found this excellent and easy to use library (which was handy as I couldn't get the CHM help file to open) -CarlosAg Excel Xml Writer Library. This meant that I could concentrate on just getting a quick and dirty implementation up and running.

I knew I had the following constraints
  • Only 255 columns
  • Only 4000 unique colours
This meant that the process should be a simple
  1. Resize the image if necessary
  2. Identify every distinct colour
  3. Generate a style for each colour
  4. For each pixel in the image create a cell with the appropriate style.
Resizing the image is very easy with .Net


   1:          private void Resize(int width, int height)
   2:          {
   3:              Bitmap b = new Bitmap(width, height);
   4:              Graphics g = Graphics.FromImage((Image)b);
   5:              g.InterpolationMode = InterpolationMode.HighQualityBicubic;
   6:              g.DrawImage(image, 0, 0, width, height);
   7:              g.Dispose();
   8:          }


Identifying the unique colours was even simpler with LINQ as you only need the following command to get all distinct values.





   1:  s = k.Distinct().ToArray();

Generating the styles was just a case of iterating through the distinct array and adding each style to the workbook.

The Application
The application can be downloaded from here and is very simple to use.

From the commandline it's just

BitmapToExcel <filename>

This will create an xml file with the same name as your image which you may need to rename to .xls to get Excel to initially open it.

Friday, 25 November 2011

Meego Linux - A Beautiful Failure

I have an Dell Notebook (Dell Mini 10V) which was the perfect device for a particular problem we had, that need has now passed and I have a device which isn't powerful enough to replace a laptop but still seems too good to not use.

It was dual booting Windows 7 (Starter edition) and Ubuntu 10 (Notebook remix) without a huge amount of success. I felt that Windows 7 was too slow on the device and Ubuntu was rather poor compared to how well it works on the desktop.

It was time to find a new distro.

I initially tried XUbuntu (alternate) but the installer was very iffy and I didn't try to download the normal Xubuntu iso. I then moved to Meego which upon initial installation was absolutely amazing, it was quick to install, looked great and was responsive with a really nice user interface. I thought I'd found the holy grail of lightweight Linux distributions as it really seemed to understand what people use a netbook for and everything was so well integrated, you could really see where the developers had learnt from mobile device development. Unfortunately Meego has a massive failing and that is the fact that it's completely single user, something which may make sense in a mobile device but takes real computers back to the mid 90s.

A computer which integrates your social media accounts and email so tightly with the user interface can only make sense as a single user platform in something like a mobile phone, netbooks and laptops are commonly used by multiple people and multi user support is a very basic requirement.

I've now switched to Easy Peasy Linux which while based on Ubuntu is a lot snappier and has (I think) a much better user interface.

Friday, 18 November 2011

Excel (2010) Automation with C# (Continued)

In my last post I left it with the problem with how do you know that a cell value received as a double needs to be treated as a date. At this point in time I haven't found a complete solution but have found a sufficient solution for my problem.

Parameters in my command are identified as being curly braces surrounding a type and a column reference

e.g. {S2} refers to a string in column 2, while {N4} refers to a number in column 4 (The row is the same as command)

As I know what I'd like to be formatted as a date it was just a case of adding a new type identifier D for date so that I now have {D3} to refer to a date in column 3. I can then use DateTime.FromOADate() to parse the double and return a date time. This can then be output as a yyyy/mm/dd string in the parameter substitution

DateTime dt = DateTime.FromOADate(r.Value2);

Future Changes
Currently the command is processed as soon as you leave the cell, what I'd like to do is to be able to write the commands and then  select a range of cells and choose to convert them at that point. This way I could make sure I was happy with the entire thing before I lost all my parameter markers.

The other sensible change may be to  leave the initial string along and have the actual fully substituted string returned in the next column across, This way you don't lose your original command.