Archive for March, 2005

Crazy Busy

This may be my last post until April 9 - the day I take the GREs. I’ve done a couple practice math tests, and - without publicly humiliating myself by revealing my scores - let’s just say I’m not doing very well so far. It’s frustrating since I did well on the GREs way back when in 1992. And that time I didn’t prepare at all… aside from being fresh out of 4 years of college ;-). Let’s hope I can relearn this stuff over the next week, and that I haven’t just grown stupider with time.

But it’s not just that. Kai and Maria were away in Denver last week, so I worked on the house in the evenings, and then put in a couple 15 hour days over the weekend. I’m trying hard to wrap up all my re-wiring and related tasks, as the drywall guys are supposed to start putting up the new walls next week, and we don’t want to delay any further. We originally thought we’d have the new drywall installed already, but that’s before I went beyond our original goals and got really ambitious. I’ve gone even further since then - while Kai and Maria were away I re-wired two bedrooms and knocked down a couple walls (with help from friends Chris and Ed). Once I’m done with the GREs I’ll post some pictures of the recent house work (and of Kai in Denver too). I’m pushing myself to get all the house work done because we want to have it all wrapped up before the baby comes. After the drywall is up I still have to put in the trim, paint, fix up the bathrooms, move furniture, etc, so there’s still plenty more to do. August may seem far away, but it’s not.

But for the next 10 days it’s all about the GREs. Wish me luck, and I’ll be back when it’s over!

Back to School, and, What’s Wrong with My Brain?

I’ve started the application process for Penn’s Master of Computer and Information Technology (MCIT) program. I’ve been doing this whole Internet thing for 8 years now, so I guess it’s my career ;-). I should get a degree that’s more relevant to it than political science. And as a Penn employee, I can take the classes for free, so why not? With the baby coming, I’ll probably do just one class per semester. So it’ll take a long time for me to finish the program, but I’m not in any rush.

But first I have to get into the program. One hoop I have to jump through is the GRE test. I’m taking it on April 9. I’ve started studying, and I’m amazed that I remember almost none of my math education. Going through the first practice test I was dumbstruck by most of the algebra, trig, and geometry questions. I got A’s in my advanced statistics classes in grad school, and an A in calculus in college. But that was a long time ago - right now I couldn’t even tell you what calculus is if you asked me. Maria’s been helping me - somehow she remembers all her high school math. I asked her how she could possibly remember it all, and she said she has a good memory for general concepts and principals, even though she’ll often forget particular facts and details. I wish my mind worked that way - instead it is a vast wasteland crammed full of useless trivia. So even though I’ve forgotten the concept of factoring (despite the fact that I used it regularly throughout at least 4 years of my education) I can tell you right off the top of my head that 80s pop artist Falco, before the song Rock Me Amadeus made him a super-star, actually scored his first hit years earlier with Der Commissar, as the singer for the band After the Fire. I never even liked Falco, yet that useless tidbit of information lingers in my mind, an unwelcome squatter occupying valuable mental real estate. Hopefully between now and April 9 I can clear away enough mental debris to make space for the math I need to learn.

Converting Web Applications to UTF-8

UPDATE: I expanded this to a full length article, which was published in the May 2005 issue of php|architect. My apologies for not responding to earlier comments - I had a newborn baby at the time.

An Overview of UTF-8 in PHP, Smarty, Oracle, and Apache, with data exports to PDF, RTF, email, and text

Here at the Penn Med School we recently switched our web and database applications from Western/ISO encoding to Unicode/UTF-8. We did this so we can provide better support for international character sets (Greek, Japanese, etc.). As sometimes happens with projects that involve computers, it grew into a big, hairy beast that was way beyond anything we initially anticipated. I was partly responsible for managing the transition, and since I found no comprehensive guide to help us through it, I thought I’d write one now that we’re done. We’re using two-thirds of the open source PHP-Apache-MySQL trinity, with Oracle instead of MySQL. Even if you have a different mix of applications, the concepts I’ll describe are probably applicable to your situation, even if the semantics are different.

Getting Started

First, if you need some orientation in understanding character sets, start with The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). It’s actually quite readable, even if you’re not a techie.

Second, you need to read the Oracle document An Overview on Globalizing Oracle PHP Applications. It’s an excellent starting point, but unfortunately it doesn’t always explain the reasons behind its recommendations, which means you’ll get stuck if things don’t happen to work after you follow their instructions. I’ll try to fill those gaps here.

Persuading Apache and Oracle to talk to each other in UTF-8

PHP web applications are run under the Apache web server, which itself is running in a user account (assuming you’re in a Unix environment). So the first step is to set the environment of that account correctly, so it will know how to “speak” UTF-8 to Oracle. You do this by setting the NLS_LANG environment variable in the Apache configuration. The Oracle Overview document says to set it to .AL32UTF8, but doesn’t explain why. So when this didn’t do the trick for me, I had to do some more research. I found the Oracle Character Set descriptions, and found that .AL32UTF8 corresponds to Unicode 3.1. After talking with our DBA I learned that our Oracle database is set to Unicode 3.0, which meant I needed to set NLS_LANG=.UTF8 (we ultimately switched to .AL32UTF8, since it is Oracle’s recommended standard). The key point here is that NLS_LANG must exactly match the character set you’re using in Oracle.

Serving your web pages to users in UTF-8

There are a few different aspects to this:

  1. If you want all the documents on your server to default to UTF-8, then set the AddDefaultCharset directive in the Apache configuration to UTF-8. You should do either #2 or #3 below in addition to this (see the Apache documentation for the reason).
  2. If you want all your PHP documents served in UTF-8, but not necessarily other document types, set default_charset=UTF-8 in your php.ini file. It’s OK if the PHP charset is different from the Apache charset: the PHP charset will apply to PHP files, and the Apache charset will apply to all other types (this goes for #3 below as well).
  3. If you only want certain PHP documents in UTF-8, specify UTF-8 in the Content-type header of those documents. It’s important to point out here that, if you haven’t done #1 or #2 above, then you must set this header with the PHP header() function. If you try to set it with an HTML Meta tag, the charset defined in Apache will override your Meta tag.

UTF-8 in form submissions

In Windows 95 and 98, Microsoft used the Windows ANSI character set. If you ever copy-and-pasted text from Microsoft Word into a web form under Windows 9x, chances are any upper ASCII characters, such as ©, turned into something like ä in the web form. This is because the web page was probably Western ISO8859-1 encoded, and that character set organizes the upper ASCII range differently from Windows ANSI. So the web page thought it was receiving a different character than what you intended. Windows NT, 2000, and XP use Unicode, so you won’t have this problem under the newer versions of Windows. Macs and most other modern OSs use either Western ISO 8859-1 or Unicode. The first 256 characters of Western ISO 8859-1 are the same in Unicode. So your Unicode encoded web form should correctly interpret upper ASCII text provided by anyone not using Windows 9x (or a completely foreign, non-Unicode character set).

Additional PHP and Oracle configurations

You will want to enable multi-byte character support in PHP. Compile PHP with the -enable-mbstring option, and set mbstring.internal_encoding=UTF-8 in your php.ini file. Also, you should definitely look over the PHP documentation for multi-byte string functions. Note that if you haven’t upgraded to PHP 5 yet, the html_entity_decode() function will fail hard if you pass it a UTF-8 string. This was the only UTF-8 incompatibility we found in PHP 4.3.

You may want to implement PHP’s function overloading. An example will illustrate why this is important: in UTF-8, a string that is 4 characters long could occupy anywhere from 4 to 12 bytes depending on the multi-byte characters in it. The mb_strlen() function will correctly tell you the number of characters in such a string, but the regular strlen() function won’t (it’ll tell you the number of bytes). Enabling function overloading will cause PHP to automatically assume it’s handling multi-byte strings, so, in this example, it will execute mb_strlen() when you call strlen(). If you’re making a wholesale conversion to UTF-8, and you don’t want to tweak all your existing code, implementing function overloading makes sense. But there is one exception: you may not want to do function overloading on mail() - I’ll get to that in a minute.

Related to this, in Oracle 9, you can set NLS_LENGTH_SEMANTICS to use either character length or byte length semantics for the tables you create. That is, you can use it to indicate whether, for example, a varchar(10) column is 10 characters, or 10 bytes.

Smarty

If you’re using Smarty with PHP, you’ll need to override the escape() function. It calls the PHP htmlentities() and htmlspecialchars() functions, but it doesn’t provide them with the necessary charset argument so they’ll work with UTF-8. Make a copy of the escape() modifier and tweak it to pass along a charset argument to PHP, and then use it to override the original.

Exporting to other formats

As you’ll see below, it may not always be wise to do data exports in UTF-8. Sometimes you need to change the character set before performing the export. Take a look at PHP’s utf8_decode() and iconv functions to learn about converting UTF-8 to single-byte encoding. Note that utf8_decode(), while easy to use, is limited to the Latin character set (see the user contributed notes on the PHP utf8_decode() page for tips on dealing with other character sets).

  • PDF: we use PDFlib on our web server to create PDF documents on the fly. For it to work with UTF-8 data, you need to use it with a UTF-8 compatible font. The standard Arial font supports Greek and Cyrillic in UTF-8, which is generally sufficient (don’t confuse standard Arial with Microsoft’s Arial Unicode MS font - while it can print just about any UTF-8 character, it’s 32MB, so you probably don’t want to load it on your web server!). Also, Gentium is a very nice UTF-8 compatible serif font that supports Greek and Cyrillic.
  • RTF: we are moving away from RTF, but we still have some applications that generate RTF files. RTF does not provide good UTF-8 support. Our solution is to do a utf8_decode() on our data before generating RTF files (we can get away with this since none of the data going into our RTF files contain non-Latin characters - hopefully we’ll get rid of RTF before non-Latin characters start showing up).
  • Text: we also do data exports to text files, mainly in .csv format for use in spreadsheets. Surprisingly, Microsoft Excel does not support importing UTF-8 encoded text files. Again, our solution is to perform a utf8_decode() before generating these text files.
  • Email: I recommend not doing function overloading on PHP mail(). The reason has to do with line breaks. In Unix, a line break is represented by a line feed (LF) character. On Macs, it’s represented by a carriage return (CR) character. And on Windows, by a CR+LF. For email to work between platforms, an email standard was agreed upon in the early days of the Internet, which is CR+LF. So, for example, on Unix, sendmail will add a CR as needed to each LF it finds in the body of an email message. But when an email is UTF-8, mailers don’t try to wade through the multi-byte encoding, and they don’t “fix” the line breaks. We found that the line breaks in UTF-8 emails (generated on Unix) were interpreted as desired in Mac and Unix mail readers, and by Microsoft Outlook on Windows, but not by Eudora 6.2 (and previous versions) on Windows. In Eudora, the messages displayed with no line breaks. You can’t say it’s a Eudora bug, since the line breaks weren’t meeting the standard. At this time, the emails we generate only contain basic Latin characters, so sticking with the standard mail() function meets our needs for now.

Thoughts on Coppermine, and Integrating It with WordPress

I mentioned in an earlier post that I installed Gallery for managing my photos. Gallery turned out to be a train wreck: the features are nice, but the programming behind it reminds me of 80s style spaghetti code, making it almost impossible to customize. For example, I burned a few hours trying to figure out how to display a random image on a page other than “the random image page” before I concluded it wouldn’t be possible without a massive rewrite. So I dumped it in favor of Coppermine. Here are the photo albums I’ve created so far.

Coppermine allows you to create a custom theme, which consists of a style sheet and a couple of template files (they contain most of the HTML widgets that are used in building the pages). The implementation is hardly perfect though: there is a fair amount of hardcoded HTML inside the PHP functions, so you have to do some detective work if you want to change certain things (e.g. the code it has for embedding video files only works in IE, so I had to track it down and tweak it to support Firefox/Mozilla). Also, it’s filled with a sloppy mish-mash of HTML and XHTML tags, so coaxing it to generate a valid document has required me to touch a lot of code. But those are my only complaints: the features suit my needs and the management interface is nice. A real plus is that you can integrate it with the Windows XP “publish to web” feature, so you can publish images with just a few mouse clicks - no more FTP’ing!

Integrating Coppermine with WordPress has been an adventure. I’m using psnGallery2, which gives you custom WordPress tags and PHP functions for embedding Coppermine photos in WordPress. First I installed the latest stable release, but couldn’t get it working at all. So I installed the current alpha release, and with some hacking, got it to work. I described the problems and my solutions in this WordPress forum post.

I also wanted an easy way to link Coppermine pictures to their related blog entries. I did this by creating a “burl” tag in Coppermine. In the title or description of a photo I can type, for example, [burl=29]some link text[/burl] and it’ll link to WordPress entry number 29. It was easy to do. In bb_decode() - located in include/functions.inc.php - I added:

$text = preg_replace("/\[burl=([0-9]+)\]/”, ‘<a href=”/blog/index.php?p=$1″>’, $text);
$text = str_replace(”[/burl]“, ‘</a>’, $text);

It’s all set up, so now I just have to slog through all my photos and get them in Coppermine :-(. Don’t expect it to happen overnight, but I will try to at least fix all the images that are now broken in my old blog entries as soon as I can ;-).

John Rocks Out

99|1

My younger brother John just sent me a link to photos from his band’s latest show: the Neo-Prophets at Providence, RI’s AS220 club. John’s the drummer (he’s been playing the drums since he was 8 years old). Clearly my playing the Dead Kennedys for him when he was 12 has damaged his brain. Here are more pictures from another Neo-Prophets show last November.

More WordPress Hacking

The clever little URL rewrite I mentioned earlier is now out the window. I noticed the URLs it generated weren’t technically correct (with a slash between index.php and the URL argument) but they worked, so I let it be. That was a few weeks ago, and in that time, not unexpectedly, all my old Movable Type URLs disappeared from Google. But they haven’t been replaced by the new ones - except for the top page, my blog is not in Google anymore. I’m guessing Googlebot didn’t like how I was doing the URL rewriting. So I’ve shuffled things around, and now my blog really is in the “blog” directory, and hopefully Googlebot will like that better. Unfortunately, that means I can’t use WordPress to manage pages outside the blog directory. But I’ve realized that doesn’t matter much, as I’ll eventually fold the Kai and wedding pages into Coppermine, and the Route 50 pages into the blog, so there won’t be many static pages left.

I’ve also become active in the WordPress forums - I explained how to get past the bugs that have been plaguing a lot of folks trying to use psnGallery (it’s a plugin that gives you easy access to Coppermine photos from within WordPress) and how to fix a bug with sorting the WordPress archive listings (the WP folks have since released their own bug fix).

Breaking News: Kai Gives Up His Astronaut Career

This morning Kai had a meltdown after he woke up. I wasn’t there, but Maria told me he just kept saying over and over again through his tears that “I don’t want to be an astronaut anymore.” Once he calmed down enough to answer a few questions, he explained that, on the space shuttle, he didn’t want to be “clipped onto the toilet.” The night before we read one of his new books about space travel that he got for his birthday, and it explained that astronauts need to strap themselves to the toilet when it’s time to go in zero-g. Apparently this was a deal-breaker for him, so there goes his future at NASA.

He now says he wants to be an archaeologist, so he can “study dinosaur brains.”

A Scanner Darkly is Coming!

The film’s release date hasn’t been announced yet, but the trailer for Richard Linklater’s adaptation of Philip K Dick’s A Scanner Darkly is now online: here’s the Apple trailer. I knew the film was in the works, but I had no idea they were doing a rotoscoped animation style. Check out the trailer - it’s real eye candy.

Dick’s daugthers have said this “will be the very first faithful adaptation of a Philip K. Dick story.” Other film adaptation of his books (Blade Runner, Minority Report, and Total Recall) were good films, but took tremendous liberties with the storylines. Judging by the trailer, I’m impressed. I read the book over Christmas break, so it’s still fairly fresh in my mind - I recognized some of the lines, and the voiceover is taken directly from the portion of the book that explains the meaning of the title. So my expectations are now raised - can’t wait to see it.

The Robots Are Coming

98|1

I knew the Japanese had been making a lot of progress with industrial robots and toy/pet robots, but I didn’t realize just how far they had come with humanoid robots until I saw the Humanoids with Attitude article in the Washington Post (registration required). The article describes the receptionist robot Saya (that’s her picture on your right) dealing with someone insulting her:

“You’re so stupid!” said the professor, Hiroshi Kobayashi, towering over her desk. “Eh?” she responded, her face wrinkling into a scowl. “I tell you, I am not stupid!” Truth is, Saya isn’t even human. But in a country where robots are changing the way people live, work, play and even love, that doesn’t stop Saya the cyber-receptionist from defending herself from men who are out of line. With voice recognition technology allowing 700 verbal responses and an almost infinite number of facial expressions from joy to despair, surprise to rage, Saya may not be biological — but she is nobody’s fool.

The article also points out the differences between the US and Japanese approaches to R&D in robotics and AI:

In the quest for artificial intelligence, the United States is perhaps just as advanced as Japan. But analysts stress that the focus in the United States has been largely on military applications. By contrast, the Japanese government, academic institutions and major corporations are investing billions of dollars on consumer robots aimed at altering everyday life, leading to an earlier dawn of what many here call the “age of the robot.”

I’m fascinated by the cultural factors that influence where different countries choose to focus their technological research efforts. For example, in the US, we’ve taken a no-holds-barred approach to the genetic manipulation of fruits, vegetables, grains, and livestock, even though we don’t yet know what the long-term repercussions might be. In contrast, the Europeans have been very cautious in this area. And while the US has put a straitjacket on research involving human fetal stem cells, the Europeans haven’t. While there are plenty of Americans who would prefer a more European approach to these issues, I think these differences are indicative of some real cultural distinctions, mainly derived from differing perspectives on Christianity and man’s place in the world.

Getting back to robotics, the Washington Post article explains: “Rather than the monstrous Terminators of American movies, robots here [in Japan] are instead seen as gentle, even idealistic creatures.” While many Americans would have a hard time accepting a robot like Saya, the Japanese don’t have a problem with it. The article also points out the economic motivation behind Japan’s focus on robotics: “Confronting a major depopulation problem due to a record low birthrate and its status as the nation with the longest lifespan on Earth, Japanese are fretting about who will staff the factory floors of the world’s second-largest economy in the years ahead.” What the article fails to mention is that the US and much of Europe don’t have to worry too much about declining birthrates because they allow immigration. But allowing mass immigration is not a politically viable option in Japan: the Japanese would prefer to see their future workforce dominated by robots than by non-Japanese.

Setting aside the thorny issue of immigration for a moment, the Japanese predicament arguably would be an ideal situation if the rest of the world were in the same boat. If the global human population were declining, and robots could replace people in the workforce at roughly the same rate, prosperity would be maintained and the negative environmental effects of human population pressures on the globe would be reduced (so long as the human population stabilized at some point - we wouldn’t want to disappear altogether!).

But the Japanese situation is the exception, not the rule. In most of the rest of the world - including the US - the human population is growing due to either high birthrates or immigration, and all those folks need jobs. What will be interesting to see, 10 or 15 years down the road, is what will happen in the US with humanoid robots. As they become commonplace in Japan, competitive pressures will force the US to react. Whatever cultural resistance the US may have to the widespread presence of robots will give way, as robots will save companies a lot of money: robots do not require salaries, vacation time, or health benefits. Will cultural discomfort or an altruistic drive to maintain human employment keep the robots out? I doubt it. Unfortunately, I think Marshall Brain’s Robotic Nation provides an accurate prediction of what will happen. I scribbled some thoughts along these lines in a post last year, More Robot Stories - continue reading there if you want my prognosis.

Happy Birthday Kai!

96|1
92|1

Kai turned 4 today. He had a fun birthday party at school (Maria was there to shoot the video), and when he got home I gave him my set of space legos from when I was little. It’s one of the few things I’ve kept since childhood. I think the old 70s legos are better than the new ones. The new ones look more slick, but the parts are so specialized that all you can do with them is follow the instructions to build the one thing they’re designed for. Sure, the older sets looked more like blocks, but you could reconfigure them in a wide variety of ways. And Kai’s already started to make a variety of “improvements” to the original designs.