<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

  <title><![CDATA[Acko.net]]></title>
  <link href="http://acko.net/atom.xml" rel="self"/>
  <link href="http://acko.net/"/>
  <updated>2012-05-17T01:12:29-07:00</updated>
  <id>http://acko.net/</id>
  <author>
    <name><![CDATA[Steven Wittens]]></name>
    
  </author>

  
  <entry>
    <title type="html"><![CDATA[Safe String Theory for the Web]]></title>
    <link href="http://acko.net/blog/safe-string-theory-for-the-web/"/>
    <updated>2008-04-03T00:00:00-07:00</updated>
    <id>http://acko.net/blog/safe-string-theory-for-the-web</id>
    <content type="html"><![CDATA[<div class='g8 i2 first'><div class='pad'><h1>Safe String Theory for the Web</h1><p>One of the major things that really bugs me about the web is how poor the average web programmer handles strings. Here we are, changing the way the world works on top of text based protocols and languages like HTTP, MIME, JavaScript and CSS, yet some of the biggest issues that <a href='http://drupal.org/security'>still plague us</a> are cross-site scripting and mangled text due to aggressive filtering, mismatched encodings or overzealous escaping.
</p>

<p>
Almost <a href='http://acko.net/blog/xss-friends-text-handling-in-php-applications'>two years ago</a> I said I'd write down some formal notes on how to avoid issues like XSS, but I never actually posted anything. See, once I sat down to actually try and untangle the do's and don'ts, I found it extremely hard to build up a big coherent picture.
</p>

<p>
But here we are now, and I'm going to try anyway. The text is aimed at people who have had to deal with these issues, who are looking for a bit of formalism to frame their own solutions in.
</p>

<style type='text/css'>
p.string {
  text-align: center;
  margin-top: 0.5em;
  margin-bottom: 0.5em;
}
span.letter {
  padding: 0.125em 0.25em;
  border: 1px solid #ccc;
  margin-right: -1px;
  background: #eee;
}
span.marked {
  background: #ffc;
}
div.context {
  font-size: 0.8em;
  margin-top: -0.5em;
  margin-bottom: 0.625em;
}
span.ugc {
  color: #34a;
}
div.code {
  padding: 0.5em;
  border: 1px solid #ccc;
  background: #eee;
}
table {
  margin: 0.5em auto;
}
</style>

<h2>The problem</h2>

<p>
At the most fundamental level, all the issues mentioned above come down to this: you are building a string for output to a client of some sort, and one or more pieces of data you are using is triggering unknown effects, because it has an unexpected meaning to the client.
</p>

<p>
For example this little PHP snippet, repeated in variations across the web:
</p>

<p class='codeblock'>
<code>&lt;?php&nbsp;print&nbsp;$user-&gt;name&nbsp;?&gt;'s&nbsp;profile</code>
</p>

<p>
If <code>$user-&gt;name</code> contains JavaScript, your users are screwed.
</p>

<p>
What this really comes down is concatenation of data, or more literally, strings. So with that in mind, let's take a closer look at...
</p>

<h2>The humble string</h2>

<p>
What exactly is a string? It seems like a trivial question and I'm sure I'll come across as slightly nutty and overly analytic, but I really think a lot of people don't really know a good answer to this. Here's mine:
</p>

<blockquote><p>
A string is an <strong>arbitrary sequence</strong> (1) of characters composed from a <strong>given character set</strong> (2), which acquires meaning when placed in an <strong>appropriate context</strong> (3).
</p></blockquote>

<p>
This definition covers three important aspects of strings:
</p>

<ol>
  <li>They have no intrinsic restrictions on their content.</li>
  <li>They are useless blobs of data unless you know which symbols it represents.</li>
  <li>The represented symbols are meaningless unless you know the context to interpret them in.</li>
</ol>

<p>
This is a much more high-level concept than what you encounter in e.g. C++, where the definition is more akin to:
</p>

<blockquote><p>
A string is an arbitrary sequence of bytes/words/dwords, in most cases terminated by a null byte/word/dword.
</p></blockquote>

<p>
I think this latter definition is mostly useless for learning how to deal with strings, because it only describes their form, not their function.
</p>

<p>
So let's take a closer look at the three points above.
</p>

<h2>1. Representation of Symbols</h2>

<blockquote>
<p>
They are useless blobs of data unless you know which symbols it represents.
</p>
</blockquote>

<p>
This issue is relatively well known these days and is commonly described as <em>encodings</em> and <em>character sets</em>. A character set is simply a huge, numbered list of characters to draw from. An encoding is a mechanism for turning characters into sequences of bits. Theoretically they are independent of eachother, but in practice, they are coupled together and the two terms are used interchangeably to describe a particular encoding/character set pair.
</p>

<p>
You can't say much about them these days without delving into Unicode. Fortunately, Joel Spolski has already written up a great <a href='http://www.joelonsoftware.com/articles/Unicode.html'>crash course on Unicode</a>, which explains much better than I could.
</p>

<p>
For the purposes of security though, encodings and character sets are mostly irrelevant, as the problems occur regardless of which you use. All you need to do is be consistent, making sure your code can't get confused about which encoding it's working with. So below, we'll talk about strings above the encoding level, as sequences of known characters. Like so:
</p>

<p class='string'><span class='letter'>S</span><span class='letter'>t</span><span class='letter'>r</span><span class='letter'>i</span><span class='letter'>n</span><span class='letter'>g</span><span class='letter'>&nbsp;</span><span class='letter'>T</span><span class='letter'>h</span><span class='letter'>e</span><span class='letter'>o</span><span class='letter'>r</span><span class='letter'>y</span>
</p>

<h2>2. Arbitrary content</h2>

<blockquote><p>
They have no intrinsic restrictions on their content.
</p></blockquote>

<p>
The second point seems self-evident, but can be rephrased into an important mantra for coding practices: there are no restrictions on a string's contents except those you enforce yourself. This makes strings fast and efficient, but also a possible carrier of unexpected data.
</p>

<p>
The typical response to this danger is to apply a strict filtering to any textual inputs your program has and before doing anything else to the data. The idea is to remove anything that may be interpreted later as unwanted mark-up or dangerous code. On the web, this usually means stripping out anything that looks like an HTML tag, doing funky things with ampersands and getting rid of quotes. While this is an approach that is often advocated as an effective and bulletproof solution, it is rather short-sighted, inflexible and restricted in scope, and I strongly oppose it.
</p>

<p>
This is of course very different from regular input validation, like ensuring a selected value is one of a given list of options, or checking if a given input is numeric and in the accepted range. These are different from regular textual inputs, because the desired result is in fact not a string, but either a more restricted data type (like an integer) or a more abstract reference to an existing, internal object.
</p>

<p>
To understand why textual strings are such poor candidates for input validation, we need to look at the third point.
</p>

<h2>3. Different contexts</h2>

<p>
<blockquote>The represented symbols are meaningless unless you know the context to interpret them in.</blockquote>
</p>

<p>
Context, or the lack of it, is essentially the cause of issues such as SQL injection, XSS and HTTP hijacking. And, I think it is exactly because it is so essential to processing strings, that it is often taken as self-evident and forgotten.
</p>

<p>
Let's go back to our example string:
</p>

<p class='string'><span class='letter'>S</span><span class='letter'>t</span><span class='letter'>r</span><span class='letter'>i</span><span class='letter'>n</span><span class='letter'>g</span><span class='letter'>&nbsp;</span><span class='letter'>T</span><span class='letter'>h</span><span class='letter'>e</span><span class='letter'>o</span><span class='letter'>r</span><span class='letter'>y</span>
</p>

<p>
Everyone will see this string represents two English words. That's because people are great at deriving context from free floating pieces of data. However even with natural languages, confusion can arise. Take for example this string:
</p>

<p class='string'><span class='letter'>B</span><span class='letter'>o</span><span class='letter'>n</span><span class='letter'>j</span><span class='letter'>o</span><span class='letter'>u</span><span class='letter'>r</span>
</p>

<p>
Is it a French greeting? Sure. But it is also the name used by Apple for its zero-configuration network stack. We can only know which one is meant, by knowing more about the <em>context it is used in</em>.
</p>

<p>
Now why bother with this trivial exercise? Because the web is all about textual protocols and languages. While people are great at deriving contexts automatically, computers aren't, and generally rely on strict semantics.
</p>

<p>
Imagine a discussion forum, and people post topics with the following subjects:
</p>

<p class='string'><span class='letter'>&lt;</span><span class='letter'>b</span><span class='letter'>&gt;</span><span class='letter'>O</span><span class='letter'>M</span><span class='letter'>G</span><span class='letter'>!</span><span class='letter'>!</span><span class='letter'>!</span><span class='letter'>&lt;</span><span class='letter'>/</span><span class='letter'>b</span><span class='letter'>&gt;</span>
</p>

<p class='string'><span class='letter'>&lt;</span><span class='letter'>b</span><span class='letter'>&gt;</span><span class='letter'>&nbsp;</span><span class='letter'>i</span><span class='letter'>s</span><span class='letter'>&nbsp;</span><span class='letter'>d</span><span class='letter'>e</span><span class='letter'>p</span><span class='letter'>r</span><span class='letter'>e</span><span class='letter'>c</span><span class='letter'>a</span><span class='letter'>t</span><span class='letter'>e</span><span class='letter'>d</span>
</p>

<p>
Each string contains the character <span class='letter'>&lt;</span> in a slightly different context. The first uses it as part of intended bold tags. The second seems to use the same bold tag, but is actually just talking <em>about</em> the tag instead of using it for markup. More formally, we can say the first string is written in an <em>HTML context</em>, the second in a <em>plain-text context</em>.
</p>

<p>
If we were to try and display these strings in the wrong context, we'd see tags printed when they should be interpreted, or text marked up when it should be shown as is.
</p>

<h3>Context conversion</h3>

<p>
To unify the two strings above, we can convert the plain-text string to HTML without loss of meaning, like so:
</p>

<p class='string'><span class='letter marked'>&lt;</span><span class='letter'>b</span><span class='letter marked'>&gt;</span><span class='letter'>&nbsp;</span><span class='letter'>i</span><span class='letter'>s</span><span class='letter'>&nbsp;</span><span class='letter'>d</span><span class='letter'>e</span><span class='letter'>p</span><span class='letter'>r</span><span class='letter'>e</span><span class='letter'>c</span><span class='letter'>a</span><span class='letter'>t</span><span class='letter'>e</span><span class='letter'>d</span>
</p>

<p class='string'><span class='letter marked'>&amp;</span><span class='letter marked'>l</span><span class='letter marked'>t</span><span class='letter marked'>;</span><span class='letter'>b</span><span class='letter marked'>&amp;</span><span class='letter marked'>g</span><span class='letter marked'>t</span><span class='letter marked'>;</span><span class='letter'>&nbsp;</span><span class='letter'>i</span><span class='letter'>s</span><span class='letter'>&nbsp;</span><span class='letter'>d</span><span class='letter'>e</span><span class='letter'>p</span><span class='letter'>r</span><span class='letter'>e</span><span class='letter'>c</span><span class='letter'>a</span><span class='letter'>t</span><span class='letter'>e</span><span class='letter'>d</span>
</p>

<p>
This kind of context conversion is commonplace under the term <em>escaping</em> and in this case, will replace any character that has a special meaning in HTML with its escaped equivalent. This ensures the resulting string still means the same thing in the new context.
</p>

<h2>On the Web, Contexts happen</h2>

<p>
Usually, the lesson above of escaping input to HTML-safe text is where the discussion about XSS ends. However, armed with only the knowledge that <em>HTML-special characters must be escaped</em> to be safe, it can be hard to see why in fact you should not just filter all your data on input to ensure it contains none of these pesky characters in the first place. After all, how many people really need to use angle brackets and ampersands anyway?
</p>

<p>
Well, first of all, I think that's underestimating certain users. The following subject might not be so rare on a message board, yet would be mangled by typical aggressive character stripping:
</p>

<p class='string'><span class='letter'>&lt;</span><span class='letter'>_</span><span class='letter'>&lt;</span><span class='letter'>&nbsp;</span><span class='letter'>s</span><span class='letter'>o</span><span class='letter'>&nbsp;</span><span class='letter'>s</span><span class='letter'>a</span><span class='letter'>d</span>
</p>

<p>
More fundamentally though, it implies that there is only one kind of string context used on the web. Nothing could be further from the truth. Let's look at three different, common contexts.
</p>

<h3>HTML</h3>

<p>
We take a simple snippet of HTML by itself with some assumed user-generated text in it:
</p>

<p class='codeblock'>
<code>  &lt;span title=&quot;<span class='ugc'>attribute text</span>&quot;&gt;<span class='ugc'>inline text</span>&lt;/span&gt;
</code>
</p>

<p>
We look at some different segments of the snippet, and look at what 'forbidden characters' would break or change the semantics of each.
</p>

<p>
<table>
  <thead>
    <tr>
      <th>Snippet</th>
      <th>Forbidden</th>
      <th>Escaped as</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><span style='color: #34a;'>attribute text</span></td>
      <td><span class='letter'>"</span><span class='letter'>&amp;</span></td>
      <td>&amp;quot; &amp;amp;</td>
    </tr>
    <tr>
      <td><span class='ugc'>inline text</span></td>
      <td><span class='letter'>&lt;</span><span class='letter'>&amp;</span></td>
      <td>&amp;lt; &amp;amp;</td>
    </tr>
  </tbody>
</table>
</p>

<p>
For example, quotes are disallowed in attribute text values, because otherwise a string with a quote could alter the meaning of the HTML snippet considerably:
</p>

<p class='codeblock'>
<code>  &lt;span title=&quot;<span class='ugc'>attribute with injected&quot; property=&quot;doEvil() </span>&quot;&gt;<span class='ugc'>inline text</span>&lt;/a&gt;
</code>
</p>

<p>
Note:
</p>

<ul>
<li>All ampersands need to be escaped (including those in URLs) for it to validate. HTML's stricter cousin XML will refuse to parse unescaped ampersands as well, and even requires that apostrophes be escaped too.</li>
<li>XSS attacks do not necessarily involve angle-brackets. In the attribute context, all you need is a <span class='letter'>"</span> to wreak havoc.</li>
</ul>

<h3>URLs</h3>

<p>
The situation is more complicated with URLs. The common HTTP URL for example:
</p>

<p class='codeblock'>
<code>  http://<span class='ugc'>user</span>:<span class='ugc'>password</span>@<span class='ugc'>host.com</span>/<span class='ugc'>path/</span>?<span class='ugc'>variable</span>=<span class='ugc'>value</span>&amp;<span class='ugc'>foo</span>=<span class='ugc'>bar</span>#
  </code>
</p>

<p>
<table>
  <thead>
    <tr>
      <th>Snippet</th>
      <th>Forbidden</th>
      <th>Escaped as</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>(all)</td>
      <td><span class='letter'>&lt;</span><span class='letter'>&gt;</span><span class='letter'>"</span><span class='letter'>#</span><span class='letter'>%</span><span class='letter'>{</span><span class='letter'>}</span><span class='letter'>|</span><span class='letter'>\</span><span class='letter'>^</span><span class='letter'>~</span><span class='letter'>[</span><span class='letter'>]</span><span class='letter'>`</span><span class='letter'> </span> (and non-printables)</td>
      <td>%3C %3E %22 %23 ...</td>
    </tr>
    <tr>
      <td><span class='ugc'>user</span></td>
      <td><span class='letter'>@</span><span class='letter'>:</span></td>
      <td>%40 %3A</td>
    </tr>
    <tr>
      <td><span class='ugc'>password</span></td>
      <td><span class='letter'>:</span></td>
      <td>%3A</td>
    </tr>
    <tr>
      <td><span class='ugc'>host.com</span></td>
      <td><span class='letter'>/</span><span class='letter'>@</span></td>
      <td>disallowed</td>
    </tr>
    <tr>
      <td><span class='ugc'>path</span></td>
      <td><span class='letter'>?</span></td>
      <td>%3F</td>
    </tr>
    <tr>
      <td><span class='ugc'>variable</span></td>
      <td><span class='letter'>&amp;</span><span class='letter'>=</span></td>
      <td>%26 %3D</td>
    </tr>
    <tr>
      <td><span class='ugc'>value</span></td>
      <td><span class='letter'>&amp;</span><span class='letter'>+</span></td>
      <td>%26 %2B</td>
    </tr>
  </tbody>
</table>
</p>

<p>
Note:
</p>

<ul>
  <li>Many forget that a <span class='letter'>+</span> in a query value actually means a space, not a plus.</li>
  <li>Even completely valid URLs can still be malicious, through the <code>javascript://</code> protocol.</li>
  <li>Defined by <a href='http://www.faqs.org/rfcs/rfc1738.html'>RFC 1738</a> and <a href='http://www.faqs.org/rfcs/rfc2616.html'>RFC 2616</a>.</li>
</ul>

<h3>MIME headers</h3>

<p>
Several protocols such as HTTP and SMTP employ the same mechanism of providing metadata for pieces of content. This includes data such as e-mail subjects, senders, cookie headers or HTTP redirects, likely to contain user-generated data.
</p>

<p class='codeblock'>
<code>  Subject: <span class='ugc'>message subject</span><br />
  Content-Type: text/html; charset=utf-8
  </code>
</p>

<p>
<table>
  <thead>
    <tr>
      <th>Snippet</th>
      <th>Forbidden</th>
      <th>Escaped as</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><span class='ugc'>message subject</span></td>
      <td><span class='letter'><em style='opacity: 0.7'>CR</em></span><span class='letter'><em style='opacity: 0.7'>LF</em></span> (if not followed by space), <span class='letter'>(</span><span class='letter'>)</span><span class='letter'>&lt;</span><span class='letter'>></span><span class='letter'>@</span><span class='letter'>,</span><span class='letter'>;</span><span class='letter'>:</span><span class='letter'>\</span><span class='letter'>"</span><span class='letter'>/</span><span class='letter'>[</span><span class='letter'>]</span><span class='letter'>?</span><span class='letter'>=</span>
       + any non-printable</td>
      <td><span style='white-space: nowrap'>=?UTF-8?B?...?=</span></td>
    </tr>
  </tbody>
</table>
</p>

<p>Note:</p>

<ul>
  <li>CRLF sequences without trailing space start a new field and can be used for header injection.</li>
  <li>Lines should be wrapped at 80 columns with CRLF + space.</li>
  <li>Defined by <a href='http://www.faqs.org/rfcs/rfc2045.html'>RFC 2045</a>.</li>
</ul>

<h2>Lolwut?</h2>

<p>
If the above three tables seem complicated and confusing, that's normal. It should be obvious that each of the three contexts is unique and has its own special range of 'forbidden characters' for user input (and even some sub-contexts). From this perspective, it would be impossible to define a safe <em>input</em> filtering mechanism for text on the web that didn't destroy almost all legitimate content.
</p>

<p>
You would have to filter or escape only for a single context, which would create a situation where the exact same approach to a problem can be safe in some cases, but unsafe in others, thus promoting bad coding practices.
</p>

<p>
With the selection above, I also ignored other important contexts (notably JS/JSON or SQL). However, the fact that I was able to make my point using only old school Web 1.0 techniques should show how this problem becomes even hairier in today's Web 2.0.
</p>

<h2>So what then?</h2>

<p>
The right way around string incompatibilities is to use appropriate conversions to change content from one context to another without changing its meaning, and do so when <em>outputting</em> text in a particular instance. We already did it above for the plain-text example, but similar conversions can be made in almost every other instance. Most web languages (like PHP) contain pre-built and tested functions for doing this.
</p>

<p>
Whenever you put strings together, you need to ask yourself what context the strings are in. If they are not the same, an appropriate conversion needs to be made, or you can run into bugs or worse, exploits.
</p>

<p>
For the snippet at the very beginning, the appropriate fix is:
</p>

<p class='codeblock'>
<code>&lt;?php&nbsp;print&nbsp;htmlspecialchars($user-&gt;name)&nbsp;?&gt;'s&nbsp;profile</code>
</p>

<p>
The trick is in understanding why that call goes <em>there</em>, and not somewhere else.
</p>

<p>
<em>Update: Google's DocType wiki has an excellent section with instructions for <a href='http://code.google.com/p/doctype-mirror/wiki/ArticleXSS'>escaping for various contexts</a></em>.
</p>

</div></div>]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Vancouver PHP Conference]]></title>
    <link href="http://acko.net/blog/vancouver-php-conference/"/>
    <updated>2007-02-12T00:00:00-08:00</updated>
    <id>http://acko.net/blog/vancouver-php-conference</id>
    <content type="html"><![CDATA[<div class='g8 i2 first'><div class='pad'><h1>Vancouver PHP Conference</h1><p>Ahoy from the <a href='http//vancouver.php.net/'>Vancouver PHP</a> conference. I gave a talk titled "A Closer Look at Drupal 5" earlier. Overall response was positive, although according to Boris I wouldn't have managed to squeeze everything in 1 hour if I hadn't put on my zippy fast presentation speaking voice, so there might have been some information overload at times.
</p>

<p>
Oh well.. I figure anyone generally only remembers at most 50% of a talk, so I might as well blast you with a bunch of things and hope some of it sticks ;).
</p>

<p>
Thanks to <a href='http://buytaert.net/'>Dries</a> and <a href='http://walkah.net'>James</a> for letting me use their earlier presentations as a base.
</p>

<p>
<em>The slides are no longer available by Dries' request, as he has had problems with people stealing slides without permission before. Sorry.</em></p></div></div>]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[XSS & friends: Text Handling in PHP applications]]></title>
    <link href="http://acko.net/blog/xss-and-friends-text-handling-in-php-applications/"/>
    <updated>2006-06-26T00:00:00-07:00</updated>
    <id>http://acko.net/blog/xss-and-friends-text-handling-in-php-applications</id>
    <content type="html"><![CDATA[<div class='g8 i2 first'><div class='pad'><h1>XSS &amp; friends: Text Handling in PHP applications</h1>
  
<p><em>Update: I jotted down some initial theory in my <a href='/blog/safe-string-theory-for-the-web'>Safe String Theory for the Web</a> post.</em>
</p>

<p>
For a while now, a lot of talk has been going on about XSS, aka <em>Cross Site Scripting</em>. In October 2005, an XSS worm nearly took down MySpace. Most XSS attacks however are not as benevolent as that. They can be used to steal passwords and other sensitive information, perform distributed Denial-of-Service attacks on sites or generate fraudulent advertisement income.
</p>

<p>
XSS problems are still rampant in many web applications today though, with PHP applications being especially vulnerable. This has caused some to conclude that XSS problems are even impossible to avoid or at least impractical to completely audit for. However, from a purely technical standpoint, XSS problems are not unique at all. They belong to a wider class of security problems which stem from incorrect handling of user-supplied data (e.g. SQL command injection or e-mail header injection).
</p>

<p>
So, what makes the web so tricky to secure? Is it because web programmers are inherently 'stupid' and can't 'code properly'? I don't think so.
</p>

<p>
However, I do think that most web languages (such as PHP) tend to promote a bad approach to coding and by extension, to security. By letting the programmer jump in directly, learning as they go, most people never build-up a complete overview of the programming environment, but simply tweak code 'until it works'. The same applies to security issues: when a bug is found, those people will just tweak a particular line of code until the problem goes away. They won't see the big picture and will make similar mistakes later.
</p>

<p>
Another serious problem in my opinion is that there is no well-defined vocabulary for the tools used to solve these problems. Umbrella words such as 'filtering' are all too often used and stand in the way of a more precise description. With only vague notions about 'validation', 'special characters' and 'escaping', you cannot understand what's <em>really</em> going on. Such a lack of insight also prevents people from seeing beyond individual issues.
</p>

<p>
So I've decided I want to build up a more formalized explanation to text handling. Expect one or more blog posts about this in the future. At least the next time people "lock up" on me, I can point them somewhere.</p></div></div>]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Summer of Code - Ajax Functionality for Drupal]]></title>
    <link href="http://acko.net/blog/summer-of-code-ajax-functionality-for-drupal/"/>
    <updated>2005-09-01T00:00:00-07:00</updated>
    <id>http://acko.net/blog/summer-of-code-ajax-functionality-for-drupal</id>
    <content type="html"><![CDATA[<div class='g8 i2 first'><div class='pad'><h1>Summer of Code - Ajax Functionality for Drupal</h1><aside class='r m1'><img alt='' src='/files/soc2006/soc.png' /></aside><p>
This last summer I was sponsored by Google as part of their <a href='http://code.google.com/summerofcode.html'>Summer of Code</a> progam to work on Drupal. My goal was to introduce various <a href='http://en.wikipedia.org/wiki/AJAX' title='Asynchronous JavaScript and XML'>AJAX</a> functionalities to <a href='http://www.drupal.org/'>Drupal</a>.
</p>

<p>
The official project description was:
<blockquote><em>"Drupal has recently begun to find meaningful ways to introduce AJAX functionality with the goal of improving the user experience. Work with Drupal's usability experts to identify the next steps and help implement new dynamic functions based on interaction with the XMLHttpRequest object."</em></blockquote>
</p>

<p>
I focused on the following Ajax-powered features:
<ul>
<li><strong>Inline Editing of posts</strong>: Though I built a working prototype module, I decided not to develop this feature further because it is not flexible enough to work as a generic Drupal module. It would break on too many configurations and has limited usefulness anyhow.</li>
<li><a href='http://drupal.org/node/28483'>Uploading of files</a>: allows you to attach files to Drupal nodes (with upload.module) without having to reload the page.</li>
<li><a href='http://drupal.org/node/30150'>Sorting tables inside a page</a>: this changes the sort order of a table without reloading the entire page. It is not client-side sorting as you'd expect at first sight: because most tables in Drupal are spread across multiple pages, client-side sorting is not very useful.</li>
<li><strong>Switching between multiple pages</strong>: this was implemented on top of the sorting functionality, and only works on paged tables (this covers most of the useful pagers though).</li>
<li><a href='http://acko.net/yay-progress'>Progressbar widget</a>: a typical progressbar that fetches the status from the server through Ajax.</li>
</ul>
</p>

<p>
The resulting code can be found in <a href='http://cvs.drupal.org/viewcvs/drupal/contributions/sandbox/unconed/soc/'>my sandbox</a> in the Drupal contributions repository. Note however that most of the code is in patches against the (rapidly changing) Drupal HEAD, so they are likely to go out of date soon.
</p>

<p>
The file uploader is now already part of the Drupal HEAD, and at least the tablesorter is sitting in the patch queue being reviewed. I will try and keep them up to date.
</p>

<p>
A big thanks goes to Google for organising the Summer of Code!</p></div></div>]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Proposal for Implementing Unicode in PHP]]></title>
    <link href="http://acko.net/blog/proposal-for-implementing-unicode-in-php/"/>
    <updated>2005-06-03T00:00:00-07:00</updated>
    <id>http://acko.net/blog/proposal-for-implementing-unicode-in-php</id>
    <content type="html"><![CDATA[<div class='g8 i2 first'><div class='pad'><h1>Proposal for Implementing Unicode in PHP</h1>
  
<p>On the <a href='http://www.drupal.org/'>Drupal team</a>, I am known as an <em>encoding nut</em>: whenever there's an encoding issue or a question about Unicode, people tend to knock on my door. Usually any fix or answer from me is accompanied by a lot of cursing to the unfortunate inquirer about how "PHP is horrible when it comes to string handling" and how it seems that "the entire PHP dev team has its head planted firmly into the ground when it comes to Unicode".
</p>

<p>
To which the reply is more than often: "Why don't you fix it yourself?".
</p>

<p>
Well, I'm not a PHP language developer. To be honest I have no interest or time for becoming one. But I do know a lot about encodings and Unicode, so I decided to write this article describing the problem and possible solutions. That way, maybe others can take some of these ideas and put them into practice. At the very least, it should answer a lot of questions that people have about Unicode and PHP.
</p>

<p>
Right now, the message from the PHP developers seems to be that "PHP supports Unicode, but some assembly is required". In fact, it is a lot worse. Please, read on.
</p>

<h2><span>About encodings and Unicode</span></h2>

<p>
First, I recommend that anyone reading this article first reads <a href='http://www.joelonsoftware.com/articles/Unicode.html'>The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)</a> by Joel Spolsky. It is an excellent introduction to Unicode and encodings in general. Note also that the article was written in 2003 and specifically mentions PHP's Unicode support being hopeless. We are now two years later and the situation has not changed much.
</p>

<p>
The only important thing about Unicode which isn't explained in Joel's article is that Unicode is in fact more than just a big table which maps characters to numbers: it is also a set of character properties, recommendations and algorithms on how those characters should be used. And this is why Unicode needs (and deserves!) much more attention than any other character set.
</p>

<h2><span>What is the current situation?</span></h2>

<p>
As far as PHP is concerned at the moment, a character consists of 8 bits and a string is a series of characters. This is good enough for legacy 8-bit encodings (like the common ISO-8859-1 or Latin-1 encoding used in Western Europe), but does not cater to more complicated encodings.
</p>

<p>
To accomodate those, the <a href='http://php.belnet.be/mbstring'>multibyte string extension</a> (Mbstring) can be used. This extension was originally developed for handling Japanese encodings, but it has now been extended to support many more encodings, including the Unicode Transformation Formats (like the popular UTF-8). Mbstring provides encoding-aware versions of many of PHP's string functions (<code>substr()</code>, <code>strlen()</code>, <code>ereg()</code>, ...). Through a feature called overloading, you can tell PHP to always use the Mbstring version of a function if there is one.
</p>

<p>
Aside from Mbstring, there are a few other libraries and extensions which may be used to provide encoding- and Unicode-related services, like Imap, Iconv or GNU Recode.
</p>

<h2 id='mbstring-sucks'><span>What problems are there with the current approach?</span></h2>

<ol>
<li>
<p><strong>PHP itself still doesn't know anything about encodings or Unicode.</strong>
Aside from function calls, there are other ways of interacting with strings in PHP. For example, there is the <code>{}</code> operator for selecting characters from strings, as if they were arrays. And like in most programming languages, you can define strings in code with the familiar quote syntax. But all of these methods work with literal bytes, not with actual encoded characters.
</p>

<p>
PHP source code itself must be encoded in an ASCII-compatible encoding and there is no way to use Unicode codepoints directly. If you want to store a character in a variable, you either have to use a short string of bytes (the encoded representation of the character) or an integer representing the character's Unicode codepoint. But converting between a codepoint and its encoded representation requires ugly work-arounds and wrappers, as PHP itself provides no easy mechanism for doing this.
</p></li>


<li><p><strong>PHP does not guarantee anything about the local setup as far as encoding support goes.</strong>
All the actual encoding functionality is located in libraries or extensions which may not be present on the average PHP install or which may be outdated. This makes it very difficult to make Unicode-compatible PHP programs work everywhere. One of PHP's assets is its large install base, yet the large majority of those installs is completely unsuited for Unicode work. At the time of writing this article, the latest PHP (5.0.4) still does not enable the Mbstring extension by default.
</p>

<p>
A trickier example: in Drupal 4.6.0 we depend on the Perl-compatible Regular Expression Library's support for Unicode and UTF-8. This was <a href='http://php.belnet.be/manual/en/reference.pcre.pattern.modifiers.php'>supposedly present</a> since PHP 4.1 (exception: since PHP 4.2.3 on Windows). But actual testing shows that it took until PHP 4.3.3 for this library to know how to deal correctly with UTF-8 and the full Unicode range. But even now, PHP still has the ability to use the system-provided PCRE library, which can still be compiled without UTF-8 support. This can result in unsupported installs even for those using the latest PHP version.
</p>
</li>


<li><p><strong>When you use Mbstring overloading, you can no longer easily work with strings of binary data.</strong>
Mbstring overloading sounds nice in theory, as it gives you smarter string functions for free without having to adapt your code. However, this feature denies a basic fact: <em>text strings are fundamentally different from binary data</em>. If this sounds strange to you, consider this:
</p>

 <ul>
  <li>Binary data requires no meta-information about its encoding and can be passed around freely. Operations on two byte arrays are guaranteed to work. Text, on the other hand, is always encoded in a particular way. Text operations can only work if the encoding is known and verified to be the same for all operands involved.</li>
  <li>Binary data can contain arbitrary bits, while most text encodings have a much more limited syntax. Take a look at <a href='http://php.belnet.be/utf8_encode'>UTF-8's bit patterns</a> for example. However, even plain US-ASCII text has historically had the limitation that it may not contain the NULL character.</li>
  <li>Binary data has no intrinsic semantic meaning, while text does. Many operations (like case conversion) only make sense on text, while other operations become much more complicated (e.g. text sorting needs to take local conventions into account). Specifically, there are a lot of Unicode algorithms for advanced text processing (e.g. the Bidirectional Algorithm for handling text with mixed writing directions).</li>
 </ul>

<p>
Due to the fact that text has been 8-bit encoded for a long time, a lot of programmers don't think twice about using text functions for dealing with binary data and vice-versa. But this assumption is no longer valid today.
</p>

<p>
If Mbstring overloading is enabled and a PHP programmer wants to perform operations on binary data, (s)he has to temporarily trick PHP into using a simple 8-bit encoding (like ISO-8859-1). Quite possibly, locale settings have to be changed back and forth as well. This results in bloated, complicated code.
</p>
</li>


<li><p><strong>PHP's string functions don't form a clean, consistent API.</strong>
There is no consistent naming convention (e.g. <code>substr()</code>, <code>str_replace()</code>, <code>convert_cyr_string()</code>, <code>parse_str()</code>, <code>sprintf()</code>, ...).
</p>

<p>
There are also a bunch of hodge-podge functions which are only useful in very specific situations and/or which are tied to a particular encoding (e.g. <code>utf8_encode()</code>) or locale (e.g. <code>ucfirst()</code>).
</p>

<p>
Finally, though some functions take an encoding argument to allow for some encoding support, this is rare and inconsistent. For example, while the <code>html_entities()</code> function supports several encodings, the utility function <code>get_html_translation_table()</code> which fetches its translation table does not.
</p>
</li>


<li><p><strong>PHP's locale mechanism is completely platform-dependant and offers no guarantees.</strong>
The locale identifiers passed to <code>setlocale()</code> <a href='http://php.belnet.be/set_locale'>differ completely</a> between Windows and Unix platforms, but even between similar Unix platforms there is no guarantee of which locales are available. The dependency of PHP on system locales also means that you are restricted to whatever encodings the system locales are available in.
</p>
</li>


<li><p><strong>PHP's XML parser is notorious for violating the specifications when it comes to encodings.</strong>
In today's web, XML is everywhere in the form of XHTML, RSS feeds, OPML, etc. Being able to parse XML correctly is essential to any PHP application. A significant portion of the XML specification talks about encodings and how to deal with them, but PHP does not implement them correctly.
</p>

<p>
For example, if an XML document starts with a UTF-8 signature (in the form of the byte-order mark), PHP5's parser will die if it is told the document is in UTF-8 encoding. Similar simple, but critical bugs have had to be worked around by PHP programmers in the past. Before PHP5, absolutely no encoding autodetection was present in the XML parser: this had to be done by the code invoking the parser.
</p>
</li>


<li><p><strong>Mbstring is a pragmatic library, not a fully featured Unicode solution</strong>.
Example limitations include not being able to specify characters beyond U+FFFF for some functions (e.g. <code>mb_substitute_character()</code>) or the way <code>mb_strwidth()</code> seems to be <a href='http://be.php.net/manual/en/function.mb-strwidth.php'>hardcoded for Japanese only</a> (there are no zero widths for combining accents?).</p></li>
</ol>

<p>
All of these problems together mean that it is very hard at the moment to write PHP software which can support encodings and Unicode. Even worse, if this software has to run on a typical PHP install, then you can forget about implementing anything more than simple pass-through behaviour as far as text is concerned.
</p>

<h2><span>Proposed solution</span></h2>

<p>
Unfortunately, PHP is very hot on backwards compatibility, so significant changes to the existing string API are pretty much out of the question. New types and APIs need to be introduced which offer a complete, consistent and flexible solution for dealing with encodings and Unicode.
</p>


<ol>
<li><p><strong>PHP needs a new Unicode text string type which is separate from the classic byte string.</strong>
This type, let's call it <em>ustring</em>, would represent a string of Unicode text.
</p>

<p>
Internally, it would be stored using one of the UTF's. In the interests of internal processing efficiency, UTF-16 is probably the best choice, but UTF-8 can be considered as well as it is the most popular UTF on the web today. In that case, outputting UTF-8 could be done without any conversion. On the other hand, the complicated bit patterns and variability of UTF-8 mean that it is harder to find character boundaries and such. Looking at how languages like Perl and Python approach this is a good idea. After all, they've had Unicode strings for quite some time.
</p>

<p>
To distinguish <em>ustrings</em> from plain strings when defined, a syntax similar to C could be introduced, for example <code>U&quot;This&nbsp;is&nbsp;a&nbsp;Unicode&nbsp;string&quot;</code>. This syntax would support <code>\u####</code>, <code>\U########</code> and <code>\x{#..}</code> notation for defining characters by codepoint inside the string.
</p>

<p>
Using the <code>{}</code> operator on a <em>ustring</em> would return ints, not chars. To reduce confusion, perhaps a <em>uchar</em> type could be introduced specifically for handling Unicode codepoints. As the Unicode codespace is only 21-bit wide, there would be subtle differences between <em>uchar</em> and <em>int</em>, though both would probably be stored as 32-bit.
</p>

<p>
For backwards compatibility, plain quoted strings would remain used for byte strings, although it might be interesting to define a <code>B&quot;This&nbsp;is&nbsp;a&nbsp;byte&nbsp;string&quot;</code> notation, while providing a configurable option for choosing which type of string is assumed when there is no prefix. As Unicode usage would become more widespread, it would be nice to not have to litter your code with U's everywhere.
</p>

<p>
Though the internal encoding would be fixed to one of the UTF's, the external encoding might vary (and would be configurable through an API). When casting a <em>ustring</em> to a <em>string</em>, a conversion would take place from the internal encoding to the external one, and vice-versa. It remains to be seen which type takes precedence when both are mixed together (e.g. <code>$string&nbsp;=&nbsp;U&quot;Unicode&quot;&nbsp;.&nbsp;&quot;Bytes&quot;</code>).
</p>
</li>


<li><p><strong>PHP needs a new Unicode string API.</strong>
This API would contain a selection of functions from both the plain String API as well as the Mbstring API, but would have a simpler and more logical naming convention. For example, making all ustring functions start with <code>ustr_</code>. Each of these would accept a <em>ustring</em> where the current ones accept a plain <em>string</em>.
</p>

<p>
External APIs, like the PCRE library, could choose whether to accept <em>string</em>, <em>ustring</em> or both. For example for PCRE, it makes sense to replace the PHP-proprietary <code>/u</code> modifier with a simple string type check instead.
</p>
</li>


<li><p><strong>PHP needs to ensure that a baseline set of encoding-related functions are always available.</strong>
I believe the Iconv extension is now standard since PHP5, but things like complete UTF-8 support in PCRE are important too. This allows programmers to write their code in a straightforward fashion without having to check for a gazillion exceptions or exotic configurations.
</p>
</li>

<li><p><strong>PHP needs an independent locale library across all platforms</strong>
This ensures consistent handling of locales and no longer limits PHP to what the platform supports. The <a href='http://www-306.ibm.com/software/globalization/icu/index.jsp'>International Components for Unicode</a> (ICU) are an excellent candidate.</p>
</li>
</ol>


<p>
The choice to limit this new string functionality to Unicode strings might seem elitist: after all, the idea of Unicode is <em>not</em> to get rid of other encodings, but merely to ensure compatibility. Non-Unicode encodings will keep fulfilling an important role in the years to come. On the other hand, as Unicode is guaranteed to be a perfect intermediate format, it makes sense to use it for internal string handling. It limits the functionality that has to be dealt with and creates a common baseline to work with.
</p>

<p>
Finally, as the original String and Mbstring APIs would not be altered by these changes, programmers would be free to use the 'old school' way of dealing with strings. They would simply not be able to take advantage of the cleaner API and consistent locales.
</p>

</div></div>]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[PHP, Unicode and ostriches.]]></title>
    <link href="http://acko.net/blog/php-unicode-and-ostriches/"/>
    <updated>2005-03-25T00:00:00-08:00</updated>
    <id>http://acko.net/blog/php-unicode-and-ostriches</id>
    <content type="html"><![CDATA[<div class='g8 i2 first'><div class='pad'><h1>PHP, Unicode and ostriches.</h1><p><em>Update: I've written a <a href='/blog/unicode-in-php'>follow-up post</a> that describes how I would like PHP's encoding support to be.</em>
</p>

<p>
As the resident encoding geek on the <a href='http://www.drupal.org/'>Drupal</a> team, it's usually my job to make sure Drupal handles encodings and Unicode correctly. I don't mind doing this, but PHP doesn't exactly make it easy. With the new search.module for Drupal 4.6 being Unicode-aware, this has become very obvious, as we've had to bump up the minimum required version of PHP to 4.3.3. The UTF-8 support in the Perl-compatible regular expressions in PHP 4.3.2 and earlier is completely broken. And now I've had a bug report about someone on PHP 4.3.8 who still had problems getting it to work.
</p>

<p>
I don't know why exactly, but as far as encodings go PHP is still in the stone-age. This is odd, as you'd expect a web-oriented scripting language to have excellent support for sharing and exchanging textual information. There is a multi-byte string extension available, but it's not available on 90% of PHP hosts out there, and it's more of a black-box library anyway: it does not present you your strings as Unicode character codepoints, but still as an array of bytes. Furthermore, if you actually enable the mbstring overrides, you lose the ability to work with bytes at will. Apparently, the PHP team still hasn't figured out that bytes and characters are not the same. The other extensions which deal with encodings (iconv, recode) are also unavailable on the majority of PHP installs out there.
</p>

<p>
This means that if you want to make a PHP application which supports any language <em>and</em> runs on the average PHP host out there, that there's only one option: use UTF-8 internally, and write your own functions for string truncation, email header encoding, validation, etc. Using UTF-8 ensures that you only have one encoding to worry about and because it's Unicode it is guaranteed to be able to represent any language. Of course, you will no longer be able to do something simple as upper/lowercasing a string, as these PHP functions don't take UTF-8 at all.
</p>

<p>
What PHP needs is Unicode string support in the core, along with a good library of useful functions for handling the very large Unicode character range efficiently. ASP, Perl, Python, Java all have it... for me, it's the only thing that would've made PHP5 worth to upgrade to.
</p>

<p>
It's as if the entire PHP team has stuck their head in the ground, hoping that all this Unicode stuff will somehow blow over. It won't.
</p></div></div>]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[UFPDF: Unicode/UTF-8 extension for FPDF]]></title>
    <link href="http://acko.net/blog/ufpdf-unicode-utf-8-extension-for-fpdf/"/>
    <updated>2004-09-01T00:00:00-07:00</updated>
    <id>http://acko.net/blog/ufpdf-unicode-utf-8-extension-for-fpdf</id>
    <content type="html"><![CDATA[<div class='g8 i2 first'><div class='pad'><h1>UFPDF: Unicode/UTF-8 extension for FPDF</h1><p><b>Note: I wrote UFPDF as an experiment, not as a finished product. If you have problems using it, don't bug me for support. Patches are welcome though, but I don't have much time to maintain this.</b>
</p>

<p>
<a href='http://www.fpdf.org'>FPDF</a> is a PHP class for generating PDF files on-the-fly. Unfortunately it does not support Unicode. So I've coded UFPDF, an extension of FPDF which accepts input in UTF-8.
</p>

<p>
Only TrueType fonts are supported for now. To embed .TTF files, you need to extract the font metrics and build the required tables using the provided utilities (see README.txt). Included is a modified version of <a href='http://ttf2pt1.sourceforge.net/'>TTF2PT1</a> which extracts the Unicode glyph info.
</p>

<p>
UFPDF works the same as FPDF, except that all text is in UTF-8, so consult the <a href='http://www.fpdf.org/en/doc/index.php'>FPDF documentation</a> for usage.
</p>

<p>
<a href='/files/ufpdf/ufpdf.zip'>Download UFPDF</a>
<a href='/files/ufpdf/unicode.pdf'>Example PDF</a></p></div></div>]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[PHP 5 references fun: clone for PHP4.]]></title>
    <link href="http://acko.net/blog/php-5-references-fun-clone-for-php4/"/>
    <updated>2004-08-25T00:00:00-07:00</updated>
    <id>http://acko.net/blog/php-5-references-fun-clone-for-php4</id>
    <content type="html"><![CDATA[<div class='g8 i2 first'><div class='pad'><h1>PHP 5 references fun: clone for PHP4.</h1><p>An issue that's popped up recently in <a href='http://www.drupal.org/'>Drupal</a> is PHP5 compatibility. At first, this looks like a no-brainer. Drupal does not use any advanced <acronym title='object oriented'>OO</acronym> features, so most code runs on both PHP4 and PHP5.
</p>

<p>
There is a however a nasty change in PHP5: objects are now always passed by reference. Variables hold a handle to the object rather than the object itself. This brings PHP more in line with other OO languages (like Java) and removes some of the ugly from PHP OO code, but it also means that objects are treated differently from all the other types. Old code that depends on having objects copied when not passed by reference will break.
</p>

<p>
Fixing this is tricky. PHP5 gives you the <code>clone</code> keyword to copy objects with:
</p>

<p class='codeblock'><code>&lt;?php<br />
&nbsp;&nbsp;$copy&nbsp;=&nbsp;clone&nbsp;$object;<br />
?&gt;</code></p>

<p>
And surprise, surprise, PHP4 does not consider this to be valid code. It doesn't even parse, so you couldn't enclose this with a version check. To get around this, you need a rather ugly hack.
</p>

<p>
The following code works the same as the above, in PHP5:
</p>

<p class='codeblock'><code>&lt;?php<br />
&nbsp;&nbsp;$copy&nbsp;=&nbsp;clone($object);<br />
?&gt;</code></p>

<p>
PHP 4 on the other hand will think <code>clone()</code> is a function. The obvious next step is to conditionally declare this function if PHP4 is running. The only problem there is that the function definition will not parse in PHP5 because <code>clone</code> is a special keyword. To get around that, we have to use <code>eval()</code> to postpone parsing. Here's the finished hack:
</p>

<p class='codeblock'><code>&lt;?php<br />
&nbsp;&nbsp;if&nbsp;(version_compare(phpversion(),&nbsp;'5.0')&nbsp;&lt;&nbsp;0)&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;eval('<br />
&nbsp;&nbsp;&nbsp;&nbsp;function&nbsp;clone($object)&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;return&nbsp;$object;<br />
&nbsp;&nbsp;&nbsp;&nbsp;}<br />
&nbsp;&nbsp;&nbsp;&nbsp;');<br />
&nbsp;&nbsp;}<br />
?&gt;</code></p>

<p>
In PHP5, the native <code>clone</code> keyword will clone the object, while in PHP4 the cloning will happen when the object is passed by value to <code>clone()</code>.
</p>

<p>
We still need to go over the Drupal code and check for reference problems, but at least now we can clone objects consistently.</p>

</div></div>]]></content>
  </entry>
  
</feed>

