Mod_rewrite doesn’t look at arguments? D’oh!

Blog-Image-002

When I converted my old, kloodgy Web site into WordPress last year, one of the reasons I did so was because my restaurant review script was no longer supported and had some ginormous security holes. It was open to all kinds of cross-site script injection and it was becoming a one to two hour a day chore to keep it running.

The new solution (WordPress) is fabulous, but now I have a couple hundred inbound links that are broken. Trying to get all of the links updated by the other sites was a losing proposition, so my next step was to put some 301 redirects into my friendly neighborhood .htaccess file and transfer all the link love to the new pages.

Easy (normal) 301 redirects via .htaccess
Normally, a 301 redirect in your .htaccess file would look like this:
Redirect 301 /old/url.shtml http://domain.com/new/filename
or if we’re being fancy…
RewriteEngine on
RewriteBase /
RewriteRule ^old/url.shtml$ new/filename [R=301,L]

Normally, the examples above are perfect for redirecting incoming links from old, defunct pages to the new, correct location of the file. The problem I ran into is that my pages were dynamically generated; i.e., they all shared the same URL, and which review was presented to the user was determined by an argument passed in the URL after the file extension.
http://www.domain.com/cgi-bin/script.cgi?review=bobs_shrimp_hut

To address this, I figured I would just include the argument in the pattern that the RewriteRule tries to match, and make sure that I remembered to escape the question mark after the file extension since question marks are special characters in regex.
RewriteEngine on
RewriteBase /
RewriteRule ^cgi-bin/script.cgi?review=bobs_shrimp_hut$ restaurants/bobs-shrimp-hut [R=301,L]

Looks right… but nothing happened. Like, nothing-nothing. It didn’t return a 500 Server Error, which would have let me know I screwed something up, and it also didn’t even attempt to rewrite the URL. When I tried the original (old) address, I got a 404. 🙁

Since my .htaccess file already had a number of other RewriteRules that were executing correctly, I ruled out the possibility that the .htaccess file wasn’t being read. To test my assumption, I tried removing the argument portion from the rule (the question mark and everything after it).
RewriteRule ^cgi-bin/script.cgi$ restaurants/bobs-shrimp-hut [R=301,L]

Okay, that worked. We’ve confirmed the .htaccess file is being read and things are working correctly. The only problem with this solution is that this rule will fire every time any of the reviews are accessed, regardless of the argument, and will always rewrite as “restaurants/bobs-shrimp-hut”. Great exposure for Bob’s Shrimp Hut, but not cool if you were expecting to find Chez Francois or Chuck’s Cheese Chalet.

Let’s RTFM!
After re-reading the Apache documentation on the Mod_Rewrite Module, I found this in the RewriteRule Directive section:

What is matched?

The Pattern will initially be matched against the part of the URL after the hostname and port, and before the query string. If you wish to match against the hostname, port, or query string, use a RewriteCond with the %{HTTP_HOST}, %{SERVER_PORT}, or %{QUERY_STRING} variables respectively.

Crap. The part of the URL after the hostname and before the query string is identical for all 200 restaurants. The query string is what I HAVE to match.

Alrighty, so the manual says I need a RewriteCond if I want to match anything against the QUERY_STRING (argument). So I tried the following:
RewriteEngine on
RewriteBase /
RewriteCond %{QUERY_STRING} ^(.*)bobs_shrimp_hut$
RewriteRule ^cgi-bin/script.cgi$ restaurants/bobs-shrimp-hut [R=301,L]

And it worked! Mostly. The rewritten URL looked like this:
http://www.domain.com/restaurants/bobs-shrimp-hut/?review=bobs_shrimp_hut

So we’re 98% of the way there at this point. Realistically, this is still a success as far as the user is concerned because the link does redirect to the new page successfully. It’s just ugly and will cause some canonical issues for the engines.

So I RTFMd some more, and a little further down the page, I found this:

Modifying the Query String

By default, the query string is passed through unchanged. You can, however, create URLs in the substitution string containing a query string part. Simply use a question mark inside the substitution string to indicate that the following text should be re-injected into the query string. When you want to erase an existing query string, end the substitution string with just a question mark. To combine new and old query strings, use the [QSA] flag.

Eureka! So now my .htaccess looks like this:
RewriteEngine on
RewriteBase /
RewriteCond %{QUERY_STRING} ^(.*)bobs_shrimp_hut$
RewriteRule ^cgi-bin/script.cgi$ restaurants/bobs-shrimp-hut? [R=301,L]

which gives us
http://www.domain.com/restaurants/bobs-shrimp-hut
YAY!

The only downside to this method of redirection is that there has to be a separate RewriteCond/RewriteRule pair for each argument I need to redirect. This is only because I failed to use a consistent naming scheme. If I had thought ahead when I set up the original script, I probably would have named the files differently and wouldn’t be having this problem now.

If the names were consistent between the old script and the new solution, I would have been able to do something like this:
RewriteEngine on
RewriteBase /
RewriteCond %{QUERY_STRING} ^review=(.*)$
RewriteRule ^cgi-bin/script.cgi$ restaurants/%1? [R=301,L]

%1 retrieves a captured value from the RewriteCond. $1 (more commonly seen) retrieves a captured value from the first part of a RewriteRule. This single Condition/Rule pair would have handled all of the dynamic URLs from the old script and redirected them to their new homes.

Okay, that’s it for the .htaccess tonight. My head feels like it’s going to explode.