Tutorials - Webspace > Reducing unwanted web site traffic
How to reduce unwanted traffic to your web siteThe problem
PlusNet, as part of your standard account provision, allows your web site visitors to collectively download up to 250MB of data from your web space (500MB for business accounts) every day. If this amount is exceeded, you will be automatically sent a warning. Continue to exceed this allowance, and you are in danger of having your web site withdrawn and archived, without further notice.
It is possible that your web site is just too popular to be hosted on an inclusive ISP web hosting service. If this is the case, you need to consider commercial alternatives: dedicated web site hosting companies which provide high-traffic hosting, service level agreements and attractive value-added services.
However, you might not need a dedicated web host. You might run a web site which generates a modest amount of traffic, but may also be wasting a significant amount of your remaining traffic allowance on unwanted "noise". This noise may be enough to push your site traffic over the edge.
The causes
There are several ways in which your web sites can draw unnecessary traffic:
- Your web site's files are too big or there are too many of them. Your site may be heavy on the binaries, or your HTML may be a bloated mess. Maybe you can produce more efficient code, use less binaries or use smaller binaries.
- You host avatars or files for use in public forums. You may have underestimated how many people actually access these files, and may take a significant penalty from these hits.
- Other net users may link to your files for their own purposes. They may decide that your image looks good in their own web site or they might use a small image as their own avatar. But, rather than copy your file, they link directly to it.
- Your entire site is archived by a visitor. Some people use software which can download entire web sites in one go, so they can be viewed offline at a later time. If your site is quite large, this can have a big impact on any one particular day.
- Your entire site is spidered by a search engine or "spambot". If you want to be found by other net users, your site has to be indexed, but not all spiders may be wanted guests. Some may be after your email address and image spiders, in particular, can generate huge hits.
- A large number of web-surfers come upon your site by accident. There may be content in your web site which is attracting a large number of visitors using search engines. This is usually a good thing, but your site may be not be what your visitors are actually looking for, in which case you are getting wasted hits.
The solution?
Reducing unwanted traffic to your web site requires several different skills and some free software. For the purposes of this tutorial we shall be using Webalizer (www.mrunix.net/webalizer) to generate web site statistics ("webstats") because it is already installed on PlusNet's web server and is therefore available to all, with very little effort.
You must first activate your PlusNet-generated webstats by visiting the PlusNet portal's member centre, clicking on "Website settings", then "Advanced Webstats" and then "Activate". You could also install Webalizer (or another log file analyser) on your CGI web space and configure it exactly how you like, to give you more detail on the type of traffic you are getting, but this is beyond the scope of this tutorial (see other tutorials for tips on how to do this).
Once you have collected a few day's worth of webstats, you should start to see a usage pattern emerging. You can access your webstats via the member centre, or create a bookmarked link using the format : portal.plus.net/stats/webroot/ (where webroot is the URL of the root of your web site, e.g. www.mydomain.com). Webalizer's reports are fairly straightforward and you should be able to spot a few problems without any help.
But, before diving straight in with log-file analysis, lets get back to basics…
Keep in Trim
You might not like to admit it, but your web site may be a bit flabby – in fact, it might even be clinically obese. No matter what clever trickery you employ to reduce unwanted site traffic, if your site is overweight to begin with, you're already wasting valuable bandwidth needlessly. If your pages are leaner, they consume less bandwidth. It's as simple as that.
Like in the human world, your web site will benefit from a healthy diet and a bit of regular exercise.
Feed it with optimised graphics and media files. Images should be no bigger than absolutely necessary and use the correct format for their content (e.g. jpeg for photos and gif for graphics). Learn how to use your image editing software properly to reduce the file size of your images using different compression ratios and colour depths. The same principle applies to audio, video and Flash files.
Take your web site to the HTML gym. Teach it HTML 4 or XHTML 1 and CSS, and it will soon be shedding pounds, looking slimmer and younger, without all those unsightly font tags and bulky nested tables. Maybe it's time you tried a different web authoring program, which enables you to do this easily.
Look at your webstats to find out which files contribute most to your total kBytes. See if you can make them smaller, or maybe even get rid of them altogether.
Putting it about a bit
It's obvious if you think about it: if you put stuff in your web space and then link to it from lots of different places, you're going to generate a lot of traffic. What's not so obvious, is how big an impact this might have on your traffic allowance.
Even the lowly avatar (a small image you might use to represent you in an online forum) can clock up an unbelievable amount of traffic, if it is accessed by a lot of people – not difficult, if you are a regular in a busy forum.
You might want to think twice about publishing the URL of a page or binary file on your web site. If it is hit enough times, or becomes famous, it may prove to be an unnecessary drain on your resources. At the very least, it is wise to make these files as small as possible. Alternatively, there might be another place they can be hosted and therefore not count towards your web site allocation.
It may seem like a good idea to host a forum or photo gallery on your CGI web space, but ask yourself if you can afford the extra traffic generated by these things. Maybe it would be safer to use a third-party service.
Daylight robbery
If you have been looking closely at your webstats, you may have noticed that some of your binary files may be getting an proportionally large number of hits. If this is not intentional, you may be looking at image theft (or "hotlinking", as it is known).
Believe it or not: not only do people copy images from your web site, but they sometimes don't even bother to host them on their own servers – they just link back to your site. So visitors to their web sites consume your traffic allowance instead of theirs. In short, they let you foot the bill for their plagiarism.
There is a relatively simple way to stop bandwidth theft, using a technical trick. Judicious use of a web site file called .htaccess will instruct the web server not to allow this sort of behaviour. Full details are given in the example at the end of the tutorial.
Saving it for later
You can get software that will download an entire web site to your computer in one go. You can then look at the web site offline or browse the site using your own software in your own time. Cool, eh? Well, no, actually.
If your web site is quite large, just one person could make a huge dent on your daily allowance, which could mean the difference between staying within your limit, or getting canned. This is not a good situation to be in! You could stick a notice on your site, asking for people not to do this, but do you think this will stop them?
Your webstats show a list of "User Agents" – browsers used by your web site visitors. Site archiving software produces a different signature to ordinary browsers, identifying itself with a trade name. This name can be used to exclude the software from your site traffic. Fortunately, .htaccess comes to the rescue again. The example at the end of the tutorial shows how .htaccess stops offline archiving software from grabbing your site.
Stand still, while I take your picture
One of the ways your web site becomes known is through a process called "spidering" or "web-crawling", where a "robot" server gathers information about your web site. Generally, this is a good thing – how else do you expect to get indexed by the likes of Google, Yahoo! or MSN?
But, did you also know that there are a lot of less desirable robots on the prowl? Spammers like to hunt through your web site for e-mail addresses, and hackers like to look for vulnerabilities to exploit in your code. Even reputable search engines like to download all your images for indexing, which might not be desirable (see "Saving it for later" above). All this reduces your available traffic allowance, with no benefit to yourself.
A widely-used system has been devised where you can create a file called robots.txt on your web site, providing instructions on what you will not allow to be spidered. An example is given at the end of the tutorial. You should use this file to instruct reputable search engines on what they can't index, if at all. Less reputable organisations may ignore robots.txt altogether, in which case our old friend .htaccess comes in useful again. Robots also leave their own "User Agent" signature in your webstats, and this can be used to exclude them from your site. The .htaccess example at the end of the tutorial makes this all clear.
Sorry, I thought you were someone else
Whilst it's very nice to have a popular web site, it's not so nice to get your site canned because 10,000 visitors thought you were giving away free television sets. As unlikely as this may seem, it is occasionally possible to word a page in such a way as to cause search engines to rate it highly for a totally unrelated query. This can produce a lot of brief visits from people with no interest in your site whatsoever. If the pages in question are long, or contain large images, this can be a big drain on your traffic allowance.
Your webstats show your top search engine queries. You might be surprised what some of them are. Try the unlikely-looking phrases in Google and see which pages are returned (used the Advanced Search, specific to your site, if you can't find your page). The Top Entry & Exit Pages may help with this. Maybe some of these pages need work to make them less misleading.
.htaccess
A lot has been said already about .htaccess, but what is it? It is a text file containing a list of rules specific to your web site. Servers (which run the popular open-source Apache web server software) look for this file in every subdirectory containing a requested file and follow the relevant rules if it is found. It is a big subject in itself, and is covered elsewhere in the tutorials. All we are concerned with here are rules which limit unnecessary traffic to your web site.
The file can be created in any text editor and must be uploaded as an ASCII file to your web space, usually in the root directory of the web site (but additional files can be added to subdirectories, if required). Here is a sample .htaccess file, set-up purely to keep undesirables at bay. Each section will be discussed in detail below:
| RewriteEngine on # forbid any requests from banned browsers or robots RewriteCond %{HTTP_USER_AGENT} DodgyBot [OR] RewriteCond %{HTTP_USER_AGENT} spammerzbot [NC,OR] RewriteCond %{HTTP_USER_AGENT} GreedyGrabber [OR] RewriteCond %{HTTP_USER_AGENT} Site\ Slurper RewriteRule .* - [F,L] # forbid any requests for images except directly, # from inside the site or from friendly servers RewriteCond %{HTTP_REFERER} !^$ RewriteCond %{HTTP_REFERER} !^http://(www\.)?mywebsite\.co\.uk/.*$ [NC] RewriteCond %{HTTP_REFERER} !^http://(search\.)?friendly\.com/.*$ [NC] RewriteRule \.(jpe?g|gif|ico|swf|sit|zip)$ - [F,L] |
This all looks a bit scary because it uses Apache Rewrite rules and Perl Regular Expressions. You should be scared. But, if you take the time to follow the explanation, you will see that it is not as difficult as it looks…
| RewriteEngine on |
This rule tells the server to process the following Rewrite rules. It won't work without it.
| RewriteCond %{HTTP_USER_AGENT} DodgyBot [OR] |
This tells the server to look for the text "DodgyBot" (case-sensitive) in the file request – in the bit where the client has to tell the server what browser it is. The
| [OR] |
| RewriteCond %{HTTP_USER_AGENT} spammerzbot [NC,OR] |
This is similar to the last condition, except it is not case-sensitive. Add as many conditions as necessary…
| RewriteCond %{HTTP_USER_AGENT} Site\ Slurper |
As there is a space in the name "Site Slurper", this condition contains a backslash, which allows the space to be treated as a character rather than a gap in the condition. The same applies for most other punctuation. There is no
| [OR] |
| RewriteRule .* - [F,L] |
This rule tells the web server that any request from one of these browsers should return a "403 Forbidden" message and Leave the .htaccess file processing.
| RewriteCond %{HTTP_REFERER} !^$ |
This condition checks if the URL was entered directly into the browser (as opposed to being a link from another page). It does honestly; take my word for it.
| RewriteCond %{HTTP_REFERER} !^http://(www\.)?mywebsite\.co\.uk/.*$ [NC] |
This condition checks if the request came from a page in the host web site. Again, backslashes indicate literal punctuation. The
| www. |
| RewriteCond %{HTTP_REFERER} !^http://(search\.)?friendly\.com/.*$ [NC] |
This condition checks if the request came from a privileged web site (in this case, at friendly.com).
| RewriteRule \.(jpe?g|gif|ico|swf|sit|zip)$ - [F,L] |
If none of the conditions were met, requests for files with any of these extensions will be Forbidden, and the server should Leave .htaccess processing.
If you want to know more about the syntax of these statements, then you can refer to the Apache web site (httpd.apache.org/docs/mod/mod_rewrite.html), and if you're feeling really strong you can try Perl's description of Regular Expressions (www.perl.com/doc/manual/html/pod/perlre.html).
robots.txt
It's always best to be polite, and for those spiders/web-crawlers which play by the rules, you should let them know where they stand using a robots.txt file in the root directory of your web site. Again, this is a text file, uploaded as ASCII. Here is an example file, with explanations…
| User-agent: Googlebot-Image Disallow: / User-agent: Googlebot Disallow: /images/ Disallow: /downloads/ |
This is nowhere near as scary-looking as .htaccess. The User-agent value is a partial match for the requesting browser. Each agent is listed together with a list of disallowed files.
In this example, Google Images (identifying itself as Googlebot-Image) is denied access completely, whereas Google is just denied access to the images and downloads subdirectories.
Conclusion
Reducing unwanted web site traffic is a three stage process:
- Identify the type of resource drain.
- Implement an appropriate course of action.
- Assess the impact of your action.
Use webstats to identify both high-bandwidth files and unwanted sources of wasted traffic.
Provided by: Keith Nuttall
Original Article by: acarr - Edited by: acarr