Tuesday, June 5, 2012

The Cob-Web

While doing some web scraping for another project, I collected some ancillary data that I think you'll find interesting.  I wrote a program to  collect a comprehensive list of .com top level domains, and I wondered:  how many .com domains are there?  How long is the average domain?  The longest?  Roughly what proportion of them are parked, typical websites, or pornography?

The answer, I thought, was actually pretty sad.  The web that I had been led to believe was predominantly pornography and parking was actually mostly abandoned.  In my simple random sample, there were more parked domains than all other non-404 websites combined.  I'll get to that in a bit.  Please note that this study only concerns .com domains, not .net, .org, .gov, .edu, or anything else.

Let's start by taking a look at the distribution of domain name lengths.  Since my data file included the ".com" part, but not the "http://" part, I could parse out the domain name proper with a simple one-liner in bash:

awk '{print length($1)-4}' domainList | sort -n | uniq -c | awk '{print $2, $1}'
Figure 1:  Two views of the number of registered .com domains versus the length of the domain name.  Note the logarithmic scale on the left.  

Remarkably, there were only three registered one-letter domain names:  q.com, x.com and z.com (these must be grandfathered in), and z.com returned a 404 error.  There were available and unregistered 2-letter domain names, and there were at least a million domains having each length from 5-21.  The most common domain name length was 11, and there were 547 .com domains with the maximum length of 63 characters.  I counted 54,816,663 registered dot-com domains total on April 30, 2012.

Now I wanted to get a handle on what the landscape of the web looked like.  I used another simple one-liner to generate a list of 100 randomly selected domains from my collection:

awk '{print rand(), $1}' domainList | sort -n -k 1 | head -n 100 | awk '{print $2}'

I acted as my own Mechanical Turk, classifying each of the 100 domains in my simple random sample as either 404, parked, XXX, website placeholder, or website.  Website placeholders were purchased, but never developed and displayed a message from the registrar.  A 95% confidence interval is given for each type.

Type of .comProportion (95% CI)
40448 ± 9.8%
Parked28 ± 8.8%
'Normal' Website19 ± 7.7%
Placeholder4 ± 3.8%
'XXX' Website1 ± 1.9%
Table 1:  Types of website, and proportion in my n=100 random sample.

Table 1 gives a surprising view of our web:  about half of it returns a 404 error.  About a quarter of it is parked by people seeking to make an easy buck by re-selling the domain.  About a fifth is what we would consider a website, and the remainder is either a placeholder where someone will ostensibly develop a new website or pornography, which is actually apparently quite rare compared to typical personal or corporate web pages.

Though the error bars are quite large (admittedly, I tired of being a Mechanical Turk), this exercise challenged my stereotype of the web.  This shows that there is a huge amount of turnover where domains are relinquished and picked up, mostly by domain parking speculators but also by legitimate people and businesses.    It also raises another concern:  if a website can be considered part of the culture of the information age, what is happening to the enormous amount of information and data that is being turned over?  The internet archive has put forth a valiant effort, but nonetheless websites disappear like the great cities of the past, except without any ruins that future historians could ever hope to dig up and analyze.

It may prove that the information age has a very short memory!

No comments:

Post a Comment