The answer, I thought, was actually pretty sad. The web that I had been led to believe was predominantly pornography and parking was actually mostly abandoned. In my simple random sample, there were more parked domains than all other non-404 websites combined. I'll get to that in a bit. Please note that this study only concerns .com domains, not .net, .org, .gov, .edu, or anything else.
Let's start by taking a look at the distribution of domain name lengths. Since my data file included the ".com" part, but not the "http://" part, I could parse out the domain name proper with a simple one-liner in bash:
awk '{print length($1)-4}' domainList | sort -n | uniq -c | awk '{print $2, $1}'
Figure 1: Two views of the number of registered .com domains versus the length of the domain name. Note the logarithmic scale on the left.
Now I wanted to get a handle on what the landscape of the web looked like. I used another simple one-liner to generate a list of 100 randomly selected domains from my collection:
awk '{print rand(), $1}' domainList | sort -n -k 1 | head -n 100 | awk '{print $2}'
I acted as my own Mechanical Turk, classifying each of the 100 domains in my simple random sample as either 404, parked, XXX, website placeholder, or website. Website placeholders were purchased, but never developed and displayed a message from the registrar. A 95% confidence interval is given for each type.
Type of .com | Proportion (95% CI) |
---|---|
404 | 48 ± 9.8% |
Parked | 28 ± 8.8% |
'Normal' Website | 19 ± 7.7% |
Placeholder | 4 ± 3.8% |
'XXX' Website | 1 ± 1.9% |
Table 1 gives a surprising view of our web: about half of it returns a 404 error. About a quarter of it is parked by people seeking to make an easy buck by re-selling the domain. About a fifth is what we would consider a website, and the remainder is either a placeholder where someone will ostensibly develop a new website or pornography, which is actually apparently quite rare compared to typical personal or corporate web pages.
Though the error bars are quite large (admittedly, I tired of being a Mechanical Turk), this exercise challenged my stereotype of the web. This shows that there is a huge amount of turnover where domains are relinquished and picked up, mostly by domain parking speculators but also by legitimate people and businesses. It also raises another concern: if a website can be considered part of the culture of the information age, what is happening to the enormous amount of information and data that is being turned over? The internet archive has put forth a valiant effort, but nonetheless websites disappear like the great cities of the past, except without any ruins that future historians could ever hope to dig up and analyze.
It may prove that the information age has a very short memory!
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.