Back to previous post: The Slumbering Lungfish

Go to Making Light's front page.

Forward to next post: Rolling your own tampons

Subscribe (via RSS) to this post's comment thread. (What does this mean? Here's a quick introduction.)

May 14, 2002

Mapping
Posted by Teresa at 12:15 AM *

This is a project for someone who has a few skills I lack.

Most weblogs keep a list of other weblogs they think are swell. I think someone ought to write a script that’ll collect up the lists and generate a map of who links to whom, perhaps representing particular blogs as larger or smaller depending on number of links from other blogs.

It would be more complicated, perhaps impossible, to write a script that would figure out who quotes whom, and how often, and at what length; but if the data could be collected and represented in some useful and easily-grasped graphical fashion, how interesting it would be!

Comments on Mapping:
#1 ::: Glenn Hauman ::: (view all by) ::: May 14, 2002, 02:03 AM:

You're in luck-- it's already been done for you, at least partially.

http://www.metastatic.org/wlm/

#2 ::: Teresa Nielsen Hayden ::: (view all by) ::: May 14, 2002, 03:20 PM:

Interesting site, but it's just a way to arrange known data. The positions of weblogs within Casey Marshall's (admittedly clever) spiral pattern are arbitrary. The arrangement doesn't tell me much I didn't already know, or help me perceive patterns in the data I hadn't been able to see before. It does make it easy to see the proportions of incoming, outgoing, and mutual links for a given blog; but that's about it.

There's more that can be done. For instance, I have a sense (as I suspect we all do) that there are clusters within blogdom. It would be interesting if there were a way to give more weight to a mutual link between blogs A and B if both of them also have mutual links to blog C, and more weight still if they both have mutual links to blogs C and D, which are themselves mutually linked.

This could get elaborate.

#3 ::: Bob Webber ::: (view all by) ::: May 14, 2002, 08:22 PM:

No concrete and immediate help to offer, but check out www.searchtools.com for some ideas on how a data collection spider could be made, and k. claffy's interesting Internet connectivity and load graphics, via her CAIDA staff homepage at .

Best regards,
bob

#4 ::: Steve Cook ::: (view all by) ::: May 14, 2002, 10:03 PM:

Are you imagining simply stripping out all links from the front page of each blog? Or some logic that figures out what the bookmarks/favorites/daily reads list is?

#5 ::: Laurel Krahn ::: (view all by) ::: May 14, 2002, 10:35 PM:

Weblog Madness ( http://www.larkfarm.com/weblog_madness.htm ) is a great place to look for pages about weblogs. Seems like every so often someone will try to put together something that's at least a little bit like what you're talking about.

There used to be a cool fan favorites ( http://jim.roepcke.com/fan-faves ) webpage that used the data at weblogs.com to track which weblogs were most read or linked to among those webloggers with accounts there (and back in the day, a large percentage of webloggers participated).

Oh-- and if you haven't seen blogdex ( http://blogdex.media.mit.edu/ ), I find it pretty darn cool.

#6 ::: Teresa Nielsen Hayden ::: (view all by) ::: May 14, 2002, 10:46 PM:

Thanks, Bob, I'll have a look.

Steve, I was imagining something fairly simple, like "look for a whole bunch of links and not much else, all jammed into their own section, =n= percent of which display =n= or more of the following list of common characteristics of weblogs."

If you wanted to get slightly fancier, you could periodically run a search for "lots of links to this one from other blogs' lists, but no list of its own turning up during regular searches," to improve your chances of spotting anomalous formats like annotated lists.

#7 ::: Steve Cook ::: (view all by) ::: May 15, 2002, 09:37 PM:

I was imagining something fairly simple, like look for a whole bunch of links and not much else, all jammed into their own section, =n= percent of which display =n= or more of the following list of common characteristics of weblogs.

That's probably doable; a better method might be to look at the front page of a weblog once or twice a week for a month or so, then see what links stayed constant over the period. (The problem is that any person who serves out a dynamic favorites list or changes in the middle will show up as having no favorites at all.)

I'm also fairly sure that you could do this by scraping Blogdex, although you'd be relying on people having registered.

#8 ::: Christine ::: (view all by) ::: May 15, 2002, 10:32 PM:

Isn't that sort of similar to what Blogdex does? Although it's not a map - but I look at who links to me, and then I can tell who links to them, and meanwhile I can tell my "ranking" overall and so forth.

#9 ::: Teresa Nielsen Hayden ::: (view all by) ::: May 16, 2002, 11:10 AM:

Laurel, both those sites are interesting, and data's always good, but what I want to see is the relationships within the data, like the crystalline structure in a rock.

Christine, a single weblog is doable, but my buffers would overflow if I tried to do that for more than one weblog at a time. Also -- correct me if I'm wrong -- it sounds like your method is at least partially dependent on knowing who's who in the blog world. That may be the only truly reliable sorting mechanism -- the members of our species who can do that do it very well indeed -- but I wouldn't mind some computerized help.

Steve, that's a good one. I hadn't thought about sorting by constant vs. changeable links. ... I suspect that the most sophisticated automated search processes currently available would at most reduce the total number of weblogs you actually had to go and look at.

To clarify, not that I really think you need it, but just to expand on the idea: The data I want is that common weblog feature, the little list of blogs-linked-to, which among other things functions as a statement of identity: "This is the sort of weblog that links to bOINGbOING, James Lileks, Pigs and Fishes, Honeyguide, Diary de la Vex, and Arts and Letters Daily."

(Another interesting statistic, assuming you could reliably isolate that data, would be the frequency with which a given weblog actually discusses material appearing in the weblogs that it lists. What we could legitimately infer from this information is of course another question entirely.)

#10 ::: Graydon ::: (view all by) ::: May 16, 2002, 03:50 PM:

This -- wanting a weighted topology for an arbitrary set of links -- is a hard problem; it's assigning weight to a graph traversal without prior knowledge of the graph's cyclic nature. (I suppose there is a theoretical set of bloggers who don't reference the people who reference them, but I doubt they'll be encountered in practice. At that point, the spider has to be able to unwind the loops. This is surprisingly difficult in the general case.)

Even once you have the nicely weighted graph, depicting them sensibly (for values of 'sensibly' mostly equivalent to 'humans can tell this from a plumbing diagram for a large public structure without particular effort or concentration) is also a hard problem. Consider how useless the visualization tools for large databases can prove to be, if one is trying to get a sense of the whole of the database.

#11 ::: Bob ::: (view all by) ::: May 17, 2002, 12:05 AM:

Dear Teresa:

If at heart you're still the simple tootsie who put together the detailed statistics of AZAPA mailings one hot, damp week in East Lansing to celebrate her new relationship with a certain Patrick, I expect the only thing that will really help is a good, long soaking in the data.

I originally wrote a comment much longer than that appearing above, and in it I referred to a Scientific American article which described and illustrated a scheme to identify the "hot spots" in the WWW by analyzing the "referred-to-ness" of websites. This was back in the days when a lot of web pages were collections of pointers to technical data that engineers had found useful. Operations like the standards development group at PictureTel would publish directories of the publically accessible finished standards and works in progress.

I tried fishing for this in the index on the SciAm website, but didn't have much luck. If I only had it on microfiche, something I have a clue about searching. A cumulative index, maybe a "KWIC" (Key Word In Context, a permuted index of each significant noun or verb with a few words of context on each end) would probably find this article quickly, and the illustrations might help give you some shapes to hang your thoughts on.

Regarding mapping the textual and mark-up content of web pages to actual weblogs and ultimately their owners, I think this is probably best conducted as part of soaking in the data. Back in the Dark Ages of the early 1980s there was a significant amount of activity in applied statistics w.r.t. clustering analysis. There was something, I think it might have been a multivariate analysis, that coughed up something called a "Pierson (or Pearson) Correlation Coefficient." Some aspect of the overall analysis project (data fitting to hypothesized trade routes around the Mediterranean, with clustering analysis applied to the trace element content of clays used in (presumed) locally manufactured pottery.

(Neutron Activation Analysis, thanks for asking: throw the sample in the Slowpoke subcritical reactor for a while, fish it out, stick it in front of a gamma-ray counter. The nuclei emit gamma rays with energy distributions (spectra) as the nucleons decay from the excited state they enter as they interact with the neutron flux in the reactor core. It's kind of like the way substances' electrons emit characteristic wavelengths when bombarded with X-Rays or Hard Ultraviolet in Fluorescence Spectroscopy, only slower because electrons are so much more likely to find it convenient to hop back down in energy and toss a photon. But that's neither here nor there.)

Anyway, it seems to me that you are trying to define a kind of correlation parameter: you're not asking questions about what it means, but you're trying to find a way to map affinities to something that can be fixed in your field of vision and used as an external reference, a Langdon Chart, if you will, for visualizing relationships in the absence of faces in places at times.

So any typical search spider or even wget can probably do the work of traversing weblog space. You first problem is finding a way to coarsely filter out all the stuff that clearly isn't what you're looking for. This is a job for awk/sed/Perl -- I recommend awk, it's elegant and will do all you need. Perl is far more powerful, but definitely got an extra whack from the ugly stick.

This script has to take the output of wget (say), read it line by line, and only pass through the lines that look like they're in the blocks you're interested in. Unfortunately, you have to start out going through it by hand. Awk or Perl will make it easy to do the actual edit automatically, and allow you to use a bunch of different patterns as delimiters, so either would be well-suited to building up a set of ad hoc editing filters.

As part of this process, you'll also set up patterns to trip counters when the URLs of certain weblogs. Either Perl or awk will let you print out a tabulation of these data at the end of a round of processing. For small data sets, that's probably enough to get started.

On the whole, I think that awk would be easier for an experienced Hypercard programmer to learn than Perl.

For more complex datasets you'll need a statistical software package and some statistical expertise, but a correlation coefficient ought to work as well with a link count from Electrolite as a gamma ray count in a given energy range. Way back when, the pottery analysis took a university-sized IBM mainframe. These days you have exclusive use of a lot more CPU than that in your apartment, but it seems you might need to work on formulating your questions more precisely?

Best regards,
bob

#12 ::: Clark Myers ::: (view all by) ::: May 23, 2002, 02:56 AM:

Granted the ultimate product will be better with a custom from scratch application built with the best tools, possibly done by the folks at Illinois who proved the 4 color mapping conjecture - who would develop nice lemmas as they go - but as a practical matter starting with Visio wizards for a spider and using MS tools with a little rough massaging will get adequate results including visual with a small v output much much sooner for a small number of iterations.

#13 ::: Teresa Nielsen Hayden ::: (view all by) ::: May 23, 2002, 07:48 AM:

I'm still thinking.

Graydon, I grant that in theory it's a difficult problem, but might it not be possible to start at an arbitrary point in the middle of the knot by identifying some rule-of-thumb characteristics and characteristic relationships? It's imperfect. It would bias the data. It wouldn't be good for finding the wholly unintuitive patterns of relationship that otherwise are visible only to alien intelligences. But it might be manageable.

Bob, definitely still thinking. Langdon chart, yes. That one we both remember was a crude instrument. But did I ever tell you that it turned up a discontinuity? The lines went every which way except into the Deep South. There was either an absence of qualifying activity across that line, or the gossip wasn't getting through. The former, most likely.

#14 ::: Bob Webber ::: (view all by) ::: June 05, 2002, 12:23 PM:

I assume that volunteers flang themselves into the breech, quickly erasing this natural monument shortly after it was mapped?

Choose:
Smaller type (our default)
Larger type
Even larger type, with serifs

Dire legal notice
Making Light copyright 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012 by Patrick & Teresa Nielsen Hayden. All rights reserved.