Making Light: An Hour In Sp@m

February 23, 2012

An Hour In Sp@m
Posted by Jim Macdonald at 06:31 PM *

Varney opens the treasure-house of his knowledge….

—Varney the Vampyre; or, The Feast of Blood

Would you like to look behind the scenes, to peer in upon the doings of the gnomes high in the glass-and-steel headquarters of Making Light? Come then, with me, to view their doings. Come to view An Hour In Spam.

The example that you see on the left (click it to see it at a larger, readable level) does not come from a particularly active hour. (The hours from three to six in the morning Eastern time can see many times this many spam attempts.) Let me go over what you’ll see.

On the left, the little square check box allows the gnomes to indicate which post or posts are to be operated upon. Choices include Publish, Unpublish, Delete, and Mark As Spam. Moving to the right, the little orange triangle marks these comments as Unpublished; Held For Moderation. (A green triangle marks a Published Comment, while a purple one marks Known Spam.) All that we are looking at here is the Moderation Queue.

Next, to the right, comes a block of text; the comment itself. Links aren’t shown here (nor are paragraph breaks, italics, blockquotes, and so on). Oftentimes the gnomes can tell just by inspection whether a comment is a spam or ham. They can check the box and do a group action.

Next line down on each comment, we see four columns. The first, on the left, is Edit. When the gnomes click there, they move to the editing screen (example to the right).

The second link is the Commenter. Sometimes it’s obvious that this is a spammer: Few people go around with names like Cheap Viagra No Prescription or Auto Scratch Remover. Sometimes, however, it’s a human-sounding name, like Caleb Hutchcraft or Marina Gordon. A click there brings up the Show All By screen.

The third column shows the name of the thread where the comment appeared. That link goes to the editing screen for each particular post.

The fourth column shows how long ago the comment was posted. That isn’t a clickable link.

The last column, on the far right, shows the IP address whence the comment was posted. This is a live link to a Show All posted from that IPN. This is less useful than it could be: Nearly all spam is posted from compromised addresses.

After we’ve checked the boards for posts that are labeled Spam by the commentariat, and those spammish posts are Unpublished, the next thing that the gnomes do is go to the moderation queue and start reading the comments one-by-one with the Edit link.

The gnomes have a pretty good memory for prose, and a decent eye for patterns. As each post is examined, they look for patterns in the e-mail addresses, in the commenter’s names, in the URLs of the websites being advertised, and in the text of the comment itself.

When a real comment from a real person appears, the gnomes instantly publish it, after checking to see which filter was triggered that moved the post to moderation. Sometimes, it’s a filter that they’re not going to remove (e.g. malformed URLs) because, even though those filters occasionally hold up a real person, they tend to stop dozens if not hundreds of spam posts every day. Other filters which less-often stop spam are removed.

The other posts — the ones where the filters didn’t stop the spam — the gnomes use to build new filters.

The way that works, the gnomes find what look like key phrases. They Google those phrases, to see if they mostly show up in spam comment posts (e.g. “center to heart”). Then they look at the phrases immediately before and after the key phrase.

Spammers have gone to mad-libs style comments. I suspect that the word-and-phrase lists they use are either comma-delimited or single-quote delimited, from the bizarre ways in which commas and apostrophes are used in many spam comments. A comment with no space after a comma, or one with no apostrophe in a standard contraction, is, more than nine times out of ten, a spam comment.

Let me show you what a filter looks like:

/a (useful|informative|helpful|educational|benificial|beneficial|) (and|along with) (funny|interesting|amusing) (publication|write.?up|essay|post|article|submitting|submission|script)/i

One of the words in each set of parentheses, separated by vertical bars, is substituted into each slot. Thus, that filter will stop “A useful and funny publication” or “a helpful along with amusing article” or any other combination you can build out of that list. The “/” character tells the filter where the phrase starts and stops, the small-letter i after the second slash means that both small letters and capital letters will trip the filter. And the .? mark means that any one character will match: write-up or write up or writeup will trip the filter.

I regret that comments on the sorts of things that the gnomes gnome is likely to get that comment gnomed. But … the gnomes will release those comments soon enough.

So ends our tour of one of the floors in the landmark glass-and-steel tower. Please stop by the gift shop on your way out.

Comments on An Hour In Sp@m:

#1 ::: Jacque ::: (view all by) ::: February 23, 2012, 06:41 PM:

Moshe Yudkowsky, when setting up his shiny new server, commented that plugging into the unfiltered Internet was just like tapping an active sewer line....

#2 ::: Jacque ::: (view all by) ::: February 23, 2012, 06:48 PM:

Teh clickey is no work ... ?

#3 ::: Henning Makholm ::: (view all by) ::: February 23, 2012, 06:49 PM:

There's a stray "l" in the link to the larger-and-readable image.

#4 ::: Xopher HalfTongue ::: (view all by) ::: February 23, 2012, 06:49 PM:

Fascinating stuff. Answers questions I'd had but hadn't articulated.

I get an error page when I click the spam list, though.

#5 ::: Tom Whitmore ::: (view all by) ::: February 23, 2012, 06:52 PM:

Interesting and sfl material. If we talked regularly about Joanna Russ' "*s*f*l Phrases for the Tourist", we'd have more pseudospam!

#6 ::: Stefan Jones ::: (view all by) ::: February 23, 2012, 06:52 PM:

Caleb Hutchcraft. I like that. May I use that as a minor character name?

#7 ::: Xopher HalfTongue ::: (view all by) ::: February 23, 2012, 06:53 PM:

Ah, well spotted, Henning. Yeah, probably isn't called "spaml.jpg," huh?

#8 ::: Xopher HalfTongue ::: (view all by) ::: February 23, 2012, 06:55 PM:

Spam-El is one of Mon-El's brothers. (Along with Tue-El, Wed-El, etc.)

#9 ::: Lee ::: (view all by) ::: February 23, 2012, 07:11 PM:

I am vastly amused by the robo-comment repeated over and over again by "photo mounting board / matte board" (the name varies). Obviously the script plugs in the title of the thread on which it's being posted, with sometimes-hilarious results.

"Making Light: $9,695 New Age sweat lodge session kills 2, injures 19 is kinda vanilla." Really?

#10 ::: Lee has been unsurprisingly gnomed ::: (view all by) ::: February 23, 2012, 07:13 PM:

Quoting part of the pattern from a gnomed comment had the more-or-less expected result.

#11 ::: thanate ::: (view all by) ::: February 23, 2012, 07:17 PM:

I suspect the correlation I make between brain.?dead commenters and posting all my social media contacts is not the one the spammers wish me to make.

Interesting. I continue to be delighted that other people are willing to deal with this moderation thing. (Thank you, gnomes!)

#12 ::: Henning Makholm ::: (view all by) ::: February 23, 2012, 07:18 PM:

I'm intrigued by the claim that most spam is sent from spoofed IP addresses, that is, addresses different from the address of the machine that actually sent you the spam.

I would expect that most spam is sent by malware-infected zombie machines, but labeled with the zombie's actual IP address. That would indeed make the logged IP more or less useless for recognizing spam, so I'm tentatively assuming that this is what "spoofed addresses" was intended to mean here.

True spoofing of sender addresses is easy enough for individual IP packets. But for the spoof to survive a three-way TCP handshake and get through to the HTTP layer is supposed to require either phenomenal luck (if the webhost OS uses pseudorandom SYN cookies as it should) or some rather dramatic BGP-level subversion of the entire network, which I'd assume was outside the capabilities of your average spamhaus.

However, my knowledge of these things is purely theoretical, and I'd love to be set right by somebody with actual combat experience.

#13 ::: Jim Macdonald ::: (view all by) ::: February 23, 2012, 07:24 PM:

I may well have spoken inexactly about "spoofed" addresses. Whatever they are, the addresses are useless for the purpose of filtering: When you see the same exact text, including the URL being advertised, come in six times in as many seconds, each time from a different apparent IP, well... it really drives home how we have to find something else to use for our filters.

#14 ::: Tony Zbaraschuk ::: (view all by) ::: February 23, 2012, 07:25 PM:

Very much thank you. Lots of effort goes into making this site the wonderful place to read that it is.

Along another vein,

DIE SPAMMERS DIE!!!!!

#15 ::: Henning Makholm ::: (view all by) ::: February 23, 2012, 08:14 PM:

Jim @13: Sorry for going pedantic on you, then.

There's probably an interesting point about language, communication and culture to be made here. If only I were quite sure what is is ... something about IT people having a tendency to assign technical meanings to everyday terms (such as "spoofing"), with the consequence that laypeople who're not trying to be technically precise run a large risk of sounding to the initiated like the don't know what they are talking about. Which is only half right: indeed they don't know, but they thing they don't know is not what they're talking about.

My impression is that IT is especially bad at this. Other technical fields, such as chemistry or biology, seem to prefer Latinate roots for their technical terms, with some Greek mixed in for good measure.

The really interesting bit is why this happens. Some of the trend-setters in the IT culture are self-taught and perhaps shouldn't be expected to have a classical vocabulary -- but they are also highly curious and intelligent and probably pick it up anyway just for the heck of it. Instead, I think that the coöpting of everyday terms for technical meanings stems from a more or less conscious ambition to be inclusive and accessible. If so, it's rather ironic that it ends up having the elitist effect of making it harder for outsiders to avoid speaking nonsense.

Or am I overthinking this?

#16 ::: David Harmon ::: (view all by) ::: February 23, 2012, 08:35 PM:

Henning Makholm #15: Well, in most fields, one way to recognize the knowledgeable from the poseurs is whether they can use the terms correctly. (Cargo-cult imitation, e.g. the pseudo-med crazies, can only take you so far.) With IT jargon, it's a little easier for imitators to "sound right" for the marks, and maybe even think they're actually getting it right.

#17 ::: Henning Makholm ::: (view all by) ::: February 23, 2012, 08:54 PM:

David, my point is that the bad choice of technical terms tend to make perfectly innocent outsiders (who're not even trying to sound like they have any technical knowledge) look like poseurs, because they actually picked a word that had a specific technical meaning.

Comparing to medicine/pseudo-med, you don't risk accidentally saying "telomere" when you were only trying to speak informally about, for example, the part of a head hair right next to where it was cut with scissors. If you meet somebody who claims that his special shampoo prevents telomere damage, you can be quite certain that he's a deliberate quack.

But if you see somebody claiming that his moderation queue is full of spoofed IP addresses, is he then attempting to imitate network security experts, or just honestly attempting to express without any specialized technical vocabulary his lay observation that the addresses do not identify the actual human spammer?

#18 ::: Doug Burbidge ::: (view all by) ::: February 23, 2012, 09:05 PM:

I added a not-very-annoying-at-all CAPTCHA to a blog I administer, and saw spam drop precipitously. (Of course since I have very little traffic compared to ML, I'm a much less attractive target.)

The one I use is by an outfit called BestWebSoft, which despite sounding like "we are bottom-feeding scum" is apparently one of the most popular CAPTCHA plugins for WordPress.

#19 ::: lorax ::: (view all by) ::: February 23, 2012, 09:08 PM:

I particularly like the noun-list from Falconsyk, which is quite clearly generated from several specialized lists stirred together -- for one, there are way too many types of birds on that list for it just to be random from the dictionary. But what's the point? Is that just supposed to be something vaguely textlike that doesn't trip too many filters, because it doesn't have any individual words or patterns that are identifiable from one instance to the next? I understand the "content-free flattery" approach to get around filters, but this one's just weird.

#20 ::: James D. Macdonald ::: (view all by) ::: February 23, 2012, 10:36 PM:

The word "spoof" predates IT by about a century. I come to the term via EW/ECM, where its meaning is pretty much the way I used it: A false designation of origin.

#21 ::: Marty in Boise ::: (view all by) ::: February 23, 2012, 10:41 PM:

I sort of feel...not sorry for, but may be perplexed by spam operators that just generate word-salad ads like these. How on earth do these operations make any money at all? Even assuming that a fair percentage of their messages make it into insufficiently guarded comment threads, I guess I just can't imagine how they get enough clicks, even accidentally, to sell enough crap to support what seems like massive effort, no matter how much of it is automated.

Then again, when I was living in Japan, I was surprised by a news story about how the yakuza were engaged in a scheme that involved massive counterfeiting on a very small scale--they would drill holes in fairly low-value Korean coins to make them weigh the same as a 500-Yen coin, then send low-level gang members to spend the coins in vending machines, buying a soda for about 100 yen and pocketing the legitimate change. It just struck me as a ridiculous amount of work for a fairly small amount of ill-gotten gain per transaction, plus a lot of soda that I suppose was easily resold...

#22 ::: James D. Macdonald ::: (view all by) ::: February 23, 2012, 11:05 PM:

Let's say there's a flower shop in Brooklyn that wants to increase its web-based business. They hire an Internet Advertising Expert to make it happen. That guy sends out spam.

The spammer doesn't know, or care, whether the flower shop gets more business or not. He's got his paycheck. He really sends the spam, because not doing so is fraud. But as to effective? What of it?

Other spam isn't meant to be read. It's just there to put up a great many links to one site so that Google will rank that site higher than other similar sites.

Another common trick is to put in a link to a Google search, with one unique letter combination so the search will only hit one page -- figuring that no-one will block links to Google.

Spam probes often have no links at all, but they frequently have one very badly misspelled word, which will show up later on a Google search; anywhere that's a Google hit on that word is an unguarded site, and the flood comes shortly afterward.

#23 ::: abi ::: (view all by) ::: February 24, 2012, 02:36 AM:

By the way, although I do bits of this work when I have time and headspace, the guy in the lionskin diverting the river is usually Jim.

Thank you, Jim, for the work you do. We wouldn't have a viable community without it.

#24 ::: dcb ::: (view all by) ::: February 24, 2012, 03:34 AM:

My thanks as well!

Henning Makholm: Except - I have to be careful using the medical term "acute" which means "of sudden onset/short duration" (contrasting with "chronic") because many lay people (including e.g. insurance companies) think it means "severe" - so they want to know how severe, and whether it was permanently crippling, and...

IF ML decides to go "Captcha", please use Recaptcha - that way we'd be helping to decipher old texts with every post, which would be appropriate, I think.

#25 ::: Jacque ::: (view all by) ::: February 24, 2012, 04:23 AM:

James D. Macdonald @22: Let's say there's a flower shop in Brooklyn that wants to increase its web-based business.

Coincidentally, just yesterday, I got an offer in paper mail for a search engine placement service. The mind boggles....

#26 ::: Jim Macdonald ::: (view all by) ::: February 24, 2012, 05:11 AM:

I really don't like Captchas. Half the time I can't do 'em, and I hear they're really rough on people with visual difficulties.

#27 ::: FaultyMemory ::: (view all by) ::: February 24, 2012, 05:59 AM:

Besides which, per Abi's link of three days ago, CAPTCHAs are breakable via distributed human processing, for a price of about a tenth of a cent each.

#28 ::: Alex ::: (view all by) ::: February 24, 2012, 06:39 AM:

IP filtering does have some value, I think. I administer a company MT install that sees significant amounts of spam. Despite using Akismet, the mod queue grows a couple of hundred items a week.

We used to experience a problem when an unusually heavy spam run hit us (typically, we get spam in identifiable runs, a few dozen comments in less than a minute) where the instances of mt-comments.cgi waiting for the Akismet web service to respond would trip our vhost's policing and knock us offline.

I deployed the MT-Blacklist plugin, which mines the spam trap for IP addresses (that appear more than a configurable threshold level) and adds them to the web server's .htaccess file. It's layered defence - submissions from spammy machines get shot down before the MT comments script gets executed or the Akismet API called, so they don't use any resources. We've not had the problem since.

#29 ::: Nonentity ::: (view all by) ::: February 24, 2012, 09:02 AM:

James D. Macdonald @20: I come to the term via EW/ECM, where its meaning is pretty much the way I used it: A false designation of origin.

That's the way the term is used in the network field as well, but for the reasons Henning Makholm mentioned it gets more technical with IP addresses. Spoofing an IP for purposes of doing something that requires replies from the destination is nearly impossible unless you happen to be on the same physical network as the IP you are spoofing (or the same physical network as the target) so that you can see the replies.

The origins are probably true, but belong to compromised systems. They can also change around a lot due to dynamic IPs (such as in residential ISPs where sometimes rebooting your modem will give a new public IP).

The difference is that spoofing is stuffing a lot of mail for you with different reply addresses into the nearest mailbox, while this is probably that someone has broken into a large number of homes and are using their addresses to carry on conversations with you. As you mentioned, though, the end result - that you can't solidly rely on the source, unless you do some heavy historical tracking and research on the source networks - is pretty much the same.

I apologize if that's nitpicky... I deal with different types of network abuse on a daily basis, so "spoofed IP" is a bit of a trigger phrase.

#30 ::: Fragano Ledgister ::: (view all by) ::: February 24, 2012, 09:14 AM:

Compared to this job, cleaning the Augean Stables must have been a doddle. I speak here, by the bye, with the authority of actual experience at stable cleaning (of goats) with bag and shovel (of the manuring of fields with fork and hoe that followed I also had much experience but that is nothing to the point). Old Heracles would have taken one look, shrugged his shoulders and run off.

#31 ::: rea ::: (view all by) ::: February 24, 2012, 11:23 AM:

Well, this was certainly a useful,informative, helpful, educational and beneficial post :)

#32 ::: Ingvar M ::: (view all by) ::: February 24, 2012, 11:31 AM:

James D McDonald @ #20:

I started writing up a (now-zapped) description of "spoofed IP" as a term-of-art in IT, but Nonentity @ #29 beat me to it.

It's one of those things that are small, almost no one cares (and on good days, I care not very much, certainly not to the point that it pulls me out of a good yarn), but does yank one out of the reading mode out into "what just happened there" mode.

I do, however, salute you and the other gnomes. It is an important, thankless and never-ending task, wading through the mod queue.

#33 ::: dcb ::: (view all by) ::: February 24, 2012, 11:49 AM:

Jim Macdonald @ 26: Oh I wasn't wanting you to start using Captcha or ReCaptcha! Far from it. And your point about people with visual difficulties is well taken.

#34 ::: Heather Rose Jonse ::: (view all by) ::: February 24, 2012, 12:19 PM:

Count me in as one of those people who is bewildered by some of the alleged "business models" now making our informational lives miserable. I suspect my bewilderment comes from three parts: 1) I'm not really their target audience, given the ways I use info-tech -- I'm just collateral damage; 2) some of the business models are convoluted enough (e.g., as detailed in #22) that you can't deduce them from superficial effects; and 3) I'm probably trying to make sensible patterns out of data that contains a lot of mere noise.

Illogical business models fascinate me. Some time I should tell the story of how I frightened my office co-workers when dealing with a series of phone-spam calls (repeated sales calls to someone not-me at my number from various clearly bogus call centers) when I decided to let rip to the human beings at the other end. My rant focused very much on the incoherence of their business model, given that they were spending so much actual human time turning a non-customer to an active and vindictive enemy.

#35 ::: pedantic peasant ::: (view all by) ::: February 24, 2012, 12:34 PM:

Jim Macdonald @ 0P

And the .? mark means that any one character will match: write-up or write up or writeup will trip the filter.

So, am I reading this correctly that "w r i t e d u p" would also get gnomed?

#36 ::: pedantic peasant gnomed himself ::: (view all by) ::: February 24, 2012, 12:35 PM:

Sorry,

Could you please free my query?

On the plus side, I suppose the question is answered.

#37 ::: James D. Macdonald ::: (view all by) ::: February 24, 2012, 12:39 PM:

I'm always amenable to learning something new, and I hate to have our visitors brought up short by incorrect terminology.

I'd be happy to change the term from "spoofed" to something else. What would work? False? Faked? Unreliable?

When you're doing EW, a spoofed transmission would be one where the bad guys come up on your frequency, and say "This is W7M ..." and you start talking to them even though they aren't W7M; W7M is miles away and has no idea what's going on. I suppose that using a botnet is the equivalent of some armed people taking over the genuine W7M and forcing the radio operator there to come up on your freq and say "This is W7M..." and you start talking with them. That might be a funkspiel, or maybe not. You'd say that W7M was compromised. So, would "compromised IPs" work?

Meanwhile, you do sometimes find interesting things in the moderation queue. Observe.

#38 ::: Mary Aileen ::: (view all by) ::: February 24, 2012, 12:49 PM:

Just curious: Does having the word 'spam' in the post title make it harder to find spam-spotting posts in the Recent Comments list?

#39 ::: Dave Bell ::: (view all by) ::: February 24, 2012, 12:52 PM:

I wonder what the WW2 Operation Fortitude wireless deception would be classed as. Spoofing I suppose, with the fake radio traffic to present an image of the First US Army Group. But the double agents could definitely, from the German point of view, be described as compromised.

#40 ::: James D. Macdonald ::: (view all by) ::: February 24, 2012, 12:55 PM:

pedantic peasant 35/36:

No, "w r i t e u p" didn't get gnomed; one of the other words in the quote is what tripped the trigger.

If spammers started using "w r i t e u p" it would get into the filters pretty quickly, though.

I don't like to talk much about the makeup of the filters and exactly what trips them, lest the spammers get information on how to circumvent me.

Incidentally, while IP numbers are worthless for fighting spammers, they're quite useful for fighting trolls and sockpuppets.

#41 ::: James D. Macdonald ::: (view all by) ::: February 24, 2012, 12:58 PM:

Mary Aileen: #38

Yes, it does. I've just changed the post title.

#42 ::: wrw ::: (view all by) ::: February 24, 2012, 01:00 PM:

James MacDonald @ 37: "Compromised" would be the accurate term of art in network security as well.

#43 ::: OtterB ::: (view all by) ::: February 24, 2012, 01:13 PM:

James D Macdonald @41 I've just changed the post title.

I'm glad to hear that. I just stopped by, noticed comments on this post, and thought "I didn't notice the @ this morning when I started reading that thread." Or, I suppose, th@ thread

And "compromised" makes perfect sense to me too; I consider myself a moderately-informed layperson on the subject.

#44 ::: Nonentity ::: (view all by) ::: February 24, 2012, 01:17 PM:

@37: That highlights one of the differences about the communications in networking: it's easy to send out something saying you're at an IP you're not at*, but in order to do anything other than just flood someone with tiny junk packets you need to actually be at or near one of the termination points to see responses. Usually you're working either with physical wired networks or with very short-distance wireless communications (and with encryption as well).

"Unreliable" or "compromised" might be a better description in this case. The sources are valid, but the method hits you with two main problems:

1) The attacker probably has extremely large numbers of valid sources to proxy their spam through. In your EW example, there's no way to know how many of the network of friendly sources have been compromised, and those friendly operators probably don't even know they're compromised and sending the enemy's data.

2) The compromised systems themselves can change IPs within their network's allotment, so blocking 10.0.0.5 today doesn't prevent the same compromised computer from showing up at 10.0.0.6 tomorrow. Probably no EW analogy there, since it depends on not even knowing which sources are friendly. You can decide to treat entire blocks of sources as unreliable, but then you're catching friendly sources as well.

* Actually, if your ISP is doing their job properly, it's not possible to say you're any IP other than one that is at least isolated to that provider's IPs. But that's a whole different topic.

#45 ::: James D. Macdonald ::: (view all by) ::: February 24, 2012, 01:23 PM:

I've just changed "spoofed" to "compromised" in the main post.

#46 ::: Debra Doyle ::: (view all by) ::: February 24, 2012, 01:30 PM:

On the things that can happen when the same word or acronym is used as a term of art in two different disciplines:

As you know, Bob, POV in the writing biz is short for Point Of View. But on US military bases, at least back during the days when I was spending a good bit of my time on same, POV stood for Privately Owned Vehicle.

You can imagine, then, my momentary (but memorable) befuddlement upon encountering, at the Corozal Exchange, a sign reading NO POV BEYOND THIS POINT.

#47 ::: Nonentity ::: (view all by) ::: February 24, 2012, 01:42 PM:

It's too bad these people aren't doing anything that's illegal or infrastructure threatening that would make the ISPs of the compromised systems react promptly to an abuse report. I'm sure I'm not the only person who delights in shutting down the resources of internet scum, but it's rather difficult to get people's attention when the abuse is "comment spam".

#48 ::: John A Arkansawyer ::: (view all by) ::: February 24, 2012, 01:44 PM:

It would be nice if POS and POC unambigiously meant Point Of Sale and Person Of Color.

#49 ::: James D. Macdonald ::: (view all by) ::: February 24, 2012, 01:51 PM:

Of allied interest: MIJI (Meaconing, Intrusion, Jamming, Interference) from the Navy rate training manual for Information Systems Technician (which I suppose is new name for Radioman).

The recent capture of the US drone by Iran, which is being described in the popular press as having been done by "spoofing" GPS signals, was probably actually meaconing (and I don't see any way to stop anyone who wants to do it to repeat the trick against any drone anywhere, any time they please; meaconing is really, really effective).

Similar is the range-gate pull off, used to keep radar-guided missiles from hitting you.

#50 ::: Linkmeister ::: (view all by) ::: February 24, 2012, 02:03 PM:

Jim @ #49, yes. RM is now IT. It still uses the old RM rating badge, though. Per Wikipedia:

The Radioman (RM) and Data Processing Technician (DP) ratings were merged in November 1998, keeping the Radioman name. In November 1999 the rating was redesignated Information Systems Technician. Both Radiomen and Data Processing Technicians in the Navy had to undergo general rate training and take a computer-based exam in order to be designated under the new IT rating (the Data Processing Technicians found themselves bound to a serious learning curve because they had to learn every single aspect from the Radioman rating) whereas to the Radiomen, most of them already had a general knowledge in basic computer fundamentals and maintenance. In 1996 the Submarine force merged Radioman with Electronics Technicians/ Electronic Warfare Specialist.

The Coast Guard rating was renamed Telecommunications Specialist (TC) in 1995, which split in 2003 to make up the Information System Technician (IT) and Operation Systems Specialist (OS) ratings.

I guess I can't claim responsibility for the change. It was in 1974 that I went from RM3 to civilian.

#51 ::: Walter Hawn ::: (view all by) ::: February 24, 2012, 02:08 PM:

A very logical article! Well-thought out! But wouldn't you rather have a burning, scintelating headline that makes readers's eye's pop?

Only 29.99 for "headline's that Mke Readers's Eyes' POP!" Order one today!

#52 ::: Rob Hansen ::: (view all by) ::: February 24, 2012, 02:22 PM:

Debra Doyle@#46: In old time SF fandom BNF stood for Big Name Fan. Then along came British Nuclear Fuels, which led to double-takes over some newspaper headlines

#53 ::: Dave Crisp ::: (view all by) ::: February 24, 2012, 02:32 PM:

Rob Hansen @ 52: Personally, I parse that acronym (when devoid of context, natch) as Backus-Naur Form. Which I guess shows the kind of circles I hang out in more than anything.

#54 ::: pedantic peasant ::: (view all by) ::: February 24, 2012, 02:41 PM:

James D. Macdonald @ 40:

Thanks. I understand not wanting to give spammers a lesson in how to ghost by your filters.

But, at the risk of living up to my nom de photon, I was attempting to ask if any character truly meant any character, so not just the space, hyphen, or no space options, but even a d making write past tense and merging with up making a single compound word might trip the filters.

I understand, however, if any reply would need to be, um, redacted. :)

#55 ::: James D. Macdonald ::: (view all by) ::: February 24, 2012, 02:50 PM:

Yes, writeaup, writebup, ... writedup, ... writezup, write0up, ... write9up will all be gnomed.

#56 ::: Lee ::: (view all by) ::: February 24, 2012, 02:52 PM:

Marty, #21: In addition to what Jim said, remember that legitimate advertising considers a 3% return to be outstanding. Spamming doesn't cost the spammers anything at all (because it's all done by means of stolen services), so somebody running a phishing scam or selling fake V***** (IOW, things which can't be advertised any other way) is going to consider a 0.001% return to be well worth the effort.

Jim, #37: My partner suggests that either "compromised IPs" or "botnet IPs" would work in this context.

John, #48: There have been times when I felt that the ambiguity in POS was perfectly appropriate.

#57 ::: Xopher HalfTongue ::: (view all by) ::: February 24, 2012, 03:57 PM:

I used to delight in telling coworkers who were unfamiliar with acronyms that I used that 'POS' stood for "Piece Of Junk" and that 'PITA' stood for "Pain In The Neck."

Got a lot of "Wait, how could that...oh." Happy me!

#58 ::: Lee ::: (view all by) ::: February 24, 2012, 04:23 PM:

Xopher, #57: One of our bumper stickers reads, in large letters, RTFMA. Translation is provided underneath: "Read the furnished materials, sir."

(And I just nearly gnomed myself. Fortunately, I caught the comma-and-missing-space in proofreading.)

#59 ::: James D. Macdonald ::: (view all by) ::: February 24, 2012, 04:39 PM:

Back to MIJI for a moment (because it takes me back to the days of the Whirly-One and the Slick-32), the format for a MIJI report, and the classifications given to the various paragraphs in a MIJI report.

S is Secret
C is Confidential
U is Unclassified.

OADR is Originating Agency's Determination Required; that is, before you declassify that paragraph, you have to check with the guys who classified it in the first place to see if declassification is a smart idea.

Classified information is limited to persons who have both a) the appropriate clearance(s), and b) a need to know.

Secret material is that which, if it became public knowledge, would cause grave damage to national security. Confidential material is that which, if it became public knowledge, would cause damage or be prejudicial to national security. Unclassified material can be disseminated to persons who have neither a clearance nor need to know.

Every paragraph in a classified document must be marked with that paragraph's classification. The entire document is classified at the level of the highest level of any paragraph.

#60 ::: Jim Macdonald ::: (view all by) ::: February 24, 2012, 05:03 PM:

In the past hour we've received more than 75 copies of a comment spam that reads, "black hat seo forum with a twist, man i love SEO"

Every single iteration was stopped.

Not too good at black hat SEO, are ya, guys?

BTW, I loathe SEO.

#61 ::: dcb ::: (view all by) ::: February 24, 2012, 06:18 PM:

But the BNF is the British National Formulary - everyone knows that!

(Context. It's all about context.)

#62 ::: David Harmon ::: (view all by) ::: February 24, 2012, 07:41 PM:

Henning Makholm #17: Good point, but I'm not sure it translates to a real problem. Jim himself admitted to speaking casually in this case, and it would have been a "reasonable" misstatement even for an expert. In any case, much of IT terminology comes from analogy, metaphor, or simile to everyday things, exactly because we need those comparisons to make sense of what's going on.

#63 ::: Erik Nelson ::: (view all by) ::: February 24, 2012, 08:37 PM:

Xopher at #8: Spam-El was a resident of the planet Krypton, who sent 10,000 of his offspring to Earth when his home planet exploded.

#64 ::: P J Evans ::: (view all by) ::: February 24, 2012, 09:06 PM:

34
It's interesting (for certain values of interesting) getting spam in Cyrillic, or from Pakistan, or for something claiming to be a Catholic men's group (all of those were seen this week). Not being in any of those markets myself, I send them off to the e-mail host's spam filter.

#65 ::: Erik Nelson ::: (view all by) ::: February 24, 2012, 09:10 PM:

Lorax at #19:
It's designed to foil a Bayesian spam filter, which is a filter that spots combinations of words in conjunction to test probability of spam.

It does this by throwing lots of random words at it.

#66 ::: P J Evans ::: (view all by) ::: February 24, 2012, 09:10 PM:

59
OADR is Originating Agency's Determination Required; that is, before you declassify that paragraph, you have to check with the guys who classified it in the first place to see if declassification is a smart idea.

In practice, these days, it appears that the guy int the White House (or his designated agent) can declassify stuff, or reclassify something that was formally declassified, because he feels like it, without even bothering with the OADR requirement.

#67 ::: Older ::: (view all by) ::: February 24, 2012, 11:39 PM:

Xopher (57) -- In my work I deal with many instances of a document my boss refers to as "POS" and *every time* I think ...

#68 ::: Daniel Martin ::: (view all by) ::: February 25, 2012, 02:23 AM:

Going back to the comparison Henning Makholm was making many comments ago about the source for jargon terms in various fields:

When creating a term of art, I see there as being basically two choices: the practitioners of a field can take a word from the surrounding language and repurpose it, or a new word can be invented. (In the second case, it could be a matter of combining words or word fragments from a language other than the surrounding language - e.g., what medicine is known to do - or it could mean naming something after an inventor or discoverer - e.g., Hertz as a unit of measure)

As noted, CS/IT seems to take the first choice with great frequency. I'll point out that this is the natural behavior for a discipline at the intersection of mathematics and business: although some older mathematical terms derive from non-English languages, mathematics is notorious for taking a common term ("group", "ring", "field", "sheaf", "pencil", "category", "class") and giving it a technical meaning you can't even describe properly to a complete non-specialist. Business types are also somewhat notorious for repurposing common words, generating Dilbert cartoons in their wake.

I'm not entirely sold on the idea that repurposing existing words is a clearly inferior strategy for jargon formation. Doing such makes the new jargon term easier to spell. I also have my suspicions that a tendency to use latinate words for jargon makes one prone to inventing jargon when none is needed, and thus creating an artificial barrier to entry into the field. At this point, though, I'll need a sociolinguist to talk more about jargon's significance as a social phenomenon.

This being ML, there's probably a sociolinguist reading this comment.

#69 ::: Rob Rusick ::: (view all by) ::: February 25, 2012, 06:30 AM:

Erik Nelson @68: Spam-El was a resident of the planet Krypton, who sent 10,000 of his offspring to Earth when his home planet exploded.

There was a PKD story of shape-shifting alien invaders who didn't grasp human individuality and scale.

Hence, Provo Utah would be besieged by hundreds of 2ft tall insurance salesmen, who were agog that humans were so quick to see through their clever disguise.

#70 ::: David Harmon ::: (view all by) ::: February 25, 2012, 08:35 AM:

There's a game wiki I work on that got effectively DOSed by a spam flood since last June or so. I prevailed on the mostly gafiated owner to install reCAPTCHA, which didn't do squat. This month he came back and not only upgraded the CAPTCHAs, but installed an e-mail confirmation for registration. That has mostly worked, but the "mostly" part is a bit worrying. We're getting a trickle of suspicious-looking user regs, none of which are e-mail-confirmed. Maybe 20%-30% of those have actually managed to drop spam payloads, and they shouldn't be able to.

#71 ::: John Mark Ockerbloom ::: (view all by) ::: February 25, 2012, 09:19 AM:

Some years ago if I saw "BNF" I first thought "Backus-Naur Form". Now I first think "Bibliotheque Nationale de France". Which indirectly gives you an idea of the arc of my career...

#72 ::: lorax ::: (view all by) ::: February 25, 2012, 09:54 AM:

Erik @65:

I understand that it's designed to get through the filter. What I don't understand is what the point is -- is it just for the googlejuice that they'd get if not for the nofollow being set?

#73 ::: Jules ::: (view all by) ::: February 25, 2012, 11:02 AM:

Jim @49: Having recently read a paper on GPS interference detection techniques, I can tell you that there are actually a few approaches that can apparently be taken that would be likely to work. The most reliable in the case you describe would probably be monitoring of signal strength -- outdoor GPS signal strength is apparently very steady and can be predicted to a high degree of reliability once you have a copy of the satellites' ephemera. As the rebroadcast signal would necessarily be more powerful than the original in order to deceive a receiver into decoding it instead of the original, you should be able to detect the unexpectedly high signal power and label the broadcast as untrustworthy. Falling back to an inertial system shouldn't be too problematic at this point, as such a system is likely to get you beyond range of any plausible interference safely.

For other forms of interference, keeping close track of timestamps is apparently enough. Starting to receive the inauthentic signal is usually accompanied by a noticeable jump in timestamp value, which should be enough to show conclusively that the signal has been tampered with.

It makes me wonder why, precisely, these very expensive drones don't implement such systems (the authors of the paper were convinced that between the two of them, they would work in almost all situations).

#74 ::: Erik Nelson ::: (view all by) ::: February 25, 2012, 11:34 AM:

Daniel Martin at 68, etc:
"When I use a word, it means exactly what I choose it to mean, neither more nor less."
-Humpty Dumpty

#75 ::: Debra Doyle ::: (view all by) ::: February 25, 2012, 12:14 PM:

Daniel Martin@68: At this point, though, I'll need a sociolinguist to talk more about jargon's significance as a social phenomenon.

This being ML, there's probably a sociolinguist reading this comment.

I'm not a proper sociolinguist, but I am a certified (thank you, UPenn!) word nut who's done some reading in the field. Jargon, as a general rule, serves two purposes: one, to allow people in a particular field to discuss concepts and objects peculiar to that field in an economical fashion ("Mary Sue" is a lot faster to say than "ill-conceived and poorly-executed authorial self-insert", for example); and two, to draw a line between those inside the field, who can speak its private language, and those outside the field, who can't.

#76 ::: Dave Bell ::: (view all by) ::: February 25, 2012, 12:21 PM:

Jules @71

Jim has mentioned range gate pull off, which is a trick suggesting a way of defeating that sort of countermeasure.

Let me see if I'm betting this right.

A range gate is a feature of a radar system designed to reduce the effectiveness of some sorts of jamming. You can, for instance, if you re-broadcast the incoming radar pulse, create several false echoes, at different ranges. because the "beam" isn't hard-edged, a high-power fake echo can also look like a normal echo on a different bearing. But if you can spot the real target before the jamming starts, you know the real echo can only be heard at around a particular time, because a 'plane can only move so fast. The range gate is when the radar only listens at the times it expects an echo.

But the radar system locks on to the strongest echo. So if you slowly increase the power of a false echo, and slowly change the timing to separate it from the real echo, you can move the range gate away from the real target. Stop the fake echo pulses, and the radar is looking in the wrong place.

GPS is more complicated, but a slow increase in power and shift in timing could get past a countermeasure guarding against sudden change. On the other hand, because GPS has to know something about where the satellite is, the changes can be predicted, and a small enough change to get past the safeguards might not be of any value.

Still, Michelle Yeoh in a James Bond movie: who cares whether the GPS deception in the story can work or not?

#77 ::: paxed ::: (view all by) ::: February 25, 2012, 12:36 PM:

David@#70, I admin a game wiki, and we don't have a spam problem. Admittedly, it's not a very big wiki, nor a very well known game outside certain circles.

When I started admining the wiki, I installed a captcha, which asks questions that anyone knowing anything about the game can answer. Or if you spend some time looking through the wiki. Knowledge domain -specific questions seem to be the best non-recaptcha method for avoiding spam.

#78 ::: Bruce Cohen (Speaker to Managers) ::: (view all by) ::: February 25, 2012, 06:32 PM:

Jules @ 73:

It makes me wonder why, precisely, these very expensive drones don't implement such systems (the authors of the paper were convinced that between the two of them, they would work in almost all situations).

Maybe for the same reason that for several years the drones flown in Iraq were using non-encrypted communications with their controllers. I assume that reason was underestimation of the opposition.

#79 ::: Jim Macdonald ::: (view all by) ::: February 25, 2012, 06:58 PM:

I assume that reason was underestimation of the opposition.

No, I suspect it falls under one of Murphy's Laws of Combat: Remember, your weapon was made by the lowest bidder.

#80 ::: Lenny Bailes ::: (view all by) ::: February 25, 2012, 08:34 PM:

Many of the comment-spam examples that Jim posts are similar to what I get on my low-traffic LiveJournal page -- direct product ads or robotic statements of praise for the blog followed by the product ads. But there's a type in his list that I don't usually see: the ones that start out with insults to the readership followed by questions about the blog (or suggestions for improvements) that aren't directly related to the spammer's product placement. I'm guessing that, with those, the product placement links are just contained in the sig.

But my curiosity is stirred, slightly, about comment-spam algorithms. I'm wondering what triggers the insult/question type. Do they statistically weigh the amount of comment traffic and/or attempt some rudimentary analysis of content? I'm speculating that I don't get the insult/question type of comment spam because the typical number of comments per post is too low to trigger it.

Do the spammers have low-rent content analyzers that sample a blog to determine the form that the comment spam should take, or is just a game to get past the junk filters built into blogging tools?

#81 ::: David Harmon ::: (view all by) ::: February 25, 2012, 09:32 PM:

paxed #77: Ooh, domain specific CAPTCHAs sounds like a useful idea.... I'll note that until last year, spam wasn't an issue, but over a month or two it ramped up to, iirc, a dozen per day or more.

#82 ::: E. Liddell ::: (view all by) ::: February 26, 2012, 09:23 AM:

The enthusiasm shown here for captchas has me wondering (not for the first time) how they stack up in terms of percentage of bot-spam prevented against less visible methods, like honeypot form fields and header filtering for bogus User-Agents. Given that the other methods don't seem to be used very much, I doubt a study has ever been done, but you never know . . .

#83 ::: James D. Macdonald ::: (view all by) ::: February 26, 2012, 12:17 PM:

#66 P J Evans In practice, these days, it appears that the guy int the White House (or his designated agent) can declassify stuff, or reclassify something that was formally declassified, because he feels like it, without even bothering with the OADR requirement.

Given that the President is ultimately in control of all of those agencies, and their power to classify anything at all emanates from him, I don't see this as a problem.

#84 ::: P J Evans ::: (view all by) ::: February 26, 2012, 01:33 PM:

83
It's the doing it without consulting with the classifying agency, and even, apparently without bothering to let them know he's about to do it, that bothers me. (It also bothers me when it's a cabinet officer doing it - shouldn't the laws and regulations apply to them also?)

#85 ::: Sylvia Sotomayor ::: (view all by) ::: February 26, 2012, 01:39 PM:

Regarding spam filters, I have some questions. I don't particularly like captchas, so I've been using other strategies.

I have a wordpress blog that was active for about three years with maybe half a dozen readers. Every day I would go and empty out the spam trap, which generally held 20-50 messages. Then I found Spam Free Wordpress and I got 1 spam message in the next six months. This seems to me almost too good to be true. Does anyone else here have experience with this type of spam filtering? It seems all silver lining and no cloud.

Another type of form-based spam filtering that I have implemented that appears to be working well is the trick of having an input field that is hidden through css. The server is then configured to drop any responses that have a value in that field. Does anyone have experience with this? Are there drawbacks that I am not considering? Would someone using a text-reader see that field and use it? (I do have text next to the field that says "Please leave this field empty.")

#86 ::: Elliott Mason ::: (view all by) ::: February 28, 2012, 09:36 AM:

Thena @571: Just because I can recite what songs have what intervals (and once I've run through that lookup table in my head, sing the bits) doesn't mean I can reflexively sing a fourth when asked to; it's a multi-stage lookup.

Part of the problem for me with the sheet music is after one or two lines/gaps, they kind of blur in the part of my brain that 'knows' numbers at a glance and it becomes 'many'. I have to actually count them in my head (or did, when taking that piano class) by silently saying the numbers. Relates to my discalculia, I presume. I bet I could get my spatial-thinking/kinesthetic circuitry trained to do the job for me if I spent enough time drilling it, but I haven't got that set yet.

I have the damnedest time memorizing 'abstract' information (though what counts to my brain as 'abstract' doesn't always match up with what seems 'abstract' to other people). I memorize well if it's context-linked to other things I know or find of interest, but things like the pattern of an octave (how many piano keys, regardless of color, between each note in a major scale) slide right off my brain's surface if I'm not using them five times a day, even if people give me a mnemonic for them.

If I had the time, energy, and a motivated teacher to immerse myself in the skills for weeks (and then kept them up and active for at least six months afterwards), I might be able to properly learn to sight-read. However, I have a toddler, and I therefore will not have that much time to focus on music for at least 2-4 more years. :->

As it is, I can improvise harmony very well on the fly; the only thing I do semi-regularly that would be made simpler by being properly 'paper-trained' is notating multi-harmony arrangements that I've composed so other people (who are paper-trained; which doesn't include most of the people in my 'band') can also sing them without me teaching them first.

#87 ::: Elliott Mason ::: (view all by) ::: February 28, 2012, 09:37 AM:

Ok, that was totally the wrong thread. Putting it on the right one.

#88 ::: Mycroft W ::: (view all by) ::: February 28, 2012, 05:33 PM:

However, given the propensity of thread-drift on ML, there is a measurably significant probability of being an on-topic answer in 500 or so comments.

So, good luck with your Precognition, ~~Mrs. Cake~~ Mr. Mason!

#89 ::: David Harmon ::: (view all by) ::: March 17, 2012, 06:27 PM:

Are you folks seeing a surge in Greek-charset spam? The game wiki I work on just had three in a row get through the passive defenses. (I'm the active defense. ;-) )

#90 ::: Jim Macdonald ::: (view all by) ::: April 13, 2013, 03:48 PM:

"Today, this attack is happening at a global level and WordPress instances across hosting providers are being targeted. Since the attack is highly distributed in nature (most of the IPs used are spoofed), it is making it difficult for us to block all malicious data."

Huge attack on WordPress sites could spawn never-before-seen super botnet

Yet, we were told above that IP addresses cannot be spoofed.

So, which is accurate?

Back to previous post: It’s Lent!

Go to Making Light's front page.

Forward to next post: What a tangled web we knit

Subscribe (via RSS) to this post's comment thread. (What does this mean? Here's a quick introduction.)

Choose:
Smaller type (our default)
Larger type
Even larger type, with serifs

Dire legal notice