Spam Analysis

In the heart of email lies communication. It was designed as a fast and efficient way for communication between two points. Email made data sharing easy and it also helped us to archive shared information. But every shine has its dark side. For email, spamming is one of those dark sides. Top ESP’s these days are fighting with spammers. Many efficient and intelligent codes have been written worldwide just to filter spams from hams.

Spam Analysis Test Lists

We can classify spam codes on the basis of source as follows,

Codes written by major ESP’s

To fight better with spammers ESP’s do not revel their matrix and concepts of filtering emails. One can surely assume but its really very hard to pin point ESP’s filtering algorithms. ESP’s are using learning based codes to filter the behavior of spammer which makes it even harder to guess those key points.

Codes written by Open Source projects

Other than ESP’s some major open source projects are deepening theirs roots in order to filter spams. Though, a spammer has no face, we can sort them on the technical base. Many open source projects, like Spam Assassin, assumes some of the key features of spammers, like,

  • a spammer do not bother to set up a proper SMTP server,
  • a spammer will try to use less resources to shoot a high volume emails,
  • his spamming history will blacklist his IP’s,
  • a spammer do not spend sufficient time to design a proper layout,
  • spammer will not have sufficient information of user,
  • variations ratio in mails shot by spammer will be low, etc.

Definitely in today’s world above mention points are not sufficient to filter spammer, that’s why ESP’s are spending a lot of money and resources just to filter them more efficiently, but these points indeed help us to filter a majority of candidates. ESP’s are open for every possible solution to filter spam emails so even they are taking advantages of these open source projects.

Codes written by standalone companies

ESP’s are doing a lot to filter spams but we don’t know how exactly are they implementing it ? Though, Open Source Projects help us to filter spams but only on some extent. ESP’s, like Gmail, understand their user better than anyone. They know

  • how their users respond on a particular type of mail,
  • what type of content could be interesting for their user,
  • who is in his contact list
  • user can manually mark his interest on a particular mail, etc.

As mentioned above, ESP’s are using learning based algorithms to filter spams. A mail could be ham for one user and spam for another depending upon individuals interest.

So for now it is pointless to say, as an Email Marketing company, an email spam or ham as a whole on the basis of individual’s interest. Instead we have to focus on the person, who is the source of these mails, the Sender. We need to classify users in two categories,

  • Spammer
  • Hammer

That’s why it generates the need to develop tools to filter could be spam mails.

Why do we need Spam Test Feature ?

As mentioned above, it is impractical to tag a mail spam on the basis of individual’s interest because interest varies person to person. So we have to focus on the source of mails, i.e., person who is shooting these mails, The Sender.

On the basis of behavior of sender, mails can be classified spam or ham. As a Email Marketing company, we can analyze

  • the history of sender,
  • responses on his past campaigns,
  • route used by him,
  • IPs and domains used by him,
  • quality of database used by him,
  • technical key points,
  • content of the mails, etc .

But what if a sender has by mistake chosen a wrong route, or blacklisted IP, or text-image ratio, or broken HTML design ? We assume that he has done this without any wrong-will so he should get a chance to correct them. Here comes the need of Spam Test Feature, where he can run a test before scheduling this campaign and do the needful.

What we do ?

Here at Sarv, we have incorporated the benefits of Open Source Projects and designed our own tools to classify spam from ham. System runs a vast array of tests and assign individual scores to them. In the end sum of these scores defines the final category of email

Individual Test may give Negative or Positive score.

  • Negative score indicates that mailer has failed the test.
  • Positive score indicates that mailer has passed the test.
Spam Analysis Final Score Test 2 ...n Test 1 Score 1 Score 2 ...n Ham Spam Forward to Spam Engine

We analyze the design, content, route, IPs, domains and other related information of the email. Bellow we have listed some of the key tests that we run on campaigns,

  • Header Tests
    • Sender email is freemail
    • From: empty name, localpart has series of non-vowel letters, localpart has long hexadecimal sequence, too many raw illegal characters, starts with many numbers
    • Reply-To freemail username ends in digit
    • Invalid Date: header
    • Character set doesn't exist
    • Illegal IP address in Received header
    • Multiple Content-Type headers found
    • Message headers are very long
    • Bulk email fingerprint like Gecko faked, Received PF, envfrom etc. found
    • SPF: sender matches SPF record
    • SPF: sender does not match SPF record
    • From: address is in the default/user's DKIM whitelist
    • From: address is in the default/user's SPF whitelist
    • X-mailer pattern common to anal porn site spam
Spam Test Analysis
  • Body Tests
    • Tests on HTML content like, message include 'HTML' tag, html contains unnecessary close tags, font size is very large, background color and font color are similar,
    • Mailer body contains high ratio of image and text part which is not recommended
    • Body contains iframe tag
    • Misleading contents like guarantee about something, phishing content, mentioned lot of about money, weight lose, guarantee about money back, investment advice
    • Porn message
    • Using numeric IP address in URL
    • Using non-standard port number
    • Long hexadecimal sequence in URI hostname
    • Suspicious unsubscribe link
    • CGI in .info or .biz TLD other than third-level "www"
    • Bayes spam probability
    • Contains an URL listed in the SBL/SC SURBL/WS SURBL/PH SURBL/OB SURBL/AB SURBL/JP SURBL/URIBL blocklist
    • Unclear URI
    • Multipart message mostly text/html MIME
    • Message only has text/html MIME parts
    • Extra blank lines in base64 encoding
    • Weird repeated double-quotation marks
    • Body contains a ROT13-encoded email address
    • Incorporates a tracking ID number
    • Generic Test for Unsolicited Bulk Email
    • Different HTML and text parts
    • Message body has 80-90% blank lines
    • Character set indicates a foreign language