Tim's IT-Blog | HowTo find chinese or russian Spam encoded in UTF-8 with SpamAssassin (Chinese Russian Spam filter rules)

Normally SpamAssassin do not support UTF-8 by performance reasons:

If it contains “utf8”, then that’s probably the problem. Change it so it does not contain “utf8” …, and the performance issues will clear up.

Perl 5.8 uses Unicode character sets internally in this situation, and unfortunately, this greatly hurts performance of all Perl code which operates heavily on strings (like SpamAssassin).

[Source: https://wiki.apache.org/spamassassin/Utf8Performance]

Mostly chinese or russian Spam has special chinese or russian Characters at the Subject encoded in UTF-8 and Base64.

To find chinese or russian Spam encoded in UTF-8 (Unicode) you have to search the Byte-Code of the chinese or russian Character.



#

# UTF8-Character-Search at the Subject

#
meta         CHARSET_UTF8_SUBJ_CYRIL (__CHARSET_SUBJECT_UTF8_ENCODED && (__CHARSET_UTF8_SUBJ_CRY1 || __CHARSET_UTF8_SUBJ_CRY2 || __CHARSET_UTF8_SUBJ_CRY3 || __CHARSET_UTF8_SUBJ_CRY4))

describe     CHARSET_UTF8_SUBJ_CYRIL Cyrillic UTF-8 Character in Subject

score        CHARSET_UTF8_SUBJ_CYRIL 1.1
meta         CHARSET_UTF8_SUBJ_CJK (__CHARSET_SUBJECT_UTF8_ENCODED && (__CHARSET__UTF8_SUBJ_CJK1 || __CHARSET__UTF8_SUBJ_CJK2 || __CHARSET__UTF8_SUBJ_CJK3))

describe     CHARSET_UTF8_SUBJ_CJK chinese (CJK) UTF-8 Character in Subject

score        CHARSET_UTF8_SUBJ_CJK 1.1
header       __CHARSET_SUBJECT_UTF8_ENCODED Subject:raw =~ /=?utf-8?.?/i

header       __CHARSET_SUBJECT_UTF8_B_ENCODED Subject:raw =~ /=?utf-8?b?/i
# Unicode CJK Ideograph 4E00.9FFF

# U+4E00 [UTF-8 Bytecode e4 b8 80] ... U+9FFF [UTF-8 Bytecode e9 bf bf]

header       __CHARSET__UTF8_SUBJ_CJK1 Subject =~ /(?:[xe4][xb8-xbf][x80-xbf]|[xe5-xe9][x80-xbf][x80-xbf])/
# Unicode - CJK Compatibility Ideographs F900.FAFF

# U+F900 [UTF-8 Bytecode ef a4 80] ... U+FAFF [UTF-8 Bytecode ef ab bf]

header       __CHARSET__UTF8_SUBJ_CJK2 Subject =~ /[xef][xa4-xab][x80-xbf]/
# Unicode - CJK Unified Ideographs Extension A 3400.4DBF

# U+3400 [UTF-8 Bytecode e3 90 80] ... U+4DBF [UTF-8 Bytecode e4 b6 bf]

header       __CHARSET__UTF8_SUBJ_CJK3 Subject =~ /(?:[xe3][x90-xbf][x80-xbf]|[xe4]][x80-xb6][x80-xbf])/
## offen CJK # Unicode - CJK Radicals Supplement 2E80.2EFF

## offen CJK # Unicode - CJK Symbols and Punctuation 3000.303F

## offen CJK # Unicode - CJK Strokes 31C0.31EF

## offen CJK # Unicode - Enclosed CJK Letters and Months 3200.32FF

## offen CJK # Unicode - CJK Compatibility 3300.33FF
# Unicode - Cyrillic 0400.04FF

# U+0400 [UTF-8 Bytecode d0 80] ... U+04FF [UTF-8 Bytecode d3 bf]

header       __CHARSET_UTF8_SUBJ_CRY1  Subject =~ /[xd0-xd3][x80-xbf]/
# Unicode - Cyrillic Supplement 0500.052F

# U+0500 [UTF-8 Bytecode d4 80] ... U+052F [UTF-8 Bytecode d4 af]

header       __CHARSET_UTF8_SUBJ_CRY2  Subject =~ /[xd4][x80-xaf]/
# Unicode - Cyrillic Extended-A 2DE0.2DFF

# U+2DE0 [UTF-8 Bytecode e2 b7 a0] ... U+2DFF [UTF-8 Bytecode e2 b7 bf]

header       __CHARSET_UTF8_SUBJ_CRY3  Subject =~ /[xe2][xb7][xa0-xbf]/
# Unicode - Cyrillic Extended-B A640.A69F

# U+A640 [UTF-8 Bytecode ea 99 80] ... U+A69F [UTF-8 Bytecode ea 9a 9f]

header       __CHARSET_UTF8_SUBJ_CRY4  Subject =~ /(?:[xea][x99][x80-xbf]|[xea][x9a][x80-x9f])/
#

# Non Latin UTF8 Character

#

# Unicode - Basic Latin

# U+0000 [UTF-8 Bytecode 00] ... U+007F  [UTF-8 Bytecode 7f]

# Unicode - Latin-1 Supplement

# U+0080 [UTF-8 Bytecode c2 80] ... U+00FF [UTF-8 Bytecode c3 bf]

#

# Ergo:

# get all U+0100 ... U+FFFF

header       __CHARSET_UTF8_SUBJ_NON_LATIN Subject =~ /(?:[xc4-xdf][x80-xbf]|[xe0-xef][x80-xbf][x80-xbf])/
#

# Only Latin UTF8 Character

meta         CHARSET_UTF8_B_SUBJ_LATIN (__CHARSET_SUBJECT_UTF8_B_ENCODED && !__CHARSET_UTF8_SUBJ_NON_LATIN)

describe     CHARSET_UTF8_B_SUBJ_LATIN Only LATIN UTF-8 Character in Base64-Encoded Subject (good)

score        CHARSET_UTF8_B_SUBJ_LATIN -0.1

For more Information about UTF-8 and Byte-Code read following pages:
UTF-8 Character-ByteCode: https://en.wikipedia.org/wiki/UTF-8
UTF-8 Country-Map: https://en.wikipedia.org/wiki/Mapping_of_Unicode_character_planes
Online-Tool to get ByteCode for UTF-8 Character: https://www.utf8-zeichentabelle.de/unicode-utf8-table.pl?number=1024&htmlent=1