Tim's IT-Blog

Just a blog about IT and IT-Problems…

HowTo find chinese or russian Spam encoded in UTF-8 with SpamAssassin (Chinese Russian Spam filter rules)

by admin_import on 03/06/2010

Normally SpamAssassin do not support UTF-8 by performance reasons:

If it contains “utf8”, then that’s probably the problem. Change it so it does not contain “utf8” …, and the performance issues will clear up.

Perl 5.8 uses Unicode character sets internally in this situation, and unfortunately, this greatly hurts performance of all Perl code which operates heavily on strings (like SpamAssassin).

[Source: https://wiki.apache.org/spamassassin/Utf8Performance]

Mostly chinese or russian Spam has special chinese or russian Characters at the Subject encoded in UTF-8 and Base64.

To find chinese or russian Spam encoded in UTF-8 (Unicode) you have to search the Byte-Code of the chinese or russian Character.

#
# UTF8-Character-Search at the Subject
#

meta CHARSET_UTF8_SUBJ_CYRIL (__CHARSET_SUBJECT_UTF8_ENCODED && (__CHARSET_UTF8_SUBJ_CRY1 || __CHARSET_UTF8_SUBJ_CRY2 || __CHARSET_UTF8_SUBJ_CRY3 || __CHARSET_UTF8_SUBJ_CRY4))
describe CHARSET_UTF8_SUBJ_CYRIL Cyrillic UTF-8 Character in Subject
score CHARSET_UTF8_SUBJ_CYRIL 1.1

meta CHARSET_UTF8_SUBJ_CJK (__CHARSET_SUBJECT_UTF8_ENCODED && (__CHARSET__UTF8_SUBJ_CJK1 || __CHARSET__UTF8_SUBJ_CJK2 || __CHARSET__UTF8_SUBJ_CJK3))
describe CHARSET_UTF8_SUBJ_CJK chinese (CJK) UTF-8 Character in Subject
score CHARSET_UTF8_SUBJ_CJK 1.1

header __CHARSET_SUBJECT_UTF8_ENCODED Subject:raw =~ /=?utf-8?.?/i
header __CHARSET_SUBJECT_UTF8_B_ENCODED Subject:raw =~ /=?utf-8?b?/i

# Unicode CJK Ideograph 4E00.9FFF
# U+4E00 [UTF-8 Bytecode e4 b8 80] ... U+9FFF [UTF-8 Bytecode e9 bf bf]
header __CHARSET__UTF8_SUBJ_CJK1 Subject =~ /(?:[xe4][xb8-xbf][x80-xbf]|[xe5-xe9][x80-xbf][x80-xbf])/

# Unicode - CJK Compatibility Ideographs F900.FAFF
# U+F900 [UTF-8 Bytecode ef a4 80] ... U+FAFF [UTF-8 Bytecode ef ab bf]
header __CHARSET__UTF8_SUBJ_CJK2 Subject =~ /[xef][xa4-xab][x80-xbf]/

# Unicode - CJK Unified Ideographs Extension A 3400.4DBF
# U+3400 [UTF-8 Bytecode e3 90 80] ... U+4DBF [UTF-8 Bytecode e4 b6 bf]
header __CHARSET__UTF8_SUBJ_CJK3 Subject =~ /(?:[xe3][x90-xbf][x80-xbf]|[xe4]][x80-xb6][x80-xbf])/

## offen CJK # Unicode - CJK Radicals Supplement 2E80.2EFF
## offen CJK # Unicode - CJK Symbols and Punctuation 3000.303F
## offen CJK # Unicode - CJK Strokes 31C0.31EF
## offen CJK # Unicode - Enclosed CJK Letters and Months 3200.32FF
## offen CJK # Unicode - CJK Compatibility 3300.33FF

# Unicode - Cyrillic 0400.04FF
# U+0400 [UTF-8 Bytecode d0 80] ... U+04FF [UTF-8 Bytecode d3 bf]
header __CHARSET_UTF8_SUBJ_CRY1 Subject =~ /[xd0-xd3][x80-xbf]/

# Unicode - Cyrillic Supplement 0500.052F
# U+0500 [UTF-8 Bytecode d4 80] ... U+052F [UTF-8 Bytecode d4 af]
header __CHARSET_UTF8_SUBJ_CRY2 Subject =~ /[xd4][x80-xaf]/

# Unicode - Cyrillic Extended-A 2DE0.2DFF
# U+2DE0 [UTF-8 Bytecode e2 b7 a0] ... U+2DFF [UTF-8 Bytecode e2 b7 bf]
header __CHARSET_UTF8_SUBJ_CRY3 Subject =~ /[xe2][xb7][xa0-xbf]/

# Unicode - Cyrillic Extended-B A640.A69F
# U+A640 [UTF-8 Bytecode ea 99 80] ... U+A69F [UTF-8 Bytecode ea 9a 9f]
header __CHARSET_UTF8_SUBJ_CRY4 Subject =~ /(?:[xea][x99][x80-xbf]|[xea][x9a][x80-x9f])/

#
# Non Latin UTF8 Character
#
# Unicode - Basic Latin
# U+0000 [UTF-8 Bytecode 00] ... U+007F [UTF-8 Bytecode 7f]
# Unicode - Latin-1 Supplement
# U+0080 [UTF-8 Bytecode c2 80] ... U+00FF [UTF-8 Bytecode c3 bf]
#
# Ergo:
# get all U+0100 ... U+FFFF
header __CHARSET_UTF8_SUBJ_NON_LATIN Subject =~ /(?:[xc4-xdf][x80-xbf]|[xe0-xef][x80-xbf][x80-xbf])/

#
# Only Latin UTF8 Character
meta CHARSET_UTF8_B_SUBJ_LATIN (__CHARSET_SUBJECT_UTF8_B_ENCODED && !__CHARSET_UTF8_SUBJ_NON_LATIN)
describe CHARSET_UTF8_B_SUBJ_LATIN Only LATIN UTF-8 Character in Base64-Encoded Subject (good)
score CHARSET_UTF8_B_SUBJ_LATIN -0.1

For more Information about UTF-8 and Byte-Code read following pages:
UTF-8 Character-ByteCode: https://en.wikipedia.org/wiki/UTF-8
UTF-8 Country-Map: https://en.wikipedia.org/wiki/Mapping_of_Unicode_character_planes
Online-Tool to get ByteCode for UTF-8 Character: https://www.utf8-zeichentabelle.de/unicode-utf8-table.pl?number=1024&htmlent=1