{"id":29,"date":"2010-06-03T17:10:28","date_gmt":"2010-06-03T15:10:28","guid":{"rendered":"https:\/\/it-blog.timk.de\/?p=29"},"modified":"2010-06-03T17:10:28","modified_gmt":"2010-06-03T15:10:28","slug":"howto-find-chinese-or-russian-spam-encoded-in-utf-8-with-spamassassin","status":"publish","type":"post","link":"https:\/\/www.timk.de\/it-blog\/howto-find-chinese-or-russian-spam-encoded-in-utf-8-with-spamassassin\/","title":{"rendered":"HowTo find chinese or russian Spam encoded in UTF-8  with SpamAssassin (Chinese Russian Spam filter rules)"},"content":{"rendered":"<p>Normally SpamAssassin do not support UTF-8 by performance reasons:<\/p>\n<blockquote><p>If it contains &#8220;utf8&#8221;, then that&#8217;s probably the problem. Change it so it does not contain &#8220;utf8&#8221; &#8230;, and the performance issues will clear up.<\/p>\n<p>Perl 5.8 uses Unicode character sets internally in this situation, and unfortunately, this greatly hurts performance of all Perl code which operates heavily on strings (like SpamAssassin). <\/p><\/blockquote>\n<p>[Source: <a href=\"https:\/\/wiki.apache.org\/spamassassin\/Utf8Performance\">https:\/\/wiki.apache.org\/spamassassin\/Utf8Performance<\/a>]<\/p>\n<p>Mostly chinese or russian Spam has special chinese or russian Characters at the Subject encoded in UTF-8 and Base64.<\/p>\n<p>To find chinese or russian Spam encoded in UTF-8 (Unicode) you have to search the Byte-Code of the chinese or russian Character.<\/p>\n<p><!--more--><code><\/p>\n<blockquote><p>\n#<br \/>\n# UTF8-Character-Search at the Subject<br \/>\n#<\/p>\n<p>meta         CHARSET_UTF8_SUBJ_CYRIL (__CHARSET_SUBJECT_UTF8_ENCODED && (__CHARSET_UTF8_SUBJ_CRY1 || __CHARSET_UTF8_SUBJ_CRY2 || __CHARSET_UTF8_SUBJ_CRY3 || __CHARSET_UTF8_SUBJ_CRY4))<br \/>\ndescribe     CHARSET_UTF8_SUBJ_CYRIL Cyrillic UTF-8 Character in Subject<br \/>\nscore        CHARSET_UTF8_SUBJ_CYRIL 1.1<\/p>\n<p>meta         CHARSET_UTF8_SUBJ_CJK (__CHARSET_SUBJECT_UTF8_ENCODED && (__CHARSET__UTF8_SUBJ_CJK1 || __CHARSET__UTF8_SUBJ_CJK2 || __CHARSET__UTF8_SUBJ_CJK3))<br \/>\ndescribe     CHARSET_UTF8_SUBJ_CJK chinese (CJK) UTF-8 Character in Subject<br \/>\nscore        CHARSET_UTF8_SUBJ_CJK 1.1<\/p>\n<p>header       __CHARSET_SUBJECT_UTF8_ENCODED Subject:raw =~ \/=?utf-8?.?\/i<br \/>\nheader       __CHARSET_SUBJECT_UTF8_B_ENCODED Subject:raw =~ \/=?utf-8?b?\/i<\/p>\n<p># Unicode CJK Ideograph 4E00.9FFF<br \/>\n# U+4E00 [UTF-8 Bytecode e4 b8 80] ... U+9FFF [UTF-8 Bytecode e9 bf bf]<br \/>\nheader       __CHARSET__UTF8_SUBJ_CJK1 Subject =~ \/(?:[xe4][xb8-xbf][x80-xbf]|[xe5-xe9][x80-xbf][x80-xbf])\/<\/p>\n<p># Unicode - CJK Compatibility Ideographs F900.FAFF<br \/>\n# U+F900 [UTF-8 Bytecode ef a4 80] ... U+FAFF [UTF-8 Bytecode ef ab bf]<br \/>\nheader       __CHARSET__UTF8_SUBJ_CJK2 Subject =~ \/[xef][xa4-xab][x80-xbf]\/<\/p>\n<p># Unicode - CJK Unified Ideographs Extension A 3400.4DBF<br \/>\n# U+3400 [UTF-8 Bytecode e3 90 80] ... U+4DBF [UTF-8 Bytecode e4 b6 bf]<br \/>\nheader       __CHARSET__UTF8_SUBJ_CJK3 Subject =~ \/(?:[xe3][x90-xbf][x80-xbf]|[xe4]][x80-xb6][x80-xbf])\/<\/p>\n<p>## offen CJK # Unicode - CJK Radicals Supplement 2E80.2EFF<br \/>\n## offen CJK # Unicode - CJK Symbols and Punctuation 3000.303F<br \/>\n## offen CJK # Unicode - CJK Strokes 31C0.31EF<br \/>\n## offen CJK # Unicode - Enclosed CJK Letters and Months 3200.32FF<br \/>\n## offen CJK # Unicode - CJK Compatibility 3300.33FF<\/p>\n<p># Unicode - Cyrillic 0400.04FF<br \/>\n# U+0400 [UTF-8 Bytecode d0 80] ... U+04FF [UTF-8 Bytecode d3 bf]<br \/>\nheader       __CHARSET_UTF8_SUBJ_CRY1  Subject =~ \/[xd0-xd3][x80-xbf]\/<\/p>\n<p># Unicode - Cyrillic Supplement 0500.052F<br \/>\n# U+0500 [UTF-8 Bytecode d4 80] ... U+052F [UTF-8 Bytecode d4 af]<br \/>\nheader       __CHARSET_UTF8_SUBJ_CRY2  Subject =~ \/[xd4][x80-xaf]\/<\/p>\n<p># Unicode - Cyrillic Extended-A 2DE0.2DFF<br \/>\n# U+2DE0 [UTF-8 Bytecode e2 b7 a0] ... U+2DFF [UTF-8 Bytecode e2 b7 bf]<br \/>\nheader       __CHARSET_UTF8_SUBJ_CRY3  Subject =~ \/[xe2][xb7][xa0-xbf]\/<\/p>\n<p># Unicode - Cyrillic Extended-B A640.A69F<br \/>\n# U+A640 [UTF-8 Bytecode ea 99 80] ... U+A69F [UTF-8 Bytecode ea 9a 9f]<br \/>\nheader       __CHARSET_UTF8_SUBJ_CRY4  Subject =~ \/(?:[xea][x99][x80-xbf]|[xea][x9a][x80-x9f])\/<\/p>\n<p>#<br \/>\n# Non Latin UTF8 Character<br \/>\n#<br \/>\n# Unicode - Basic Latin<br \/>\n# U+0000 [UTF-8 Bytecode 00] ... U+007F  [UTF-8 Bytecode 7f]<br \/>\n# Unicode - Latin-1 Supplement<br \/>\n# U+0080 [UTF-8 Bytecode c2 80] ... U+00FF [UTF-8 Bytecode c3 bf]<br \/>\n#<br \/>\n# Ergo:<br \/>\n# get all U+0100 ... U+FFFF<br \/>\nheader       __CHARSET_UTF8_SUBJ_NON_LATIN Subject =~ \/(?:[xc4-xdf][x80-xbf]|[xe0-xef][x80-xbf][x80-xbf])\/<\/p>\n<p>#<br \/>\n# Only Latin UTF8 Character<br \/>\nmeta         CHARSET_UTF8_B_SUBJ_LATIN (__CHARSET_SUBJECT_UTF8_B_ENCODED && !__CHARSET_UTF8_SUBJ_NON_LATIN)<br \/>\ndescribe     CHARSET_UTF8_B_SUBJ_LATIN Only LATIN UTF-8 Character in Base64-Encoded Subject (good)<br \/>\nscore        CHARSET_UTF8_B_SUBJ_LATIN -0.1\n<\/p><\/blockquote>\n<p><\/code><\/p>\n<p>For more Information about UTF-8 and Byte-Code read following pages:<br \/>\nUTF-8 Character-ByteCode: <a href=\"https:\/\/en.wikipedia.org\/wiki\/UTF-8\">https:\/\/en.wikipedia.org\/wiki\/UTF-8<\/a><br \/>\nUTF-8 Country-Map: <a href=\"https:\/\/en.wikipedia.org\/wiki\/Mapping_of_Unicode_character_planes\">https:\/\/en.wikipedia.org\/wiki\/Mapping_of_Unicode_character_planes<\/a><br \/>\nOnline-Tool to get ByteCode for UTF-8 Character: <a href=\"https:\/\/www.utf8-zeichentabelle.de\/unicode-utf8-table.pl?number=1024&#038;htmlent=1\">https:\/\/www.utf8-zeichentabelle.de\/unicode-utf8-table.pl?number=1024&#038;htmlent=1<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Normally SpamAssassin do not support UTF-8 by performance reasons: If it contains &#8220;utf8&#8221;, then that&#8217;s probably the problem. Change it so it does not contain &#8220;utf8&#8221; &#8230;, and the performance issues will clear up. Perl 5.8 uses Unicode character sets internally in this situation, and unfortunately, this greatly hurts performance of all Perl code which [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[5,11],"tags":[16,49,32,53,41,42],"class_list":["post-29","post","type-post","status-publish","format-standard","hentry","category-perl","category-spamassassin","tag-chinese-spam","tag-perl","tag-russian-spam","tag-spamassassin","tag-unicode","tag-utf-8"],"_links":{"self":[{"href":"https:\/\/www.timk.de\/it-blog\/wp-json\/wp\/v2\/posts\/29","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.timk.de\/it-blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.timk.de\/it-blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.timk.de\/it-blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.timk.de\/it-blog\/wp-json\/wp\/v2\/comments?post=29"}],"version-history":[{"count":0,"href":"https:\/\/www.timk.de\/it-blog\/wp-json\/wp\/v2\/posts\/29\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.timk.de\/it-blog\/wp-json\/wp\/v2\/media?parent=29"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.timk.de\/it-blog\/wp-json\/wp\/v2\/categories?post=29"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.timk.de\/it-blog\/wp-json\/wp\/v2\/tags?post=29"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}