Gmail gets biggest spam filter upgrade in years

Google recently published a Security blog post detailing what it calls one of the biggest defense upgrades to Gmail’s spam filter in recent years. It’s a new text classification system called Resilient and Efficient Text Vectorization (RETVec). Google says it can help understand the relevance and specificity of text, which is emails filled with special characters, emojis, misspellings, and other junk that were previously readable by humans but not easily understood by machines. Previously, spam messages filled with special characters easily bypassed Gmail’s defenses.

While any spam filter can eliminate an email that says, “Congratulations! A $1,000 balance is available for your jackpot account,” the vast majority of the letters in the email go into the endless depths of the Unicode standard, where users can find characters that look like they are part of the regular Latin alphabet.

Google says RETVec is trained to be resilient to character-level operations including insertions, deletions, misspellings, homonyms, LEET substitutions, and more. The RETVec model is trained on a new character encoding that can efficiently encode all characters and words in the UTF-8 set. As a result, RETVec performs exceptionally well across more than 100 languages without requiring lookup tables or fixed vocabulary sizes.

Gmail nâng cấp bộ lọc thư rác lớn nhất trong nhiều năm - Ảnh 1. — Thanks to RETVec, Gmail can now better recognize and filter spam

Google says the performance difference is dramatic. Methods that use fixed vocabulary sizes or lookup tables of homonyms are resource-intensive. RETVec, on the other hand, has only 200,000 parameters instead of millions, so while Google’s spam-filtering cloud platform is large enough, it can run on a local machine. RETVec is open source, and Google hopes it will eliminate homonym attacks.

RETVec works in a similar way to TensorFlow machine learning models, which use visual similarity to determine the meaning of words rather than their actual character content. This approach has led to big improvements, with Google saying that replacing Gmail's spam classifier with RETVec improved spam detection rates over baseline by 38% and reduced false positives by 19.4%. Using RETVec reduced the model's TPU usage by 83%, making the RETVec rollout one of the biggest upgrades in recent years. The company has been testing RETVec internally for the past year and has rolled it out to all Gmail users.

Source link