The Internet is growing on a daily basis and just like any other transport network, any body can use it. But on a daily basis more and more people are annoyed by unwanted emails known as Spam. In general, spammers can use email address on ones web site could send junk mail forever. For years people have been researching the way spam is produced. Different methods have come about to do away with unwanted emails, but to this day we have not found anything that has a long-term effect.
We are working on a solution that parses the entire text, including headers, embedded html and script code in each message of the two corpuses (spam and valid – with each having around 8000 messages) for spam detection. The header also has information about the relay servers, which we believe can be used to build a network topology for spam source identification. But, this solution alone is not enough to effectively differentiate between spam and regular messages. Currently we are developing a filter which effectively combines the results derived from the inferential knowledge of a user with a content based filter, to classify the messages. Based on the past history, trust, reputation and inferential knowledge of the user, a source can be recognized as spam or genuine. This is an adaptive filter, which uses Bayesian learning algorithms and learns over time, and dynamically updates its database with the source user information. Further work includes using this filter to detect spam in real time traffic environment like VoIP.