Research

 People

 Publications

 Sponsors & Affiliates

 News

 Related Sites

 Offered Courses



SPONGY (SPam ONtoloGY): Spam mail filtering using an adaptive ontology by Seongwook Youn

Modern computers generally come with some ability to send spam. The only necessary ingredient is the list of addresses to target. Spammers obtain email addresses by a number of means: harvesting addresses from Usenet postings, DNS listings, or Web pages; guessing common names at known domains (known as a dictionary attack); and "e-pending" or searching for email addresses corresponding to specific persons, such as residents in an area. Many spammers utilize programs called web spiders to find email addresses on web pages, although it is possible to fool the web spider by substituting the "@" symbol with another symbol, for example "#", while posting an email address. As a result, users have to waste their valuable time to delete spam emails. Moreover, because spam emails can fill up the storage space of a file server quickly, they could cause a very severe problem for many websites with thousands of users.
Currently, much work on spam email filtering has been done using the techniques such as decision trees, Naive Bayesian classifiers, neural networks, etc. To address the problem of growing volumes of unsolicited emails, many different methods for email filtering are being deployed in many commercial products. We constructed a framework for efficient email filtering using ontology. Ontologies allow for machine-understandable semantics of data, so it can be used in any system. It is important to share the information with each other for more effective spam filtering. Thus, it is necessary to build ontology and a framework for efficient email filtering. Using ontology that is specially designed to filter spam, bunch of unsolicited bulk email could be filtered out on the system. We used Waikato Environment for Knowledge Analysis (Weka) explorer, and Jena to make ontology based on sample dataset.

Figure1. SPONGY Architecture

Emails can be classified using different methods. Different people or email agents may maintain their own personal email classifiers and rules. The problem of spam filtering is not a new one and there are already a dozen different approaches to the problem that have been implemented. The problem was more specific to areas like artificial intelligence and machine learning. Several implementations had various trade-offs, difference performance metrics, and different classification efficiencies. The techniques such as decision trees, Naive Bayesian classifiers, and Neural Networks had various classification efficiencies.
Figure 1 shows our framework to filter spam. The training dataset is the set of email that gives us a classification result. The test data is actually the email will run through our system which we test to see if classified correctly as spam or not. This will be an ongoing test process and so, the test data is not finite because of the learning procedure, the test data will sometimes merge with the training data. The training dataset was used as input to J48 classification. To do that, the training dataset should be modified as a compatible input format. After J48 classification procedure, classification result was created.
To query the test email in Jena, an ontology should be created based on the classification result. To create ontology, an ontology language was required. RDF was used to create an ontology. The classification result in the form of RDF file format was inputted to Jena, and inputted RDF was deployed through Jena, finally, an ontology was created. Ontology generated in the form of RDF data model is the base on which the incoming mail is checked for its legitimacy. Depending upon the assertions that we can conclude from the outputs of Jena, the email can be defined as spam or otherwise. The email is actually the email in the format that Jena will take in (i.e. in a CSV format) and will run through the ontology that will result in spam or not spam.
SPONGY system updates periodically the dataset with the emails classified as spam when user spam report is requested. Then, modified training dataset is inputted to WEKA to get a new classification result. Based on the classification result, we can get new ontology, which can be used as a second spam filter. Through this procedure, the number of ontology will be increased. Finally, this spam filtering ontology will be customized for each user. User customized ontology filter would be different with each other depending on each user’ background, preference, hobby, etc. That means one email might be spam for person A, but not for person B. SPONGY system provides evolving spam filter based on user’s preference, so user can get better spam filtering result.
The input to the system mainly is the training dataset and then the test email. The test email is the first set of emails that the system will classify and learn and after a certain time, the system will take a variety of emails as input to be filtered as a spam or not. The training dataset which we used, which had classification values for features on the basis of which the decision tree will classify, will first be given to get the same. The classification results need to be converted to an ontology. The decision result which we obtained J48 classification was mapped into RDF file. This was given as an input to Jena which then mapped the ontology for us. This ontology enabled us to decide the way different headers and the data inside the email are linked based upon the word frequencies of each words or characters in the dataset. The mapping also enabled us to obtain assertions about the legitimacy and non-legitimacy of the emails. The next part was using this ontology to decide whether a new email is a spam or not. This required querying of the obtained ontology which was again done through Jena. The output obtained after querying was the decision that the new email is a spam or not.
The primary way where user can let the system know would be through a GUI or a command line input with a simple ‘yes’ or ‘no’. This would all be a part of a full fledged working system as opposed to our prototype which is a basic research model.


Home | Research | People | Publications | Sponsors & Affiliates | News | Offered Courses

© 2000-2013 Semantic Information Research Laboratory. All Rights Reserved.