Artificial intelligence - BTC AI solutions

Web site classification – which method is more effective?

Machine learning and deep learning are the most popular methods used to classify websites.

Which one achieves better results and affects the effectiveness of the solution?

The answer to this question can be found in the following article.

Web site classification

A solution for categorizing and classifying websites is a guarantee of effective IT management. Web address classification has many applications.

  • The tool allows monitoring the content viewed by users, which translates into an increase in the productivity of tasks performed.
  • The classifier indicates which activities of employees deviate from their effective work.
  • Automatic surveillance of viewed content enables faster detection of potential dangers.
  • The classifier has the ability to automatically block sites that the IT administrator deems unsafe.

Comparison of methods for website classification based on the BTC AI Website Classification solution.

We use machine learning and deep learning to classify URLs. Machine learning analyzes pages using a dictionary method. Deep learning uses neural networks and analyzes the entire content of a page using them.

Machine Learning – dictionary method

URL classification using machine learning involves analyzing a page using a dictionary method. Tens or thousands of words that are on the page are retrieved. The algorithm works on the basis of probability and returns the most relevant category based on the analyzed content. In case the received data is ambiguous, or there is an aggravation of a particular word, the classifier may wrongly assign a category.

Deep Learning – the use of neural networks

Deep learning methodology, thanks to the use of neural networks in addition to the analysis of individual words, allows to understand the context of the entire content, which is an important element for the correct analysis of a website. Deep learning allows to achieve better results in the area of classification when a huge amount of data is provided to the algorithm. In the case of the BTC AI Website Classification solution, the classifier is fed with a database of 8.9 million web pages (data as of 19.03.2021). The fully automated process of data acquisition, which is delivered to the classifier model, influences better and better results.

Deep learning more effective in classifying web pages

Moving from dictionary-based machine learning to analyzing the entire page with the context of its content can achieve better results. The deep learning network itself, is a kind of neural network, which is characterized by a large number of hidden layers. Thanks to the deep connections, the algorithm is able to understand the context of the analyzed content. The use of deep learning to classify URLs allows us to bypass the problem of the severity of a word and the resulting miscategorization. The algorithm is able to determine on its own that a word is out of context, so that it is not taken into account during the categorization process. In addition, the use of deep neural networks returns information on how confidently it has classified the studied page. It may so happen that the analyzed site contains text from multiple categories, or does not contain enough words. In such a case, the algorithm will return a category score of 10 or 20%, informing the user that it is not sure of its categorization.

BTC AI Website Classification.

The BTC AI solution for URL classification uses two methods simultaneously, returning information to the user simultaneously regarding machine learning and deep learning categories. The goal of the classifier is automatic threat detection and more effective management by monitoring employee activity. Advanced algorithms provide IT administrators with useful information about system users, so they can effectively manage IT infrastructure, as well as ensure data security. Web pages are analyzed in detail in terms of structure and categories.

What does the website classifier tell you about?

  • Page language The parameter indicates whether the language of the page has been correctly detected.
  • SSL certificate The system checks whether the site is secured with an SSL certificate.
  • List of gambling sites The system checks whether the site appears in the Ministry of Finance database of gambling sites. The Ministry of Finance Sites Register (https://hazard.mf.gov.pl/) – is a database of gambling sites that are used to offer gambling in violation of the law.
  • Secure structure The system analyzes the presence of tags on the page and, based on this, assesses whether the page structure is secure.
  • Safe category The system checks whether a Web site classifies itself into categories considered safe. Ultimately, pornography and gambling are unsafe categories.
  • CERT List The system checks whether a site appears in the CERT database. CERT List (https://www.cert.pl/) – is a database of sites considered dangerous.
  • List of sites containing malware The system checks whether the site appears in the URL Haus database. URL Haus list (https://urlhaus.abuse.ch/) – is a database of sites containing malware.
  • Redirects The system checks whether a page contains redirects. Pages containing redirects are considered suspicious pages.