We use machine learning and deep learning to automatically classify websites.
Traditional methods of classifying websites are based on the subjective judgment of the operator, URL rules or ready-made pattern databases, which makes them ineffective in a dynamically changing Internet.
The BTC Website Classification solution eliminates these limitations, allowing instant and accurate classification of any site. What’s more, each site is reanalyzed every month, which is especially important in situations where the owner or content changes.
Our technology is distinguished not only by its speed, but also by its high quality performance. The classification process is based on three independent algorithms that are constantly improving their evaluation mechanisms. This makes the solution ideal for professional IT management and network security systems.
Unlike many foreign classifiers, BTC Website Classification effectively analyzes content in Polish and 51 other languages, making it one of the most comprehensive tools of its kind. In addition, the classified sites provide information not only about their subject matter, but also about their impact on productivity and potential threat to the user. The system effectively detects sites that can phish, government-blocked sites (such as gambling sites) and other dangerous resources.
Our solution also offers API access, which allows it to be easily integrated with other systems – for example, to automatically block sites belonging to specific categories, such as pornography or phishing sites. In addition, BTC Website Classification is not limited to analyzing the homepage only – in case of insufficient data, it searches sub-linked bookmarks and supports redirects, which significantly increases the effectiveness of the classification.
The addresses are sent to the classifier
The site code is downloaded for later analysis
The code of the page is cleaned of unnecessary data, such as repeated words and HTML tags
After clearing the code of unnecessary components, the words (keywords) that define the nature of the site will remain.
Data processing to enhance the effectiveness of the deep learning model
Repeated keywords are assigned to categories based on the dictionary and the number (saturation) of words within each category is determined
When analyzing a Web page, the entire context of the page is taken into account, which makes it possible to analyze multi-topic pages more effectively
The site is assigned to the category that has been identified as the most probable