Algorithms, performance and accuracy explained
This essential blog compares the approach used by 51Degrees and DeviceAtlas. It describes the high-level operation of each product, areas of difference and associated intellectual property. In simple terms, it will help readers match products to common use cases and formulate tests for evaluation.
51Degrees is a free open source code + data device detection solution permissively licenced under the Mozilla Public Licence 2 (MPL2). MPL2 is the same licence used by the Firefox web browser. 51Degrees has 3 granted patents in the US and Europe, with 3 pending. The MPL2 licence enables others to use and incorporate much of the intellectual property and data into their solutions only needing to acknowledge 51Degrees in their licences. Users of 51Degrees APIs have the option of automatically sharing usage back with 51Degrees growing the ground truth source. 51Degrees generate revenue via a subscription to weekly and daily updated databases containing a wider range of characteristics and device combinations.
DeviceAtlas is a product operated by dotMobi, a small subsidiary within Afilias Technologies Plc. Afilias’ core business of Top Level Domain (TLD) name management and DeviceAtlas appears to be a small unrelated product.
DeviceAtlas is a closed source device detection solution with proprietary licencing making it impossible for 51Degrees to assess the implementation. Over the years Afilias Technologies have published various patents which provide some insight into the techniques used. Afilias have also published publicly available blogs and documentation. Many of the patents are very narrow, however patent EP2245836B1 granted in Europe during July 2017 after 8 years of examination is the one used when summarising DeviceAtlas here. It is believed to describe the current implementation. The patent text should be referred to for the full details.
What is Device Detection?
Finding information concerning visitors to a web site so that the user experience can be optimised, targeting content to maximise revenue, or adding information to analytics is termed device detection. The type of information available is quite varied. This article focuses on three components;
- the device the visitor is using;
- the operating system on the device; and
- the browser or application.
A way to visualise this is as an ever-expanding cube with components as dimensions. Ever expanding, since as time moves on new devices come into the market and new versions of operating system and browser come into use.
Each component has a set of associated properties and values. The device’s physical screen size, when it was released and the retail price are just a small set of examples. Keeping this specification data current is a significant task and the correctness of the values is a key measure of any device detection solution. At 51Degrees an in-house editorial team is used to ensure independent collection and accuracy. No data is scraped from web sites like GSM Arena. At the time of writing over 170 new device models are added every week from vendor only data sources.
Access to this information grows revenue. The performance and design of the web site can be efficiently altered by the web server. For example; rich hover menus can be used on desktops and laptops whilst accordion menus work better on smartphone, bandwidth friendly images sent to low data speed mobiles, additional information added to analytics packages, or advertising positioned effectively for viewability and engagement.
Device detection can be used for various purposes each requiring different considerations. Some use cases may be sensitive to performance and others to eliminating false positives. There’s no one size fits all, which is why 51Degrees offers several product variants and ways of deploying.
The range and accuracy of the properties is one measure. Another is how good it is at matching the device, operating system and browser information to a particular visitor. This process relies heavily on User-Agent strings.
What is a User-Agent?
Every time a web site is visited a User-Agent is transmitted along with the request containing information used to determine the client device’s properties. Although this is not the only information used for device detection, it’s a very important part of the detection process and the focus of this article.
User-Agent header values - usually referred to as “User-Agent strings” - adhere loosely to a number of conventions but don’t obey any strict rules. Finding a way of using the information they contain to yield accurate values for component properties that is fast and accurate is a key differentiator when choosing a device detection solution.
User-Agents contain segments that can be used for identification purposes. The characters "GT-P5210" starting at cell (5,1) in the above example is the model code used by Samsung for a Galaxy Tab 3 tablet. The characters “Android 4.4.2” at cell (23,0) indicate the operating system is Android and part of a major series called Jelly Bean.
Assuming that the device detection solution has catalogued the Galaxy Tab 3 it can tell that the display is 10.1 inches with 1280x800 pixels the very first time a visitor from the device accesses a web site or an image is requested. This information can be used to optimise the layout of the web page, provide optimised images, and deliver a user interface in the most suitable form. Hover menus that would work on a laptop with a touchpad can be removed for the touchscreen tablet for example.
Seems simple, right? Well, not so simple. Let’s take another User-Agent string, this time from the Lumia 640.
Looking at this in the same light as the Galaxy Tab 3 we might assume that it’s an Android device, or possibly that it’s an iPhone or that it’s running Apple OS X. In fact none of those are true, those elements of the User-Agent string are put there deliberately so that simple minded device detection solutions don’t cause the Lumia 640 (a Microsoft device) to be treated as an “unknown” device and therefore receive a downgraded experience.
These simple minded detection solutions look for key words in the User-Agent string in a process called “browser sniffing”. Browser sniffing has a bad name, and rightly so, typically it’s slow, inaccurate and gives web site visitors a terrible experience.
Solving the Complexity
All good device detection solutions will collect lots and lots of User-Agent strings and classify them according to the various device, operating system and browser they relate to.
51Degrees is one such solution. If a User-Agent string or associated component version or device hasn’t been seen before a team of experts allocates the User-Agent string to the indicated component in the central 51Degrees database, adding a new verifiable component entry (called a component profile) if necessary. 51Degrees have been doing this since 2011, when a fresh database was started from scratch. Over 807,000 device combinations (combinations of device, operating system and browser) have been collected, and 10s of millions of unique User-Agents processed.
Collections of User-Agent strings assigned to components form a ground truth, representing some portion of all the possible User-Agent strings there are. However, no matter how many such strings are collected no one can know all of them. Even if they could all be collected they would go out of date immediately as new ones appear all the time.
Two Big Questions
At this point there are two questions.
- Retrieval - “If you’ve seen the exact User-Agent before, do you report the components accurately?”
- Prediction - “If you haven’t seen the exact User-Agent before, how accurate is your algorithm at determining the correct components, as judged by how that User-Agent is eventually added to the database?”
Compromises need to be made depending on how quickly the result is needed or the techniques used to predict the best result.
Question One - ReTRIEval
There are a number of methods for retrieving an exact match. The exact string can be stored which would be very expensive in storage and perform very slowly.
A better way is to observe that there is some degree of similarity between the strings. Reading from the left many start with “Mozilla/” and of those many follow with “5.0”. If the strings are stored in a tree, with a branch at each point where one string diverges from another then the need to store possibly many millions of copies of “Mozilla/” is reduced to a single copy. Simple.
What’s more when reading a User-Agent string to find out if it has been seen before a simple and relatively fast algorithm would be to check “is the next character one of the branches of the last character”? At the point at which the answer to that question becomes “no” there are either no more characters to examine, or a leaf of the tree has been reached. In the latter case an exact match has been found. Bingo! In the former case there isn’t and that’s where things go wrong for this type of algorithm. Whoops.
Patricia Trees and Modified Patricia Trees
The process described previously could be considered simple and obvious. Labelling a tree structure with successive characters of known strings is first described in computing literature in 1968 by Donald Morrison who named the approach PATRICIA. The more romantically minded of us would like to think that might have been the name of his wife or sweetheart, and indeed it might be, all we know, though is that it’s an acronym standing for “Practical Algorithm To Retrieve Information Coded In Alphanumeric”.
There’s a considerable amount of literature about Patricia Trees (or Tries - so named because of the letters TRIE forming part of the word “reTRIEval”) and enhancements in certain ways such as extracting information of interest without going to the leaf of the tree, or how nodes can be can merged where there are no branches. Most of these work best when the most significant information is at the left end of the string and are evaluated on a character by character basis from the start (left hand end) of the string.
DeviceAtlas Patent and Opposition
Afilias’ patent EP2245836B1 describes a Patricia Tree all of which is well known and widely deployed computer science. 51Degrees are opposing that patent precisely because it seeks to monopolise those techniques. More information on the opposition can be read here
Back when the Afilias patent was originally filed Nokia and Blackberry ruled the mobile device industry. It was a reasonable assumption with User-Agent strings to assume that the most significant information is contained at the start of the string. Eight years later this is not true.
If most User-Agent strings start with “Mozilla/5.0” it’s merely a waste of processing power to evaluate the tree for each of those characters. Worse still is what happens when processing a User-Agent string that has not been seen in that exact form before.
Say, in the Galaxy Tab User-Agent string earlier, instead of “Build/JDQ39" the User-Agent contains “Build/JDQ40” - a human would probably say “oh, just a bug fix, most likely doesn’t make any difference”. When using a strict Patricia Tree the algorithm stops and no useful answer is returned.
The Afilias patent is not prescriptive, so 51Degrees don’t know what exactly they do in this type of situation. Perhaps they fail. More likely they return what they think they do know - Galaxy Tab 3 and possibly things that they guess - through defaults or by somehow trying to skip over the inconvenient 40 instead of 39.
All of this means that their claim to give you exactly the right answer is not correct since they’re having to guess in the presence of data they have not seen before, or that they do indeed stop and don’t give you the answer they should have.
51Degrees = Innovation
51Degrees received their Pattern algorithm patent EP2871816 in 2016 less than 3 years after filing. Afilias took nearly 8 years for their device detection patent to be granted and have filed nothing relevant since 2009.
51Degrees Pattern algorithm is modern, fit for purpose and provides the kind of predictive power needed in a rapidly changing world of indefinite varieties of User-Agent strings.
Recognising that there is no such thing as perfect, 51Degrees invented Hash trie, the first device detection algorithm to work with hashes rather than character sequences to offer unrivalled performance and good predictive support. 51Degrees’ 2017 patent applications for Hash trie encapsulate many man years worth of reflection and considered research.
Only 51Degrees continue to innovate in the device detection space bringing new techniques and approaches to what is a complex computing problem.
How does 51Degrees work?
The 51Degrees designers recognised multiple algorithms are needed for device detection depending on the intended use case and offer two core algorithms.
- Hash Trie - extremely fast due to use of hashes (explanation follows)
- Pattern - uses character sequences and slightly slower than hash trie
Both algorithms only consider the relevant parts of the User-Agent and process the User-Agent string in ways very different to a Patricia Tree. They offer some insight into the prediction methods used to provide the result.
51Degrees Hash Trie
A hash function maps data of any size to a fixed size, typically an integer number. This integer number is called a hash value. If a simple addition function were used on the numeric values of the characters forming segments a series of hash values is returned.
A hash function with more complexity than simple addition is needed. A good hash function will ensure there are no duplicate values across a range of User-Agent string segments. 51Degrees currently use the Rabin-Karp hash algorithm checking for segments where different input generates the same result.
Boundary characters cannot be used to identify segments. Instead hash values are stored using a range of characters where the number of characters considered can vary for each segment.
|Mozilla/5.0 (Linux; Android 6.0.1; GT||77+111+122+105+108+108+97+47+53+46+48+32+40+76+105+110+117+120
51Degrees Hash Trie stores the hash values only and NOT the sequence of characters that created them. Further only the hash values and their character positions within the User-Agent strings which are needed to identify the device are stored. Unneeded portions of the User-Agent strings are discarded.
The hash values are arranged in a type of tree termed a directed acyclic graph. There is no correlation between the level of the tree and the position of the hash in the User-Agent string. The Egyptian Ptolemaic dynasty family tree is an example of such a data structure.
The properties associated with the User-Agent string are known when a leaf is identified. The number of iterations needed to reach a leaf or edge have the most significant impact on the performance of the implementation. At the time of writing an average of six iterations are needed in live operation.
The segments are evaluated in an order that achieves the largest possible reduction in outcomes per CPU operation. This feature is one of the techniques used to achieve extremely fast performance and differs extensively to the Patricia Tree which is constrained to left to right, or right to left operation.
Because the hash values are always the same size irrespective of the number of characters used in the segment which generated them a phenomenal space saving is achieved. At the time of writing a full set of 201 properties with 800,000 device combinations would require 140MB of data uncompressed, compared to at least 3 to 4 times that with a Patricia Tree trained with similar data.
Hash Trie Answers the Two Big Questions
Hash trie is extremely good at rapidly identifying exact User-Agent strings and has configuration options to strengthen identification of non-exact strings. It runs at over 1.1 million device detections per second on a $30 Raspberry Pi.
For a fuller explanation of Hash Trie and to start an evaluation see this documentation.
In Pattern each of the segments of importance within the User-Agent string map to a unique segment Id. All the possible segments of relevance are included in the database. The following example shows such a mapping for a User-Agent string.
Patricia Tree data structures are used to store segment characters. They are the most efficient computing method for storing the required characters and quickly determining the unique segment Id. This is a very different use of Patricia Tree to the one described previously.
A sequence of unique segment Ids forms a signature which in turn relates to the components. When all the expected unique Ids are found the algorithm is very quick.
When a precise match is not found further techniques can be used. The signatures associated with the segments found form a set that can be used to find one with the closest matching segments. Such techniques might include comparing version numbers numerically and not by character value, finding segments with a small offset to the expected position, accepting some segments are missing, or comparing character values. This additional optional processing takes more time and can lead to a wider range of detection times over random User-Agent strings.
Pattern also provides metrics associated with the result found such as the character differences between the provided User-Agent string and the values returned. Such information can be used to find out when 51Degrees is less confident in the result.
Pattern Answers the Two Big Questions
Pattern is quick at finding exact results and provides more options than Hash Trie to predict the best result.
A complete explanation of Pattern is available in our documentation.
It’s nonsensical to say, as DeviceAtlas does, there doesn’t have to be a compromise between accuracy and speed, since it’s inherent in the nature of the task of Device Detection that such compromises have to be made. How much control can be exerted over the compromises is extremely important, and at 51Degrees rather than pretending that one size can fit all control is in the hands of the implementer.
When only the highest performance is needed 51Degrees supports this. When more processing power is available and device detection can take a little longer then 51Degrees can do that too. Perhaps when processing offline log files, and speed is not of the essence but eliminating suspect results is. Or where data files can’t be refreshed frequently and better predictive strength is needed.
In fact not allowing the choice makes no sense at all. Sports cars and tractors are made of very similar things, an engine, 4 wheels and so on, but are used for very different purposes. A muddy field would not be ploughed with a sports car. Only 51Degrees recognises this and provides you with the choice.
Anyone with access to DeviceAtlas can evaluate the two solutions quickly using a comparison program in Java published by 51Degrees on GitHub. Get the source code here.