Running a privacy-focused business means we are always looking for ways to improve user privacy whilst providing useful software. We recently made some changes to further pursue this objective. Our goal with this recent change was to continue to provide users with unique visits & total page views without intefering with the visitor experience. Nobody wants those annoying popups on their website unless they have to have them.
So let’s dive into how we built this in a privacy-focused way:
First idea: complex hash for each user
Our first attempt was to create a complex, unique hash for the visitor, and tie the hash to each of their page views in our database. Doing this meant we could update their previous page view, establish whether the page view / site visit is unique, produce a duration and change very little in our codebase.
The data we’d put into the hash would consist of:
- Random SHA256 String (regenerated daily at midnight UTC)
- IP Address
- User Agent
- Site ID
- Day of the year
As face recognition technology (or FRT) collects information of a person’s facial features, its classed under biometric data, which is labeled as “sensitive personal data.” The verbatim definition of biometric data in GDPR is … [Biometric data] means personal data resulting from specific technical processing relating to the physical, physiological or behavioural characteristics of a natural person, which allow or confirm the unique identification of that natural person, such as facial images or dactyloscopic data.
So the starting data may look something like this:
5c5fe02e320dd2af45a29d8cdf5c99ef947edea2c067cea501ee695bb46a2652 + 184.108.40.206 + Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36 + ABCDEF + 250
This would then be hashed to:
But doing this meant that we’d be able to see every single page the user visited on a certain website for the day… which would be an absolute disaster. And although it’s anonymous, we’d still be building a complete browsing history for users, which would allow “singling out”, which we would never want.
Second idea & solution: we keep multiple, un-related complex hashesRemember, one of the things Fathom Analytics needs to accomplish is to be able to track visits to the site and to individual pages. Tracking page views alone, without visits, is completely useless and means that you won’t have insight into how many people visit your site / pages each day. We can do better than that, and here’s how:
- For tracking unique page views: on each request, we generate a hash similar to what we discussed earlier in the article but we add
Pathnamein. We’ll call this hash
PageRequestSignature. We then take this
PageRequestSignatureand check it against a table of signature hashes from today. If it exists, the current request is marked as “not unique” but if it doesn’t exist, we insert it and mark the current request as “unique”
- For tracking unique site views: on each request, we generate a hash which is similar to the
PageRequestSignature, except we don’t include the
Pathname, since we’re tracking if the visitor is new to the site not the specific page, and we add in a new array item to make sure it’s different from the signature detailed below
- For tracking previous requests (and marking them as finished, not a bounce and the duration between requests): this is the most challenging bit because we need to tie the user’s previous pageview to them without keeping a history of their page views. To accomplish this, we clear out the previous view’s
user_signaturefield in our pageviews table to make way for the new entry (which keeps the user_signature). This means that, at any one time, we only have 1 pageview per user signature. This then means that we can’t put together an anonymous, individual user’s browsing habits. Remember, the user signature is completely anonymous, and no there’s only ever 1 page view tied to it
The hashes we generate are impossible for us to “de-hash” (we’ll explain later in this post). It’s important to note that hashing is not the same as encryption. With encryption, you have the ability to decrypt, whereas hashing is one-way.Pageviews are also processed & removed from our temporary table as soon as they’re considered to be finished (within 30 minutes or when the user has visited another page). In addition, we recycle our SHA256 salt string every day at midnight. This isn’t necessarily needed but it’s added complexity against future computing power & rainbow tables. Additionally, we perform all of this hashing on our collection endpoint before the data is put into our Redis queue. The reason for this is because Redis is a data store, and we don’t want to be storing plaintext IP addresses, user agents, etc. in there.
The final problem to address was whether we could take a user signature (e.g. f2d9be5d11064121fd5ec822be4a0453ba01bc303fe6f9c1d7c4f10e52655ae1) and search over our query logs to find each of that anonymous user’s pageviews on a website. The good news is that we are secure on this front because we only log data definition statements (e.g. CREATE, ALTER and DROP) in our query log, not inserts / updates etc.
When you install or use the Poper Blocker Product, we collect from you: the type of device, operating system and browsers you are using; the date and time stamp; the browsing usage, including visited URLs, clickstream data or web address accessed; TabID; the browser identifier; and your Internet Protocol address (trimmed and hashed so that it cannot be used to identify you).
AnonymizationThe purpose of the hashes is to make sure that there is no realistic way for us to ever identify an individual from the data we track—this is paramount in Fathom Analytics being truly “privacy-focused” and what, frankly, makes us stand out in this space for website analytics. This is also important for GDPR compliance / being exempt from GDPR.
Here are the key points from GDPR's Recital 26 and our comments: Recital 26 Our comment To determine whether a natural person is identifiable, account should be taken of all the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly. A natural person / anonymous user can’t be singled out due to everything being hashed. The only piece that can be connected is when we update the user’s previous pageview, but that is all done in a single database transaction, and we don’t keep query logs. To ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments. Brute forcing a 256 bit hash would cost 10^44 times the Gross World Product (GWP). 2019 GWP is US$88.08 trillion ($88,080,000,000,000) so we're at least a few dollars short of brute forcing a 256 bit hash. The principles of data protection should therefore not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable. We have rendered the data anonymous to the point where we could not identify a natural person from the hash. This Regulation does not therefore concern the processing of such anonymous information, including for statistical or research purposes. It's possible that GDPR does not apply to Fathom since data is made completely anonymous. Even if GDPR did still apply, we reiterate the stance that there is legitimate business interest to understand how your website is performing.
There are other analytics platforms who are already cookie-free, but they don’t provide their users with total visits, only pageviews. Anyone who needs website analytics to make money from their online business could tell you why total visits is such an important metric and one we’ve worked so hard to deliver in a privacy-focused manor with our Fathom Analytics software.
It’s important to note that these changes only apply to our hosted version , and the changes will arrive to our community edition later this year when we open-source our new codebase (we cannot wait!).
We are incredibly open to any ideas, comments or concerns. This is a big step up from what we had but there’s always room for improvement.Published on July 21, 2019. Written by Jack Ellis and Paul Jarvis.