Time to Clean Up The Web

The web used to be a beautiful place. Those rolling pastures of fast loading simple HTML, the fresh smell of web 2.0 gradients, the spotless fields of minimal libraries. Sure, our navigation wasn’t as advanced as today, but the waters were clear and easy to surf.

Over the years however, an icological disaster of petaproportions has been unfolding. I’m talking about the wave of toxic trackers, social media plugin spills, poisonous popups, chemical cookie banners, and all the other garbage that has slowly turned the clear web waters murky. With January being the time for new resolutions and new ideas, let’s make this the year we finally start to take action, clean up, and make the web usable again.

There have been plenty of cool trends as well of course, but some have been one step forward and two steps back. Sure we got rid of some pop-up window ad dialogs, but only to find them replaced with exit intent popups, mouse cursor heatmap trackers and being tracked on every page by FacedIn and friends. It’s making the web unreadable and we should fix it.

Let’s look at some examples of web pollution. Some are so hilarious it looks like parody; how I wish that was true (nothing’s photoshopped here!)

Cookie bar

This is an example from the Dutch public broadcaster. It doesn’t matter what is says (yes it fills the entire screen, you need to click all buttons, and it looks like a website to configure the internet, doesn’t it?). Our Crapbar Rating ™: 4/5.

This is the plastic bag of web pollution. It’s simply everywhere. While it has done nothing to stop data collection (obviously cookies are just one of the many tools for tracking), it sure did make it more annoying. Some types of cookies don’t even require a pop-up (a privacy policy in the footer is often fine), and regulators certainly didn’t specify a minimum in obnoxiousness. That doesn’t seem to stop many websites from showing a humongous crapbar covering the whole page. It’s even more pointless because many sites load the cookies anyway and the crapbar just informs you they’re doing so, while the whole point was asking consent.

GDPR

Death by crapbar

Like the cookie crapbar, its friend the GDPR crapbar, pop ups in most places these days. Don’t get me wrong, the idea behind GDPR is great: respect people’s privacy. In practice, however, that often means don’t change anything but just show even larger and more obnoxious dialogs. I also like how they give you all these options of trackers and ads to “opt-in” to. Who in the history of the internets has ever voluntarily clicked on any of those? If you know the answer already, just don’t ask. Ironically all these pop-ups have also made browsing in Incognito mode or Firefox containers, handy tools to prevent tracking sessions, completely unworkable, with more tracking as a result.

The Cookie + GDPR double-whammy

This GDPR crapbar is so large, that the cookie crapbar on top hides the underlying GDPR consent button, which you would have to click to continue (but can’t). Good Times. Crapbar score: 5/5.

Visual noise

Pardon the interruption, 3 crapbars per paragraph.

A polluting trend that can’t be recycled quickly enough as far as I’m concerned, is adding all kinds of sticky bars and things that pop-up while reading and scrolling. In the example above there’s not a whole lot of room left for the actual text between the slats. You might think I’m vertically challenged and should just get a larger screen, but people also read on laptops and mobile devices.

I also don’t get the trend of having to override selection behavior. Yes, I’m a selection reader, I select stuff while reading. And no, just like the rest of humanity, I have never ever clicked on the “Tweet this selection” button.

Another related web pollution trend is the exit intent popup dialog. There you are, casually surfing an online store for some new running gear, scrolling thr.. and boom, intent crapbar. Close tab. Or throw mobile, where it’s even worse.

Don’t miss out

I could go on and on. There’s many more examples (“install our app to continue!”, “ding ding, our chat is offline “-popup crapbar), but I think you get the idea!

Trackers

This one is not really visible, except maybe from the very slowly loading progress bars, especially on mobile devices. Let’s call it the noise pollution of the internet. There’s even startups now to help track all the trackers you’ve installed. Clearly we need more trackers. Good thing you have that huge cookie banner so it’s all good.

I’m not sure how people get to these designs. Perhaps it starts with a brainstorming meeting where everyone shouts out their favorite analytics, A/B testing, or social media platform, and then they compromise by including all of them? It is because everyone else includes them? Because Facebook needs more data? There surely must be a trade-off where having more trackers installed causes less conversion than the additional data could help you optimize for. My bet is this tipping point is closer to zero and definitely lower than 161 requests (!), like the example from a tech newsite below.

A/B test yourself out of this one

As an industry, let’s just stop this madness! I don’t think anyone seriously looks at this and thinks: this makes things better for users, good to go, ship it. So let’s clean up the web in 2019 and make it fast and readable again. Then, hopefully someday, when our grandchildren surf the clear azure (/self-hosted) waters of the internet, the only proof of internet pollution that remains is the stories they read from the fast-loading websites that still tell about it. Without crapbars.

GDPR for Cloud Software Providers

Introduction

In 2016 the European Union created the General Data Protection Regulation (GDPR), and a few months from now, on May 25, 2018 to be precise, it will go into full effect. This regulation has sweeping implications for any EU business that deals with personal data and for any business outside of the EU that stores or processes data of EU citizens. Did you suddenly get emails from webapps that are going to delete your account after years of non-use? That’s right, that’s GDPR at work.

As a cloud intranet software provider the GDPR regulation affects everything we do, so we knew we had to start our preparations for compliance way in advance of the May 2018 deadline. In the past year we’ve been very busy and in this document we describe the measures and precautions we’ve had to take, and we reflect on some of the lesson we’ve learned.

We’ve created this document in the hope that it is of value to others that are grappling with the implications of GDPR. We found that it is easy to overlook complete areas that are affected by GDPR. And we got it relatively easy: we are a 100% digital cloud-based company. We don’t have filing cabinets filled with documents, or old computers stuck in dusty closets. The only thing we had to worry about is the software we wrote ourselves. And yet, becoming GDPR compliant was still a ton of work. So if we had to go through all this trouble as a small company with only a couple of products, we cannot imagine how big the impact of GDPR will be for larger companies that have many employees and products that go back decades. Suffice to say, GDPR is no joke.

Unlike most of our blogposts, this one is going to be a little more technical. With that said, let’s dive in!

Data Storage

The first and most critical thing to figure out is what kinds of data we store, and in which systems. We’ve been in business for 10 years which means that we have accumulated a lot of cruft. Old services, like backup systems, that have been replaced by newer services, old database tables, old diagnostics data, and so on. So first we had to make an exhaustive list of all places where data is stored, and then we had to figure out an appropriate GDPR strategy for each one. Following the YAGNI mantra, deletion was our solution of choice.

Our primary product is a drag&drop intranet SaaS webapp called Papyrs. Users can create pages, add file attachments, image galleries, comments, forms, and so on. Because we respect the privacy of our customers we don’t know if any of the data stored on our service by our customers is sensitive data protected by GDPR. Consequently we take the most conservative assumption: everything stored by our customers is critically confidential data, or PII (personally identifiable information) in GDPR parlance. This means that when a customer closes their account all their data has to be permanently incinerated.

That’s easier said than done. Papyrs has several types of data. We’ll briefly cover each type.

File uploads

We don’t use a storage service like Amazon S3 so we don’t have to worry about an orphaned bucket floating around somewhere. All files customers upload to Papyrs end up in the customer’s directory on our own CDN (mirrored using rsync). The metadata on each file (e.g. permissions, statistics, ownership) are stored in the database: every file has a corresponding database record. The database is transactional, but the filesystem is not. When an upload fails the database transaction is rolled back, but a partially uploaded or broken file may remain. In an ideal world the database and the filesystem would guaranteed to be in sync, but we have to make do with the abstractions we have. Partial uploads still get deleted eventually and it cannot lead to errors visible to the end user, but we still have to be mindful of the subtleties of dealing with the filesystem.

Papyrs dynamically resizes images as needed (to improve load times and to save bandwidth for mobile devices), which means that for every image uploaded to Papyrs there can be many cached thumbnails. The story here is similar to that of regular file uploads, except that we have to be extra careful to make sure all the right thumbnails get deleted.

The lesson here is to make sure you keep track of stray files, and that you don’t delete a row from a database until you’re absolutely sure all the corresponding files on the filesystem are gone. Otherwise it may too late to identify which files have to be deleted!


fish-no-gdprFishermen don’t have to worry about GDPR

Log files

All systems that make up the Papyrs service keep log files that contain information about system health, performance, errors and warnings, and other diagnostic information. Although typically these log files contain nothing even remotely sensitive or confidential, some information in log files (like IP addresses) may be considered PII under GDPR. Figuring out exactly which lines in which log files relate to which user on Papyrs is difficult and exactly the kind of privacy-invading work we don’t want to to. So instead we fall back on our pessimistic strategy: we treat all data, except for the data we know for sure is innocuous, as confidential. Our solution: automatically delete (rotate) all log files older than a few weeks. In practice we rarely needed diagnostic information older than that, anyway.

Here the Python RotatingFileHandler and TimedRotatingFileHandler modules are a great solution. Of course we also rely on Debian’s logrotate whenever possible.

But what to do with services that don’t listen to the SIGHUP signal sent by logrotate, and that don’t have any in-built way to rotate their own log files? Well, there is an easy trick:

$ tail -n 1000 logfile.log | sponge logfile.log

This truncates the file and puts the last 1000 lines back into the logfile. You can’t beat the simplicity, but if you want to cull the beginning of your log files this way you have to be mindful of one thing: while most logging systems with multiple writers use a mutex to synchronize writing to the same file, this bash hack doesn’t. So you may end up with a malformed line or two in your log file as two processes write to the same logfile simultaneously. If that’s unacceptable, you can instead create your own lazy log rotation by sticking this in a cron file:

$ cp logfile.log > logfile.log.1 && echo “” > logfile.log

The copy here is deliberate: you can’t move the file. This is because the services that are writing to the logfile will just keep writing to their open file handle, blithely unaware that the file has been moved, renamed, or even deleted. The logrotate option COPYTRUNCATE works like this too.

Cache files and other temporary data

Memcached, message queues, and other systems for transient storage can easily end up containing confidential data. Luckily, we didn’t have to do anything here. Everything we store in memcached expires quickly, so we don’t have to worry about stale data residing in an in-memory store. We made that decision a long time ago for performance reasons. Too many web applications end up relying on memcached for performance, so much so that when memcached resets the database and workers can no longer keep up with all the traffic. We didn’t want to worry about that, so our memcached data can be purged at any time. Everything in our message queue has a deterministic lifetime as well.

This is one of those cases where good architectural design choices end up paying off in unexpected ways! Always a nice surprise.

Full Text Search

We use Sphinx for full text search. Sphinx is terrific. Performance is great and the quality and ranking of the search results is superior. One problem is that we have one Sphinx database for all our user data, and a full text search database like Sphinx isn’t designed for quickly deleting a range of records. Re-creating the entire full text index isn’t an option, as that would take too long.

The only updates sphinx knows is “merge & swap”. A main index and a delta index are merged together into a new index and then the new index is swapped with the old one. The delta index can contain delete indicators. Then during the merge phase those rows with delete indicators get dropped from the main index.

Relational Database

Deleting user data from a database is either trivial or practically impossible, depending on what it means to delete. Files on a filesystem can be securely deleted if needed. Or you can simply create an encrypted directory and throw away the key to get rid of the data securely. But when rows are deleted from a database like postgres or mysql the data isn’t guaranteed to be gone from the hard drive. The records may persist in the binary logs used for database mirroring, or the data may simply persist as junk data inside the database, even after issuing a regular VACUUM.

How far do we want to go here? Do we have to issue a VACUUM FULL (which re-creates the entire database table) to securely delete data? Does that really provide additional guarantees when data may still persist on sectors of hard drives? We don’t know. The GDPR informs us that we have to “erase” data (including copies and replications thereof), but we have no idea what it actually means in the context of software. We think deletion to the point where we cannot recover the data — even if we wanted to — qualifies as real erasure.

I’m sure the fine folks who wrote this regulation thought carefully about what it really means to erase data. They’re just sparing us the technical details so we can experience the joy of figuring it out for ourselves.

Backups

Backups and GDPR are, on a fundamental level, diametrically opposed to each other. With a good backup solution you know that even if disaster strikes no user data is lost and the service can be entirely restored. To comply with GDPR a user has the “right to be forgotten” which means their data cannot be recovered and is lost forever. So how can we satisfy these conflicting goals? Clearly we cannot just destroy our backups, but we can’t keep customer data in perpetuity either.

Our solution: have backups rotate on a strict schedule, and delete old backups, including remote and offsite backups. The only difficulty is that this adds a delay to irrevocable data deletion. If, somehow, data ends up in a backup set that should already have been deleted, then it could take another 6 months after we fix the mistake for the data to be entirely gone from all backups.

There is no perfectly satisfactory solution here. For future products we’ll probably create backups strictly on a per-customer basis. Then we can easily backup/restore an individual customer’s data as needed, and if a customer exercises their GDPR “right to be forgotten” then we can immediately purge all their data from our systems.

Third party services

We don’t share any customer data with 3rd party services. We don’t share data with advertisers. We don’t use any software for sales lead generation. We like to build everything ourselves, and that served us well so far. The only 3rd party service we use is Pingdom, and they don’t have access to any confidential data of ours.

Data Export

GDPR gives customers the “right to data portability”. We already had a Papyrs Backup system in place, so we didn’t have to do anything here.

PCs, Laptops and Phones

Here we just have to take some common sense precautions. There shouldn’t be any sensitive customer data on our work devices, but old devices get securely wiped just in case. We use full disk encryption on all machines (macOS FileVault and TrueCrypt) and iPhones because they are encrypted. If somehow an old device ends up in the trash we can be confident no data leak will happen as a consequence.

Employee records

We don’t just have GDPR obligations toward our customers, but to employees and contractors as well. They, too, have a right to know what data we have on them, and they have a right to be forgotten after their employment has ended. There’s not much to say here, except that having a central place where all data is stored makes deletion a whole lot easier.

Customer Rights

So far we’ve mostly talked about the technical implications of GDPR. GDPR also specifies that our customers have a number of concrete rights: We have an obligation to tell customers what data we have on them and how we use it. We have an obligation to notify our customers when data leaks occur. Customers have a right to tell us not to process their data.

We can imagine these rights lead to a lot of difficulty for many businesses, but not for us. Thankfully we don’t need to collect any personal data from our customers except for a name and address for billing purposes, so we don’t. Our business model is really straightforward and we don’t need to engage in any shenanigans. Customers are free to come and go as they please, and they can take their data with them.


gdpr-landscapeAppreciation of the GDPR landscape

Conclusion

The main thing we learned in the past year while preparing for GDPR is that it’s much easier to design a new service with GDPR in mind than to retrofit GDPR compliance onto systems that were created many years ago. We are really lucky that we already stored 90% of our data in a structured way where every piece of data had a known owner (so we can easily figure out what to delete). However, we had to make sure that all data on our servers was accounted for, not 90%. If you have a pile of mystery data somewhere — even when nobody has access to it — that’s still a GDPR liability. As the saying goes, when you’re 90% done you still have 90% of the work ahead of you.

We hear that companies are only now starting with their GDPR preparations, but time is running out. GDPR compliance isn’t something that can be outsourced or something a small task force can do by itself, because GDPR affects affects customer support, sales, operations, marketing, HR, and every other part of your organization. GDPR affects all data, and data is everywhere.

If your organization has to be GDPR-ready in a few months and you haven’t started yet, I wish you luck :)

Thanks for reading!

Email as a platform

Google just announced Actions in the Inbox.

By adding some markup (Microdata or JSON-LD) to the HTML emails you send out, Gmail can now display quick action buttons next to your email.

Some might see this as simply another random extension, or a Gmail gimmick à la Gmail Labs extensions. We rather see it as part of a trend in which communication channels and devices are changed into platforms supporting an entire “ecosystem of apps”. Mobile phones used to be for calling, emails used to be for sending someone a text message (and I guess sunglasses used to be for just protecting your eyes :). Not anymore. All these things have turned into platforms with many apps running on top. All this added functionality makes it much more interesting for consumers, and more exciting for developers to hack new apps on top of. The companies behind them of course know that they can win the real battles by having the better ecosystem (would you switch to Windows Phone if it doesn’t have your favorite YouTube app, or to Hotmail if it doesn’t have your Movies Info app).

Many people would agree that email as it works today just isn’t good enough anymore. People try to use it for much more than it was originally intended. It’s no longer just a way to send people a message, it’s also a todo list, a CRM application, a help desk, a way to organize events, split bills, and much more. Because of this, the way email works desperately needs to be changed, at least extended.

One approach is to completely reinvent email as we know it, Email 2.0. We think this is a step in the wrong direction. One famous example of a very cool but failed attempt is Google Wave. It’s exciting to work on these projects, and of course developers love to completely rewrite and reinvent things, but let’s not forget email is already very popular and it’s hard to make everybody switch. It’s also not needed, which brings us to the other alternative: there are many ways email can be extended with new (open!) standards. It’s called Actions now, but it might look more like complete “email apps” in the future. We already know people want to use their email more effectively, and in fact using some clever hacks, some “apps” were already built and proven to be popular (Rapportive, for example). Plus, as a platform, it’s already used before as a “human API” to build services on (like sending an email to trigger some action like creating a blog post, an invoice, or – like with a weekend project of ours – create an online form).

One of the reasons email is so popular is because it just works, everywhere. So of course we need to deal with things like mobile views and graceful degradation to make sure that remains to be the case. But a player like Gmail is big enough to take the first step into allowing developers to build more apps around email, using new, open standards. And I believe that with us many other developers would be happy to write them, and make email much more powerful and effective than it is today.