Move fast and break things updated to add "...with stable infra"
How about not breaking things?
- lots of time soent on end to end testing for releasing with confidence
- Meta doesn't have anyone dedicated to testing, a cultural choice
- They still didn't run ALL the tests.
- Flakiness makes things worse
- Amount of tests required grows in polynomial fashion alongside number of repos.
- TestPilot is a universal test runner for tests regardless of language. (API access)
- Results exported to Hive for storage and analysis
- 17,000 results per second on the highest day
Most of your tests don't matter, but you don't know which ones do.
- Run only tests related to the change you made: reverse dependencies, testing commits in batches (but this did not work and it was abandoned within a year bec there were too many points of failure), eliminate low value tests (exclude tests that never catch any bugs - predictive test selection)
- Flaky tests are not always your fault - could be code or could be inherent in e2e tests for distributed systems, undermines confidence. About half of tests for Facebook app were disabled due to flakiness. Flakiness might stop a release candidate and caused engineers to go through hundreds of stack traces a week. Found out that most flakiness was due to a single race condition to do with the app loading things remotely. Flakiness is directly correlated to product quality.
- What to do with bugs that slip through the cracks? Non blocking continuous test runs, relevant teams are notified as soon as possible, run high value tests in CI and then run all tests continuously. Then automated bisect for root causing failures.
Is your test naughty or nice?
- probabilistic flakiness score
- difficult to tell when something is broken or flaky
- broken tests should not block new changes
Shipping with confidence without sacrificing speed?
- batch commits but this is error prone
- limit tests you run
- merge optimistically and deal with regressions