1. 17 Feb, 2021 1 commit
    • Alexander Kukushkin's avatar
      Fix bug with major upgrade after clone (#557) · 7ac1688b
      Alexander Kukushkin authored
      When working on the code removing bg_mon from shared_preload_libraries before executing pg_upgrade a little bug was introduced. Specifically, shared_preload_libraries for the old version was overwritten by the value taken from the new version. As a result, the old cluster was failing to start due to missing (not existing) libraries and upgrade was failing.
      
      This commit is fixing the wrong behavior and improves tests to catch similar issues in the future.
      7ac1688b
  2. 16 Feb, 2021 2 commits
    • Alexander Kukushkin's avatar
      Various improvements (#553) · 849313c8
      Alexander Kukushkin authored
      1. Fix failing in-place major upgrade due to the amcheck_next
      2. Make sure bg_mon is not in the shared_preload_libraries during pg_upgrade.
      3. Bump bg_mon commitid
      4. Run renice as root (when possible)
      5. Change self-signed certificate domain from dummy.org to example.org
      6. Bump Timescaledb version to 2.0.1
      7. Silence annoying NOTICE messages about deprecated set_user GUCs
      8. Fully revert wal-e [commit](https://github.com/wal-e/wal-e/commit/485d834a18c9b0d97115d95f89e16bdc564e9a18) causing performance issues with S3
      9. Analyze all databases after promote
      10. Patroni 2.0.1
      11. Last, but not least - new PostgreSQL minor releases
      849313c8
    • Django's avatar
      Reverted uid/gid detection to pwd rather than home directory ownership (see #551) (#555) · 581a73c7
      Django authored
      Fixes an issue (see #551) with Spilo running in OpenShift when the UID is a random number picked by OpenShift. This PR reverts the method of determining what uid/gid to use back to checking the `/etc/passwd` file rather than the ownership of the `PGHOME` directory.
      
      Verification that it works:
      * Tests all pass
      * `docker build -t spilo . && docker run --user=1500080000:0 spilo` is able to start the database with no issues
      581a73c7
  3. 04 Feb, 2021 1 commit
    • D. Spindel's avatar
      Bump the Wal-G version to 0.2.19 and add WALG SSH support (#549) · d0b763a6
      D. Spindel authored
      Bump the Wal-G version to 0.2.19 and add variables for the WALG_SSH_PREFIX for backups
      
      
      Not adding SSH_PASSWORD as password authentication is generally not recommended.
      
      The SSH transport functions using sftp://, meaning that the server side could
      be set up in a completely restricted manner in sshd_config:
      
          Match User spilo:
              ChrootDirectory /srv/backup/spilo
              ForceCommand internal-sftp
              AuthenticationMethods publickey
              PermitTunnel no
              AllowAgentForwarding no
              AllowTcpForwarding no
              X11Forwarding no
      
      With the matching setup in the spilo container could be:
      
      SSH_PRIVATE_KEY_PATH=/etc/patroni/ssh_key
      SSH_USER=patroni
      WALG_SSH_PREFIX=sftp://backup.example.com/spilo
      
      Fixes: zalando/spilo#548
      d0b763a6
  4. 27 Jan, 2021 2 commits
    • Alexander Kukushkin's avatar
      Run CHECKPOINT on replicas in parallel with pg_upgrade (#545) · 49b31b40
      Alexander Kukushkin authored
      Major upgrade with replica rsync requires all replicas being up-to-date with the primary and shutdown cleanly. It is possible to make the shutdown process faster by executing the CHECKPOINT immediately before running `pg_ctl stop -m fast`.
      
      Before this commit, the CHECKPOINT and shutdown on replicas were performed only after the pg_upgrade was successfully executed.
      
      The time for CHECKPOINT on replicas and therefore time for the shutdown is quite measurable and comparable with the time which is required to execute the pg_upgrade.
      By doing these processes (pg_upgrade and CHECKPOINT on replicase) in parallel we reduce the downtime that is required for pg_upgrade + rsync.
      
      The old diagram:
      ```
                                      +-> r1 CHECKPOINT +   +-> r1 rsync +
      SHUTDOWN PRIMARY -> pg_upgrade -|-> r2 CHECKPOINT |->-|-> r2 rsync |->
                                      +-> r3 CHECKPOINT +   +-> r3 rsync +
      ```
      
      The new diagram:
      ```
                        +-> pg_upgrade    +
                        |-> r1 CHECKPOINT |   +-> r1 rsync +
      SHUTDOWN PRIMARY -|-> r2 CHECKPOINT |->-|-> r2 rsync |->
                        +-> r3 CHECKPOINT +   +-> r3 rsync +
      
      ```
      
      In addition to that fix corner-case with upgrading a single-node cluster.
      
      Also, the `registry.opensource.zalan.do/acid/spilo-cdp-13` and `registry.opensource.zalan.do/acid/spilo-13` become based on `registry.opensource.zalan.do/library/ubuntu-18.04. If someone concerned, the last one is mostly a mirror of ubuntu:18.04 from the dockerhub with a couple of layers on top. One of them updates packages and another adds zalando-marker file containing some "random" string. The FROM image is overridden in the delivery.yaml. If someone builds the image without build args supplied, the default ubuntu:18.04 is used.
      49b31b40
    • François Van Ingelgom's avatar
      Set AWS_REGION whenever USE_WALG_BACKUP is set to true (#546) · 246c3fe9
      François Van Ingelgom authored
      Since wal-g supports having both AWS_ENDPOINT (or WALE_S3_ENDPOINT) and AWS_REGION and that we cannot determine region from endpoint in the event of an onpremise deployment, this PR allows setting AWS_REGION even if AWS_ENDPOINT (or WALE_S3_ENDPOINT) is specified.
      
      This way using an onpremise deploy with a custom region (http://someminioinstance:9000/ and my-place, ie) or using an endpoint that does not match the regex engine (https://somebucket.s3.nl-ams.scw.cloud, ie)
      
      Close: https://github.com/zalando/spilo/issues/539
      246c3fe9
  5. 20 Jan, 2021 1 commit
  6. 11 Jan, 2021 1 commit
  7. 04 Jan, 2021 1 commit
    • Alexander Kukushkin's avatar
      Postgis 3.1, timescale 2.0 and a few optimizations in relinking of extensions/contribs (#535) · ae5efea5
      Alexander Kukushkin authored
      1. Fix build broken by postgis 3.1 release
         - use postgis 3.1 as main version
         - use postgis 3.0 as legacy version (postgres 9.5)
         - fix relinking code to support legacy postgis and optimize it to support relinking between all major versions, not only the current one.
      2.  Bump version of timescaledb to 2.0
          - change cmake requirement from 3.11 to 3.11 in CMakeLists.txt. The old version available in ubuntu 18.04 perfectly works.
          - call bootstrap with -DWARNINGS_AS_ERRORS=OFF
          - 9.6 and 10 will continue using 1.7.4, therefore optimized relinking code in the commit updating postgis becomes very handy.
      3. A bit unrelated, but since this PR already does a lot of housekeeping, optimize the code defining/cleaning locales in order to reduce the number of repetitions.
      
      Close: #528, #529, #534, #533
      ae5efea5
  8. 18 Dec, 2020 1 commit
    • Alexander Kukushkin's avatar
      Fix a few minor issues (#524) · 656d46f8
      Alexander Kukushkin authored
      1. Restore `ETCD_DISCOVERY_DOMAIN` -> `etcd.discovery_srv` mapping
      2. Compatibility with https://github.com/zalando/patroni/pull/1788
      3. `python3-requests`, `postgresql-${version}-pg-stat-kcache`, `postgresql-${version}-cron`, and `postgresql-${version}-pgq3` are required for DEMO spilo to work.
      4. Don't install `pv`, `lzop`, `etcd`, and `etcdctl` in DEMO mode
      5. Don't remove `pg_recvlogical` binary, it might be useful for playng with logical replication/decoding.
      656d46f8
  9. 16 Dec, 2020 1 commit
  10. 15 Dec, 2020 1 commit
  11. 14 Dec, 2020 2 commits
    • Alexander Kukushkin's avatar
      Add PostgreSQL 13 support and in-place major upgrade (#520) · bb6ab228
      Alexander Kukushkin authored
      How to trigger upgrade? This is a two-step process:
      1. Update configuration (version) and rotate all pods. On start, configure_spilo will notice version mismatch and start the old version.
      2. When all pods are rotated exec into the master container and call `python3 /scripts/inplace_upgrade.py N`, where N the capacity of the PostgreSQL cluster.
      
      What `inplace_upgrade.py` does:
      1. Safety checks:
        * new version must be bigger than the old one
        * current node must be running as a master with the leader lock
        * the current number of members must match with `N`
        * the cluster must not be running in maintenance mode
        * all replicas must be streaming from the master with small lag
      2. Prepare `data_new` by running `initdb` with matching parameters
      3. Run `pg_upgrade --check`. If it fails - abort and do a cleanup.
      4. Drop objects from the database which could be incompatible with the new version (e.g. pg_stat_statements wrapper, postgres_log fdw)
      5. enable maintenance mode (patronictl pause --wait)
      6. Do a clean shutdown of the postgres
      7. Get the latest checkpoint location from pg_controldata
      8. Wait for replicas to receive/apply latest checkpoint location
      9. Start rsyncd, listening on port 5432 (we know that it is exposed!)
      10. If all previous steps succeeded call `pg_upgrade -k`
      11. If pg_upgrade succeeded we reached the point of no return!
            If it failed we need to rollback previous steps.
      12. Rename data directories `data -> data_old` and `data_new -> data`
      13. Update configuration files (postgres.yml and wal-e envdir).
      14. Call CHECKPOINT on replicas (predictable shutdown time).
      15. Trigger rsync on replicas (COPY (SELECT) TO PROGRAM)
      16. Wait for replicas rsync to complete. Feedback status is generated by `post-xfer exec` script. Wait timeout 300 seconds.
      17. Stop rsyncd
      18. Remove the initialize key from DCS (it contains old sysid)
      19. Restart Patroni on the master with the new configuration
      20. Start the local postgres up as the master by calling REST API `POST /restart`
      21. Memorize and reset custom statistics targets.
      22. Start the ANALYZE in stages in a separate thread.
      23. Wait until Patroni on replicas is restarted.
      24. Disable maintenance mode (patronictl resume)
      25. Wait until analyze in stages finishes.
      26. Restore custom statistics targets and analyze these tables
      27. Call post_bootstrap script (restore dropped objects)
      28. Remove `data_old`
      29. Trigger creation of the new backup
      
      Rollback:
      1. Stop rsyncd if it is running
      2. Disable maintenance mode (patronictl resume)
      3. Remove `data_new` if it exists
      
      Replicas upgrade with rsync
      ---------------------------
      
      There are many options on how to call the script:
      1. Start a separate REST API for such maintenance tasks (requires opening a new port and some changes in infrastructure)
      2. Allow `pod/exec` (works only on K8s, not desirable)
      3. Use COPY TO PROGRAM "hack"
      
      The `COPY TO PROGRAM` seems to be low-hanging fruit. It requires only postgres to be up and running, which is in turn already one of the requirements for the in-place upgrade to start. When being started, the script does some sanity checks based on input parameters.
      
      There are three parameters required: new_version, primary_ip, and PID.
      * new_version - the version we are upgrading to
      * primary_ip - where to rsync from
      * PID - the pid of postgres backend that executed COPY TO PROGRAM.
      The script must wait until the backend will exit before continuing. Also the script must check that its parent (maybe grandparent?) process has the right PID which is matching with the argument.
      
      There are some problems with `COPY TO PROGRAM` approach. The Patroni and therefore PostgreSQL environment is cleared before start. As a result, the script started by postgres backend will not see for example `$KUBERNETES_SERVICE_HOST` and won't be able to work with DCS in all cases.
      
      Once made sure that the client backend is gone the script will:
      1. Remember the old sysid
      2. Do a clean shutdown of the postgres
      3. Rename data directory `data -> data_old`
      4. Update configuration file (postgres.yaml and wal-e envdir). We do it before rsync because the initialize key could be cleaned up right after rsync was completed and Patroni will exit!
      5. Call rsync. If it failed, rename data directory back.
      6. Now we need to wait for the fact that the initialize key is removed from DCS. Since we know that it happens before the postgres on the master is started we will try to connect to the master via replication protocol and check the sysid.
      7. Restart Patroni.
      8. Remove `data_old`
      
      In addition to that, implement integration tests. Mostly they are testing happy-case scenarios, like:
      1. Successful in-place upgrade from 9.6 to 10
      2. Successful in-place upgrade from 10 to 12
      3. Major upgrade after the custom bootstrap with wal-g
      4. Major upgrade after the custom bootstrap with pg_basebackup
      5. Bootstrap of a new replica with wal-g
      
      Also tests are covering a few unhappy cases, like: in-place upgrade doesn't start if pre-conditions are not meet.
      bb6ab228
    • Marcin Frankiewicz's avatar
      Support for additional locales (#513) · 3ed0a9b7
      Marcin Frankiewicz authored
      * Support for additional locales
      
      * Fixes after review
      3ed0a9b7
  12. 26 Nov, 2020 1 commit
  13. 23 Nov, 2020 1 commit
  14. 26 Oct, 2020 1 commit
  15. 01 Oct, 2020 1 commit
  16. 28 Sep, 2020 1 commit
  17. 25 Sep, 2020 2 commits
  18. 24 Sep, 2020 1 commit
    • Alexander Kukushkin's avatar
      Disable prefetch when restore_command is called for pg_rewind (#495) · 3651d94c
      Alexander Kukushkin authored
      We use the fact that `%p` parameter contains either `RECOVERYHISTORY` or `RECOVERYXLOG`, but in case of pg_rewind the file is being restored directly to `pg_wal/%f`
      If it is detected, the following is done:
      * export WALG_DOWNLOAD_CONCURRENCY=1 environment for wal-g
      * Use -p 1 option for wal-e
      
      In addition to that update README.rst and bump some dependencies.
      3651d94c
  19. 08 Sep, 2020 1 commit
    • Alexander Kukushkin's avatar
      Housekeeping (#489) · 5f09fef7
      Alexander Kukushkin authored
      * remove useless callback_endpoint.py
      * add exponential back-off to callback_role.py
      * enable linting job with GitHub actions:
        1. check shell scripts with shellcheck
        2. check python code with flake8
      * run integration tests on GitHub actions and CDP
      5f09fef7
  20. 07 Sep, 2020 1 commit
  21. 02 Sep, 2020 1 commit
  22. 01 Sep, 2020 1 commit
  23. 14 Aug, 2020 2 commits
  24. 07 Aug, 2020 1 commit
  25. 05 Aug, 2020 1 commit
  26. 04 Aug, 2020 1 commit
  27. 20 Jul, 2020 1 commit
  28. 17 Jul, 2020 1 commit
    • Alexander Kukushkin's avatar
      Update dependencies (#464) · 9978b671
      Alexander Kukushkin authored
      * timescaledb 1.7.2
      * pg_mon cc028fdae8542ec3f5df3bf4c66f0895d87c127d
      * plpgsql_check 1.11.0
      * plantuner 800d81bc85da64ff3ef66e12aed1d4e1e54fc006 (pg13 support)
      9978b671
  29. 26 Jun, 2020 1 commit
  30. 11 Jun, 2020 2 commits
  31. 25 May, 2020 1 commit
  32. 19 May, 2020 1 commit
  33. 15 May, 2020 2 commits