- 17 Feb, 2021 1 commit
-
-
Alexander Kukushkin authored
When working on the code removing bg_mon from shared_preload_libraries before executing pg_upgrade a little bug was introduced. Specifically, shared_preload_libraries for the old version was overwritten by the value taken from the new version. As a result, the old cluster was failing to start due to missing (not existing) libraries and upgrade was failing. This commit is fixing the wrong behavior and improves tests to catch similar issues in the future.
-
- 16 Feb, 2021 2 commits
-
-
Alexander Kukushkin authored
1. Fix failing in-place major upgrade due to the amcheck_next 2. Make sure bg_mon is not in the shared_preload_libraries during pg_upgrade. 3. Bump bg_mon commitid 4. Run renice as root (when possible) 5. Change self-signed certificate domain from dummy.org to example.org 6. Bump Timescaledb version to 2.0.1 7. Silence annoying NOTICE messages about deprecated set_user GUCs 8. Fully revert wal-e [commit](https://github.com/wal-e/wal-e/commit/485d834a18c9b0d97115d95f89e16bdc564e9a18) causing performance issues with S3 9. Analyze all databases after promote 10. Patroni 2.0.1 11. Last, but not least - new PostgreSQL minor releases
-
Django authored
Fixes an issue (see #551) with Spilo running in OpenShift when the UID is a random number picked by OpenShift. This PR reverts the method of determining what uid/gid to use back to checking the `/etc/passwd` file rather than the ownership of the `PGHOME` directory. Verification that it works: * Tests all pass * `docker build -t spilo . && docker run --user=1500080000:0 spilo` is able to start the database with no issues
-
- 04 Feb, 2021 1 commit
-
-
D. Spindel authored
Bump the Wal-G version to 0.2.19 and add variables for the WALG_SSH_PREFIX for backups Not adding SSH_PASSWORD as password authentication is generally not recommended. The SSH transport functions using sftp://, meaning that the server side could be set up in a completely restricted manner in sshd_config: Match User spilo: ChrootDirectory /srv/backup/spilo ForceCommand internal-sftp AuthenticationMethods publickey PermitTunnel no AllowAgentForwarding no AllowTcpForwarding no X11Forwarding no With the matching setup in the spilo container could be: SSH_PRIVATE_KEY_PATH=/etc/patroni/ssh_key SSH_USER=patroni WALG_SSH_PREFIX=sftp://backup.example.com/spilo Fixes: zalando/spilo#548
-
- 27 Jan, 2021 2 commits
-
-
Alexander Kukushkin authored
Major upgrade with replica rsync requires all replicas being up-to-date with the primary and shutdown cleanly. It is possible to make the shutdown process faster by executing the CHECKPOINT immediately before running `pg_ctl stop -m fast`. Before this commit, the CHECKPOINT and shutdown on replicas were performed only after the pg_upgrade was successfully executed. The time for CHECKPOINT on replicas and therefore time for the shutdown is quite measurable and comparable with the time which is required to execute the pg_upgrade. By doing these processes (pg_upgrade and CHECKPOINT on replicase) in parallel we reduce the downtime that is required for pg_upgrade + rsync. The old diagram: ``` +-> r1 CHECKPOINT + +-> r1 rsync + SHUTDOWN PRIMARY -> pg_upgrade -|-> r2 CHECKPOINT |->-|-> r2 rsync |-> +-> r3 CHECKPOINT + +-> r3 rsync + ``` The new diagram: ``` +-> pg_upgrade + |-> r1 CHECKPOINT | +-> r1 rsync + SHUTDOWN PRIMARY -|-> r2 CHECKPOINT |->-|-> r2 rsync |-> +-> r3 CHECKPOINT + +-> r3 rsync + ``` In addition to that fix corner-case with upgrading a single-node cluster. Also, the `registry.opensource.zalan.do/acid/spilo-cdp-13` and `registry.opensource.zalan.do/acid/spilo-13` become based on `registry.opensource.zalan.do/library/ubuntu-18.04. If someone concerned, the last one is mostly a mirror of ubuntu:18.04 from the dockerhub with a couple of layers on top. One of them updates packages and another adds zalando-marker file containing some "random" string. The FROM image is overridden in the delivery.yaml. If someone builds the image without build args supplied, the default ubuntu:18.04 is used.
-
François Van Ingelgom authored
Since wal-g supports having both AWS_ENDPOINT (or WALE_S3_ENDPOINT) and AWS_REGION and that we cannot determine region from endpoint in the event of an onpremise deployment, this PR allows setting AWS_REGION even if AWS_ENDPOINT (or WALE_S3_ENDPOINT) is specified. This way using an onpremise deploy with a custom region (http://someminioinstance:9000/ and my-place, ie) or using an endpoint that does not match the regex engine (https://somebucket.s3.nl-ams.scw.cloud, ie) Close: https://github.com/zalando/spilo/issues/539
-
- 20 Jan, 2021 1 commit
-
-
Kieran Evans authored
Close #542
-
- 11 Jan, 2021 1 commit
-
-
Max Rosin authored
Close #525
-
- 04 Jan, 2021 1 commit
-
-
Alexander Kukushkin authored
1. Fix build broken by postgis 3.1 release - use postgis 3.1 as main version - use postgis 3.0 as legacy version (postgres 9.5) - fix relinking code to support legacy postgis and optimize it to support relinking between all major versions, not only the current one. 2. Bump version of timescaledb to 2.0 - change cmake requirement from 3.11 to 3.11 in CMakeLists.txt. The old version available in ubuntu 18.04 perfectly works. - call bootstrap with -DWARNINGS_AS_ERRORS=OFF - 9.6 and 10 will continue using 1.7.4, therefore optimized relinking code in the commit updating postgis becomes very handy. 3. A bit unrelated, but since this PR already does a lot of housekeeping, optimize the code defining/cleaning locales in order to reduce the number of repetitions. Close: #528, #529, #534, #533
-
- 18 Dec, 2020 1 commit
-
-
Alexander Kukushkin authored
1. Restore `ETCD_DISCOVERY_DOMAIN` -> `etcd.discovery_srv` mapping 2. Compatibility with https://github.com/zalando/patroni/pull/1788 3. `python3-requests`, `postgresql-${version}-pg-stat-kcache`, `postgresql-${version}-cron`, and `postgresql-${version}-pgq3` are required for DEMO spilo to work. 4. Don't install `pv`, `lzop`, `etcd`, and `etcdctl` in DEMO mode 5. Don't remove `pg_recvlogical` binary, it might be useful for playng with logical replication/decoding.
-
- 16 Dec, 2020 1 commit
-
-
Kiryl authored
-
- 15 Dec, 2020 1 commit
-
-
Oleksii Kliukin authored
* Use an IP address for Google instance metadata Makes Spilo start even if external DNS queries from within the container are not working. Do not continue with the startup when /etc/service is missing. Per https://github.com/zalando/spilo/issues/511 * Follow Python guidelines on on-line comments Co-authored-by:
Alexander Kukushkin <cyberdemn@gmail.com> * Extra space to the shell comment for readability Co-authored-by:
Alexander Kukushkin <cyberdemn@gmail.com>
-
- 14 Dec, 2020 2 commits
-
-
Alexander Kukushkin authored
How to trigger upgrade? This is a two-step process: 1. Update configuration (version) and rotate all pods. On start, configure_spilo will notice version mismatch and start the old version. 2. When all pods are rotated exec into the master container and call `python3 /scripts/inplace_upgrade.py N`, where N the capacity of the PostgreSQL cluster. What `inplace_upgrade.py` does: 1. Safety checks: * new version must be bigger than the old one * current node must be running as a master with the leader lock * the current number of members must match with `N` * the cluster must not be running in maintenance mode * all replicas must be streaming from the master with small lag 2. Prepare `data_new` by running `initdb` with matching parameters 3. Run `pg_upgrade --check`. If it fails - abort and do a cleanup. 4. Drop objects from the database which could be incompatible with the new version (e.g. pg_stat_statements wrapper, postgres_log fdw) 5. enable maintenance mode (patronictl pause --wait) 6. Do a clean shutdown of the postgres 7. Get the latest checkpoint location from pg_controldata 8. Wait for replicas to receive/apply latest checkpoint location 9. Start rsyncd, listening on port 5432 (we know that it is exposed!) 10. If all previous steps succeeded call `pg_upgrade -k` 11. If pg_upgrade succeeded we reached the point of no return! If it failed we need to rollback previous steps. 12. Rename data directories `data -> data_old` and `data_new -> data` 13. Update configuration files (postgres.yml and wal-e envdir). 14. Call CHECKPOINT on replicas (predictable shutdown time). 15. Trigger rsync on replicas (COPY (SELECT) TO PROGRAM) 16. Wait for replicas rsync to complete. Feedback status is generated by `post-xfer exec` script. Wait timeout 300 seconds. 17. Stop rsyncd 18. Remove the initialize key from DCS (it contains old sysid) 19. Restart Patroni on the master with the new configuration 20. Start the local postgres up as the master by calling REST API `POST /restart` 21. Memorize and reset custom statistics targets. 22. Start the ANALYZE in stages in a separate thread. 23. Wait until Patroni on replicas is restarted. 24. Disable maintenance mode (patronictl resume) 25. Wait until analyze in stages finishes. 26. Restore custom statistics targets and analyze these tables 27. Call post_bootstrap script (restore dropped objects) 28. Remove `data_old` 29. Trigger creation of the new backup Rollback: 1. Stop rsyncd if it is running 2. Disable maintenance mode (patronictl resume) 3. Remove `data_new` if it exists Replicas upgrade with rsync --------------------------- There are many options on how to call the script: 1. Start a separate REST API for such maintenance tasks (requires opening a new port and some changes in infrastructure) 2. Allow `pod/exec` (works only on K8s, not desirable) 3. Use COPY TO PROGRAM "hack" The `COPY TO PROGRAM` seems to be low-hanging fruit. It requires only postgres to be up and running, which is in turn already one of the requirements for the in-place upgrade to start. When being started, the script does some sanity checks based on input parameters. There are three parameters required: new_version, primary_ip, and PID. * new_version - the version we are upgrading to * primary_ip - where to rsync from * PID - the pid of postgres backend that executed COPY TO PROGRAM. The script must wait until the backend will exit before continuing. Also the script must check that its parent (maybe grandparent?) process has the right PID which is matching with the argument. There are some problems with `COPY TO PROGRAM` approach. The Patroni and therefore PostgreSQL environment is cleared before start. As a result, the script started by postgres backend will not see for example `$KUBERNETES_SERVICE_HOST` and won't be able to work with DCS in all cases. Once made sure that the client backend is gone the script will: 1. Remember the old sysid 2. Do a clean shutdown of the postgres 3. Rename data directory `data -> data_old` 4. Update configuration file (postgres.yaml and wal-e envdir). We do it before rsync because the initialize key could be cleaned up right after rsync was completed and Patroni will exit! 5. Call rsync. If it failed, rename data directory back. 6. Now we need to wait for the fact that the initialize key is removed from DCS. Since we know that it happens before the postgres on the master is started we will try to connect to the master via replication protocol and check the sysid. 7. Restart Patroni. 8. Remove `data_old` In addition to that, implement integration tests. Mostly they are testing happy-case scenarios, like: 1. Successful in-place upgrade from 9.6 to 10 2. Successful in-place upgrade from 10 to 12 3. Major upgrade after the custom bootstrap with wal-g 4. Major upgrade after the custom bootstrap with pg_basebackup 5. Bootstrap of a new replica with wal-g Also tests are covering a few unhappy cases, like: in-place upgrade doesn't start if pre-conditions are not meet.
-
Marcin Frankiewicz authored
* Support for additional locales * Fixes after review
-
- 26 Nov, 2020 1 commit
-
-
Steve Singer authored
Close https://github.com/zalando/spilo/issues/383
-
- 23 Nov, 2020 1 commit
-
-
Alexander Kukushkin authored
https://github.com/wal-e/wal-e/commit/485d834a18c9b0d97115d95f89e16bdc564e9a18 introduced `gevent.monkey.patch_thread()` that makes S3 performance ridiculously slow.
-
- 26 Oct, 2020 1 commit
-
-
Alexander Kukushkin authored
Close https://github.com/zalando/spilo/issues/508
-
- 01 Oct, 2020 1 commit
-
-
SanjeevChoubey authored
* Added Extension pg_permission
-
- 28 Sep, 2020 1 commit
-
-
Alexander Kukushkin authored
Close https://github.com/zalando/spilo/issues/501
-
- 25 Sep, 2020 2 commits
-
-
SanjeevChoubey authored
Pin the commit id (tags are inconsistent) Co-authored-by:
Alexander Kukushkin <cyberdemn@gmail.com>
-
Alexander Kukushkin authored
* Don't install/purge libpq with pinned version. It was required only for building pg_rewind on 9.3 and 9.4. * Fix Patroni install * Clean up post_init.sh
-
- 24 Sep, 2020 1 commit
-
-
Alexander Kukushkin authored
We use the fact that `%p` parameter contains either `RECOVERYHISTORY` or `RECOVERYXLOG`, but in case of pg_rewind the file is being restored directly to `pg_wal/%f` If it is detected, the following is done: * export WALG_DOWNLOAD_CONCURRENCY=1 environment for wal-g * Use -p 1 option for wal-e In addition to that update README.rst and bump some dependencies.
-
- 08 Sep, 2020 1 commit
-
-
Alexander Kukushkin authored
* remove useless callback_endpoint.py * add exponential back-off to callback_role.py * enable linting job with GitHub actions: 1. check shell scripts with shellcheck 2. check python code with flake8 * run integration tests on GitHub actions and CDP
-
- 07 Sep, 2020 1 commit
-
-
Alexander Kukushkin authored
* wal-e 1.1.1 * wal-g 0.2.17 * timescaledb 1.7.3 * refactor DCS configuration (close #468)
-
- 02 Sep, 2020 1 commit
-
-
Christopher Bayliss authored
Splits AWS and OpenStack metadata fetching. Adds OpenStack specific fetching from http://169.254.169.254/openstack/latest/meta_data.json Like GCP OpenStack's metadata does not expose IPv4 addresses so the auto-discovered one used. Close https://github.com/zalando/spilo/issues/485
-
- 01 Sep, 2020 1 commit
-
-
Igor Yanchenko authored
We want to be able to overwrite default configuration.
-
- 14 Aug, 2020 2 commits
-
-
Alexander Kukushkin authored
and install plpgsql-check from pgdg packages
-
Marco Giovannini authored
Close #479
-
- 07 Aug, 2020 1 commit
-
-
sweetbadger authored
Close #477
-
- 05 Aug, 2020 1 commit
-
-
Dmitry Dolgov authored
To facilitate troubleshooting, enable coredumps. Unfortunately by default a coredump will take snapshot of the whole shared memory of a process, which in case of PostgreSQL could mean gigabyres of shared buffers. To prevent that, also configure coredump_filter to exclude shared memory from a dump. See [1] for more details. [1]: https://www.kernel.org/doc/html/latest/filesystems/proc.html#proc-pid-coredump-filter-core-dump-filtering-settings
-
- 04 Aug, 2020 1 commit
-
-
Andras Vaczi authored
-
- 20 Jul, 2020 1 commit
-
-
Alexander Kukushkin authored
it is used by post_init.sh script
-
- 17 Jul, 2020 1 commit
-
-
Alexander Kukushkin authored
* timescaledb 1.7.2 * pg_mon cc028fdae8542ec3f5df3bf4c66f0895d87c127d * plpgsql_check 1.11.0 * plantuner 800d81bc85da64ff3ef66e12aed1d4e1e54fc006 (pg13 support)
-
- 26 Jun, 2020 1 commit
-
-
Alexander Kukushkin authored
We want to make it possible to grant pgq_* roles by mebers of admin role
-
- 11 Jun, 2020 2 commits
-
-
Armin Nesiren authored
-
stoetti authored
Close https://github.com/zalando/postgres-operator/issues/676
-
- 25 May, 2020 1 commit
-
-
Alexander Kukushkin authored
refactor Dockerfile a little bit (pglogical-ticker is available for 12) Close https://github.com/zalando/postgres-operator/issues/926
-
- 19 May, 2020 1 commit
-
-
Alexander Kukushkin authored
In addition to that bump pam-oauth2 to v1.0.1, there is no real changes in the code, but it will solve https://github.com/zalando/spilo/issues/433 And last, fix build with `--build-arg PGOLDVERSIONS=""` Close https://github.com/zalando/spilo/issues/413
-
- 15 May, 2020 2 commits
-
-
Alexander Kukushkin authored
and configure it to keep 2h of aggregated metrics in memory
-
Armin Nesiren authored
Close https://github.com/zalando/spilo/pull/404
-