Commits · 7ac1688be02da90192a03c7a87a8b72d348bce76 · Plural / platform / spilo

17 Feb, 2021 1 commit

Fix bug with major upgrade after clone (#557) · 7ac1688b

Alexander Kukushkin authored 4 years ago

When working on the code removing bg_mon from shared_preload_libraries before executing pg_upgrade a little bug was introduced. Specifically, shared_preload_libraries for the old version was overwritten by the value taken from the new version. As a result, the old cluster was failing to start due to missing (not existing) libraries and upgrade was failing.

This commit is fixing the wrong behavior and improves tests to catch similar issues in the future.

7ac1688b

16 Feb, 2021 2 commits

Various improvements (#553) · 849313c8

Alexander Kukushkin authored 4 years ago

1. Fix failing in-place major upgrade due to the amcheck_next
2. Make sure bg_mon is not in the shared_preload_libraries during pg_upgrade.
3. Bump bg_mon commitid
4. Run renice as root (when possible)
5. Change self-signed certificate domain from dummy.org to example.org
6. Bump Timescaledb version to 2.0.1
7. Silence annoying NOTICE messages about deprecated set_user GUCs
8. Fully revert wal-e [commit](https://github.com/wal-e/wal-e/commit/485d834a18c9b0d97115d95f89e16bdc564e9a18) causing performance issues with S3
9. Analyze all databases after promote
10. Patroni 2.0.1
11. Last, but not least - new PostgreSQL minor releases

849313c8

Reverted uid/gid detection to pwd rather than home directory ownership (see #551) (#555) · 581a73c7

Django authored 4 years ago

Fixes an issue (see #551) with Spilo running in OpenShift when the UID is a random number picked by OpenShift. This PR reverts the method of determining what uid/gid to use back to checking the `/etc/passwd` file rather than the ownership of the `PGHOME` directory.

Verification that it works:
* Tests all pass
* `docker build -t spilo . && docker run --user=1500080000:0 spilo` is able to start the database with no issues

581a73c7

04 Feb, 2021 1 commit

Bump the Wal-G version to 0.2.19 and add WALG SSH support (#549) · d0b763a6

D. Spindel authored 4 years ago

Bump the Wal-G version to 0.2.19 and add variables for the WALG_SSH_PREFIX for backups


Not adding SSH_PASSWORD as password authentication is generally not recommended.

The SSH transport functions using sftp://, meaning that the server side could
be set up in a completely restricted manner in sshd_config:

    Match User spilo:
        ChrootDirectory /srv/backup/spilo
        ForceCommand internal-sftp
        AuthenticationMethods publickey
        PermitTunnel no
        AllowAgentForwarding no
        AllowTcpForwarding no
        X11Forwarding no

With the matching setup in the spilo container could be:

SSH_PRIVATE_KEY_PATH=/etc/patroni/ssh_key
SSH_USER=patroni
WALG_SSH_PREFIX=sftp://backup.example.com/spilo

Fixes: zalando/spilo#548

d0b763a6

27 Jan, 2021 2 commits

Run CHECKPOINT on replicas in parallel with pg_upgrade (#545) · 49b31b40

Alexander Kukushkin authored 4 years ago

Major upgrade with replica rsync requires all replicas being up-to-date with the primary and shutdown cleanly. It is possible to make the shutdown process faster by executing the CHECKPOINT immediately before running `pg_ctl stop -m fast`.

Before this commit, the CHECKPOINT and shutdown on replicas were performed only after the pg_upgrade was successfully executed.

The time for CHECKPOINT on replicas and therefore time for the shutdown is quite measurable and comparable with the time which is required to execute the pg_upgrade.
By doing these processes (pg_upgrade and CHECKPOINT on replicase) in parallel we reduce the downtime that is required for pg_upgrade + rsync.

The old diagram:
```
                                +-> r1 CHECKPOINT +   +-> r1 rsync +
SHUTDOWN PRIMARY -> pg_upgrade -|-> r2 CHECKPOINT |->-|-> r2 rsync |->
                                +-> r3 CHECKPOINT +   +-> r3 rsync +
```

The new diagram:
```
                  +-> pg_upgrade    +
                  |-> r1 CHECKPOINT |   +-> r1 rsync +
SHUTDOWN PRIMARY -|-> r2 CHECKPOINT |->-|-> r2 rsync |->
                  +-> r3 CHECKPOINT +   +-> r3 rsync +

```

In addition to that fix corner-case with upgrading a single-node cluster.

Also, the `registry.opensource.zalan.do/acid/spilo-cdp-13` and `registry.opensource.zalan.do/acid/spilo-13` become based on `registry.opensource.zalan.do/library/ubuntu-18.04. If someone concerned, the last one is mostly a mirror of ubuntu:18.04 from the dockerhub with a couple of layers on top. One of them updates packages and another adds zalando-marker file containing some "random" string. The FROM image is overridden in the delivery.yaml. If someone builds the image without build args supplied, the default ubuntu:18.04 is used.

49b31b40

Set AWS_REGION whenever USE_WALG_BACKUP is set to true (#546) · 246c3fe9

François Van Ingelgom authored 4 years ago

Since wal-g supports having both AWS_ENDPOINT (or WALE_S3_ENDPOINT) and AWS_REGION and that we cannot determine region from endpoint in the event of an onpremise deployment, this PR allows setting AWS_REGION even if AWS_ENDPOINT (or WALE_S3_ENDPOINT) is specified.

This way using an onpremise deploy with a custom region (http://someminioinstance:9000/ and my-place, ie) or using an endpoint that does not match the regex engine (https://somebucket.s3.nl-ams.scw.cloud, ie)

Close: https://github.com/zalando/spilo/issues/539

246c3fe9

20 Jan, 2021 1 commit
- Try to parse response to ensure it's parsable. If not, assume local provider. (#543) · f13c69ff
  Kieran Evans authored 4 years ago
```
Close #542
```
  f13c69ff
11 Jan, 2021 1 commit
- Fix metadata service for OpenStack (#526) · 5138179f
  Max Rosin authored 4 years ago
```
Close #525
```
  5138179f
04 Jan, 2021 1 commit

Postgis 3.1, timescale 2.0 and a few optimizations in relinking of extensions/contribs (#535) · ae5efea5

Alexander Kukushkin authored 4 years ago

1. Fix build broken by postgis 3.1 release
   - use postgis 3.1 as main version
   - use postgis 3.0 as legacy version (postgres 9.5)
   - fix relinking code to support legacy postgis and optimize it to support relinking between all major versions, not only the current one.
2.  Bump version of timescaledb to 2.0
    - change cmake requirement from 3.11 to 3.11 in CMakeLists.txt. The old version available in ubuntu 18.04 perfectly works.
    - call bootstrap with -DWARNINGS_AS_ERRORS=OFF
    - 9.6 and 10 will continue using 1.7.4, therefore optimized relinking code in the commit updating postgis becomes very handy.
3. A bit unrelated, but since this PR already does a lot of housekeeping, optimize the code defining/cleaning locales in order to reduce the number of repetitions.

Close: #528, #529, #534, #533

ae5efea5

18 Dec, 2020 1 commit

Fix a few minor issues (#524) · 656d46f8

Alexander Kukushkin authored 4 years ago

1. Restore `ETCD_DISCOVERY_DOMAIN` -> `etcd.discovery_srv` mapping
2. Compatibility with https://github.com/zalando/patroni/pull/1788
3. `python3-requests`, `postgresql-${version}-pg-stat-kcache`, `postgresql-${version}-cron`, and `postgresql-${version}-pgq3` are required for DEMO spilo to work.
4. Don't install `pv`, `lzop`, `etcd`, and `etcdctl` in DEMO mode
5. Don't remove `pg_recvlogical` binary, it might be useful for playng with logical replication/decoding.

656d46f8

16 Dec, 2020 1 commit
- fix clone from Azure (#523) · 2f23b65c
  Kiryl authored 4 years ago
  
  2f23b65c
15 Dec, 2020 1 commit

Use an IP address for Google instance metadata (#521) · 00f4dd85

Oleksii Kliukin authored 4 years ago

* Use an IP address for Google instance metadata

Makes Spilo start even if external DNS queries from
within the container are not working.

Do not continue with the startup when /etc/service is missing.

Per https://github.com/zalando/spilo/issues/511



* Follow Python guidelines on on-line comments
Co-authored-by: Alexander Kukushkin <cyberdemn@gmail.com>

* Extra space to the shell comment for readability
Co-authored-by: Alexander Kukushkin <cyberdemn@gmail.com>

00f4dd85

14 Dec, 2020 2 commits

Add PostgreSQL 13 support and in-place major upgrade (#520) · bb6ab228

Alexander Kukushkin authored 4 years ago

How to trigger upgrade? This is a two-step process:
1. Update configuration (version) and rotate all pods. On start, configure_spilo will notice version mismatch and start the old version.
2. When all pods are rotated exec into the master container and call `python3 /scripts/inplace_upgrade.py N`, where N the capacity of the PostgreSQL cluster.

What `inplace_upgrade.py` does:
1. Safety checks:
* new version must be bigger than the old one
* current node must be running as a master with the leader lock
* the current number of members must match with `N`
* the cluster must not be running in maintenance mode
* all replicas must be streaming from the master with small lag
2. Prepare `data_new` by running `initdb` with matching parameters
3. Run `pg_upgrade --check`. If it fails - abort and do a cleanup.
4. Drop objects from the database which could be incompatible with the new version (e.g. pg_stat_statements wrapper, postgres_log fdw)
5. enable maintenance mode (patronictl pause --wait)
6. Do a clean shutdown of the postgres
7. Get the latest checkpoint location from pg_controldata
8. Wait for replicas to receive/apply latest checkpoint location
9. Start rsyncd, listening on port 5432 (we know that it is exposed!)
10. If all previous steps succeeded call `pg_upgrade -k`
11. If pg_upgrade succeeded we reached the point of no return!
If it failed we need to rollback previous steps.
12. Rename data directories `data -> data_old` and `data_new -> data`
13. Update configuration files (postgres.yml and wal-e envdir).
14. Call CHECKPOINT on replicas (predictable shutdown time).
15. Trigger rsync on replicas (COPY (SELECT) TO PROGRAM)
16. Wait for replicas rsync to complete. Feedback status is generated by `post-xfer exec` script. Wait timeout 300 seconds.
17. Stop rsyncd
18. Remove the initialize key from DCS (it contains old sysid)
19. Restart Patroni on the master with the new configuration
20. Start the local postgres up as the master by calling REST API `POST /restart`
21. Memorize and reset custom statistics targets.
22. Start the ANALYZE in stages in a separate thread.
23. Wait until Patroni on replicas is restarted.
24. Disable maintenance mode (patronictl resume)
25. Wait until analyze in stages finishes.
26. Restore custom statistics targets and analyze these tables
27. Call post_bootstrap script (restore dropped objects)
28. Remove `data_old`
29. Trigger creation of the new backup

Rollback:
1. Stop rsyncd if it is running
2. Disable maintenance mode (patronictl resume)
3. Remove `data_new` if it exists

Replicas upgrade with rsync
---------------------------

There are many options on how to call the script:
1. Start a separate REST API for such maintenance tasks (requires opening a new port and some changes in infrastructure)
2. Allow `pod/exec` (works only on K8s, not desirable)
3. Use COPY TO PROGRAM "hack"

The `COPY TO PROGRAM` seems to be low-hanging fruit. It requires only postgres to be up and running, which is in turn already one of the requirements for the in-place upgrade to start. When being started, the script does some sanity checks based on input parameters.

There are three parameters required: new_version, primary_ip, and PID.
* new_version - the version we are upgrading to
* primary_ip - where to rsync from
* PID - the pid of postgres backend that executed COPY TO PROGRAM.
The script must wait until the backend will exit before continuing. Also the script must check that its parent (maybe grandparent?) process has the right PID which is matching with the argument.

There are some problems with `COPY TO PROGRAM` approach. The Patroni and therefore PostgreSQL environment is cleared before start. As a result, the script started by postgres backend will not see for example `$KUBERNETES_SERVICE_HOST` and won't be able to work with DCS in all cases.

Once made sure that the client backend is gone the script will:
1. Remember the old sysid
2. Do a clean shutdown of the postgres
3. Rename data directory `data -> data_old`
4. Update configuration file (postgres.yaml and wal-e envdir). We do it before rsync because the initialize key could be cleaned up right after rsync was completed and Patroni will exit!
5. Call rsync. If it failed, rename data directory back.
6. Now we need to wait for the fact that the initialize key is removed from DCS. Since we know that it happens before the postgres on the master is started we will try to connect to the master via replication protocol and check the sysid.
7. Restart Patroni.
8. Remove `data_old`

In addition to that, implement integration tests. Mostly they are testing happy-case scenarios, like:
1. Successful in-place upgrade from 9.6 to 10
2. Successful in-place upgrade from 10 to 12
3. Major upgrade after the custom bootstrap with wal-g
4. Major upgrade after the custom bootstrap with pg_basebackup
5. Bootstrap of a new replica with wal-g

Also tests are covering a few unhappy cases, like: in-place upgrade doesn't start if pre-conditions are not meet.

bb6ab228

Support for additional locales (#513) · 3ed0a9b7
Marcin Frankiewicz authored 4 years ago
```
* Support for additional locales

* Fixes after review
```
3ed0a9b7

26 Nov, 2020 1 commit
- Added support for sending WAL to Azure via wal-g · 931c26dd
  Steve Singer authored 4 years ago
```
Close https://github.com/zalando/spilo/issues/383
```
  931c26dd
23 Nov, 2020 1 commit

Fix S3 performance issues (#515) · 2158f45c

Alexander Kukushkin authored 4 years ago

https://github.com/wal-e/wal-e/commit/485d834a18c9b0d97115d95f89e16bdc564e9a18 introduced `gevent.monkey.patch_thread()` that makes S3 performance ridiculously slow.

2158f45c

26 Oct, 2020 1 commit
- Fix docker build (#509) · 513238ed
  Alexander Kukushkin authored 4 years ago
```
Close https://github.com/zalando/spilo/issues/508
```
  513238ed
01 Oct, 2020 1 commit
- Pg permission fix (#504) · d4f50cf8
  SanjeevChoubey authored 4 years ago
```
* Added Extension pg_permission
```
  d4f50cf8
28 Sep, 2020 1 commit
- Fixed requirements in pg_view (#502) · 47aab17b
  Alexander Kukushkin authored 4 years ago
```
Close https://github.com/zalando/spilo/issues/501
```
  47aab17b
25 Sep, 2020 2 commits

Added Extension pg_permission (#493) · a75ff213

SanjeevChoubey authored 4 years ago


Pin the commit id (tags are inconsistent)
Co-authored-by: Alexander Kukushkin <cyberdemn@gmail.com>

a75ff213

Fix problems with build (#499) · 53a992a4

Alexander Kukushkin authored 4 years ago

* Don't install/purge libpq with pinned version. It was required only for building pg_rewind on 9.3 and 9.4.
* Fix Patroni install
* Clean up post_init.sh

53a992a4

24 Sep, 2020 1 commit

Disable prefetch when restore_command is called for pg_rewind (#495) · 3651d94c

Alexander Kukushkin authored 4 years ago

We use the fact that `%p` parameter contains either `RECOVERYHISTORY` or `RECOVERYXLOG`, but in case of pg_rewind the file is being restored directly to `pg_wal/%f`
If it is detected, the following is done:
* export WALG_DOWNLOAD_CONCURRENCY=1 environment for wal-g
* Use -p 1 option for wal-e

In addition to that update README.rst and bump some dependencies.

3651d94c

08 Sep, 2020 1 commit

Housekeeping (#489) · 5f09fef7

Alexander Kukushkin authored 4 years ago

* remove useless callback_endpoint.py
* add exponential back-off to callback_role.py
* enable linting job with GitHub actions:
  1. check shell scripts with shellcheck
  2. check python code with flake8
* run integration tests on GitHub actions and CDP

5f09fef7

07 Sep, 2020 1 commit

Patroni 2.0 (#487) · ea59eba8

Alexander Kukushkin authored 4 years ago

* wal-e 1.1.1
* wal-g 0.2.17
* timescaledb 1.7.3
* refactor DCS configuration (close #468)

ea59eba8

02 Sep, 2020 1 commit

Changes configure script to use OpenStack metadata directly. (#486) · 27e13179

Christopher Bayliss authored 4 years ago

Splits AWS and OpenStack metadata fetching.
Adds OpenStack specific fetching from http://169.254.169.254/openstack/latest/meta_data.json
Like GCP OpenStack's metadata does not expose IPv4 addresses so the auto-discovered one used.

Close https://github.com/zalando/spilo/issues/485

27e13179

01 Sep, 2020 1 commit
- do not sum lists (#484) · 27f94fb8
  Igor Yanchenko authored 4 years ago
```
We want to be able to overwrite default configuration.
```
  27f94fb8
14 Aug, 2020 2 commits
- Minor postgres releases (#481) · a492a4aa
  Alexander Kukushkin authored 4 years ago
```
and install plpgsql-check from pgdg packages
```
  a492a4aa
- Make sure created certificate are owned by postgres user · 7ee46758
  Marco Giovannini authored 4 years ago
```
Close #479
```
  7ee46758
07 Aug, 2020 1 commit
- use AWS_REGION when provided (#478) · d81e7753
  sweetbadger authored 4 years ago
```
Close #477
```
  d81e7753
05 Aug, 2020 1 commit

Enable coredumps (#475) · 2113b5b7

Dmitry Dolgov authored 4 years ago

To facilitate troubleshooting, enable coredumps. Unfortunately by
default a coredump will take snapshot of the whole shared memory of a
process, which in case of PostgreSQL could mean gigabyres of shared
buffers. To prevent that, also configure coredump_filter to exclude
shared memory from a dump. See [1] for more details.

[1]: https://www.kernel.org/doc/html/latest/filesystems/proc.html#proc-pid-coredump-filter-core-dump-filtering-settings

2113b5b7

04 Aug, 2020 1 commit
- Don't use pspg as pager (#470) · 533e8547
  Andras Vaczi authored 4 years ago
  
  533e8547
20 Jul, 2020 1 commit
- Don't strip TIMESCALEDB_LEGACY env (#466) · cd90c541
  Alexander Kukushkin authored 4 years ago
```
it is used by post_init.sh script
```
  cd90c541
17 Jul, 2020 1 commit

Update dependencies (#464) · 9978b671

Alexander Kukushkin authored 4 years ago

* timescaledb 1.7.2
* pg_mon cc028fdae8542ec3f5df3bf4c66f0895d87c127d
* plpgsql_check 1.11.0
* plantuner 800d81bc85da64ff3ef66e12aed1d4e1e54fc006 (pg13 support)

9978b671

26 Jun, 2020 1 commit
- Helper script for pgq (#454) · c307996b
  Alexander Kukushkin authored 4 years ago
```
We want to make it possible to grant pgq_* roles by mebers of admin role
```
  c307996b
11 Jun, 2020 2 commits
- Added pgaudit extension. (#448) · 99a4e5e4
  Armin Nesiren authored 4 years ago
  
  99a4e5e4
- fix permission error of PGDATA directory upon container restart (#447) · 380d4537
  stoetti authored 4 years ago
```
Close https://github.com/zalando/postgres-operator/issues/676
```
  380d4537
25 May, 2020 1 commit

Set PGOPTIONS="-c synchronous_commit=local" in post_init.sh (#440) · fd74b36c

Alexander Kukushkin authored 4 years ago

refactor Dockerfile a little bit (pglogical-ticker is available for 12)

Close https://github.com/zalando/postgres-operator/issues/926

fd74b36c

19 May, 2020 1 commit

Bump timescaledb version to 1.7.1 (#438) · 6c2ea52b

Alexander Kukushkin authored 4 years ago

In addition to that bump pam-oauth2 to v1.0.1, there is no real changes in the code, but it will solve https://github.com/zalando/spilo/issues/433

And last, fix build with `--build-arg PGOLDVERSIONS=""`

Close https://github.com/zalando/spilo/issues/413

6c2ea52b

15 May, 2020 2 commits
- Bump bg_mon commit (#436) · 720349ad
  Alexander Kukushkin authored 5 years ago
```
and configure it to keep 2h of aggregated metrics in memory
```
  720349ad
- Added missing reference when to basebackup_fast_xlog replica method. (#435) · 1e2b94c5
  Armin Nesiren authored 5 years ago
```
Close https://github.com/zalando/spilo/pull/404
```
  1e2b94c5