PostgreSQL Monitoring for Application Developers The DBA Fundamentals

# PostgreSQL Monitoring for Application Developers: The DBA Fundamentals ![rw-book-cover](https://blog.crunchydata.com/hubfs/dba-stats.png#keepProtocol) URL:: https://blog.crunchydata.com/blog/postgresql-monitoring-for-application-developers-dba-stats Author:: Jonathan S. Katz ## Highlights > Connection Statistics ([View Highlight](https://read.readwise.io/read/01fe9rs56s61ezq6av9fpbbreh)) > Total Connections ([View Highlight](https://read.readwise.io/read/01fe9rs6g5ta2sa2nzrjmswwyk)) > One of the required configuration parameters for PostgreSQL is [max_connections](http://guc-max-connections/): any connections beyond this number will cause the client connection to fail. This is the first fundamental reason why you need to monitor the number of connections to your database: too many connections may cause the appearance of downtime events in your application. ([View Highlight](https://read.readwise.io/read/01fe9rsahaxa7y3mpjsttq5fby)) > Idle in Transaction ([View Highlight](https://read.readwise.io/read/01fe9rst1ydy8md9tcjqg68ttj)) > There is a bit more to the connection story, and this story involves connections that are "idle in transaction." A connection that is idle in transaction is where a transaction is in process (e.g. there was a [BEGIN](https://www.postgresql.org/docs/current/sql-begin.html) at some point) but has not yet been committed ([COMMIT](https://www.postgresql.org/docs/current/sql-commit.html)) or rolled back ([ROLLBACK](https://www.postgresql.org/docs/current/sql-rollback.html)). If you have too many connections that are idle in transaction, you can end up causing overall performance issues for your system and potentially maintenance issues. ([View Highlight](https://read.readwise.io/read/01fe9rsv68d5nhnfaw5bc5hvpv)) > Transaction Rate ([View Highlight](https://read.readwise.io/read/01fe9rswkj9zp8mjhnx6gjs48w)) > Transaction rate, whether it's measured in transactions per second (TPS) or transactions per minute (TPM), measures the overall throughput of your system. Usually, this metric by itself does not tell you much: transaction rate can vary based on time-of-day, day-of-week, etc. You can use transaction rate to help determine if there is a load spike, or couple it with another metric such as [network throughput](https://blog.crunchydata.com/blog/postgresql-monitoring-application-developers-os-stats) to determine if your performance is being affected. ([View Highlight](https://read.readwise.io/read/01fe9rsz6m60drmrap1wa6twx3)) > Row Activity ([View Highlight](https://read.readwise.io/read/01fe9rt2rtdbkgbh9pnj1fzxwx)) > Row activity can indicate the type of workload that you are managing, whether it's read heavy (SELECT), write-heavy (INSERT/UPDATE/DELETE), or balanced between the two. Knowing what type of workload, and in conjunction with other metrics, you can tune your database to maximize your application's performance. ([View Highlight](https://read.readwise.io/read/01fe9rt6bsvqdavy0waccf57cw)) > Disk Usage ([View Highlight](https://read.readwise.io/read/01fe9rt8m38ewcz74p6gze5r9y)) > It's quite simple: if PostgreSQL can't write to disk, you're going to have downtime. It's important to [keep track of disk size](https://info.crunchydata.com/blog/postgresql-monitoring-application-developers-os-stats), but it's also helpful to know which items are taking up disk. ([View Highlight](https://read.readwise.io/read/01fe9rtabnyjj9305b4evrsr69)) > Cache Hit Ratio ([View Highlight](https://read.readwise.io/read/01fe9rthbmpmgzxmmh90gaayct)) > Cache hit ratio is a measurement of how much of your "working data set" resides in memory. If you make a request for data that is not in memory, PostgreSQL will have to go and fetch it from disk, which is a much more costly operation. Your goal, then, is to try to have your cache hit ratio be as close to 100% as possible. ([View Highlight](https://read.readwise.io/read/01fe9rv5k58th1gve2mt9gy8pf)) > So, what is your "working data set?" Simply, your working data set is the data that your application is accessing over a given period of time. For example, let's say I have a simple scheduling application that only looks at appointments that are either today or in the future. Therefore, the maximum size of my working data set would be all scheduled items that are from today and into the future. My actually working data set is probably smaller: the application is more likely to access appoints that are closer to today rather than farther, so we can narrow our window down to that. ([View Highlight](https://read.readwise.io/read/01fe9rvj7hzga9ss6wq9yka7ht)) > Locks ([View Highlight](https://read.readwise.io/read/01fe9rvxrvm7f19vn90a6h0f31)) > Locking is a natural part of PostgreSQL as it is one of the fundamental pieces of [multi-version concurrency control](https://en.wikipedia.org/wiki/Multiversion_concurrency_control) (MVCC), so it is perfectly normal to have locks. The key bit is the above: ensuring that you can avoid errors due to locks. The most common locking error I would run into was "could not acquire lock," typically as my system was oversaturated at that point. I usually triggered deadlocks due to bad code (I admit it: I did write some very complex chained triggers, which would usually do the trick). ([View Highlight](https://read.readwise.io/read/01fe9rw0pepcv0z2v18n856xjv)) > Replication Lag ([View Highlight](https://read.readwise.io/read/01fe9rweezzs4pxx8g7wsdass7)) > In short, replication lag is the amount of time it would take for a replica, or a copy of your database, to be up-to-date with the primary. Typically, you want this value to be small: if there is a scenario that causes a primary to be unavailable, you want to cut over to a replica with as little data loss as possible. ([View Highlight](https://read.readwise.io/read/01fe9rwhrns5k9rt7ykwq1g6w7)) > Backups ([View Highlight](https://read.readwise.io/read/01fe9rwmjwqphdstyy4wvtv75k)) > I would be remiss if I didn't get a word in about backups. In almost every presentation I give, I try to sneak in a word about the importance of taking [regular backups of your database](https://access.crunchydata.com/documentation/postgres-operator/latest/tutorial/disaster-recovery/#schedule-backups) with a backup management tool like [pgBackRest](https://pgbackrest.org/). If your system has not completed a backup successfully within the period you subscribed, you should definitely fix that. Not only are backups critical in the event of a disaster, they can be used to [help efficiently bootstrap other clusters](https://access.crunchydata.com/documentation/postgres-operator/latest/tutorial/disaster-recovery/#restores). ([View Highlight](https://read.readwise.io/read/01fe9rws6ypmga5gbm6cjytmmy)) --- Title: PostgreSQL Monitoring for Application Developers: The DBA Fundamentals Author: Jonathan S. Katz Tags: readwise, articles date: 2024-01-30 --- # PostgreSQL Monitoring for Application Developers: The DBA Fundamentals ![rw-book-cover](https://blog.crunchydata.com/hubfs/dba-stats.png#keepProtocol) URL:: https://blog.crunchydata.com/blog/postgresql-monitoring-for-application-developers-dba-stats Author:: Jonathan S. Katz ## AI-Generated Summary What are some PostgreSQL monitoring stats that are typically used to monitor the health of your databases? ## Highlights > Connection Statistics ([View Highlight](https://read.readwise.io/read/01fe9rs56s61ezq6av9fpbbreh)) > Total Connections ([View Highlight](https://read.readwise.io/read/01fe9rs6g5ta2sa2nzrjmswwyk)) > One of the required configuration parameters for PostgreSQL is [max_connections](http://guc-max-connections/): any connections beyond this number will cause the client connection to fail. This is the first fundamental reason why you need to monitor the number of connections to your database: too many connections may cause the appearance of downtime events in your application. ([View Highlight](https://read.readwise.io/read/01fe9rsahaxa7y3mpjsttq5fby)) > Idle in Transaction ([View Highlight](https://read.readwise.io/read/01fe9rst1ydy8md9tcjqg68ttj)) > There is a bit more to the connection story, and this story involves connections that are "idle in transaction." A connection that is idle in transaction is where a transaction is in process (e.g. there was a [BEGIN](https://www.postgresql.org/docs/current/sql-begin.html) at some point) but has not yet been committed ([COMMIT](https://www.postgresql.org/docs/current/sql-commit.html)) or rolled back ([ROLLBACK](https://www.postgresql.org/docs/current/sql-rollback.html)). If you have too many connections that are idle in transaction, you can end up causing overall performance issues for your system and potentially maintenance issues. ([View Highlight](https://read.readwise.io/read/01fe9rsv68d5nhnfaw5bc5hvpv)) > Transaction Rate ([View Highlight](https://read.readwise.io/read/01fe9rswkj9zp8mjhnx6gjs48w)) > Transaction rate, whether it's measured in transactions per second (TPS) or transactions per minute (TPM), measures the overall throughput of your system. Usually, this metric by itself does not tell you much: transaction rate can vary based on time-of-day, day-of-week, etc. You can use transaction rate to help determine if there is a load spike, or couple it with another metric such as [network throughput](https://blog.crunchydata.com/blog/postgresql-monitoring-application-developers-os-stats) to determine if your performance is being affected. ([View Highlight](https://read.readwise.io/read/01fe9rsz6m60drmrap1wa6twx3)) > Row Activity ([View Highlight](https://read.readwise.io/read/01fe9rt2rtdbkgbh9pnj1fzxwx)) > Row activity can indicate the type of workload that you are managing, whether it's read heavy (SELECT), write-heavy (INSERT/UPDATE/DELETE), or balanced between the two. Knowing what type of workload, and in conjunction with other metrics, you can tune your database to maximize your application's performance. ([View Highlight](https://read.readwise.io/read/01fe9rt6bsvqdavy0waccf57cw)) > Disk Usage ([View Highlight](https://read.readwise.io/read/01fe9rt8m38ewcz74p6gze5r9y)) > It's quite simple: if PostgreSQL can't write to disk, you're going to have downtime. It's important to [keep track of disk size](https://info.crunchydata.com/blog/postgresql-monitoring-application-developers-os-stats), but it's also helpful to know which items are taking up disk. ([View Highlight](https://read.readwise.io/read/01fe9rtabnyjj9305b4evrsr69)) > Cache Hit Ratio ([View Highlight](https://read.readwise.io/read/01fe9rthbmpmgzxmmh90gaayct)) > Cache hit ratio is a measurement of how much of your "working data set" resides in memory. If you make a request for data that is not in memory, PostgreSQL will have to go and fetch it from disk, which is a much more costly operation. Your goal, then, is to try to have your cache hit ratio be as close to 100% as possible. ([View Highlight](https://read.readwise.io/read/01fe9rv5k58th1gve2mt9gy8pf)) > So, what is your "working data set?" Simply, your working data set is the data that your application is accessing over a given period of time. For example, let's say I have a simple scheduling application that only looks at appointments that are either today or in the future. Therefore, the maximum size of my working data set would be all scheduled items that are from today and into the future. My actually working data set is probably smaller: the application is more likely to access appoints that are closer to today rather than farther, so we can narrow our window down to that. ([View Highlight](https://read.readwise.io/read/01fe9rvj7hzga9ss6wq9yka7ht)) > Locks ([View Highlight](https://read.readwise.io/read/01fe9rvxrvm7f19vn90a6h0f31)) > Locking is a natural part of PostgreSQL as it is one of the fundamental pieces of [multi-version concurrency control](https://en.wikipedia.org/wiki/Multiversion_concurrency_control) (MVCC), so it is perfectly normal to have locks. The key bit is the above: ensuring that you can avoid errors due to locks. The most common locking error I would run into was "could not acquire lock," typically as my system was oversaturated at that point. I usually triggered deadlocks due to bad code (I admit it: I did write some very complex chained triggers, which would usually do the trick). ([View Highlight](https://read.readwise.io/read/01fe9rw0pepcv0z2v18n856xjv)) > Replication Lag ([View Highlight](https://read.readwise.io/read/01fe9rweezzs4pxx8g7wsdass7)) > In short, replication lag is the amount of time it would take for a replica, or a copy of your database, to be up-to-date with the primary. Typically, you want this value to be small: if there is a scenario that causes a primary to be unavailable, you want to cut over to a replica with as little data loss as possible. ([View Highlight](https://read.readwise.io/read/01fe9rwhrns5k9rt7ykwq1g6w7)) > Backups ([View Highlight](https://read.readwise.io/read/01fe9rwmjwqphdstyy4wvtv75k)) > I would be remiss if I didn't get a word in about backups. In almost every presentation I give, I try to sneak in a word about the importance of taking [regular backups of your database](https://access.crunchydata.com/documentation/postgres-operator/latest/tutorial/disaster-recovery/#schedule-backups) with a backup management tool like [pgBackRest](https://pgbackrest.org/). If your system has not completed a backup successfully within the period you subscribed, you should definitely fix that. Not only are backups critical in the event of a disaster, they can be used to [help efficiently bootstrap other clusters](https://access.crunchydata.com/documentation/postgres-operator/latest/tutorial/disaster-recovery/#restores). ([View Highlight](https://read.readwise.io/read/01fe9rws6ypmga5gbm6cjytmmy))