本文共 9726 字,大约阅读时间需要 32 分钟。
由于客户提到,运行某些大运算量批处理时,PostgreSQL突发性使用大量资源的问题,
进行了如下的调查,发现确实发生了崩溃。PostgreSQL 需要资源控制方案啊。
现在正在考虑是否必须要用 os 级别的限制方案:
过程如下:
测试环境:
内存:大约1024MB
postgresql.conf 设定:
使用缺省值:checkpoint_segments =3 shard_buffers=32MB
这些是有意为之,就是想看看数据量大、shared_buffer小的状况:
建立表(一条记录大约1024字节):
postgres=# create table test01(id integer, val char(1024));
向表中插入大量数据(总共插入2400MB)
postgres=# insert into test01 values(generate_series(1,2457600),repeat( chr(int4(random()*26)+65),1024));
插入操作会花费一些时间,此时用ps 命令看PostgreSQL个进程对内存使用状况:‘
[root@server ~]# ps aux | grep postroot 3180 0.0 0.0 105296 712 pts/1 S 16:31 0:00 su - postgrespostgres 3181 0.0 0.0 70304 676 pts/1 S+ 16:31 0:00 -bashpostgres 3219 0.0 0.2 113644 2864 pts/1 S 16:32 0:00 /usr/local/pgsql/bin/postgres -D /gao/datapostgres 3221 0.4 3.0 113724 35252 ? Ss 16:32 0:01 postgres: writer process postgres 3222 0.2 0.1 113644 1616 ? Ds 16:32 0:00 postgres: wal writer process postgres 3223 0.0 0.0 114380 1148 ? Ss 16:32 0:00 postgres: autovacuum launcher process postgres 3224 0.0 0.0 73332 472 ? Ss 16:32 0:00 postgres: stats collector process root 3252 0.0 0.0 105296 712 pts/2 S 16:32 0:00 su - postgrespostgres 3253 0.0 0.0 70304 676 pts/2 S 16:32 0:00 -bashpostgres 3285 0.0 0.0 83488 740 pts/2 S+ 16:32 0:00 ./psqlpostgres 3286 14.8 80.2 2598332 924308 ? Ds 16:32 0:35 postgres: postgres postgres [local] INSERTroot 3333 0.0 0.0 65424 812 pts/3 S+ 16:36 0:00 grep post
可以发现,INSERT操作正在消耗80%以上的内存。
再等片刻,发现psql端传来反馈:
WARNING: terminating connection because of crash of another server processDETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.HINT: In a moment you should be able to reconnect to the database and repeat your command.The connection to the server was lost. Attempting reset: Failed.!>
此时看看Log,可以看到:background writer (3321) 已经被干掉了,所有连接被重置。
LOG: autovacuum launcher startedLOG: database system is ready to accept connectionsLOG: checkpoints are occurring too frequently (2 seconds apart)HINT: Consider increasing the configuration parameter "checkpoint_segments".LOG: checkpoints are occurring too frequently (1 second apart)HINT: Consider increasing the configuration parameter "checkpoint_segments".LOG: checkpoints are occurring too frequently (2 seconds apart)HINT: Consider increasing the configuration parameter "checkpoint_segments".LOG: checkpoints are occurring too frequently (1 second apart)HINT: Consider increasing the configuration parameter "checkpoint_segments".LOG: checkpoints are occurring too frequently (2 seconds apart)HINT: Consider increasing the configuration parameter "checkpoint_segments".LOG: checkpoints are occurring too frequently (1 second apart)HINT: Consider increasing the configuration parameter "checkpoint_segments".LOG: checkpoints are occurring too frequently (2 seconds apart)HINT: Consider increasing the configuration parameter "checkpoint_segments".LOG: checkpoints are occurring too frequently (4 seconds apart)HINT: Consider increasing the configuration parameter "checkpoint_segments".LOG: checkpoints are occurring too frequently (10 seconds apart)HINT: Consider increasing the configuration parameter "checkpoint_segments".LOG: checkpoints are occurring too frequently (9 seconds apart)HINT: Consider increasing the configuration parameter "checkpoint_segments".LOG: checkpoints are occurring too frequently (7 seconds apart)HINT: Consider increasing the configuration parameter "checkpoint_segments".LOG: checkpoints are occurring too frequently (9 seconds apart)HINT: Consider increasing the configuration parameter "checkpoint_segments".LOG: checkpoints are occurring too frequently (7 seconds apart)HINT: Consider increasing the configuration parameter "checkpoint_segments".LOG: checkpoints are occurring too frequently (5 seconds apart)HINT: Consider increasing the configuration parameter "checkpoint_segments".LOG: checkpoints are occurring too frequently (5 seconds apart)HINT: Consider increasing the configuration parameter "checkpoint_segments".LOG: checkpoints are occurring too frequently (10 seconds apart)HINT: Consider increasing the configuration parameter "checkpoint_segments".LOG: checkpoints are occurring too frequently (11 seconds apart)HINT: Consider increasing the configuration parameter "checkpoint_segments".LOG: checkpoints are occurring too frequently (15 seconds apart)HINT: Consider increasing the configuration parameter "checkpoint_segments".LOG: checkpoints are occurring too frequently (23 seconds apart)HINT: Consider increasing the configuration parameter "checkpoint_segments".LOG: checkpoints are occurring too frequently (7 seconds apart)HINT: Consider increasing the configuration parameter "checkpoint_segments".LOG: checkpoints are occurring too frequently (8 seconds apart)HINT: Consider increasing the configuration parameter "checkpoint_segments".LOG: checkpoints are occurring too frequently (13 seconds apart)HINT: Consider increasing the configuration parameter "checkpoint_segments".LOG: checkpoints are occurring too frequently (7 seconds apart)HINT: Consider increasing the configuration parameter "checkpoint_segments".LOG: checkpoints are occurring too frequently (8 seconds apart)HINT: Consider increasing the configuration parameter "checkpoint_segments".LOG: checkpoints are occurring too frequently (8 seconds apart)HINT: Consider increasing the configuration parameter "checkpoint_segments".LOG: checkpoints are occurring too frequently (8 seconds apart)HINT: Consider increasing the configuration parameter "checkpoint_segments".LOG: checkpoints are occurring too frequently (4 seconds apart)HINT: Consider increasing the configuration parameter "checkpoint_segments".LOG: checkpoints are occurring too frequently (5 seconds apart)HINT: Consider increasing the configuration parameter "checkpoint_segments".LOG: checkpoints are occurring too frequently (9 seconds apart)HINT: Consider increasing the configuration parameter "checkpoint_segments".LOG: background writer process (PID 3221) was terminated by signal 9: KilledLOG: terminating any other active server processesWARNING: terminating connection because of crash of another server processDETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.HINT: In a moment you should be able to reconnect to the database and repeat your command.WARNING: terminating connection because of crash of another server processDETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.HINT: In a moment you should be able to reconnect to the database and repeat your command.WARNING: terminating connection because of crash of another server processDETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.HINT: In a moment you should be able to reconnect to the database and repeat your command.WARNING: terminating connection because of crash of another server processDETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.HINT: In a moment you should be able to reconnect to the database and repeat your command.LOG: all server processes terminated; reinitializingFATAL: the database system is in recovery modeLOG: database system was interrupted; last known up at 2013-08-30 16:36:42 CSTLOG: database system was not properly shut down; automatic recovery in progressLOG: consistent recovery state reached at 0/B7657BD0LOG: redo starts at 0/B60FE2B8LOG: unexpected pageaddr 0/B044C000 in log file 0, segment 184, offset 4505600LOG: redo done at 0/B844B940LOG: autovacuum launcher startedLOG: database system is ready to accept connections
各个进程都重新生成了:
[root@server ~]# ps aux | grep postroot 3180 0.0 0.0 105296 660 pts/1 S 16:31 0:00 su - postgrespostgres 3181 0.0 0.0 70304 628 pts/1 S+ 16:31 0:00 -bashpostgres 3219 0.0 0.5 113644 6276 pts/1 S 16:32 0:00 /usr/local/pgsql/bin/postgres -D /gao/dataroot 3252 0.0 0.0 105296 660 pts/2 S 16:32 0:00 su - postgrespostgres 3253 0.0 0.0 70304 628 pts/2 S 16:32 0:00 -bashpostgres 3285 0.0 0.0 83488 1144 pts/2 S+ 16:32 0:00 ./psqlpostgres 3348 0.0 0.0 113644 984 ? Ss 16:40 0:00 postgres: writer process postgres 3349 0.0 0.0 113644 732 ? Ss 16:40 0:00 postgres: wal writer process postgres 3350 0.0 0.1 114336 1756 ? Ss 16:40 0:00 postgres: autovacuum launcher process postgres 3351 0.0 0.0 73332 580 ? Ss 16:40 0:00 postgres: stats collector process root 3361 0.0 0.0 65420 796 pts/3 R+ 16:44 0:00 grep post[root@server ~]#
回到psql端再看,发现连接已经失效了:
!> \Invalid command \. Try \? for help.!> \dt;You are currently not connected to a database.!>
根据向社区提问,据说是因为OS级别的OOM错误发生,所以杀死了Postmaster进程。
总之,此种情况表明,如果没有对资源消费总量的控制,那么突发性的用户访问会导致崩溃的。
本文转自健哥的数据花园博客园博客,原文链接:http://www.cnblogs.com/gaojian/p/3291938.html,如需转载请自行联系原作者