exachk --Oracle在Exadata上的最佳实践和一些配置的建议值 ILOM --收集Exadata的硬件故障信息 sundiag --主要收集硬件信息,包括RAID,HCA, Infiniband awr --数据库性能报告 alert log --计算节点和存储节点的日志(dcli -g cell_group -l root "cellcli -e list alerthistory") osw --操作系统信息 crs --diagcollection.sh,从11.2.0.4开始,也可以使用TFA
Exachk是Exadata一体机的健康检查工具,我们可以定期使用exachk这个工具收集机器上的系统信息,生成健康检查报告,并结合Oracle在Exadata上的最佳实践和一些配置的建议值,可以及时发现有哪些潜在的问题,消除隐患,保障Exadata一体机的稳定运行,进而使Exadata机器发挥出最大的性能。Exachk工具会定期地不断被更新优化,所以我们每次用到exachk的时候最好下载最新版本的版本,最新版本可以从Note 1070954.1上下载。更多信息可参考文档757552.1查询当前Oracle最佳实践建议,参考文档888828.1查询Exadata机器当前支持的版本信息。
Exachk收集什么信息?
Exachk工具功能很全面,检查的组件包括数据库服务器,storage存储服务器,InfiniBand和Ethernet网络等。exachk这个工具使用很简单,省去了人工收集的繁琐步骤。收集的全部信息都是针对Exadata的可用性和稳定性,还有数据库架构的安全性等方面,并不会收集数据库内容的数据,在执行的过程中几乎对系统没有任何影响,用户可以放心使用。收集完成后,可以在整体上对系统的健康状况做一个评估,包含软件、硬件、固件版本、配置等信息。
什么时候去执行exachk?
如何获得exachk的最新版本?A.初次成功安装之后
B.更改系统配置的前后去执行exachk(比如更换硬盘,闪存卡等)
C.系统维护前后(比如升级打补丁等)D.定期的健康检查(每2月)
OracleExadata Database Machine exachk or HealthCheck (Doc ID 1070954.1)
如何去执行exachk?
当前最新Exachk的版本是12.1.0.2.5_20151023,下面是简单介绍和基本的操作步骤:
1.把文件exachk.zip上传到一个数据库服务器节点上推荐路径/opt/oracle.SupportTools/exachk
2.用Unzip解压exachk.zip
3.11g数据库可以用oracle用户去执行(考虑使用VNC避免网络中断),从12.1.0.2.2版本起,Oracle推荐使用root去执行exachk
注意:输错root密码,要等10分钟的哦如何收集ILOM信息?
我们可以通过web界面和命令行这两种方法收集信息
A.使用Web界面登陆ILOM
- the web ILOM interface at: http://<hostname of switch> - go to 'Maintenance' tab - go to 'Snaphost' tab - Select Data Set=normal & choose preferred Transfer Method - select 'run' 不同版本ILOM界面不太一样, Maintenance-->Snapshot-->Data Set(select Normal)-->Transfer Output File(Browser)-->Run(Click on it)-->Save the file to Desktop
ILOM Administration-->Maintenance-->Snapshot-->Data Set(select Normal)-->Transfer Output File(Browser)-->Run(Click on it)-->Save the file to Desktop
B.使用命令行收集
# ssh db01-ilom Password: Sun(TM) Integrated Lights Out Manager Version 3.0.9.19.a r55943 Copyright 2009 Sun Microsystems, Inc. All rights reserved. Use is subject to license terms. -> -> cd /SP/diag -> set snapshot dataset=normal -> set snapshot dump_uri=sftp://root:welcome1@192.168.16.4/tmp -> show snapshot result When snapshot data is fully transferred to the specified location, this command output will indicate the status as "completed". Note:dump_uri=sftp://root:<password>@<ip of host>/<dir to upload snapshot to>
交互式执行exachk
执行exachk的时候,会有一些提示信息需要输入Yes or No,确认您是否从系统收集数据,并给你一些选项,同时需要输入密码(exachk是不会保存密码文件),然后脚本开始工作,收集原始数据并在最后进行分析。原数据和分析结果会被存放在以日期为结构的目录中。详情请参考文档文件里的Exachk的使用手册。Exachk有个watchdog进程,负责监控exachk的执行状态,它会设定一个默认的"超时"值,以防止exachk hung住。在一个繁忙的系统中,如果在默认的时间内没有响应的话,检查将会被终止。通过设置一些环境变量,可以延长默认的"超时"值(RAT_TIMEOUT和RAT_ROOT_TIMEOUT)
用oracle用户登录到计算节点,切换到exachk目录,执行以下命令,根据提示输入相应信息
./exachk
If the environment variable "CRS_HOME" is not set, the first message and prompt is: CRS stack is running and CRS_HOME is not set. Do you want to set CRS_HOME to /u01/app/11.2.0/grid?[y/n][y] Type "y" and press the return key, or just press the return key. exachk checks for SSH configuration. If the environment has SSH for the "oracle" userid configured to the other database servers in the cluster, the next message is: Checking ssh user equivalency settings on all nodes in cluster
Node randomdb02 is configured for ssh user equivalency for oracle user If SSH is not configured for the "oracle" userid, the messaging is different, and you will be prompted later for how you wish to proceed. exachk next determines the list of OCR registered databases and displays this message and prompt: Searching for running databases . . . . . . . List of running databases registered in OCR 1. dbm 2. dss 3. All of above 4. None of above Select databases from list for checking best practices. For multiple databases, select 3 for All or comma separated number like 1,2 etc [1-4][3].1 In most cases, you will want to simply press the return key to evaluate all discovered databases. This example enters "1" to select only the "dbm" database. exachk next queries the state of the Oracle software stack and reports its findings: Searching out ORACLE_HOME for selected databases. . . . Checking Status of Oracle Software Stack - Clusterware, ASM, RDBMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ------------------------------------------------------------------------------------------------ Oracle Stack Status ------------------------------------------------------------------------------------------------ Host Name CRS Installed ASM HOME RDBMS Installed CRS UP ASM UP RDBMS UP DB Instance Name ------------------------------------------------------------------------------------------------ randomdb01 Yes Yes Yes Yes Yes Yes dbm1 randomdb02 Yes Yes Yes Yes Yes Yes dbm2 ------------------------------------------------------------------------------------------------ Execution continues without user intervention, and the following is typically presented: root user equivalence is not setup between randomdb01 and STORAGE SERVER randomcel01. 1. Enter 1 if you will enter root password for each STORAGE SERVER when prompted. 2. Enter 2 to exit and configure root user equivalence manually and re-run exachk. 3. Enter 3 to skip checking best practices on STORAGE SERVER. Please indicate your selection from one of the above options[1-3][1]:- In most environments, SSH is not configured for the "root" userid on the storage servers from the "oracle" userid on the database servers. There are several options here. The most common answer is to select "1", and exachk will prompt for the "root" userid password to store only in memory for the duration of the run and use with the expect command to log on to the storage servers. The second option is to enter "2" to exit exachk at this point, configure SSH for the "root" userid on the storage servers from the "oracle" userid on the database server, and restart exachk. The third option is to enter "3" to skip the storage server checks and continue the exachk run. This will result in the storage server findings being absent from the exachk report. This example presses the return key, which enters "1" by default. The next prompt is: Is root password same on all STORAGE SERVER[y/n][y] In most cases, enter "y" and press the return key, or simply press the return key. If each storage server has a unique "root" userid password, answer this prompt "n", and you will be prompted for the password for each individual storage server. The next prompt is: Enter root password for STORAGE SERVER :- Type in the common "root" userid password for the storage servers and press the return key. exachk next verifies this password on the storage servers. Beginning with Exadata Storage Server Software version 11.2.3.1.0, if the password does not verify, exachk will exit after posting the message shown below: The password entered for the root userid did not validate on 192.168.32.19 This userid may now be subject to a login delay on the specified node. Please review the pam utility configuration, and allow the specified amount of login delay time to elapse before retrying exachk. Please also check your pam failed login counts for this userid against the permitted total, and clear if required. exachk is exiting. If the password verifies, you will see the following prompt for the database servers: 101 of the included audit checks require root privileged data collection on DATABASE SERVER. If sudo is not configured or the root password is not available, audit checks which require root privileged data collection can be skipped. 1. Enter 1 if you will enter root password for each on DATABASE SERVER host when prompted 2. Enter 2 if you have sudo configured for oracle user to execute root_exachk.sh script on DATABASE SERVER 3. Enter 3 to skip the root privileged collections on DATABASE SERVER 4. Enter 4 to exit and work with the SA to configure sudo on DATABASE SERVER or to arrange for root access and run the tool later. Please indicate your selection from one of the above options[1-4][1]:- The most common option is to select "1" and provide the "root" userid password for the database servers, similar to the method discussed earlier for the storage servers. If you already have sudo access from the "oracle" userid on the database servers to the "root" userid on the database servers, enter "2". The third option is to enter "3" to skip the database server checks and continue the exachk run. This will result in the database server findings being absent from the exachk report. Enter "4" if you wish to exit the exachk run, and get sudo configured from the "oracle" userid on the database servers to the "root" userid on the database servers so that you can restart exachk and select option "2" at this prompt later. This example presses the return key for the default action of "1". The next prompt is: Is root password same on all compute nodes?[y/n][y] In most cases, enter "y" and press the return key, or simply press the return key. If each database server has a unique "root" userid password, answer this prompt "n", and you will be prompted for the password for each individual database server. The next prompt is: Enter root password on DATABASE SERVER:- Type in the common "root" userid password for the database servers and press the return key. exachk next verifies this password on the database servers. Beginning with Exadata Storage Server Software version 11.2.3.1.0, if the password does not verify, exachk will exit after posting the message shown below: The password entered for the root userid did not validate on 192.168.32.19 This userid may now be subject to a login delay on the specified node. Please review the pam utility configuration, and allow the specified amount of login delay time to elapse before retrying exachk. Please also check your pam failed login counts for this userid against the permitted total, and clear if required. exachk is exiting. If the password verifies, you will see the following prompt for the InfiniBand switches: 9 of the included audit checks require nm2user privileged data collection on INFINIBAND SWITCH. 1. Enter 1 if you will enter nm2user password for each INFINIBAND SWITCH when prompted 2. Enter 2 to exit and to arrange for nm2user access and run the exachk later. 3. Enter 3 to skip checking best practices on INFINIBAND SWITCH Please indicate your selection from one of the above options[1-3][1]:- The most common option is to select "1" and provide the "nm2user" userid password for the InfiniBand switches, similar to the method discussed earlier for the storage servers. Enter "2" if you wish to exit the exachk run, and acquire the "nm2user" userid password for the InfiniBand switches. Enter "3" to skip the InfiniBand switch checks and continue the exachk run. This will result in the InfiniBand switch findings being absent from the exachk report. This example presses the return key for the default action of "1". The next prompt is: Is nm2user password same on all INFINIBAND SWITCH ?[y/n][y] In most cases, enter "y" and press the return key, or simply press the return key. If each InfiniBand switch has a unique "root" userid password, answer this prompt "n", and you will be prompted for the password for each individual InfiniBand switch. The next prompt is: Enter nm2user password for INFINIBAND SWITCH:- Type in the common "nm2user" userid password for the InfiniBand switches and press the return key. exachk next verifies this password on the InfiniBand switches. If the password does not verify, the following message will be posted: nm2user password for randomsw-ib1.us.oracle.com was incorrect. 2 retries remaining. Enter nm2user password for randomsw-ib1.us.oracle.com :- Reenter the password, and exachk will try again. exachk will try a total of three times before it asks if you want to skip the storage servers and proceed, or exit the execution. NOTE: exachk version 2.2.1 uses the "nm2user" userid for InfiniBand switch validation. If you wish to have exachk use the "root" userid for IB switch validation (the original pre-version 2.2.1 behavior), set the RAT_IBSWITCH_USER environment variable. For example: export RAT_IBSWITCH_USER=root If the password verifies, you will see the data collection process begin as shown below: *** Checking Best Practice Recommendations (PASS/WARNING/FAIL) *** Log file for collections and audit checks are at /home/oracle/exachk_215/20120524/exachk_053012_102825/exachk.log ============================================================= Node name - randomdb01 ============================================================= Collecting - ASM Diskgroup Attributes Collecting - ASM initialization parameters Collecting - Database Parameters for dbm database Collecting - Database Undocumented Parameters for dbm database Collecting - Clusterware and RDBMS software version <output truncated for brevity> Data collection will continue across the components of the machine, identifying each component by name and echoing back to the screen the checks that are being performed. The collection process occurs on the first database server, and then it moves to the first storage server, as shown below: Preparing to run root privileged commands on STORAGE SERVER randomcel01 root@192.168.32.19's password: Collecting - Ambient Temperature on storage server Collecting - Exadata software version on storage server Collecting - Exadata software version on storage servers Collecting - Exadata storage server system model number <output truncated for brevity> When data collection is complete for all storage servers, exachk moves on to the InfiniBand switches, as shown below: Preparing to run root privileged commands on INFINIBAND SWITCH randomsw-ib1.us.oracle.com. root@randomsw-ib1.us.oracle.com's password: Collecting - Hostname in /etc/hosts Collecting - Infiniband Switch NTP configuration Collecting - Infiniband subnet manager status Collecting - Infiniband switch HCA status <output truncated for brevity> When data collection is complete for InfiniBand switches, exachk performs its analysis phase on the local database server from which it was launched, as shown below: Data collections completed. Checking best practices on randomdb01. -------------------------------------------------------------------------------------- FAIL => A minimum of two controlfiles are not stored in high redundancy diskgroups for dbm INFO => Number of SCAN listeners is NOT equal to the recommended number of 3. <output truncated for brevity> At this time, exachk also performs analysis on the storage servers and InfiniBand switches, databases, and MAA Scorecard. When this activity is complete, exachk next gathers the data from the remaining database servers in the cluster: ============================================================= Node name - randomdb02 ============================================================= Collecting - Clusterware and RDBMS software version Collecting - Compute node PCI bus slot speed for infiniband HCAs Collecting - Kernel parameters <output truncated for brevity> The data collection is followed by the analysis for each remaining database server, as shown below: Data collections completed. Checking best practices on randomdb02. -------------------------------------------------------------------------------------- FAIL => A minimum of two controlfiles are not stored in high redundancy diskgroups for dbm INFO => Number of SCAN listeners is NOT equal to the recommended number of 3. <output truncated for brevity> When all of the remaining database servers have been processed, exachk performs clusterwide checks and analysis, as shown below: --------------------------------------------------------------------------------- CLUSTERWIDE CHECKS --------------------------------------------------------------------------------- --------------------------------------------------------------------------------- Typically, there is very little or no screen output following the clusterwide checks banner. The last screen output to appear is the file reference information, as shown below: Detailed report (html) - /home/oracle/exachk_215/20120524/exachk_dbm_053012_102825/exachk_dbm_053012_102825.html UPLOAD(if required) - /home/oracle/exachk_215/20120524/exachk_dbm_053012_102825.zip
exachk帮助
$ ./exachk -h
Usage : ./exachk [-abvhpfmsuSo:c:t:] -h Prints this page. -a All (Perform best practice check and recommended patch check) -b Best Practice check only. No recommended patch check -v Show version -p Patch check only -m exclude checks for Maximum Availability Architecture (MAA) scorecards(see user guide for more details) -u Run exachk to check pre-upgrade or post-upgrade best practices for 11.2.0.3 and above -o pre or -o post is mandatory with -u option like ./exachk -u -o pre -f Run Offline.Checks will be performed on data already collected from the system -o Argument to an option. if -o is followed by v,V,Verbose,VERBOSE or Verbose, it will print checks which passs on the screen if -o option is not specified,it will print only failures on screen. for eg: exachk -a -o v -clusternodes Pass comma separated node names to run exachk only on subset of nodes. -dbnames Pass comma separated database names to run exachk only on subset of databases -localonly Run exachk only on local node. -debug Run exachk in debug mode. Debug log will be generated. eg:- ./exachk -debug -dbnone Do not prompt database selection and skip all database related checks. -dball Do not prompt database selection and run database related checks on all databases discovered on system. -c Used only under the guidance of Oracle support or development to override default components -upgrade Used to force upgrade the version of exachk being run. Report Options: -nopass Skip PASS'ed check to print in exachk report and upload to database. -noscore Do not print healthscore in HTML report. -diff <Old Report> <New Report> [-outfile <Output HTML>] Diff two exachk reports. Pass directory name or zip file or html report file as <Old Report> & <New Report> -exadiff <Exalogic collection1> <Exalogic collection2> Compare two different Exalogic rack and see if both are from the same release.Pass directory name or zip file as <Exalogic collection1> & <Exalogic collection2> (applicable for Exalogic only) -merge Pass comma separated collection names(directory or zip files) to merge collections and prepare single report. eg:- ./exachk -merge \ exachk_hostname1_db1_120213_163405.zip, \ exachk_hostname2_db2_120213_164826.zip Auto Restart Options: -auto_restart -h: Prints help for this option -<initsetup|initrmsetup|initcheck|initpresetup> initsetup : Setup auto restart. Auto restart functionality automatically brings up exachk daemon when node starts initrmsetup : Remove auto restart functionality initcheck : Check if auto restart functionality is setup or not initpresetup : Sets root user equivalency for COMPUTE, STORAGE and IBSWITCHES.(root equivalency for COMPUTE nodes is mandatory for setting up auto restart functionality) Daemon Options: -d <start|start_debug|stop|status|info|stop_client|nextautorun|-h> start : Start the exachk daemon start_debug : Start the exachk daemon in debug mode stop : Stop the exachk daemon status : Check if the exachk daemon is running info : Print information about running exachk daemon stop_client : Stop the exachk daemon client nextautorun [-id <ID>] : print the next auto run time if '-id <ID>' is specified, it will print the next auto run time for specified autorun schedule ID -h : Prints help for this option -daemon run exachk only if daemon is running -nodaemon Dont use daemon to run exachk [-id <ID>] -set configure exachk daemon parameter like 'param1=value1;param2=value2... ' if '-id <ID>' is specified, it will configure exachk daemon parameter(s) for specified autorun schedule ID Supported parameters are:- (Deprecated) - AUTORUN_INTERVAL <n[d|h]> :- Automatic rerun interval in daemon mode. Set it zero to disable automatic rerun which is zero. AUTORUN_SCHEDULE * * * * :- Automatic run at specific time in daemon mode. - - - - | | | | | | | +----- day of week (0 - 6) (0 to 6 are | | | Sunday to Saturday) | | +---------- month (1 - 12) | +--------------- day of month (1 - 31) +-------------------- hour (0 - 23) example: exachk -set 'AUTORUN_SCHEDULE=8,20 * * 2,5' will schedule runs on tuesday and friday at 8 and 20 hour. AUTORUN_FLAGS <flags> : exachk flags to use for auto runs. example: exachk -set 'AUTORUN_INTERVAL=12h;AUTORUN_FLAGS= -profile sysadmin' to run sysadmin profile every 12 hours exachk -set 'AUTORUN_INTERVAL=2d;AUTORUN_FLAGS=-profile dba' to run dba profile once every 2 days. NOTIFICATION_EMAIL : Comma separated list of email addresses used for notifications by daemon if mail server is configured. PASSWORD_CHECK_INTERVAL <number of hours> : Interval to verify passwords in daemon mode COLLECTION_RETENTION <number of days> : Purge exachk collection directories and zip files older than specified days. [-id <ID>] -unset <parameter | all> unset the parameter if '-id <ID>' is specified, it will unset the parameter for specified autorun schedule ID example: exachk -unset AUTORUN_SCHEDULE [-id <ID>] -get <parameter | all> Print the value of parameter. if '-id <ID>' is specified, it will print the value of parameter for specified autorun schedule ID -vmguest Pass comma separated filenames containing exalogic guest VM list(applicable for Exalogic only) -hybrid [-phy] phy :Pass comma separated physical compute nodes (applicable for Exalogic only) eg:- ./exachk -hybrid -phy phy_node1,phy_node2 Profile Run Options: -profile Pass specific profile. With -h prints help. List of supported profiles: asm asm Checks clusterware Oracle clusterware checks compute_node Compute Node checks (Exalogic only) control_VM Checks only for Control VM(ec1-vm, ovmm, db, pc1, pc2). No cross node checks dba dba Checks ebs Oracle E-Business Suite checks el_extensive Extensive EL checks el_lite Exalogic-Lite Checks(Exalogic Only) el_rackcompare Data Collection for Exalogic Rack Comparison Tool(Exalogic Only) goldengate Oracle GoldenGate checks maa Maximum Availability Architecture Checks obiee obiee Checks(Exalytics Only) preinstall Pre-installation checks storage Oracle Storage Server Checks switch Infiniband switch checks sysadmin sysadmin checks timesten timesten Checks(Exalytics Only) virtual_infra OVS, Control VM, NTP-related and stale VNICs check (Exalogic Only) zfs ZFS storage appliances checks (Exalogic Only) -excludeprofile Pass specific profile. List of supported profiles is same as for -profile. -cells Pass comma separated storage server names to run exachk only on selected storage servers. -ibswitches Pass comma separated infiniband switch names to run exachk only on selected infiniband switches. -zfsnodes Pass comma separated ZFS storage appliance names to run exachk only on selected storage appliances. -dbserial Run SQL, SQL_COLLECT and OS Checks in serial -dbparallel [n] Run SQL, SQL_COLLECT and OS Checks in parallel. n : Specified number of Child processes. Default is 25% of CPUs. NOTE: exachk 2.2.5 introduced second level help for improved readability: $ ./exachk -d -h -d <start|start_debug|stop|status|info|stop_client|nextautorun|-h> start : Start the exachk daemon start_debug : Start the exachk daemon in debug mode stop : Stop the exachk daemon status : Check if the exachk daemon is running info : Print information about running exachk daemon stop_client : Stop the exachk daemon client nextautorun [-id <ID>] : print the next auto run time if '-id <ID>' is specified, it will print the next auto run time for specified autorun schedule ID -h : Prints help for this option
OracleExadata Best Practices (Doc ID 757552.1)
OracleExadata Database Machine exachk or HealthCheck (Doc ID 1070954.1)
DatabaseMachine and Exadata Storage Server 11g Release 2 (11.2) Supported Versions (Doc ID888828.1)