Friday, March 16, 2012

AIX GENERAL TROBULESHOOTING


A.  AIX GENERAL TROUBLE SHOOTING


1)  File System Space Usage

Check for disk space problems. 

# df –I   (Checks for inode usage)
Filesystem    512-blocks      Used      Free %Used Mounted on
/dev/hd4        17301504   5926488  11375016   35% /
/dev/hd2        10485760   4583816   5901944   44% /usr

# df –k   (Checks for disk space usage in 1K blocks)
Filesystem    1024-blocks      Free %Used    Iused %Iused Mounted on
/dev/hd4          8650752   5687508   35%    39729     2% /
/dev/hd2          5242880   2950972   44%    35227     3% /usr

# df –g   (Checks for disk space usage in GigaByte blocks)
Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
/dev/hd4           8.25      5.42   35%    39729     2% /
/dev/hd2           5.00      2.81   44%    35227     3% /usr

# df –gP (POSIX view with different heading names)
Filesystem    GB blocks   Used Available Capacity Mounted on
/dev/hd4           8.25   2.83      5.42      35% /
/dev/hd2           5.00   2.19      2.81      44% /usr

Note that the (df –k or -g) lists the disk usage (%Used) as well as the inodes usage (%Iused). 

Be sure to pay close attention and try not to get the two confused when checking file system space.




Use lsps to check paging/swap space usage:

The lsps command displays the characteristics of paging spaces, such as paging space name, physical volume name, volume group name, size, percentage of the paging space used, status of space, and it shows if the paging space is set to automatic.

# lsps –a (Note that this system is paging quite a bit)
Page Space      Physical Volume   Volume Group    Size %Used Active  Auto  Type
paging00        hdisk0            rootvg       10752MB    45     yes   yes    lv
hd6             hdisk1            rootvg        2560MB    45     yes   yes    lv
hd6             hdisk2            rootvg        8192MB    45     yes   yes
    lv

or

# swap -s
allocated = 5505024 blocks used = 2458677 blocks free = 3046347 blocks

2)   Load Average


# uptime
11:14AM   up 10 days,  21:02,  2 users,  load average: 0.05, 0.05, 0.03

*Note: The load average numbers give the average number of jobs/processes in the run queue over the last 1, 5, and 15 minutes.  The lowest possible load average is zero. A load average of one or two is about typical.  The load avg. of 3 and above could indicate a critical issue on the system.


B.  SYSTEM PERFORMANCE


1)  CPU and Memory Usage

 The vmstat command reports statistics about kernel threads, virtual memory, disks, traps, and CPU activity.  
*us = user time, sy = system time, id = CPU idle time, wa = CPU cycles to determine that the current process is wait.

# vmstat 5 5
System Configuration: lcpu=8 mem=16384MB
kthr     memory             page              faults        cpu    
----- ----------- ------------------------ ------------ -----------
 r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa
 5  1 4818381 24300   0   2   2 636  859   0 2048 321280 9460 24 18 53  5
 6  1 4817085 25591   0   0   0   0    0   0 1838 593223 4798 53 21 23  4
 7  1 4811637 31031   0   0   0   0    0   0 1975 265643 4706 30 13 49  8
 2  1 4813001 29650   0   0   0   0    0   0 1814 95041 7491  8 10 76  6
 4  1 4818874 23769   0   0   0   0    0   0 1864 53014 4428  5  7 81  7

A new I/O oriented view using the –I option:

# vmstat -I 5 5
System Configuration: lcpu=8 mem=16384MB
  kthr      memory             page              faults        cpu    
-------- ----------- ------------------------ ------------ -----------
 r  b  p   avm   fre  fi  fo  pi  po  fr  sr   in   sy  cs us sy id wa
 5  1  0 4809912 45680 574 203   2   2 636  860 2048 321270 9459 24 18 53  5
 1  0  0 4820163 35346  12 152   0   0   0    0 2034 410525 5435 10 20 67  2
 2  0  0 4816092 39388   4  57   0   0   0    0 1726 566771 62167 13 20 65  2
 2  1  0 4821609 33799  11 216   0   0   0    0 2024 529518 21680 13 27 56  4
 6  1  0 4815588 39806   1  43   0   0   0    0 1668 481025 4853 12 18 69  1


Iostat reports CPU and I/O statistics.

# iostat (On large systems this output could be quite large)

System configuration: lcpu=2 disk=3

tty:      tin          tout       avg-cpu:  % user    % sys     % idle    % iowait
           0.0          0.5               0.3      0.2           99.3       0.2

Disks:        % tm_act    Kbps      tps     Kb_read    Kb_wrtn
hdisk1           0.6           8.2         1.2     2030462    5660599
hdisk0           0.7           7.2         1.1     1116762    5660603ma
cd0                0.0           0.0         0.0           0          0

Note: %user shows the percentage of CPU utilization at the user level and %sys shows the percentage of the CPU utilization at the system level.


# sar 5 5

AIX jrspa22t 2 5 00283EDD4C00    07/26/06

System Configuration: lcpu=8

10:12:49    %usr    %sys    %wio   %idle
10:12:54      22      13       2      64
10:12:59      53       4       1      42
10:13:04      52       9       1      38
10:13:09      52       3       1      44
10:13:14      39       8       2      52

Average       44       7       1      48

To monitor all CPU usage via SAR:

# sar –P ALL 5 10

The topas command displays statistics of  system activities and CPU usage.  This output may be viewed in intervals of seconds using the –i flag.  To ensure output is in a readable format, set your terminal emulation to vt220 prior to accessing the system as well as after logging onto the system.

# topas -i5
The report from the topas command lists the CPU usage of the kernel, user, wait time, and system idle time.  Below, it also lists processes, along with the PID, CPU usage, and owner that are currently running on the system.
Process Id, usage, & owner
 
Kernel, user, wait, & idle usage
 




To monitor the busiest processes on a system using topas:

# topas –Pi5 (checks at a 5 second interval)
Topas Monitor for host:    jrspa22t    Interval:   5    Wed Jul 26 10:15:47 2006

                                DATA  TEXT  PAGE               PGFAULTS
USER        PID    PPID PRI NI   RES   RES SPACE    TIME CPU%  I/O  OTH COMMAND
root     258066       1  60 20    88     1   160 1966:11  2.9    0   25 syncd   
patrol  7778462       1  75 30 14933   674 17910 1423:17  1.1    027708 PatrolAg   lAg
root    8769704       1  62 20  6313   835  6313    9:29  1.1    6  400 bgsagent
root     172116       0  16 41    17     0    17 1090:36  0.5    0    0 wlmsched
bsomqp022642060 2191530  60 20 15194    11 24748   26:38  0.4    0    0 java      
root    7958554 2969668  58 41  2790    19  2790    0:01  0.4    0  202 topas    
root    1417340       1   1 41   400   245   838 1349:10  0.4    0    0 seosd 
ncmsqp042674916 2822388  60 20 25443    17 40486   47:57  0.3    0    0 java
patrol  6553618       1  70 30  2754   674  4157  242:53  0.2    0 1721 PatrolAg   t
ncmsqp023493972 3690670  60 20 17462    11 27864   29:37  0.2    0    0 java 

Find the top 15 processes using memory on a system:

# svmon -Pt15 | perl -e 'while(<>){print if($.==2||$&&&!$s++);$.=0 if(/^-+$/)}'
-------------------------------------------------------------------------------
     Pid Command          Inuse      Pin     Pgsp  Virtual 64-bit Mthrd LPage
 1589482 oracle          247739     5402    55835   109827      Y     N     N
 2039974 oracle          221077     5402    56167   110311      Y     N     N
 2129990 oracle          220953     5402    56091   110111      Y     N     N
 1982638 oracle          220808     5402    55824   109858      Y     N     N
 1396820 oracle          219414     5402    55839   109946      Y     N     N
 2670812 oracle          219319     5402    55990   109938      Y     N     N
 6779124 oracle          219285     5402    56034   109932      Y     N     N
 2216084 oracle          219245     5402    55979   109899      Y     N     N
 2912464 oracle          219239     5402    55926   109873      Y     N     N
 2470110 oracle          219232     5402    55953   109874      Y     N     N
 2572518 oracle          219002     5402    56018   109846      Y     N     N
 2584744 oracle          218920     5402    56173   109915      Y     N     N
 2211846 oracle          218883     5402    56245   109948      Y     N     N
 6979770 oracle          200825     5402    56144   109830      Y     N     N
 1790028 java            187476     5727    57630   198578      N     Y     N 

Finding the size of a PID using ps:

# ps v 3375240
     PID    TTY STAT  TIME PGIN  SIZE   RSS   LIM  TSIZ   TRS %CPU %MEM COMMAND
 3375240      - A    42:25 10859 157132 106180    xx    39    44  0.0  1.0 /pac/nc

2)  Where to obtain PERFPMR to collect performance data


If a server has a performance problem, IBM may request that you install perfpmr and collect performance data during a peak load period.  IBM normally provides instructions on how to install.

You can obtain a copy of the perfpmr scripts from the following location:


You will need to get this while you are logged onto the server with the problem.

The IBM performance team has suggested the following changes be made to the script once it is downloaded and installed:
Please change the following lines in each of the stanzas in perfpmr.cfg:
trace.sh:
logsize = 402653184
kbufsize = 201326592
filemon.sh:
filemon_kbufsize = 201326592
filemon_time_seconds = 60
space_required = 83886080

3)  LAN Status


The netstat command shows network status for each protocol or routing table.  The –i flag may be used to determine collisions and I/O errors.

# netstat -i
Name  Mtu   Network     Address              Ipkts            Ierrs    Opkts     Oerrs  Coll
en0   1500     link#2        0.9.6b.3e.57.61     424536         0      239376       0       0
en0   1500     89.10.12     prl28284              424536          0      239376       0       0
en2   1500     link#3        0.9.6b.ce.54.cb     4297312        0      140332       2       0
en2   1500     55.10.32    breac01t-55          4297312        0      140332        2      0
lo0   16896    link#1                                        5254          0         6076         0       0
lo0   16896    127           loopback                  5254           0         6076         0       0
lo0   16896    ::1                                              5254           0         6076         0       0

Check routing tables with network addresses
# netstat -rn

Routing tables
Destination      Gateway           Flags   Refs     Use  If   PMTU Exp Groups

Route Tree for Protocol Family 2 (Internet):
default              89.10.12.254      UGc            0        0         en0     -   - 
55.10.32.0       55.10.34.184       UHSb          0        0         en2     -   -  =>
55.10.32/22     55.10.34.184       U                 0   138677    en2     -   - 
55.10.34.184     127.0.0.1           UGHS         0        1         lo0     -   - 
55.10.35.255   55.10.34.184       UHSb          0        4         en2     -   - 
89.10.5.135     89.10.12.254       UGHW       1     2163       en0     -   -

# ifconfig -a

en0:flags=5e080863,80<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD,PSEG,CHAIN> inet 89.10.12.31 netmask 0xffffff00 broadcast 89.10.12.255
en2:flags=5e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD,PSEG,CHAIN>inet 55.10.34.184 netmask 0xfffffc00 broadcast 55.10.35.255
lo0:flags=e08084b<UP,BROADCAST,LOOPBACK,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT> inet 127.0.0.1 netmask 0xff000000 broadcast 127.255.255.255
        inet6 ::1/0  tcp_sendspace 65536 tcp_recvspace 65536


4)  How to check interface card speed, auto negotiation info.


# entstat -d ent4|more
-------------------------------------------------------------
ETHERNET STATISTICS (ent4) :
Device Type: Gigabit Ethernet-SX PCI-X Adapter (14106802)
Hardware Address: 00:02:55:33:77:63
Elapsed Time: 11 days 9 hours 58 minutes 37 seconds

Transmit Statistics:                          Receive Statistics:
--------------------                          -------------------
Packets: 17299124                             Packets: 166277808
Bytes: 486040591195                           Bytes: 38982878854
Interrupts: 0                                 Interrupts: 153893117
Transmit Errors: 0                            Receive Errors: 0
Packets Dropped: 0                            Packets Dropped: 0
                                              Bad Packets: 0
Max Packets on S/W Transmit Queue: 51       
S/W Transmit Queue Overflow: 0
Current S/W+H/W Transmit Queue Length: 0

Broadcast Packets: 60                         Broadcast Packets: 97825101
Multicast Packets: 1                          Multicast Packets: 95415
No Carrier Sense: 0                           CRC Errors: 0
DMA Underrun: 0                               DMA Overrun: 0
Lost CTS Errors: 0                            Alignment Errors: 0
Max Collision Errors: 0                       No Resource Errors: 0
Late Collision Errors: 0                      Receive Collision Errors: 0
Deferred: 0                                   Packet Too Short Errors: 0
SQE Test: 0                                   Packet Too Long Errors: 0
Timeout Errors: 0                             Packets Discarded by Adapter: 0
Single Collision Count: 0                     Receiver Start Count: 0
Multiple Collision Count: 0
Current HW Transmit Queue Length: 0

General Statistics:
-------------------
No mbuf Errors: 0
Adapter Reset Count: 0
Adapter Data Rate: 2000
Driver Flags: Up Broadcast Running
        Simplex 64BitSupport ChecksumOffload
        PrivateSegment LargeSend DataRateSet

Gigabit Ethernet-SX PCI-X Adapter (14106802) Specific Statistics:
--------------------------------------------------------------------
Link Status : Up
Media Speed Selected: Auto negotiation
Media Speed Running: 1000 Mbps Full Duplex

PCI Mode: PCI-X (100-133)
PCI Bus Width: 64-bit
Latency Timer: 144
Cache Line Size: 128
Jumbo Frames: Disabled
TCP Segmentation Offload: Enabled
TCP Segmentation Offload Packets Transmitted: 14265351
TCP Segmentation Offload Packet Errors: 0
Transmit and Receive Flow Control Status: Enabled
XON Flow Control Packets Transmitted: 0
XON Flow Control Packets Received: 0
XOFF Flow Control Packets Transmitted: 0
XOFF Flow Control Packets Received: 0
Transmit and Receive Flow Control Threshold (High): 45056
Transmit and Receive Flow Control Threshold (Low): 24576
Transmit and Receive Storage Allocation (TX/RX): 16/48

C.  Last Reboot, Run Level, Boot Log, Console Log


Check to see if the box has rebooted recently by running:
who –b

A recent system reboot could explain alarms on the system. The reboot may have been scheduled or may have been caused by a system panic, hardware failure, or power failure. Further investigation should be done. Check the CAMCS logs to see if a system panic occurred or check cron to see if a reboot script was executed.

Check for the system’s current run level.  Please note that AIX operates at Run Level 2. Other Run Levels are available, but are rarely used.
      who –r


To check for any configuration errors after a system reboot, run the following command to see the bootlog:

# alog –o –f /var/adm/ras/bootlog | more

The console log can be viewed using this command:

# alog –o –f /var/adm/ras/conslog | more

D.  DISK DRIVE REPLACEMENT


Disk Drive Procedures


The following commands are used to display devices on the system and their characteristics. 

1) Hardware Devices

lsdev    displays information about devices in the device configuration database.
Flags:  -C  lists information about a device that is in the Customized Devices object class.
           -c    specifies a device class name.
           -H   displays headers above the column output.

To list the disks that are in the Available state in the Customized Devices object class…..
# lsdev -CH -c disk 
name   status    location     description
hdisk0 Available 1S-08-00-5,0 16 Bit LVD SCSI Disk Drive
hdisk1 Available 1S-08-00-8,0 16 Bit LVD SCSI Disk Drive

To list all devices:
# lsdev -C -H | pg 
name       status    location      description
L2cache0   Available               L2 Cache
aio0       Defined                 Asynchronous I/O (Legacy)
cd0        Available 1G-19-00      IDE DVD-ROM Drive
en0        Available 1L-08         Standard Ethernet Network Interface
en1        Defined   1c-08         Standard Ethernet Network Interface
en2        Available 1j-08         Standard Ethernet Network Interface
en3        Defined   1n-08         Standard Ethernet Network Interface
ent0       Available 1L-08         10/100 Mbps Ethernet PCI Adapter II (1410ff01)
ent1       Available 1c-08         10/100 Mbps Ethernet PCI Adapter II (1410ff01)

lspv provides information about known physical volumes on the system along with its physical disk name, physical volume identifier (PVIDs) and volume group.

# lspv
hdisk0          000c8edc02dccea9          rootvg          active
hdisk1          000c8edc851ee972          rootvg          active


# lspv hdisk0
PHYSICAL VOLUME:    hdisk0                   VOLUME GROUP:     rootvg
PV IDENTIFIER:      000c8edc02dccea9 VG IDENTIFIER     000c8edc00004c00000000fc851ef361
PV STATE:           active                                    
STALE PARTITIONS:   0                        ALLOCATABLE:      yes
PP SIZE:            64 megabyte(s)           LOGICAL VOLUMES:  7
TOTAL PPs:          542 (34688 megabytes)    VG DESCRIPTORS:   1
FREE PPs:           86 (5504 megabytes)      HOT SPARE:        no
USED PPs:           456 (29184 megabytes)                     
FREE DISTRIBUTION:  25..60..00..00..01                        
USED DISTRIBUTION:  84..48..108..108..108    

The –p flag will list all physical partitions of physical volume hdisk0.
# lspv -p hdisk0
hdisk0:
PP RANGE  STATE   REGION        LV NAME        TYPE       MOUNT POINT
  1-4              used      outer edge        hd5                 boot             N/A
  5-29       free       outer edge                                  
 30-109      used      outer edge        hd9var              jfs        /var
110-141        used    outer middle     hd6                 paging        N/A
142-201        free     outer middle                                 
202-217        used    outer middle     hd3                 jfs          /tmp
218-221        used     center               hd8                 jfslog        N/A
222-325      used       center               hd4                 jfs           /
326-381      used      inner middle     hd4                 jfs           /
382-433     used       inner middle     hd2                 jfs                 /usr
434-541      used      inner edge         hd2                 jfs              /usr
542-542      free       inner edge    

Example of a problem on hdisk0.

# lspv -p hdisk0
PHYSICAL VOLUME:    hdisk0                   VOLUME GROUP:     rootvg
PV IDENTIFIER:      000c8edc001363a5 VG IDENTIFIER     000c8edc00004c00000000fc851ef361
PV STATE:           active                                    
STALE PARTITIONS:   6                     ALLOCATABLE:   yes 
Note Stale Partitions – Disk is BAD.

PP SIZE:            64 megabyte(s)           LOGICAL VOLUMES:  7
TOTAL PPs:          542 (34688 megabytes)    VG DESCRIPTORS:   1
FREE PPs:           86 (5504 megabytes)      HOT SPARE:        no
USED PPs:           456 (29184 megabytes)                     
FREE DISTRIBUTION:  25..60..00..00..01         FREE PP’s = 86 (25+60+1)    -       
USED DISTRIBUTION:  84..48..108..108..108    USED PP’s = 456 (84+48+108+108+108)

# lspv -p hdisk0
hdisk0:
PP RANGE    STATE   REGION        LV NAME        TYPE       MOUNT POINT
  1-4                used      outer edge       hd5                 boot            N/A
  5-29               free       outer edge                                  
 30-109       used       outer edge       hd9var              jfs               /var
110-141       used       outer middle     hd6                 paging        N/A        
142-201       free        outer middle                                                       
202-217       used       outer middle     hd3                 jfs              /tmp
218-218       *stale       center               hd8                 jfslog             N/A
219-221       used      center               hd8                 jfslog        N/A
222-222       *stale       center               hd4                 jfs                   /
223-231       used       center               hd4                 jfs                     /
232-232       *stale      center                hd4                 jfs                   /
233-240       used             center               hd4                 jfs              /
241-241       *stale      center               hd4                 jfs                 /
242-325       used       center                hd4                 jfs                /
326-381       used      inner middle       hd4                 jfs               /
382-382       *stale    inner middle       hd2                 jfs            /usr
383-400       used      inner middle       hd2                 jfs           /usr
401-401       *stale    inner middle       hd2                 jfs           /usr
402-433       used       inner middle       hd2                 jfs           /usr
434-541       used       inner edge         hd2                 jfs           /usr
542-542       free        inner edge                      
                

2) Volume Groups

To list volume groups that are currently active on your system, type:
lsvg -o

# lsvg -o
rootvg

List detailed information and status about the volume group.
# lsvg rootvg
VOLUME GROUP:   rootvg                   VG IDENTIFIER:  000c8edc00004c00000000fc851ef361
VG STATE:       active                   PP SIZE:        64 megabyte(s)
VG PERMISSION:  read/write               TOTAL PPs:      1084 (69376 megabytes)
MAX LVs:        256                      FREE PPs:       108 (6912 megabytes)
LVs:            9                        USED PPs:       976 (62464 megabytes)
OPEN LVs:       8                        QUORUM:         1
TOTAL PVs:      2                        VG DESCRIPTORS: 3
STALE PVs:      0                        STALE PPs:      0
ACTIVE PVs:     2                        AUTO ON:        yes
MAX PPs per PV: 1016                     MAX PVs:        32
LTG size:       128 kilobyte(s)          AUTO SYNC:      no
HOT SPARE:      no                       BB POLICY:      relocatable

List the logical volumes in a volume group.
# lsvg -l rootvg
rootvg:
LV NAME             TYPE       LPs   PPs   PVs  LV STATE      MOUNT POINT
hd5                 boot       1     2     2    closed/syncd  N/A
hd6                 paging     42    84    3    open/syncd    N/A
hd8                 jfslog     1     2     2    open/syncd    N/A
hd4                 jfs        33    66    2    open/syncd    /
hd2                 jfs        20    40    2    open/syncd    /usr
hd9var              jfs        20    40    2    open/syncd    /var
hd3                 jfs        4     8     2    open/syncd    /tmp
pac_lv1             jfs        1     2     2    open/syncd    /pac
lvbto               jfs        72    144   2    open/syncd    /bto/sys
hd7                 sysdump    18    18    1    open/syncd    N/A
hd71                sysdump    18    18    1    open/syncd    N/A
paging00            paging     42    84    2    open/syncd    N/A

List the physical volume status within a volume group.

# lsvg -p rootvg
rootvg:
PV_NAME           PV STATE          TOTAL PPs   FREE PPs    FREE DISTRIBUTION
hdisk2            active            135         5           01..00..00..00..04
hdisk3            active            135         0           00..00..00..00..00
hdisk0            active            135         6           00..00..00..00..06
hdisk1            active            135         21          00..00..10..00..11

List attributes about a physical volume (disk):

# lsattr -El hdisk2
PCM             PCM/friend/scsiscsd              Path Control Module           False
algorithm       fail_over                        Algorithm                     True
dist_err_pcnt   0                                Distributed Error Percentage  True
dist_tw_width   50                               Distributed Error Sample Time True
hcheck_interval 0                                Health Check Interval         True
hcheck_mode     nonactive                        Health Check Mode             True
max_transfer    0x40000                          Maximum TRANSFER Size         True
pvid            00283edd26fdf5680000000000000000 Physical volume identifier    False
queue_depth     3                                Queue DEPTH                   False
reserve_policy  single_path                      Reserve Policy                True
size_in_mb      36400                            Size in Megabytes             False

 

 

E.  Running SNAP


Note:  You must have an open PMR with pSeries Support (IBM) before continuing.  All references to the PMR number below will be in the format of “xxxxx.YYY” where “xxxx” is the problem number and “YYY” is the branch number.

1)  Call IBM




To find the 4-digit machine type:
# uname -M
IBM,7029-6C3


Search the report for General Info and view the HW_MODEL field.
====================================================================
GENERAL INFO: senthil : 0x590a0c1f : Fri 03-04-11 14:04:31 CST : 80.1
====================================================================
HOSTNAME: senthil
HOSTID: 0x590a0c1f
PRIM_IP_ADDRESS: x.x.x.x
HW_VENDOR: IBM
HW_MODEL: IBM,7029-6C3
OS_LEVEL: AIX 5.2
SYSTEM_MEMORY: 2048 Mb
DDSABLE: TRUE
DOMAIN: none 


Follow the steps below to run “snap” and ftp the output to IBM:

2)  How to run SNAP command:

Using the "snap" command to gather information:
This is a powerful command to gather lots of data on all types of machines.  Following are some cavaets with this command:

-- The "-b" flag gathers SSA information
-- The "-t" flag gathers the TCPIP information
-- The file created from the output is /tmp/ibmsupt/snap.pax.Z





To gather the basic information on a machine like error logs configuration, AIX driver levels, run
# snap -r   (this removes any prior snap data)
# snap -gc

NOTE: Depending on the amount of SSA drives this could last anywhere from a few minutes to 2 hours, so be careful.

To gather the SSA info, use:  # snap -gbc
To gather the SSA and TCPIP info, use:  # snap –gtbc
To gather all system configuration information:  # snap –ac

Example of output:
bos62833[root]: snap -r
Nothing to clean up
bos62833[root]: snap -gbc
Checking space requirement for general information.......................................................................................................................................................................................................................................................................................................................................................... done.
..Checking space requirement for ssa information.......... done.
Checking for enough free space in filesystem... done.

********Checking and initializing directory structure
Creating /tmp/ibmsupt directory tree... done.
Creating /tmp/ibmsupt/ssa directory tree... done.
Creating /tmp/ibmsupt/general directory tree... done.
Creating /tmp/ibmsupt/general/diagnostics directory tree... done.
Creating /tmp/ibmsupt/testcase directory tree... done.
Creating /tmp/ibmsupt/other directory tree... done.
********Finished setting up directory /tmp/ibmsupt

Gathering general system information.......................................................................................................................................................................................................................................................................................................................................................... done.
Gathering scanout information..done.
Gathering ssa system information.......... done.

Creating compressed pax file...
Starting pax/compress process... Please wait... done.
-rw-------   1 0        0            834911 Feb  8 00:08 snap.pax.Z

Note: additional flags to be used for specific data.

IBM support may request additional options to be executed with the snap command. From “man snap”, these are the different Flags:

-a Gathers all system configuration information. This option requires approximately 8MB of temporary disk space.

-A Gathers asynchronous (TTY) information.

-b Gathers SSA information.

-c Creates a compressed pax image (snap.pax.Z file) of all files in the /tmp/ibmsupt directory tree or other named output directory.

-D Gathers dump and /unix information. The primary dump device is used.

Notes:
* If bosboot -k was used to specify the running kernel to be other than /unix, the incorrect kernel is gathered. Make sure that /unix is or is linked to, the kernel in use when the dump was taken.
If the dump file is copied to the host machine, the snap command does not collect the dump image in the /tmp/ibmsupt/dump directory. Instead, it creates a link in the dump directory to the actual dump image.

-d Dir Identifies the optional snap command output directory (/tmp/ibmsupt is the default).

-f Gathers file system information.

-g Gathers the output of the lslpp -hBc command, which is required to recreate exact operating system environments. Writes output to the /tmp/ibmsupt/general/lslpp.hBc file.

Also collects general system information and writes the output to the /tmp/ibmsupt/general/general.snap file.

-G Includes predefined Object Data Manager (ODM) files in general information collected with the -g flag.

-i Gathers installation debug vital product data (VPD) information.


-k Gathers kernel information

-l Gathers programming language information.

-L Gathers LVM information.

-n Gathers Network File System (NFS) information.

-N Suppresses the check for free space.

-o OutputDevice Copies the compressed image onto diskette or tape.

-p Gathers printer information.

-r Removes snap command output from the /tmp/ibmsupt directory.


-s Gathers Systems Network Architecture (SNA) information.

-S Includes security files in general information collected with the -g flag.

-t Gathers Transmission Control Protocol/Internet Protocol (TCP/IP) information.

-T Gathers all the log files for a multicpu trace. Only the base file, trcfile, is captured with the -g flag.

-v Component Displays the output of the commands executed by the snap command. Use this flag to view the specified name or group of files.
Note: Press the Ctrl-C key sequence to interrupt the snap command. A prompt will return with the following options: press the Enter key to return to current operation; press the S key to stop the current operation; press the Q key to quit the snap command completely.

-w Gathers WLM information

3)  Check the current maintenance level of your system:



To determine the highest recommended maintenance level reached for the current version of AIX on the system, type:
# oslevel -r
5200-03


Beginning in 2006, IBM AIX changed from “Maintenance Level (ML)” to “Technology Level (TL)” and “Service Pack (SP)” terminology.  The command below will provide you will TL and SP information:

# oslevel –s
# 5200-08-01

This can be broken down as follows:
AIX Version:                          5.2
Technology Level:  8
Service Pack:                     1

For more detailed information on these topics, please refer to The IBM AIX 5L Service Strategy and Best Practices document.

4)  Check dump size


Identify the dump space settings. Note that the dump will only write to the primary or secondary and will not span to the secondary if the primary fills:

# sysdumpdev –l
primary              /dev/hd7
secondary            /dev/hd71
copy directory       /var/adm/ras
forced copy flag     TRUE
always allow dump    TRUE
dump compression     OFF

                                                     Display statistical info about the most recent dump:


Estimates the size of the dump (in bytes) for the current running system:

# sysdumpdev –e
0453-041 Estimated dump size in bytes: 4280287232


To identify how much space is allocated to the dump device:

# lslv hd7
LOGICAL VOLUME:     hd7                    VOLUME GROUP:   rootvg
LV IDENTIFIER:      00283edd00004c00000001024cb1a4c3.10 PERMISSION:     read/write
VG STATE:           active/complete        LV STATE:       opened/syncd
TYPE:               sysdump                WRITE VERIFY:   off
MAX LPs:            512                    PP SIZE:        256 megabyte(s)
COPIES:             1                      SCHED POLICY:   parallel
LPs:                18                     PPs:            18
STALE PPs:          0                      BB POLICY:      relocatable
INTER-POLICY:       minimum                RELOCATABLE:    yes
INTRA-POLICY:       middle                 UPPER BOUND:    32
MOUNT POINT:        N/A                    LABEL:          None
MIRROR WRITE CONSISTENCY: on/ACTIVE                             
EACH LP COPY ON A SEPARATE PV ?: yes                                    
Serialize IO ?:     NO         

Dump Space Size (hd7) = PPs x PP SIZE
Dump Space Size (hd7) = 18 X 256 megabytes = 4608 megabytes                          

5)  Create a dump file





      

F.  SHUTDOWN


The shutdown command halts the operating system. Only a user with root user authority can run this command.  Do not attempt to restart the system or turn off the system before the shutdown completion message is displayed; otherwise, file system damage can result.

Make sure you are on the correct server prior to entering shutdown command:

Enter:  hostname

To shutdown and restart the system:

# shutdown –Fr

Other flags that could be used with the shutdown command are:
- h   Halts the operating system completely.
-m   Brings the system down to maintenance (single user) mode.
-d    Brings the system down from a distributed mode to a multiuser mode.
-i     Interactive mode.  Displays interactive messages to guide the user through the shutdown.

The last command can be used to help determine when the system was last shut down.

# last shutdown
shutdown  tty0                                Feb 11 14:05
shutdown  tty0                                Feb 10 20:23
shutdown  pts/1                               Feb 04 07:08

G.  HARDWARE ASSISTANCE

How to run Diagnostics


The diag command is menu driven and is used to run diagnostics for a suspected problem.

# diag
Press <Enter> to advance past the information screen.
Select Diagnostic Routines.
Select Problem Determination.

This instructs the diag command to test the system and analyze the error log.

You may run a diagnosis on a particular device by using the –d flag.
# diag –d (device name)


Display previous diagnostic results.
# cd /usr/lpp/diagnostics/bin
# ./diagrpt  -o


Display all diagnostic result files logged since the data specified.
# /usr/lpp/diagnostics/bin/diagrpt –s 030705

This will list results for March 7, 2011.

Diagnostic result files are stored in /etc/lpp/diagnostics/data directory.

Finding system configuration information


Total physical memory in system
# bootinfo –r

Total number of processors in system
# lsdev –Cc processor (this will list each processor)

Display configuration, diagnostic, vital product data about system
# lscfg –vp | more

 

H.  LOGS


The first place you should go when troubleshooting problems in AIX is the error report (errpt).

First run errpt without any options to get an overview of current errors:

# errpt|more
IDENTIFIER TIMESTAMP  T C RESOURCE_NAME  DESCRIPTION
B6048838   0725140606 P S SYSPROC        SOFTWARE PROGRAM ABNORMALLY TERMINATED
B6048838   0725133506 P S SYSPROC        SOFTWARE PROGRAM ABNORMALLY TERMINATED
B6048838   0725122506 P S SYSPROC        SOFTWARE PROGRAM ABNORMALLY TERMINATED
B6048838   0724140106 P S SYSPROC        SOFTWARE PROGRAM ABNORMALLY TERMINATED
B6048838   0721033906 P S SYSPROC        SOFTWARE PROGRAM ABNORMALLY TERMINATED
B6267342   0721032506 P H hdisk1356      DISK OPERATION ERROR
B6267342   0721032506 P H hdisk1356      DISK OPERATION ERROR
B6267342   0721032506 P H hdisk1355      DISK OPERATION ERROR
B6267342   0721032506 P H hdisk1355      DISK OPERATION ERROR

To get the specifics associated with the IDENTIFIER:

# errpt -aj B6048838 | more
---------------------------------------------------------------------------
LABEL:          CORE_DUMP
IDENTIFIER:     B6048838

Date/Time:       Tue Jul 25 14:06:04 EDT
Sequence Number: 113629
Machine Id:      00283E9D4C00
Node Id:         jrspa13t
Class:           S
Type:            PERM
Resource Name:   SYSPROC        

Description
SOFTWARE PROGRAM ABNORMALLY TERMINATED

Probable Causes
SOFTWARE PROGRAM

User Causes
USER GENERATED SIGNAL

        Recommended Actions
        CORRECT THEN RETRY

Failure Causes
SOFTWARE PROGRAM

        Recommended Actions
        RERUN THE APPLICATION PROGRAM
        IF PROBLEM PERSISTS THEN DO THE FOLLOWING
        CONTACT APPROPRIATE SERVICE REPRESENTATIVE

Detail Data
SIGNAL NUMBER
           6
USER'S PROCESS ID:
               7540756
FILE SYSTEM SERIAL NUMBER
          44
INODE NUMBER
     1474687
PROCESSOR ID
          16
CORE FILE NAME
/pac/brsmdp07/bea/app/user_projects/domains/collections/core
PROGRAM NAME
java
ADDITIONAL INFORMATION
abort E8
??

Symptom Data
REPORTABLE

You can display errors that were encountered during the last day by specifying a date in your search.

# date
Wed Feb 23 14:57:39 CST 2005

# errpt -a -s 0222145601 |more
-a  display information in a detailed format
-s   display all records posted after the StartDate

Example: errpt  -a  -s (mmddhhmmyy)  month, day, hour, minute, and year minus 24 hours

I.  Installed Software Installation Info


How to determine the maintenance level of software:

# lslpp –l | more (This will list every fileset on the system)

# lslpp –l <Fileset> (Lists the state of a fileset)

# lslpp –L | grep <Fileset> (Easy way to get basic version info)

# lslpp –h <Fileset> (Displays when a fileset was installed)

 

























3 comments: