Available monitoring for Z-Series
Partition
- The partition name as defined on the hardware console.
- The partition identified by the name PHYSICAL is not a configured partition. Data reported in this line includes all of the
uncaptured time which was used by LPAR but could not be attributed to a specific logical partition. - The summary lines like *CP or *ICF show the totals for the displayed CPU type.
Capping Option
- The capping option of the partition (YES, MIX or NO). This field indicates whether the operator has set 'capped=yes' in the logical partition controls for the partition.
- MIX is set by RMF, if either a nonIBM processor belongs to this partition which is not managed by the logical partition
controls or if the capping status is currently changing.
Jobname
- The name of the Job, the TSO Userid, the name of the started task or the name of an USS address space.
ASID
- The address space id of a Job, TSO Userid, started task or USS address space. Unless otherwise indicated RMF
displays the ASID number in decimal and not in hexadecimal notation.
Type
- The type of address space:
- B - Batch job (JES)
- T - TSO user
- S - Started task
- O - USS address space (OMVS)
- B - Batch job (JES)
Service Class
- The WLM service class that has been assigned to this address space.
Period
- The WLM service class period that has been assigned to this address space.
Group Name
- The name of the XCF group.
Member Name
- The name of the XCF member or *ALL for the group summary.
Status
- The status of the XCF member:
- A - Active
- C - Created
- M - Missing
- Q - Quiesced
- F - Failed
- R - Monitor Removed
- T - Sys Termination
- A - Active
Status Checking Interval
- The number of seconds that can elapse before the user status routine is scheduled. It is specified by the joining
member by means of the STATEXIT and INTERVAL parameters.
System Name
- The system name where this XCF member resides.
Job Name
- JOB, STC, MOUNT, or LOGON name that joined this member.
Line Type
- G - Group summary line
- M - Member line
System Pair
- For outbound requests:
- The first name is the system from which the signals are sent. The second name is the system
on which the signals are received.
- The first name is the system from which the signals are sent. The second name is the system
- For inbound requests:
- The first name is the system on which the signals are received. The second name is the system
from which the signals are sent.
- The first name is the system on which the signals are received. The second name is the system
System(1)
- For outbound requests:
- The name of the system from which the signals are sent.
- For inbound requests:
- The name of the system on which the signals are received.
System(2)
- For outbound requests:
- The name of the system on which the signals are received.
- For inbound requests:
- The name of the system from which the signals are sent.
CFStructure or CTC Device Pair
- CTC - the device number pair being used as path
- STR - the coupling facility structure name
- LST - the coupling facility structure name and list number
Path Type
- CTC - Channel to Channel
- STR - Coupling Facility Structure
- LST - List within Coupling Facility Structure
Transport Class
- The name of the transport class XCF uses for message transfer.
Status
- The status of the signalling path:
- ST - Starting
- RS - Restarting
- WR - Working
- PP - Stopped
- WC - WaitingForComp
- NO - NotOperational
- FL - Failed
- RB - Rebuilding
- QG - Quiescing
- QD - Quiesced
- ST - Starting
Path direction
- O - Outbound
- I - Inbound
Direction
- L - Local
- O - Outbound
- I - Inbound
System Name
- The name of the z/OS image in the Sysplex.
SMF Id
- The SMF system Id.
Partition Name
- The name of the partition where this system runs.
System Level
- The z/OS release level running on this system.
Monitoring Interval
- Length of time in hundredths of seconds it takes XCF to detect a failure in the Sysplex (as specified by the INTERVAL
parameter in the COUPLExx parmlib member).
Operator Interval
- Length of time in hundredths of seconds it takes XCF to notify the operator of a failure in the Sysplex (as specified by the OPNOTIFY parameter in the COUPLExx parmlib member).
Status
- The status of the z/OS image:
- A - Active
- R - Removed
- M - Missing
- L - Local
- C - Cleanup
- U - Unknown
- A - Active
RMF Master
- The indication whether this system is the RMF master.
Storage Group Name
- Name of the storage group connected to the system.
- The line showing *ALL in this column presents the accumulated values or average percentage values for all storage groups.
Total Capacity (MB)
- Total amount of disk space (in megabytes) on all online volumes in the storage group.
Free Space (MB)
- Total amount of free disk space (in megabytes) on all online volumes in the storage group.
Free Space %
- Percentage of free disk space in the storage group.
Number of Volumes
- Number of volumes in the storage group.
Unallocated Volumes
- If at least one volume in the storage group did not return any space information, this is indicated by an *.
Volume
- Name of the volume.
Total Capacity (MB)
- Total amount of disk space (in megabytes) on the volume.
Free Space (MB)
- Total amount of free disk space (in megabytes) on the volume.
Free Space %
- Percentage of free disk space on the volume.
Largest Block (MB)
- Largest block (extent) in megabytes of unallocated disk space available on the volume.
Storage Group Name
- Name of the storage group to which the volume belongs.
Aggregate Name
- Name of the zFS aggregate which is the name of the VSAM Linear Data Set (VSAM LDS) that contains one or more file
systems.
File System Name
- Name of the file system or USS file system name.
File System Values
- All file system information concatenated into one single string. Please consult the other table column helps for more
information.
Indicator
- Indicator for file system ("F") or mount point ("M").
Mode
- Mount mode of the file system:
- R/W - Mounted in readwrite mode.
- R/O - Mounted in readonly mode.
- N/M - Not mounted.
- QSC - Not available because the aggregate is quiesced.
- R/W - Mounted in readwrite mode.
Quota Limit
- Maximum logical size of the file system.
Quota Usage
- Percentage of the quota currently used by the file system.
Operation Rate
- Number of vnode operations per second on this file system.
Mount Point
- Mount point of the file system.
Aggregate Size
- Size of the zFS aggregate.
%Aggregate Use
- Percentage of space used in the aggregate.
Aggregate Mode
- There are two types of aggregates: A compatibility mode aggregate contains a single zFS file system. A multifile
system aggregate contains more than one zFS file system. An aggregate can have one of the following modes:- R/O CP - Compatibility mode aggregate attached readonly.
- R/W CP - Compatibility mode aggregate attached readwrite.
- R/O MS - Multifile system aggregate attached readonly. All file systems in this aggregate are readonly and can only be mounted readonly.
- R/W MS - Multifile system aggregate attached readwrite. The file systems in this aggregate can be mounted readonly and readwrite.
- R/O CP - Compatibility mode aggregate attached readonly.
Number of File Systems
- Number of file systems in the aggregate.
Aggregate Read Rate
- Data transfer read rate in bytes/second for the aggregate.
Aggregate Write Rate
- Data transfer write rate in bytes/second for the aggregate.
CHPID
- Hexadecimal number of the channel path identifier (CHPID).
Type
- Type of channel path.
- You may issue the console command D M=CHP(xx) to see an explanation of the channel path type. If the field is blank, RMF encountered an error collecting data. Check the operator console for messages.
- The indication of whether a channel path is defined as shared between one or more logical partitions. Y indicates that
the channel path is shared.
LPAR MSGRate
- Rate of messages sent by the partition.
LPAR MSGSize
- Average size of messages sent by the partition (in bytes).
Total MSGRate
- Rate of messages sent by the entire system.
Total Receive Fail
- Rate of messages (received by the entire system) that failed due to unavailable buffers.
Total MSGSize
- Average size of messages sent by the entire system (in bytes).
FICON Operation Rate
- Number of native FICON operations per second.
FICON Operations Active
- Average number of native FICON operations that are concurrently active.
FICON Deferred Operation Rate
- Number of deferred FICON operations per second. This is the number of operations that could not be initiated by the channel due to the lack of available resources.
zHPFOperation Rate
- Number of native zHPF (High Performance FICON) operations per second.
zHPFOperations Active
- Average number of native zHPF (High Performance FICON) operations that are concurrently active.
zHPFDerferred Operation Rate
- Number of deferred zHPF (High Performance FICON) operations per second. This is the number of operations that could not be initiated by the channel due to the lack of available resources.
Resource
- The resource name of the lock.
Jobname
- The name of the address space, which is spinning due to the lock request.
Type
- The indication, whether the lock is held exclusive or shared.
ASID
- The decimal address space identifier of the spinning job.
CPUID
- The identifier of the logical CPU holding the lock.
Address
- The address of the instruction which obtained the lock.
%Held
- The percentage of samples where the address space held the lock during the report interval.
%Spin
- The percentage of samples where the requesting address space has been found spinning due to the unavailable lock.
Lock Type
- The type of the suspend lock:
- L - Local Suspend Lock
- LX - CrossMemory Local (CML) Suspend Lock
- G - Global CMS Suspend Lock
- L - Local Suspend Lock
Jobname
- The name of the job/address space which holds the lock.
ASID
- The decimal address space identifier of the job/address space which holds the lock.
%Interrupted
- The percentage of samples where the address space was interrupted while holding the lock.
%Dispatchable
- The percentage of samples where the address space was dispatchable while holding the lock.
%Suspended
- The percentage of samples where the address space was suspended while holding the lock.
%channel path partition utilization
- The channel path utilization percentage for an individual logical partition. RMF uses the values provided by CPMF (Channel Path Measurement Facility).
- In LPAR mode, the calculation is: % partition utilization = (CBT / CET) * 100
- CBT - Cumulative channel path busy time
- CET - Cumulative channel path elapsed time
- CBT - Cumulative channel path busy time
- In BASIC mode, no data are shown.
%channel path total utilization
- The channel path utilization percentage for the entire system during an interval.
- For shared channels in LPAR mode, or for all channels in BASIC mode with CPMF not available, the calculation is:
% total utilization = (SCB / N) * 100- SCB - Number of SRM observations of channel path busy
- N - Number of SRM samples
- SCB - Number of SRM observations of channel path busy
- For unshared channels in LPAR mode, the value for total utilization is the same as partition utilization.
- For all channels in BASIC mode with CPMF available, the calculation is: % total utilization = (CBT / CET) * 100
- CBT - Cumulative channel path busy time
- CET - Cumulative channel path elapsed time
- CBT - Cumulative channel path busy time
%enqueue delay
- The percentage of time during the report interval that the system or job was waiting to use a serially reusable resource that another system or job was using.
%HSM delay
- The percentage of time during the report interval that the system or job was waiting for services from the Hierarchical
Storage Manager (HSM). - A high HSM delay value might be caused by one or more of the following:
- HSM address spaces delayed
- Delay on HSM volumes (Check HSM device volumes)
- HSM doing its housekeeping during prime time
- Not enough primary or level one space
- HSM dispatching priority too low.
- HSM address spaces delayed
%JES delay
- The percentage of time during the report interval that the system or job was waiting for services from the Job Entry
Subsystem (JES). - A high JES delay value might be caused by one or more of the following:
- JES address spaces delayed
- Delay on JES volumes (Check JES device volumes)
- JES dispatching priority too low.
- JES address spaces delayed
%operator delay
- The percentage of time during the report interval that the system or job was waiting for the operator to reply to a message or mount a tape, or the address space was quiesced by the operator.
%processor delay
- The percentage of time during the report interval that the system or job or enclave was waiting for a processor.
- A high processor using value might be caused by one or more of the following:
- looping user
- high dispatching priority for a processorbound job (in compatibility mode) or high importance for the service class of a processorbound job (in goal mode)
- small block size I/O
- excessive use of expensive supervisor service
- looping user
- A high processor delay value might be caused by one or more of the following:
- ineffective choice of dispatching priorities in either the SRM IPS (compatibility mode) or ineffective choice of importances in the active service policy (goal mode)
- high priority work using an excessive amount of CPU
- ineffective meantimetowait usage
- ineffective choice of dispatching priorities in either the SRM IPS (compatibility mode) or ineffective choice of importances in the active service policy (goal mode)
%storage delay
- The percentage of time during the report interval that the system or job was waiting for a COMM, LOCL (both include shared pages), SWAP, or VIO page, was on the out/ready queue, or was a result of a crossmemory address space or standard hiperspace paging delay.
- For enclaves, only COMM, crossmemory, and shared page delays apply.
- A high storage delay value can be associated with common storage paging (COMM), local storage paging (LOCL), swapin delay (SWAP), swapped out and ready delay (OUTR), and other delays (OTHR) which includes virtual I/O paging and paging delays from crossmemory address spaces and standard hiperspaces.
- A high storage delay associated with common storage paging might be caused by one or more of the following:
- insufficient page data sets
- not enough central storage
- poorly tuned paging configuration
- too many address spaces in storage
- too many "logical swap" address spaces in storage
- excessive storage isolation of address spaces
- too many extremely large address spaces resident
- paging data set on shared device
- high use of user I/O on paging volume
- "common I/O" contends with "swap I/O"
- common data set on wrong device
- insufficient page data sets
- A high storage delay associated with local storage paging might be caused by one or more of the following:
- insufficient page data sets
- not enough central storage
- address space is under isolated (causing trim) or over isolated (causing others to page/swap)
- poorly tuned paging configuration
- too many address spaces in storage
- too few (artificially low) address spaces in storage
- too many "logical swap" address spaces in storage
- paging data set on shared device
- high use of user I/O on paging volume
- too much swapping
- pageins are from trimming at swapout
- "local I/O" contends with "swap I/O"
- program pages in each address space rather than in PLPA
- too many extremely large address spaces resident
- insufficient page data sets
- A high storage delay associated with virtual I/O might be caused by one or more of the following:
- insufficient page data sets
- poorly tuned paging configuration
- paging data set on shared device
- high use of user I/O on paging volume
- virtual I/O contending with swap I/O
- insufficient page data sets
- A high storage delay associated with swap-in activity might be caused by one or more of the following:
- too much swapping
- workload too heavy
- insufficient page/swap data sets
- misplaced page/swap data sets
- swap data sets on slow devices
- too few (artificially low) address spaces in storage
- paging data set on shared device
- high use of user I/O on paging volume
- swapped pages moved to backing store on cached device
- not enough central storage
- too much swapping
- A high delay value for address spaces that are swapped out and ready might be caused by one or more of the following:
- too few (artificially low) address spaces in storage
- workload too heavy
- unbalanced workload
- not enough central storage
- poorly tuned paging configuration
- insufficient page/swap data sets
- too many address spaces in storage
- too many or too few logical swap address spaces
- paging/swapping too slow
- exchange swap rate too high
- too many detected wait swaps
- improper use of storage isolation
- too few (artificially low) address spaces in storage
- Other storage delays might be caused by one or more of the following:
- paging delays from crossmemory address spaces
- paging delays from standard hiperspaces (but not ESO hiperspaces)
- paging delays from crossmemory address spaces
% subsystem delay
- The percentage of time during the report interval that the system or job was waiting for services from
-
Job Entry Subsystem (JES)
-
Hierarchical Storage Manager (HSM)
-
Cross-System Coupling Facility (XCF)
-
% XCF delay
- The percentage of time during the report interval that the system or job was waiting for services from the Cross-System Coupling Facility (XCF).
- A high XCF delay value might be caused by one or more of the following:
-
Path capacity exceeded.
-
Other applications are tying up the path.
-
XCF delays on the receiving system.
-
Some data paths are unavailable or offline.
-
% total delay
- The percentage of time during the report interval that the job was not using any resources and was delayed for at least one of the following resources:
- processor - the job had ready work on the dispatching queue.
- storage - the job was delayed by paging, swapping or virtual input/output (VIO) activity, or was on the out/ready queue.
- device - the job was waiting for a DASD or tape.
- Job Entry Subsystem (JES)
- Hierarchical Storage Manager (HSM)
- Cross - System Coupling Facility (XCF)
- OPER - the job was waiting for the operator to reply to a message or to mount a tape, or the address space was quiesced by the operator.
- ENQ - the job was waiting to use a serially reusable resource that another job was using.
Note: If a job with several tasks is simultaneously delayed for more than one resource, RMF counts this job only once as delayed when it calculates delay percentage.
% idle
- The percentage of time during the report interval that the system or job was idle.
- RMF considers a job idle if it is in terminal wait, timer wait, or is waiting to be selected by JES, and it is not using or waiting for any resource that RMF monitors.
% using
- The percentage of time during the report interval that the system or job was using one or more processors or devices.
Note: If a job with more than one task is simultaneously using and delayed for the same resource, RMF counts the job once as using and once as delayed (regardless of how many times it is found using and delayed). If a job is delayed for more than one resource, it is counted once for the overall delay and once for each resource causing a delay.
% workflow
- Workflow percentage is the speed at which a job is moving through the system in relation to the maximum speed at which it could move through the system.
- A low workflow percentage indicates that the job has few of the resources it needs and is contending with other jobs for system resources. A high workflow percentage indicates that the job has the resources it needs to execute and is moving through the system at a relatively high speed.
- For example, a job that could execute in one minute if all the resources it needed were available, would have a workflow of 25 percent if it took four minutes to execute.
% unknown
- RMF considers the system or jobs that are not delayed for a monitored resource, not using a monitored resource, or not in a monitored idle state to be in an unknown state.
- The value represents the percentage of time during the report interval that the job was in the system, but not in any monitored state.
- Examples of address spaces in an unknown state include those waiting for devices other than DASD or tape and those that are waiting for work (idle) using a method that RMF does not recognize. Started tasks (STCs) are usually found in this category.
% connect time
- The sum of the percentages of time during the report interval that devices used by the job were connected to channel path(s) to transfer data between the devices and central storage.
- Because a job can be connected to more than one device at a time, the value in connect time percentage can be greater than 100%.
Note: This can include devices other than DASD and tape; for example, graphic displays.
% using
- The percentage of time during the report interval that one job or all jobs in a group or in the system were using one or more devices.
- RMF considers a job to be using a device as soon as the job's I/O request is queued in the channel for the device.
- Therefore, the using percentage for a device includes both active time on the device and queuing delay in the channel.
i/o activity rate
- The rate per second that I/O instructions (SSCH, RSCH, and HSCH) to a device completed successfully.
IOS queue time
- The average number of milliseconds an I/O request must wait on an IOS queue before an SSCH instruction can be issued. A delay occurs when a previous request to the same subchannel is in progress.
response time
- The average response time (in milliseconds) that the device required to complete an I/O request.
i/o intensity
- The product of the number of users and the time waiting in average for a DASD device because of one of the following reasons:
-
The path and device are busy
-
The SIO is pending
-
The device is busy
-
The SIO is queued
-
Note: there is no common name for I/O intensity in the literature. Other programs might use different names. The following terms are equivalent to I/O Intensity: DASD MPL, Response Time Volume.
% active time
- The percentage of time during the report interval that the device was active.
Note: active time = connect time + disconnect time + pending time
% connect time
- The percentage of time during the report interval that the device was connected to a channel path.
% disconnect time
- The percentage of time during the report interval that the device had an active channel program, but was not connected to the channel.
Note: Disconnect time includes seek time, normal rotational delay time, and extra rotational delay time because the channel was busy.
% pending time
- The percentage of time during the range period that I/O requests were waiting in a channel queue before a path was available.
Note: Pending time includes the time spent waiting for a device, a control unit, a head of string, or a channel.
% I/O delay
- The percentage of time during the report interval that the job is waiting for any DASD or tape, or has an I/O request queued in the channel for a device, but not transmitting data (for example, is being disconnected to seek).
- A high device delay value for a job usually means that another job has a high using value for the same device. Use the Device Delay report to determine what volume a job is waiting for; then use the Device Resource Delay report to determine how the job using that volume is spending its time.
- General reasons for a high device using value might include:
-
Unnecessary I/O (such as using DASD instead of VIO for temporary data sets).
-
Data sets on a slow device.
-
- Using time for a volume will approximately equal connect time (time that the device was connected to a channel path).
- Using time does not include disconnect time (time that the device had an active channel program but was not connected to the channel) and pending time (time that I/O requests were waiting in a channel queue before a path was available).
- A high connect percentage (CON %) might be caused by one or more of the following:
- programs not resident
- inappropriate application parameters
- inefficient use of device by application(s)
- not enough in-storage buffering
- heavy BLDL activity
- high VTOC activity
- A high disconnect percentage (DSC %) might be caused by one or more of the following:
- small block size I/O
- multiple revolutions per I/O due to missing channel connects or reconnects
- long seeks because of data set placement or multiple extents on high use data sets
- heavy BLDL activity
- high miss ratio for cached device
- misplaced VTOC or CATALOG or both
- channel, control unit, or head of string contention
- A high pending percentage (PND %) might be caused by one or more of the following:
- shared DASD contention
- device not responding
- channel, control unit, or head of string contention
- poorly balanced I/O
- PND time of 100 % usually means another system had the device reserved
% delay device busy
- The percentage of time during the range period when there was an I/O request delay because the device was busy.
Note: Device busy might mean that another system is using the volume, another system reserved the volume, or a head of string busy condition caused the contention.
% control unit busy
- The percentage of time during the range period when there is an I/O request delay because the control unit was busy. If the device is shared at the control unit level, a sharing system might be using the device. If the device is not shared at the control unit level, the contention is the result of other activity to different devices over an alternate path serviced by this control unit.
% director port busy
- The percentage of time during the range period when there is an I/O request delay because the ES Connection Director port was busy.
% using
- The percentage of time during the report interval that the job was using the volume.
Note: RMF considers a job to be using a device as soon as the job's I/O request is queued in the channel for the device. Therefore, the using percentage for a device includes both active time on the device and queuing delay in the channel.
% all channel paths busy
- The percentage of time during the measurement interval when all channel paths belonging to the LCU were busy at the same time.
- Only channel paths that are both online to the system and connected to a device are included in the calculation:
- % all channel paths busy = CHPID0 * CHPID1 * CHPID2 * CHPID3
- CHPIDn = Percentage busy of each channel path involved
- % all channel paths busy = CHPID0 * CHPID1 * CHPID2 * CHPID3
% control unit busy
- This value shows for each channel path of the LCU the relationship between requests deferred due to control unit busy and total successful requests serviced by that path.
- Each CHPID of the LCU measures the distribution of control unit contention.
- The calculation is: % control unit busy = ((CUB / (DPB + CUB + SUC)) * 100
- DPB = Number of deferred I/O requests due to director port busy
- CUB = Number of deferred I/O requests due to control unit busy
- SUC = Number of successful I/O requests on that path
% director port busy
- This field indicates director port contention.
- It is the number of times an I/O request was deferred because the director port was busy during the measurement interval.
- The calculation is: % director port busy = ((DPB / (DPB + CUB + SUC)) * 100
- DPB = Number of deferred I/O requests due to director port busy
- CUB = Number of deferred I/O requests due to control unit busy
- SUC = Number of successful I/O requests on that path
% CHPID taken
- The rate at which I/O requests to devices of this LCU are satisfied by each CHPID during the interval.
- By reviewing the rate at which each channel path of the LCU satisfies I/O requests, you can see how evenly the work requests are distributed among the available paths and how effectively those paths are arranged for the LCU.
- The calculation is: % CHPID taken = (TO / SI) * 100
- TO - Total number of I/O operations accepted on that path
- SI - Number of seconds in the interval
# delayed i/o requests
- The average number of delayed requests on the control unit header (CUHDR).
- Each time a request is enqueued from the CUHDR, RMF counts the number of requests remaining on the queue and adds that number to the accumulator.
- The calculation is: # delayed i/o requests = (AL ER) / ER
- AL - Accumulated queue length
- ER - Total number of enqueued requests
delayed i/o request rate
- The rate per second at which the IOP places delayed I/O requests on the CUHDR for this LCU. This is done when all paths to the subchannel are busy and at least one path to the control unit is busy.
- For devices with only one path, or for devices where multiple paths exist and the busy condition is immediately resolved, the IOP does not count the condition.
- The calculation is: delayed i/o request rate = ER / SI
- ER - Total number of enqueued requests
- SI - Number of seconds in the interval
% delay by volume
- The percentage of delay caused because the job was waiting to use the named volume.
% using
- The percentage of time during the report interval that one job or all jobs in a group or in the system were using one or more processors.
% appl (TCB + SRB) by job
- The percentage of processor time used by the job during the report interval.
- This metric does not include:
- enclave CPU time - see % eappl if you want enclave CPU time included.
-
AAP processor time - see % AAP if you want to monitor AAP processor time.
Note: This metric is NOT adjusted (divided) by the number of processors.
working set
- The working set represents the (central or expanded) storage the user has when a job is actually running. Shared page counts are not included in the working set.
% delay for SWAP
- The percentage that swapin delays contributed to the delay of a job.
% delay for COMM
- The percentage that common storage (common service area (CSA) or link pack area (LPA)), including shared pages, contributed to the delay of a job.
% delay for LOCL
- The percentage that local (private) storage paging, including shared pages contributed to the delay of a job.
% delay for OTHR
- The percentage that various types of delays contributed to the delay of a job.
- This is the sum of:
- VIO (virtual I/O)
- Paging delays from cross-memory address spaces
- For example, if the DB2 address space does not have sufficient central/expanded storage, CICS could be delayed by crossmemory page-in in the DB2 address space. This would show up as a crossmemory delay for CICS.
- Paging delays from standard hiperspaces (but not ESO hiperspaces).
- This delay could be caused by a job running DFSORT with hipersorting if the DFSORT hiperspace's pages were migrated from expanded to auxiliary storage.
% delay for OUTR
- The percentage that swappedoutandready delays contributed to the delay of a job.
% available
- The percentage of common storage (CSA, ECSA, SQA, or ESQA) available for allocation at the end of the specified range period.
% not released
- The percentage of allocated common storage (CSA, ECSA, SQA, or ESQA) that was not released when a job ended.
% utilization
- The percentage of common storage (CSA, ECSA, SQA, or ESQA) used during the specified range period.
# frames not released
- The amount of allocated common storage (CSA, ECSA, SQA, or ESQA) that was not released when a job ended.
# frames used
- The amount of common storage (CSA, ECSA, SQA, or ESQA) used during the specified range period.
# frames defined
- The amount of common storage (CSA, ECSA, SQA, or ESQA) defined to the system at IPL.
# frames idle
- The average number of frames held by a job while it was idle.
# frames total
- The sum of the active and idle frames.
# frames active
- The average number of frames held by a job while it was active.
# frames fixed
- The average number of fixed frames a job was using during the report interval including frames both above and below the 16 megabyte line.
Note: A fixed frame is a frame that cannot be paged out of central storage.
# frames DIV
- The DIV frame count represents the number of Datainvirtual frames in relation to the number of Datainvirtual samples.
# slots
- The total number of the auxiliary storage slots a job used, averaged over the report interval.
es rate per residency time
- The value is the rate of pagemoves from expanded storage to central storage per active second. This count is the total pagemove count divided by the time the user was swappedin.
Note: It includes single and blocked pages, but does not include shared, hiperspace or VIO pages.
pgin rate
- The rate at which pages are being read into central storage.
- It is calculated by dividing the total pagein count (for the group) by the resident time.
Note: The addressspace related shared storage pageins are included in the value.
migration age
- Migration age is the average number of seconds a page resides on expanded storage before it migrates to auxiliary storage.
unreferenced interval count
- The average high unreferenced interval count (UIC) is an indicator of central storage contention. A low UIC count indicates that storage contention is high and you might experience storage problems.
% frames active
- The percentage of storage allocated to jobs that are active.
% frames available
- The percentage of available storage.
% frames idle
- The percentage of storage allocated to jobs that are idle.
% frames CSA
- The percentage of storage allocated to the common storage area (CSA).
% frames LPA
- The percentage of storage allocated to the link pack area (LPA).
% frames NUC
- The percentage of storage allocated to the nucleus (NUC).
% frames SQA
- The percentage of storage allocated to the system queue area (SQA).
# delayed jobs for COMM
- The average number of jobs in each group that are delayed for common storage (common service area (CSA) or link pack area (LPA)),
Note: including shared pages.
# delayed jobs
- The average number of jobs in each group that are delayed for any of the storage reasons COMM, LOCL, SWAP, OUTR, or OTHR.
# delayed jobs for OTHR
- The average number of jobs in each group that are delayed for various types of delays.
- This is the sum of:
-
VIO (virtual I/O)
-
Paging delays from cross-memory address spaces.
For example, if the DB2 address space does not have sufficient central/expanded storage, CICS could be delayed by cross-memory page-in in the DB2 address space. This would show up as a cross-memory delay for CICS.
-
Paging delays from standard hiperspaces (but not ESO hiperspaces)
This delay could be caused by a job running DFSORT with hipersorting if the DFSORT hiperspace's pages were migrated from expanded to auxiliary storage.
-
# delayed jobs for OUTR
- The average number of jobs in each group with swappedoutandready delays.
# delayed jobs for LOCL
- The average number of jobs in each group that are delayed for local (private) storage paging, including shared pages.
# frames online
- Central storage
-
Number of central storage frames, excluding read-only frames.
-
Nucleus frames are included in this Metric.
-
- Expanded storage
-
Number of usable expanded storage frames.
-
# delayed jobs for SWAP
- The average number of jobs in each group with swapin delays.
pgin rate per residency time
- The average number of pageins per second for an address space.
Note: The calculation is the total number of non-swap page-ins (including VIO page-ins, hiperspace page-ins, page-ins caused by page faults, and shared storage page-ins) during the range period divided by the total time an address space was swapped-in (residency time).
pagein rate
- The rate of total system pages per second read into central storage.
- The rate excludes swapin, vio and hiperspace pageins.
# of frames available
- Average number of frames on the available frame queue during the range period.
# of slots available
- Average number of free slots in the auxiliary storage during the range period.
# of frames and slots available
- Average number of available frames and free slots during the range period.
- The sum of frames and slots can be considered as amount of virtual memory.
# average number of user region pages below 16M
- The average number of user region pages (Subpools 0127,251,252) that are allocated for the master address space below 16 M.
# average number of LSQA/SWA/UKYSP pages below 16M
- The average number of LSQA (Local System Queue Area, Subpools 253255), SWA (Scheduler Work Area, Subpools
236,237), UKYSP (User Key Space, Subpools 229,230) pages that are allocated for the master address space below 16
M.
# average number of user region pages above 16M
- The average number of user regions pages (Subpools 0127,251,252) that are allocated for the master address space above 16 M.
# average number of ELSQA/ESWA/EUKYSP pages above 16M
- The average number of ELSQA (Extended Local System Queue Area, Subpools 203205,213215,223225,253255), ESWA (Extended Scheduler Work Area, Subpools 236,237), EUKYSP (Extended User Key Space, Subpools 229,230) pages that are allocated for the master address space above 16M.
Memory Objects Large
- The average number of large memory objects allocated by the job.
Note: A memory object is a contiguous range of virtual addresses in units of megabytes on a megabyte boundary.
Memory Objects Frames
- The average number of 1 MB frames backed in real storage.
Total MemObjs
- Average number of memory objects allocated.
Common MemObjs
- Average number of 64bit common memory objects allocated.
- Average number of shared memory objects allocated.
Private MemObjs
- Average number of private memory objects allocated.
Large MemObjs
- The average number of large memory objects allocated.
1 MB Frames
- The average number of 1 MB frames backed in real storage.
Total Bytes
- Average amount of storage allocated by memory objects in 64bit high virtual memory.
Common Bytes
- Average amount of 64bit common storage allocated.
- Average amount of shared storage allocated from 64bit virtual storage by memory objects.
Private Bytes
- Average amount of 64bit private storage allocated.
Common HWM
- High water mark for the amount of 64bit common storage allocated.
Memory Limit
- Address space memory limit in MB.
# of common frames backed in real
- Average number of 64bit common memory frames backed in real storage.
# of common frames fixed in real
- Average number of 64bit common memory frames fixed in real storage.
# of common AUX slots
- Average number of 64bit common memory auxiliary storage slots.
- Average number of 64bit shared memory frames backed in real storage.
% common frames used
- Percentage of area used by frames backed in real storage related to the 64bit common area size in the system.
- Percentage of area used by shared frames backed in real storage related to the shared area size of real storage in the system.
% 1 MB frames used
- Percentage of 1 MB frames backed in real storage related to the large frames area size in the system.
execution velocity
- The execution velocity of the MVS system, workload group, service class or service class period being reported on. This value is calculated independent of a specified goal.
- The value for execution velocity is calculated as CPU using, divided by the sum of CPU using and total delays gathered by WLM.
- A high value indicates little workload contention while a low value indicates that the requests for system resources are delayed.
response time
- The average response time (in seconds) for all transactions of a WLM workload, or WLM service or report class that ended during the range period.
- The response time value is the sum of the queued time and the active time for an average ended transaction.
transaction rate
- The number of transactions per second for a WLM workload, or WLM service or report class during the range period.
% average MVS utilization
- MVS view of CPU utilization.
- For example, if an MVS partition has 5% of the processor capacity and the physical CPU utilization reported by RMF for the partition is 5%, this indicates an MVS view of 100% CPU utilization.
Note: This Metric is available in LPAR mode only, because in Basic mode (nonLPAR mode) this value is shown in the % total utilization Metric.
% workflow
- The average speed at which the jobs in the group are moving through the system in relation to the maximum speed at which they could move through the system.
- A low workflow percentage indicates that jobs in the group have few of the resources they need and are contending with other jobs for system resources.
- A high workflow percentage indicates that jobs in the group have the resources they need and are moving through the system at a relatively high speed.
- For example, jobs in a group that could process in four minutes if all the resources that they needed were available, would have a workflow of 25% if they took sixteen minutes to process.
% average CPU utilization
- The average utilization percentage for all processors during the report interval.
# active users
- The average number of active users in the system or in a group of address spaces.
- Active users include those using a monitored resource, those delayed for a monitored resource, and those doing activities that RMF does not measure.
Note: Each system user is either active, idle or unknown during a report interval.
% SRB
- The percentage of SRB time used by all work in the system, or by WLM class.
- This metric does not include AAP processor time see % AAP if you want to monitor AAP processor time.
Note: This metric is adjusted (divided) by the number of processors.
% TCB
- The percentage of TCB time used by all work in the system, or by WLM class.
- This metric does not include AAP processor time see m% AAP if you want to monitor AAP processor time.
Note: This metric is adjusted (divided) by the number of processors.
% appl (TCB + SRB)
- The percentage of processor time used by all work in the system, or by WLM class.
- This metric does not include:
-
enclave CPU time see % eappl if you want enclave CPU time included.
-
AAP processor time see % AAP if you want to monitor AAP processor time.
-
Note: This metric is adjusted (divided) by the number of processors.
# users
- The average number of total users in the system or in a group of address spaces.
# using jobs
- Average number of users using devices.
# using jobs
- Average number of users using the processor.
% workflow
- Workflow percentage with respect to the processor is the speed at which one job or all jobs in a group or in the system are using the processor(s) in relation to the maximum speed at which they could do this.
- The calculation for this value is:%workflow = (%using / (%using + %delay)) * 100
Note: In this formula, the values of %using and %delay refer to the processor.
% workflow
- Workflow percentage with respect to devices is the speed at which one job or all jobs in a group or in the system are using the devices in relation to the maximum speed at which they could do this.
- The calculation for this value is: %workflow = (%using / (%using + %delay)) * 100
Note: In this formula, the values of %using and %delay refer to devices.
# using jobs
- The average number of jobs using either the processor or devices during the report interval.
# delayed jobs
- The average number of jobs that are delayed during the report interval because of at least one of the following reasons:
-
Waiting for a processor
-
Waiting for a device
-
Waiting for storage
-
Waiting for a subsystem (JES, HSM, XCF)
-
Waiting for the operator
-
Waiting for serially reusable resource (enqueue)
-
# AAP processors online
- The number of AAP processors online during the range period.
% AAP
- The percentage of processor time on AAP processors used by all work in the system, or by WLM class.
Note: This metric is adjusted (divided) by the number of AAP processors.
% AAP by job
- The percentage of processor time on AAP processors used by the job during the report interval.
Note: This metric is NOT adjusted (divided) by the number of AAP processors.
% AAP on CP
- The percentage of processor time for AAP eligible work executed on CPs used by all work in the system, or by WLM class.
Note: This metric is adjusted (divided) by the number of AAP processors.
% AAP on CP by job
- The percentage of processor time for AAP eligible work executed on CPs used the job during the report interval.
Note: This metric is NOT adjusted (divided) by the number of AAP processors.
# IIP processors online
- The number of IIP processors online during the range period.
% IIP
- The percentage of processor time on IIP processors used by all work in the system, or by WLM class.
Note: This metric is adjusted (divided) by the number of IIP processors.
% IIP by job
- The percentage of processor time on IIP processors used by the job during the report interval.
Note: This metric is NOT adjusted (divided) by the number of IIP processors.
% IIP on CP
- The percentage of processor time for IIP eligible work executed on CPs used by all work in the system, or by WLM class.
Note: This metric is adjusted (divided) by the number of IIP processors.
% IIP on CP by job
- The percentage of processor time for IIP eligible work executed on CPs used the job during the report interval.
Note: This metric is NOT adjusted (divided) by the number of IIp processors.
% of total delay samples
- The percentage of samples where a WLM class has been found delayed.
% of standard CP delay samples
- The percentage of samples where a WLM class has been found delayed for the general purpose processor.
% of AAP delay samples
- The percentage of samples where a WLM class has been found delayed for an AAP processor.
% of IIP delay samples
- The percentage of samples where a WLM class has been found delayed for an IIP processor.
% of RG capping delay samples
- The percentage of samples where a WLM class has been found delayed because of CPU capping (WLM resource group maximum is being enforced).
CPU time at promoted DP
- The CPU time in seconds where transactions of the WLM class were running at promoted dispatching priority to help blocked workloads.
# dedicated CPs
- The number of processors in a CEC that are assigned to one or more LPARs as dedicated.
- The number of processors in a CEC that are available to the shared processor pool.
# processors defined
- The number of processors defined to the partition.
Note: This metric is available for partitions running general purpose processors (CP) as well as for partitions running special purpose processors (e.g. ICF).
# processors online (partition)
- The number of logical processors online to the partition.
Note: This metric is available for partitions running general purpose processors (CP) as well as for partitions running special purpose processors (e.g. ICF).
# processors dedicated (partition)
- The number of processors assigned exclusively to the partition. period.
Note: This metric is available for partitions running general purpose processors (CP) as well as for partitions running application assist processors (AAP) or integrated information processors (IIP).
# processors dedicated (CPC)
- The number of processors assigned exclusively to all partitions running in the CPC.
Note: This metric is available for partitions running general purpose processors (CP) as well as for partitions running application assist processors (AAP) or integrated information processors (IIP).
# delayed jobs for enqueue
- The average number of jobs for each group that are waiting to use a serially reusable resource that another system or job was using.
# delayed jobs for HSM
- The average number of jobs for each group that are waiting for services from the Hierarchical Storage Manager (HSM).
- A high HSM delay value might be caused by one or more of the following:
-
HSM address spaces delayed
-
Delay on HSM volumes
-
HSM doing its housekeeping during prime time
-
Not enough primary or level one space
-
HSM dispatching priority too low
-
# delayed jobs for JES
- The average number of jobs for each group that are waiting for services from the Job Entry Subsystem (JES).
- A high JES delay value might be caused by one or more of the following:
-
JES address spaces delayed
-
Delay on JES volumes
-
JES dispatching priority too low
-
# delayed jobs for operator
- The average number of jobs for each group that are waiting for the operator to reply to a message or mount a tape, or the address space was quiesced by the operator.
# delayed jobs for subsystem
- The average number of jobs for each group that are waiting for services from:
-
Job Entry Subsystem (JES)
-
Hierarchical Storage Manager (HSM)
-
Cross-System Coupling Facility (XCF)
-
# delayed jobs for XCF
- The average number of jobs for each group that are waiting for services from the Cross-System Coupling Facility (XCF).
- A high XCF delay value might be caused by one or more of the following:
-
Path capacity exceeded.
-
Other applications are tying up the path.
-
XCF delays on the receiving system.
-
Some data paths are unavailable or offline.
-
# delayed jobs for I/O
- The average number of jobs for each group that are waiting for any DASD or tape, or has an I/O request queued in the channel for a device, but not transmitting data (for example, is being disconnected to seek).
- A high device delay value for a job usually means that another job has a high using value for the same device.
- General reasons for a high device using value might include:
-
unnecessary I/O (such as using DASD instead of VIO for temporary data sets)
-
data sets on a slow device
-
- General reasons for a high device using value might include:
- Using time for a volume will approximately equal connect time (time that the device was connected to a channel path).
- Using time does not include disconnect time (time that the device had an active channel program but was not connected to the channel) and pending time (time that I/O requests were waiting in a channel queue before a path was available).
- A high connect percentage (CON %) might be caused by one or more of the following:
-
programs not resident
-
inappropriate application parameters
-
inefficient use of device by application(s)
-
not enough in-storage buffering
-
heavy BLDL activity
-
high VTOC activity
-
- A high disconnect percentage (DSC %) might be caused by one or more of the following:
-
small block size I/O
-
multiple revolutions per I/O due to missing channel connects or reconnects
-
long seeks because of data set placement or multiple extents on high use data sets
-
heavy BLDL activity
-
high miss ratio for cached device
-
misplaced VTOC or CATALOG or both
-
channel, control unit, or head of string contention
-
- A high pending percentage (PND %) might be caused by one or more of the following:
-
shared DASD contention
-
device not responding
-
channel, control unit, or head of string contention
-
poorly balanced I/O
-
PND time of 100 % usually means another system had the device reserved
-
# delayed jobs for processor
- The average number of jobs for each group that are waiting for a processor.
- A high processor using value might be caused by one or more of the following:
-
looping user
-
high dispatching priority for a processor-bound job (in compatibility mode) or high importance for the service
-
class of a processor-bound job (in goal mode)
-
small block size I/O
-
excessive use of expensive supervisor service
-
- A high processor delay value might be caused by one or more of the following:
-
ineffective choice of dispatching priorities in either the SRM IPS (compatibility mode) or ineffective choice of importances in the active service policy (goal mode)
-
high priority work using an excessive amount of CPU
-
ineffective mean-time-to-wait usage
-
# delayed jobs for storage
- The average number of jobs for each group that are waiting for a COMM, LOCL (both include shared pages), SWAP, or VIO page, was on the out/ready queue, or was a result of a crossmemory address space or standard hiperspace paging delay.
- For enclaves, only COMM, crossmemory, and shared page delays apply.
- A high storage delay value can be associated with common storage paging (COMM), local storage paging (LOCL), swapin delay (SWAP), swapped out and ready delay (OUTR), and other delays (OTHR) which includes virtual I/O paging and paging delays from crossmemory address spaces and standard hiperspaces.
- A high storage delay associated with common storage paging might be caused by one or more of the following:
-
insufficient page data sets
-
not enough central storage
-
poorly tuned paging configuration
-
too many address spaces in storage
-
too many "logical swap" address spaces in storage
-
excessive storage isolation of address spaces
-
too many extremely large address spaces resident
-
paging data set on shared device
-
high use of user I/O on paging volume
-
"common I/O" contends with "swap I/O"
-
common data set on wrong device
-
- A high storage delay associated with local storage paging might be caused by one or more of the following:
- insufficient page data sets
- not enough central storage
- address space is under isolated (causing trim) or over isolated (causing others to page/swap)
- poorly tuned paging configuration
- too many address spaces in storage
- too few (artificially low) address spaces in storage
- too many "logical swap" address spaces in storage
- paging data set on shared device
- high use of user I/O on paging volume
- too much swapping
- page-ins are from trimming at swapout
- "local I/O" contends with "swap I/O"
- program pages in each address space rather than in PLPA
- too many extremely large address spaces resident
- A high storage delay associated with virtual I/O might be caused by one or more of the following:
-
insufficient page data sets
-
poorly tuned paging configuration
-
paging data set on shared device
-
high use of user I/O on paging volume
-
virtual I/O contending with swap I/O
-
- A high storage delay associated with swap-in activity might be caused by one or more of the following:
-
too much swapping
-
workload too heavy
-
insufficient page/swap data sets
-
misplaced page/swap data sets
-
swap data sets on slow devices
-
too few (artificially low) address spaces in storage
-
paging data set on shared device
-
high use of user I/O on paging volume
-
swapped pages moved to backing store on cached device
-
not enough central storage
-
- A high delay value for address spaces that are swapped out and ready might be caused by one or more of the following:
- too few (artificially low) address spaces in storage
-
workload too heavy
-
unbalanced workload
-
not enough central storage
-
poorly tuned paging configuration
-
insufficient page/swap data sets
-
too many address spaces in storage
-
too many or too few logical swap address spaces
-
paging/swapping too slow
-
exchange swap rate too high
-
too many detected wait swaps
-
improper use of storage isolation
- Other storage delays might be caused by one or more of the following:
-
paging delays from crossmemory address spaces
-
paging delays from standard hiperspaces (but not ESO hiperspaces)
-
execution velocity goal
- The target execution velocity for ended transactions that has been in effect for the service class period during the reported range.
performance index
- This index helps to compare goals. If, for example, several execution velocity goals with the same importance are not met, this index helps you decide which group was impacted the most.
- RMF calculates the performance index depending on the type of goal:
-
Execution velocity goal: perf index = goal% / actual%
-
-
-
Average response time goal: perf index = actual(sec) / goal(sec)
-
-
-
Response time goal with percentile: perf index = actual(sec) / goal(sec)
-
- In this context "actual" means the maximal response time that actually was reached for the percentage of the goal. To calculate this, perform the following 3 steps:
-
Calculate the number of transactions N that correspond to the goal:
-
N = (sum of all transactions * goal% ) / 100
-
Add up all transactions until a bucket M is reached where the sum is greater than N.
-
The "actual" response time in the formula for the performance index shown above is the response time value belonging to the bucket M.
-
Note: Due to this methodology, the maximal value of the performance index for this goal type is 4.
important service units (capacity) / transaction
- Actual service rate (in unweighted CPU service units per second) as consumed per transaction in a resource group with a high importance (1 or 2).
percentile achieving response time goal
- The percentage of transactions that actually ended within the time specified in the goal.
response time
- Average response time for all transactions as reported by the CICS TOR or IMS CTL region. However, for subsystem data, it is possible that active time is greater than total time.
Note: All of these response times are for ended transactions only. Thus, if there is a problem where transactions are completely locked out, either while queued or running, the problem will not be seen until the lockedout transactions end.
queue time
- Queue time is the difference between total and active time.
- For CICS, this may be the queue time for transactions within the TOR, AOR, and other regions, and also processing time within the TOR.
- For IMS, this may be the queue time for transactions within the MPR and also processing time within the CTL region.
- In all other cases, this is the average time that transactions spent waiting on a JES or APPC queue.
Note: Queue time may not always be meaningful, depending on how you schedule work. For example, jobs are submitted in hold status and left until they are ready to be run, all of the held time counts as queued time. This time may or may not represent a delay to the job.
transaction ended rate
- The number of transactions ended per second.
active time
- For CICS transactions, active time is the execution time in AOR, only for routed transactions.
- For IMS transactions, active time is the execution time within the MPR.
- For Batch, TSO, etc., active time is the average time that transactions spent in execution.
service units (capacity) / transaction
- Actual service rate (in unweighted CPU service units per second) as consumed per transaction.
response time goal
- The goal that has been in effect for the service class period during the reported range:
-
The average target response time for all ended transactions
-
response time goal percentile
- The goal that has been in effect for the service class period during the reported range:
-
-
The percentage of transactions that should complete within the time specified in the goal.
-
service rate
- The actual service rate, in unweighted CPU service units per second.
cf processor utilization
- Average value of processor utilizations within the coupling facility.
- In case of a stand-alone coupling facility, the utilization of the individual CPs should be approximately the same. In a PR/SM environment where this CP is shared with other partitions, the utilization is the logical utilization of the CP (that is, only the utilization by the coupling facility).
- If the average utilization is high, you can take the following actions:
-
.In a PR/SM environment, you can dedicate the CP to the integrated coupling facility or assign additional CPs to the partition.
-
Move structures to a coupling facility with lower utilization.
-
Consider additional or larger coupling facilities.
-
# effective logical processors
- Number of effective available logical processors in a shared environment. This value is only useful in a CFCC environment. CFCC measures the time of real command execution as well as the time waiting for work. The reported value shows the ratio between the LPAR dispatch time (CFCC execute and wait time) and the RMF Mintime length.
- For example, if a CFCC CEC contains 6 LPs, and the measured CF LPAR has two logical processors and is limited at 5% the number of effective LPs is 0.3.
total request rate
- The sum of synchronous and asynchronous requests completed against any structure within this coupling facility per second. This includes requests that changed from synchronous to asynchronous.
# frames installed
- The total amount of storage in the coupling facility, including both allocated and available space.
# frames available
- The amount of coupling facility space that is not allocated to any structure and not allocated as dump space.
sync request rate (CF structure)
- Number of hardware operations per second that started and completed synchronously to the coupling facility on behalf of connectors to the structure.
async request rate (CF structure)
- Number of hardware operations per second that started and completed asynchronously to the coupling facility on behalf of connectors to the structure.
sync service time (CF structure)
- Average time in microseconds required to satisfy a synchronous coupling facility request for this structure.
async service time (CF structure)
- Average time in microseconds required to satisfy an asynchronous coupling facility request for this structure. This value also includes operations that started synchronously but completed asynchronously.
% subchannel delay
- The percentage of all coupling facility requests MVS had to delay because it found all coupling facility subchannels busy.
- If this percentage is high, you should first ensure that sufficient subchannels are defined.
- If there are sufficient subchannels and this percentage is still high, it indicates either a coupling facility path constraint or internal coupling facility contention.
% path delay
- The percentage of all coupling facility requests that were rejected because all paths to the coupling facility were busy.
- A high percentage results in elongated service times which is a reduction of the capacity of the sending processor. If coupling facility channels are being shared among PR/SM partitions, the contention could be coming from a remote partition.
- dentifying Path Contention: There can be path contention even when this count is low. In fact, in a nonPR/SM
environment where the subchannels are properly configured, Subchannel Busy, not Path Busy, is the indicator for path
contention. If Path Busy is low but Subchannel Busy is high, it means MVS is delaying the coupling facility requests and
in effect gating the workload before it reaches the physical paths. Before concluding you have a capacity problem,
however, be sure to check that the correct number of subchannels is defined in the I/O generation.
LPAR Environment: If coupling facility channels are being shared among PR/SM partitions, Path Busy behaves differently.
Potentially, you have many MVS subchannels mapped to only a few coupling facility command buffers. You could have a
case where the subchannels were properly configured (or even underconfigured), Subchannel Busy is low, but Path
Busy is high. This means the contention is due to activity from a remote partition. - Possible actions: Dedicate the coupling facility links on the sending processor or add additional links.
CF sync request rate (view from MVS image)
- Number of hardware operations per second that started and completed synchronously to the coupling facility on behalf of connectors from this system.
CF async request rate (view from MVS image)
- Number of hardware operations per second that started and completed asynchronously to the coupling facility on behalf of connectors from this system.
CF sync service time (view from MVS image)
- Average time in microseconds required to satisfy a synchronous coupling facility request.
CF async service time (view from MVS image)
- Average time in microseconds required to satisfy an asynchronous coupling facility request.
- This value also includes operations that started synchronously but completed asynchronously.
% subchannel busy (view from MVS image)
- The percentage of time where the subchannel was in use by synchronous or asynchronous operations.
% CPU utilization (CF structure)
- The percentage of CPU time for a structure compared to the total CPU time for the coupling facility.
- The sum for all structures does not add up to 100%, since not all CPU work can be attributed to structures.
% using for a dataset
- Percentage of time when a job has had an I/O request accepted by the channel for the volume on which the data set resides, but the request is not yet complete.
% delay for a dataset
- Percentage of time when a job was waiting to use the data set because of contention for the volume where the data set resides.
i/o rate
- Rate of I/O requests. The i/o rate is measured at the hardware level and is the sum of the i/o activity of all systems attached to the volume or SSID.
% cache hits
- Percentage of I/Os that where processed within the cache (cache hits)
- based on the total number of I/Os.
-
% cache READ hits is the percentage for READ operations
-
% cache WRITE hits is the percentage for WRITE operations
-
% cache DFW hits is the percentage for DASD FAST WRITE operations
-
% cache CFW hits is the percentage for WRITE and READ-AFTER-WRITE operations.
-
% cache misses
- Percentage of I/Os that where NOT processed within the cache based on the total number of I/Os.
- Definition: % cache read misses = 100 % cache read hits
- % cache READ misses is the percentage of misses for READ operations
- % cache WRITE misses is the percentage of misses for WRITE operations
% of read operations
- Percentage of READ requests based on all READ and WRITE requests.
noncache dasd i/o rate
- I/O rate of all requests that accessed DASD. This is the sum of Stage rates (normal or sequential I/O requests that accessed DASD) and other request rates (inhibit cache load, DFW BYPASS, CFW BYPASS, DFW INHIBIT).
importance by WLM service class period
- Importance describes the level of importance assigned to a service class period.
-
1: Highest - describes highest priority service class period for most important work
-
2: High
-
3: Medium
-
4: Low
-
5: Lowest
-
D: Discretionary
-
CPC capacity (MSU/h)
- Processor capacity available to the Central Processor Complex (CPC). The data is in Millions of unweighted CPU service units per hour.
image capacity (MSU/h)
- Defined MSU capacity limit for the partition. No data are available, if the partition is not under control of the License Manager. The data is in Millions of unweighted CPU service units per hour.
% weigth of max
- Average weighting factor in relation to the maximum defined weighting factor for this partition.
% WLM capping
- Percentage of time when WLM capped the partition because the fourhours average MSU value exceeds the defined capacity limit.
four hour MSU average
- The average CPU consumption of the partition over the last four hours measured in millions of unweighted CPU service units per hour (MSU/h).
four hour MSU maximum
- The maximum CPU consumption of the partition over the last four hours measured in millions of unweighted CPU service units per hour (MSU/h). This value can be greater than the defined capacity.
actual MSU
- Actual MSU consumption of the image running in the specified partition. Data is in millions of unweighted CPU service units per hour.
# Logical Processors online
- The number of logical processors online during the range period.
% effective logical utilization (partition)
- The effective utilization of the logical processors by the partition.
- This data is based on the total online time of all logical processors and does not include LPAR management time.
- This metric is available for partitions running general purpose processors (CP) as well as for partitions running special purpose processors (e.g. ICF).
% total logical utilization (partition)
- The total utilization of the logical processors by the partition.
- This data is based on the total online time of all logical processors and includes LPAR management time.
- This metric is available for partitions running general purpose processors (CP) as well as for partitions running special purpose processors (e.g. ICF).
% effective physical utilization (CPC)
- The effective utilization of the physical processors by all partitions running in the CPC.
- This data is based on the total interval time of all physical processors and does not include LPAR management time.
- This metric is available as sum for all partitions running general purpose processors (CP) as well as for all partitions running special purpose processors (e.g. ICF).
% effective physical utilization (partition)
- The effective utilization of the physical processors by the partition.
- This data is based on the total interval time of all physical processors and does not include LPAR management time.
- This metric is available for partitions running general purpose processors (CP) as well as for partitions running special purpose processors (e.g. ICF).
% total physical utilization (CPC)
- The total utilization of the physical processors by all partitions running in the CPC.
- This data is based on the total interval time of all physical processors and includes LPAR management time.
- This metric is available as sum for all partitions running general purpose processors (CP) as well as for all partitions running special purpose processors (e.g. ICF).
% total physical utilization (partition)
- The total utilization of the physical processors by the partition.
- This data is based on the total interval time of all physical processors and includes LPAR management time.
- This metric is available for partitions running general purpose processors (CP) as well as for partitions running special purpose processors (e.g. ICF).
% LPAR management time (partition)
- The LPAR management time percentage for the partition. This is the time which was used by LPAR and could be attributed to a specific logical partition.
- This metric is available for general purpose processors (CP) as well as for special purpose processors (e.g. ICF).
remaining time until capping in seconds (by partition)
- The projected time until WLM soft capping will start. WLM soft capping takes place to prevent you from using more than the defined capacity over a long period of time. This is under the assumption you continue to use your system as you have done in the immediate past. The maximum number RMF reports is 14400 seconds or 4 hours. If RMF reports 14K, it means the remaining time until capping is at least 14K seconds.
average channel subsystem delay time (in mSec)
- The average number of milliseconds of delay an I/O request encountered after the acceptance of the start or resume function at the subchannel for the LCU until the channel subsystem first attempt to initiate the operation.
% delay due to device command response time
- The average number of milliseconds a successfully initiated start or resume function needs until the first command is indicated by the device as accepted.
average CU busy time (in mSec)
- The average number of milliseconds of delay an I/O request encountered for the channel path because the control unit was busy.
average command response time (in mSec)
- The average number of milliseconds a successfully initiated start or resume function needs until the first command is indicated by the device as accepted.
% eappl
- % EAppl is the percentage of CPU time as sum of TCB time, global and local SRB time, preemptible or client SRB time, and enclave CPU time consumed by the total system or by WLM service class/report class/workload.
- This metric does not include AAP processor time see % AAP if you want to monitor AAP processor time.
Note: This metric is adjusted (divided) by the number of processors.
% eappl (total) by job
- % EAppl is the percentage of CPU time as sum of TCB time, global and local SRB time, preemptible or client SRB time, and enclave CPU time consumed by individual address spaces.
- This metric also includes AAP processor time see % AAP if you want to monitor AAP processor time seperately.
Note: This metric is NOT adjusted (divided) by the number of processors.
total time
- Total number of milliseconds during the current interval where the processor was dispatched.
- The value must match the sum of captured time and uncaptured time.
Note: This metric is NOT adjusted (divided) by the number of processors.
uncaptured time
- Number of milliseconds during the current interval where the processor was dispatched for work that cannot be counted to an address space or workload.
- The sum of captured time and uncaptured time must match the total processor time.
Note: This metric is NOT adjusted (divided) by the number of processors.
captured time
- Number of milliseconds during the current interval where the processor was dispatched for work that can be counted to an address space or workload.
- The sum of captured time and uncaptured time must match the total processor time.
Note: This metric is NOT adjusted (divided) by the number of processors.
eappl time by job
- Total CPU time in milliseconds consumed by individual address spaces.
- It includes TCB time, global and local SRB time, preemptible or client SRB time and enclave CPU time.
Note: This metric is NOT adjusted (divided) by the number of processors.
load average
- Average number of dispatchable units in the ready queue. Currently running units are counted as part of the ready queue.
- With regard to the number of processors defined to the operating system, this value can be a useful indicator for processor contention:
- For an LPAR with 16 logical processors, a load average of 32 is reasonable, while the same value for an LPAR with 1 processor is way too high.
% capacity used
- Percentage of MSUs (millions of service units) per hour actually consumed compared to the capacity limit for the LPAR.
- The capacity limit is either the defined capacity limit for WLM LPAR CPU management or it is derived from the logical processor configuration for the LPAR.
% LPAR management time by PHYSICAL
- The LPAR management time percentage for the partition named PHYSICAL. This is the time which was used by LPAR but could not be attributed to a specific logical partition.
Note: This metric is available for general purpose processors (CP) as well as for special purpose processors (e.g. ICF).
% LPAR management time (CPC)
- The LPAR management time of the physical processors by all partitions running in the CPC. This is the time which was used by LPAR and could be attributed to a specific logical partition.
Note: This metric is available for general purpose processors (CP) as well as for special purpose processors (e.g. ICF).
% bus utilization
- Percentage of bus cycles, the bus has been found busy for this channel in relation to the theoretical limit. For OSA Express Gigabit Ethernet the value reflects the PCI bus utilization.
Note: This information is not availaible for parallel and ESCON channels, but is only supported for channels in channel path measurement group 2 (such as FICON, FICON bridges, OSA Express Gigabit Ethernet).
partition bytes read/sec
- Data transfer rates (bytes read/second) from the control unit to the channel for the individual partition where this system is running in.
Note: This information is not availaible for parallel and ESCON channels, but is only supported for channels in channel path measurement group 2 (such as FICON, FICON bridges, OSA Express Gigabit Ethernet).
total bytes read/sec
- Data transfer rates (bytes read/second) from the control unit to the channel for the entire system (sum for all partitions sharing the channel path)
Note: This information is not availaible for parallel and ESCON channels, but is only supported for channels in channel path measurement group 2 (such as FICON, FICON bridges, OSA Express Gigabit Ethernet).
partition bytes written/sec
- Data transfer rates (bytes written/second) from the channel to the control unit for the individual partition where this system is running in.
Note: This information is not availaible for parallel and ESCON channels, but is only supported for channels in channel path measurement group 2 (such as FICON, FICON bridges, OSA Express Gigabit Ethernet).
total bytes written/sec
- Data transfer rates (bytes written/second) from the channel to the control unit for the entire system (sum for all partitions sharing the channel path)
Note: This information is not availaible for parallel and ESCON channels, but is only supported for channels in channel path measurement group 2 (such as FICON, FICON bridges, OSA Express Gigabit Ethernet).
send fail/sec
- Rate of messages (sent by this partition) that failed, but not caused by an unavailable buffer in the receiving partition.
Note: This information is not availaible for parallel and ESCON channels, but is only supported for channels in channel path measurement group 2 (such as FICON, FICON bridges, OSA Express Gigabit Ethernet).
receive fail/sec
- Rate of messages (received by this partition) that failed due to an unavailable buffer. The value could indicate that more receive buffers are required.
This information is not availaible for parallel and ESCON channels, but is only supported for channels in channel path measurement group 2 (such as FICON, FICON bridges, OSA Express Gigabit Ethernet).
capacity
- Total amount of space (in megabytes) on the volume.
Note: For new volumes, where no allocation or unallocation has been performed so far, asterisks are displayed.
freespace
- Total amount of free space (in megabytes) on the volume.
Note: For new volumes, where no allocation or unallocation has been performed so far, asterisks are displayed.
% freespace
- Percentage of free space on the volume.
Note: For new volumes, where no allocation or unallocation has been performed so far, asterisks are displayed.
largest extent
- Largest extent (in megabytes) of unallocated space available on the volume.
Note: For new volumes, where no allocation or unallocation has been performed so far, asterisks are displayed.
capacity
- Total amount of space (in megabytes) for all volumes of the monitored SMS storage groups.
Note: New volumes, where no allocation or unallocation has been performed so far, are not considered in this calculation.
freespace
- Total amount of free space (in megabytes) for all volumes of the monitored SMS storage groups.
Note: New volumes, where no allocation or unallocation has been performed so far, are not considered in this calculation.
% freespace
- Percentage of free space for all volumes of the monitored SMS storage groups.
Note: New volumes, where no allocation or unallocation has been performed so far, are not considered in this calculation.
% appl by uss pid and jobname
- Percentage of TCB and local/global SRB time that the process consumed.
total cpu seconds by uss pid and jobname
- Total computing time in seconds since the process has been started.
Note: This metric is available for general purpose processors (CP) as well as for special purpose processors (e.g. ICF).
Note: This metric is available for general purpose processors (CP) as well as for special purpose processors (e.g. ICF).
Current LPAR Weight (partition)
- The current weighting of the shared processor resources for the partition.
Note: This metric is available for general purpose processors (CP) as well as for special purpose processors (e.g. ICF).
LPAR Weight (CPC)
- The sum of weightings of the shared processor resources for all partitions running in the CPC.
Note: This metric is available for general purpose processors (CP) as well as for special purpose processors (e.g. ICF).
% total physical utilization of shared resources (CPC)
- The total utilization of the shared physical processors by all partitions running in the CPC. That is, the utilization of dedicated processors is not included.
- This data is based on the total interval time of all shared physical processors and includes LPAR management time.
Note: This metric is available for general purpose processors (CP) as well as for special purpose processors (e.g. ICF).
Logical Processor Share % (partition)
- The percentage of the physical processor that a logical processor of the LPAR is entitled to use.
- If HiperDispatch is enabled, this is the percentage of logical processors with medium entitlement of a physical processor.
- Without HiperDispatch, all logical processors have the same share because the LPAR weight is evenly distributed.
HiperDispatch: # High (partition)
- If HiperDispatch is enabled, the number of logical processors of the LPAR with a high entitlement (100% share) of a physical processor.
HiperDispatch: # Medium (partition)
- If HiperDispatch is enabled, the number of logical processors of the LPAR with a medium entitlement of a physical processor.
- For the percentage, refer to the Logical Processor Share value.
HiperDispatch: # Low (partition)
- If HiperDispatch is enabled, the number of logical processors of the LPAR with a low entitlement (0% share or very close to it) of a physical processor.
User Partition ID
- The user partition identifier.
Operating System Name
- Name of the operating system instance.
Central Storage Online in MB (partition)
- Amount of central storage, in megabytes, currently online to this logical partition.
LPAR Cluster Name
- Sysplex name associated with the partition. All partitions that have the same cluster name are grouped together.
Initial Weight (partition)
- Defined initial weighting of the shared processor resources.
Minimum Weight (partition)
- Minimum weighting of the shared processor resources.
Note: A value of zero indicates that the partition is under control of LPAR CPU management, but no minimum has been specified.
Maximum Weight (partition)
- Maximum weighting of the shared processor resources.
Note: A value of zero indicates that the partition is under control of LPAR CPU management, but no minimum has been specified.
Group Capacity Name
- Name of the capacity group.
Group Capacity Limit
- MSU limit defined for the capacity group.
Group Capacity Minimum Entitlement
- The guaranteed MSU share the partition gets if necessary (even if all other partitions within the capacity group are running high workload).
Group Capacity Maximum Entitlement
- The maximum MSU share a partition can get if all other partitions in the capacity group are running without workload.
average response time (zFS total)
- Average number of milliseconds to complete a zFS request.
Note: This value represents the average for all zFS aggregates.
% i/o wait (zFS total)
- Percentage of time a zFS request has to wait for I/O completion.
Note: This value represents the average for all zFS aggregates.
Line Type
- The first character is the CPU type description:
- C - General purpose CPU (CP)
- A - Application Assist Processor (zAAP)
- U - Integrated Information Processor (zIIP)
- F - Internal Coupling Facility (ICF)
- L - Integrated Facility for Linux (IFL)
- I - CPUs known as ICF/IFL/zAAP on processors before z9
- The second character describes whether the line displays the totals for a certain CPU type (S), the information for a logical partition (P), or the partition named PHYSICAL (Y).
Note: Starting with z9, IFLs and zAAPs are reported separately and no longer together with the ICFs. However, the ICFPOOL is still displayed for compatibility purposes.
% lock wait (zFS total)
- Percentage of time a zFS request has to wait for locks.
Note: This value represents the average for all zFS aggregates.
% sleep wait (zFS total)
- Percentage of time a zFS request has to wait for events.
Note: This value represents the average for all zFS aggregates.
cache request rate (user cache zFS total)
- Number of requests per second made to the user file cache.
Note: This value represents the average for all zFS aggregates.
% cache hit (user cache zFS total)
- Percentage of requests to the user file cache that completed without accessing the DASD.
Note: This value represents the average for all zFS aggregates.
% cache read (user cache zFS total)
- Percentage of read requests to the user file cache, based on the sum of read and write requests.
Note: This value represents the average for all zFS aggregates.
% cache delay (user cache zFS total)
- Percentage of requests to the user file cache that was delayed.
Note: This value represents the average for all zFS aggregates.
% quota usage
- Percentage of the quota currently used by the file system.
Note: The quota denotes the maximum size or a the file system.
request rate
- Number of requests per second made to the file system.
% used space
- Percentage of space currently used by the aggregate.
read rate
- Read data transfer rate (in bytes per second) for the aggregate.
write rate
- Write data transfer rate (in bytes per second) for the aggregate.
% using
- Percentage of using samples (processor and i/o), based on the total number of state samples for the interval.
% delay
- Percentage of delay samples (processor, i/o, storage, capping, queueing), based on the total number of state samples for the interval.
% idle
- Percentage of idle samples, based on the total number of state samples for the interval.
% CP
- Percentage of time where the enclave was dispatched on a standard CP.
Note: This value is related to the current interval.
total CP seconds by enclave
- Number of seconds where the enclave was dispatched on a standard CP.
Note: This value is related to the entire lifetime of the enclave.
delta CP seconds by enclave
- Number of seconds where the enclave was dispatched on a standard CP.
Note: This value is related to the current interval.
% AAP by enclave
- Percentage of time where the enclave was dispatched on AAP processors.
Note: This value is related to the current interval.
total AAP seconds by enclave
- Number of seconds where the enclave was dispatched on AAP processors.
Note: This value is related to the entire lifetime of the enclave.
delta AAP seconds by enclave
- Number of seconds where the enclave was dispatched on AAP processors.
Note: This value is related to the current interval.
% AAP on CP
- Percentage of time where AAP eligible work of the enclave was dispatched on standard CPs while all AAP processors were busy.
Note: This value is related to the current interval.
total AAP on CP seconds by enclave
- Number of seconds where AAP eligible work of the enclave was dispatched on standard CPs while all AAP processors were busy.
Note: This value is related to the entire lifetime of the enclave.
delta AAP on CP seconds by enclave
- Number of seconds where AAP eligible work of the enclave was dispatched on standard CPs while all AAP processors were busy.
Note: This value is related to the current interval.
% CP using
- Percentage of using samples for standard CPs, based on the total number of state samples for the interval.
% CP delay
- Percentage of delay samples for standard CPs, based on the total number of state samples for the interval.
% AAP using
- Percentage of using samples for AAP processors, based on the total number of state samples for the interval.
% AAP delay
- Percentage of delay samples for AAP processors, based on the total number of state samples for the interval.
% AAP using on CP
- Percentage of using samples for AAP eligible work on standard CPs, based on the number of state samples for the interval.
% capping delay
- Percentage of CPU capping samples, based on the total number of state samples for the interval.
% using
- Percentage of i/o using samples, based on the total number of state samples for the interval.
% delay
- Percentage of i/o delay samples, based on the total number of state samples for the interval.
% delay
- Percentage of storage delay samples, based on the total number of state samples for the interval. This includes:
-
-
Waiting for paging i/o from common.
-
Waiting for cross memory page fault.
-
Waiting for shared paging.
-
Server private paging delay.
-
Server VIO paging delay.
-
Server hiperspace paging delay.
-
Server MPL paging delay.
-
Server swap-in delay.
-
% delay
- Percentage of queue delay samples, based on the total number of state samples for the interval.
% IIP
- Percentage of time where the enclave was dispatched on IIP processors.
Note: This value is related to the current interval.
total IIP seconds by enclave
- Number of seconds where the enclave was dispatched on IIP processors.
Note: This value is related to the entire lifetime of the enclave.
delta IIP seconds by enclave
- Number of seconds where the enclave was dispatched on IIP processors.
Note: This value is related to the current interval.
% IIP on CP
- Percentage of time where IIP eligible work of the enclave was dispatched on standard CPs while all IIP processors were busy.
This value is related to the current interval.
total IIP on CP seconds by enclave
- Number of seconds where IIP eligible work of the enclave was dispatched on standard CPs while all IIP processors were busy.
Note: This value is related to the entire lifetime of the enclave.
delta IIP on CP seconds by enclave
- Number of seconds where IIP eligible work of the enclave was dispatched on standard CPs while all IIP processors were busy.
Note: This value is related to the current interval.
% IIP using
- Percentage of using samples for IIP processors, based on the total number of state samples for the interval.
% IIP delay
- Percentage of delay samples for IIP processors, based on the total number of state samples for the interval.
% IIP using on CP
- Percentage of using samples for IIP eligible work on standard CPs, based on the number of state samples for the interval.
% IIP by enclave
- Percentage of time where the enclave was dispatched on IIP processors.
Note: This value is related to the current interval.
% IIP delay
- Percentage of delay samples for IIP processors, based on the total number of state samples for the interval.
% IIP on CP
- Percentage of time where IIP eligible work of the enclave was dispatched on standard CPs while all IIP processors were busy.
This value is related to the current interval.
% IIP using
- Percentage of using samples for IIP processors, based on the total number of state samples for the interval.
% IIP using on CP
- Percentage of using samples for IIP eligible work on standard CPs, based on the number of state samples for the interval.
XCF signals sent
- The total number of outbound signals.
- This metric is available for the following XCF entities:
-
Group
-
Member
-
Path
-
System
-
XCF signals received
- The total number of inbound signals.
- This metric is available for the following XCF entities:
-
Group
-
Member
-
Path
-
System
-
% retry
- The percentage of retry attempts compared to the retry limit.
retry limit
- The limit for retries as defined by the RETRY parameter in the COUPLExx parmlib member.
message limit
- The limit for messages as defined by the MAXMSG parameter in the COUPLExx parmlib member.
times path busy
- Number of outbound signal requests satisfied by this path while active.
signals pending transfer
- The current number of signals pending transfer on the path.
storage in use
- The current number of 1K byte blocks of message buffer space.
restart count
- Cummulative number of restarts.
- Number of inbound signals refused due to maximum message limit.
- Total number of times a no path condition occurred.
- Total number of times a no buffer condition occurred.
buffer length
- Length of longest message that fits the buffer size that supports the defined transport class length.
% fit
- The percentage of messages sent whose length fit the buffer size that supports the defined transport class length.
% small
- The percentage of messages sent whose length was smaller the buffer size that supports the defined transport class length.
% large
- The percentage of messages sent whose length exceeds the buffer size that supports the defined transport class length.
% degraded
- The percentage of messages sent whose length exceeds the buffer size for wich the signalling service was optimized.
I/O transfer time
- I/O transfer time in milliseconds for the most recently received signals.
Note: This metric is only available for inbound paths.
rate of context switches (per second)
- This metric measures the number of context switches per second. A context switch happens when the operating system's scheduler stops (pauses) execution of one proc. and begins (continues) execution of another process.
LINUX_SYSTEM
- rate of context switches (per second)
- This metric measures the number of context switches per second. A context switch happens when the operating system's scheduler stops (pauses) execution of one proc. and begins (continues) execution of another process.
- rate of processes created (per second)
- This metric measures the number of processes created per second. If this number is high, then a large number of processes are being started. Each time a process is created, there is some amount of overhead associated with this creation; this overhead can become a performance problem if the rate of process creation become large.
LINUX_MEMORY
- % swap space used
-
The percentage of a machine's swap space used. If swap space is being heavily used, then the operating system is increasingly having to read and write from the disk to execute memory reading and writing routines. This can have a significant impact on performance.
-
- cache memory in MB
-
This metric measures the memory used for cache in MB.
-
- free swap space in MB
-
This metric is related to virtual memory. To achieve the best performance, we would like to have all the processes memory segments in physical memory (RAM) and not being swapped to/from disk. Therefore, a large value here is usually good in terms of machine performance.
-
- major page fault rate by process
- This metric gives us the number of major page faults enumerated by process per second. We define a major page fault to simply be any page fault where disk access is involved. Again, we would like to have as little disk access as possible. A large value here usually corresponds to poor performance.
- major page fault rate process (including children)
-
Similar to the major page fault rate by process metric, but including the major faults by all children processes also. Again, a large value usually corresponds to poor performance.
-
- memory used for buffers in MB
-
The Linux kernel maintains a disk cache designed to relieve processes from waiting on relatively slow disk access.
When free memory becomes low, buffer frames are released. Ideally, we would like the amount of free memory to stay high. Therefore, a small value here corresponds to buffer frames being released. Perhaps there isn't a performance problem yet, but swapping to/from virtual memory could start soon.
-
- memory used in MB
- This metric measures the number of MB of memory used (including the buffer caches). If the amount of memory used is near the total amount of memory (RAM) available on the machine, the machine may have poor performance.
- minor page fault rate by process
- The number of page faults without disk access per second by process.
- number of pages swapped in per second
- This metric reports the number of pages of virtual memory swapped in from disk per second. Generally, any nonzero value here indicates poor performance.
- number of pages swapped out per second
- This metric reports the number of pages of virtual memory swapped out to disk per second. Generally, any nonzero value here indicates poor performance.
- resident set size (RSS) in MB by process
- The resident set size refers to the amount of memory used by the process. This includes the code, data, and shared libraries. If a process's RSS is growing rapidly, this could be problematic the machine may run out of physical memory and begin using swap space.
- shared memory in MB
-
This metric measures the amount of memory in MB that can be used by more than one process. This isn't usually a good indication of poor performance more information is needed. For instance, a database server may have several processes running with one very large shared memory area. If the total amount of memory collectively used by the processes and this shared memory area is larger than the physical memory size (RAM), then there may be a performance problem.
-
- size of swap space in MB
- This simply measures the size of the swap space/swap file on the machine.
- total memory size in MB
- Total amount of memory available (RAM) on the machine.
- used swap space in MB
- This metric reports the amount of swap space used at the end of each cycle time. This is usually a very good indicator of performance. A high value corresponds to large amount of swap space being used this means that there are many disk access, and thus poor performance.
-
virtual memory size by process
- Size of the virtual memory in bytes by process at the end of a time cycle. This is normally a very big number, but most parts of this virtual memory area are often left unused (not even paged in). Therefore, this metric is usually not a good performance indicator (on its own).
LINUX_NETWORK
- bytes received per second
- The number of bytes received per second by the host. This is the sum of all received bytes on all network devices per second.
-
bytes received per second by network device
-
This metric is an enumeration of the network devices and the number of bytes received per second
-
- bytes transmitted per second
-
The number of bytes transmitted per second by the host. This is the sum of all transmitted bytes on all network devices per second.
-
-
bytes transmitted per second bynetwork device
-
This metric is an enumeration of the network devices and the number of bytes transmitted per second for each individual network device.
-
- packets received per second
-
This metric is the total of all packets received by the machine (through all network devices) per second. Though the size of a packet is not constant, this can give you an idea about the network processor and CPU utilization caused by network traffic. For instance, if you have many small network packets, you don't have a bandwidth problem, but rather a problem with network processor speed.
-
-
packets received per second by network device
-
This metric is similar to packets received per second except this metric is enumerated by the network devices.
-
-
packets transmitted per second
-
his metric is the total of all packets transmitted by the machine (through all network devices) per second. Though the size of a packet is not constant, this can give you an idea about the network processor and CPU utilization caused by network traffic. For instance, if you have many small network packets, you don't have a bandwidth problem, but rather a problem with network processor speed.
-
-
packets transmitted per second by network device
-
This metric is similar to packets transmitted per second except this metric is enumerated by the network devices.
-
- receive errors per second
-
Number of receive errors per second for the entire machine (over all connected network devices).
-
-
receive errors per second by network device
-
This metric is similar to receive errors per second except that the number of errors is broken down by network device.
-
-
transmit errors per second
-
Number of transmit errors per second for the entire machine (over all connected network devices).
-
-
transmit errors per second by network device
-
This metric is similar to transmit errors per second except that the number of errors is broken down by network device.
-
LINUX_CPU
- % CPU idle time
-
This metric measures the amount of time that the CPU is idle (averaged over all CPUs). This can be used as an effective performance metric. If this metric has a low value, then the machine is not being utilized 100%. If the value is very large, then the CPUs are being fully utilized. In some cases, one should also look at disk access/swap space usage stats during periods of high CPU utilization as disk access can lead to high CPU usage.
-
- % CPU idle time by processor
- This metric is simply the % cpu idle time enumerated by CPUs.
- % CPU time in kernel mode
- Percentage of CPU idle time in kernel mode (averaged over all CPUs). When a machine is in kernel mode, then it is executing code inside the kernel (system calls, scheduler routines, etc.).
- % CPU time in kernel mode by process
- This metric is simply the % cpu time in kernel mode enumerated by CPUs.
- % CPU time in nice mode
- This metric measures the amount of time the machine spends in "nice" mode (averaged over all CPUs. We say a process is "nice" if its scheduling priority is lower than normal. If the superuser has increased the scheduling priority of some processes to values higher than normal, the process is no longer classified as running in "nice" mode.
- % CPU time in user mode
- This metric measures the amount of time spent executing code of userspace processes (averaged over all CPUs).
- % CPU time in user mode by process
-
This metric is simply the % cpu time in user mode enumerated by process.
-
- % CPU time in user mode by processor
-
This metric is simply the % cpu time in user mode enumerated by CPUs.
-
-
% CPU time total by process
- This metric measures the amount of CPU time spent in user, nice, and kernel modes by process.
-
% CPU total active time
-
Amount of time the CPU spent not idling.
-
-
% CPU total active time by processor
-
This metric is simply the % cpu total active time enumerated by CPUs.
-
-
accumulated CPU time in kernel mode by process
-
The accumulated CPU time per process while the process was executing in kernel mode.
-
-
accumulated CPU time in user mode by process
-
The accumulated CPU time per process while the process was executing in user mode.
-
-
load average
- This metric measures the average length of the running process queue in the kernel. These are processes that are waiting for some CPU time. This is the number of of processes that can be executed concurrently provided there are enough CPUs available. This metric is related to % CPU idle time.
-
nice value by process
-
This metric measures the nice value (priority) of each process on the machine. This nice value can be manually changed with the "nice" command. The nice value may be any integer from 20 (highest priority) to 19 (lowest priority), inclusive.
-
LINUX_FILESYSTEM
- % free by file system
-
Percentage of disk space free, enumerated by filesystem.
-
-
% used by file system
-
Metric measuring disk usage enumerated by file system.
-
-
% of space free
-
This metric measures the percentage of free space available across the entire filesystem (the sum of all mounted file systems).
-
-
% of space used
-
This metric measures the percentage of used space available across the entire filesystem (the sum of all mounted file systems).
-
-
available (in 1 MB blocks) by file system
-
Similar to % free by filesystem, but the measurements are presented in MB instead of percentages.
-
- size (in 1 MB blocks) by file system
-
This metric simply lists the size of each filesystem. This metric's values won't change (often).
-
-
total size of all file systems (in 1 MB blocks)
- This metric is simply the summation of size by file system over all file systems mounted. This metrics values should rather change.
-
total space available (in 1 MB blocks)
-
Total space available (free space) on all file systems mounted.
-