Sunday, January 2, 2011

Exadata V2 Storage Server (Cell node) Architecture and Management:

Exadata V2 Storage Server (Cell node) Architecture and Management:

Architecture Overview:

Storage Servers:
The Exadata Storage Server is a SAN storage device specifically built for Oracle database use. Each holds 12 SAS or SATA disks (2 TB(SAS)/7 TB(SATA) total raw capacity),
dual Xeon CPUs, dual InfiniBand, and 384 GB of flash memory. Smart Scan support reduces the data that must travel over the InfiniBand network,
while Hybrid Columnar Compression reduces data footprint by 4 to 50 times, depending on the data and compression mode.

- Each Exadata storage cell has 12 disks.
- One physical disk is called LUN and is also called cell disk.
- A grid disk is part of a cell disk.
- A disk group is made of many grid disks.

- On the first two disks, the first 4 partitions (29GB) are reserved for system software. The two disks contain mirror copy.
- On the other 10 disks, 29GB can not be used. These 10 disks can be utilized by creating DBFS file system.
- So, Exdata creates a file system of 290GB called SYSTEMDG for FULL RAC cluster.
- SYSTEMDG contains OCR and Votedisk information.

ExadataCell Disk Storage Capacity:
. 12 x 600 GB SAS 15K rpm disks (7.2 TB/cell @ 100 TB total) - But actual size is about 558GB. A little bit of space is also kept aside for celldisk metadata ~48MB per disk.
. 12 x 2TB SATA disks (24 TB/cell @ 336 TB total)
. 4 x 96 GB Sun Flash Cards (384GB/cell @ total 5TB) - But when execute celci command 'list flashcache' returns 365.25.
- A little bit of space is also kept aside for celldisk metadata on these cards too.


Exadata Disk groups:
DATA (Data)
RECO (Redo logs, Archive logs and Flash Recovery Area)
SYSTEMDG (OCR and Votedisk)


Cell node unique feature:
Smart Scans
Hybrid Columnar Compression
Storage Indexes
Flash Cache
ExadataI/O Resource Management in Multi-Database Environment


Background Processes in the Exadata Cell Environment on database server:
The background processes for the database and Oracle ASM instance for an Exadata
Cell environment are the same as other environments, except for the following background process:

- diskmon Process - The diskmon process is a fundamental component of Exadata Cell, and is responsible for implementing I/O fencing.

- XDMG Process (Exadata Automation Manager)
Its primary task is to watch for inaccessible disks and cells, and to detect when the disks and cells become accessible.

- XDWK Process (Exadata Automation Worker)
The XDWK process begins when asynchronous actions, such as ONLINE, DROP or ADD for an Oracle ASM disk are requested by the XDMG process.
The XDWK process will stop after 5 minutes of inactivity.

Output:
> ps -ef | egrep "diskmon|xdmg|xdwk"
oracle 4684 4206 0 06:42 pts/1 00:00:00 egrep diskmon|xdmg|xdwk
oracle 10321 1 0 2010 ? 00:38:15 /u01/app/11.2.0/grid/bin/diskmon.bin -d -f
oracle 10858 1 0 2010 ? 00:00:18 asm_xdmg_+ASM1


As a departure from ASM storage technology that uses a process architecture borrowed from database instances,
the storage servers have a brand new set of processes to manage disk I/O. They are:

- RS, the restart service. Performing a similar role to SMON, RS monitors other processes, and automatically restarts them if they fail unexpectedly.
RS also handles planned restarts in conjunction with software updates.
The main cellrssrm process spawns several helper processes, including cellrsbmt, cellrsbkm, cellrsomt, and cellrsssmt.

- MS, the management service. MS is the back-end process that processes configuration and monitoring commands. It communicates with cellcli, described in the next section.
MS is written in Java, unlike the other background processes which are distributed in binary form and are likely written in C.

- CELLSRV, the cell service. CELLSRV handles the actual I/O processing of the storage server.
It is not uncommon to see heavy usage from CELLSRV process threads during periods of heavy load.
Among other things, CELLSRV provides:
. Communication with database nodes using the iDB/RDS protocols over the InfiniBand network
. Disk I/O with the underlying cell disks
. Offload of SQL processing from database nodes
. I/O resource management, prioritizing I/O requests based on a defined policy

- I/O Resource Manager (IORM). Enables storage grid by prioritizing I/Os to ensure predictable performance



Cell node Management Overview:
DBA's login as OS user "celladmin" to manage cell nodes.
Each cell node internally run ASM instance to manage cell node disks. This means, you can't see the ASM pmon process on the cell node.

Cell Admin. Tool's: cellcli and dcli.
Cell monitoring Tool's: OSWatcher, ORION (I/O performance benchmarking tool) and ADRCI


Cell Nodes Logs and Traces:
$ADR_BASE/diag/asm/cell/`hostname`/trace/alert.log
$ADR_BASE/diag/asm/cell/`hostname`/trace/ms-odl.*
$ADR_BASE/diag/asm/cell/`hostname`/trace/svtrc__0.trc -- ps -ef | grep "cellsrv 100"
$ADR_BASE/diag/asm/cell/`hostname`/incident/*

/var/log/messages*, dmesg
/var/log/sa/*
/var/log/cellos/*

cellcli -e list alerthistory

$OSSCONF/cellinit.ora -- #CELL Initialization Parameters
$OSSCONF/cell_disk_config.xml
$OSSCONF/griddisk.owners.dat
$OSSCONF/cell_bootstrap.ora

/opt/oracle/cell/cellsrv/deploy/log/cellcli.lst*

$OSSCONF/alerts.xml
$OSSCONF/metrics/*
oswatcher data

df -h -> check if /opt/oracle file system full? /opt/oracle only 2GB in size on cell node !!!

Where :
$OSSCONF is: /opt/oracle/cell11.2.1.3.1_LINUX.X64_100818.1/cellsrv/deploy/config
$ADR_BASE is: /opt/oracle/cell11.2.1.3.1_LINUX.X64_100818.1/log


Cell Check and shutdown/startup commands:
Note: For full list of commands use: cellcli -e help

cellcli -e alter cell shutdown services all
cellcli -e alter cell startup services all
cellcli -e alter cell shutdown services cellsrv
cellcli -e alter cell restart services cellsrv
cellcli -e list lun detail
cellcli -e list griddisk detail
cellcli -e list celldisk detail
cellcli -e list physicaldisk detail
cellcli -e list flashcache detail
cellcli -e list physicaldisk attributes name, diskType, luns, status
cellcli -e list physicaldisk where disktype=harddisk attributes physicalfirmware
cellcli -e list lun attributes name, diskType, isSystemLun, status

imagehistory (root/sudo)
imageinfo (root/sudo)
service celld status (root/sudo)
lsscsi | grep MARVELL


Smart scan layers:
Smart scan involves multiple layers of code
KDS/KTR/KCBL - data layers in rdbms
KCFIS - smart scan layer in rdbms
Predicate Disk - smart scan layer in cellsrv
Storage index - IO avoidance optimization in cellsrv
Flash IO - IO layer in cellsrv to fetch data from flash cache
Block IO - IO layer in cellsrv to fetch data from hard-disks
FPLIB - filtering library in cellsrv


How to Isolate the Issue Whether or Not it's Exadata Related?
Issue can be?
Wrong results when running the query on an Exadata DB -- I personally faced this issue.
Query is slower when running on an exadata database

Is smart scan issue?
Cell_offload_processing=false (default true)
If the problem does not occurs, it’s the smart scan issue.

Is this a FPLIB issue?
_kcfis_cell_passthru_enabled=true (default false)
If the problem does not occurs, it’s the FPLIB issue

Is storage index issue?
_kcfis_storageidx_disabled=true (default false)
Problem still occurs, it’s not a storage index issue.

Is flash cache issue?
For 11.2.0.2, _kcfis_keep_in_cellfc_enabled=false (default true) do not use flash cache
For 11.2.0.1, _kcfis_control1=1 (default 0)
Problem still occurs, it’s not a flash cache problem.

Cell related Database view's:
select * from sys.GV_$CELL_STATE;
select * from sys.GV_$CELL;
select * from sys.GV_$CELL_THREAD_HISTORY;
select * from sys.GV_$CELL_REQUEST_TOTALS;
select * from sys.GV_$CELL_CONFIG;


Bloom filter in Exadata:
In Oracle 10g concept of bloom filtering was introduced.
When two tables are joined via a hash join, the first table (typically the smaller table) is scanned and the rows that satisfy the ‘where’ clause predicates (for that table) are used to create a hash table.
During the hash table creation a bit vector or bloom filter is also created based on the join column.
The bit vector is then sent as an additional predicate to the second table scan.
After the ‘where’ clause predicates have been applied to the second table scan, the resulting rows will have their join column hashed and it will be compared to values in the bit vector.
If a match is found in the bit vector that row will be sent to the hash join. If no match is found then the row will be disregarded.
On Exadata the bloom filter or bit vector is passed as an additional predicate so it will be overloaded to the storage cells making bloom filtering very efficient.

How to Identify a Bloom Filter in an Execution plan:
You can identify a bloom filter in a plan when you see :BF0000 in the Name column of the execution plan.

To disable the feature, the initialization parameter _bloom_pruning_enabled must be set to FALSE.



Just want to say few words about other Exadata components too:

The Sun Oracle Exadata Database Machine hardware consists of preconfigured Oracle Database servers connected to Sun Oracle Exadata Storage Servers
by an InfiniBand fabric. Each of these has been configured to take advantage of the latest advances in Oracle database technology.

Database Servers:
Industry-standard Oracle Database 11gR2 servers feature advanced software that makes the extreme performance of Exadata possible.
Automatic Storage Management (ASM) provides advanced storage management capabilities; the Database Resource Manager (DBRM) lets users
prioritize the resources available to each database; and the Intelligent Database protocol (IDB) allows Smart Scan offloading of database queries
to the Storage Servers, greatly reducing network overhead.

InfiniBand switch:
At 40 Gbit/s on each port, the InfiniBand network linking the Exadata database servers to the Storage Servers is ten times as fast as Fibre Channel,
with lower latency. Multipathing protects against network failures.

Management switch:
A single Cisco Catalyst 4948 48-port gigabit Ethernet switch handles management traffic.

KVM and rack:
One 32-port Avocent KVM switch with associated keyboard/mouse drawer provides console access to database servers and storage cells.
The switch is IP-enabled, meaning remote console access is available either via the individual system ILOM ports or the KVM switch.
All the components are housed in a 42U Sun 1242E rack with integrated zero-U power distribution units.