Oracle SQL und PL/SQL: Algorithmus ... Blog mit zahlreichen Tipps und Tricks von Thomas Uhren

Mit diesem Post möchte ich eine Reihe starten, welche sich mit dem Thema Data Mining beschäftigt; konkret die Implementierung von Algorithmen mit Bordmitteln samt den Besonderheiten, die im Datenbankumfeld zu beachten sind.

Als Beispiel dient ein Algorithmus zur Klassifikation auf der Basis von Entscheidungsbäumen - der ID3-Algorithmus. Doch bevor wir uns mit der Implementierung des Algorithmus beschäftigen, schauen wir uns zunächst das Ergebnis an. Zunächst wird der Algorithmus mit der Prozedur init initialisiert:

exec dm_id3.init(1,10);

Das erste Argument bestimmt das Projekt, während das zweite Argument die minimale Anzahl von Beobachtungen in einem Knoten festlegt; in diesem Fall muss jeder Knoten mindestens 10 Beobachtungen enthalten.

Das Projekt nimmt Bezug auf eine Tabelle, welche die Daten des Data Sets "Chess (King-Rook vs. King)" enthält.

Nach der erfolgreichen Initialisierung erfolgt ein Aufruf der Table-Funtion gain. Man erhält den Gain bzw. den Gain Ratio, also wie sehr sich ein Attribut zur weiteren Zerlegung eignet:

select * from table(dm_id3.gain(0)) order by gain desc;

NODE   ATTRIBUTE          INFO      SPLIT_INFO GAIN     
------ ------------------ --------- ---------- ---------- 
0      BLACK_KING_RANK    3.5041597 3.1896893  0.31446969 
0      WHITE_KING_RANK    3.5041597 3.2139022  0.29025673 
0      BLACK_KING_FILE    3.5041597 3.3190435  0.18511551 
0      WHITE_KING_FILE    3.5041597 3.3388416  0.16531733 
0      WHITE_ROOK_FILE    3.5041597 3.4540855  0.05007348 
0      WHITE_ROOK_RANK    3.5041597 3.4579962  0.04616276

Man sieht, dass das Attribut "BLACK_KING_RANK" am besten geeignet ist, um eine Zerlegung des Root-Knoten vorzunehmen.

Um den Entscheidungsbaum zu erzeugen, wird die Prozedur make_tree aufgerufen:

exec dm_id3.make_tree(0);

Nun werfen wir einen Blick auf den erzeugten Entscheidungsbaum:

select * from table(dm_id3.show_tree(1));

NODE                              SUPPORT  PREDICTION   CONFIDENCE DEPTH
--------------------------------- -------- ------------ ---------- ------
0: 1 = 1                          28056    fourteen     4553       1
   1: BLACK_KING_RANK = 1         3664     eight        591        2
      9: WHITE_KING_FILE = a      372      eleven       89         3
         49: WHITE_ROOK_RANK = 1  36       draw         10         4
         50: WHITE_ROOK_RANK = 2  48       eight        19         4
         51: WHITE_ROOK_RANK = 3  48       ten          25         4
         52: WHITE_ROOK_RANK = 4  48       twelve       22         4
         53: WHITE_ROOK_RANK = 5  48       eleven       13         4
         54: WHITE_ROOK_RANK = 6  48       fourteen     13         4
         55: WHITE_ROOK_RANK = 7  48       fourteen     13         4
         56: WHITE_ROOK_RANK = 8  48       eleven       12         4
      10: WHITE_KING_FILE = b     620      ten          163        3
...

Man sieht unter anderem, dass das Attribut "BLACK_KING_RANK" tatsächlich für die Zerlegung des Root-Knoten verwendet wurde.

So viel für heute... in den kommenden Postings erfolgt dann die eigentliche Implementierung der eben gezeigten Funktionen.

Der Algorithmus von Edsger W. Dijkstra dient der Ermittlung der kürzesten Pfade ausgehend von einem Startknoten zu allen anderen Knoten in einem Graphen.

Die Darstellung des Graphen erfolgt in Form einer einfachen Tabelle mit den Spalten ID, SOURCE, DESTINATION und DISTANCE.

Als Beispiel dient der folgende Graph:

Die zugehörige Tabelle stellt sich dann wie folgt dar:

        ID     SOURCE DESTINATION DISTANCE
---------- ---------- ----------- --------
         1          1           2        4
         1          1           3        6
         1          1           4        8
         1          2           5        7
         1          2           3        1
         1          3           4        2
         1          3           5        5
         1          3           6        4
         1          4           6        5
         1          5           7        6
         1          6           5        1
         1          6           7        8

Auf dieser Grundlage sollen jetzt die kürzesten Wege ausgehend von Knoten 1 gefunden werden, wenngleich von jedem Knoten begonnen werden kann.

Der Aufruf der PL/SQL-Table-Function geht so:

select * from table(dijkstra(1,1)) order by vertex;

    VERTEX DISTANCE PREDECESSOR PATH
---------- -------- ----------- -----------------------
         1        0             1
         2        4           1 1 -> 2
         3        5           2 1 -> 2 -> 3
         4        7           3 1 -> 2 -> 3 -> 4
         5       10           3 1 -> 2 -> 3 -> 5
         6        9           3 1 -> 2 -> 3 -> 6
         7       16           5 1 -> 2 -> 3 -> 5 -> 7

Die Spalte DISTANCE gibt die minimale Distanz zum jeweiligen Knoten an, die Spalte PREDECESSOR enthält den direkten Vorgänger und die Spalte PATH enthält die Knoten, welche auf dem kürzesten Weg zurückgelegt wurden.

Nach dieser kurzen Demonstration der Funktionsweise hier der zugehörige Code:

-- Tabelle

drop table graph;

create table graph
(
 id integer,
 source integer,
 destination integer,
 distance number(38,2) not null
);

alter table graph
add constraint pk_graph
primary key (id, source, destination);

alter table graph
add constraint c_graph_source
check (source > 0);

alter table graph
add constraint c_graph_destination
check (destination > 0);

alter table graph
add constraint c_graph_distance
check (distance >= 0);

-- Daten

insert into graph(id,source,destination,distance)
values(1,1,2,4);
insert into graph(id,source,destination,distance)
values(1,1,3,6);
insert into graph(id,source,destination,distance)
values(1,1,4,8);
insert into graph(id,source,destination,distance)
values(1,2,5,7);
insert into graph(id,source,destination,distance)
values(1,2,3,1);
insert into graph(id,source,destination,distance)
values(1,3,4,2);
insert into graph(id,source,destination,distance)
values(1,3,5,5);
insert into graph(id,source,destination,distance)
values(1,3,6,4);
insert into graph(id,source,destination,distance)
values(1,4,6,5);
insert into graph(id,source,destination,distance)
values(1,5,7,6);
insert into graph(id,source,destination,distance)
values(1,6,5,1);
insert into graph(id,source,destination,distance)
values(1,6,7,8);
commit;

-- Type

drop type dijkstra_tab;

drop type dijkstra_t;

create type dijkstra_t as object
(
 vertex integer,
 distance binary_double,
 predecessor integer,
 path varchar2(4000)
);
/

create type dijkstra_tab as table of dijkstra_t;
/

-- Table function

create or replace
function dijkstra(p_graph_in in binary_integer, p_vertex_in in binary_integer) return dijkstra_tab pipelined
is
 
 graph_not_found exception;
 vertex_not_found exception;
 
 type unchecked_tab is table of binary_integer index by binary_integer;
 type predecessor_tab is table of binary_integer index by binary_integer;
 type distance_tab is table of binary_double index by binary_integer;
 
 cursor init_cur is
  select source vertex
  from graph
  where id = p_graph_in
   union
  select destination vertex
  from graph
  where id = p_graph_in;

 cursor distance_cur(pc_vertex_in in binary_integer) is
  select destination, distance
  from graph
  where id = p_graph_in and source = pc_vertex_in;
 
 i binary_integer;
 v_dummy varchar(10);
 v_unchecked unchecked_tab;
 v_predecessor predecessor_tab;
 v_distance distance_tab;
 v_minimum binary_integer;
 v_alternative binary_double;
 v_path varchar2(4000);
 
begin

 begin
  select 'TRUE' into v_dummy
  from dual
  where exists
  (
   select *
   from graph
   where id = p_graph_in
  );
 exception
  when no_data_found then
   raise graph_not_found;
 end;
 
 begin
  select 'TRUE' into v_dummy
  from dual
  where exists
  (
   select *
   from graph
   where id = p_graph_in and (source = p_vertex_in or destination = p_vertex_in)
  );
 exception
  when no_data_found then
   raise vertex_not_found;
 end;
 
 begin
  for init_rec in init_cur loop
   v_unchecked(init_rec.vertex) :=  null;
   v_predecessor(init_rec.vertex) := null;
   v_distance(init_rec.vertex) := binary_double_infinity;
  end loop;
  v_distance(p_vertex_in) := 0;
 end;

 begin
  while (v_unchecked.count > 0) loop
   v_minimum := null;
   i := v_unchecked.first;
   while (i is not null) loop
    if (v_minimum is null) then
     v_minimum := i;
    else
     if (v_distance(i) < v_distance(v_minimum)) then
      v_minimum := i;
     end if;
    end if;
    i := v_unchecked.next(i);
   end loop;
   v_unchecked.delete(v_minimum);
   for distance_rec in distance_cur(v_minimum) loop
    if (v_unchecked.exists(distance_rec.destination)) then
     v_alternative := v_distance(v_minimum) + distance_rec.distance;
     if (v_alternative < v_distance(distance_rec.destination)) then
      v_distance(distance_rec.destination) := v_alternative;
      v_predecessor(distance_rec.destination) := v_minimum;
     end if;
    end if;
   end loop;
   if (v_distance(v_minimum) = binary_double_infinity) then
    v_path := '';
   else
    v_path := v_minimum;
   end if;
   i := v_predecessor(v_minimum);
   while (i is not null) loop
    v_path :=  i || ' -> ' || v_path;
    i := v_predecessor(i);
   end loop;
   pipe row( dijkstra_t (
    v_minimum, 
    v_distance(v_minimum), 
    v_predecessor(v_minimum), 
    v_path ) );
  end loop;
 end;

exception
 when graph_not_found then
  raise_application_error(-20010, 'DIJKSTRA: The graph was not found.');
 when vertex_not_found then
  raise_application_error(-20011, 'DIJKSTRA: The vertex to start the algorithm was not found.');
 when others then
  raise_application_error(-20012, 'DIJKSTRA: Unexpected error: ' || substr(1,200,SQLERRM)); 
end;

Oracle SQL und PL/SQL

Samstag, 21. Juli 2012

Data Mining mit Bordmitteln - Teil 1

Donnerstag, 28. Juli 2011

Dijkstra-Algorithmus mit PL/SQL