SAS数据集中重复记录问题

SAS程序猿/媛在处理数据的时候，经常会遇到要处理有关重复记录的问题，其中有些重复记录是我们需要的，而有的则是多余的。如果是多余的直接去重：

PROC SORT，其中有两个选项NODUPKEY、NODUPRECS（NODUP），第一个是按照BY变量来去重，第二是比较整条记录来去重，重复的记录可以用DUPOUT=来保留。程序如下：
```
proc sort data=sashelp.class out=unq nodupkey dupout=dup;
    by WEIGHT;
run;
```

HASH，程序如下：

data _null_;
    if 0 then set sashelp.class;
    if _n_=1 then do;
        declare hash h(dataset: 'sashelp.class', ordered: 'y');
        h.definekey('WEIGHT');
        h.definedata(all:'y');
        h.definedone();
    end;
    h.output(dataset: 'uni');
    stop;
run;

如果重复记录是需要保留以备后用则可以用下面几种方法：

DATA步，程序如下：

proc sort data=sashelp.class out=class;
    by WEIGHT;
run;

data uni dup;
    set class;
    by WEIGHT;
    if first.WEIGHT and last.WEIGHT then output uni;
    else output dup;
run;

PROC SQL，程序如下：

proc sql;
    create table uni as
        select * 
        from sashelp.class
        group by WEIGHT
        having count(*) = 1
        ;

    create table dup as
        select * 
        from sashelp.class
        group by WEIGHT
        having count(*) > 1
        ;
quit;

HASH，程序（SAS9.2+）如下：

data uni(drop=rc: i);
    if _n_=1 then do;
        if 0 then set sashelp.class;
        dcl hash h1(dataset: 'sashelp.class', multidata:'y');
        h1.definekey('WEIGHT');
        h1.definedata(all: 'yes');
        h1.definedone();

        dcl hash h2(dataset: 'sashelp.class');
        dcl hiter hi('h2');
        h2.definekey('WEIGHT');
        h2.definedone();
    end;
    rc1=hi.first();
    do while(rc1=0);
        rc2= h1.find();
        i=0;
        do while(rc2=0 and i < 2);
            i+1;
            rc2=h1.find_next();
        end;
        if i < 2 then do;
            output;
            if i < 2 then h1.remove();
        end;
        rc1=hi.next();
    end;
    h1.output(dataset: 'dup');
run;

不管是去重还是保留重复的记录，上面几种方法中HASH行数都是最多的，但是这种方法在去重之前不用排序，故当处理的数据集较大时建议使用此方法以提高效率。

曾宪华 / 2015-12-05
本文采用署名-非商业性使用-相同方式共享 3.0许可协议。属于程序人生分类，被贴了 Hash Object NODUPKEY 去重书签

上一篇批量改变SAS数据集字符型变量的长度
下一篇正则表达式模式修饰词