SAS删除字符串中的重复项

SAS程序猿/媛有时候会碰到去除字符串中重复值的问题,用常用的字符函数如SCAN,SUBSTR可能会很费劲,用正则表达式来处理就简单了。示例程序如下:

data _null_;
    infile cards truncover;
    input STRING $32767.;
    REX1=prxparse('s/([a-z].+?\.\s+)(.*?)(\1+)/\2\3/i');
    REX2=prxparse('/([a-z].+?\.\s+)(.*?)(\1+)/i');
    do i=1 to 100;
        STRING=prxchange(REX1, -1, compbl(STRING));
        if not prxmatch(REX2, compbl(STRING)) then leave;
    end;
    put STRING=;
cards;
a. The cow jumps over the moon.
a. The cow jumps over the moon. b. The chicken crossed the road. c. The quick brown fox jumped over the lazy dog. a. The cow jumps over the moon. 
b. The chicken crossed the road. a. The cow jumps over the moon. b. The chicken crossed the road. c. The quick brown fox jumped over the lazy dog.
a. The cow jumps over the moon. a. The cow jumps over the moon. b. The chicken crossed the road. b. The chicken crossed the road. c. The quick brown fox jumped over the lazy dog. c. The quick brown fox jumped over the lazy dog.
a. The cows jump over the moon. a. The cows jump over the moon. b. The chickens crossed the road. b. The chickens crossed the road. c. The quick brown foxes jumped over the lazy dog. c. The quick brown foxes jumped over the lazy dog.
a. The cow jumps over the moon. b. The chicken crossed the road.  c. The quick brown fox jumped over the lazy dog. a. The cow jumps over the moon.  b. The chicken crossed the road. c. The quick brown fox jumped over the lazy dog.
;
run;

可以看到上面的重复项是一整个句子,如果重复项是单词,上面的表达式就要改了:

data _null_;
    STRING='cow chicken fox cow chicken fox cows chickens foxes';
    REX1=prxparse('s/(\b\w+\b)(.*?)(\b\1+\b)/\2\3/i');
    REX2=prxparse('/(\b\w+\b)(.*?)(\b\1+\b)/i');
    do i=1 to 100;
        STRING=prxchange(REX1, -1, compbl(STRING));
        if not prxmatch(REX2, compbl(STRING)) then leave;
    end;
    put STRING=;
run;

注意上面的表达式中第一个括号中的\b是用来限定只匹配单词而不是单个字母。第三个括号中的\b表示精确匹配,即匹配一模一样的单词。

曾宪华 /
本文采用 署名-非商业性使用-相同方式共享 3.0许可协议 属于 程序人生 分类, 被贴了 Regular Expression 正则表达式 PRXCHANGE 书签

上一篇 一道小学生的趣味数学题
下一篇 儿子,欢迎你来到这个世界