曾宪华

Python脚本转exe文件

曾宪华 — 2018-06-24T15:59:00+00:00

上一篇博文介绍了一个自动更新.docx文件的Python脚本。当时通宵（通宵看葡萄牙VS西班牙顺带码的）码好的时候想着怎么分享给整个部门使用，考虑到公司电脑并没有Python环境（没有安装权限），于是我就找有没有办法可以让我的这个Python脚本在一台没有安装Python的电脑上执行。经过Google发现有py2exe和Pyinstaller可以将Python脚本编译成Windows（Pyinstaller支持多平台）可执行文件。经过比较发现Pyinstaller安装使用更简单（见下图），所以我选择了Pyinstaller，现记录一下转换过程。

首先是安装，在控制台输入命令pip install pyinstaller回车，成功安装如下图所示：

接下来是使用，在脚本所在目录下输入命令pyinstaller Checklist.py回车，转换成功如下图所示：

打开脚本所在目录，可以看到多了三个文件夹和一个文件，截图如下：

根据官网的说明，exe文件会保存在dist文件夹中（见下图），所以我们只需要带着这一个文件夹，就可以在没有Python环境的机器上执行Python脚本了。

大家可能会觉得整个文件夹看起来不够简洁，我们可不可以只带着一个exe文件呢？当然是可以的，只需要在转换的时候加上选项-F就可以实现只生成一个exe文件，截图如下：

我们可以看到上面两种方法所生成的exe文件大小有很大差别（第一个是1.52MB，第二个6.99MB），但是经过测试，发现两种方法exe文件启动时间并没有明显的差别，可能是因为我的这个脚本简单。但是对于一个复杂的Python脚本，加选项-F转换后的exe文件肯定会比不加选项生成的exe文件大很多，启动也会慢很多，故建议在转换一个复杂的Python脚本时不要加选项-F以提高exe启动速度。

我的博客即将同步至腾讯云+社区，邀请大家一同入驻： https://cloud.tencent.com/developer/support-plan?invite_code=17pibou8wgo3v

用Python操控Word

曾宪华 — 2018-06-18T21:01:00+00:00

4月底，我带着自己水的一篇文章，从深圳奔赴美帝西雅图参加了一个制药行业软件用户组2018年年会（PharmaSUG 2018）。听了一些报告，收获不少。在众多报告中，有一篇题目为Why SAS Programmers Should Learn Python Too的报告有点意思。不过在我看来，文章中的例子并没有很好地体现出Python的强大，因为那几个例子用Linux Shell脚本实现也很简单。不可否认，如果你想选择一种语言来入门编程，那么Python绝对是首选！但是对于SAS程序猿/媛来说，我觉得现阶段没有太多必要去学Python，因为行业的原因，Python对SAS程序猿/媛日常的编程工作几乎没有什么用。除非你和我一样，喜欢折腾代码，或者你想转行业做深度码农，那Python是必须掌握的语言，因为Python有各种强大的库。下面就让我们来感受下python-docx库的强大之处吧！

我们知道，带项目的SAS程序猿/媛在交项目时候需要准备一个时间戳的文件（假定这个文件是行业都要用到的），用来证明各项工作是有序进行的，如下图（注：因为是公司内部文件，所以单元格内容有做删减）:

在没有程序实现的情况下，我们每次交项目更新这个文件只能是一个一个地复制和粘贴。虽然要更新的单元格不多，但是手动更新还是有点费时。我能想象到用SAS实现（我不会，囧）肯定要比Python麻烦，所以我就用Python来实现。简单介绍一下用Python实现的思路：首先我们要找出需要更新单元格左边一列的位置。代码如下：

# coding=utf-8

from docx import Document

chklst = Document('C:\\Users\\Xianhua\\Documents\\Python\\Checklist.docx')

table = chklst.tables[2] # 第三个表格

for i in range(1,len(table.rows)): # 限定从表格第二行开始循环读取数据
    for j in range(1,2): # 限定只读取表格第二列数据
        # 输出单元格的位置
        print(i, j)
        # 输出单元格的内容
        print(table.rows[i].cells[j].text)

当然你也可以通过直接打开文档查看来获取位置，比如上图中的第一行第二列的单元格的坐标就是（1，1）。代码执行结果如下图：

然后赋值给所获取位置的右边一列。以下代码有一个前提：即各个时间戳已经被获取并保存在一个TXT文件中（可以通过FILENAME PIPE获取最新时间戳，例子在这里），如下图：

更新时间戳的代码如下：

# coding=utf-8

from docx import Document
import re
from datetime import datetime
from docx.shared import Pt
from docx.enum.text import WD_PARAGRAPH_ALIGNMENT

timestamp = open('C:\\Users\\Xianhua\\Documents\\Python\\Checklist.txt', 'r')

# 将TXT转化为字典
mydic = dict() 

for line in timestamp:
    matchObj = re.match( r'(\w+?)\s+(.+)', line)
    if matchObj:
        mydic[matchObj.group(1).rstrip()] = matchObj.group(2)
    
timestamp.close()

chklist = Document('C:\\Users\\Xianhua\\Documents\\Python\\Checklist.docx')

table = chklist.tables[0] # 第一个表格

for i in range(1, len(table.rows)): # 限定从表格第二行开始循环读取数据
    # Transfer date
    if i == 3:
        table.rows[i].cells[2].text = datetime.now().date().strftime('%d %b %Y')

table = chklist.tables[2] # 第三个表格

for i in range(1, 6):
    # Raw data
    if i == 1:
        table.rows[i].cells[2].text = mydic['raw']
        
    # Vendor data
    if i == 2:
        table.rows[i].cells[2].text = mydic['edat']
        
    # Transfer datasets
    if i == 3:
        table.rows[i].cells[2].text = mydic['transfer']
        
    # Logcheck
    if i == 4:
        table.rows[i].cells[2].text = mydic['logcheck']
        
    # Spec and Define
    if i == 5:
        table.rows[i].cells[2].text = mydic['spec'] + '\n' + mydic['define']
        
    # 更改字体 
    run = table.rows[i].cells[2].paragraphs[0].runs
    font = run[0].font
    font.name = 'Courier New'

    # 居中单元格
    table.rows[i].cells[2].paragraphs[0].alignment = WD_PARAGRAPH_ALIGNMENT.CENTER

chklist.save('C:\\Users\\Xianhua\\Documents\\Python\\Checklist '+ datetime.now().date().strftime('%Y%m%d')+'.docx')

更新后的文件截图如下：

SAS统计一篇文章中各字母的出现频率

曾宪华 — 2018-02-17T10:01:00+00:00

今天偶然看到一个古老的帖子：统计一篇文章中各字母的出现的次数和频率。先说统计单词的问题。最直接的方法应该是将文章按单词分成多行，每行一个单词，再用PROC FREQ即可求得频数和频率。程序如下：

data;
    TEXT="It is Teacher's Day today. On this special occasion I would like to extend my heartfelt congratulations to all teachers, Happy Teacher's Day! Of all teachers who have taught me since my early childhood, the most unforgettable one is my first English teacher in college, Ms. Zhang. It is she who has aroused my keen interest in the learning of English and helped me realize the importance of self-reliance. Born into a poor farmer's family in a mountainous area and educated in relatively primitive surroundings, I found myself lagging far behind in the first class in college, which happened to be Ms. Zhang's English class. I was really discouraged and frustrated, so I decided to drop out. Ms. Zhang was so keenly insightful that she had noticed my embarrassment in class. After class, she called me into the Teacher's Room and discussed the situation with me, earnestly and kindly, citing the example of Robinson Crusoe to motivate me to go ahead in spite of all kinds of difficulties. Be a man and rely on yourself, she nudged me. The next time we met, she brought me a simplified version of Robinson Crusoe and recommended that I finish reading it in a week and write a book report. Under her consistent and patient guidance, not only has my English been greatly improved, but my confidence and courage enhanced considerably. Rely on yourself and be a man, Ms. Zhang's inspiring words have been echoing in my mind. I will work harder and try my utmost to lay a solid foundation for my future career. Only by so doing can I repay Ms. Zhang's kindness and live up to her expectations of me, that is, to become a useful person and contribute to society.";
    i=1;
    do until(scan(TEXT, i)='');
        WORD=scan(TEXT, i);
        output;
        i+1;
    end;
run;

proc freq;
    tables WORD / noprint out=counts;
run;

结果如下：

上面的方法也可以用来处理统计字母频率的问题，但是有点LOW。因为文章一长，行数就会非常多。下面介绍使用CALL PRXNEXT的方法：

data demo;
    TEXT="It is Teacher's Day today. On this special occasion I would like to extend my heartfelt congratulations to all teachers, Happy Teacher's Day! Of all teachers who have taught me since my early childhood, the most unforgettable one is my first English teacher in college, Ms. Zhang. It is she who has aroused my keen interest in the learning of English and helped me realize the importance of self-reliance. Born into a poor farmer's family in a mountainous area and educated in relatively primitive surroundings, I found myself lagging far behind in the first class in college, which happened to be Ms. Zhang's English class. I was really discouraged and frustrated, so I decided to drop out. Ms. Zhang was so keenly insightful that she had noticed my embarrassment in class. After class, she called me into the Teacher's Room and discussed the situation with me, earnestly and kindly, citing the example of Robinson Crusoe to motivate me to go ahead in spite of all kinds of difficulties. Be a man and rely on yourself, she nudged me. The next time we met, she brought me a simplified version of Robinson Crusoe and recommended that I finish reading it in a week and write a book report. Under her consistent and patient guidance, not only has my English been greatly improved, but my confidence and courage enhanced considerably. Rely on yourself and be a man, Ms. Zhang's inspiring words have been echoing in my mind. I will work harder and try my utmost to lay a solid foundation for my future career. Only by so doing can I repay Ms. Zhang's kindness and live up to her expectations of me, that is, to become a useful person and contribute to society.";
    TEXT_TEMP=TEXT;
    if _N_=1 then do;
        RE1=prxparse('s/(\b.+?\b)(\s.*?)(\b\1+\b)/\2\3/i');
        RE2=prxparse('/(\b.+?\b)(\s.*?)(\b\1+\b)/i');
    end;
    /*Remove repeated values*/
    do i=1 to 1000;
        TEXT=prxchange(RE1, -1, cats(TEXT));
        if not prxmatch(RE2, cats(TEXT)) then leave;
    end;
    do i=1 to countw(TEXT);
        WORD=scan(TEXT, i);
        COUNT=0;
        RE=prxparse('/\b'||cats(WORD)||'\b/i');
        START=1;
        STOP=length(TEXT_TEMP);
        call prxnext(RE, START, STOP, TEXT_TEMP, POSITION, LENGTH);
        do while(POSITION>0);
            COUNT+1;
            call prxnext(RE, START, STOP, TEXT_TEMP, POSITION, LENGTH);
        end;
        FREQ=COUNT/countw(TEXT_TEMP)*100;
        keep WORD COUNT FREQ;
        output;
    end;
run;

值得注意的是，第一种方法会区分大小写，比如会分别统计‘Be’和‘be’的频率（见下图)。

当然我们可以在用PROC FREQ之前先处理好大小写的问题。第二种方法有使用正则表达式去重，所以会有点慢。当然也可以在最后使用PROC SORT去重。第二种方法同样可以用来处理统计字母的问题，程序如下：

data demo;
    TEXT="It is Teacher's Day today. On this special occasion I would like to extend my heartfelt congratulations to all teachers, Happy Teacher's Day! Of all teachers who have taught me since my early childhood, the most unforgettable one is my first English teacher in college, Ms. Zhang. It is she who has aroused my keen interest in the learning of English and helped me realize the importance of self-reliance. Born into a poor farmer's family in a mountainous area and educated in relatively primitive surroundings, I found myself lagging far behind in the first class in college, which happened to be Ms. Zhang's English class. I was really discouraged and frustrated, so I decided to drop out. Ms. Zhang was so keenly insightful that she had noticed my embarrassment in class. After class, she called me into the Teacher's Room and discussed the situation with me, earnestly and kindly, citing the example of Robinson Crusoe to motivate me to go ahead in spite of all kinds of difficulties. Be a man and rely on yourself, she nudged me. The next time we met, she brought me a simplified version of Robinson Crusoe and recommended that I finish reading it in a week and write a book report. Under her consistent and patient guidance, not only has my English been greatly improved, but my confidence and courage enhanced considerably. Rely on yourself and be a man, Ms. Zhang's inspiring words have been echoing in my mind. I will work harder and try my utmost to lay a solid foundation for my future career. Only by so doing can I repay Ms. Zhang's kindness and live up to her expectations of me, that is, to become a useful person and contribute to society.";
    do i=1 to 26;
        CHAR=byte(i+64);
        COUNT=0;
        RE=prxparse('/'||CHAR||'/i');
        START=1;
        STOP=length(TEXT);
        call prxnext(RE, START, STOP, TEXT, POSITION, LENGTH);
        do while(POSITION>0);
            COUNT+1;
            call prxnext(RE, START, STOP, TEXT, POSITION, LENGTH);
        end;
        if COUNT>0 then do;
            FREQ=COUNT/length(compress(prxchange('s/\W//', -1, TEXT)))*100;
            output;
        end;
    end;
    keep CHAR COUNT FREQ;
run;

结果如下：

当然，SAS有现成的函数COUNTC可以用来统计字母频率，程序如下：

data demo;
    TEXT="It is Teacher's Day today. On this special occasion I would like to extend my heartfelt congratulations to all teachers, Happy Teacher's Day! Of all teachers who have taught me since my early childhood, the most unforgettable one is my first English teacher in college, Ms. Zhang. It is she who has aroused my keen interest in the learning of English and helped me realize the importance of self-reliance. Born into a poor farmer's family in a mountainous area and educated in relatively primitive surroundings, I found myself lagging far behind in the first class in college, which happened to be Ms. Zhang's English class. I was really discouraged and frustrated, so I decided to drop out. Ms. Zhang was so keenly insightful that she had noticed my embarrassment in class. After class, she called me into the Teacher's Room and discussed the situation with me, earnestly and kindly, citing the example of Robinson Crusoe to motivate me to go ahead in spite of all kinds of difficulties. Be a man and rely on yourself, she nudged me. The next time we met, she brought me a simplified version of Robinson Crusoe and recommended that I finish reading it in a week and write a book report. Under her consistent and patient guidance, not only has my English been greatly improved, but my confidence and courage enhanced considerably. Rely on yourself and be a man, Ms. Zhang's inspiring words have been echoing in my mind. I will work harder and try my utmost to lay a solid foundation for my future career. Only by so doing can I repay Ms. Zhang's kindness and live up to her expectations of me, that is, to become a useful person and contribute to society.";
    do i=1 to 26;
        CHAR=byte(i+64);
        COUNT=countc(TEXT, CHAR, 'i');
        if COUNT>0 then do;
            FREQ=COUNT/length(compress(prxchange('s/\W//', -1, TEXT)))*100;
            output;
        end;
    end;
    keep CHAR COUNT FREQ;
run;

SAS矩阵重组

曾宪华 — 2018-01-04T22:02:00+00:00

最近看到一个群友（QQ群：144839730）提的一个问题：将上图中的名为HAVE的数据集转置成名为WANT的数据集。实现的方法有多种，最易懂的方法应该是TRANSPOSE，下面介绍其他几种方法：

FILENAME：

data have;
    a_t1=1; b_t1=2; a_t2=3; b_t2=4; a_t3=5; b_t3=6; a_t4=7; b_t4=8;
run;

filename code temp;
data _null_;
    file code;
    set have;
    array vlst{*} _numeric_;
    do i=1 to dim(vlst) BY 2;
        N1=vname(vlst{i});
        N2=vname(vlst{i+1});
        N3=prxchange('s/(\w+?)_(\w+)/\1_\2=\1/', -1, catx(' ', N1, N2));
        N4=scan(N1, 2, '_');
        put ' SET have(keep=' N1 N2' rename=(' N3 '));' @@;
        put ' NAME="' N4 '"; output;'; 
    end;   
run;

data want;
    length NAME $32;
    %inc code;
run;

CALL EXECUTE：

data temp;
    set have;
    array vlst{*} _numeric_;
    do i=1 to dim(vlst) BY 2;
        N1=vname(vlst{i});
        N2=vname(vlst{i+1});
        N3=prxchange('s/(\w+?)_(\w+)/\1_\2=\1/', -1, catx(' ', N1, N2));
        N4=scan(N1, 2, '_');
        keep N:;
        output;
    end;   
run;

data want;
    set temp end=last;
    if _n_=1 then call execute('data want; length NAME $32;');
    call execute('SET have(keep='||catx(' ', N1, N2)||' rename=('||cats(N3)||')); NAME="' ||cats(N4)||'"; output;');
    if last then call execute('run;');
run;

可能大家会觉得上面两种方法代码行数都有点多，那请看下面采用SAS/IML的方法：

proc iml;
    use have;
    read all var _NUM_ into M1[c=VARNAMES];
    close;
    NAME1=scan(VARNAMES, 1, '_');
    NAME2=scan(VARNAMES, -1, '_');
    ROW=unique(NAME1);
    NAME=unique(NAME2);
    M2=shape(M1, 0, 2);
    create want from M2[c=ROW r=NAME];
    append from M2[r=NAME];
    close;
quit;

注意，上面函数SHAPE中的行数我写成0，这样真正的行数就由列数决定，即重组1行8列的矩阵，转成2列的情况下，行数只能是4了。故在行列很多的情况下把行或列数设为0会简单点，因为不用去算行或列数。

SAS求子集

曾宪华 — 2017-11-26T23:01:00+00:00

前几天在微信群里看到一个问题：求一个数组的子集。SAS中实现排列的方法有多种，最易懂的方法应该是PROC SUMMARY以及CALL ALLCOMB，两种方法的代码在这里。下面介绍一个DATA步一步到位的方法：

data subsets;
    array set1[*] $ a b c d e ('a', 'b', 'c', 'd', 'e');
    array set2[*] bin1-bin5;
    array set3[*] $ ele1-ele5;
    do i=0 to dim(set1)-1;
        set2(i+1)=2**(dim(set1)-(i+1));
    end;
    do j=0 to 2**dim(set1)-1;
        call missing(of ele:);
        do i=1 to dim(set1);
            if band(j, set2(i)) then set3(i)=set1(i);
        end;
        SUBSETS=cats(of ele:);
        keep SUBSETS;
        output;
    end;
run;

简单说下上面方法的思路，我们知道一个具有n个元素的集合的子集个数是2的n次方，因为每个元素只有出现和不出现两种情况。首先将数组元素转换成二进制值，‘abcde’对应的值为‘00011111’，各元素对应的值分别为：2的4次方16，2的3次方8，2的2次方4，2的1次方2，2的0次方1。然后用函数BAND将数字0-31（0代表空集）分别和各元素做位运算，返回结果为真则将元素值赋值给新的数组，最后将新数组连接起来即为子集。

SAS领先函数功能的实现

曾宪华 — 2017-10-02T11:57:00+00:00

SAS程序猿/媛都知道SAS有滞后函数LAG。那我们会问有没有与之相反的领先函数呢？答案是否定的。但是，我们有其他的替代方法。最简单的方法就是新建一个值为_N_的排序变量，然后逆向排序，使用LAG函数，再正向排序。方法虽然简单明了，但是要多个PROC+DATA步，而且数据较大时，效率会很低。下面介绍其他两种方法：

双SET：

data demo;
    input X @@;
cards;
1 2 3 4 5 6
;
run;

data lead;
    set demo end=eod;
    LAG=lag(X);
    if not eod then do;
        VAR_TEMP=_N_+1;
        set demo(keep=X rename=X=LEAD) point=VAR_TEMP;
    end;
    else LEAD=.;
    keep X LAG LEAD;
run;

HASH：

data lead;
    retain X;
    if _N_=1 then do;
        dcl hash h(ordered: 'a') ;
        h.definekey('LEAD_SEQ');
        h.definedata('LEAD_SEQ', 'LEAD');
        h.definedone();
        dcl hiter hi('h');
        
        do  until(eof);
            set demo(rename=X=LEAD) end=eof;
            LEAD_SEQ+1;
            h.add();
        end;
    end;
    set demo;
    LAG=lag(X);
    hi.setcur(key: _N_); /*Specifies a starting key item for iteration*/
    rc=hi.next();
    if rc^=0 then LEAD=.;
    drop LEAD_SEQ RC;
run;

上面第一种方法程序行数虽然少，但是有两次SET的操作，所以当数据集较大时建议采用第二种方法以提高效率。

SAS成语接龙

曾宪华 — 2017-10-01T22:57:00+00:00

今年国庆长假没有出游计划，不过可以在朋友圈周游世界。周游世界的同时正好有点时间来博客除除草了，2017年已过四分之三，目前只留下一篇博客（囧）。今天无意间翻到3年前回复过的一个帖子：用SAS做成语接龙。编程思路如下：首先导入成语大全，提取首尾汉字，将所有成语放入哈希表中，然后将成语最后一个汉字去哈希表中查询匹配，如果成功匹配则把哈希表中匹配的成语最后一个汉字做为KEY去查询匹配，直到遍历整个哈希表。更新的代码（SAS 9.2 for Windows）如下：

/*导入成语列表*/
proc import datafile="D:\Demo\成语大全.txt"
    out=idiom_list
    replace;
    getnames=no;
    guessingrows=32767;
run;

/*提取首尾汉字*/
data idiom_list;
    set idiom_list(rename=VAR1=IDIOM);
    length FIRST_C END_C $2.;
    FIRST_C=prxchange('s/^(.{2}).+/\1/', 1, cats(IDIOM));
    END_C=prxchange('s/.+(.{2})$/\1/', 1, cats(IDIOM));
run;

/*初始成语*/
%let start_idiom=胸有成竹;

/*查询*/
data _null_;
    if _n_=1 then do;
        if 0 then set idiom_list;
        dcl hash h(multidata: 'Y');
        h.definekey('FIRST_C');
        h.definedata('IDIOM', 'FIRST_C', 'END_C');
        h.definedone();
    end;;
    do until(last);
        set idiom_list idiom_list end=last;
        h.add();
    end;
    set idiom_list(where=(IDIOM="&start_idiom") rename=END_C=FIRST_C keep=IDIOM END_C);
    put IDIOM=;
    if h.find(key: FIRST_C)=0 then put IDIOM=;
    do i=1 to 100;
        if h.find(key: END_C)=0 then put IDIOM=;
        if h.find(key: END_C)=0 then do;            
            rc=h.find_next(key: END_C);         
            put IDIOM=;
        end;
    end;
run;

结果如下：胸有成竹、竹苞松茂、茂林修竹、竹报平安、安安稳稳、稳操胜券。

上面的帖子其实有点像深度优先搜索（Depth-First-Search，简称DFS）。除了哈希表的方法，还可以用双SET加KEY选项来解决。比如这个帖子。数据集如下图：

楼主的问题是找最高级，如上图中ID为2的下一级是5，5的下一级是9，9的下一级是102，102没有下一级了，那么2的最高级就是102。编程思路和上面HASH方法类似，即用当前的KONZERNID作为索引ID去查找匹配，直到匹配不成功。更新的代码如下：

data konzern;
    input ID KONZERNID;
cards;
1 18
2  5
3 18
4 24
5 9
6 9
7 15
8 12
9 102
;
run;

/*Create index*/
data konzern(index=(ID));
    set konzern;
run;

/*Lookup*/
data highestid;
    set konzern;
    ID_INIT=ID;
    KONZERNID_INIT=KONZERNID;
    HIGHESTID=ID;
    do i=1 to 100;
        ID=KONZERNID;
        HIGHESTID=ID;
        set konzern key=ID/unique;
        if _IORC_^=0 then leave;
    end;
    keep ID_INIT KONZERNID_INIT HIGHESTID;
run;

结果如下：

以两种方法各有利弊，因为哈希表是存储在内存中，所以当数据较大时可能会导致内存不足。而第二种方法因为有多次SET操作，数据较大时效率会大大降低。故在实际应用中应该根据具体情况而定

SAS数据集中一行与多行的比较

曾宪华 — 2017-09-16T10:02:00+00:00

前几天看到一个群友（QQ群：144839730）提的一个问题：求上图中X小于等于所有Y值的个数。比如，第一个Y为0，则5个X中小于等于0的个数为0。实现这一目的的方法有多种，最易懂的方法应该是转置加数组，下面介绍其他两种方法：

双SET：

data have;
    input ID X Y;
cards;
1 1000 0
2 2000 0
3 3000 3000
4 4000 3500
5 5000 4000
;

data want;
    set have nobs=totobs;
    NUM=0;
    do i=1 to totobs;
        set have(keep=X rename=X=X_) point=i;
        if X_ <= Y then NUM=NUM+1; 
    end ;
    drop X_;
    output;
run;

HASH，程序（SAS9.2+）如下：

data have;
    set have;
    BYVAR=1;
run;

data want;
    if _n_=1 then do;
        dcl hash h(dataset:'have(keep=BYVAR X rename=X=X_)', multidata: 'y');
        h.definekey('BYVAR');
        h.definedata(all:'y');
        h.definedone();
        call missing(X_);
    end;
    set have;
    NUM=0;
    rc=h.find();
    do while(rc=0);
        if X_ <= Y then NUM=NUM+1; 
        rc=h.find_next();
    end;
    drop BYVAR X_ RC;
run;

上面第一种方法程序行数少，但是有多次SET的操作，所以当数据集较大时建议用第二种方法以提高效率。

儿子，欢迎你来到这个世界

曾宪华 — 2016-12-20T22:15:00+00:00

2016年12月14日23时11分，儿子尧尧出生，体重4300g，身长53cm，特此纪念。老婆辛苦了！深圳市南山区人民医院的医生护士你们辛苦了！

酷帅如我的小眼睛，哈哈！

SAS删除字符串中的重复项

曾宪华 — 2016-11-26T15:54:00+00:00

SAS程序猿/媛有时候会碰到去除字符串中重复值的问题，用常用的字符函数如SCAN，SUBSTR可能会很费劲，用正则表达式来处理就简单了。示例程序如下：

data _null_;
    infile cards truncover;
    input STRING $32767.;
    REX1=prxparse('s/([a-z].+?\.\s+)(.*?)(\1+)/\2\3/i');
    REX2=prxparse('/([a-z].+?\.\s+)(.*?)(\1+)/i');
    do i=1 to 100;
        STRING=prxchange(REX1, -1, compbl(STRING));
        if not prxmatch(REX2, compbl(STRING)) then leave;
    end;
    put STRING=;
cards;
a. The cow jumps over the moon.
a. The cow jumps over the moon. b. The chicken crossed the road. c. The quick brown fox jumped over the lazy dog. a. The cow jumps over the moon. 
b. The chicken crossed the road. a. The cow jumps over the moon. b. The chicken crossed the road. c. The quick brown fox jumped over the lazy dog.
a. The cow jumps over the moon. a. The cow jumps over the moon. b. The chicken crossed the road. b. The chicken crossed the road. c. The quick brown fox jumped over the lazy dog. c. The quick brown fox jumped over the lazy dog.
a. The cows jump over the moon. a. The cows jump over the moon. b. The chickens crossed the road. b. The chickens crossed the road. c. The quick brown foxes jumped over the lazy dog. c. The quick brown foxes jumped over the lazy dog.
a. The cow jumps over the moon. b. The chicken crossed the road.  c. The quick brown fox jumped over the lazy dog. a. The cow jumps over the moon.  b. The chicken crossed the road. c. The quick brown fox jumped over the lazy dog.
;
run;

可以看到上面的重复项是一整个句子，如果重复项是单词，上面的表达式就要改了：

data _null_;
    STRING='cow chicken fox cow chicken fox cows chickens foxes';
    REX1=prxparse('s/(\b\w+\b)(.*?)(\b\1+\b)/\2\3/i');
    REX2=prxparse('/(\b\w+\b)(.*?)(\b\1+\b)/i');
    do i=1 to 100;
        STRING=prxchange(REX1, -1, compbl(STRING));
        if not prxmatch(REX2, compbl(STRING)) then leave;
    end;
    put STRING=;
run;

注意上面的表达式中第一个括号中的\b是用来限定只匹配单词而不是单个字母。第三个括号中的\b表示精确匹配，即匹配一模一样的单词。