Xianhua Zeng

Removing Duplicate Texts in a String

Xianhua Zeng — 2016-11-26T00:00:00+00:00

A recent question on SAS-L asked how to remove duplicate texts in a string. Several solutions were offered, each of them solving the problem differently. From my point of view, using traditional functions like SCAN and SUBSTR to solve this question might be confusing and labourious. Here is regular expression solution.

data _null_;
    infile cards truncover;
    input STRING $32767.;
    REX1=prxparse('s/([a-z].+?\.\s+)(.*?)(\1+)/\2\3/i');
    REX2=prxparse('/([a-z].+?\.\s+)(.*?)(\1+)/i');
    do i=1 to 100;
        STRING=prxchange(REX1, -1, compbl(STRING));
        if not prxmatch(REX2, compbl(STRING)) then leave;
    end;
    put STRING=;
cards;
a. The cow jumps over the moon. b. The chicken crossed the road. c. The quick brown fox jumped over the lazy dog. a. The cow jumps over the moon. 
b. The chicken crossed the road. a. The cow jumps over the moon. b. The chicken crossed the road. c. The quick brown fox jumped over the lazy dog.
a. The cow jumps over the moon. a. The cow jumps over the moon. b. The chicken crossed the road. b. The chicken crossed the road. c. The quick brown fox jumped over the lazy dog. c. The quick brown fox jumped over the lazy dog.
a. The cows jump over the moon. a. The cows jump over the moon. b. The chickens crossed the road. b. The chickens crossed the road. c. The quick brown foxes jumped over the lazy dog. c. The quick brown foxes jumped over the lazy dog.
a. The cow jumps over the moon. b. The chicken crossed the road.  c. The quick brown fox jumped over the lazy dog. a. The cow jumps over the moon.  b. The chicken crossed the road. c. The quick brown fox jumped over the lazy dog.
;
run;

Regular expression visualization by Regexper:

Here’s a brief explanation of the regular expression. “[a-z]” matched a single lower case letter. “.+?” matches any characters as few times as possible. “.” matches exactly a period character. “\s+” exactly a space as many times as possible. “.*?” matches any characters as few times as possible. “\1+” matches the first capturing group as many times as possible.

Note that if the repeated time value is greater than 100, you need to increase the stopping value in DO loop accordingly. I think this scenario rarely happens. If you want to remove duplicate words instead of sentences, you need to adjust the expression. For example:

data _null_;
    STRING='cow chicken fox cow chicken fox cows chickens foxes';
    REX1=prxparse('s/(\b\w+\b)(.*?)(\b\1+\b)/\2\3/i');
    REX2=prxparse('/(\b\w+\b)(.*?)(\b\1+\b)/i');
    do i=1 to 100;
        STRING=prxchange(REX1, -1, compbl(STRING));
        if not prxmatch(REX2, compbl(STRING)) then leave;
    end;
    put STRING=;
run;

Regular expression visualization by Regexper:

“\b” matches a word boundary. “\w+” matches any word character (equal to [a-zA-Z0-9_]) as many times as possible.

Automagically Opening Dataset and Copying Variable Value

Xianhua Zeng — 2016-10-16T17:36:00+00:00

I attended PharmaSUG China 2016 in Beijing last month, along with my line manager. There were a large number presentations this year. One presentation made a deep impression on me. The presenter shared some useful tips on presentation, such as automagically opening dataset and copying variable value. The source code is not available, so I created three small macros to accomplish these common tasks.

%markdsn, automagically opens the dataset selected.

%macro markdsn();
gsubmit "
dm 'wcopy';

filename clip clipbrd;

data _null_;
    infile clip;
    input;
    call execute('dm ""vt '||_INFILE_||';"" continue ;');
run;

filename clip clear;";
%mend markdsn;

%markcode, runs the selected code and automagically opens the last created dataset

%macro markcode();
gsubmit "
dm 'wcopy';

filename clip clipbrd;

data _null_;
    infile clip end=eof;
    input;
    call execute(_INFILE_);
    if eof then call execute('%nrstr(dm ''vt &syslast;'' continue ;)');
run;

filename clip clear;";
%mend markcode;

%vvalue, automagically copies variable value.

%macro vvalue();
gsubmit '
dm "wcopy";

filename clip clipbrd;

data _null_;
    infile clip;
    input;
    call symputx("var", _INFILE_);
run;

filename clip clear;

proc sql noprint;
    select distinct &var into :varlst separated by "@"
    from &syslast
    ;
quit;

data _null_;
    if not symexist("increment") then call symputx("increment", 1, "g");
    else call symputx("increment", 1 + input(symget("increment"), best.), "g"); 
run;

filename clip clipbrd;

data _null_;
    file clip;
    length value $32767;
    if &increment <= countw("&varlst", "@") then value=scan("&varlst", &increment, "@");
    else value=scan("&varlst", countw("&varlst", "@"), "@");
    put value;
run;

filename clip clear;';
%mend vvalue;

Prerequisites:

Store the macros in an autocall library
In command line type below commands to assign keys to evoke these macros
```
keydef 'F9' '%markdsn'
keydef 'F10' '%markcode' 
keydef 'F11' '%vvalue'
```

Usage:

Select dataset name and then press F9
Mark some code and then press F10
Select variable name and then press F11, repeat the above process until getting the desired value

Splitting Data Set Based on a Variable

Xianhua Zeng — 2015-11-14T20:57:00+00:00

SAS programmers sometimes need to split a data set into multiple data sets, depending on the unique values of a variable. And you can usually achieve what you want by applying a WHERE= option or IF statement. But these aren’t the most efficient or elegant method. Suppose that you need to break SASHELP.CLASS into different tables based on the value of SEX, here are three methods I know:

CALL EXECUTE:

proc sql;
    create table sex as
        select distinct SEX 
    	from sashelp.class
        ;
quit;

data _null_;
    set sex;
    call execute('data sex_'||cats(SEX)||'(where=(SEX='||quote(cats(SEX))||')); set sashelp.class; run;');
run;

FILENAME:

proc sql;
    create table sex as
        select distinct SEX 
    	from sashelp.class
        ;
quit;

filename code temp;
data _null_;
    file code;
    set sex;
    put ' sex_' SEX '(where=(SEX="' SEX '"))' @@;
run;

data %inc code;;
    set sashelp.class;
run;

HASH(SAS9.2+):

proc sort data=sashelp.class out=class;
	by SEX;
run;

data _null_;
    dcl hash h(multidata:'y');
    h.definekey('SEX');
    h.definedone();
    do until(last.SEX);
        set class;
        by SEX;
        h.add();
    end;
    h.output(dataset:cats('sex_', SEX));
run;

Note that the second method is most efficient when splits a huge data set since it reads data set only one time.

SAS Display Manager Commands

Xianhua Zeng — 2015-11-08T23:38:00+00:00

SAS programmers usually use time-consuming point-and-click methods to accomplish common tasks. For example, when the program completes its run, you need to open a specific dataset to check the desired variable or observation. Have you ever wished that these common tasks can be done automatically? Of course, these tasks had to be accomplished almost automagically. DM commands came to your rescue. A DM command stands for Display Manager Statement. It submits SAS Program Editor, Log, Procedure Output or text editor commands as SAS statements. DM command is very powerful but didn’t get much attention of SAS programmers as they should be. In this post, I’ll introduce 2 DM commands you might not have heard of before.

Usually when you are working with a dataset, you want to see certain columns and not all of them. The following command will show only column A.
```
gsub "dm _last_ 'show A;' continue;"
```
Usually when you are working with a huge dataset, you want to retrieve values in a certain observation. The following command will scroll downward 1116 observations.
```
gsub "dm _last_ 'forward 1116;' continue;"
```

If you don’t want to type commands, here is a hotkey-driven solution. Save the following program(e.g., as /user1/zenga/tool.sas):

%let line=;
%let name=;
%window Tool irow = 10 rows = 15 icolumn = 10 columns = 90 color=white
#3 @18 'To show the desired variable or desired line.' color=blue
#5 @18 'Enter line number:' color=blue
#7 @18 'Enter variable name:' color=blue
#9 @18 'Note: variable name should be separated by a single space.' color=blue
#5 @37 line 15 attr=underline
#7 @39 name 15 attr=underline;
%display Tool;

%macro tool;
%if &line^= %then %do;
    dm _last_ "top" continue;
    dm _last_ "forward %eval(&line-1)" continue;
%end;
%if &name^= %then %do;
    dm _last_ "show ""&name""" continue;
%end;
%mend tool;

%tool

Open one dataset then type below command in command line to assign a VT key to run the code.

keydef "F9" "gsubmit '%inc ""/user1/zenga/tool.sas"";'"

When you enter the assigned key you will be asked to enter line number or variable name, then you can get desired observation or variable.

Parsing Comments from aCRF with Perl Regular Expression

Xianhua Zeng — 2015-11-06T15:45:00+00:00

As clinical SAS programmer, we sometimes need to import and parse annotations contained in the Annotated Case Report Form (aCRF) for creating or validating Define.xml. When parsing the imported comments from aCRF, our ultimate goal is to identify the variable. Then we can get the correspondence information, such as CRF page. In this post, I’ll introduce a method using Perl regular expression. Syntax: PRXCHANGE (regular-expression-id|perl-regular-expression, times, source). Example:

COMMENTS=prxchange("s/.*?(\b(?:LBCAT|LBTEST|LBTESTCD)\b)?/\1 /o", -1, cats(COMMENTS));

Regular expression visualization by Regexper:

Here’s a brief explanation of the regular expression used in the example above. The “.” matches any single character except newline. The “*?” is lazy repetition factor, matches 0 or more occurrences of the preceding character as few times as possible. The first “(“and “)” characters matches a pattern and creates a capture buffer for the match. The last “?” is greedy repetition factor, matches the first capturing group zero or one time as many times as possible. The “\b” matches a word boundary. Since we want to mention “\b” only once, so the second “(“and “)”characters are needed. The “(?:…)” means non-capturing group, the “?:” is not necessary in this example. Since there is no memory required for the second catch (?:), it may work faster. The “\1” matches capture buffer 1.

Creating a SAS Format from a Data Set

Xianhua Zeng — 2015-05-08T21:01:00+00:00

SAS Formats are useful to the SAS programmer. They are usually used to map one value into another. We can create a format from a data set. The most common way to create a format is use PROC FORMAT. The picture below is an example of a data set with two columns, analysis visit and analysis visit number.

This post will illustrate four different methods to create a format called $visit from this data set.

CALL EXECUTE

data _null_;
    set demo end=eof;
    if _n_=1 then call execute('proc format; value visit');
    call execute(cats(AVISITN)||' = '||quote(cats(AVISIT)));
    if eof then call execute('; run;');
run;

Macro variable

proc sql noprint;
    select catx(' = ', cats(AVISITN), quote(cats(AVISIT))) into :fmtlst separated by ' '
        from demo
        order by AVISITN;
quit;

proc format;
    value visit
    &fmtlst;
run;

CNTLIN= option

proc sql;
    create table fmt as
        select distinct 'visit' as FMTNAME
             , AVISITN as START
             , cats(AVISIT) as label
        from demo
        order by AVISITN
        ;
quit;

proc format library=work cntlin=fmt;
run;

FILENAME

proc sql;
    create table fmt as
        select distinct AVISITN
             , quote(cats(AVISIT)) as AVISIT
        from demo
        order by AVISITN
        ;
quit;

/*Write the generated code to a temporary file*/
filename code temp;
data _null_;
    file code;
    set fmt;
    if _n_=1 then put +4 'value visit';
    put +14 AVISITN ' = ' AVISIT;
run;

proc format;
    %inc code / source2;
    ;
run;

Splitting a String Using Perl Regular Expression

Xianhua Zeng — 2015-05-01T20:59:00+00:00

In SDTM domains, all character variables are limited to a maximum of 200 characters due to FDA requiring datasets in SAS v5 transport format. Text more than 200 characters long should be stored as a record in the SUPP–dataset. To improve readability the text should be split between words not just broken the text into 200-character. In this post, I’ll introduce a method using regular expression. Syntax: PRXCHANGE (regular-expression-id|perl-regular-expression, times, source). Example:

VAR=prxchange('s/(.{1,200})([\s]|$)/\1~/', -1, cats(VAR));

Regular expression visualization by Regexper:

Here’s a brief explanation. Expression looks at character 201 – if it’s a space, the split character is inserting. Otherwise, it locates the position of the rightmost breaking character and inserts a split character. The process is repeated on the remaining characters in the string until the end of the variable VAR. And finally we can use SCAN function to extract individual words from the variable based on the delimiter (~), and each chunk is assigned to a new variable.

Clinical SAS Programming

Xianhua Zeng — 2015-04-19T21:02:00+00:00

Clinical SAS Programming, A blog about SAS clinical trail programming. My inspiration to write this blog is to explore and share my knowledge on SAS.