Wednesday, February 9, 2011


1. What is the difference between Merge and Proc SQL ?
The resultant dataset depends on the input datasets. In case of one to one and one to many  both work
similarly,i.e. the resultant dataset is same. But differs in case of many to many and non matching datasets:
many to many:
ex:                                     (merge on x)         (proc sql)
X  Y              X    Z           X  Y  Z                X  Y  Z    
-----             ------            -------                 -------
1  A              1    F            1  A  F                 1  A  F
1  C              1    R --->      1  C  R                 1  A  R
2  B              2    G            2  B  G                1  C  F
                                                                   1  C  R
                                                                    2  B  G


NON MATCHING DATA:

ex:                                       (merge on x)    (proc sql)
X  Y              X    Z              X  Y  Z          X  Y  Z    
-----             ------                -------           -------
1  A              1    F            1  A  F           1  A  F
2  B              3    T    --->    2  B  .            3  C  T
3  C              4    G            3  C  T    
                                         4  .  G

2. Difference between Proc Means & Proc Summary
Ans:
a)  Proc Summary defaults to NOPRINT
     Proc Means defaults to PRINT

b) If you omit the VAR statement, then PROC SUMMARY produces a simple count of observations, whereas PROC MEANS tries to analyze all the numeric variables that are not listed in the other statements

c) If you specify statistics on the PROC SUMMARY statement and the VAR statement is omitted, then PROC SUMMARY stops processing and an error message is written to the SAS log.

d) If you omit the VAR statement, then PROC MEANS analyzes all numeric variables that are not listed in the other statements. When all variables are character variables, PROC MEANS produces a simple count of observations.

3. Check the below available options:

1) dash (-): Using a dash between two variable names will create an ordered list of variables.   
   So KEEP VAR1-VAR4;
   is equivalent to KEEP VAR1 VAR2 VAR3 VAR4; 

Note, if one of the variables expected in the ordered list does not exist a WARNING will be issued to the log.
However, if using the dash when defining an array, the missing variable will be created.

2) double-dash (--): Using a double dash between two variable names will create a list of variables as they are ordered in the dataset. 
    So if I have the following variables in a dataset in this order - INTN, SSN, NAME, ADDRESS, CITY
    then KEEP INTN -- ADDRESS;
    is equivalent to KEEP INTN SSN NAME ADDRESS;

3) colon (:): Using a colon after a string of letters will create a list of all variables beginning with those letter.
   That is var: will create a list of all variables in the dataset that begin with those 3 letters.
   If I have a dataset with variables named VARSSN VARINTN VARNAME VARDATE, then

   KEEP VAR:;
   is equivalent to KEEP VARSSN VARINTN VARNAME VARDATE;

4) _ALL_: Using the _ALL_ will create a list of all the variables in the dataset.
    So if I have a dataset with variables VAR1 VAR2 VAR3 and VAR4.  Then

    KEEP _ALL_;
    is equivalent to KEEP VAR1 VAR2 VAR3 VAR4;

    This one can be most useful with a proc sort nodupkey to remove records which are exact matchs on all variables.
    PROC SORT DATA=TEST NODUPKEY;
     BY _ALL_;
    RUN;

5) _CHARACTER_: Using _CHARACTER_ creates a list of all the character variables in the dataset. 
    If I have a dataset with two character variables CVAR1 and CVAR2 and two numeric variables NVAR3 NVAR4, then

    KEEP _CHARACTER_;
    is equivalent to KEEP CVAR1 CVAR2;

6) _NUMERIC_: Using _NUMERIC_ creates a list of all the numeric variables in the ataset.
    So in the same dateset from #5, then
  
    KEEP _NUMERIC_;
    is equivalent to KEEP NVAR1 NVAR2;

    Both _CHARACTER_ and _NUMERIC_ can be useful when creating arrays, since all the variables in the array must be the same type. 



4. what are system variables, automated variables and Dataset Variables.

5. How will you write a code to create a sas dataset with no data in it/ Or How to validate a dataset without actual creating it.
Ans:
1)Use OPTIONS OBS = 0

2) Creating a data set by using the like clause.
 ex: proc sql;
create table latha.emp like oracle.emp;
quit;
In this the like clause triggers the existing table structure to be copied to the new table.using this method 
result in the creation of an empty table.
3)  data check;
           set _null_;
     run;
4) data test;
        delete;
    run;
check the above answers.. I have not verified if it will work..


6. There is a date in the infile as 01-01-49 . We need to read the date as 2049 instead of 1949
Ans: Set  Options yearcuoff=2000

7. Difference between Merge and Set with example

8. Difference b/w left join,right join, full outer join

9. Do you use PROC REPORT or PROC TABULATE? Which do you prefer? Explain.
I prefer Proc report as it is higly customizable and flexible where I can define each column in what ever way I 
want to and even make use of SAS functions, logic processing, and assignment statements and create new 
variables for report making use of the compute block of proc report. I referred proc report because it is more efficient tool than tabulate becasuse with report we can do means frequency and tabulate also

10. proc Report and options available in Proc Report

11. ODS

12. how will you read an input file into sas
Ans: Use the statements INFILE to point to the file reference(should be defined using Filename) / to the file path. Use INPUT statement to read the data into the sas dataset

13. How would you remove a format that has been permanently associated with a variable? 
Ans: we can remove the format by using proc datasets:
Proc datasets;
modify <data set name>;
format <variable name>(which variable format needs to 
modify>;
run;
quit;

14. how to display duplicated observations in a data using base sas
There are two ways to display duplicate observations.
1)In data step, using first.var and last.var
2)Using proc sort with option dupout option


15. In the flow of DATA step processing, what is the first action in a typical DATA Step?
Ans:
1)Compilation Phase:
When you submit a DATA step,it reads the input statements it creates an input buffer and brings the variables and observations.it is a logical memory area and pdv brings the observations at a time from input buffer 
and checks the errors. PDV contains 2 automatic variables _n_ & _error_ ,these checks the errors in observations.
_n_ : indicates the no of obs.
_error_: 1 if error occured
             0 if no error
After that it assigns the datavalues to appropriate variable and builds a sas dataset.

2)Execution Phase

16.  Does SAS ?Translate? (compile) or does it ?Interpret?? Explain.
A typical SAS program could contain DATA steps, PROC steps and macros. Macros are preprocessed. DATA steps are just in time compiled. PROC steps are interpreted in the order they appear in program. So when we submit a SAS program consisting of all these three components, the macro is compiled and executed first. If a DATA step is encountered, then it is compiled and executed. Note that the DATA step 
will not be executed if there is an error in the compilation. If a PROC step is encountered, it is interpreted and executed line by line. However i am not certain on this PROC step behavior.
  
17. At compile time when a SAS data set is read, what items are created?
Ans: Only PDV is generated at compilation time, where as the automatic variables _N_ , _ERROR_ are generated at execution time only.

18. how do u validate sas program
Ans: When a sas code is submitted, SAS performs syntactical checks before executing the program/ code. In that case, one of the ways could be 
- at the beginning of the code, write OPTIONS OBS=0 in addition to other options and then RUN it. 
This way data will not be processed and the log shows error messages/ warnings, if any. 
- If you are executing the SAS code on PC SAS, the highlighted colors itself shows the syntactical errors, if any.

19. Name and describe three SAS functions that you have used, if any?
 1) Compress - sas function which is used to remove spaces in string value and concatenate two values 
with out spaces.
2) Input is another function of sas, it is one of conversion function in sas. It converts numeric into char.
3) Put is another conversion function in sas. It converts char to numeric.
4) SAS functions can be used to convert data and manipulate character variable values.
different types of functions:1)TRIM
                                        2)SUBSTR
                                        3)ABS
                                        4)SCAN

TRIM : Removing the trailing blanks from character expressions. 
               syntax=trim(argument)
SUBSTR : substr extracts the substring from an argument
               syntax=substr(argument,position<,n>)
abs : Returns the absolute of the argument
                syntax=abs(argument)
  
The most common functions that would be used are-
Conversion functions - Input / Put / int / ceil / floor
Character functions - Scan / substr / index / Left / trim / compress / cat / catx / upcase,lowcase
Arithmetic functions - Sum / abs /
Attribute info functions – Attrn / length
Dataset – open / close / exist
Directory  - dexist / dopen / dclose / dcreate / dinfo
File functions – fexist / fopen/ filename / fileref
SQL functions – coalesce / count / sum/ mean
Date functions – date / today / datdif / datepart / datetime / intck / mdy
Array functions – dim


20. When looking for data contained in a character string of 150 bytes, which function is the best to locate that data: scan, index, or indexc?
INDEX: Searches a character expression for a string of characters, and returns the position of the string's first character for the first occurrence of the string.
INDEX (source, excerpt)
it returns the position where the 2nd field is in the source


Eg:  
     str1 = 'Hi i am fine here. how are u there ? .....';
     str2 = index(str1,'how'); --> str2 = 20

SCAN is to get a substring upto mentioned character. Scan function the best for locating the particular word specified in a argument,scan function default length is 200bytes.
data k;
r='ganesh kumar';
u=scan(r,2);
proc print;
run;

result:  kumar

INDEXC to locate only for mentioned one or more single character
        INDEXC(character-value, 'char1','char2','char3', ..)

FIND: Searches for a specific substring of characters within a character string
 FIND(string,substring<,modifiers><,startpos>)
it returns the position the substring is in

FINDC: To locate a character that appears or does not appear within a string.used to search for any one in a list of character 
values

INDEXW: Searches for the substring as a word that follows a space in the provided sentence
eg;string1= "there is a the here" ;
INDEXW(STRING1,"the")  

result: 12 (the word "the") 

21. If you have a data set that contains 100 variables, but you need only five of those, what is the code to force SAS to use only those variable?
Ans: Use KEEP option on a dataset to only select few variables from 100 variables. We can use KEEP option either on set statement or data step statement. If we use on SET statement then only the 
five variables are created on pdv and only these variables are sent to the output dataset. If we use KEEP option on data step statement then all the variables are copied into pdv and after any maniuplation only the selected variables on data step statement are processed and sent to output dataset.

data abc;
  set xyz (keep= ab cd ef gh);
run;

data abc(keep= ab cd ef gh ij);
  set xyz;
 ij=ab+jk;
run;

22. What is the purpose of the trailing and How would you use them?
If the data is continuosly in data set SAS would read the first words only from each line in the `datalines' block and it will ignore the rest of the line. if we use Trailing @@'it will read completly.and another type of trailing is using single @ this is a line hold specifier.
Trailing @ is used to hold the record in input buffer to execute another input statement on the same datalines.
Trailing @@ is used to hold the record in input buffer to execute same input statement on same datalines intil eof record
The trailing @ or more technically, line hold specifiers are used to hold the pointer in the same record for 
multiple iterations.  The two tyoes of line hold specifiers are single trailing(@) and double trailing(@@).
The single trailing hold the record until it encounters either another input statement or end of the datastep.  
They are used for the records such as 
001F38   H
002 F 40 G
To read these values to the datastep
Data example;
  input @10 type $ @;
  if type='H' then
    input @1 id 3. @4 gender $1. @5 age2.;
  else if type='G' then
    input @1 id3. @5 gender $1. @7 age 2.;
  end;
cards;
001F38   H
002 F 40 G
;
run;

The double trailing holds the until the end of the record.
Data example2;
  input id age @@;
cards;
001 23 002 43 003 65 004 32 005 54
;
run;
  

23. What is the significance of the ‘OF’ in X=SUM (OF a1-a4, a6, a9);? 
Ans: It is use to tell sas to consider a set of values to be processed. In the above example, SUM(OF a1-a4,a6,a9) resolves to SUM(A1,A2,A3,A4,A6,A9).
If we dont use 'OF' then it would be treated as a minus sign.. A1(Minus)-A4 and that is not what we are trying to accomplish. 

24. How do you debug and test your SAS programs? 
put statement and debug option is used for debug and test the programs.

25. How do you test for missing values? 

NMISS OPTION is used for missing values

26. How would you create multiple observations from a single observation? 
line pointer is used for multiple lines per observation
  @@ is used for multiple observations per line

27. What are some good SAS programming practices for processing very large data sets? 
ARRAYS IS USED FOR PROCESSING FOR LARGE DATA SET

28. Briefly describe 5 ways to do a "table lookup" in SAS. 

29. Why is SAS considered self-documenting? 
WHEN DATA SET IS CREATED SAS CREATE DISCRIPTOR PORTION 
AND DATA PORTION .THAT MEANS SAS STORES THE INFOMATION LIKE 
VARIABLE NAME ,LENGTH,TYPE etc.

30. How does SAS handle missing values in: assignment statements, functions, a merge, an update, sort order, formats, PROCs? 


When you check for ordinary missing numeric values, you can use code that is similar to the following: 
if numvar=. then do;
If your data contains special missing values, you can check for either an ordinary or special missing value with a statement that is similar to the following: 
if numvar<=.z then do;
To check for a missing character value, you can use a statement that is similar to the following: 
if charvar=' ' then do;
The MISSING function enables you to check for either a character or numeric missing value, as in: 
if missing(var) then do;

31. How do you generate random samples? ---ranuni
question will be asked like: I have a dataset that creates a value.. I need some random values to be generated each time i execute that code for creating that dataset, how will you generate the random values..

Use the function RANUNI

32. What is the Program Data Vector (PDV)? What are its functions?
When SAS processes a data step it has two phases. Compilation phase and execution phase.
During the compilation phase the input buffer is created to hold a record from external file. After input buffer is
created the PDV is created. The PDV is the area of memory where sas builds dataset, one observation at a time. The PDV contains two automatic variables _N_ and _ERROR_.
  
To the above answers I like to add a bit of information that PDV has even more automated variables such as
FIRST.<by-group-variable>, LAST.<by-group-variable>, _END_, ...etc, along with _ERROR_ and _N_ where each has got its own significance to have a control over the execution of a SAS Program.
  Source: SAS book
The PDV contains all the variables in the input data set, the variables created in DATA step statements, and the two variables, _N_ and _ERROR_, that are automatically generated for every DATA step. The _N_ variable represents the number of times the DATA step has iterated. The _ERROR_ variable acts like a binary switch whose value is 0 if no errors exist in the DATA step, or 1 if one or more errors exist

Variables that are created by the INPUT and the Sum statements  are set to missing initially. Note that in this representation, numeric variables are initialized with a period and character variables are initialized with blanks. The automatic variable _N_ is set to 1; the automatic variable _ERROR_ is set to 0. 

The variable TeamName is marked Drop in the PDV because of the DROP= data set option in the DATA statement. Dropped variables are not written to the SAS data set. The _N_ and _ERROR_ variables are dropped because automatic variables created by the DATA step are not written to a SAS data set

33. If reading an external file to produce an external file, what is the shortcut to write that record without coding every single variable on the record?
A PUT Statement using a Variable _infile_  can be used to achieve this task. where we wont have to mention all the variable names for output using put statement.

_all_ - Write all var present in Dataset including computed variables also.
_input_- write only those variables read from input file
_infile_ - reads in all the content from an external input file as a single block. It can be used to write out if no changes are needed.
  
34. If reading a variable length file with fixed input, how would you prevent SAS from reading the next record if the last variable didn?t have a value?
Use the MISSOVER keyword; it forces SAS to put either a blank or a period for the missing variable before continuing to the next record.

35. what is the diff b/w verification validation in sas
verification :
   is the value is correct ot not
validation :
   is the value is existable or not

EX:   0 <= p(x) <= 1 
validation:  value of p(x) must be lies between 0 and 1 other wise it's not validate result
verification: check again the calculation process to obtain the valid value for p(x).......
p(x) means probability of x(Success)

36. What is the one statement to set the criteria of data that can be coded in any step?
OPTIONS

37. What is the purpose of using the N=PS option?
The N=PS option creates a buffer in memory which is large enough to store PAGESIZE (PS) lines and enables a page to be formatted randomly prior to it being printed.

38. Which date advances a date, time or date/time value by a given interval? 
Here is the detailed usage of intnx function
date1=intnx('month','01jan95'd,5,'beginning');
put date1 / date1 date7.;

date2=intnx('month','01jan95'd,5,'middle');
put date2 / date2 date7.;

date3=intnx('month','01jan95'd,5,'end');
put date3 / date3 date7.;

date4=intnx('month','01jan95'd,5,'sameday');
put date4 / date4 date7.;

date5=intnx('month','15mar2000'd,5,'same');
put date5 / date5 date9.;

interval='month';
date='1sep2001'd;
align='m';
date4=intnx(interval,date,2,align)
;put date4 / date4 date7.;

39. If you need the value of a variable rather than the variable itself what would you use to load the value to a 
macro variable?
If we need a value of a macro variable then we must define it in such terms so that we can call them everywhere in the program. Define it as Global. There are different ways of assigning a global variable. Simplest method is %LET.
Ex: A, is macro variable. 
Use following statement to assign the value of a rather than the variable itself 
%Let A=xyz x="&A";
This will assign "xyz" to x, not the variable xyz to x.

40. How do i read multiple spaces in datasets?
1) While reading external data if there are any consecutive blanks we use DSD.
2) While reading free format extrenal data, if  two words have to be read for a sing variable we use &.

41. 
data abc;
input s w k g o t a m;
cards;
1 2 3 4 5 6 7 8 
2 3 4 5 6 7 8 9
;
run;

i want the output to be in the sorted order(only variables).observations should not be changed..

Use options formdlim='.';

**** create a dummy variable by assigning the values of any 
one of ur variable;
**** here I have taken the values of S variable of your ABC 
data set;
data setabc;
set abc;
l=s;
run;
proc print;run;
* transpose the variables into observations of a transposed 
data set by using the ID statement; 
proc transpose data=setabc out=T1_abc let;
id l;
run;
proc print;run;
**sort the transposed data set by using _name_ variable 
inorder to get varibales in a ascending order;
proc sort data=T1_abc;
by _name_;
run;
proc print;run;
** once again transpose the sorted data set with the _name_ 
variable in ID statement;
proc transpose data=t1_abc out=t2_abc(drop=_name_) let;
id _name_;
run;
proc print noobs;
title 'sorting the variables in Ascending order';
run;

42  WHAT IS DEBUGGING? HOW TO TEST THE DEBUGGING IN SAS?
Debugging is the process of testing the logic of the program. 
To debug the sas Datastep: use the Debug clause after '/' in the data statement and execute. If needed set watch points,and keep on pressing enter key while Observing  the values in the debugger log window.Finally exit from it.
To debug the Macro programs, use the system options like Symbolgen,Mlogic, Mprint.
Debugging is also supported by the log window by notes,warnings, errors and messeges of  and Error macro
variables.

43. How to import the Zip files into SAS? If it is possible in SAS? If it is posible write the code...

PROC IMPORT OUT= WORK.ALL 
   DATAFILE= "C:\Users\MyDocuments\xxxx\file1.zip" 
            DBMS=CSV REPLACE;
     GETNAMES=YES;
     DATAROW=2; 
RUN;

44. Difference Between Proc Means and Proc Summary:
proc means:It will give descreptive statitstics. By default it will give output in output window.
limited statistics will give like n,mean,median,min and max.If we need additional statistics we need to add options.
Proc summary:same thing but it will not give output as default.we need give an option print then only it will give the output.
  
1) Proc Means generate a default output but Proc Summary require Print option to generate output similar to proc means.
2) To get similar outputs Proc summary should compulsory have atleast one variable in VAR statement.
If you want a proof for the above please run the below code:
proc means data=sasuser.admit;
run;
proc summary data=sasuser.admit print;
run;

The outputs will be different.

45. SCANOVER, MISSOVER, TRUNCOVER, FLOWOVER, STOPOVER...

46. explain nway in Proc summary