Tips for Analyzing Data in Pipe-Delimited Files Using SAS

The text files can be read using in SAS using PROC IMPORT or with a DATA STEP. The DATA STEP is recommended because it allows the user more control over how variables are input (e.g., variable lengths and formats). However PROC IMPORT can provide a useful first pass, and generates explicit DATA STEP code in the SAS log file that can be modified for a second pass run.

1

Start by Reading a Small File

      

/* first try reading file with PROC IMPORT */ proc import datafile = "id_information.txt" out = id_information ; delimiter = '|'; getnames = YES; guessingrows = 100; run;

Notes:

  • The code assumes that the file ‘id_information.txt’ is located in the current working directory. Substitute appropriate path and file information as needed.
  • When guessingrows=100, SAS will use the first 100 records to guess appropriate input formatting for variables. A larger value can be set, however, a very large value may be computationally intensive.
  • PROC IMPORT determines that the data file contains four variables in the order: ID, NCT_ID, ID_TYPE, ID_VALUE. It guesses that the ID variable is numeric, and that the three other variables are character with lengths of at least 11, 12, and 28 respectively.
  • PROC IMPORT determined that the maximum record length was 60.

Next, re-read the file using an explicit DATA STEP. To avoid truncating variables check variable lengths in the Data Dictionary. The ‘sanity_checks.txt’ file also provides information about number of records, frequently encountered variable values, and maximum variable length for each data file.

      

/* An explicit DATA STEP is recommended for reading the file */ filename infl "id_information.txt" encoding='latin1' ; data id_information; infile infl delimiter = '|' termstr=lf firstobs=2 lrecl=100 missover dsd ; informat id best32. ; informat nct_id $11.; informat id_type $50.; informat id_value $50.; format id best32. ; format nct_id $11.; format id_type $50.; format id_value $50.; input id nct_id $ id_type $ id_value $ ; run;

Notes:

  • The FILENAME statement specifies the input file name and options (e.g., indicates that Latin1 character encoding is used).
  • The DELIMITER=’|’ options specifies that variables are delimited with pipes (‘|’).
  • The TERMSTR=LF option indicates that linefeed characters indicate the end of a record.
  • FIRSTOBS=2 indicates that data should be read starting at row 2 (the first row contains variable names)
  • LRECL=100 is set based on what was observed from the PROC IMPORT trial run. The maximum possible value of LRECL=32767. Setting LRECL too short will result in truncation of data records. Setting LRECL to the maximum will reduce computational performance.
  • When the end of a record is encountered, the MISSOVER option will cause SAS to set any unassigned values to missing rather than filling in missing values using values from the next record.
  • The DSD option has several effects: it removes quotation marks that enclose character values (i.e., doesn't read them as part of the character data); it treats two consecutive delimiters as a missing value; and when a data value is enclosed within quotes, delimiters within that data value will be treated as character data rather than as a delimiter.
  • The INFORMAT and FORMAT statements are used to set the appropriate variable type, length, and format for each variable.
  • In the INPUT statement, variable names must be listed in the order that they are encountered in the data file. Character variables are indicated with a $ following the variable name.
2

The following code can be used to estimate the approximate memory needed to read the full dataset

Once you have successfully read a small file, try reading the file ‘designs.txt’. This file is larger but contains structured data elements without any free text. Reading this file may help you test memory requirements on your system, and is a good file on which to develop strategies for reading larger files.

(What code?)

3

Once you have successfully read the two files mentioned above, you are ready to try reading some other files. The file ‘studies.txt’ contains one record per study (NCT_ID is the unique identifier), and contains many of the study protocol data elements.

      

/* Read the studies.txt using an explicit DATA STEP */ filename infl "studies.txt" encoding='latin1' ; data &filen._dta; data designs; infile infl delimiter = '|' termstr=lf firstobs=2 lrecl=2500 missover dsd ; informat nct_id $11. ; informat study_first_submitted_date yymmdd10. ; informat last_update_submitted_date yymmdd10. ; informat results_first_submitted_date yymmdd10. ; informat completion_date_type $20. ; informat primary_completion_date_type $20. ; informat study_type $50. ; informat overall_status $50. ; informat phase $20. ; informat target_duration $20. ; informat enrollment best32. ; informat enrollment_type $20. ; informat source $10. ; informat biospec_retention $20. ; informat limitations_and_caveats $10. ; informat description $10. ; informat acronym $20. ; informat number_of_arms best32. ; informat number_of_groups best32. ; informat why_stopped $200. ; vinformat has_expanded_access $10. ; informat has_dmc $10. ; informat is_fda_regulated_drug $10. ; informat brief_title $350. ; informat official_title $700. ; informat biospec_description $1200. ; informat created_at $25. ; informat updated_at $25. ; informat plan_to_share_ipd $20. ; informat disposition_first_submitted_date yymmdd10. ; informat nlm_download_date_description $60. ; informat start_month_year $20. ; informat verification_month_year $20. ; informat completion_month_year $20. ; informat primary_completion_month_year $20. ; informat plan_to_share_ipd_description $1000. ; format nct_id $11. ; format study_first_submitted_date yymmdd10. ; format last_changed_date yymmdd10. ; format results_first_submitted_date yymmdd10. ; format completion_date_type $20. ; format primary_completion_date_type $20. ; format study_type $50. ; format overall_status $50. ; format phase $20. ; format target_duration $20. ; format enrollment best32. ; format enrollment_type $20. ; format source $10. ; format biospec_retention $20. ; format limitations_and_caveats $10. ; format description $10. ; format acronym $20. ; format number_of_arms best32. ; format number_of_groups best32. ; format why_stopped $200. ; format has_expanded_access $10. ; format has_dmc $10. ; format is_fda_regulated_drug $10. ; format brief_title $350. ; format official_title $700. ; format biospec_description $1200. ; format created_at $25. ; format updated_at $25. ; format plan_to_share_ipd $20. ; format disposition_first_submitted_date yymmdd10. ; format nlm_download_date_description $60. ; format start_month_year $20. ; format verification_month_year $20. ; format completion_month_year $20. ; format primary_completion_month_year $20. ; format plan_to_share_ipd_description $1000. ; input nct_id $ study_first_submitted_date last_changed_date results_first_submitted_date completion_date_type $ primary_completion_date_type $ study_type $ overall_status $ phase $ target_duration $ enrollment enrollment_type $ source $ biospec_retention $ limitations_and_caveats $ description $ acronym $ number_of_arms number_of_groups why_stopped $ has_expanded_access $ has_dmc $ is_fda_regulated_drug $ brief_title $ official_title $ biospec_description $ created_at $ updated_at $ plan_to_share_ipd $ nlm_download_date_description $ start_month_year $ verification_month_year $ completion_month_year $ primary_completion_month_year $ plan_to_share_ipd_description $ ; run;

Example SAS code:

      

/* Read the designs.txt using an explicit DATA STEP */ filename infl "designs.txt" encoding='utf8' ; data designs; infile infl delimiter = '|' termstr=lf firstobs=2 lrecl=500 missover dsd ; informat id best32. ; informat description $255. ; informat masking $50. ; informat masked_roles $60. ; informat primary_purpose $50. ; informat intervention_model $50. ; informat allocation $50. ; informat time_perspective $50. ; informat observational_model $50. ; informat nct_id $11. ; informat subject_masked $10. ; informat caregiver_masked $10. ; informat investigator_masked $10. ; informat outcomes_assessor_masked $10. ; format id best32. ; format description $255. ; format masking $50. ; format masked_roles $60. ; format primary_purpose $50. ; format intervention_model $50. ; format allocation $50. ; format time_perspective $50. ; format observational_model $50. ; format nct_id $11. ; format subject_masked $10. ; format caregiver_masked $10. ; format investigator_masked $10. ; format outcomes_assessor_masked $10. ; input id description $ masking $ masked_roles $ primary_purpose $ intervention_model $ allocation $ time_perspective $ observational_model $ nct_id $ subject_masked $ caregiver_masked $ investigator_masked $ outcomes_assessor_masked $ ; run;