Tips for Analyzing Data in Pipe-Delimited Files Using SAS

The text files can be read using in SAS using PROC IMPORT or with a DATA STEP. The DATA STEP is recommended because it allows the user more control over how variables are input (e.g., variable lengths and formats). However PROC IMPORT can provide a useful first pass, and generates explicit DATA STEP code in the SAS log file that can be modified for a second pass run.

1

Start by Reading a Small File

      

/* first try reading file with PROC IMPORT */ proc import datafile = "id_information.txt" out = id_information ; delimiter = '|'; getnames = YES; guessingrows = 100; run;

Notes:

Next, re-read the file using an explicit DATA STEP. To avoid truncating variables check variable lengths in the Data Dictionary. The ‘sanity_checks.txt’ file also provides information about number of records, frequently encountered variable values, and maximum variable length for each data file.

      

/* An explicit DATA STEP is recommended for reading the file */ filename infl "id_information.txt" encoding='latin1' ; data id_information; infile infl delimiter = '|' termstr=lf firstobs=2 lrecl=100 missover dsd ; informat id best32. ; informat nct_id $11.; informat id_type $50.; informat id_value $50.; format id best32. ; format nct_id $11.; format id_type $50.; format id_value $50.; input id nct_id $ id_type $ id_value $ ; run;

Notes:

2

The following code can be used to estimate the approximate memory needed to read the full dataset

Once you have successfully read a small file, try reading the file ‘designs.txt’. This file is larger but contains structured data elements without any free text. Reading this file may help you test memory requirements on your system, and is a good file on which to develop strategies for reading larger files.

(What code?)

3

Once you have successfully read the two files mentioned above, you are ready to try reading some other files. The file ‘studies.txt’ contains one record per study (NCT_ID is the unique identifier), and contains many of the study protocol data elements.

      

/* Read the studies.txt using an explicit DATA STEP */ filename infl "studies.txt" encoding='latin1' ; data &filen._dta; data designs; infile infl delimiter = '|' termstr=lf firstobs=2 lrecl=2500 missover dsd ; informat nct_id $11. ; informat study_first_submitted_date yymmdd10. ; informat last_update_submitted_date yymmdd10. ; informat results_first_submitted_date yymmdd10. ; informat completion_date_type $20. ; informat primary_completion_date_type $20. ; informat study_type $50. ; informat overall_status $50. ; informat phase $20. ; informat target_duration $20. ; informat enrollment best32. ; informat enrollment_type $20. ; informat source $10. ; informat biospec_retention $20. ; informat limitations_and_caveats $10. ; informat description $10. ; informat acronym $20. ; informat number_of_arms best32. ; informat number_of_groups best32. ; informat why_stopped $200. ; vinformat has_expanded_access $10. ; informat has_dmc $10. ; informat is_fda_regulated_drug $10. ; informat brief_title $350. ; informat official_title $700. ; informat biospec_description $1200. ; informat created_at $25. ; informat updated_at $25. ; informat plan_to_share_ipd $20. ; informat disposition_first_submitted_date yymmdd10. ; informat nlm_download_date_description $60. ; informat start_month_year $20. ; informat verification_month_year $20. ; informat completion_month_year $20. ; informat primary_completion_month_year $20. ; informat plan_to_share_ipd_description $1000. ; format nct_id $11. ; format study_first_submitted_date yymmdd10. ; format last_changed_date yymmdd10. ; format results_first_submitted_date yymmdd10. ; format completion_date_type $20. ; format primary_completion_date_type $20. ; format study_type $50. ; format overall_status $50. ; format phase $20. ; format target_duration $20. ; format enrollment best32. ; format enrollment_type $20. ; format source $10. ; format biospec_retention $20. ; format limitations_and_caveats $10. ; format description $10. ; format acronym $20. ; format number_of_arms best32. ; format number_of_groups best32. ; format why_stopped $200. ; format has_expanded_access $10. ; format has_dmc $10. ; format is_fda_regulated_drug $10. ; format brief_title $350. ; format official_title $700. ; format biospec_description $1200. ; format created_at $25. ; format updated_at $25. ; format plan_to_share_ipd $20. ; format disposition_first_submitted_date yymmdd10. ; format nlm_download_date_description $60. ; format start_month_year $20. ; format verification_month_year $20. ; format completion_month_year $20. ; format primary_completion_month_year $20. ; format plan_to_share_ipd_description $1000. ; input nct_id $ study_first_submitted_date last_changed_date results_first_submitted_date completion_date_type $ primary_completion_date_type $ study_type $ overall_status $ phase $ target_duration $ enrollment enrollment_type $ source $ biospec_retention $ limitations_and_caveats $ description $ acronym $ number_of_arms number_of_groups why_stopped $ has_expanded_access $ has_dmc $ is_fda_regulated_drug $ brief_title $ official_title $ biospec_description $ created_at $ updated_at $ plan_to_share_ipd $ nlm_download_date_description $ start_month_year $ verification_month_year $ completion_month_year $ primary_completion_month_year $ plan_to_share_ipd_description $ ; run;

Example SAS code:

      

/* Read the designs.txt using an explicit DATA STEP */ filename infl "designs.txt" encoding='utf8' ; data designs; infile infl delimiter = '|' termstr=lf firstobs=2 lrecl=500 missover dsd ; informat id best32. ; informat description $255. ; informat masking $50. ; informat masked_roles $60. ; informat primary_purpose $50. ; informat intervention_model $50. ; informat allocation $50. ; informat time_perspective $50. ; informat observational_model $50. ; informat nct_id $11. ; informat subject_masked $10. ; informat caregiver_masked $10. ; informat investigator_masked $10. ; informat outcomes_assessor_masked $10. ; format id best32. ; format description $255. ; format masking $50. ; format masked_roles $60. ; format primary_purpose $50. ; format intervention_model $50. ; format allocation $50. ; format time_perspective $50. ; format observational_model $50. ; format nct_id $11. ; format subject_masked $10. ; format caregiver_masked $10. ; format investigator_masked $10. ; format outcomes_assessor_masked $10. ; input id description $ masking $ masked_roles $ primary_purpose $ intervention_model $ allocation $ time_perspective $ observational_model $ nct_id $ subject_masked $ caregiver_masked $ investigator_masked $ outcomes_assessor_masked $ ; run;