The text files can be read using in SAS using PROC IMPORT or with a DATA STEP. The DATA STEP is recommended because it allows the user more control over how variables are input (e.g., variable lengths and formats). However PROC IMPORT can provide a useful first pass, and generates explicit DATA STEP code in the SAS log file that can be modified for a second pass run.
/* first try reading file with PROC IMPORT */
proc import
datafile = "id_information.txt"
out = id_information
;
delimiter = '|';
getnames = YES;
guessingrows = 100;
run;
Notes:
Next, re-read the file using an explicit DATA STEP. To avoid truncating variables check variable lengths in the Data Dictionary. The ‘sanity_checks.txt’ file also provides information about number of records, frequently encountered variable values, and maximum variable length for each data file.
/* An explicit DATA STEP is recommended for reading the file */
filename infl "id_information.txt" encoding='latin1' ;
data id_information;
infile infl
delimiter = '|'
termstr=lf
firstobs=2
lrecl=100
missover
dsd
;
informat id best32. ;
informat nct_id $11.;
informat id_type $50.;
informat id_value $50.;
format id best32. ;
format nct_id $11.;
format id_type $50.;
format id_value $50.;
input
id
nct_id $
id_type $
id_value $
;
run;
Notes:
The following code can be used to estimate the approximate memory needed to read the full dataset
Once you have successfully read a small file, try reading the file ‘designs.txt’. This file is larger but contains structured data elements without any free text. Reading this file may help you test memory requirements on your system, and is a good file on which to develop strategies for reading larger files.
(What code?)
Once you have successfully read the two files mentioned above, you are ready to try reading some other files. The file ‘studies.txt’ contains one record per study (NCT_ID is the unique identifier), and contains many of the study protocol data elements.
/* Read the studies.txt using an explicit DATA STEP */
filename infl "studies.txt" encoding='latin1' ;
data &filen._dta;
data designs;
infile infl
delimiter = '|'
termstr=lf
firstobs=2
lrecl=2500
missover
dsd
;
informat nct_id $11. ;
informat study_first_submitted_date yymmdd10. ;
informat last_update_submitted_date yymmdd10. ;
informat results_first_submitted_date yymmdd10. ;
informat completion_date_type $20. ;
informat primary_completion_date_type $20. ;
informat study_type $50. ;
informat overall_status $50. ;
informat phase $20. ;
informat target_duration $20. ;
informat enrollment best32. ;
informat enrollment_type $20. ;
informat source $10. ;
informat biospec_retention $20. ;
informat limitations_and_caveats $10. ;
informat description $10. ;
informat acronym $20. ;
informat number_of_arms best32. ;
informat number_of_groups best32. ;
informat why_stopped $200. ;
vinformat has_expanded_access $10. ;
informat has_dmc $10. ;
informat is_fda_regulated_drug $10. ;
informat brief_title $350. ;
informat official_title $700. ;
informat biospec_description $1200. ;
informat created_at $25. ;
informat updated_at $25. ;
informat plan_to_share_ipd $20. ;
informat disposition_first_submitted_date yymmdd10. ;
informat nlm_download_date_description $60. ;
informat start_month_year $20. ;
informat verification_month_year $20. ;
informat completion_month_year $20. ;
informat primary_completion_month_year $20. ;
informat plan_to_share_ipd_description $1000. ;
format nct_id $11. ;
format study_first_submitted_date yymmdd10. ;
format last_changed_date yymmdd10. ;
format results_first_submitted_date yymmdd10. ;
format completion_date_type $20. ;
format primary_completion_date_type $20. ;
format study_type $50. ;
format overall_status $50. ;
format phase $20. ;
format target_duration $20. ;
format enrollment best32. ;
format enrollment_type $20. ;
format source $10. ;
format biospec_retention $20. ;
format limitations_and_caveats $10. ;
format description $10. ;
format acronym $20. ;
format number_of_arms best32. ;
format number_of_groups best32. ;
format why_stopped $200. ;
format has_expanded_access $10. ;
format has_dmc $10. ;
format is_fda_regulated_drug $10. ;
format brief_title $350. ;
format official_title $700. ;
format biospec_description $1200. ;
format created_at $25. ;
format updated_at $25. ;
format plan_to_share_ipd $20. ;
format disposition_first_submitted_date yymmdd10. ;
format nlm_download_date_description $60. ;
format start_month_year $20. ;
format verification_month_year $20. ;
format completion_month_year $20. ;
format primary_completion_month_year $20. ;
format plan_to_share_ipd_description $1000. ;
input
nct_id $
study_first_submitted_date
last_changed_date
results_first_submitted_date
completion_date_type $
primary_completion_date_type $
study_type $
overall_status $
phase $
target_duration $
enrollment
enrollment_type $
source $
biospec_retention $
limitations_and_caveats $
description $
acronym $
number_of_arms
number_of_groups
why_stopped $
has_expanded_access $
has_dmc $
is_fda_regulated_drug $
brief_title $
official_title $
biospec_description $
created_at $
updated_at $
plan_to_share_ipd $
nlm_download_date_description $
start_month_year $
verification_month_year $
completion_month_year $
primary_completion_month_year $
plan_to_share_ipd_description $
;
run;
Example SAS code:
/* Read the designs.txt using an explicit DATA STEP */
filename infl "designs.txt" encoding='utf8' ;
data designs;
infile infl
delimiter = '|'
termstr=lf
firstobs=2
lrecl=500
missover
dsd
;
informat id best32. ;
informat description $255. ;
informat masking $50. ;
informat masked_roles $60. ;
informat primary_purpose $50. ;
informat intervention_model $50. ;
informat allocation $50. ;
informat time_perspective $50. ;
informat observational_model $50. ;
informat nct_id $11. ;
informat subject_masked $10. ;
informat caregiver_masked $10. ;
informat investigator_masked $10. ;
informat outcomes_assessor_masked $10. ;
format id best32. ;
format description $255. ;
format masking $50. ;
format masked_roles $60. ;
format primary_purpose $50. ;
format intervention_model $50. ;
format allocation $50. ;
format time_perspective $50. ;
format observational_model $50. ;
format nct_id $11. ;
format subject_masked $10. ;
format caregiver_masked $10. ;
format investigator_masked $10. ;
format outcomes_assessor_masked $10. ;
input
id
description $
masking $
masked_roles $
primary_purpose $
intervention_model $
allocation $
time_perspective $
observational_model $
nct_id $
subject_masked $
caregiver_masked $
investigator_masked $
outcomes_assessor_masked $
;
run;