fhg
Volume 8, Number 36 -- October 22, 2008

SQL Quickly and Dirtily Extracts a Field from a CSV File

Published: October 22, 2008

by Ted Holt

A colleague and I were recently tracking the way data progresses through a poorly documented information system. We were trying to determine at which point in the process a certain field's decimal positions were being discarded. We began at the beginning, which happened to be a CSV file from a PC-based system. How could we quickly determine whether or not the CSV file had values in the decimal positions of the seventh field?

We used SQL. I'll provide a simple illustration. Here are the commands, in case you want to try this yourself.

Let's first create a CSV file we can play with. Run this command from the CL command line:

crtpf qtemp/custcsv rcdlen(80)

Next, load some data into the CSV file.

INSERT INTO QTEMP/CUSTCSV
SELECT CUSNUM||',"'||trim(LSTNAM)||'","'||INIT||'","'|| 
 trim(STREET)||'","'||trim(CITY)||'","'||STATE||'",'||ZIPCOD||
 ','||trim(char(CDTLMT))||','||CHGCOD||','|| 
      trim(char(BALDUE))||','||trim(char(CDTDUE)) 
 FROM qiws/qcustcdt 

The example CSV file, QTEMP/CUSTCSV, looks like this:

938472,"Henning","G K","4859 Elm Ave","Dallas","TX",75217,5000,3,37.00,.00 
839283,"Jones","B D","21B NW 135 St","Clay","NY",13041,400,1,100.00,.00    
392859,"Vine","S S","PO Box 79","Broton","VT",5046,700,1,439.00,.00        
938485,"Johnson","J A","3 Alpine Way","Helen","GA",30545,9999,2,3987.50,33.50
397267,"Tyron","W E","13 Myrtle Dr","Hector","NY",14841,1000,1,.00,.00     
389572,"Stevens","K L","208 Snow Pass","Denver","CO",80226,400,1,58.75,1.50
846283,"Alison","J S","787 Lake Dr","Isle","MN",56342,5000,3,10.00,.00     
475938,"Doe","J W","59 Archer Rd","Sutter","CA",95685,700,2,250.00,100.00  
693829,"Thomas","A N","3 Dove Circle","Casper","WY",82609,9999,2,.00,.00   
593029,"Williams","E D","485 SE 2 Ave","Dallas","TX",75218,200,1,25.00,.00 
192837,"Lee","F L","5963 Oak St","Hector","NY",14841,700,2,489.50,.50      
583990,"Abraham","M T","392 Mill St","Isle","MN",56342,9999,3,500.00,.00   

Now, find the records that have a lowercase "e" in the second position of the city field, which is the fifth field.

with t1 as (select rrn(custcsv) as rrrn,                        
            substr(custcsv,locate(',',custcsv)+1) as x          
            from qtemp/custcsv),                                
t2 as (select rrrn, substr(x,locate(',',x)+1) as x from t1),    
t3 as (select rrrn, substr(x,locate(',',x)+1) as x from t2),    
t4 as (select rrrn, substr(x,locate(',',x)+1) as x from t3),    
t5 as (Select rrrn, substr(x,1,locate(',',x)-1) as city from t4)
select * from t5                                                
where city like '"_e%'                                          

Here's the result set.

RRRN   CITY    
   4   "Helen" 
   5   "Hector"
   6   "Denver"
  11   "Hector"

So, how does it work?

Common table expression T1 extracts everything that follows the first comma, which is the second field and following. It also gets the relative record number of the original file.

with t1 as (select rrn(custcsv) as rrrn, 
            substr(custcsv,locate(',',custcsv)+1) as x
            from qtemp/custcsv), 

Common table expression T2 carries along the relative record number, and extracts everything that follows the first comma of field T1, which is the third field and the following:

t2 as (select rrrn, substr(x,locate(',',x)+1) as x from t1), 

Common table expressions T3 and T4 work like T2, carrying along the relative record number of the original file and peeling one field from the front of the CSV data.

Common table expression T5 carries along the relative record number and extracts the first field of T4, which is the fifth field of CUSTCSV.

t5 as (Select rrrn, substr(x,1,locate(',',x)-1) as city from t4)

All that remains to be done is select the desired records.

select * from t5 
where city like '"_e%' 

I'd hesitate to use this technique in production, but for quick and dirty data analysis, it worked great for us.

My colleague and I determined that the decimal positions were in the CSV file, and were able to continue with our analysis.




                     Post this story to del.icio.us
               Post this story to Digg
    Post this story to Slashdot


Sponsored By
WORKSRIGHT SOFTWARE

Do you need area code information?
Do you need ZIP Code information?
Do you need ZIP+4 information?
Do you need city name information?
Do you need county information?
Do you need a nearest dealer locator system?

We can HELP! We have affordable AS/400 software and data to do all of the above. Whether you need a simple city name retrieval system or a sophisticated CASS postal coding system, we have it for you!

The ZIP/CITY system is based on 5-digit ZIP Codes. You can retrieve city names, state names, county names, area codes, time zones, latitude, longitude, and more just by knowing the ZIP Code. We supply information on all the latest area code changes. A nearest dealer locator function is also included. ZIP/CITY includes software, data, monthly updates, and unlimited support. The cost is $495 per year.

PER/ZIP4 is a sophisticated CASS certified postal coding system for assigning ZIP Codes, ZIP+4, carrier route, and delivery point codes. PER/ZIP4 also provides county names and FIPS codes. PER/ZIP4 can be used interactively, in batch, and with callable programs. PER/ZIP4 includes software, data, monthly updates, and unlimited support. The cost is $3,900 for the first year, and $1,950 for renewal.

Just call us and we'll arrange for 30 days FREE use of either
ZIP/CITY or PER/ZIP4.

WorksRight Software, Inc.
Phone: 601-856-8337
Fax: 601-856-9432
E-mail: software@worksright.com
Web site: www.worksright.com


Senior Technical Editor: Ted Holt
Technical Editor: Joe Hertvik
Contributing Technical Editors: Edwin Earley, Brian Kelly, Michael Sansoterra
Publisher and Advertising Director: Jenny Thomas
Advertising Sales Representative: Kim Reed
Contact the Editors: To contact anyone on the IT Jungle Team
Go to our contacts page and send us a message.

Sponsored Links

MKS:  FREE white paper: From WDSC to RDi. Making Software Change Easier with MKS Integrity for IBM i
Vibrant Technologies:  The leading source for IBM Power Systems and Upgrades
COMMON:  Join us at the annual 2009 conference, April 26 - 30, in Reno, Nevada


 

IT Jungle Store Top Book Picks

Easy Steps to Internet Programming for AS/400, iSeries, and System i: List Price, $49.95
Getting Started with PHP for i5/OS: List Price, $59.95
The System i RPG & RPG IV Tutorial and Lab Exercises: List Price, $59.95
The System i Pocket RPG & RPG IV Guide: List Price, $69.95
The iSeries Pocket Database Guide: List Price, $59.00
The iSeries Pocket Developers' Guide: List Price, $59.00
The iSeries Pocket SQL Guide: List Price, $59.00
The iSeries Pocket Query Guide: List Price, $49.00
The iSeries Pocket WebFacing Primer: List Price, $39.00
Migrating to WebSphere Express for iSeries: List Price, $49.00
iSeries Express Web Implementer's Guide: List Price, $59.00
Getting Started with WebSphere Development Studio for iSeries: List Price, $79.95
Getting Started With WebSphere Development Studio Client for iSeries: List Price, $89.00
Getting Started with WebSphere Express for iSeries: List Price, $49.00
WebFacing Application Design and Development Guide: List Price, $55.00
Can the AS/400 Survive IBM?: List Price, $49.00
The All-Everything Machine: List Price, $29.95
Chip Wars: List Price, $29.95


 
The Four Hundred
Some Servers Take a Dive in IBM's Third Quarter

Gartner, Forrester Cut 2009 IT Spending Growth Estimates

Infor CEO Preaches Business Darwinism, Prepares for Hyper Business Future

Mad Dog 21/21: Home Deep Owe

IBM Cuts Disk Prices, Rejiggers Memory and CPU Conversion Prices

The Linux Beacon
Why Blade Servers Still Don't Cut It, and How They Might

Intel Keeps Both Arms Swinging with Xeons, Jabs with Itanium

Microsoft Ponies Up Another $100 Million for Novell Linux

Mad Dog 21/21: Newtonian Economics

Two More Xeon-Based Galaxy Servers from Sun

Four Hundred Stuff
Is Java the AS/400's Final Lifeline?

AquaFold Adds DB2/400 Support to Database Tool

Infor ERP LX Adds Compliance and Language Features, SOA Enablement

Shield Gets Closer to Full HA with RAP 3.1

IT Chiefs Don't Care About Software Quality, Survey Says

Big Iron
For Some Customers, the Mainframe Is Green

Top Mainframe Stories From Around the Web

Chats, Webinars, Seminars, Shows, and Other Happenings

System i PTF Guide
October 18, 2008: Volume 10, Number 42

October 11, 2008: Volume 10, Number 41

October 4, 2008: Volume 10, Number 40

September 27, 2008: Volume 10, Number 39

September 20, 2008: Volume 10, Number 38

September 14, 2008: Volume 10, Number 37

The Windows Observer
Citrix Addresses Performance with XenApp 5

Server Buyers Shop Like It's 1999 in the Second Quarter

Intel Keeps Both Arms Swinging with Xeons, Jabs with Itanium

Mad Dog 21/21: Newtonian Economics

Microsoft Does Something About Those SQL Injection Attacks

The Unix Guardian
What the Heck Is the Midrange, Anyway?

Overseas and Notebook Sales Offset Printer Declines for HP in Q3

Two More Xeon-Based Galaxy Servers from Sun

Mad Dog 21/21: Newtonian Economics

Intel's Nehalems to Star at IDF, AMD Pitches Shanghai

Four Hundred Monitor
Four Hundred Monitor's
Full iSeries Events Calendar

THIS ISSUE SPONSORED BY:

Help/Systems
WorksRight Software
Aldon


Printer Friendly Version


TABLE OF CONTENTS
Visual Explain for Run SQL Scripts

SQL Quickly and Dirtily Extracts a Field from a CSV File

Healing Failed Windows-i5/OS FTP Transfers

Four Hundred Guru

BACK ISSUES

From the IT Jungle Forums
Data Queues vs. MQ Series: Performance

Removing blanks from a CL Variable

XML

SQL "Hidden" Field

Java Messages

MQ Help Desired





 
Subscription Information:
You can unsubscribe, change your email address, or sign up for any of IT Jungle's free e-newsletters through our Web site at http://www.itjungle.com/sub/subscribe.html.

Copyright © 1996-2008 Guild Companies, Inc. All Rights Reserved.
Guild Companies, Inc., 50 Park Terrace East, Suite 8F, New York, NY 10034

Privacy Statement