Using OpenCSV to parse CSV files in Java
Part of my degree path in Information Systems requires taking my programming classes in Java so far, and that includes Data Structures. Due to the nature of the course, majority of our assignments involve parsing data in a manner that we judge appropriate relative to the data we are working with and the task assigned. The first 2 assignments involved taking the Top 200 Streaming Songs list off Spotify and making first just a simple linked list that prints out, and then a playlist.
Spotifycharts.com makes it easy to export their data into a csv spreadsheet, there is a lovely button in the top right corner. The challenge that came with this first assignment is that a CSV file is short for comma-seperated values. If you download the CSV for lets say the week of 1-17-2020 to 1-24-2020 and open it, Windows defaults CSV files to excel and the data is nice and organize with no commas present. Right click and open it in Notepad and you'll see the actual raw data and see the column values seperated by commas on every line. So this essentially is where we had to decide how we wanted to read the data. Rather than try and fiddle with excel as I am not fond of it, I chose to just read the CSV data raw.
Referencing the official java docs on the String class there is a method of the string class called split, where you can specify the regular expression you want it split by. For CSV the choice is obviously a comma. For normal pieces of data this would normally be enough and you might not need anything else to parse a CSV file, but due to the nature of music extra commas either in song titles or including other featured artists on the track aren't uncommon. A snippet of code of my first version of our first assignment in the main method.
TopStreamingArtists artistNames = new TopStreamingArtists();
String csvFile = "src/regional-us-weekly-2020-01-17--2020-01-24.csv";
String inputLine;
try {
Scanner inputStream = new Scanner(new FileReader(csvFile));
inputLine = inputStream.nextLine();
while (inputStream.hasNextLine()) {
inputLine = inputStream.nextLine();
String[] tempArray = inputLine.split(",");
int tempPosition = Integer.parseInt(tempArray[0]);
String tempTrack = tempArray[1];
String tempArtist = tempArray[2];
int tempStreams = Integer.parseInt(tempArray[3]);
String tempUrl = tempArray[4];
artistNames.insertLast(tempPosition, tempTrack, tempArtist, tempStreams, tempUrl);
}
inputStream.close();
So the goal was to take my file and read it with a Scanner object I created, and read it one line at a time and save that line as a string in the variable inputLine, and then split it into an array of type String called tempArray at every comma so that I'd have abunch of subStrings of that read line that I can then place in temp variables, and add to the linked list. First issue that came up is an error because the very top of the file has 2 lines that aren't necessary data we want to parse, just data relevant to reading the file if done so with the human eye. This can be solved by adding 2 "inputLine = inputStream.nextLine();" lines of code before executing the while loop but it's not really effective if you try to reuse this program later for some other pieces of data. What if you get a csv file with 30 lines of text before the data even appears, adding "inputLine = inputStream.nextLine();" 30 times is super messy.
If you ignore that or modify the CSV file to take those out, the second issue that came up around 85 lines in is that this code will literally separate at every comma as it's supposed to. So when the program got to Eminem's Yah Yah (feat. Royce Da 5'9", Black Thought, Q-Tip & Denaun), as you can see there are 2 extra commas in the track field alone. So the tempArray has more values in it that the previous lines did, and the positions will be all off, so then trying to ParseInt the 4th value in the array, instead of reading what should be the number of streams it will be reading Q-Tip & Denaun and throw an error as this cannot be made into an integer value. The first assignment was flexible and we could manipulate the CSV file so I just submitted it with only the first 65 lines for grading. But considering everything else builds on this I wanted to make sure the second assignment was done properly, as it's literally using the same sets of data we used for the first one.
As our professor said either before or after the first assignment(I don't remember at this point, was roughly 2 months ago), there's no point in trying to reinvent the wheel in regards to doing something like reading CSV files. It's been something that has been down for a LONG time at this point and there are many available libraries that we can use and import into our java projects to assist with that. Some time spent googling and I came open OpenCSV. I downloaded the opencsv-5.1.jar file to get the library and it requires the Apache library as a dependency so I downloaded commons-lang3-3.9.jar. IntelliJ has a GUI for importing libraries but follow your IDE documentation for doing so if its different. After doing that I made a separate program of my first assignment that would utilize OpenCSV and imported the following 2 libraries up top to get the library loaded
import com.opencsv.CSVReader;
import com.opencsv.exceptions.CsvValidationException;
So to see how that changes the code here's a snippet of the updated version for the first assignment
SortedArtists artistNames = new SortedArtists();
String csvFile = "src/regional-us-weekly-2020-01-17--2020-01-24.csv";
try {
CSVReader reader = new CSVReader(new FileReader(csvFile));
String[] inputLine = reader.readNext();
reader.skip(2);
while ((inputLine = reader.readNext()) != null) {
int tempPosition = Integer.parseInt(inputLine[0]);
String tempTrack = inputLine[1];
String tempArtist = inputLine[2];
int tempStreams = Integer.parseInt(inputLine[3]);
String tempUrl = inputLine[4];
artistNames.insertLast(tempPosition, tempTrack, tempArtist, tempStreams, tempUrl);
}
reader.close();//stops reading file after while loop
CSVReader is a class in OpenCSV that is primarily used for this, **full java doc here**to see all the methods. First important addition it added was the skip method, because by default it won't skip any lines like the first version and will read from the beginning, which will give the same error. By using the skip method and specifying the number of lines to skip, it will automatically do so and it keeps the code clean. I know that spotifycharts.com will always have the first 2 lines populated with information that I don't want to read, but if I wanted to reuse this code somewhere else and it needs to be modified I could change the 2 to whatever number was required, change it to a variable, even add in a user prompt to ask how many lines to skip if the program is going to be a multipurpose CSVReader program and save it to a variable and pass that into the skip method.
The second important addition which isn't immediately clear unless you open and read the CSVReader class and the readNext() method is it does all the parsing of the line and even removes quotes by default. It does all the proper parsing to account for things like extra commas and will split the line correctly, so running this version of the program I get a much cleaner output when displaying the linked list. Doing so made parsing data for the second assignment easier since it was essentially the same data sets.
Going forward I essentially learned to look for a library that does whatever is giving me trouble, as when it comes to Data Structures someone else probably had to deal with the same issue before and there's a library that takes care of it. Both versions of this program can be seen on my personal git repository for comparison right here. Master version is what I submitted first, and version 2 branch is the updated one that uses OpenCSV to parse. The included CSV file was the edited one that doesn't have the first 2 lines so there is no reader.skip(2) code in version2 branch as I had placed here as an example, but that was used in the following program.