Every data scientist I know spends a lot of time handling data that originates in CSV files. You can quickly end up with a mess of CSV files located in your Documents, Downloads, Desktop, and other random folders on your hard drive.
I greatly simplified my workflow the moment I started organizing all my CSV files in my Cloud account. Now I always know where my files are and I can read them directly from the Cloud using JupyterLab (the new Jupyter UI) or my Python scripts.
This article will teach you how to read your CSV files hosted on the Cloud in Python as well as how to write files to that same Cloud account.
I’ll use IBM Cloud Object Storage, an affordable, reliable, and secure Cloud storage solution. (Since I work at IBM, I’ll also let you in on a secret of how to get 10 Terabytes for a whole year, entirely for free.) This article will help you get started with IBM Cloud Object Storage and make the most of this offer. It is composed of three parts:
- How to use IBM Cloud Object Storage to store your files;
- Reading CSV files in Python from Object Storage;
- Writing CSV files to Object Storage (also in Python of course).
The best way to follow along with this article is to go through the accompanying Jupyter notebook either on Cognitive Class Labs (our free JupyterLab Cloud environment) or downloading the notebook from GitHub and running it yourself. If you opt for Cognitive Class Labs, once you sign in, you will able to select the IBM Cloud Object Storage Tutorial as shown in the image below.
What is Object Storage and why should you use it?
The “Storage” part of object storage is pretty straightforward, but what exactly is an object and why would you want to store one? An object is basically any conceivable data. It could be a text file, a song, or a picture. For the purposes of this tutorial, our objects will all be CSV files.
Unlike a typical filesystem (like the one used by the device you’re reading this article on) where files are grouped in hierarchies of directories/folders, object storage has a flat structure. All objects are stored in groups called buckets. This structure allows for better performance, massive scalability, and cost-effectiveness.
By the end of this article, you will know how to store your files on IBM Cloud Object Storage and easily access them using Python.
Provisioning an Object Storage Instance on IBM Cloud
Visit the IBM Cloud Catalog and search for “object storage”. Click the Object Storage option that pops up. Here you’ll be able to choose your pricing plan. Feel free to use the Lite plan, which is free and allows you to store up to 25 GB per month.
Sign up (it’s free) or log in with your IBM Cloud account, and then click the Create button to provision your Object Storage instance. You can customize the Service Name if you wish, or just leave it as the default. You can also leave the resource group to the default. Resource groups are useful to organize your resources on IBM Cloud, particularly when you have many of them running.
Working with Buckets
Since you just created the instance, you’ll now be presented with options to create a bucket. You can always find your Object Storage instance by selecting it from your IBM Cloud Dashboard.
There’s a limit of 100 buckets per Object Storage instance, but each bucket can hold billions of objects. In practice, how many buckets you need will be dictated by your availability and resilience needs.
For the purposes of this tutorial, a single bucket will do just fine.
Creating your First Bucket
Click the Create Bucket button and you’ll be shown a window like the one below, where you can customize some details of your Bucket. All these options may seem overwhelming at the moment, but don’t worry, we’ll explain them in a moment. They are part of what makes this service so customizable, should you have the need later on.
If you don’t care about the nuances of bucket configuration, you can type in any unique name you like and press the Create button, leaving all other options to their defaults. You can then skip to the Putting Objects in Buckets section below. If you would like to learn about what these options mean, read on.
Configuring your bucket
|Cross Region||Your data is stored across three geographic regions within your selected location||High availability and very high durability|
|Regional||Your data is stored across three different data centers within a single geographic region||High availability and durability, very low latency for regional users|
|Single Data Center||Your data is stored across multiple devices within a single data center||Data locality|
Storage Class Options
|Frequency of Data Access||IBM Cloud Object Storage Class|
|Weekly or monthly||Vault|
|Less than once a month||Cold Vault|
Feel free to experiment with different configurations, but I recommend choosing “Standard” for your storage class for this tutorial’s purposes. Any resilience option will do.
Putting Objects in Buckets
After you’ve created your bucket, store the name of the bucket into the Python variable below (replace
cc-tutorial with the name of your bucket) either in your Jupyter notebook or a Python script.
There are many ways to add objects to your bucket, but we’ll start with something simple. Add a CSV file of your choice to your newly created bucket, either by clicking the Add objects button, or dragging and dropping your CSV file into the IBM Cloud window.
If you don’t have an interesting CSV file handy, I recommend downloading FiveThirtyEight’s 2018 World Cup predictions.
Whatever CSV file you decide to add to your bucket, assign the name of the file to the variable
filename below so that you can easily refer to it later.
We’ve placed our first object in our first bucket, now let’s see how we can access it. To access your IBM Cloud Object Storage instance from anywhere other than the web interface, you will need to create credentials. Click the New credential button under the Service credentials section to get started.
In the next window, you can leave all fields as their defaults and click the Add button to continue. You’ll now be able to click on View credentials to obtain the JSON object containing the credentials you just created. You’ll want to store everything you see in a
credentials variable like the one below (obviously, replace the placeholder values with your own).
Note: If you’re following along within a notebook be careful not to share this notebook after adding your credentials!
Reading CSV files from Object Storage using Python
The recommended way to access IBM Cloud Object Storage with Python is to use the
ibm_boto3 library, which we’ll import below.
The primary way to interact with IBM Cloud Object Storage through
ibm_boto3 is by using an
ibm_boto3.resource object. This resource-based interface abstracts away the low-level REST interface between you and your Object Storage instance.
Run the cell below to create a
resource Python object using the IBM Cloud Object Storage credentials you filled in above.
After creating a
resource object, we can easily access any of our Cloud objects by specifying a bucket name and a key (in our case the key is a filename) to our
resource.Object method and calling the
get method on the result. In order to get the object into a useful format, we’ll do some processing to turn it into a pandas dataframe.
We’ll make this into a function so we can easily use it later:
Adding files to IBM Cloud Object Storage with Python
IBM Cloud Object Storage’s web interface makes it easy to add new objects to your buckets, but at some point you will probably want to handle creating objects through Python programmatically. The
put_object method allows you to do this.
In order to use it you will need:
- The name of the bucket you want to add the object to;
- A unique name (Key) for the new object;
- A bytes-like object, which you can get from:
- Python’s built-in
openmethod in binary mode, e.g.
To demonstrate, let’s add another CSV file to our bucket. This time we’ll use FiveThirtyEight’s airline safety dataset.
You can now easily access your newly created object using the function we defined above in the Reading from Object Storage using Python section.
Get 10 Terabytes of IBM Cloud Object Storage for free
You now know how to read from and write to IBM Cloud Object Storage using Python! Well done. The ability to pragmatically read and write files to the Cloud will be quite handy when working from scripts and Jupyter notebooks.
If you build applications or do data science, we also have a great offer for you. You can apply to become an IBM Partner at no cost to you and receive 10 Terabytes of space to play and build applications with.
You can sign up by simply filling the embedded form below. If you are unable to fill the form, you can click here to open the form in a new window.
Just make sure that you apply with a business email (even your own domain name if you are a freelancer) as free email accounts like Gmail, Hotmail, and Yahoo are automatically rejected.