Skip to main content

Exploring The File Import



Another post starts with you beautiful people!
Today we will explore various file import options in Python which I learned from a great learning site- DataCamp.
In order to import data into Python, we should first have an idea of what files are in our working directory. We will learn step by step examples as given below-

Importing entire text files-
In this exercise, we'll be working with the file mobydick.txt [download here]
It is a text file that contains the opening sentences of Moby Dick, one of the great American novels!
Here you'll get experience opening a text file, printing its contents to the shell and, finally, closing it-

# Open a file: file
file = open('mobydick.txt', mode='r')

# Print it
print(file.read())

# Check whether file is closed
print(file.closed)

# Close file
file.close()

# Check whether file is closed
print(file.closed)


Importing text files line by line-
For large files, we may not want to print all of their content to the shell.We may wish to print only the first few lines.
Enter the readline() method, which allows us to do this. When a file called file is open,
we can print out the first line by executing file.readline().
If we execute the same command again, the second line will print, and so on.
We can bind a variable file by using a context manager construct:
with open('huck_finn.txt') as file:
While still within this construct, the variable file will be bound to open('huck_finn.txt');

thus, to print the file to the shell, all the code we need to execute is:

with open('huck_finn.txt') as file:
    print(file.read())
# Read & print the first 3 lines
with open('mobydick.txt') as file:
    print(file.readline())
    print(file.readline())
    print(file.readline())


You're now well-versed in importing text files and you're about to become a wiz at importing flat files.

Using NumPy to import flat files-
In this exercise, we're now going to load the MNIST digit recognition dataset using the numpy function loadtxt() and see just how easy it can be:

The first argument will be the filename.
The second will be the delimiter which, in this case, is a comma.
You can find more information about the MNIST dataset [what is mnist] on the webpage of Yann LeCun who is currently Director of AI Research at Facebook and Founding Director of the NYU Center for Data Science, among many other things.

# Import package
import numpy as np

# Assign filename to variable: file
file = 'digits.csv'

# Load file as array: digits
digits = np.loadtxt(file, delimiter=',')

# Print datatype of digits
print(type(digits))

# Select and reshape a row
im = digits[21, 1:]
im_sq = np.reshape(im, (28, 28))

# Plot reshaped data (matplotlib.pyplot already loaded as plt)
plt.imshow(im_sq, cmap='Greys', interpolation='nearest')
plt.show()

Customizing your NumPy import-
What if there are rows, such as a header, that you don't want to import?
What if your file has a delimiter other than a comma? What if you only wish to import particular columns?

There are a number of arguments that np.loadtxt() takes that we'll find useful:
delimiter changes the delimiter that loadtxt() is expecting, for example, we can use ',' and '\t' for comma-delimited and tab-delimited respectively; skiprows allows us to specify how many rows (not indices) we wish to skip; usecols takes a list of the indices of the columns we wish to keep.

The file that we'll be importing, digits_header.txt,has a header and is tab-delimited.
# Import numpy
import numpy as np

# Assign the filename: file
file = 'digits_header.txt'

# Load the data: data
data = np.loadtxt(file, delimiter='\t', skiprows=1, usecols=[0,2])

# Print data

print(data)


Importing different datatypes-
The file seaslug.txt has a text header, consisting of strings and is tab-delimited.
These data consists of percentage of sea slug larvae that had metamorphosed in a given time period.

Due to the header, if we tried to import it as-is using np.loadtxt(), Python would throw us a ValueError  and tell us that it could not convert string to float.
There are two ways to deal with this:
Firstly, we can set the data type argument dtype equal to str (for string).
Alternatively, we can skip the first row as we have seen before, using the skiprows argument.

# Assign filename: file
file = 'seaslug.txt'

# Import file: data
data = np.loadtxt(file, delimiter='\t', dtype=str)

# Print the first element of data
print(data[0])

# Import data as floats and skip the first row: data_float
data_float = np.loadtxt(file, delimiter='\t', dtype=float, skiprows=1)

# Print the 10th element of data_float
print(data_float[9])

# Plot a scatterplot of the data
plt.scatter(data_float[:, 0], data_float[:, 1])
plt.xlabel('time (min.)')
plt.ylabel('percentage of larvae')
plt.show()


Working with mixed datatypes (1)-
Much of the time we will need to import datasets which have different datatypes in different columns;  one column may contain strings and another floats, for example. The function np.loadtxt() will freak at this.
There is another function, np.genfromtxt(), which can handle such structures. If we pass dtype=None to it, it will figure out what types each column should be.

Import 'titanic.csv' using the function np.genfromtxt() as follows:
data = np.genfromtxt('titanic.csv', delimiter=',', names=True, dtype=None)

Here, the first argument is the filename, the second specifies the delimiter , and the third argument names tells us there is a header. Because the data are of different types, data is an object called a structured array.
Because numpy arrays have to contain elements that are all the same type, the structured array solves this by being a 1D array,
where each element of the array is a row of the flat file imported. We can test this by checking out the array's shape in the shell by executing np.shape(data).

Acccessing rows and columns of structured arrays is super-intuitive: to get the ith row, merely execute data[i] and to get the column with name 'Fare', execute data['Fare'].

Working with mixed datatypes (2)-
We have just used np.genfromtxt() to import data containing mixed datatypes.
There is also another function np.recfromcsv() that behaves similarly to np.genfromtxt(),
except that its default dtype is None. In this exercise, we'll practice using this to achieve the same result.
# Assign the filename: file
file = 'titanic.csv'

# Import file using np.recfromcsv: d
d=np.recfromcsv(file)

# Print out first three entries of d
print(d[:3])


Using pandas to import flat files as DataFrames (1)-
In the last exercise, we were able to import flat files containing columns with different datatypes as numpy arrays.
However, the DataFrame object in pandas is a more appropriate structure in which to store such data and,
thankfully, we can easily import files of mixed data types as DataFrames using the pandas functions read_csv() and read_table().

# Import pandas
import pandas as pd

# Assign the filename: file
file = 'titanic.csv'

# Read the file into a DataFrame: df
df = pd.read_csv(file)

# View the head of the DataFrame
print(df.head())


Using pandas to import flat files as DataFrames (2)-
In the last exercise, we were able to import flat files into a pandas DataFrame.
As a bonus, it is then straightforward to retrieve the corresponding numpy array using the attribute values.
We'll now have a chance to do this using the MNIST dataset, which is available as digits.csv.
# Assign the filename: file
file = 'digits.csv'

# Read the first 5 rows of the file into a DataFrame: data
data = pd.read_csv(file, nrows=5, header=None)

# Build a numpy array from the DataFrame: data_array
data_array = data.values

# Print the datatype of data_array to the shell
print(type(data_array))


Customizing your pandas import-
The pandas package is also great at dealing with many of the issues we will encounter when importing data as a data scientist,
such as comments occurring in flat files, empty lines and missing values. Note that missing values are also commonly referred to as NA or NaN.
To wrap up this post, we're now going to import a slightly corrupted copy of the Titanic dataset titanic_corrupt.txt, which contains comments after the character '#' and is tab-delimited.

# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt

# Assign filename: file
file = 'titanic_corrupt.txt'

# Import file: data
data = pd.read_csv(file, sep='\t', comment='#', na_values='Nothing')

# Print the head of the DataFrame
print(data.head())

# Plot 'Age' variable in a histogram
pd.DataFrame.hist(data[['Age']])
plt.xlabel('Age (years)')
plt.ylabel('count')
plt.show()

That's it guyz for today. I have given code snippet so that you can use it directly in your notebook.
Please try and share your thoughts!

Comments

  1. You forgot to mention that you copied DataCamp codes!

    ReplyDelete
    Replies
    1. Thanks for mentioning. I have mentioned that I have learnt this from DataCamp.

      Delete
  2. This comment has been removed by the author.

    ReplyDelete
  3. conscufrig_pe Melissa Lambert click
    mimortgistti

    ReplyDelete

Post a Comment

Popular posts from this blog

How to install and compile YOLO v4 with GPU enable settings in Windows 10?

Another post starts with you beautiful people! Last year I had shared a post about  installing and compiling Darknet YOLOv3   in your Windows machine and also how to detect an object using  YOLOv3 with Keras . This year on April' 2020 the fourth generation of YOLO has arrived and since then I was curious to use this as soon as possible. Due to my project (built on YOLOv3 :)) work I could not find a chance to check this latest release. Today I got some relief and successfully able to install and compile YOLOv4 in my machine. In this post I am going to share a single shot way to do the same in your Windows 10 machine. If your machine does not have GPU then you can follow my  previous post  by just replacing YOLOv3 related files with YOLOv4 files. For GPU having Windows machine, follow my steps to avoid any issue while building the Darknet repository. My machine has following configurations: Windows 10 64 bit Intel Core i7 16 GB RAM NVIDIA GeForce GTX 1660 Ti Version 445.87

How to convert your YOLOv4 weights to TensorFlow 2.2.0?

Another post starts with you beautiful people! Thank you all for your overwhelming response in my last two posts about the YOLOv4. It is quite clear that my beloved aspiring data scientists are very much curious to learn state of the art computer vision technique but they were not able to achieve that due to the lack of proper guidance. Now they have learnt exact steps to use a state of the art object detection and recognition technique from my last two posts. If you are new to my blog and want to use YOLOv4 in your project then please follow below two links- How to install and compile Darknet code with GPU? How to train your custom data with YOLOv4? In my  last post we have trained our custom dataset to identify eight types of Indian classical dance forms. After the model training we have got the YOLOv4 specific weights file as 'yolo-obj_final.weights'. This YOLOv4 specific weight file cannot be used directly to either with OpenCV or with TensorFlow currently becau

How to use opencv-python with Darknet's YOLOv4?

Another post starts with you beautiful people 😊 Thank you all for messaging me your doubts about Darknet's YOLOv4. I am very happy to see in a very short amount of time my lovely aspiring data scientists have learned a state of the art object detection and recognition technique. If you are new to my blog and to computer vision then please check my following blog posts one by one- Setup Darknet's YOLOv4 Train custom dataset with YOLOv4 Create production-ready API of YOLOv4 model Create a web app for your YOLOv4 model Since now we have learned to use YOLOv4 built on Darknet's framework. In this post, I am going to share with you how can you use your trained YOLOv4 model with another awesome computer vision and machine learning software library-  OpenCV  and of course with Python 🐍. Yes, the Python wrapper of OpenCV library has just released it's latest version with support of YOLOv4 which you can install in your system using below command- pip install opencv-python --up