Skip to main content

Exploring The File Import



Another post starts with you beautiful people!
Today we will explore various file import options in Python which I learned from a great learning site- DataCamp.
In order to import data into Python, we should first have an idea of what files are in our working directory. We will learn step by step examples as given below-

Importing entire text files-
In this exercise, we'll be working with the file mobydick.txt [download here]
It is a text file that contains the opening sentences of Moby Dick, one of the great American novels!
Here you'll get experience opening a text file, printing its contents to the shell and, finally, closing it-

# Open a file: file
file = open('mobydick.txt', mode='r')

# Print it
print(file.read())

# Check whether file is closed
print(file.closed)

# Close file
file.close()

# Check whether file is closed
print(file.closed)


Importing text files line by line-
For large files, we may not want to print all of their content to the shell.We may wish to print only the first few lines.
Enter the readline() method, which allows us to do this. When a file called file is open,
we can print out the first line by executing file.readline().
If we execute the same command again, the second line will print, and so on.
We can bind a variable file by using a context manager construct:
with open('huck_finn.txt') as file:
While still within this construct, the variable file will be bound to open('huck_finn.txt');

thus, to print the file to the shell, all the code we need to execute is:

with open('huck_finn.txt') as file:
    print(file.read())
# Read & print the first 3 lines
with open('mobydick.txt') as file:
    print(file.readline())
    print(file.readline())
    print(file.readline())


You're now well-versed in importing text files and you're about to become a wiz at importing flat files.

Using NumPy to import flat files-
In this exercise, we're now going to load the MNIST digit recognition dataset using the numpy function loadtxt() and see just how easy it can be:

The first argument will be the filename.
The second will be the delimiter which, in this case, is a comma.
You can find more information about the MNIST dataset [what is mnist] on the webpage of Yann LeCun who is currently Director of AI Research at Facebook and Founding Director of the NYU Center for Data Science, among many other things.

# Import package
import numpy as np

# Assign filename to variable: file
file = 'digits.csv'

# Load file as array: digits
digits = np.loadtxt(file, delimiter=',')

# Print datatype of digits
print(type(digits))

# Select and reshape a row
im = digits[21, 1:]
im_sq = np.reshape(im, (28, 28))

# Plot reshaped data (matplotlib.pyplot already loaded as plt)
plt.imshow(im_sq, cmap='Greys', interpolation='nearest')
plt.show()

Customizing your NumPy import-
What if there are rows, such as a header, that you don't want to import?
What if your file has a delimiter other than a comma? What if you only wish to import particular columns?

There are a number of arguments that np.loadtxt() takes that we'll find useful:
delimiter changes the delimiter that loadtxt() is expecting, for example, we can use ',' and '\t' for comma-delimited and tab-delimited respectively; skiprows allows us to specify how many rows (not indices) we wish to skip; usecols takes a list of the indices of the columns we wish to keep.

The file that we'll be importing, digits_header.txt,has a header and is tab-delimited.
# Import numpy
import numpy as np

# Assign the filename: file
file = 'digits_header.txt'

# Load the data: data
data = np.loadtxt(file, delimiter='\t', skiprows=1, usecols=[0,2])

# Print data

print(data)


Importing different datatypes-
The file seaslug.txt has a text header, consisting of strings and is tab-delimited.
These data consists of percentage of sea slug larvae that had metamorphosed in a given time period.

Due to the header, if we tried to import it as-is using np.loadtxt(), Python would throw us a ValueError  and tell us that it could not convert string to float.
There are two ways to deal with this:
Firstly, we can set the data type argument dtype equal to str (for string).
Alternatively, we can skip the first row as we have seen before, using the skiprows argument.

# Assign filename: file
file = 'seaslug.txt'

# Import file: data
data = np.loadtxt(file, delimiter='\t', dtype=str)

# Print the first element of data
print(data[0])

# Import data as floats and skip the first row: data_float
data_float = np.loadtxt(file, delimiter='\t', dtype=float, skiprows=1)

# Print the 10th element of data_float
print(data_float[9])

# Plot a scatterplot of the data
plt.scatter(data_float[:, 0], data_float[:, 1])
plt.xlabel('time (min.)')
plt.ylabel('percentage of larvae')
plt.show()


Working with mixed datatypes (1)-
Much of the time we will need to import datasets which have different datatypes in different columns;  one column may contain strings and another floats, for example. The function np.loadtxt() will freak at this.
There is another function, np.genfromtxt(), which can handle such structures. If we pass dtype=None to it, it will figure out what types each column should be.

Import 'titanic.csv' using the function np.genfromtxt() as follows:
data = np.genfromtxt('titanic.csv', delimiter=',', names=True, dtype=None)

Here, the first argument is the filename, the second specifies the delimiter , and the third argument names tells us there is a header. Because the data are of different types, data is an object called a structured array.
Because numpy arrays have to contain elements that are all the same type, the structured array solves this by being a 1D array,
where each element of the array is a row of the flat file imported. We can test this by checking out the array's shape in the shell by executing np.shape(data).

Acccessing rows and columns of structured arrays is super-intuitive: to get the ith row, merely execute data[i] and to get the column with name 'Fare', execute data['Fare'].

Working with mixed datatypes (2)-
We have just used np.genfromtxt() to import data containing mixed datatypes.
There is also another function np.recfromcsv() that behaves similarly to np.genfromtxt(),
except that its default dtype is None. In this exercise, we'll practice using this to achieve the same result.
# Assign the filename: file
file = 'titanic.csv'

# Import file using np.recfromcsv: d
d=np.recfromcsv(file)

# Print out first three entries of d
print(d[:3])


Using pandas to import flat files as DataFrames (1)-
In the last exercise, we were able to import flat files containing columns with different datatypes as numpy arrays.
However, the DataFrame object in pandas is a more appropriate structure in which to store such data and,
thankfully, we can easily import files of mixed data types as DataFrames using the pandas functions read_csv() and read_table().

# Import pandas
import pandas as pd

# Assign the filename: file
file = 'titanic.csv'

# Read the file into a DataFrame: df
df = pd.read_csv(file)

# View the head of the DataFrame
print(df.head())


Using pandas to import flat files as DataFrames (2)-
In the last exercise, we were able to import flat files into a pandas DataFrame.
As a bonus, it is then straightforward to retrieve the corresponding numpy array using the attribute values.
We'll now have a chance to do this using the MNIST dataset, which is available as digits.csv.
# Assign the filename: file
file = 'digits.csv'

# Read the first 5 rows of the file into a DataFrame: data
data = pd.read_csv(file, nrows=5, header=None)

# Build a numpy array from the DataFrame: data_array
data_array = data.values

# Print the datatype of data_array to the shell
print(type(data_array))


Customizing your pandas import-
The pandas package is also great at dealing with many of the issues we will encounter when importing data as a data scientist,
such as comments occurring in flat files, empty lines and missing values. Note that missing values are also commonly referred to as NA or NaN.
To wrap up this post, we're now going to import a slightly corrupted copy of the Titanic dataset titanic_corrupt.txt, which contains comments after the character '#' and is tab-delimited.

# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt

# Assign filename: file
file = 'titanic_corrupt.txt'

# Import file: data
data = pd.read_csv(file, sep='\t', comment='#', na_values='Nothing')

# Print the head of the DataFrame
print(data.head())

# Plot 'Age' variable in a histogram
pd.DataFrame.hist(data[['Age']])
plt.xlabel('Age (years)')
plt.ylabel('count')
plt.show()

That's it guyz for today. I have given code snippet so that you can use it directly in your notebook.
Please try and share your thoughts!

Comments

  1. You forgot to mention that you copied DataCamp codes!

    ReplyDelete
    Replies
    1. Thanks for mentioning. I have mentioned that I have learnt this from DataCamp.

      Delete
  2. This comment has been removed by the author.

    ReplyDelete
  3. conscufrig_pe Melissa Lambert click
    mimortgistti

    ReplyDelete

Post a Comment

Popular posts from this blog

How to deploy your ML model as Fast API?

Another post starts with you beautiful people! Thank you all for showing so much interests in my last posts about object detection and recognition using YOLOv4. I was very happy to see many aspiring data scientists have learnt from my past three posts about using YOLOv4. Today I am going to share you all a new skill to learn. Most of you have seen my post about  deploying and consuming ML models as Flask API   where we have learnt to deploy and consume a keras model with Flask API  . In this post you are going to learn a new framework-  FastAPI to deploy your model as Rest API. After completing this post you will have a new industry standard skill. What is FastAPI? FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.6+ based on standard Python type hints. It is easy to learn, fast to code and ready for production . Yes, you heard it right! Flask is not meant to be used in production but with FastAPI you can use you...

Learn the fastest way to build data apps

Another post starts with you beautiful people! I hope you have enjoyed and learned something new from my previous three posts about machine learning model deployment. In one post we have learned  How to deploy a model as FastAPI?  I n the second post, we have learned  How to deploy a deep learning model as RestAPI ? and in the third post, we have also learned  How to scale your deep learning model API?   If you are following my blog posts, you have seen how easily you have transit yourselves from aspiring to a mature data scientist. In this new post, I am going to share a new framework-  Streamlit which will help you to easily create a beautiful app with Python only. I will show here how had I used the Streamlit framework to create an app for my YOLOv3 custom model. What is Streamlit? Streamlit’s open-source app framework is the easiest way for data scientists and machine learning engineers to create beautiful, performant apps in only a few hours!...

How can I make a simple ChatBot?

Another post starts with you beautiful people! It has been a long time of posting a new post. But my friends in this period I was not sitting  where I got a chance to work with chatbot and classification related machine learning problem. So in this post I am going to share all about chatbot- from where I have learned? What I have learned? And how can you build your first bot? Quite interesting right! Chatbot is a program that can conduct an intelligent conversation based on user's input. Since chatbot is a new thing to me also, I first searched- is there any Python library available to start with this? And like always Python has helped me this time also. There is a Python library available with name as  ChatterBot   which is nothing but a machine learning conversational dialog engine. And yes that is all I want to start my learning because I always prefer inbuilt Python library to start my learning journey and once I learn this then only I move ahead for another...