2_Data Manipulation and Exploration with Numpy and Pandas - Part 2
Table of contents
Written By: Deborah Dormah Kanubala, Twitter
This article is a continuation of my previously written article on Numpy. In case you are already familiar with Numpy then you can jump right into this tutorial. Otherwise, I will refer you to this article and strongly encourage you to go through it before coming back to this.
Pandas Installation
Pandas is a python library used for working with datasets. It provides useful functions/methods for analyzing, cleaning, exploring, and manipulating data.
To install pandas using the terminal, simply type pip install pandas
Why Use Pandas
Pandas allow us to analyze big datasets easily.
Pandas make it easy to clean messy datasets and make them readable and relevant.
Pandas also provide us with useful methods that allow us to merge, and concatenate different columns in a dataset.
Import Pandas
import pandas as pd
Once we have installed pandas and imported it, we can always refer to the imported pandas using pd. The code below allows us to create a python data frame using pd.DataFrame()
data = pd.DataFrame({'group':['a', 'a', 'a', 'b','b', 'b', 'c', 'c','c'],'ounces':[4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data
data.describe() #computes summary statistics data
#Let's sort the data frame by ounces - inplace = True will make changes to the data
data.sort_values(by=['ounces'],ascending=True,inplace=False)
#sort the data by multiple columns.
data.sort_values(by=['group','ounces'],ascending=[True,False],inplace=False)
Pandas DataFrames
Datasets in pandas are multi-dimensional tables, called DataFrames.
We can create a data frame from two series using the code below:
series= {"calories": [420, 380, 390, 210, 671, 111, 321, 214, 566, 222],
"duration": [50, 40, 45, 35, 60, 25, 62, 95, 55, 45]}
df = pd.DataFrame(series)
df
Pandas Indexing and Data Selection
There are a lot of different ways to select elements, rows, and columns from a data frame. In the next few lines of code, we will look at different ways to index/select:
dataframe.loc[]
: used for labelsdataframe.iloc[]
: used for positions or integer baseddataframe.ix[]
: used for bot label and integer-based (NB: This has depreciated in recent pandas versions)
series= {"calories": [420, 380, 390, 210, 671, 111, 321, 214, 566, 222, 610, 882],
"month":["January", "Feburary", "March", "April", "May", "June", "July",
"August", "September", "October", "November", "December"],
"duration": [50, 40, 45, 35, 60, 25, 62, 95, 55, 45, 11, 23],
"names": ["Kwame", "Ama", "Kwasi", "Kwadwo", "Ebo", "Abena", "Akua",
"Asi", "Abena", "Yaw", "Akosua", "Afiba"],
"location":["Tamale", "Buipe", "Damango", "Dungu", "Savelugu",
"Bolgatanga", "Yendi", "Sawla", "Tuna", "Yapei", "Walewale", "Navrongo"],
"prices":[1400, 3200, 1200, 4560, 7761, 9900, 4955, 8722, 9111, 6093, 7021,2531]}
df = pd.DataFrame(series)
we can then perform the next operation based on the created data frame
#column names
df.columns
#select a single column
df["calories"]
#multiple selection
df[["calories", "calories", "location"]]
dataframe.loc[]
# first row
df.loc[0]
#select multiple rows
df.loc[[0, 1, 3]]
#select rows(first, second and sixth rows) and columns(calories, month, names) using loc
df.loc[[0, 1, 5], ["calories", "month", "names"]]
#select all rows and specific columns
df.loc[:, ["calories", "month"]]
#select first to 4th row and all columns
df.loc[1:3,:]
dataframe.iloc[]
#select a single row
df.iloc[3]
#select all rows and some columns
df.iloc[:, [1, 2]]
Methods Pandas
Function | Description/Use |
dataframe.head() | Return the top n rows of a data frame. |
dataframe.tail() | Return bottom n rows of a data frame. |
dataframe.shape() | Return the columns and rows of the data. |
dataframe.groupby() | |
dataframe.isnull().sum() | Return names of columns with missing values with their counts. |
dataframe.insert() | Insert column into DataFrame at the specified location. |
dataframe.copy() | Makes a copy of the original data frame |
dataframe.value_counts() | Return the counts of unique values. |
dataframe.head() | Returns a Pandas DataFrame with duplicate rows removed. |
dataframe.memory_usage() | Return memory usage of each column (in bytes) in a Pandas DataFrame |
Load Data Using Pandas
The best part of using pandas is the provision it makes for us to load datasets of different extensions. In this section, we would briefly look at how to load different data extension files.
Reading csv files
df_csv =
pd.read
_csv('data.csv')
Reading JSON files
df_json =
pd.read
_json('data.json')
Reading excel files
df_json =
pd.read
_excel('data.xlsx')
Reading excel files
df_json =
pd.read
_html('data.html')
Working with titanic dataset
You will now have to work in break-out rooms. All instructions should be carefully followed.
Each room will have one volunteer.
One participant will voluntarily or randomly be chosen to share their screen with their notebook.
The participant sharing should not be the one speaking, the other participants should direct what should be written.
Download the titanic data from the google drive shared.
Now, follow the instructions in the notebook.
Final tips :
Continue practicing daily, learning programming == learning a new language like EN, DE, FR etc.
Use StackOverflow if you are not sure of how to do something.
Instead of copying and pasting code, write the code out by yourself.
Read other people's code as much as possible.
Don't freak out about errors, read the errors and try to understand.
Copy the errors and search on Google, HIGH CHANCE: it is already a solved problem.
If preparing for technical interviews, make sure to use these resources Hackerrank, Leetcode, and Codewars.
References
Pandas: https://pandas.pydata.org/docs/user_guide/10min.html
Numpy: https://numpy.org/doc/stable/user/absolute_beginners.html
Indexing: https://www.geeksforgeeks.org/indexing-and-selecting-data-with-pandas/
Exercises: https://favtutor.com/blogs/numpy-exercises-python
Numpy Documentation: https://numpy.org/doc/
Pandas Documentation: https://pandas.pydata.org/docs/