[ad_1]
Pandas is an open-source Python library, which can be used for data analysis and manipulation, and other types of computation. It is built on top of NumPy. In this tutorial, I will be covering Pandas, different features of it, and how to use it. Firstly let us see some features of Pandas.
Features
- Provides an efficient way to explore data.
- Supports multiple file formats.
- Ability to handle missing data.
- Ability to extract data, and run transformations on it.
- Reshape, slice, and index.
- Merge and join datasets.
- Perform mathematical operations on data.
- Time series functionality.
- Visualize data.
Installation
Installing Pandas is pretty simple. There are two ways to install Pandas;
- Using Anaconda. When you install Anaconda on your machine, Pandas and some other libraries get installed along with it. Click here to install Anaconda.
- Using ‘pip’. If you have Python already installed, run the following command to install Pandas.
Alternatively, visit this website to install Pandas.
If you are a Linux user and want to install Pandas, the code may vary depending on the distribution you have, Refer to this site for proper installation guidance.
Data Types
A data type is used by a programming language to understand how to store and manipulate data. The table below summarizes the different data types in Pandas.
Data type | Use |
int | Integer number, eg: 10, 12 |
float | Floating point number, eg: 100.2, 3.1415 |
bool | True/False value |
object | Test, non-numeric, or a combination of text and non-numeric values, eg: Apple |
DateTime | Date and time values |
category | A finite list of values |
Pandas Data Structures
There are two main data structures associated with Pandas, Series and DataFrame.
Series
You can think of Pandas Series like an array, or a list, capable of holding any data type. It is 1 dimensional. In simple language, you can think of Series like a column in an Excel sheet. It helps in storing data.
DataFrame
Pandas DataFrame is a 2-dimensional structure. The data is stored in a tabular format, containing rows and columns. You can think of a DataFrame as a collection of different Pandas Series. You can also create a single column DataFrame. Although it looks like a Pandas Series, since it is defined as a DataFrame, it will act as one. Also, a key thing to note is that even though a DataFrame looks like a SQL table or an Excel sheet, it is completely different from them.
How to create Pandas Series and DataFrame?
Pandas Series
Using Numpy Array:
To create a Pandas Series from a NumPy array, first I will define a NumPy array, and then I will call this array inside my Series initialization function.
# import pandas as pd
import pandas as pd
# import numpy as np
import numpy as np
# simple array
data = np.array([‘apple’,’mango’,’guava’,’grapes’,’banana’,’strawberry’])
ser = pd.Series(data)
ser.head()
Output
0 apple
1 mango
2 guava
3 grapes
4 banana
dtype: object
Using Python List:
Similar to creating a Pandas Series from a NumPy array, first I will define a list, and then I will call this list inside my Series initialization function.
list1 = [1,2,3,4,5,6,7,8,9,10]
# create series from a list
ser = pd.Series(list1)
ser.head()
Output
0 1
1 2
2 3
3 4
4 5
dtype: int64
Using the Python dictionary:
Similar to creating a Pandas Series from a NumPy array or a list, first I will define a dictionary, and then I will call this dictionary inside my Series initialization function.
# create a dictionary
dictionary1 = {1 : 100, 2 : 200, 3 : 300}
# create a series
ser = pd.Series(dictionary1)
ser.head()
Output
1 100
2 200
3 300
dtype: int64
Pandas DataFrame
Using Numpy Array:
To create a Pandas DataFrame from a NumPy array, first I will define a NumPy array, and then I will call this array inside my DataFrame initialization function.
In order to view the data better, in the second part of the code, I am taking a transpose of it.
import pandas as pd
# list of strings
arr = [[‘Pandas’, ‘Dataframe’, ‘example’, ‘using’, ‘lists’],
[1,2,3,4,5],
[‘apple’,’mango’,’guava’,’grapes’,’banana’]]
# Calling DataFrame constructor on numpy array
df = pd.DataFrame(arr)
df.head()
Output:
0 1 2 3 4
0 Pandas Dataframe example using lists
1 1 2 3 4 5
2 apple mango guava grapes banana
We can change the alignment of above data by taking a transpose
arr = np.array([[‘Pandas’, ‘Dataframe’, ‘example’, ‘using’, ‘lists’],
[1,2,3,4,5],
[‘apple’,’mango’,’guava’,’grapes’,’banana’]])
arr = arr.T
# Calling DataFrame constructor on numpy array
df = pd.DataFrame(arr)
df.head()
Output
0 1 2
0 Pandas 1 apple
1 Dataframe 2 mango
2 example 3 guava
3 using 4 grapes
4 lists 5 banana
Using Python List:
Similar to creating a Pandas DataFrame from a NumPy array, first I will define a list, and then I will call this list inside my DataFrame initialization function.
import pandas as pd
# list of strings
list1 = [‘Pandas’, ‘Dataframe’, ‘example’, ‘using’, ‘lists’]
# Calling DataFrame constructor on list
df = pd.DataFrame(list1)
df.head()
Output:
0
0 Pandas
1 Dataframe
2 example
3 using
4 lists
Using the Python dictionary:
Similar to creating a Pandas DataFrame from a NumPy array or a list, first I will define a dictionary, and then I will call this dictionary inside my DataFrame initialization function.
import pandas as pd
# initialize a dictionary
data = {‘Name’:[‘Captain America’, ‘Iron Man’, ‘Hulk’, ‘Thor’],
‘Rating’:[100, 80, 84, 90]}
# Create DataFrame
df = pd.DataFrame(data)
df.head()
Output
Name Rating
0 Captain America 100
1 Iron Man 80
2 Hulk 84
3 Thor 90
Series basic functions
Accessing data using position or index:
Elements/ data in a Pandas Series can be accessed in a similar manner to that of a NumPy ndarray. We can use the position or the index to access the data. We use the indexing operator ‘[ ]’ to access the data. To obtain multiple data we use slicing. Slicing is done in the following manner: [start index: end index].
In the below code I am slicing to obtain the first three elements of the Series.
# import pandas as pd
import pandas as pd
# import numpy as np
import numpy as np
# simple array
data = np.array([‘apple’,’mango’,’guava’,’grapes’,’banana’,’strawberry’])
ser = pd.Series(data)
# print first 3 data from the series
print(ser[:3])
Output:
0 apple
1 mango
2 guava
dtype: object
Indexing:
Indexing is selecting particular rows from the Series. Using Indexing you can select all rows or a small subset.
You can do this by using the square bracket ‘[ ]’, or by using ‘.loc[ ]’ and ‘.iloc[ ]’ operators.
# import pandas as pd
import pandas as pd
# import numpy as np
import numpy as np
# simple array
data = np.array([‘apple’,’mango’,’guava’,’grapes’,’banana’,’strawberry’,’orange’,’pineapple’,’kiwi’])
ser = pd.Series(data)
# using indexing operation
print(ser[6:9])
Output:
6 orange
7 pineapple
8 kiwi
dtype: object
.loc[]:
This function selects data by the label of the rows.
In the code below I have selected indexes from 4 to 8.
# import pandas as pd
import pandas as pd
# import numpy as np
import numpy as np
# simple array
data = np.array([‘apple’,’mango’,’guava’,’grapes’,’banana’,’strawberry’,’orange’,’pineapple’,’kiwi’])
ser = pd.Series(data)
# using .loc function
print(ser.loc[4:9])
Output:
4 banana
5 strawberry
6 orange
7 pineapple
8 kiwi
dtype: object
.iloc[]:
This function allows us to select rows based on their position.
In the code below I have selected the first 4 rows.
# import pandas as pd
import pandas as pd
# import numpy as np
import numpy as np
# simple array
data = np.array([‘apple’,’mango’,’guava’,’grapes’,’banana’,’strawberry’,’orange’,’pineapple’,’kiwi’])
ser = pd.Series(data)
# using .loc function
print(ser.iloc[:4])
Output:
0 apple
1 mango
2 guava
3 grapes
dtype: object
Changing index
To change the index of the Pandas Series to a custom index of your choice, pass in the argument ‘index’ while initializing the Pandas Series.
Example – pd.Series( data, index = [‘a’, ’b’, ‘c’])
# import pandas as pd
import pandas as pd
# import numpy as np
import numpy as np
# simple array
data = np.array([‘apple’,’mango’,’guava’,’grapes’,’banana’,’strawberry’,’orange’,’pineapple’,’kiwi’])
# changing index
ser = pd.Series(data, index=[‘a’,’b’,’c’,’d’,’e’,’f’,’g’,’h’,’i’])
ser.head()
Output:
a apple
b mango
c guava
d grapes
e banana
dtype: object
Arithmetic operations
On a Pandas Series, many arithmetic operations can be done. Here I am showing you only two, addition and subtraction.
Sum:
# import pandas as pd
import pandas as pd
# import numpy as np
import numpy as np
# simple array
data1 = np.array([10,22,3,-9,5,500])
data2 = np.array([22,4,900,7,15,-20])
# changing index
ser1 = pd.Series(data1)
ser2 = pd.Series(data2)
print(‘ser1n’,ser1.head(),’n’)
print(‘ser2n’,ser2.head(),’n’)
print(‘ser1 + ser2 n’,ser1.add(ser2))
Output:
ser1
0 10
1 22
2 3
3 -9
4 5
dtype: int64
ser2
0 22
1 4
2 900
3 7
4 15
dtype: int64
ser1 + ser2
0 32
1 26
2 903
3 -2
4 20
5 480
dtype: int64
Subtract:
# import pandas as pd
import pandas as pd
# import numpy as np
import numpy as np
# simple array
data1 = np.array([10,22,3,-9,5,500])
data2 = np.array([22,4,900,7,15,-20])
# changing index
ser1 = pd.Series(data1)
ser2 = pd.Series(data2)
print(‘ser1n’,ser1.head(),’n’)
print(‘ser2n’,ser2.head(),’n’)
print(‘ser1 – ser2 n’,ser1.sub(ser2))
Output:
ser1
0 10
1 22
2 3
3 -9
4 5
dtype: int64
ser2
0 22
1 4
2 900
3 7
4 15
dtype: int64
ser1 – ser2
0 -12
1 18
2 -897
3 -16
4 -10
5 520
dtype: int64
Data type conversion:
To convert the data type of Pandas Series we use the ‘.astype()’ function. Pass in the data type in the function to convert the Series data type.
Example – ser.astype(‘float’)
# import pandas as pd
import pandas as pd
# import numpy as np
import numpy as np
# simple array
data = np.array([10,22,3,-9,5,500])
# changing index
ser = pd.Series(data1)
print(“Before conversion”)
print(ser.dtype)
ser = ser.astype(float)
print(“After conversion”)
print(ser.dtype)
Output:
Before conversion
int64
After conversion
float64
Arithmetic operations:
In the below table you can find all the arithmetic operations that can be performed on a Series.
Function | Description |
add() | Used to add series of the same length |
sub() | Used to subtract series of the same length |
mul() | Used to multiply series of the same length |
div() | Used to divide the series of the same length |
sum() | Returns sum of values for the requested axis |
prod() | Returns product of values for the requested axis |
mean() | Returns the mean of values for the requested axis |
abs() | Used to calculate the absolute value of each element in the series |
cov() | Used to find covariance of two series |
Pandas Series methods:
In the below table you can find different Series methods.
Function | Method |
head() | Returns a specified number of rows from the beginning of the Series. The default value is 5. |
tail() | Returns a specified number of rows from the end of the Series. The default value is 5. |
count() | Returns the number of non-NA or null values in the Series |
size() | Returns the number of elements in the Series |
is_unique() | The return type is boolean. Finds if any unique value exists in the Series |
idxmax() | Returns the index position of the highest value in the Series |
idxmin() | Returns the index position of the lowest value in the Series |
sort_values() | Sorts values in either ascending or descending order in the Series |
sort_index() | Sorts values by index |
value_counts() | Returns number of times each unique value is found in the Series |
get() | Used to extract values from the Series. This is an alternative to bracket syntax. |
DataFrame basic functions
Indexing columns and rows:
Indexing is selecting particular rows and columns from the DataFrame. Using Indexing you can select all rows and columns, or a small subset.
You can do this by using the indexing operator ‘[ ]’, or by using ‘.loc[ ]’ and ‘.iloc[ ]’ operators.
Columns
In order to select a column in the DataFrame, simply put the name of the column in square brackets
Eg: df[‘Name’], df[[‘Name’, ‘Place’]]
In the below code I am selecting a single column.
import pandas as pd
# initialize a dictionary
data = {‘Name’:[‘Captain America’, ‘Iron Man’, ‘Hulk’, ‘Thor’,’Black Panther’,’Spiderman’],
‘Rating’:[100, 80, 84, 93, 90, 70],
‘Place’:[‘USA’,’USA’,’USA’,’Asgard’,’Wakanda’,’USA’]}
# Create DataFrame
df = pd.DataFrame(data)
# retrieve the first column
first = df[‘Name’]
print(first)
Output
0 Captain America
1 Iron Man
2 Hulk
3 Thor
4 Black Panther
5 Spiderman
Name: Name, dtype: object
In the below code I am selecting multiple columns.
import pandas as pd
# initialize a dictionary
data = {‘Name’:[‘Captain America’, ‘Iron Man’, ‘Hulk’, ‘Thor’,’Black Panther’,’Spiderman’],
‘Rating’:[100, 80, 84, 93, 90, 70],
‘Place’:[‘USA’,’USA’,’USA’,’Asgard’,’Wakanda’,’USA’]}
# Create DataFrame
df = pd.DataFrame(data)
# retrieve the name and place column
cols = df[[‘Name’,’Place’]]
print(cols)
Output:
Name Place
0 Captain America USA
1 Iron Man USA
2 Hulk USA
3 Thor Asgard
4 Black Panther Wakanda
5 Spiderman USA
Rows
We can select rows either using .loc[], or .iloc[] operators.
.loc[]
This function selects data by the label of the rows, and returns the value of row/rows if they exist.
In the code below I am extracting the row with index ‘a’.
import pandas as pd
# initialize a dictionary
data = {‘Name’:[‘Captain America’, ‘Iron Man’, ‘Hulk’, ‘Thor’,’Black Panther’,’Spiderman’],
‘Rating’:[100, 80, 84, 93, 90, 70],
‘Place’:[‘USA’,’USA’,’USA’,’Asgard’,’Wakanda’,’USA’]}
# Create DataFrame
df = pd.DataFrame(data, index=[‘a’,’b’,’c’,’d’,’e’,’f’])
# retrieve the first row
rows = df.loc[‘a’]
print(rows)
Output
Name Captain America
Rating 100
Place USA
Name: a, dtype: object
.iloc[]
This function allows us to select rows based on their position.
In case the index labels are other than numbers, or if the user doesn’t know the index labels, the .iloc[] method can be used in this case.
In the below code I am extracting the first row using its Index value.
import pandas as pd
# initialize a dictionary
data = {‘Name’:[‘Captain America’, ‘Iron Man’, ‘Hulk’, ‘Thor’,’Black Panther’,’Spiderman’],
‘Rating’:[100, 80, 84, 93, 90, 70],
‘Place’:[‘USA’,’USA’,’USA’,’Asgard’,’Wakanda’,’USA’]}
# Create DataFrame
df = pd.DataFrame(data, index=[‘a’,’b’,’c’,’d’,’e’,’f’])
# retrieve the first row
rows = df.iloc[0]
print(rows)
Output
Name Captain America
Rating 100
Place USA
Name: a, dtype: object
Changing index
If you want custom index values for your DataFrame, you can specify it during the initialization of DataFrame. Default index values are numbers starting from 0.
To change the index of the Pandas DataFrame to a custom index of your choice, pass in the argument ‘index’ while initializing the Pandas DataFrame.
Example – pd.DataFrame( data, index = [‘a’, ’b’, ‘c’])
import pandas as pd
# initialize a dictionary
data = {‘Name’:[‘Captain America’, ‘Iron Man’, ‘Hulk’, ‘Thor’,’Black Panther’,’Spiderman’],
‘Rating’:[100, 80, 84, 93, 90, 70],
‘Place’:[‘USA’,’USA’,’USA’,’Asgard’,’Wakanda’,’USA’]}
# Create DataFrame
df = pd.DataFrame(data, index=[‘a’,’b’,’c’,’d’,’e’,’f’])
df.head()
Output
Name Rating Place
a Captain America 100 USA
b Iron Man 80 USA
c Hulk 84 USA
d Thor 93 Asgard
e Black Panther 90 Wakanda
Missing Data
Let’s face it, the null value can be troublesome especially when you are doing some important calculations. Pandas has a few methods that can help identify and rectify the missing values.
Checking missing data
We can use .isnull() or .notnull() functions to check for missing values. These functions can also be used in Pandas Series to find null values. The output of ‘isnull()’ function is boolean, indicating if that particular element is null or not.
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {‘First’:[np.nan, 90, np.nan, 95],
‘Second’: [30, 45, 56, np.nan],
‘Third’:[np.nan, 40, 80, 98]}
# creating a dataframe from list
df = pd.DataFrame(dict)
# using isnull() function
df.isnull()
Output
First Second Third
0 True False True
1 False False False
2 True False False
3 False True False
Filling missing data
We can use fillna() function to replace NaN values with our specified value.
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {‘First’:[np.nan, 90, np.nan, 95],
‘Second’: [30, 45, 56, np.nan],
‘Third’:[np.nan, 40, 80, 98]}
# creating a dataframe from list
df = pd.DataFrame(dict)
# filling missing value using fillna()
df.fillna(0)
Output
First Second Third
0 0.0 30.0 0.0
1 90.0 45.0 40.0
2 0.0 56.0 80.0
3 95.0 0.0 98.0
Dropping missing data
We can use the dropna() function to drop rows or columns filled with missing data. Using dropna(), we can either drop null values from rows by specifying axis=0 or drop null values from columns by specifying axis=1.
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {‘First’:[np.nan, 90, np.nan, 95],
‘Second’: [30, 45, 56, 900],
‘Third’:[np.nan, 40, 80, 98]}
# creating a dataframe from list
df = pd.DataFrame(dict)
# drop null values from rows
print(df.dropna(axis=0))
print(‘n’)
# drop null values from columns
print(df.dropna(axis=1))
Output
First Second Third
1 90.0 45 40.0
3 95.0 900 98.0
Second
0 30
1 45
2 56
3 900
Iteration
We can use iteritems(), iterrows(), itertuples() functions to iterate over rows.
Iterrows():
This function returns each index value along with the data in each row.
import pandas as pd
# initialize a dictionary
data = {‘Name’:[‘Captain America’, ‘Iron Man’, ‘Hulk’, ‘Thor’,’Black Panther’],
‘Rating’:[100, 80, 84, 93, 90],
‘Place’:[‘USA’,’USA’,’USA’,’Asgard’,’Wakanda’]}
# Create DataFrame
df = pd.DataFrame(data, index=[‘a’,’b’,’c’,’d’,’e’])
# iterating over rows using iterrows() function
for i, j in df.iterrows():
print(i, j)
print(“n”)
Output
a Name Captain America
Rating 100
Place USA
Name: a, dtype: object
b Name Iron Man
Rating 80
Place USA
Name: b, dtype: object
c Name Hulk
Rating 84
Place USA
Name: c, dtype: object
d Name Thor
Rating 93
Place Asgard
Name: d, dtype: object
e Name Black Panther
Rating 90
Place Wakanda
Name: e, dtype: object
Iteritems():
This function iterates over each column as key, value pair, with column name as key and its data as values.
import pandas as pd
# initialize a dictionary
data = {‘Name’:[‘Captain America’, ‘Iron Man’, ‘Hulk’, ‘Thor’,’Black Panther’],
‘Rating’:[100, 80, 84, 93, 90],
‘Place’:[‘USA’,’USA’,’USA’,’Asgard’,’Wakanda’]}
# Create DataFrame
df = pd.DataFrame(data, index=[‘a’,’b’,’c’,’d’,’e’])
# iterating using iteritems() function
for key,value in df.iteritems():
print(“Key:”,key)
print(“Valuesn”,value)
print(“n”)
Output
Key: Name
Values
a Captain America
b Iron Man
c Hulk
d Thor
e Black Panther
Name: Name, dtype: object
Key: Rating
Values
a 100
b 80
c 84
d 93
e 90
Name: Rating, dtype: int64
Key: Place
Values
a USA
b USA
c USA
d Asgard
e Wakanda
Name: Place, dtype: object
Itertuples():
This function returns a tuple for each row in the DataFrame.
import pandas as pd
# intialise a dictionary
data = {‘Name’:[‘Captain America’, ‘Iron Man’, ‘Hulk’, ‘Thor’,’Black Panther’],
‘Rating’:[100, 80, 84, 93, 90],
‘Place’:[‘USA’,’USA’,’USA’,’Asgard’,’Wakanda’]}
# Create DataFrame
df = pd.DataFrame(data, index=[‘a’,’b’,’c’,’d’,’e’])
# iterating using itertuples() function
for row in df.itertuples():
print(row)
Output
Pandas(Index=’a’, Name=’Captain America’, Rating=100, Place=’USA’)
Pandas(Index=’b’, Name=’Iron Man’, Rating=80, Place=’USA’)
Pandas(Index=’c’, Name=’Hulk’, Rating=84, Place=’USA’)
Pandas(Index=’d’, Name=’Thor’, Rating=93, Place=’Asgard’)
Pandas(Index=’e’, Name=’Black Panther’, Rating=90, Place=’Wakanda’)
Data type conversion
To convert the data type of Pandas DataFrame we use the ‘.astype()’ function. Pass in the data type in the function to convert the DataFrame data type.
Example – df[‘Rating’].astype(‘float’)
import pandas as pd
# initialize a dictionary
data = {‘Name’:[‘Captain America’, ‘Iron Man’, ‘Hulk’, ‘Thor’,’Black Panther’],
‘Rating’:[100, 80, 84, 93, 90],
‘Place’:[‘USA’,’USA’,’USA’,’Asgard’,’Wakanda’]}
# Create DataFrame
df = pd.DataFrame(data, index=[‘a’,’b’,’c’,’d’,’e’])
print(“Before conversion”)
print(df[‘Rating’].dtype)
# Changing data type of selected column
df[‘Rating’] = df[‘Rating’].astype(float)
print(“After conversion”)
print(df[‘Rating’].dtype)
Output
Before conversion
int64
After conversion
float64
Pandas DataFrame methods
In the below table you can find different DataFrame methods.
Function | Description |
add() | Used to add Dataframes of the same length or Dataframe with a number |
sub() | Used to subtract DataFrames of the same length or Dataframe with a number |
mul() | Used to multiply DataFrames of the same length or Dataframe with a number |
div() | Used to find floating-point division for Dataframes of the same length or Dataframe with a number |
T | Transpose rows and columns |
head() | Returns a specified number of rows from the beginning of the DataFrame. The default value is 5. |
tail() | Returns a specified number of rows from the end of the DataFrame. The default value is 5. |
insert() | Inserts a column in the DataFrame |
index() | Returns index of the DataFrame |
unique() | Returns unique values in the DataFrame |
nunique() | Returns count of unique values in the DataFrame |
value_counts() | Returns number of times each unique value is found in the DataFrame |
columns() | Returns the column labels in the DataFrame |
isnull() | Creates a boolean DataFrame, for extracting rows with null values. |
dtypes() | Returns the data type of each column |
astype() | Converts the data type in the Series |
sort_values() | Sorts DataFrame values in either ascending or descending order |
sort_index() | Sorts value by index |
.loc[] | Retrieves rows based on row labels |
.iloc[] | Retrieves rows based on the index position |
drop() | Used to delete rows or columns |
shape | Returns a tuple containing the dimensions of the DataFrame |
fillna() | Replaces NaN values with the value defined by the user |
copy() | Creates an independent copy |
set_index() | Sets index using one or more existing column |
reset_index() | Resets the index values starting from 0 to the length of DataFrame |
Axis
A DataFrame is a 2D object. Different Series combine together to form a DataFrame.
A DataFrame has two axes; axis ‘0’ and axis ‘1’.
Axis 0 corresponds to the rows, while axis 1 is for columns
Statistics
Pandas can also help in calculating some complex statistical operations. It can do all that in a single line of code. I have discussed some of the commonly used statistical functions.
Mean
Returns the average value
Calculating mean with axis = 0. First, the sum of all values in a column is calculated, then that value is divided by the total no of elements/data in that column.
import pandas as pd
# initialize a dictionary
data = {‘Name’:[‘jennifer Lawrence’, ‘Brad Pitt’, ‘Chris Hemsworth’, ‘Dwayne Johnson’],
‘Salary’:[1000, 80000, 79000, 93000],
‘Age’:[33, 50, 45, 52]}
# Create DataFrame
df = pd.DataFrame(data)
df.mean(axis=0)
Output
Salary 63250.0
Age 45.0
dtype: float64
Calculating mean with axis=1. First, the sum of all values in a row is calculated, then that value is divided by the total no of elements/data in that row.
import pandas as pd
# initialize a dictionary
data = {‘Name’:[‘jennifer Lawrence’, ‘Brad Pitt’, ‘Chris Hemsworth’, ‘Dwayne Johnson’],
‘Salary’:[1000, 80000, 79000, 93000],
‘Age’:[33, 50, 45, 52]}
# Create DataFrame
df = pd.DataFrame(data)
df.mean(axis=1)
Output
0 516.5
1 40025.0
2 39522.5
3 46526.0
dtype: float64
Standard Deviation
Returns the Bressel standard deviation
With axis=0,
import pandas as pd
# initialize a dictionary
data = {‘Name’:[‘jennifer Lawrence’, ‘Brad Pitt’, ‘Chris Hemsworth’, ‘Dwayne Johnson’],
‘Salary’:[1000, 80000, 79000, 93000],
‘Age’:[33, 50, 45, 52]}
# Create DataFrame
df = pd.DataFrame(data)
df.std(axis=0)
Output
Salary 41987.101194
Age 8.524475
dtype: float64
With axis=1,
import pandas as pd
# initialize a dictionary
data = {‘Name’:[‘jennifer Lawrence’, ‘Brad Pitt’, ‘Chris Hemsworth’, ‘Dwayne Johnson’],
‘Salary’:[1000, 80000, 79000, 93000],
‘Age’:[33, 50, 45, 52]}
# Create DataFrame
df = pd.DataFrame(data)
df.std(axis=1)
Output
0 683.772257
1 56533.187156
2 55829.615909
3 65724.161098
dtype: float64
Summarizing the statistics of the DataFrame
We can use the .describe() function to summarize the statistics of the DataFrame.
import pandas as pd
# initialize a dictionary
data = {‘Name’:[‘jennifer Lawrence’, ‘Brad Pitt’, ‘Chris Hemsworth’, ‘Dwayne Johnson’],
‘Salary’:[1000, 80000, 79000, 93000],
‘Age’:[33, 50, 45, 52]}
# Create DataFrame
df = pd.DataFrame(data)
df.describe()
Output
Salary Age
count 4.000000 4.000000
mean 63250.000000 45.000000
std 41987.101194 8.524475
min 1000.000000 33.000000
25% 59500.000000 42.000000
50% 79500.000000 47.500000
75% 83250.000000 50.500000
max 93000.000000 52.000000
All statistical functions
Function | Description |
count() | Returns the number of times an element/data has occurred (non-null) |
sum() | Returns sum of all values |
mean() | Returns the average of all values |
median() | Returns the median of all values |
mode() | Returns the mode |
std() | Returns the standard deviation |
min() | Returns the minimum of all values |
max() | Returns the maximum of all values |
abs() | Returns the absolute value |
Input and Output
Often, you won’t be creating data but will be having it in some form, and you would want to import it to run your analysis on it. Fortunately, Pandas allows you to do this. Not only does it help in importing data, but you can also save your data in your desired format using Pandas.
Below table shows the formats supported by Pandas, the function to read files using Pandas and the function to write files.
Input type | Reader | Writer |
CSV | read_csv | to_csv |
JSON | read_json | to_json |
HTML | read_html | to_html |
Excel | read_excel | to_excel |
SAS | read_sas | – |
Python Pickle Format | read_pickle | to_pickle |
SQL | read_sql | to_sql |
Google Big Query | read_gbq | to_gbq |
In the below example, I have shown how to read a CSV file.
import pandas as pd
import numpy as np
#Read input file
df = pd.read_csv(‘/content/player_data.csv’)
df.head()
Output
name year_start year_end position height weight birth_date college
0 Alaa Abdelnaby 1991 1995 F-C 6-10 240.0 June 24, 1968 Duke University
1 Zaid Abdul-Aziz 1969 1978 C-F 6-9 235.0 April 7, 1946 Iowa State University
2 Kareem Abdul-Jabbar 1970 1989 C 7-2 225.0 April 16, 1947 University of California, Los Angeles
3 Mahmoud Abdul-Rauf 1991 2001 G 6-1 162.0 March 9, 1969 Louisiana State University
4 Tariq Abdul-Wahad 1998 2003 F 6-6 223.0 November 3, 1974 San Jose State University
The example below shows how to save a DataFrame to a CSV file.
import pandas as pd
# initialize a dictionary
data = {‘Name’:[‘Captain America’, ‘Iron Man’, ‘Hulk’, ‘Thor’,’Black Panther’],
‘Rating’:[100, 80, 84, 93, 90],
‘Place’:[‘USA’,’USA’,’USA’,’Asgard’,’Wakanda’]}
# Create DataFrame
df = pd.DataFrame(data, index=[‘a’,’b’,’c’,’d’,’e’])
# Saving to CSV
df.to_csv(“avengers.csv”)
Aggregation
The aggregation function can be applied against a single or more column. You can either apply the same aggregate function across various columns or different aggregate functions across various columns.
Commonly used aggregate functions()- sum, min, max, mean.
Example: Same aggregate function on all columns.
import pandas as pd
# initialize a dictionary
data = {‘Name’:[‘jennifer Lawrence’, ‘Brad Pitt’, ‘Chris Hemsworth’, ‘Dwayne Johnson’],
‘Salary’:[1000, 80000, 79000, 93000],
‘Age’:[33, 50, 45, 52]}
# Create DataFrame
df = pd.DataFrame(data)
df.aggregate([‘sum’,’min’,’max’,’mean’])
Output
Name Salary Age
sum jennifer LawrenceBrad PittChris Hemsworth Dwayn… 253000.0 180.0
min Brad Pitt 1000.0 33.0
max jennifer Lawrence 93000.0 52.0
mean NaN 63250.0 45.0
Example: Different aggregate functions for different columns.
import pandas as pd
# initialize a dictionary
data = {‘Name’:[‘jennifer Lawrence’, ‘Brad Pitt’, ‘Chris Hemsworth’, ‘Dwayne Johnson’],
‘Salary’:[1000, 80000, 79000, 93000],
‘Age’:[33, 50, 45, 52]}
# Create DataFrame
df = pd.DataFrame(data)
df.aggregate({‘Salary’:[‘sum’,’mean’],
‘Age’:[‘min’,’max’]})
Output
Salary Age
max NaN 52.0
mean 63250.0 NaN
min NaN 33.0
sum 253000.0 NaN
Groupby
Pandas groupby function is used to split the DataFrame into groups based on some criteria.
First, we will import the dataset, and explore it.
import pandas as pd
import numpy as np
#Read input file
df = pd.read_csv(‘/content/player_data.csv’)
df.head()
Output:
name year_start year_end position height weight birth_date college
0 Alaa Abdelnaby 1991 1995 F-C 6-10 240.0 June 24, 1968 Duke University
1 Zaid Abdul-Aziz 1969 1978 C-F 6-9 235.0 April 7, 1946 Iowa State University
2 Kareem Abdul-Jabbar 1970 1989 C 7-2 225.0 April 16, 1947 University of California, Los Angeles
3 Mahmoud Abdul-Rauf 1991 2001 G 6-1 162.0 March 9, 1969 Louisiana State University
4 Tariq Abdul-Wahad 1998 2003 F 6-6 223.0 November 3, 1974 San Jose State University
Let’s groupby the players’ college names.
# group the data on name and position.
gd = df.groupby(‘college’)
gd.first()
Output:
name year_start year_end position height weight birth_date
college
Acadia University Brian Heaney 1970 1970 G 6-2 180.0 September 3, 1946
Alabama – Huntsville Josh Magette 2018 2018 G 6-1 160.0 November 28, 1989
Alabama A&M University Mickell Gladness 2012 2012 C 6-11 220.0 July 26, 1986
Alabama State University Kevin Loder 1982 1984 F-G 6-6 205.0 March 15, 1959
Albany State University Mack Daughtry 1971 1971 G 6-3 175.0 August 4, 1950
… … … … … … … …
Xavier University Torraye Braggs 2004 2005 F 6-8 245.0 May 15, 1976
Xavier University of Louisiana Nat Clifton 1951 1958 C-F 6-6 220.0 October 13, 1922
Yale University Chris Dudley 1988 2003 C 6-11 235.0 February 22, 1965
Yankton College Chuck Lloyd 1971 1971 C-F 6-8 220.0 May 22, 1947
Youngstown State University Leo Mogus 1947 1951 F-C 6-4 190.0 April 13, 1921
Let’s print the values in any one of the groups.
gd.get_group((‘C’,’A.J. Bramlett’))
Output
Year_start year_end height weight birth_date college
435 2000 2000 6-10 227.0 January 10, 1977 University of Arizona
Let’s create groups based on more than one category
# group the data on name and position.
gd = df.groupby([‘position’,’name’])
gd.first()
Output
year_start year_end height weight birth_date college
position name
C A.J. Bramlett 2000 2000 6-10 227.0 January 10, 1977 University of Arizona
A.J. Hammons 2017 2017 7-0 260.0 August 27, 1992 Purdue University
Aaron Gray 2008 2014 7-0 270.0 December 7, 1984 University of Pittsburgh
Adonal Foyle 1998 2009 6-10 250.0 March 9, 1975 Colgate University
Al Beard 1968 1968 6-9 200.0 April 27, 1942 Norfolk State University
… … … … … … … …
G-F Win Wilfong 1958 1961 6-2 185.0 March 18, 1933 University of Memphis
Winford Boynes 1979 1981 6-6 185.0 May 17, 1957 University of San Francisco
Wyndol Gray 1947 1948 6-1 175.0 March 20, 1922 Harvard University
Yakhouba Diawara 2007 2010 6-7 225.0 August 29, 1982 Pepperdine University
Zoran Dragic 2015 2015 6-5 200.0 June 22, 1989 NaN
Merging, Joining and Concatenation
Before I start with Pandas join and merge functions, let me introduce you to four different types of joins, they are inner join, left join, right join, outer join.
- Full outer join: Combines results from both DataFrames. The result will have all columns from both DataFrames.
- Inner join: Only those rows which are present in both DataFrame A and DataFrame B will be present in the output.
- Right join: Right join uses all records from DataFrame B and matching records from DataFrame A.
- Left join: Left join uses all records from DataFrame A and matching records from DataFrame B.
Merging
Merging a Dataframe with one unique key.
import pandas as pd
# Define a dictionary containing employee data
data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],
‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],
‘Age’:[27, 24, 22, 32],}
# Define a dictionary containing employee data
data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],
‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],
‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data1)
# Convert the dictionary into DataFrame
df2 = pd.DataFrame(data2)
print(df1.head)
print(“n”)
print(df2.head())
res = pd.merge(df1, df2, on=’key’)
res
Output
key Name Age
0 K0 Mercy 27
1 K1 Prince 24
2 K2 John 22
3 K3 Cena 32>
key Address Qualification
0 K0 Canada Btech
1 K1 UK B.A
2 K2 India MS
3 K3 USA Phd
key Name Age Address Qualification
0 K0 Mercy 27 Canada Btech
1 K1 Prince 24 UK B.A
2 K2 John 22 India MS
3 K3 Cena 32 USA Phd
Merging Dataframe using multiple keys.
import pandas as pd
# Define a dictionary containing employee data
data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],
‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],
‘Address’:[‘Canada’, ‘Australia’, ‘India’, ‘Japan’],
‘Age’:[27, 24, 22, 32],}
# Define a dictionary containing employee data
data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],
‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],
‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data1)
# Convert the dictionary into DataFrame
df2 = pd.DataFrame(data2)
print(df1.head)
print(“n”)
print(df2.head())
res = pd.merge(df1, df2, on=[‘key’, ‘Address’])
res
Output
key Name Address Age
0 K0 Mercy Canada 27
1 K1 Prince Australia 24
2 K2 John India 22
3 K3 Cena Japan 32
key Address Qualification
0 K0 Canada Btech
1 K1 UK B.A
2 K2 India MS
3 K3 USA Phd
key Name Address Age Qualification
0 K0 Mercy Canada 27 Btech
1 K2 John India 22 MS
Left merge
In pd.merge() I pass the argument ‘how = left’ to perform a left merge.
import pandas as pd
# Define a dictionary containing employee data
data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],
‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],
‘Address’:[‘Canada’, ‘Australia’, ‘India’, ‘Japan’],
‘Age’:[27, 24, 22, 32],}
# Define a dictionary containing employee data
data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],
‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],
‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data1)
# Convert the dictionary into DataFrame
df2 = pd.DataFrame(data2)
print(df1.head(),”n”)
print(df2.head(),”n”)
res = pd.merge(df1, df2, how=’left’, on=[‘key’, ‘Address’])
res
Output
key Name Address Age
0 K0 Mercy Canada 27
1 K1 Prince Australia 24
2 K2 John India 22
3 K3 Cena Japan 32
key Address Qualification
0 K0 Canada Btech
1 K1 UK B.A
2 K2 India MS
3 K3 USA Phd
key Name Address Age Qualification
0 K0 Mercy Canada 27 Btech
1 K1 Prince Australia 24 NaN
2 K2 John India 22 MS
3 K3 Cena Japan 32 NaN
Right merge
In pd.merge() I pass the argument ‘how = right’ to perform a left merge.
import pandas as pd
# Define a dictionary containing employee data
data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],
‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],
‘Address’:[‘Canada’, ‘Australia’, ‘India’, ‘Japan’],
‘Age’:[27, 24, 22, 32],}
# Define a dictionary containing employee data
data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],
‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],
‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data1)
# Convert the dictionary into DataFrame
df2 = pd.DataFrame(data2)
print(df1.head(),”n”)
print(df2.head(),”n”)
res = pd.merge(df1, df2, how=’right’, on=[‘key’, ‘Address’])
res
Output
key Name Address Age
0 K0 Mercy Canada 27
1 K1 Prince Australia 24
2 K2 John India 22
3 K3 Cena Japan 32
key Address Qualification
0 K0 Canada Btech
1 K1 UK B.A
2 K2 India MS
3 K3 USA Phd
key Name Address Age Qualification
0 K0 Mercy Canada 27.0 Btech
1 K1 NaN UK NaN B.A
2 K2 John India 22.0 MS
3 K3 NaN USA NaN Phd
Outer Merge
In pd.merge(), I pass the argument ‘how = outer’ to perform a left merge.
import pandas as pd
# Define a dictionary containing employee data
data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],
‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],
‘Address’:[‘Canada’, ‘Australia’, ‘India’, ‘Japan’],
‘Age’:[27, 24, 22, 32],}
# Define a dictionary containing employee data
data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],
‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],
‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data1)
# Convert the dictionary into DataFrame
df2 = pd.DataFrame(data2)
print(df1.head(),”n”)
print(df2.head(),”n”)
res = pd.merge(df1, df2, how=’outer’, on=[‘key’, ‘Address’])
res
Output
key Name Address Age
0 K0 Mercy Canada 27
1 K1 Prince Australia 24
2 K2 John India 22
3 K3 Cena Japan 32
key Address Qualification
0 K0 Canada Btech
1 K1 UK B.A
2 K2 India MS
3 K3 USA Phd
key Name Address Age Qualification
0 K0 Mercy Canada 27.0 Btech
1 K1 Prince Australia 24.0 NaN
2 K2 John India 22.0 MS
3 K3 Cena Japan 32.0 NaN
4 K1 NaN UK NaN B.A
5 K3 NaN USA NaN Phd
Inner Merge
In pd.merge(), I pass the argument ‘how = inner’ to perform a left merge.
import pandas as pd
# Define a dictionary containing employee data
data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],
‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],
‘Address’:[‘Canada’, ‘Australia’, ‘India’, ‘Japan’],
‘Age’:[27, 24, 22, 32],}
# Define a dictionary containing employee data
data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],
‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],
‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data1)
# Convert the dictionary into DataFrame
df2 = pd.DataFrame(data2)
print(df1.head(),”n”)
print(df2.head(),”n”)
res = pd.merge(df1, df2, how=’inner’, on=[‘key’, ‘Address’])
res
Output
key Name Address Age
0 K0 Mercy Canada 27
1 K1 Prince Australia 24
2 K2 John India 22
3 K3 Cena Japan 32
key Address Qualification
0 K0 Canada Btech
1 K1 UK B.A
2 K2 India MS
3 K3 USA Phd
key Name Address Age Qualification
0 K0 Mercy Canada 27 Btech
1 K2 John India 22 MS
Join
Join is used to combine DataFrames having different index values.
Example
import pandas as pd
# Define a dictionary containing employee data
data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],
‘Age’:[27, 24, 22, 32]}
# Define a dictionary containing employee data
data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],
‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data1)
# Convert the dictionary into DataFrame
df2 = pd.DataFrame(data2)
print(df1.head(),”n”)
print(df2.head(),”n”)
res = df1.join(df2)
res
Output
Name Age
0 Mercy 27
1 Prince 24
2 John 22
3 Cena 32
Address Qualification
0 Canada Btech
1 UK B.A
2 India MS
3 USA Phd
Name Age Address Qualification
0 Mercy 27 Canada Btech
1 Prince 24 UK B.A
2 John 22 India MS
3 Cena 32 USA Phd
Performing join with ‘how’ parameter. Different inputs to the ‘how’ parameter are, inner, outer, left, right.
import pandas as pd
# Define a dictionary containing employee data
data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],
‘Age’:[27, 24, 22, 32]}
# Define a dictionary containing employee data
data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],
‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data1)
# Convert the dictionary into DataFrame
df2 = pd.DataFrame(data2)
print(df1.head(),”n”)
print(df2.head(),”n”)
res = df1.join(df2, how=’inner’)
res
Output
Name Age
0 Mercy 27
1 Prince 24
2 John 22
3 Cena 32
Address Qualification
0 Canada Btech
1 UK B.A
2 India MS
3 USA Phd
Name Age Address Qualification
0 Mercy 27 Canada Btech
1 Prince 24 UK B.A
2 John 22 India MS
3 Cena 32 USA Phd
Concatenation
Concatenating using ‘.concat()’ function
import pandas as pd
# Define a dictionary containing employee data
data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],
‘Age’:[27, 24, 22, 32],}
# Define a dictionary containing employee data
data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],
‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data1, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])
# Convert the dictionary into DataFrame
df2 = pd.DataFrame(data2, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])
frames = [df1, df2]
res = pd.concat(frames)
res
Output
Name Age Address Qualification
K0 Mercy 27.0 NaN NaN
K1 Prince 24.0 NaN NaN
K2 John 22.0 NaN NaN
K3 Cena 32.0 NaN NaN
K0 NaN NaN Canada Btech
K1 NaN NaN UK B.A
K2 NaN NaN India MS
K3 NaN NaN USA Phd
The resultant DataFrame has a repeated index. If you want the new Dataframe to have its own index, set ‘ignore_index’ to True.
import pandas as pd
# Define a dictionary containing employee data
data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],
‘Age’:[27, 24, 22, 32],}
# Define a dictionary containing employee data
data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],
‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data1, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])
# Convert the dictionary into DataFrame
df2 = pd.DataFrame(data2, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])
frames = [df1, df2]
res = pd.concat(frames, ignore_index=True)
res
Output
Name Age Address Qualification
0 Mercy 27.0 NaN NaN
1 Prince 24.0 NaN NaN
2 John 22.0 NaN NaN
3 Cena 32.0 NaN NaN
4 NaN NaN Canada Btech
5 NaN NaN UK B.A
6 NaN NaN India MS
7 NaN NaN USA Phd
The second DataFrame is concatenating below the first one, making the resultant DataFrame have new rows. If you want the second DataFrame to be added as columns, pass the argument axis=1.
import pandas as pd
# Define a dictionary containing employee data
data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],
‘Age’:[27, 24, 22, 32],}
# Define a dictionary containing employee data
data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],
‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data1, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])
# Convert the dictionary into DataFrame
df2 = pd.DataFrame(data2, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])
frames = [df1, df2]
res = pd.concat(frames, axis=1, ignore_index=True)
res
Output
0 1 2 3
K0 Mercy 27 Canada Btech
K1 Prince 24 UK B.A
K2 John 22 India MS
K3 Cena 32 USA Phd
Concatenating using ‘.append()’ function
Append function concatenates along axis = 0 only. It can take multiple objects as input.
import pandas as pd
# Define a dictionary containing employee data
data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],
‘Age’:[27, 24, 22, 32],}
# Define a dictionary containing employee data
data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],
‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data1, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])
# Convert the dictionary into DataFrame
df2 = pd.DataFrame(data2, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])
df1.append(df2)
Output
Name Age Address Qualification
K0 Mercy 27.0 NaN NaN
K1 Prince 24.0 NaN NaN
K2 John 22.0 NaN NaN
K3 Cena 32.0 NaN NaN
K0 NaN NaN Canada Btech
K1 NaN NaN UK B.A
K2 NaN NaN India MS
K3 NaN NaN USA Phd
Date Time
You will often encounter time data. Pandas is a very useful tool when working with time series data.
Generating random datetime
In the below code I am generating random datetime.
import pandas as pd
# Create dates dataframe with frequency
date = pd.date_range(’10/28/2011′, periods = 5, freq =’H’)
date
Output
DatetimeIndex([‘2011-10-28 00:00:00’, ‘2011-10-28 01:00:00’,
‘2011-10-28 02:00:00’, ‘2011-10-28 03:00:00’,
‘2011-10-28 04:00:00’],
dtype=’datetime64[ns]’, freq=’H’)
In the below code I am generating datetime using a range, which has a starting value, ending value and periods which specifies how many samples do I want,
import pandas as pd
date = pd.date_range(start=’9/28/2018′, end=’10/28/2018′, periods = 10)
date
Output
DatetimeIndex([‘2018-09-28 00:00:00’, ‘2018-10-01 08:00:00’,
‘2018-10-04 16:00:00’, ‘2018-10-08 00:00:00’,
‘2018-10-11 08:00:00’, ‘2018-10-14 16:00:00’,
‘2018-10-18 00:00:00’, ‘2018-10-21 08:00:00’,
‘2018-10-24 16:00:00’, ‘2018-10-28 00:00:00’],
dtype=’datetime64[ns]’, freq=None)
To convert the datetime to either a Pandas Series or a DataFrame, just pass the argument into the initializer.
Converting to timestamps
You can use the ‘to_datetime’ function to convert a Pandas Series or list-like object. When passed a Series, it returns a Series. If you pass a string, it returns a timestamp.
import pandas as pd
date = pd.to_datetime(pd.Series([‘Jul 04, 2020’, ‘2020-10-28’]))
date
Output
0 2020-07-04
1 2020-10-28
dtype: datetime64[ns]
In the below code I have specified the format of my input datetime. This speeds up the processing.
import pandas as pd
date = pd.to_datetime(‘4/7/1994′, format=’%d/%m/%Y’)
date
Output
Timestamp(‘1994-07-04 00:00:00’)
Dividing datetime into its features
Datetime can be divided into its components using-
pandas.Series.dt.year returns the year.
pandas.Series.dt.month returns the month.
pandas.Series.dt.day returns the day.
pandas.Series.dt.hour returns the hour.
pandas.Series.dt.minute returns the minute.
import pandas as pd
# Create datetime with dataframe
date = pd.DataFrame()
date[‘date’] = pd.date_range(’10/28/2020′, periods = 10, freq =’H’)
# Create features for year, month, day, hour, and minute
date[‘year’] = date[‘date’].dt.year
date[‘month’] = date[‘date’].dt.month
date[‘day’] = date[‘date’].dt.day
date[‘hour’] = date[‘date’].dt.hour
date[‘minute’] = date[‘date’].dt.minute
# Print the dates divided into features
date.head()
Output
date year month day hour minute
0 2020-10-28 00:00:00 2020 10 28 0 0
1 2020-10-28 01:00:00 2020 10 28 1 0
2 2020-10-28 02:00:00 2020 10 28 2 0
3 2020-10-28 03:00:00 2020 10 28 3 0
4 2020-10-28 04:00:00 2020 10 28 4 0
Visualization
Pandas can also be used to visualize data.
Line plot
In the below code I am generating a line plot. I am using random normal values generated by NumPy as input.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10,4),
index=pd.date_range(’10/28/2020′,periods=10),
columns=list(‘ABCD’))
df.plot()
Bar/Horizontal Bar plot
Bar plot can be made by using ‘.plot.bar()’. Pass the argument ‘stacked = True’ if you want stacked bars.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10,4),
columns=[‘a’,’b’,’c’,’d’])
df.plot.bar()
# using stacked bars
df.plot.bar(stacked=True)
To generate a horizontal bar graph, use ‘.plot.barh()’. You can also pass the argument ‘stacked = True’ if you want the bars to be stacked.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10,5),
columns=[‘a’,’b’,’c’,’d’,’e’])
# using stacked bars
df.plot.barh(stacked=True)
Histograms
To generate a histogram use ‘DataFrame.plot.hist()’. Pass the argument ‘bins’ specifying how many bins you want.
Example – df.plot.hist()
import pandas as pd
import numpy as np
df = pd.DataFrame({‘A’:np.random.randn(100)-3,
‘B’:np.random.randn(100)+1,
‘C’:np.random.randn(100)+3,
‘D’:np.random.randn(100)-1},
columns=[‘A’, ‘B’, ‘C’, ‘D’])
df.plot.hist(bins=20)
To plot separate histograms for all your inputs, use your DataFrame name followed by ‘.hist()’. Pass the argument ‘bins’ specifying how many bins you want.
Example- df.hist()
import pandas as pd
import numpy as np
df = pd.DataFrame({‘A’:np.random.randn(100)-3,
‘B’:np.random.randn(100)+1,
‘C’:np.random.randn(100)+3,
‘D’:np.random.randn(100)-1},
columns=[‘A’, ‘B’, ‘C’, ‘D’])
df.hist(bins=20)
To plot a single histogram for any of your input pass the input name in square brackets followed by ‘.hist()’.
Example- df[‘A’].hist()
import pandas as pd
import numpy as np
df = pd.DataFrame({‘A’:np.random.randn(100)-3,
‘B’:np.random.randn(100)+1,
‘C’:np.random.randn(100)+3,
‘D’:np.random.randn(100)-1},
columns=[‘A’, ‘B’, ‘C’, ‘D’])
df[‘A’].hist(bins=20)
Scatter plot
Scatter plot can be created using DataFrame.plot.scatter() method.
Example- df.plot.scatter()
import pandas as pd
import numpy as np
df = pd.DataFrame({‘A’:np.random.randn(100)-3,
‘B’:np.random.randn(100)+1,
‘C’:np.random.randn(100)+3,
‘D’:np.random.randn(100)-1},
columns=[‘A’, ‘B’, ‘C’, ‘D’])
df.plot.scatter(x=’A’, y=’B’)
Pie chart
To generate a pie chart use ‘.plot.pie()’
Example – df.plot.pie()
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(5), index=[‘A’, ‘B’, ‘C’, ‘D’, ‘E’])
df.plot.pie(subplots=True)
1
[ad_2]
Source link