Pandas: data analysis with Python [part 2].

Pandas is a Python library that allows us to analyze data from a variety of sources. Among the most useful features we surely find several functions to clean our data and extract some statistics about the distribution of values of various attributes. In addition, we can create aggregations with different logics and graph the data to extract more information. Let's find out how to do all this with just a few lines of code!

Reading time: 11 minutes

In the previous article Pandas: data analysis with Python [part 1]., we introduced Pandas and its data structures. We then focused on how to read data and manipulate it from different data sources. Of course, the Python library is designed to analyze data. In this paper we will look at some aspects of data analysis that the library provides us with. In particular, we will discuss how to clean the data, extract statistics, create pivot tables, and graph the results.

The examples we will look at will use the data we have already downloaded and loaded in article Pandas: data analysis with Python [part 1].. You can go read it if you need to reset your work environment and/or load the necessary data from Kaggle. Once you have uploaded the data you can get started!

Data cleaning

In data science, we used to extract information from huge volumes of raw data. This raw data may contain “dirty” and inconsistent values, leading to inaccurate analysis and unhelpful insights. In most cases, the initial stages of data acquisition and cleaning can make up 80 percent of the work; therefore, if you intend to enter this field, you must learn how to handle “dirty” data.

Data cleaning focuses primarily on removing erroneous, corrupt, improperly formatted, duplicate, or incomplete data within a dataset.

In this section, we will address the following:

removal of unnecessary columns from the dataset
removal of duplicates
finding and replacing missing values in the DataFrame
modifying the index of the DataFrame
renaming columns in a DataFrame.

We use the Melbourne housing market dataset imported from Kaggle. First, we shuffle the DataFrame to get rows with different indexes. To reproduce the same shuffled data each time, we use a random_state.

				
					shuffled_melbourne_data = melbourne_data.sample(frac=1, random_state=1) # produce the same shuffled data each time
shuffled_melbourne_data

Suppose that in the DataFrame you want to get information about:

What are the best suburbs in which to buy?
The suburbs with good value for money?
What is the expensive side of town?
Where to buy a two-bedroom unit?

We may need all but the following columns:

Method
SellerG
Date
Bathroom
Car
Landsize
BuildingArea
YearBuilt

Pandas has a drop() method to remove these columns.

				
					# Define a list names of all the columns we want to drop
columns_to_drop = ['Method',
                   'SellerG',
                   'Date',
                   'Bathroom',
                   'Car',
                   'Landsize',
                   'BuildingArea',
                   'YearBuilt']
shuffled_melbourne_data.drop(columns = columns_to_drop, inplace=True, axis=1)
shuffled_melbourne_data

Attention!

The option (inplace = True) ensures that the method does not return a new DataFrame, but looks for columns to remove from the given DataFrame. True means that the change is permanent.

Colonne eliminate!

Finding and removing duplicates

The duplicated() method returns boolean values in column format. False values mean that no data has been duplicated.

				
					shuffled_melbourne_data.duplicated()
'''
Suburb
Bentleigh East    False
                  ...  
Balwyn            False
Length: 34857, dtype: bool
'''

This duplicate check can be performed for small DataFrames. For larger DataFrames, it is impossible to check every single value. Therefore, we will concatenate the any() method to duplicated().

				
					shuffle_melbourne_data.duplicated().any()
# True

If the any() method returns True, it means that there are duplicates in the DataFrame. To remove them, we will use the drop_duplicates() method.

				
					shuffled_melbourne_data.drop_duplicates(inplace=True) # remove dupicates
shuffled_melbourne_data.duplicated().any() # Checking returns False
# False

Find and replace missing values in the DataFrame

Four methods are available to check for missing values, i.e., taking the value null. They are:

isnull(): returns a data set with boolean values.

				
					shuffled_melbourne_data.isna()

isna(): similar to isnull()
isna().any(): returns a column format of boolean values

				
					shuffled_melbourne_data.isna().any()
'''
Address          False
Rooms            False
Type             False
Price             True
Distance          True
Postcode          True
Bedroom2          True
CouncilArea       True
Lattitude         True
Longtitude        True
Regionname        True
Propertycount     True
dtype: bool
'''

isna().sum(): returns a column total of all available nulls.

				
					shuffled_melbourne_data.isna().sum()
'''
Address             0
Rooms               0
Type                0
Price            7580
Distance            1
Postcode            1
Bedroom2         8202
CouncilArea         3
Lattitude        7961
Longtitude       7961
Regionname          3
Propertycount       3
dtype: int64
'''

Missing values can be handled in two ways:

By removing the rows that contain the missing values.

In this case we use the dropna() method to remove the rows. For the following examples we will not yet remove rows in our dataset.

				
					shuffled_melbourne_data.dropna(inplace=True) # removes all rows with null values

Replacing empty cells with new values

We can decide to replace all null values with a value using the fillna() method. The syntax is DataFrame.fillna(value, method, axis, inplace, limit, downcast) where the value can be a dictionary that takes the column names as the key.

				
					shuffled_melbourne_data.fillna({'Price':1435000, 'Distance':13, 'Postcode':3067, 'Bedroom2': 2, 'CouncilArea': 'Yarra City Council'}, inplace=True)
print(shuffled_melbourne_data.isna().sum())
'''
Address             0
Rooms               0
Type                0
Price               0
Distance            0
Postcode            0
Bedroom2            0
CouncilArea         0
Lattitude        7961
Longtitude       7961
Regionname          3
Propertycount       3
dtype: int64
'''

We have successfully replaced the specified empty values of the dictionary columns within the fillna() method.

Changing the index of the DataFrame

When working with data, you usually want a field with a unique value as an index to the data. For example, in our cars.json dataset, you can assume that when a person searches for a record of a car, they will probably search for it by its model (values in the Model column).

La verifica se la colonna del modello ha valori univoci restituisce True, quindi sostituiamo l’indice esistente con questa colonna usando set_index:

				
					cars = pd.read_json('/content/cars.json', orient=’index’)

print(cars['model'].is_unique)
# True
cars.set_index('model', inplace=True)

Now each record can be accessed directly with loc[].

				
					cars.loc['M3']
'''
price                                  60500
wiki     http://en.wikipedia.org/wiki/Bmw_m3
img                     250px-BMW_M3_E92.jpg
Name: M3, dtype: object
'''

Renaming columns in a DataFrame

Sometimes you may want to rename data columns for better interpretation, perhaps because some names are not easy to understand. To do this, you can use the rename() method of the DataFrame and pass a dictionary where the key is the current column name and the value is the new name.

Let’s use the original Melbourne DataFrame, in which no column has been removed. We may want to rename:

Room to No_ofRooms.
Type to HousingType.
Method to SaleMethod
SellerG to RealEstateAgent.
Bedroom2 to No_ofBedrooms.

				
					# Use initial Melbourne DataFrame
#Create a dictionary of columns to rename. Value is the new name
columns_to_rname = {
    'Rooms': 'No_ofRooms',
    'Type': 'HousingType',
    'Method': 'SaleMethod',
    'SellerG': 'RealEstateAgent',
    'Bedroom2': 'No_ofBedrooms'
}
shuffled_melbourne_data.rename(columns=columns_to_rename, inplace=True)
shuffled_melbourne_data.head()

Finding correlations

Pandas has a corr() method that allows us to find relationships between columns in our data set.

The method returns a table representing the relationship between two columns. The values range from -1 to 1, where -1 indicates a negative correlation and 1 a perfect correlation. The method automatically ignores null values and nonnumeric values in the data set.

				
					shuffled_melbourne_data.corr()

Descriptive statistics

There are several methods in Pandas DataFrames and Series to perform summary statistics. These methods include: mean, median, max, std, sum and min.

For example, we can find the mean of the price in our DataFrame.

				
					shuffled_melbourne_data['Price'].mean()
# 1134144.8576355954
# Or for all columns
shuffled_melbourne_data.mean()
No_ofRooms       3.031071e+00
Price            1.134145e+06
				 ...
Propertycount    7.570273e+03
dtype: float64

If we want to get an overview of basic statistics, which include those mentioned above, for all the attributes of our data we can use the describe() method.

				
					shuffled_melbourne_data.describe()

Groupby operation to split, aggregate and transform data

The Pandas groupby() function allows you to reorganize data. It simply divides the data into groups classified according to some criteria and returns a set of DataFrame objects.

Divide data into groups

Let us divide the Melbourne data into groups based on the number of bedrooms. To display the groups, we use the groups attribute.

				
					grouped_melb_data = shuffled_melbourne_data.groupby('No_ofBedrooms')
shuffled_grouped_melb_data.groups
'''
{1.0: ['Hawthorn', 'St Kilda'], 2.0: ['Bentleigh East', 'Yarraville', 'Richmond', 'Mill Park', 'Southbank', 'Surrey Hills', 'Caulfield North', 'Reservoir', 'Mulgrave', 'Altona', 'Ormond', 'Bonbeach', 'St Kilda', 'South Yarra', 'Malvern', 'Essendon'], 
... 
30.0: ['Burwood']}}
'''

Since the result is a dict and the data is huge, the result is difficult to read. To overcome this problem, we can use the keys() method to get only the keys representing the groups.

				
					grouped_melb_data.groups.keys()
# dict_keys([0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 12.0, 16.0, 20.0, 30.0])

We can also group multiple columns as follows.

				
					grouped_melb_data = shuffled_melbourne_data.head(20).groupby(['No_ofBedrooms', 'Price'])
grouped_melb_data.groups.keys()
'''
dict_keys([(1.0, 365000.0), (1.0, 385000.0), (2.0, 380000.0), (2.0, 438000.0), ... (4.0, 1435000.0)])

If we want to select a group of our interest, we will use the get_group() method.

				
					grouped_melb_data = melbourne_data.groupby('No_ofBedrooms')
grouped_melb_data.get_group(2) # select 2 bedrooms group

Aggregated data

An aggregate function returns a single collective value for each group. For example, we wish to calculate the average price for each group. We will use the agg function.

				
					print(grouped_melb_data['Price'].agg([np.mean]))
'''
                       mean
No_ofBedrooms              
0.0            1.054029e+06
1.0            6.367457e+05
			   ...
30.0           1.435000e+06
'''

Transformation

Pandas has a transformation operation that is used with the groupby() function. This allows the data to be summarized effectively. When we transform a group, we get an indexed object of the same size as the one being grouped. For example, in our data set, we can get the average prices for each No_ofBedrooms group and combine the results in our original data set for other calculations.

The first approach would be to try to group the data in a new DataFrame and combine them in a multi-step process, and then combine the results in the original DataFrame. We would create a new DataFrame with average prices and merge it back with the original.

				
					price_means = shuffled_melbourne_data.groupby('No_ofBedrooms')['Price'].mean().rename('MeanPrice').reset_index()
df_1 = shuffled_melbourne_data.merge(price_means)
df_1

Complex? We can then use the transform operator to perform the same operation in one step.

				
					shuffled_melbourne_data['MeanPrice'] = shuffled_melbourne_data.groupby('No_ofBedrooms')['Price'].transform('mean')
shuffled_melbourne_data

Pivot table

A Pandas pivot table is similar to the groupby operation. It is a common operation for programs that use tabular data such as spreadsheets. The difference between the groupby operation and the pivot table is that a pivot table takes simple columnar data as input and groups the entries into a two-dimensional table that provides a multidimensional summary of the data. Its syntax is:

				
					pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc=’mean’, fill_value=None, margins=False, dropna=True, margins_name=’All’, observed=False)

Let us now create an example pivot table.

				
					pivot_tbl = shuffled_melbourne_data.pivot_table(index='HousingType')
pivot_tbl

What did we get? The result is a new DataFrame, called pivot_tbl, which is grouped by HousingType. Since no other parameters were specified in the function, the data was aggregated by the average for each column with numeric data, since by default the aggfunc parameter returns the average.

Aggregation of data by specific columns

It is also possible to aggregate data by specific columns. In the following example, we aggregate for Price and No_ofRooms columns.

				
					pivot_tbl = shuffled_melbourne_data.pivot_table(index='HousingType', values=['Price', 'No_ofRooms'])
pivot_tbl

Using aggregation methods in the pivot table

You can change the way the data are aggregated in the pivot table. Aggregation methods can thus be used to perform complex analyses. To use them, you must specify the aggregation function in the aggfunc parameter.

For example, if we want to find the sum of all property counts for each region ( Regionname attribute), we would use the following code.

				
					pivot_tbl_sum = shuffled_melbourne_data.pivot_table(index='Regionname', aggfunc='sum', values='Propertycount')
pivot_tbl_sum

Multiple aggregation methods

To apply multiple aggregation methods, a list of aggregation methods must be specified in the aggfunc parameter. For example , to find the sum and maximum number of properties for each region the code would be as follows.

				
					pivot_tbl_sum_max = shuffled_melbourne_data.pivot_table(index='Regionname', aggfunc=['sum', 'max'], values='Propertycount')
pivot_tbl_sum_max

Specify the aggregation method for each column

In the previous examples, we applied the same aggregation methods to one or more columns. To specify the aggregation function of interest for each column, we can use a dictionary containing the column as the key and the aggregation function as the value. Here is an example.

				
					pivot_tbl = shuffled_melbourne_data.pivot_table(index='Regionname', values=['Price', 'Propertycount'], aggfunc={'Price':'sum', 'Propertycount':'mean'})
pivot_tbl

Split data in pivot table by column with columns

A column can be added to the pivot table with the columns parameter. This divides the data horizontally, while the index parameter specifies the column to divide the data vertically. In the following example, the data are divided according to the type of dwelling.

				
					pivot_tbl = shuffled_melbourne_data.pivot_table(
				index='Regionname',
                values=['Price', 'Propertycount'],
                aggfunc={'Price':'sum', 'Propertycount':'mean'}, 
                columns='HousingType'
                )
pivot_tbl

Adding totals to each group

Sometimes it is necessary to add totals after aggregating. This is done by using the margins parameter. By default, it is set to False and its label is set to ‘All’. To rename it, we need to use the margin_name parameter. Let’s see how we can get the totals of all property counts for each region.

				
					pivot_tbl = shuffled_melbourne_data.pivot_table(
					index='Regionname',
                    values='Propertycount',
                    margins=True,
                    margins_name='Totals',
                    aggfunc='sum',
                    columns='HousingType')
pivot_tbl

As you may have noticed, the pivot table code is simpler and more readable than the groupby operation.

Sort DataFrames

Pandas provides a sort_values() method for sorting a DataFrame. The types of sorting that can be performed are as follows:

ascending order
descending order
sorting by multiple columns

Increasing order

We sort the Melbourne data by price. By default, data are sorted in ascending order.

				
					shuffled_melbourne_data.sort_values(by=['Price'], inplace=True)
shuffled_melbourne_data.head(3)

Descending order

To sort in descending order, simply add the condition ascending=False to the sort_values() method. Let’s sort the Melbourne data by price.

				
					shuffled_melbourne_data.sort_values(by=['Price'], inplace=True, ascending=False)
shuffled_melbourne_data.head(3)

Sorting by multiple columns

To sort over multiple columns you must pass a list in the by parameter. The order of the list elements is important. In fact, if the value of the first element in the list is equal, we sort by the second attribute and so on. In the next example we sort by No_ofRooms and Price.

				
					shuffled_melbourne_data.sort_values(by=['No_ofRooms','Price'], inplace=True)
shuffled_melbourne_data.head(3)

Plot data

To create graphs, Pandas uses the plot() method. Since this method integrates with the Matplotlib library, we can use the Pyplot module for these visualizations.

First, we import the matplotlib.pyplot module and give it plt as an alias, then define some parameters of the figures that will be generated.

				
					import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 25, 'figure.figsize': (50, 8)}) # set font and plot size to be larger

There are different types of graphs for displaying information of interest. Below we see some examples that are most commonly used.

Scatter plot

The main purpose of a scatter plot is to check whether two variables are related. In this example we try to compare No_ofRooms and Price.

				
					shuffled_melbourne_data.plot(kind='scatter', x='No_ofRooms', y='Price', title='Price vs Number of Rooms')
plt.show()

A scatter plot is not a definitive proof of a correlation. To get an overview of the correlations between different columns, one can use the corr() method seen earlier.

Histograms

Histograms help visualize the distribution of values in a data set. They provide a visual interpretation of numerical data by showing the number of objects that fall within a specified range of values (called “bin”).

				
					shuffled_melbourne_data['Price'].plot(kind='hist', title='Pricing')
plt.show()

Pie charts

Pie charts show the percentages of data that fall into a particular group compared to the entire dataset.

				
					price_count = shuffled_melbourne_data.groupby('Address')['Price'].sum()
print(price_count) #price totals each group
'''
Address
1 Abercrombie St    1435000.0
1 Aberfeldie Wy      680000.0
                      ...    
9b Stewart St       1160000.0
Name: Price, Length: 34009, dtype: float64
'''
small_groups = price_count[price_count < 1500000]
large_groups = price_count[price_count > 1500000]

small_sums = pd.Series([small_groups.sum()], index=["Other"])
large_groups = large_groups.append(small_sums)

large_groups.iloc[0:10].plot(kind="pie", label="")
plt.show()

Conversion to CSV, JSON, Excel or SQL

After working with the data, you may decide to convert the data from one format to another. Pandas has methods for making these conversions as easily as we read the data.

Convert JSON to CSV

To convert to CSV, we use the df.to_csv() method. So, to convert our cars.json file, we should proceed like this.

				
					cars = pd.read_json('cars.json', orient='index')
cars_csv = cars.to_csv(r'New path to where new CSV file will be stored\New File Name.csv', index=False) # disable index as we do not need it in csv format.

Convert CSV to JSON

We now convert to JSON using the pd.to_json() method.

				
					melbourne_data = pd.read_csv('Melbourne_housing_FULL.csv, orient-'index'
melbourne_data.to_json(r'New path to where new JSON file will be stored\New File Name.json')

Export a SQL table to CSV

We can also convert the SQL table to a CSV file:

				
					cars_df = pd.read_sql_query("SELECT * FROM vehicle_data", conn)
cars_df.to_csv(r'New path to where new csv file will be stored\New File Name.csv', index=False)

Convert CSV to Excel

CSV files can also be converted to Excel:

				
					melbourne_data = pd.read_csv('Melbourne_housing_FULL.csv, orient-'index')
melbourne_data.to_excel(r'New path to where new excel file will be stored\New File Name.xlsx', index=None, header=True)

Convert Excel to CSV

Similarly, Excel files can be converted to CSV:

				
					stud_data = pd.read_excel('/content/students.xlsx')
stud_data.to_csv(r'New path to where new csv file will be stored\New File Name.csv', index=None, header=True)

Convert DataFrame to SQL table

Pandas has the df.to_sql() method to convert a DataFrame to an SQL table.

				
					cars = pd.read_json('cars.json', orient='index')

from sqlalchemy import create_engine #pip install sqlalchemy
engine = create_engine('sqlite://', echo=False)

cars.to_sql(name='Cars', con=engine) # name='Cars' is the name of the SQL table

More To Explore

Artificial intelligence

AI: prompt engineering to reduce hallucinations [part 1]

Prompt engineering techniques allow us to improve the reasoning and responses provided by LLMs , such as ChatGPT. However, are we sure that the responses received are correct? In some cases no! When this happens, the model is said to have hallucinated. Let’s find out what this is and what are the techniques to reduce the probability of getting them.

Alessandro Fiori 17 June 2024

Python language