Data analysis in python
My days in Matlab is definitively not over, but it's a clear trend that more and more people I work with tend to use open source software tools for data analysis and visualization. This has motivated me to test alternative tools and compare them with matlab with respect to functionality and performance.
Recently, I've been working with correlation of data from gas sensors. For the SINTSENSE concept, I collect data from the sensors into an Influx 1.x database and plot data using grafana. I found not functionality to plot the correlation between sensors, so I resolved this by extracting data from the database every hour, calculate the correlation cofficients and then push them back into the database. This data was then plotted in grafana alongside the sensor responses.
The work with SINTSENSE made want to explore the possiblities of doing correlation analysis on timeseries data. The calculation of the correlation is straightforward, but in order to calulate dynamic correlation, data must be divided into time segments on which calulations is performed.
To test how this could be done in Python, I decided to grab some meteorological data from my Netatmo stations.: For outdoor temperature and pressure I would expect to see a correlation.
To plot temperature date, I used pandas for importing data as data frames:
ot = pd.read_csv('c:/temp/netatmo_ot.csv', sep=';', header=2)
Plotting the data is straightforward with matplotlib. Plotting was very slow, and manual conversion of the timestamp to datetime sped up plotting considerably:
plt.plot(pd.to_datetime(ot['Timestamp'], unit='s'), ot['temperature'], color='lightgray', marker='o')
Smoothing of the data was performed quite elegantly with the rolling functionality. Specifying center as reference was ncessary in order to avoid offset:
plt.plot(pd.to_datetime(ot['Timestamp'], unit='s'), ot['temperature'].rolling(10, center=True).mean(), 'r-')
Pressure data was treated in the same way.
In order to calculate the correlation, the rolling function was combined with the corr function in a very elegant way:
ptcorr = ot['temperature'].rolling(1000, center=True).corr(ip['pressure']).
The rolling function drops 1000/2 points in each end of the series. The comparison is performed between dataframes of length 8955 and 8942 and resulted in a series of length 8955 and therfore the datetime column for the longest dataframe was used. With 5 min time resolution, 13 points amount to a time offset of about half an hour. Over 30 days, this is acceptable for this purpose
In order to get a sound time resolution, the rolling interval was upped to 1000. Plotting as before.
Code:
# -*- coding: utf-8 -*-
"""
Created on Tue Jan 12 19:54:54 2021
@author: taarhaug
"""
import pandas as pd
import matplotlib.pyplot as plt
#read data from csv files with ';' separator and two header lines ignored
ot = []
ot = pd.read_csv('c:/temp/netatmo_ot.csv', sep=';', header=2)
ip = []
ip = pd.read_csv('c:/temp/netatmo_ip.csv', sep=';', header=2)
#plot temperature data
# pyplot very slow, convert to datetime manually
plot1 = plt.figure(1)
plt.plot(pd.to_datetime(ot['Timestamp'], unit='s'), ot['temperature'], color='lightgray', marker='o')
plt.plot(pd.to_datetime(ot['Timestamp'], unit='s'), ot['temperature'].rolling(10, center=True).mean(), 'r-')
plt.plot(pd.to_datetime(ot['Timestamp'], unit='s'), ot['temperature'].rolling(100, center=True).mean(), 'k-')
plt.grid(True)
plt.title('Netatmo temperature data')
plt.ylabel('Temperature /°C')
plt.show()
# pressure plot
plot2 = plt.figure(2)
plt.plot(pd.to_datetime(ip['Timestamp'], unit='s'), ip['pressure'], color='lightgray', marker='o')
plt.plot(pd.to_datetime(ip['Timestamp'], unit='s'), ip['pressure'].rolling(10, center=True).mean(), 'r-')
plt.plot(pd.to_datetime(ip['Timestamp'], unit='s'), ip['pressure'].rolling(100, center=True).mean(), 'k-')
plt.grid(True)
plt.title('Netatmo pressure data')
plt.ylabel('Pressure /mbar')
plt.show()
#correlation calculation
ptcorr = ot['temperature'].rolling(1000, center=True).corr(ip['pressure'])
#plot corrcoeff
plot3 = plt.figure(3)
plt.plot(pd.to_datetime(ip['Timestamp'], unit='s'), ptcorr)
plt.grid(True)
plt.ylabel('Corrcoef')
plt.show()
Comments
Post a Comment