1263 lines
34 KiB
Plaintext
1263 lines
34 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "1a44722c-9586-4418-ad7f-8a3c964a9db8",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Daten analysieren\n",
|
||
"\n",
|
||
"Zufallszahlen und Sinuswerte zu visualisieren ist ganz nett, in der Realtität müssen aber Ergebnisse aus Umfragen oder Messwerte von Experimenten ausgewertet werden.\n",
|
||
"\n",
|
||
"Standard für die Datenanlyse mit Python ist das Package `pandas`."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"id": "a5292e13-c487-4e3a-8085-21d80469a734",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"import pandas"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "8a02c3e1-8ea1-475b-90f0-b3ea6141e4ba",
|
||
"metadata": {},
|
||
"source": [
|
||
"Um zu demonstrieren, wie `pandas` funktioniert, brauchen wir ein paar Daten. Die finden wir in der Datei `car.csv`, in der die Kosten eines Autos über mehrere Jahre hinweg erfasst wurden.\n",
|
||
"\n",
|
||
"CSV (Comma Separated Values) gibt es in verschiedenen Ausprägungen. In unserer Datei sind die Daten nicht mit Komma sondern einem Tabulator voneinander getrennt. Das muss man beim Einlesen mit `sep=\"\\t\"` angeben."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "dc89267a-2993-4fb6-b225-99a233476f83",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Hinweis\n",
|
||
"\n",
|
||
"Bei den folgenden Beispielen wird die Datei `car.csv` benötigt.\n",
|
||
"Diese findet sich [hier](https://gist.githubusercontent.com/cgiess/58f995bf88cd0e2269b634c7970eb479/raw/2de29b036ead1eabc402301e934b82e8f674aaf4/car.csv). Diese Datei muss auf den Jupyter-Rechner liegen. In der Regel kann man dies mittels `Upload Files` erreichen.\n",
|
||
"\n",
|
||
"![upload_file.png](upload_file.png)\n",
|
||
"\n",
|
||
"\n",
|
||
"Das `Upload Files` funktioniert aber bei https://jupyter.org/try-jupyter/lab/ nicht. Hier muss die Datei wie folgt bereitsgestellt werden (Stand Mai 2022):\n",
|
||
"\n",
|
||
"\n",
|
||
"```\n",
|
||
"from js import fetch\n",
|
||
"res = await fetch('https://gist.githubusercontent.com/cgiess/58f995bf88cd0e2269b634c7970eb479/raw/2de29b036ead1eabc402301e934b82e8f674aaf4/car.csv')\n",
|
||
"text = await res.text()\n",
|
||
"with open('car.csv', 'w') as f:\n",
|
||
" f.write(text)\n",
|
||
"\n",
|
||
"import os\n",
|
||
"os.listdir()\n",
|
||
"```\n",
|
||
"\n",
|
||
"Als Ergebnis sollte eine Liste mit Dateienamen angezeigt werden in der `car.csv` enthalten ist, z.B.:\n",
|
||
"\n",
|
||
"`['.matplotlib', '.ipython', 'car.csv']`"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "f9c2b389-370c-4c43-b0f4-f05ed3910d88",
|
||
"metadata": {},
|
||
"source": [
|
||
"## CSV Dateien mit Pandas lesen"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 5,
|
||
"id": "bd02ed34-11f1-40ed-b3f0-7217a0acccef",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>Datum</th>\n",
|
||
" <th>Typ</th>\n",
|
||
" <th>Beschreibung</th>\n",
|
||
" <th>Preis</th>\n",
|
||
" <th>km</th>\n",
|
||
" <th>Liter</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>2012-07-07</td>\n",
|
||
" <td>Kauf</td>\n",
|
||
" <td>Autohaus</td>\n",
|
||
" <td>13800.00</td>\n",
|
||
" <td>30</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>2012-07-10</td>\n",
|
||
" <td>Benzin</td>\n",
|
||
" <td>ESSO</td>\n",
|
||
" <td>57.01</td>\n",
|
||
" <td>199</td>\n",
|
||
" <td>34.89</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>2012-07-11</td>\n",
|
||
" <td>Versich</td>\n",
|
||
" <td>Haftpfl.</td>\n",
|
||
" <td>104.30</td>\n",
|
||
" <td>400</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>2012-07-23</td>\n",
|
||
" <td>Benzin</td>\n",
|
||
" <td>Kaufland</td>\n",
|
||
" <td>55.03</td>\n",
|
||
" <td>828</td>\n",
|
||
" <td>34.20</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>2012-08-10</td>\n",
|
||
" <td>Benzin</td>\n",
|
||
" <td>Kaufland</td>\n",
|
||
" <td>56.72</td>\n",
|
||
" <td>1444</td>\n",
|
||
" <td>35.47</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>...</th>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>227</th>\n",
|
||
" <td>2021-08-28</td>\n",
|
||
" <td>Benzin</td>\n",
|
||
" <td>AVIA</td>\n",
|
||
" <td>47.10</td>\n",
|
||
" <td>104552</td>\n",
|
||
" <td>29.27</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>228</th>\n",
|
||
" <td>2021-10-09</td>\n",
|
||
" <td>Benzin</td>\n",
|
||
" <td>Kaufland</td>\n",
|
||
" <td>55.10</td>\n",
|
||
" <td>105147</td>\n",
|
||
" <td>33.97</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>229</th>\n",
|
||
" <td>2021-10-19</td>\n",
|
||
" <td>Benzin</td>\n",
|
||
" <td>JET</td>\n",
|
||
" <td>40.10</td>\n",
|
||
" <td>105623</td>\n",
|
||
" <td>24.32</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>230</th>\n",
|
||
" <td>2021-12-04</td>\n",
|
||
" <td>Benzin</td>\n",
|
||
" <td>JET</td>\n",
|
||
" <td>53.30</td>\n",
|
||
" <td>106186</td>\n",
|
||
" <td>34.19</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>231</th>\n",
|
||
" <td>2021-12-25</td>\n",
|
||
" <td>Benzin</td>\n",
|
||
" <td>AVIA</td>\n",
|
||
" <td>51.42</td>\n",
|
||
" <td>106727</td>\n",
|
||
" <td>31.96</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>232 rows × 6 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" Datum Typ Beschreibung Preis km Liter\n",
|
||
"0 2012-07-07 Kauf Autohaus 13800.00 30 NaN\n",
|
||
"1 2012-07-10 Benzin ESSO 57.01 199 34.89\n",
|
||
"2 2012-07-11 Versich Haftpfl. 104.30 400 NaN\n",
|
||
"3 2012-07-23 Benzin Kaufland 55.03 828 34.20\n",
|
||
"4 2012-08-10 Benzin Kaufland 56.72 1444 35.47\n",
|
||
".. ... ... ... ... ... ...\n",
|
||
"227 2021-08-28 Benzin AVIA 47.10 104552 29.27\n",
|
||
"228 2021-10-09 Benzin Kaufland 55.10 105147 33.97\n",
|
||
"229 2021-10-19 Benzin JET 40.10 105623 24.32\n",
|
||
"230 2021-12-04 Benzin JET 53.30 106186 34.19\n",
|
||
"231 2021-12-25 Benzin AVIA 51.42 106727 31.96\n",
|
||
"\n",
|
||
"[232 rows x 6 columns]"
|
||
]
|
||
},
|
||
"execution_count": 5,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"d = pandas.read_csv(\"file:car.csv\", sep=\"\\t\")\n",
|
||
"d"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "0f10f423-f74d-4de6-b4cc-bfe45664d40c",
|
||
"metadata": {},
|
||
"source": [
|
||
"Das scheint funktioniert zu haben. Eine kleine Änderung werden wir noch vornehmen. Die erste Spalte enthält ein Datum. Da dies verschieden geschrieben werden kann (31.12.2020, 12/31/2020, ...) müssen wir bei der Erkennung ein bisschen nachhelfen. In unserem Fall ist es ausreichend zu sagen, welche Spalten ein Datum enthalten."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 6,
|
||
"id": "49844097-ddff-4331-831a-433cad03cda6",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"d = pandas.read_csv('file:car.csv', sep='\\t', parse_dates=['Datum'])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "de418fbd-024f-4205-be0b-5357c6667ccb",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Auf einzelne Daten zugreifen\n",
|
||
"\n",
|
||
"Der Rückgabewert von `read_csv` ist ein **Data Frame**.\n",
|
||
"Das sieht aus wie eine Tabelle. Was kann man damit nun machen?"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 7,
|
||
"id": "0440feaf-1eff-4206-a349-ac2b3186ecce",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"0 13800.00\n",
|
||
"1 57.01\n",
|
||
"2 104.30\n",
|
||
"3 55.03\n",
|
||
"4 56.72\n",
|
||
" ... \n",
|
||
"227 47.10\n",
|
||
"228 55.10\n",
|
||
"229 40.10\n",
|
||
"230 53.30\n",
|
||
"231 51.42\n",
|
||
"Name: Preis, Length: 232, dtype: float64"
|
||
]
|
||
},
|
||
"execution_count": 7,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"d[\"Preis\"]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "60b9d3f1-5665-42c3-9bba-f9e2293b2de3",
|
||
"metadata": {},
|
||
"source": [
|
||
"Einzelne Spalten kann man über den Spaltentitel adressieren. So eine Spalte nennt Pandas **Series**.\n",
|
||
"Möchte man einen Wert aus so einer Series haben, so muss man zusätzlich dessen Zeilennummer angeben."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 8,
|
||
"id": "8be6a487-c75a-4ff2-b39d-7a4b2451391c",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"13800.0"
|
||
]
|
||
},
|
||
"execution_count": 8,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"d[\"Preis\"][0]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "c0d070ae-beaa-474b-ad12-3878169f638b",
|
||
"metadata": {},
|
||
"source": [
|
||
"Das Ganze geht auch umgekehrt.\n",
|
||
"Eine Zeile bekommt man über den Array `iloc`:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 9,
|
||
"id": "745dcce7-0d8b-4e7b-bb30-43bb7e78b493",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"Datum 2012-07-07 00:00:00\n",
|
||
"Typ Kauf\n",
|
||
"Beschreibung Autohaus\n",
|
||
"Preis 13800.0\n",
|
||
"km 30\n",
|
||
"Liter NaN\n",
|
||
"Name: 0, dtype: object"
|
||
]
|
||
},
|
||
"execution_count": 9,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"d.iloc[0]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "4cb8b853-3151-44cc-9d11-d6656a19bc22",
|
||
"metadata": {},
|
||
"source": [
|
||
"... und einen einzelnen Wert daraus über den Spaltennamen."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 24,
|
||
"id": "d779293b-4e75-4814-972d-138df30cf8a4",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"30"
|
||
]
|
||
},
|
||
"execution_count": 24,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"d.iloc[0][\"km\"]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "b710be5b-7be3-454b-8089-402719845f75",
|
||
"metadata": {},
|
||
"source": [
|
||
"Mit `head()` und `tail()` kann man zudem die ersten bzw. letzten N Spalten auswählen."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 11,
|
||
"id": "9f9d880a-e119-45ef-8136-70866faf3236",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>Datum</th>\n",
|
||
" <th>Typ</th>\n",
|
||
" <th>Beschreibung</th>\n",
|
||
" <th>Preis</th>\n",
|
||
" <th>km</th>\n",
|
||
" <th>Liter</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>2012-07-10</td>\n",
|
||
" <td>Benzin</td>\n",
|
||
" <td>ESSO</td>\n",
|
||
" <td>57.01</td>\n",
|
||
" <td>199</td>\n",
|
||
" <td>34.89</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" Datum Typ Beschreibung Preis km Liter\n",
|
||
"1 2012-07-10 Benzin ESSO 57.01 199 34.89"
|
||
]
|
||
},
|
||
"execution_count": 11,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"d.head(2).tail(1)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "4cd9e9a2-a2da-4a75-8dbd-18206d99744b",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Aufgabe\n",
|
||
"Wie bekommt man die 4. bis 6. Zeile?"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "ccd76426-4859-4a79-8882-c3cd72dcfb5a",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Lösung\n",
|
||
"Achtung: die 4. Zeile hat die Nummer 3 weil die Nummerierung mit 0 beginnt!"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 12,
|
||
"id": "15e79e26-3e9f-4674-9fb6-fbf3fe3a1303",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>Datum</th>\n",
|
||
" <th>Typ</th>\n",
|
||
" <th>Beschreibung</th>\n",
|
||
" <th>Preis</th>\n",
|
||
" <th>km</th>\n",
|
||
" <th>Liter</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>2012-07-23</td>\n",
|
||
" <td>Benzin</td>\n",
|
||
" <td>Kaufland</td>\n",
|
||
" <td>55.03</td>\n",
|
||
" <td>828</td>\n",
|
||
" <td>34.20</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>2012-08-10</td>\n",
|
||
" <td>Benzin</td>\n",
|
||
" <td>Kaufland</td>\n",
|
||
" <td>56.72</td>\n",
|
||
" <td>1444</td>\n",
|
||
" <td>35.47</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>5</th>\n",
|
||
" <td>2012-08-23</td>\n",
|
||
" <td>Steuern</td>\n",
|
||
" <td>Kfz-Steuer</td>\n",
|
||
" <td>50.00</td>\n",
|
||
" <td>1500</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" Datum Typ Beschreibung Preis km Liter\n",
|
||
"3 2012-07-23 Benzin Kaufland 55.03 828 34.20\n",
|
||
"4 2012-08-10 Benzin Kaufland 56.72 1444 35.47\n",
|
||
"5 2012-08-23 Steuern Kfz-Steuer 50.00 1500 NaN"
|
||
]
|
||
},
|
||
"execution_count": 12,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"d.head(6).tail(3)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "bc38357a-65bd-4f59-969a-ca3bd8c2fc92",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Was wissen die Daten über sich selbst?\n",
|
||
"Der Data Frame kann über sich selbst etwas sagen:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 13,
|
||
"id": "100cc290-195c-4b74-a231-1c4bcc8c5d24",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"Datum datetime64[ns]\n",
|
||
"Typ object\n",
|
||
"Beschreibung object\n",
|
||
"Preis float64\n",
|
||
"km int64\n",
|
||
"Liter float64\n",
|
||
"dtype: object"
|
||
]
|
||
},
|
||
"execution_count": 13,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"d.dtypes"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "4c75bf50-b650-462a-b48d-e00950065e4c",
|
||
"metadata": {},
|
||
"source": [
|
||
"Was bedeutet dies?\n",
|
||
"Für jede Spalte wird angegeben, welchen Typ die darin enthaltenen Daten besitzen.\n",
|
||
"* datetime64\\[ns] - ist ein Zeitstempel bestehend aus Datum und Uhrzeit wobei letztere eine Genauigkeit von Nanosekunden hat\n",
|
||
"* object - das sind Texte\n",
|
||
"* float64 - Zahl mit Nachkommastellen.\n",
|
||
"* int64 - Zahl ohne Nachkommastellen\n",
|
||
"\n",
|
||
"Über die Verteilung der Zahlen gibt die Funktion `describe()` Auskunft."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 14,
|
||
"id": "676f6f12-173c-44ac-a5bc-34063a0c2ae0",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>Preis</th>\n",
|
||
" <th>km</th>\n",
|
||
" <th>Liter</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>count</th>\n",
|
||
" <td>232.000000</td>\n",
|
||
" <td>232.000000</td>\n",
|
||
" <td>201.000000</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>mean</th>\n",
|
||
" <td>117.583578</td>\n",
|
||
" <td>53910.508621</td>\n",
|
||
" <td>31.144726</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>std</th>\n",
|
||
" <td>904.538671</td>\n",
|
||
" <td>31374.136420</td>\n",
|
||
" <td>4.261039</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>min</th>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>30.000000</td>\n",
|
||
" <td>10.140000</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>25%</th>\n",
|
||
" <td>39.840000</td>\n",
|
||
" <td>27386.750000</td>\n",
|
||
" <td>29.890000</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>50%</th>\n",
|
||
" <td>45.160000</td>\n",
|
||
" <td>53140.000000</td>\n",
|
||
" <td>32.640000</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>75%</th>\n",
|
||
" <td>50.000000</td>\n",
|
||
" <td>81382.000000</td>\n",
|
||
" <td>33.650000</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>max</th>\n",
|
||
" <td>13800.000000</td>\n",
|
||
" <td>106727.000000</td>\n",
|
||
" <td>37.930000</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" Preis km Liter\n",
|
||
"count 232.000000 232.000000 201.000000\n",
|
||
"mean 117.583578 53910.508621 31.144726\n",
|
||
"std 904.538671 31374.136420 4.261039\n",
|
||
"min 0.000000 30.000000 10.140000\n",
|
||
"25% 39.840000 27386.750000 29.890000\n",
|
||
"50% 45.160000 53140.000000 32.640000\n",
|
||
"75% 50.000000 81382.000000 33.650000\n",
|
||
"max 13800.000000 106727.000000 37.930000"
|
||
]
|
||
},
|
||
"execution_count": 14,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"d.describe()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "02ec9241-c8e2-4ccb-b7ba-ef2062e6227e",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Aufgabe"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "1e083c1e-4efe-4f34-ac5a-027c5e64a977",
|
||
"metadata": {},
|
||
"source": [
|
||
"1. Welche Bedeutung haben diese Zahlen?\n",
|
||
"1. Welche davon helfen beim Verständnis der Daten?\n",
|
||
"1. Gibt es Datensätze, wo diese Funktion noch viel hilfreicher ist?"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "e2d48322-849d-4a4b-aeba-51d588504ad3",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Lösung\n",
|
||
"Die besprechen wir im Kurs"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "878dec12-c334-4d2c-9d10-4500f38af752",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Zeilen zählen"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 15,
|
||
"id": "cb3dc7a5-a4f2-4c75-a251-2862fde3f324",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"Datum 232\n",
|
||
"Typ 232\n",
|
||
"Beschreibung 232\n",
|
||
"Preis 232\n",
|
||
"km 232\n",
|
||
"Liter 201\n",
|
||
"dtype: int64"
|
||
]
|
||
},
|
||
"execution_count": 15,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"d.count()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "5e773974-53d8-4a16-849b-708e429ef38d",
|
||
"metadata": {},
|
||
"source": [
|
||
"In der Spalte **Liter** fehlen einige Einträge. Darum liefert `count()` für diese einen kleineren Wert.\n",
|
||
"Nicht vorhandene Werte werden als **NaN** (Not a Number) angezeigt."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "5af33628-90d0-490d-a413-f2083fe40a5a",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Werte zählen"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 16,
|
||
"id": "b7930759-6e4f-4a82-aa63-830a15dd8d8e",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"Benzin 201\n",
|
||
"Versich 10\n",
|
||
"Steuern 10\n",
|
||
"Werkst 9\n",
|
||
"Kauf 2\n",
|
||
"Name: Typ, dtype: int64"
|
||
]
|
||
},
|
||
"execution_count": 16,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"d[\"Typ\"].value_counts()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "657dc390-cefc-4345-a2c8-86cd4da01b03",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Rechnen\n",
|
||
"\n",
|
||
"Welche Kosten sind insgesamt angefallen? Dazu muss man alle Einträge der Spalte `Preis` aufsummieren."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 17,
|
||
"id": "3e68e583-7427-40fa-86bf-f87571d83a63",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"27279.390000000003"
|
||
]
|
||
},
|
||
"execution_count": 17,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"d[\"Preis\"].sum()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "9b631276-87e4-449b-9d07-52896b7fbc16",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Aufgabe\n",
|
||
"\n",
|
||
"Was ist der Durchschnittsverbrauch des Autos über die gesamte erfasste Zeit?\n",
|
||
"\n",
|
||
"**Hinweise**\n",
|
||
"* Gesamtverbrauch, d.h. wieviele Liter wurden insgesamt verbraucht\n",
|
||
"* Fahrstrecke, d.h. km-Stand am Ende - km-Stand am Anfang\n",
|
||
"* Verbrauch wird i.d.R. in l/100km angegeben"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "e2cfce0c-9eb6-4ac3-a2bb-43d085d20466",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Lösung"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 28,
|
||
"id": "081d24f8-cdc7-4273-80fb-c9e20a42f09b",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"5.867165899697274"
|
||
]
|
||
},
|
||
"execution_count": 28,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"l_total = d[\"Liter\"].sum()\n",
|
||
"\n",
|
||
"km_start = d[\"km\"].min() # oder d[\"km\"][0] -> km-Stand aus der 0ten Zeile\n",
|
||
"km_end = d[\"km\"].max() # oder d[\"km\"].iloc[-1] -> km-Stand aus der letzten Zeile,\n",
|
||
" # geht man von 0 eins zurück fängt man am Ende wieder an\n",
|
||
"\n",
|
||
"km_total = km_end - km_start\n",
|
||
"\n",
|
||
"fuel_avg = 100 * l_total / km_total\n",
|
||
"fuel_avg"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "9b125258-956b-4a6e-8074-6f4956563bd0",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Daten filtern\n",
|
||
"Wenn man nur an einem Teil der Daten interessiert ist kann man sich diesen selektieren."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 19,
|
||
"id": "e15a92cf-d913-436e-8a51-440dfcbde914",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>Datum</th>\n",
|
||
" <th>Typ</th>\n",
|
||
" <th>Beschreibung</th>\n",
|
||
" <th>Preis</th>\n",
|
||
" <th>km</th>\n",
|
||
" <th>Liter</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>2012-07-10</td>\n",
|
||
" <td>Benzin</td>\n",
|
||
" <td>ESSO</td>\n",
|
||
" <td>57.01</td>\n",
|
||
" <td>199</td>\n",
|
||
" <td>34.89</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>2012-07-23</td>\n",
|
||
" <td>Benzin</td>\n",
|
||
" <td>Kaufland</td>\n",
|
||
" <td>55.03</td>\n",
|
||
" <td>828</td>\n",
|
||
" <td>34.20</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" Datum Typ Beschreibung Preis km Liter\n",
|
||
"1 2012-07-10 Benzin ESSO 57.01 199 34.89\n",
|
||
"3 2012-07-23 Benzin Kaufland 55.03 828 34.20"
|
||
]
|
||
},
|
||
"execution_count": 19,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"fuel_only = d[d[\"Typ\"] == \"Benzin\"]\n",
|
||
"fuel_only.head(2)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "561575e8-6193-44e1-956e-59a6e618d4d7",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Aufgabe\n",
|
||
"1. Selektiere die Zeilen, bei denen der Preis kleiner als 100 ist.\n",
|
||
"1. Selektiere die Zeilen, bei denen der Preis gleich 50 ist.\n",
|
||
"1. Selektiere die Zeilen, bei denen Liter größer 36 und kleiner 38 sind."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "c01cb1b4-dd7a-49f2-ae25-a13bd99de5fb",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Lösung"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 20,
|
||
"id": "bbe1a6eb-d467-4656-8c17-ab6f9e4ceeb2",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>Datum</th>\n",
|
||
" <th>Typ</th>\n",
|
||
" <th>Beschreibung</th>\n",
|
||
" <th>Preis</th>\n",
|
||
" <th>km</th>\n",
|
||
" <th>Liter</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>6</th>\n",
|
||
" <td>2012-09-10</td>\n",
|
||
" <td>Benzin</td>\n",
|
||
" <td>AVIA</td>\n",
|
||
" <td>60.80</td>\n",
|
||
" <td>2061</td>\n",
|
||
" <td>36.87</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>7</th>\n",
|
||
" <td>2012-09-14</td>\n",
|
||
" <td>Benzin</td>\n",
|
||
" <td>OMV</td>\n",
|
||
" <td>61.10</td>\n",
|
||
" <td>2710</td>\n",
|
||
" <td>36.83</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>106</th>\n",
|
||
" <td>2016-09-15</td>\n",
|
||
" <td>Benzin</td>\n",
|
||
" <td>Kaufland</td>\n",
|
||
" <td>48.51</td>\n",
|
||
" <td>48872</td>\n",
|
||
" <td>37.93</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>124</th>\n",
|
||
" <td>2017-05-30</td>\n",
|
||
" <td>Benzin</td>\n",
|
||
" <td>Real</td>\n",
|
||
" <td>46.78</td>\n",
|
||
" <td>57868</td>\n",
|
||
" <td>36.01</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" Datum Typ Beschreibung Preis km Liter\n",
|
||
"6 2012-09-10 Benzin AVIA 60.80 2061 36.87\n",
|
||
"7 2012-09-14 Benzin OMV 61.10 2710 36.83\n",
|
||
"106 2016-09-15 Benzin Kaufland 48.51 48872 37.93\n",
|
||
"124 2017-05-30 Benzin Real 46.78 57868 36.01"
|
||
]
|
||
},
|
||
"execution_count": 20,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"d[d[\"Preis\"] < 100]\n",
|
||
"d[d[\"Preis\"] == 50]\n",
|
||
"# d.described() sagte, dass der größte Wert 37.93 ist\n",
|
||
"d[d[\"Liter\"] > 36]\n",
|
||
"d[d[\"Liter\"].between(36,38)]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "8f1056bd-f6be-48a8-98ca-4f332989e45a",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Daten Gruppieren\n",
|
||
"\n",
|
||
"Die Kosten sollen nach Spalte `Typ` bzw `Beschreibung` aufsummiert werden."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 21,
|
||
"id": "3ecb95b7-de9d-4383-9162-ca1e5b5bf669",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"Typ\n",
|
||
"Benzin 8704.15\n",
|
||
"Kauf 14296.68\n",
|
||
"Steuern 500.00\n",
|
||
"Versich 2596.02\n",
|
||
"Werkst 1182.54\n",
|
||
"Name: Preis, dtype: float64"
|
||
]
|
||
},
|
||
"execution_count": 21,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"d.groupby(\"Typ\")[\"Preis\"].sum()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 22,
|
||
"id": "9acedd39-f003-4a98-82ae-8082a2685d1c",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"Beschreibung\n",
|
||
"AGIP 32.80\n",
|
||
"ARAL 795.33\n",
|
||
"AVIA 671.59\n",
|
||
"Autohof 45.30\n",
|
||
"Avanti 64.10\n",
|
||
"BFT 88.66\n",
|
||
"BP 46.05\n",
|
||
"Bavaria 44.00\n",
|
||
"ESSO 1320.52\n",
|
||
"Elf 45.00\n",
|
||
"Globus 798.94\n",
|
||
"HEM 27.70\n",
|
||
"JET 1074.80\n",
|
||
"KK 183.34\n",
|
||
"Kaufland 1080.48\n",
|
||
"OMV 95.38\n",
|
||
"Oil 261.71\n",
|
||
"Real 222.15\n",
|
||
"SHELL 539.30\n",
|
||
"Star 99.80\n",
|
||
"Tango 56.85\n",
|
||
"Tankcenter 413.43\n",
|
||
"Total 481.25\n",
|
||
"UNO-X 43.36\n",
|
||
"Unbekannt 172.31\n",
|
||
"Name: Preis, dtype: float64"
|
||
]
|
||
},
|
||
"execution_count": 22,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"fuel_only.groupby(\"Beschreibung\")[\"Preis\"].sum()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "d38f01a2-ffb8-48a5-a646-517989735b3b",
|
||
"metadata": {},
|
||
"source": [
|
||
"Das ist doch nett. Noch schöner wäre aber, wenn die Daten nach der zweiten Spalte absteigend sortiert wären.\n",
|
||
"Zudem ist man oft nur an den größten Ergebnissen interessiert."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 29,
|
||
"id": "3152d46a-d38e-4e1c-8465-b13082c72638",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"Beschreibung\n",
|
||
"ESSO 1320.52\n",
|
||
"Kaufland 1080.48\n",
|
||
"JET 1074.80\n",
|
||
"Globus 798.94\n",
|
||
"ARAL 795.33\n",
|
||
"AVIA 671.59\n",
|
||
"SHELL 539.30\n",
|
||
"Total 481.25\n",
|
||
"Tankcenter 413.43\n",
|
||
"Oil 261.71\n",
|
||
"Name: Preis, dtype: float64"
|
||
]
|
||
},
|
||
"execution_count": 29,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"fuel_only.groupby(\"Beschreibung\")[\"Preis\"].sum().sort_values(ascending=False).head(10)"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3 (ipykernel)",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.9.7"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|