sq_02_python_data_science/jupyter_book/04_pandas.ipynb

1244 lines
33 KiB
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

{
"cells": [
{
"cell_type": "markdown",
"id": "1a44722c-9586-4418-ad7f-8a3c964a9db8",
"metadata": {},
"source": [
"# Daten analysieren\n",
"\n",
"Zufallszahlen und Sinuswerte zu visualisieren ist ganz nett, in der Realtität müssen aber Ergebnisse aus Umfragen oder Messwerte von Experimenten ausgewertet werden.\n",
"\n",
"Standard für die Datenanlyse mit Python ist das Package `pandas`."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "a5292e13-c487-4e3a-8085-21d80469a734",
"metadata": {},
"outputs": [],
"source": [
"import pandas"
]
},
{
"cell_type": "markdown",
"id": "8a02c3e1-8ea1-475b-90f0-b3ea6141e4ba",
"metadata": {},
"source": [
"Um zu demonstrieren, wie `pandas` funktioniert, brauchen wir ein paar Daten. Die finden wir in der Datei `car.csv`, in der die Kosten eines Autos über mehrere Jahre hinweg erfasst wurden.\n",
"\n",
"CSV (Comma Separated Values) gibt es in verschiedenen Ausprägungen. In unserer Datei sind die Daten nicht mit Komma sondern einem Tabulator voneinander getrennt. Das muss man beim Einlesen mit `sep=\"\\t\"` angeben."
]
},
{
"cell_type": "markdown",
"id": "dc89267a-2993-4fb6-b225-99a233476f83",
"metadata": {},
"source": [
"## Hinweis\n",
"\n",
"Bei den folgenden Beispielen wird die Datei `car.csv` benötigt.\n",
"Diese findet sich [in Moodle](https://moodle.hs-mannheim.de/course/view.php?id=4533). Diese Datei muss auf den Jupyter-Rechner liegen. Dazu muss die Datei mittels `Upload Files` hochgeladen werden.\n",
"\n",
"![upload_file.png](upload_file.png)\n"
]
},
{
"cell_type": "markdown",
"id": "f9c2b389-370c-4c43-b0f4-f05ed3910d88",
"metadata": {},
"source": [
"## CSV Dateien mit Pandas lesen"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "bd02ed34-11f1-40ed-b3f0-7217a0acccef",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Datum</th>\n",
" <th>Typ</th>\n",
" <th>Beschreibung</th>\n",
" <th>Preis</th>\n",
" <th>km</th>\n",
" <th>Liter</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2012-07-07</td>\n",
" <td>Kauf</td>\n",
" <td>Autohaus</td>\n",
" <td>13800.00</td>\n",
" <td>30</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2012-07-10</td>\n",
" <td>Benzin</td>\n",
" <td>ESSO</td>\n",
" <td>57.01</td>\n",
" <td>199</td>\n",
" <td>34.89</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2012-07-11</td>\n",
" <td>Versich</td>\n",
" <td>Haftpfl.</td>\n",
" <td>104.30</td>\n",
" <td>400</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2012-07-23</td>\n",
" <td>Benzin</td>\n",
" <td>Kaufland</td>\n",
" <td>55.03</td>\n",
" <td>828</td>\n",
" <td>34.20</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2012-08-10</td>\n",
" <td>Benzin</td>\n",
" <td>Kaufland</td>\n",
" <td>56.72</td>\n",
" <td>1444</td>\n",
" <td>35.47</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>227</th>\n",
" <td>2021-08-28</td>\n",
" <td>Benzin</td>\n",
" <td>AVIA</td>\n",
" <td>47.10</td>\n",
" <td>104552</td>\n",
" <td>29.27</td>\n",
" </tr>\n",
" <tr>\n",
" <th>228</th>\n",
" <td>2021-10-09</td>\n",
" <td>Benzin</td>\n",
" <td>Kaufland</td>\n",
" <td>55.10</td>\n",
" <td>105147</td>\n",
" <td>33.97</td>\n",
" </tr>\n",
" <tr>\n",
" <th>229</th>\n",
" <td>2021-10-19</td>\n",
" <td>Benzin</td>\n",
" <td>JET</td>\n",
" <td>40.10</td>\n",
" <td>105623</td>\n",
" <td>24.32</td>\n",
" </tr>\n",
" <tr>\n",
" <th>230</th>\n",
" <td>2021-12-04</td>\n",
" <td>Benzin</td>\n",
" <td>JET</td>\n",
" <td>53.30</td>\n",
" <td>106186</td>\n",
" <td>34.19</td>\n",
" </tr>\n",
" <tr>\n",
" <th>231</th>\n",
" <td>2021-12-25</td>\n",
" <td>Benzin</td>\n",
" <td>AVIA</td>\n",
" <td>51.42</td>\n",
" <td>106727</td>\n",
" <td>31.96</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>232 rows × 6 columns</p>\n",
"</div>"
],
"text/plain": [
" Datum Typ Beschreibung Preis km Liter\n",
"0 2012-07-07 Kauf Autohaus 13800.00 30 NaN\n",
"1 2012-07-10 Benzin ESSO 57.01 199 34.89\n",
"2 2012-07-11 Versich Haftpfl. 104.30 400 NaN\n",
"3 2012-07-23 Benzin Kaufland 55.03 828 34.20\n",
"4 2012-08-10 Benzin Kaufland 56.72 1444 35.47\n",
".. ... ... ... ... ... ...\n",
"227 2021-08-28 Benzin AVIA 47.10 104552 29.27\n",
"228 2021-10-09 Benzin Kaufland 55.10 105147 33.97\n",
"229 2021-10-19 Benzin JET 40.10 105623 24.32\n",
"230 2021-12-04 Benzin JET 53.30 106186 34.19\n",
"231 2021-12-25 Benzin AVIA 51.42 106727 31.96\n",
"\n",
"[232 rows x 6 columns]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d = pandas.read_csv(\"car.csv\", sep=\"\\t\")\n",
"d"
]
},
{
"cell_type": "markdown",
"id": "0f10f423-f74d-4de6-b4cc-bfe45664d40c",
"metadata": {},
"source": [
"Das scheint funktioniert zu haben. Eine kleine Änderung werden wir noch vornehmen. Die erste Spalte enthält ein Datum. Da dies verschieden geschrieben werden kann (31.12.2020, 12/31/2020, ...) müssen wir bei der Erkennung ein bisschen nachhelfen. In unserem Fall ist es ausreichend zu sagen, welche Spalten ein Datum enthalten."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "49844097-ddff-4331-831a-433cad03cda6",
"metadata": {},
"outputs": [],
"source": [
"d = pandas.read_csv('car.csv', sep='\\t', parse_dates=['Datum'])"
]
},
{
"cell_type": "markdown",
"id": "de418fbd-024f-4205-be0b-5357c6667ccb",
"metadata": {},
"source": [
"## Auf einzelne Daten zugreifen\n",
"\n",
"Der Rückgabewert von `read_csv` ist ein **Data Frame**.\n",
"Das sieht aus wie eine Tabelle. Was kann man damit nun machen?"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "0440feaf-1eff-4206-a349-ac2b3186ecce",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 13800.00\n",
"1 57.01\n",
"2 104.30\n",
"3 55.03\n",
"4 56.72\n",
" ... \n",
"227 47.10\n",
"228 55.10\n",
"229 40.10\n",
"230 53.30\n",
"231 51.42\n",
"Name: Preis, Length: 232, dtype: float64"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d[\"Preis\"]"
]
},
{
"cell_type": "markdown",
"id": "60b9d3f1-5665-42c3-9bba-f9e2293b2de3",
"metadata": {},
"source": [
"Einzelne Spalten kann man über den Spaltentitel adressieren. So eine Spalte nennt Pandas **Series**.\n",
"Möchte man einen Wert aus so einer Series haben, so muss man zusätzlich dessen Zeilennummer angeben."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "8be6a487-c75a-4ff2-b39d-7a4b2451391c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"13800.0"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d[\"Preis\"][0]"
]
},
{
"cell_type": "markdown",
"id": "c0d070ae-beaa-474b-ad12-3878169f638b",
"metadata": {},
"source": [
"Das Ganze geht auch umgekehrt.\n",
"Eine Zeile bekommt man über den Array `iloc`:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "745dcce7-0d8b-4e7b-bb30-43bb7e78b493",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Datum 2012-07-07 00:00:00\n",
"Typ Kauf\n",
"Beschreibung Autohaus\n",
"Preis 13800.0\n",
"km 30\n",
"Liter NaN\n",
"Name: 0, dtype: object"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d.iloc[0]"
]
},
{
"cell_type": "markdown",
"id": "4cb8b853-3151-44cc-9d11-d6656a19bc22",
"metadata": {},
"source": [
"... und einen einzelnen Wert daraus über den Spaltennamen."
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "d779293b-4e75-4814-972d-138df30cf8a4",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"30"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d.iloc[0][\"km\"]"
]
},
{
"cell_type": "markdown",
"id": "b710be5b-7be3-454b-8089-402719845f75",
"metadata": {},
"source": [
"Mit `head()` und `tail()` kann man zudem die ersten bzw. letzten N Spalten auswählen."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "9f9d880a-e119-45ef-8136-70866faf3236",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Datum</th>\n",
" <th>Typ</th>\n",
" <th>Beschreibung</th>\n",
" <th>Preis</th>\n",
" <th>km</th>\n",
" <th>Liter</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2012-07-10</td>\n",
" <td>Benzin</td>\n",
" <td>ESSO</td>\n",
" <td>57.01</td>\n",
" <td>199</td>\n",
" <td>34.89</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Datum Typ Beschreibung Preis km Liter\n",
"1 2012-07-10 Benzin ESSO 57.01 199 34.89"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d.head(2).tail(1)"
]
},
{
"cell_type": "markdown",
"id": "4cd9e9a2-a2da-4a75-8dbd-18206d99744b",
"metadata": {},
"source": [
"## Aufgabe\n",
"Wie bekommt man die 4. bis 6. Zeile?"
]
},
{
"cell_type": "markdown",
"id": "ccd76426-4859-4a79-8882-c3cd72dcfb5a",
"metadata": {},
"source": [
"## Lösung\n",
"Achtung: die 4. Zeile hat die Nummer 3 weil die Nummerierung mit 0 beginnt!"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "15e79e26-3e9f-4674-9fb6-fbf3fe3a1303",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Datum</th>\n",
" <th>Typ</th>\n",
" <th>Beschreibung</th>\n",
" <th>Preis</th>\n",
" <th>km</th>\n",
" <th>Liter</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2012-07-23</td>\n",
" <td>Benzin</td>\n",
" <td>Kaufland</td>\n",
" <td>55.03</td>\n",
" <td>828</td>\n",
" <td>34.20</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2012-08-10</td>\n",
" <td>Benzin</td>\n",
" <td>Kaufland</td>\n",
" <td>56.72</td>\n",
" <td>1444</td>\n",
" <td>35.47</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>2012-08-23</td>\n",
" <td>Steuern</td>\n",
" <td>Kfz-Steuer</td>\n",
" <td>50.00</td>\n",
" <td>1500</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Datum Typ Beschreibung Preis km Liter\n",
"3 2012-07-23 Benzin Kaufland 55.03 828 34.20\n",
"4 2012-08-10 Benzin Kaufland 56.72 1444 35.47\n",
"5 2012-08-23 Steuern Kfz-Steuer 50.00 1500 NaN"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d.head(6).tail(3)"
]
},
{
"cell_type": "markdown",
"id": "bc38357a-65bd-4f59-969a-ca3bd8c2fc92",
"metadata": {},
"source": [
"## Was wissen die Daten über sich selbst?\n",
"Der Data Frame kann über sich selbst etwas sagen:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "100cc290-195c-4b74-a231-1c4bcc8c5d24",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Datum datetime64[ns]\n",
"Typ object\n",
"Beschreibung object\n",
"Preis float64\n",
"km int64\n",
"Liter float64\n",
"dtype: object"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d.dtypes"
]
},
{
"cell_type": "markdown",
"id": "4c75bf50-b650-462a-b48d-e00950065e4c",
"metadata": {},
"source": [
"Was bedeutet dies?\n",
"Für jede Spalte wird angegeben, welchen Typ die darin enthaltenen Daten besitzen.\n",
"* datetime64\\[ns] - ist ein Zeitstempel bestehend aus Datum und Uhrzeit wobei letztere eine Genauigkeit von Nanosekunden hat\n",
"* object - das sind Texte\n",
"* float64 - Zahl mit Nachkommastellen.\n",
"* int64 - Zahl ohne Nachkommastellen\n",
"\n",
"Über die Verteilung der Zahlen gibt die Funktion `describe()` Auskunft."
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "676f6f12-173c-44ac-a5bc-34063a0c2ae0",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Preis</th>\n",
" <th>km</th>\n",
" <th>Liter</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>232.000000</td>\n",
" <td>232.000000</td>\n",
" <td>201.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>117.583578</td>\n",
" <td>53910.508621</td>\n",
" <td>31.144726</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>904.538671</td>\n",
" <td>31374.136420</td>\n",
" <td>4.261039</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>0.000000</td>\n",
" <td>30.000000</td>\n",
" <td>10.140000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>39.840000</td>\n",
" <td>27386.750000</td>\n",
" <td>29.890000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>45.160000</td>\n",
" <td>53140.000000</td>\n",
" <td>32.640000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>50.000000</td>\n",
" <td>81382.000000</td>\n",
" <td>33.650000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>13800.000000</td>\n",
" <td>106727.000000</td>\n",
" <td>37.930000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Preis km Liter\n",
"count 232.000000 232.000000 201.000000\n",
"mean 117.583578 53910.508621 31.144726\n",
"std 904.538671 31374.136420 4.261039\n",
"min 0.000000 30.000000 10.140000\n",
"25% 39.840000 27386.750000 29.890000\n",
"50% 45.160000 53140.000000 32.640000\n",
"75% 50.000000 81382.000000 33.650000\n",
"max 13800.000000 106727.000000 37.930000"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d.describe()"
]
},
{
"cell_type": "markdown",
"id": "02ec9241-c8e2-4ccb-b7ba-ef2062e6227e",
"metadata": {},
"source": [
"## Aufgabe"
]
},
{
"cell_type": "markdown",
"id": "1e083c1e-4efe-4f34-ac5a-027c5e64a977",
"metadata": {},
"source": [
"1. Welche Bedeutung haben diese Zahlen?\n",
"1. Welche davon helfen beim Verständnis der Daten?\n",
"1. Gibt es Datensätze, wo diese Funktion noch viel hilfreicher ist?"
]
},
{
"cell_type": "markdown",
"id": "e2d48322-849d-4a4b-aeba-51d588504ad3",
"metadata": {},
"source": [
"## Lösung\n",
"Die besprechen wir im Kurs"
]
},
{
"cell_type": "markdown",
"id": "878dec12-c334-4d2c-9d10-4500f38af752",
"metadata": {},
"source": [
"## Zeilen zählen"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "cb3dc7a5-a4f2-4c75-a251-2862fde3f324",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Datum 232\n",
"Typ 232\n",
"Beschreibung 232\n",
"Preis 232\n",
"km 232\n",
"Liter 201\n",
"dtype: int64"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d.count()"
]
},
{
"cell_type": "markdown",
"id": "5e773974-53d8-4a16-849b-708e429ef38d",
"metadata": {},
"source": [
"In der Spalte **Liter** fehlen einige Einträge. Darum liefert `count()` für diese einen kleineren Wert.\n",
"Nicht vorhandene Werte werden als **NaN** (Not a Number) angezeigt."
]
},
{
"cell_type": "markdown",
"id": "5af33628-90d0-490d-a413-f2083fe40a5a",
"metadata": {},
"source": [
"## Werte zählen"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "b7930759-6e4f-4a82-aa63-830a15dd8d8e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Benzin 201\n",
"Versich 10\n",
"Steuern 10\n",
"Werkst 9\n",
"Kauf 2\n",
"Name: Typ, dtype: int64"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d[\"Typ\"].value_counts()"
]
},
{
"cell_type": "markdown",
"id": "657dc390-cefc-4345-a2c8-86cd4da01b03",
"metadata": {},
"source": [
"## Rechnen\n",
"\n",
"Welche Kosten sind insgesamt angefallen? Dazu muss man alle Einträge der Spalte `Preis` aufsummieren."
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "3e68e583-7427-40fa-86bf-f87571d83a63",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"27279.390000000003"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d[\"Preis\"].sum()"
]
},
{
"cell_type": "markdown",
"id": "9b631276-87e4-449b-9d07-52896b7fbc16",
"metadata": {},
"source": [
"## Aufgabe\n",
"\n",
"Was ist der Durchschnittsverbrauch des Autos über die gesamte erfasste Zeit?\n",
"\n",
"**Hinweise**\n",
"* Gesamtverbrauch, d.h. wieviele Liter wurden insgesamt verbraucht\n",
"* Fahrstrecke, d.h. km-Stand am Ende - km-Stand am Anfang\n",
"* Verbrauch wird i.d.R. in l/100km angegeben"
]
},
{
"cell_type": "markdown",
"id": "e2cfce0c-9eb6-4ac3-a2bb-43d085d20466",
"metadata": {},
"source": [
"## Lösung"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "081d24f8-cdc7-4273-80fb-c9e20a42f09b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"5.867165899697274"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"l_total = d[\"Liter\"].sum()\n",
"\n",
"km_start = d[\"km\"].min() # oder d[\"km\"][0] -> km-Stand aus der 0ten Zeile\n",
"km_end = d[\"km\"].max() # oder d[\"km\"].iloc[-1] -> km-Stand aus der letzten Zeile,\n",
" # geht man von 0 eins zurück fängt man am Ende wieder an\n",
"\n",
"km_total = km_end - km_start\n",
"\n",
"fuel_avg = 100 * l_total / km_total\n",
"fuel_avg"
]
},
{
"cell_type": "markdown",
"id": "9b125258-956b-4a6e-8074-6f4956563bd0",
"metadata": {},
"source": [
"## Daten filtern\n",
"Wenn man nur an einem Teil der Daten interessiert ist kann man sich diesen selektieren."
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "e15a92cf-d913-436e-8a51-440dfcbde914",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Datum</th>\n",
" <th>Typ</th>\n",
" <th>Beschreibung</th>\n",
" <th>Preis</th>\n",
" <th>km</th>\n",
" <th>Liter</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2012-07-10</td>\n",
" <td>Benzin</td>\n",
" <td>ESSO</td>\n",
" <td>57.01</td>\n",
" <td>199</td>\n",
" <td>34.89</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2012-07-23</td>\n",
" <td>Benzin</td>\n",
" <td>Kaufland</td>\n",
" <td>55.03</td>\n",
" <td>828</td>\n",
" <td>34.20</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Datum Typ Beschreibung Preis km Liter\n",
"1 2012-07-10 Benzin ESSO 57.01 199 34.89\n",
"3 2012-07-23 Benzin Kaufland 55.03 828 34.20"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fuel_only = d[d[\"Typ\"] == \"Benzin\"]\n",
"fuel_only.head(2)"
]
},
{
"cell_type": "markdown",
"id": "561575e8-6193-44e1-956e-59a6e618d4d7",
"metadata": {},
"source": [
"## Aufgabe\n",
"1. Selektiere die Zeilen, bei denen der Preis kleiner als 100 ist.\n",
"1. Selektiere die Zeilen, bei denen der Preis gleich 50 ist.\n",
"1. Selektiere die Zeilen, bei denen Liter größer 36 und kleiner 38 sind."
]
},
{
"cell_type": "markdown",
"id": "c01cb1b4-dd7a-49f2-ae25-a13bd99de5fb",
"metadata": {},
"source": [
"## Lösung"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "bbe1a6eb-d467-4656-8c17-ab6f9e4ceeb2",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Datum</th>\n",
" <th>Typ</th>\n",
" <th>Beschreibung</th>\n",
" <th>Preis</th>\n",
" <th>km</th>\n",
" <th>Liter</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>2012-09-10</td>\n",
" <td>Benzin</td>\n",
" <td>AVIA</td>\n",
" <td>60.80</td>\n",
" <td>2061</td>\n",
" <td>36.87</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>2012-09-14</td>\n",
" <td>Benzin</td>\n",
" <td>OMV</td>\n",
" <td>61.10</td>\n",
" <td>2710</td>\n",
" <td>36.83</td>\n",
" </tr>\n",
" <tr>\n",
" <th>106</th>\n",
" <td>2016-09-15</td>\n",
" <td>Benzin</td>\n",
" <td>Kaufland</td>\n",
" <td>48.51</td>\n",
" <td>48872</td>\n",
" <td>37.93</td>\n",
" </tr>\n",
" <tr>\n",
" <th>124</th>\n",
" <td>2017-05-30</td>\n",
" <td>Benzin</td>\n",
" <td>Real</td>\n",
" <td>46.78</td>\n",
" <td>57868</td>\n",
" <td>36.01</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Datum Typ Beschreibung Preis km Liter\n",
"6 2012-09-10 Benzin AVIA 60.80 2061 36.87\n",
"7 2012-09-14 Benzin OMV 61.10 2710 36.83\n",
"106 2016-09-15 Benzin Kaufland 48.51 48872 37.93\n",
"124 2017-05-30 Benzin Real 46.78 57868 36.01"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d[d[\"Preis\"] < 100]\n",
"d[d[\"Preis\"] == 50]\n",
"# d.described() sagte, dass der größte Wert 37.93 ist\n",
"d[d[\"Liter\"] > 36]\n",
"d[d[\"Liter\"].between(36,38)]"
]
},
{
"cell_type": "markdown",
"id": "8f1056bd-f6be-48a8-98ca-4f332989e45a",
"metadata": {},
"source": [
"## Daten Gruppieren\n",
"\n",
"Die Kosten sollen nach Spalte `Typ` bzw `Beschreibung` aufsummiert werden."
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "3ecb95b7-de9d-4383-9162-ca1e5b5bf669",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Typ\n",
"Benzin 8704.15\n",
"Kauf 14296.68\n",
"Steuern 500.00\n",
"Versich 2596.02\n",
"Werkst 1182.54\n",
"Name: Preis, dtype: float64"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d.groupby(\"Typ\")[\"Preis\"].sum()"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "9acedd39-f003-4a98-82ae-8082a2685d1c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Beschreibung\n",
"AGIP 32.80\n",
"ARAL 795.33\n",
"AVIA 671.59\n",
"Autohof 45.30\n",
"Avanti 64.10\n",
"BFT 88.66\n",
"BP 46.05\n",
"Bavaria 44.00\n",
"ESSO 1320.52\n",
"Elf 45.00\n",
"Globus 798.94\n",
"HEM 27.70\n",
"JET 1074.80\n",
"KK 183.34\n",
"Kaufland 1080.48\n",
"OMV 95.38\n",
"Oil 261.71\n",
"Real 222.15\n",
"SHELL 539.30\n",
"Star 99.80\n",
"Tango 56.85\n",
"Tankcenter 413.43\n",
"Total 481.25\n",
"UNO-X 43.36\n",
"Unbekannt 172.31\n",
"Name: Preis, dtype: float64"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fuel_only.groupby(\"Beschreibung\")[\"Preis\"].sum()"
]
},
{
"cell_type": "markdown",
"id": "d38f01a2-ffb8-48a5-a646-517989735b3b",
"metadata": {},
"source": [
"Das ist doch nett. Noch schöner wäre aber, wenn die Daten nach der zweiten Spalte absteigend sortiert wären.\n",
"Zudem ist man oft nur an den größten Ergebnissen interessiert."
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "3152d46a-d38e-4e1c-8465-b13082c72638",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Beschreibung\n",
"ESSO 1320.52\n",
"Kaufland 1080.48\n",
"JET 1074.80\n",
"Globus 798.94\n",
"ARAL 795.33\n",
"AVIA 671.59\n",
"SHELL 539.30\n",
"Total 481.25\n",
"Tankcenter 413.43\n",
"Oil 261.71\n",
"Name: Preis, dtype: float64"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fuel_only.groupby(\"Beschreibung\")[\"Preis\"].sum().sort_values(ascending=False).head(10)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}