{ "cells": [ { "cell_type": "markdown", "id": "d1ae5dfb", "metadata": {}, "source": [ "# Taller de manejo de Datos en Pandas" ] }, { "cell_type": "markdown", "id": "a92b73fc", "metadata": { "id": "--yl3E28USV0" }, "source": [ "## ¿Qué es?" ] }, { "cell_type": "markdown", "id": "f671da06", "metadata": { "id": "I3BqaExXUXqJ" }, "source": [ "Pandas es una librería de código abierto muy popular dentro del ámbito de Data Science y Machine Learning puesto que facilita el tratamiento y manipulación de datos.\n", "\n", "Pandas nació de la necesidad de tener todas las funcionalidades necesarias para un analista de datos en una sola librería: \n" ] }, { "cell_type": "markdown", "id": "a0528f6b-917b-4f7f-9338-b52d6c692932", "metadata": {}, "source": [ "\n", "
\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "id": "66c03d7b", "metadata": { "id": "ETNODbT9UnIV" }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "id": "fecec799", "metadata": { "id": "jpUyTyMxUi-t" }, "source": [ "### ¿Cómo importar datos?" ] }, { "cell_type": "markdown", "id": "81a349b1", "metadata": { "id": "yDzCmP7cYGzE" }, "source": [ "Para leer los datos la estructura básica es *pd.read_tipo-archivo*. Si estamos en colab y queremos utilizar algunos datos que están en nuestro google drive podemos utilizar:\n", "\n", "\n", "```\n", " from google.colab import drive\n", " drive.mount('/content/gdrive')\n", "```\n", "[Primeros pasos Pandas](https://pandas.pydata.org/docs/getting_started/index.html#getting-started)" ] }, { "cell_type": "code", "execution_count": 2, "id": "9cbc291d", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "w7qw_WJwVDtn", "outputId": "a65d6486-9088-4ae3-c35d-f9fdc87afd07" }, "outputs": [], "source": [ "#from google.colab import ##drive correr solo en colab\n", "#drive.mount('/content/gdrive')\n", "counties = pd.read_excel(\"https://github.com/AprendizajeProfundo/Libro_Fundamentos_Programacion/blob/main/Python/Datos/counties.xlsx?raw=true\")#cambiar dirección" ] }, { "cell_type": "markdown", "id": "c70ba762", "metadata": { "id": "Qaq5QnlUYvk7" }, "source": [ "Para tener una idea de qué variables y datos tengo en lo base de datos, sólo debemos llamarla." ] }, { "cell_type": "code", "execution_count": 3, "id": "a0544f2d", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 419 }, "id": "ke38Y4qqU6lS", "outputId": "ff9f5711-e549-4e52-fca6-cf78b88b35b6" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
codestatecodecountycountypopulationarea
011001Auta#%&()uga54571.0594.436000
111003Baldwin#%&() ?182265.01589.784000
211005Barbour27457.0884.876000
311007Bi#%&()bb22915.0622.582000
411009Blount ?57322.0644.776000
..................
32297272151Yabucoa37941.055.215000
32307272153Ya_uco42043.068.192000
32317878010; St. Croix ?50601.083.345868
32327878020St. John4170.019.689867
32337878030St. Thomas51634.031.313503
\n", "

3234 rows × 5 columns

\n", "
" ], "text/plain": [ " codestate codecounty county population area\n", "0 1 1001 Auta#%&()uga 54571.0 594.436000\n", "1 1 1003 Baldwin#%&() ? 182265.0 1589.784000\n", "2 1 1005 Barbour 27457.0 884.876000\n", "3 1 1007 Bi#%&()bb 22915.0 622.582000\n", "4 1 1009 Blount ? 57322.0 644.776000\n", "... ... ... ... ... ...\n", "3229 72 72151 Yabucoa 37941.0 55.215000\n", "3230 72 72153 Ya_uco 42043.0 68.192000\n", "3231 78 78010 ; St. Croix ? 50601.0 83.345868\n", "3232 78 78020 St. John 4170.0 19.689867\n", "3233 78 78030 St. Thomas 51634.0 31.313503\n", "\n", "[3234 rows x 5 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "counties" ] }, { "cell_type": "markdown", "id": "316cbdf7", "metadata": { "id": "QC4MKt1VY32I" }, "source": [ "En lo anterior veiamos todas las variables con las 5 primeras y 5 últimas filas. Pero, podemos ver una cantidad determinada de registros tanto del inicio de la tabla como del final. Utlizando *.head()* y *.tail()* respectivamente." ] }, { "cell_type": "code", "execution_count": 4, "id": "7c8c42c4-ab99-4a3b-8071-bde23db2d72d", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "urd2HbgEWXv5", "outputId": "6041d85e-7113-4697-a236-345824a6f0ab" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
codestatecodecountycountypopulationarea
011001Auta#%&()uga54571.0594.436
111003Baldwin#%&() ?182265.01589.784
211005Barbour27457.0884.876
\n", "
" ], "text/plain": [ " codestate codecounty county population area\n", "0 1 1001 Auta#%&()uga 54571.0 594.436\n", "1 1 1003 Baldwin#%&() ? 182265.0 1589.784\n", "2 1 1005 Barbour 27457.0 884.876" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "counties.head(3) #por defecto salen 5 primeros" ] }, { "cell_type": "code", "execution_count": 5, "id": "7375bbc1-826a-4802-bc83-e7ae1c9ebe70", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
codestatecodecountycountypopulationarea
32327878020St. John4170.019.689867
32337878030St. Thomas51634.031.313503
\n", "
" ], "text/plain": [ " codestate codecounty county population area\n", "3232 78 78020 St. John 4170.0 19.689867\n", "3233 78 78030 St. Thomas 51634.0 31.313503" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "counties.tail(2)" ] }, { "cell_type": "markdown", "id": "31f4e22a", "metadata": { "id": "ZexVZppTZeIY" }, "source": [ "Si aparte de ver solamente la tabla, queremos más información de la base de datos, por ejemplo sus dimensiones y tipo de variables podemos utilizar lo siguiente:" ] }, { "cell_type": "code", "execution_count": 6, "id": "9aecfb01", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "pb6R6Sk1WRS1", "outputId": "8c1b1843-3418-4cc2-f4c8-fc1223ef68d7" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 3234 entries, 0 to 3233\n", "Data columns (total 5 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 codestate 3234 non-null int64 \n", " 1 codecounty 3234 non-null int64 \n", " 2 county 3234 non-null object \n", " 3 population 3232 non-null float64\n", " 4 area 3234 non-null float64\n", "dtypes: float64(2), int64(2), object(1)\n", "memory usage: 126.5+ KB\n", "\n", "\n", "La forma de la base de datos es: (3234, 5)\n" ] } ], "source": [ "counties.info()\n", "print(\"\\n\")\n", "print(\"La forma de la base de datos es:\",counties.shape) #shape:forma de un array" ] }, { "cell_type": "markdown", "id": "dbdb2f51-bbc2-4497-967b-5de59aff49af", "metadata": {}, "source": [ "[dtypes](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dtypes.html?highlight=dtypes#pandas.DataFrame.dtypes)\n" ] }, { "cell_type": "code", "execution_count": 7, "id": "12c3d4c1", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "14k3sgunWqWk", "outputId": "aac57289-e15c-47f0-b8e8-ca12a626d8bd" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "codestate int64\n", "codecounty int64\n", "county object\n", "population float64\n", "area float64\n", "dtype: object\n", "\n", "\n", " codestate codecounty population area\n", "count 3234.000000 3234.000000 3.232000e+03 3234.000000\n", "mean 31.441868 31544.737786 9.679656e+04 1093.361817\n", "std 16.411236 16425.545223 3.088044e+05 3564.706999\n", "min 1.000000 1001.000000 1.700000e+01 0.031696\n", "25% 19.000000 19039.500000 1.129700e+04 416.360000\n", "50% 30.000000 30038.000000 2.607550e+04 602.977500\n", "75% 46.000000 46128.500000 6.566050e+04 913.884500\n", "max 78.000000 78030.000000 9.818605e+06 145504.789000\n", "\n", "\n", " codestate codecounty county population area\n", "count 3234.000000 3234.000000 3234 3.232000e+03 3234.000000\n", "unique NaN NaN 2625 NaN NaN\n", "top NaN NaN Lincoln NaN NaN\n", "freq NaN NaN 14 NaN NaN\n", "mean 31.441868 31544.737786 NaN 9.679656e+04 1093.361817\n", "std 16.411236 16425.545223 NaN 3.088044e+05 3564.706999\n", "min 1.000000 1001.000000 NaN 1.700000e+01 0.031696\n", "25% 19.000000 19039.500000 NaN 1.129700e+04 416.360000\n", "50% 30.000000 30038.000000 NaN 2.607550e+04 602.977500\n", "75% 46.000000 46128.500000 NaN 6.566050e+04 913.884500\n", "max 78.000000 78030.000000 NaN 9.818605e+06 145504.789000\n" ] } ], "source": [ "print(counties.dtypes) #tipo de objeto en cada columna\n", "print(\"\\n\")\n", "print(counties.describe())\n", "print(\"\\n\")\n", "print(counties.describe(include=\"all\")) #summary, si no se pone el include solo aparece de las variables numéricas" ] }, { "cell_type": "markdown", "id": "51e7c24f", "metadata": { "id": "_zvcDFcHXLOX" }, "source": [ "### Seleccionar subconjuntos de una base de datos\n", "\n", "\n", "[Documentación](https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html#min-tut-03-subset)" ] }, { "cell_type": "markdown", "id": "4a414ead", "metadata": { "id": "wAfaMuO1Z0yS" }, "source": [ "### Escoger columnas\n" ] }, { "cell_type": "markdown", "id": "c52d2d28", "metadata": { "id": "kl8bD-0YZx9X" }, "source": [ "Para ver el nombre de las columnas que existen, podemos utilizar *.columns*." ] }, { "cell_type": "code", "execution_count": 8, "id": "f7fe5d8c", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "0LZ_7Nb-XVxK", "outputId": "165d5b6b-522b-4ed6-ea01-0b94b0e8de1c" }, "outputs": [ { "data": { "text/plain": [ "Index(['codestate', 'codecounty', 'county', 'population', 'area'], dtype='object')" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "counties.columns" ] }, { "cell_type": "markdown", "id": "3a095965", "metadata": { "id": "ZQ9QlKdFX2ak" }, "source": [ "Escoger una columna específica o un conjunto de ellas" ] }, { "cell_type": "code", "execution_count": 9, "id": "53d96bb5", "metadata": { "id": "E7qXeiNSXm-a" }, "outputs": [ { "data": { "text/plain": [ "0 Auta#%&()uga \n", "1 Baldwin#%&() ? \n", "2 Barbour \n", "3 Bi#%&()bb \n", "4 Blount ? \n", " ... \n", "3229 Yabucoa \n", "3230 Ya_uco \n", "3231 ; St. Croix ? \n", "3232 St. John\n", "3233 St. Thomas\n", "Name: county, Length: 3234, dtype: object" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "condado = counties[\"county\"]\n", "condado" ] }, { "cell_type": "code", "execution_count": 10, "id": "45f90107", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "3hfa1pljYd06", "outputId": "3444be49-3507-40e3-99df-a2033d4cf963" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(3234,)\n", "(3234,)\n", "\n" ] } ], "source": [ "print(condado.shape) #array unidimensional con 3234 elementos\n", "print(counties[\"county\"].shape)\n", "print(type(condado))" ] }, { "cell_type": "code", "execution_count": 11, "id": "f1e32016", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "vhvtzm4yaMlc", "outputId": "41c6784a-fa99-40fd-fe12-74b704377a55" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " codecounty county\n", "0 1001 Auta#%&()uga \n", "1 1003 Baldwin#%&() ? \n", "2 1005 Barbour \n", "3 1007 Bi#%&()bb \n", "4 1009 Blount ? \n", "... ... ...\n", "3229 72151 Yabucoa \n", "3230 72153 Ya_uco \n", "3231 78010 ; St. Croix ? \n", "3232 78020 St. John\n", "3233 78030 St. Thomas\n", "\n", "[3234 rows x 2 columns]\n", "\n" ] } ], "source": [ "codi_condado=counties[[\"codecounty\",\"county\"]]\n", "print(codi_condado)\n", "print(type(codi_condado))" ] }, { "cell_type": "markdown", "id": "4116b0db", "metadata": { "id": "iluJhOUgZ48R" }, "source": [ "## Escoger Filas" ] }, { "cell_type": "markdown", "id": "9e01112d", "metadata": { "id": "F-eAWEusaWmY" }, "source": [ "Podemos escoger ciertas filas a través de su posición. utilizando *.iloc*" ] }, { "cell_type": "code", "execution_count": 12, "id": "883a707b", "metadata": { "id": "aFOooZB5bYsv" }, "outputs": [], "source": [ "from seaborn import load_dataset" ] }, { "cell_type": "code", "execution_count": 13, "id": "7abbd0ca", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "vEfryMcNZ6Rc", "outputId": "65108665-8aea-44b2-9885-4e9beaa24aff" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " codestate codecounty county population area\n", "0 1 1001 Auta#%&()uga 54571.0 594.436\n", "\n" ] } ], "source": [ "print(counties.iloc[0:1,])\n", "print(type(counties.iloc[0:1,]))" ] }, { "cell_type": "code", "execution_count": 14, "id": "c2808379", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "n4hayCXQbppm", "outputId": "fec81298-ea75-49c7-b2e9-8d9db0fdcfc3" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " codecounty county\n", "8 1017 ; Chambers ? \n", "9 1019 Cherokee \n", "10 1021 ; Chilton \n", "11 1023 Choctaw \n" ] } ], "source": [ "print(counties.iloc[8:12,1:3])" ] }, { "cell_type": "markdown", "id": "e2bab2e6", "metadata": { "id": "kWzWN3Dtb8Yl" }, "source": [ "O también escoger filas con base a alguna condicion o característica específica." ] }, { "cell_type": "code", "execution_count": 15, "id": "a3b4ccde", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "2KLtpDRdcAxO", "outputId": "2c918fe7-caed-4de5-a774-7d764e82aa8d" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
codestatecodecountycountypopulationarea
32297272151Yabucoa37941.055.215000
32307272153Ya_uco42043.068.192000
32317878010; St. Croix ?50601.083.345868
32327878020St. John4170.019.689867
32337878030St. Thomas51634.031.313503
\n", "
" ], "text/plain": [ " codestate codecounty county population area\n", "3229 72 72151 Yabucoa 37941.0 55.215000\n", "3230 72 72153 Ya_uco 42043.0 68.192000\n", "3231 78 78010 ; St. Croix ? 50601.0 83.345868\n", "3232 78 78020 St. John 4170.0 19.689867\n", "3233 78 78030 St. Thomas 51634.0 31.313503" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "counties.tail()" ] }, { "cell_type": "code", "execution_count": 16, "id": "e173670e", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "MCWLpBvBeEO9", "outputId": "5ccdca98-4412-4ae9-8de1-bd446d2afe89" }, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 False\n", "3 False\n", "4 False\n", " ... \n", "3229 False\n", "3230 False\n", "3231 True\n", "3232 True\n", "3233 True\n", "Name: codestate, Length: 3234, dtype: bool" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "counties[\"codestate\"] > 72 #Verdadero o falso si cumple la condición" ] }, { "cell_type": "code", "execution_count": 17, "id": "56143fde", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "HuP0xHljcuuk", "outputId": "3321f270-0a43-46a1-b098-c1515a0e2a4c" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 3 entries, 3231 to 3233\n", "Data columns (total 5 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 codestate 3 non-null int64 \n", " 1 codecounty 3 non-null int64 \n", " 2 county 3 non-null object \n", " 3 population 3 non-null float64\n", " 4 area 3 non-null float64\n", "dtypes: float64(2), int64(2), object(1)\n", "memory usage: 144.0+ bytes\n" ] } ], "source": [ "codsat72 = counties[counties[\"codestate\"] > 72] # Llama aquellos cuyo valor es verdadero\n", "codsat72.info()" ] }, { "cell_type": "markdown", "id": "2f3725a1", "metadata": { "id": "WfqV78NkkkmP" }, "source": [ "Filas que no tengan valores faltantes en determinada columna" ] }, { "cell_type": "code", "execution_count": 18, "id": "e2a9ab43", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "h2tC0pQoktGq", "outputId": "885f67fb-8177-4485-a11a-c867b22e94e6" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(3232, 5)\n", "(3234, 5)\n" ] } ], "source": [ "no_na = counties[counties[\"population\"].notna()]\n", "\n", "print(no_na.shape)\n", "print(counties.shape)" ] }, { "cell_type": "markdown", "id": "b1224530", "metadata": { "id": "LCogEUWCfzRm" }, "source": [ "### Filas con valores específicos" ] }, { "cell_type": "markdown", "id": "2e5fea92", "metadata": { "id": "V40Kr5aNgtLC" }, "source": [ "Existen varias maneras de realizar estas búsquedas, algunas de ellas son utilizando *isin*, *loc* de la librería seaborn y *or*." ] }, { "cell_type": "code", "execution_count": 19, "id": "621f4271", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 111 }, "id": "bOrF8pDReLDC", "outputId": "df019a70-fd16-43bf-9d4e-79c4c7f68360" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
codestatecodecountycountypopulationarea
32297272151Yabucoa37941.055.215000
32317878010; St. Croix ?50601.083.345868
\n", "
" ], "text/plain": [ " codestate codecounty county population area\n", "3229 72 72151 Yabucoa 37941.0 55.215000\n", "3231 78 78010 ; St. Croix ? 50601.0 83.345868" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "codcou1 = counties[counties[\"codecounty\"].isin([78010, 72151])]\n", "codcou1" ] }, { "cell_type": "code", "execution_count": 20, "id": "b430d94a", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 80 }, "id": "RSPwtxJOg6Bt", "outputId": "06f7ae61-7b26-4240-fc7f-f3a21eadee5e" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
codestatecodecountycountypopulationarea
32317878010; St. Croix ?50601.083.345868
\n", "
" ], "text/plain": [ " codestate codecounty county population area\n", "3231 78 78010 ; St. Croix ? 50601.0 83.345868" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "codcou2 = counties.loc[:, 'codecounty'] == 78010\n", "codcou2_ = counties.loc[codcou2]\n", "codcou2_" ] }, { "cell_type": "code", "execution_count": 21, "id": "261a22da", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 111 }, "id": "8oITWBdNh_AP", "outputId": "677d3e91-4272-4b01-c74c-03d3751952cb" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
codestatecodecountycountypopulationarea
32297272151Yabucoa37941.055.215000
32317878010; St. Croix ?50601.083.345868
\n", "
" ], "text/plain": [ " codestate codecounty county population area\n", "3229 72 72151 Yabucoa 37941.0 55.215000\n", "3231 78 78010 ; St. Croix ? 50601.0 83.345868" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "counties[(counties[\"codecounty\"] == 78010) | (counties[\"codecounty\"] == 72151)]" ] }, { "cell_type": "markdown", "id": "162b866a", "metadata": { "id": "trezEwQCjlY-" }, "source": [ "Filas con más de una característica específica." ] }, { "cell_type": "code", "execution_count": 22, "id": "f5d9f523", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "lEkdqc9Ujk3S", "outputId": "ba18d3e8-9ea2-4ad2-d73f-6f36d25fa0ea" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 16 entries, 3153 to 3230\n", "Data columns (total 5 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 codestate 16 non-null int64 \n", " 1 codecounty 16 non-null int64 \n", " 2 county 16 non-null object \n", " 3 population 16 non-null float64\n", " 4 area 16 non-null float64\n", "dtypes: float64(2), int64(2), object(1)\n", "memory usage: 768.0+ bytes\n" ] } ], "source": [ "counties[(counties[\"codestate\"] == 72) & (counties[\"area\"] >= 60)].info()" ] }, { "cell_type": "markdown", "id": "20107ed1", "metadata": { "id": "IIX73mph6Pxe" }, "source": [ "### Eliminar valores omitidos\n" ] }, { "cell_type": "code", "execution_count": 23, "id": "b9068d29", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 419 }, "id": "f3aCl-HD6SFT", "outputId": "1499c177-302d-4f46-e883-ee0b7a23dc60" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
codestatecodecountycountypopulationarea
011001Auta#%&()uga54571.0594.436000
111003Baldwin#%&() ?182265.01589.784000
211005Barbour27457.0884.876000
311007Bi#%&()bb22915.0622.582000
411009Blount ?57322.0644.776000
..................
32297272151Yabucoa37941.055.215000
32307272153Ya_uco42043.068.192000
32317878010; St. Croix ?50601.083.345868
32327878020St. John4170.019.689867
32337878030St. Thomas51634.031.313503
\n", "

3232 rows × 5 columns

\n", "
" ], "text/plain": [ " codestate codecounty county population area\n", "0 1 1001 Auta#%&()uga 54571.0 594.436000\n", "1 1 1003 Baldwin#%&() ? 182265.0 1589.784000\n", "2 1 1005 Barbour 27457.0 884.876000\n", "3 1 1007 Bi#%&()bb 22915.0 622.582000\n", "4 1 1009 Blount ? 57322.0 644.776000\n", "... ... ... ... ... ...\n", "3229 72 72151 Yabucoa 37941.0 55.215000\n", "3230 72 72153 Ya_uco 42043.0 68.192000\n", "3231 78 78010 ; St. Croix ? 50601.0 83.345868\n", "3232 78 78020 St. John 4170.0 19.689867\n", "3233 78 78030 St. Thomas 51634.0 31.313503\n", "\n", "[3232 rows x 5 columns]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Eliminar valores omitidos: dropna\n", "# axis= 0 -> Eliminar fila completa\n", "# axis=1 -> Eliminar columna completa\n", "counties.dropna(subset=[\"population\"], axis=0, inplace=False) #inplace:modifique directamente la BD" ] }, { "cell_type": "markdown", "id": "bfb060c4", "metadata": { "id": "Y6Atc2zvy6HP" }, "source": [ "## Manipulación de datos textuales (Limpieza)\n" ] }, { "cell_type": "markdown", "id": "17f56c2c", "metadata": { "id": "r_KAz4gxzNHD" }, "source": [ "Para realizarle limpieza a columna *county* de la base de datos counties, podemos utilizar lo siguiente:\n", "\n", "* **lower:**Poner en minúscula todo el texto.\n", "* **replace:**Remplazar ciertos valores por otros.\n", "* **strip:**Eliminar los espacios al principio y al final de la cadena.\n", "* **title:**Poner primera letra de cada palabra en mayúscula." ] }, { "cell_type": "code", "execution_count": 24, "id": "038f9f4c", "metadata": { "id": "aQjkSm5zzgWp" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\User\\AppData\\Local\\Temp\\ipykernel_9972\\82757700.py:1: FutureWarning: The default value of regex will change from True to False in a future version.\n", " counties[\"county\"] = (counties[\"county\"]\n" ] } ], "source": [ "counties[\"county\"] = (counties[\"county\"]\n", " .str.lower()\n", " .str.replace(\"[^a-záéíóúüñ ]\",\"\")\n", " .str.replace(\" +\",\" \")\n", " .str.strip()\n", " .str.title()\n", " ) " ] }, { "cell_type": "code", "execution_count": 25, "id": "18aa7c59", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "cQZQGImM1cRM", "outputId": "e44b1ec1-0316-4e6f-e1e4-1daeab6b5b22" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " codestate codecounty county population area\n", "0 1 1001 Autauga 54571.0 594.436000\n", "1 1 1003 Baldwin 182265.0 1589.784000\n", "2 1 1005 Barbour 27457.0 884.876000\n", "3 1 1007 Bibb 22915.0 622.582000\n", "4 1 1009 Blount 57322.0 644.776000\n", "... ... ... ... ... ...\n", "3229 72 72151 Yabucoa 37941.0 55.215000\n", "3230 72 72153 Yauco 42043.0 68.192000\n", "3231 78 78010 St Croix 50601.0 83.345868\n", "3232 78 78020 St John 4170.0 19.689867\n", "3233 78 78030 St Thomas 51634.0 31.313503\n", "\n", "[3234 rows x 5 columns]\n" ] } ], "source": [ "print(counties)" ] }, { "cell_type": "markdown", "id": "94b0b276", "metadata": { "id": "tIMVgKLV2yRB" }, "source": [ "## Creación de columnas a partir de otras" ] }, { "cell_type": "code", "execution_count": 26, "id": "f8163a7a", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "9PPLxl1s24YT", "outputId": "30ed1f63-e00f-4a32-fe55-4dc6202f7aaf" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " densidad population area\n", "0 91.802986 54571.0 594.436000\n", "1 114.647650 182265.0 1589.784000\n", "2 31.029206 27457.0 884.876000\n", "3 36.806397 22915.0 622.582000\n", "4 88.902192 57322.0 644.776000\n", "... ... ... ...\n", "3229 687.150231 37941.0 55.215000\n", "3230 616.538597 42043.0 68.192000\n", "3231 607.120676 50601.0 83.345868\n", "3232 211.784062 4170.0 19.689867\n", "3233 1648.937201 51634.0 31.313503\n", "\n", "[3234 rows x 3 columns]\n" ] } ], "source": [ "counties[\"densidad\"]= counties[\"population\"]/counties[\"area\"]\n", "print(counties[[\"densidad\",\"population\",\"area\"]])" ] }, { "cell_type": "markdown", "id": "fabc4c0f", "metadata": { "id": "8AL49oG6dmFC" }, "source": [ "### Renombrar columnas\n" ] }, { "cell_type": "code", "execution_count": 27, "id": "43eff214", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "ewrd1B4w4uNu", "outputId": "8c9ce259-9174-4d20-a239-cb97c8ad0e6b" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Index(['codestate', 'codecounty', 'county', 'population', 'area', 'densidad'], dtype='object')\n", "Index(['codestate', 'codecounty', 'county', 'population', 'area',\n", " 'densidad_pob'],\n", " dtype='object')\n" ] } ], "source": [ "print(counties.columns)\n", "counties.rename(columns={\"densidad\":\"densidad_pob\"},inplace=True) #renombrar columna\n", "print(counties.columns) #Verificación" ] }, { "cell_type": "markdown", "id": "a32106a1", "metadata": { "id": "4UHdmkDo5Io2" }, "source": [ "### Modificar tipo de dato de una columna" ] }, { "cell_type": "code", "execution_count": 28, "id": "1ad486e6", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "2lnOEwUI5AYo", "outputId": "858856ef-06bc-42d6-e169-f94fd80ff419" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 3234 entries, 0 to 3233\n", "Data columns (total 6 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 codestate 3234 non-null int64 \n", " 1 codecounty 3234 non-null int64 \n", " 2 county 3234 non-null object \n", " 3 population 3232 non-null float64\n", " 4 area 3234 non-null float64\n", " 5 densidad_pob 3232 non-null float64\n", "dtypes: float64(3), int64(2), object(1)\n", "memory usage: 151.7+ KB\n", "None\n", "\n", "\n", "\n", "RangeIndex: 3234 entries, 0 to 3233\n", "Data columns (total 6 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 codestate 3234 non-null float64\n", " 1 codecounty 3234 non-null int64 \n", " 2 county 3234 non-null object \n", " 3 population 3232 non-null float64\n", " 4 area 3234 non-null float64\n", " 5 densidad_pob 3232 non-null float64\n", "dtypes: float64(4), int64(1), object(1)\n", "memory usage: 151.7+ KB\n", "None\n" ] } ], "source": [ "print(counties.info())\n", "print(\"\\n\")\n", "counties[\"codestate\"]=counties[\"codestate\"].astype(\"float\")\n", "print(counties.info())" ] }, { "cell_type": "markdown", "id": "1cbb5206", "metadata": { "id": "ZfVG-6vg5wb6" }, "source": [ "### Eliminar filas y columnas\n" ] }, { "cell_type": "markdown", "id": "5e011d29", "metadata": { "id": "pFzdWnnkfW2D" }, "source": [ "Eliminación filas, puede ser por posición o que cumpla una característica." ] }, { "cell_type": "code", "execution_count": 29, "id": "0742e9a6", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "qk8rcvr67BaZ", "outputId": "a13ec9be-d4c4-4cee-8c16-8a3310bf03a3" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(3234, 6)\n", "(3231, 6)\n" ] } ], "source": [ "print(counties.shape)\n", "counties = counties.drop(counties.iloc[0:3,].index)\n", "counties = counties.drop(counties[counties['codecounty']==70].index)\n", "print(counties.shape)" ] }, { "cell_type": "code", "execution_count": 30, "id": "5e9427d3", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "FYZTnrdE7esV", "outputId": "284a10c7-ccdf-46d1-e98f-a7bc3b4c3590" }, "outputs": [ { "data": { "text/plain": [ "Index(['codestate', 'codecounty', 'county', 'population', 'area'], dtype='object')" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "counties.drop(['densidad_pob'], axis=1,inplace=True) #inplace=True, para que me modifique la BD\n", "counties.columns" ] }, { "cell_type": "markdown", "id": "37a01825", "metadata": { "id": "hdms92yv8UuT" }, "source": [ "### Conteos\n" ] }, { "cell_type": "markdown", "id": "a8abc044", "metadata": {}, "source": [ "La función `groupby` devuelve un objeto `DataFrameGroupBy` que contiene la información agrupada según las especificaciones que indiquemos." ] }, { "cell_type": "code", "execution_count": 31, "id": "1f5153cd", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "group=counties.groupby(\"codestate\")\n", "print(type(group))" ] }, { "cell_type": "markdown", "id": "2bbcf491", "metadata": {}, "source": [ "Podemos obtener detalles del anterior dataframe" ] }, { "cell_type": "code", "execution_count": 32, "id": "9a572e7c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " codecounty county population area\n", "codestate \n", "1.0 1007 Bibb 22915.0 622.582\n", "2.0 2013 Aleutians East 3141.0 6981.943\n", "4.0 4001 Apache 71518.0 11197.521\n", "5.0 5001 Arkansas 19019.0 988.770\n", "6.0 6001 Alameda 1510271.0 739.017\n", "8.0 8001 Adams 441603.0 1167.653\n", "9.0 9001 Fairfield 916829.0 624.890\n", "10.0 10001 Kent 162310.0 586.179\n", "11.0 11001 District Of Columbia 601723.0 61.048\n", "12.0 12001 Alachua 247336.0 875.020\n", "\n", "\n", " codestate codecounty county population area\n", "3231 78.0 78010 St Croix 50601.0 83.345868\n", "3232 78.0 78020 St John 4170.0 19.689867\n", "3233 78.0 78030 St Thomas 51634.0 31.313503\n" ] } ], "source": [ "print(group.first().head(10)) # Nos muestra el primer elemento de cada grupo según \"codestate\"\n", "print('\\n')\n", "print(group.get_group(78)) #Nos muestra los elementos cuyo codestate es 78" ] }, { "cell_type": "code", "execution_count": 33, "id": "fceb996d", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "k6Mo-11A8YET", "outputId": "7bbdb72b-c607-4a41-9406-008eb4bd65de" }, "outputs": [ { "data": { "text/plain": [ "codestate\n", "1.0 64\n", "2.0 29\n", "4.0 15\n", "5.0 75\n", "6.0 58\n", "8.0 64\n", "9.0 8\n", "10.0 3\n", "11.0 1\n", "12.0 67\n", "Name: codecounty, dtype: int64" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "counties.groupby(\"codestate\")[\"codecounty\"].count().head(10)" ] }, { "cell_type": "markdown", "id": "4e2262b7", "metadata": {}, "source": [ "## Modificar Tablas\n", "\n", "\n", "es posible hacer cambios de orden y estructura en las tablas de pandas." ] }, { "cell_type": "code", "execution_count": 34, "id": "fa3de7fc", "metadata": {}, "outputs": [], "source": [ "folder_path=\"https://github.com/AprendizajeProfundo/Libro_Fundamentos_Programacion/blob/main/Python/Datos/\"" ] }, { "cell_type": "code", "execution_count": 35, "id": "8fa1311b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yeardemocratrepublicothercodecounty
020004942119932731001
12000139974087216111003
22000518850961111005
32000271042731181007
420004977126673291009
\n", "
" ], "text/plain": [ " year democrat republic other codecounty\n", "0 2000 4942 11993 273 1001\n", "1 2000 13997 40872 1611 1003\n", "2 2000 5188 5096 111 1005\n", "3 2000 2710 4273 118 1007\n", "4 2000 4977 12667 329 1009" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "elections=pd.read_excel(folder_path+'elections.xlsx?raw=true')\n", "elections.head(5)" ] }, { "cell_type": "markdown", "id": "8c5e43fc", "metadata": {}, "source": [ "organicemos los datos descendentemente depende de la cantidad de votos democratas" ] }, { "cell_type": "code", "execution_count": 36, "id": "837cf338", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yeardemocratrepublicothercodecounty
12626201624643647697432002016037
640020082295853956425659706037
951320122216903885333788316037
3287200419077361076225376866037
175200017105058719301127196037
..................
119812012954148301
885220088151448269
1196520125139148269
1507820165149548269
150942016458348301
\n", "

15563 rows × 5 columns

\n", "
" ], "text/plain": [ " year democrat republic other codecounty\n", "12626 2016 2464364 769743 200201 6037\n", "6400 2008 2295853 956425 65970 6037\n", "9513 2012 2216903 885333 78831 6037\n", "3287 2004 1907736 1076225 37686 6037\n", "175 2000 1710505 871930 112719 6037\n", "... ... ... ... ... ...\n", "11981 2012 9 54 1 48301\n", "8852 2008 8 151 4 48269\n", "11965 2012 5 139 1 48269\n", "15078 2016 5 149 5 48269\n", "15094 2016 4 58 3 48301\n", "\n", "[15563 rows x 5 columns]" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "elections.sort_values([\"democrat\"], ascending=False)" ] }, { "cell_type": "markdown", "id": "9588e30a", "metadata": {}, "source": [ "usando `groupby()` podemos juntar los datos por año y sumarlos para tener el total de votos por año" ] }, { "cell_type": "code", "execution_count": 37, "id": "5255ae0d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
democratrepublic
year
20005092296450290498
20045861407361603653
20086920458259703000
20126550896160460609
20166562639862790289
\n", "
" ], "text/plain": [ " democrat republic\n", "year \n", "2000 50922964 50290498\n", "2004 58614073 61603653\n", "2008 69204582 59703000\n", "2012 65508961 60460609\n", "2016 65626398 62790289" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "elections.groupby('year')[['democrat', 'republic']].sum()" ] }, { "cell_type": "markdown", "id": "f6a0fd1f", "metadata": {}, "source": [ "Podemos aplicar ambos métodos en un mismo código" ] }, { "cell_type": "code", "execution_count": 38, "id": "cfde894b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
democratrepublic
year
20086920458259703000
20166562639862790289
20126550896160460609
20045861407361603653
20005092296450290498
\n", "
" ], "text/plain": [ " democrat republic\n", "year \n", "2008 69204582 59703000\n", "2016 65626398 62790289\n", "2012 65508961 60460609\n", "2004 58614073 61603653\n", "2000 50922964 50290498" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "elections_sort=(elections.groupby('year')[['democrat', 'republic']]\n", " .sum()\n", " .sort_values('democrat', ascending=False))\n", "display(elections_sort)" ] }, { "cell_type": "markdown", "id": "198b44d8", "metadata": {}, "source": [ "Para reestructurar los datos usamos la función `pivot()`." ] }, { "cell_type": "code", "execution_count": 39, "id": "83d44484", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
democratrepublic
year2000200420082012201620002004200820122016
codecounty
10014942.04758.06093.06363.05936.011993.015196.017403.017379.018172.0
100313997.015599.019386.018424.018458.040872.052971.061271.066016.072883.0
10055188.04832.05697.05912.04871.05096.05899.05866.05550.05454.0
10072710.02089.02299.02202.01874.04273.05472.06262.06132.06738.0
10094977.03938.03522.02970.02156.012667.017386.020389.020757.022859.0
\n", "
" ], "text/plain": [ " democrat republic \\\n", "year 2000 2004 2008 2012 2016 2000 2004 \n", "codecounty \n", "1001 4942.0 4758.0 6093.0 6363.0 5936.0 11993.0 15196.0 \n", "1003 13997.0 15599.0 19386.0 18424.0 18458.0 40872.0 52971.0 \n", "1005 5188.0 4832.0 5697.0 5912.0 4871.0 5096.0 5899.0 \n", "1007 2710.0 2089.0 2299.0 2202.0 1874.0 4273.0 5472.0 \n", "1009 4977.0 3938.0 3522.0 2970.0 2156.0 12667.0 17386.0 \n", "\n", " \n", "year 2008 2012 2016 \n", "codecounty \n", "1001 17403.0 17379.0 18172.0 \n", "1003 61271.0 66016.0 72883.0 \n", "1005 5866.0 5550.0 5454.0 \n", "1007 6262.0 6132.0 6738.0 \n", "1009 20389.0 20757.0 22859.0 " ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pivot_elections=elections.pivot(index='codecounty',\n", " columns='year',\n", " values=['democrat','republic'])\n", "pivot_elections.head()" ] }, { "cell_type": "markdown", "id": "348efd66", "metadata": {}, "source": [ "## Múltiples índices\n" ] }, { "cell_type": "markdown", "id": "9b69a16b", "metadata": {}, "source": [ "Se pueden elegir múltiples variables como índice del dataframe. Esto es util para facilitar la extracción de información en ciertos casos" ] }, { "cell_type": "code", "execution_count": 40, "id": "5b49ea0d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countypopulationarea
codestatecodecounty
11001Auta#%&()uga54571.0594.436000
1003Baldwin#%&() ?182265.01589.784000
1005Barbour27457.0884.876000
1007Bi#%&()bb22915.0622.582000
1009Blount ?57322.0644.776000
...............
7272151Yabucoa37941.055.215000
72153Ya_uco42043.068.192000
7878010; St. Croix ?50601.083.345868
78020St. John4170.019.689867
78030St. Thomas51634.031.313503
\n", "

3234 rows × 3 columns

\n", "
" ], "text/plain": [ " county population area\n", "codestate codecounty \n", "1 1001 Auta#%&()uga 54571.0 594.436000\n", " 1003 Baldwin#%&() ? 182265.0 1589.784000\n", " 1005 Barbour 27457.0 884.876000\n", " 1007 Bi#%&()bb 22915.0 622.582000\n", " 1009 Blount ? 57322.0 644.776000\n", "... ... ... ...\n", "72 72151 Yabucoa 37941.0 55.215000\n", " 72153 Ya_uco 42043.0 68.192000\n", "78 78010 ; St. Croix ? 50601.0 83.345868\n", " 78020 St. John 4170.0 19.689867\n", " 78030 St. Thomas 51634.0 31.313503\n", "\n", "[3234 rows x 3 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "counties_multi=pd.read_excel(folder_path+\"counties.xlsx?raw=true\", index_col=[0, 1])\n", "display(counties_multi)" ] }, { "cell_type": "markdown", "id": "f9664669", "metadata": {}, "source": [ "Podemos obtener la suma de población de cada estado" ] }, { "cell_type": "code", "execution_count": 41, "id": "f0fb3462", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\User\\AppData\\Local\\Temp\\ipykernel_9972\\18339036.py:1: FutureWarning: Using the level keyword in DataFrame and Series aggregations is deprecated and will be removed in a future version. Use groupby instead. df.sum(level=1) should use df.groupby(level=1).sum().\n", " counties_multi['population'].sum(level='codestate').head(5)\n" ] }, { "data": { "text/plain": [ "codestate\n", "1 4779736.0\n", "2 710231.0\n", "4 6392017.0\n", "5 2915918.0\n", "6 37253956.0\n", "Name: population, dtype: float64" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "counties_multi['population'].sum(level='codestate').head(5)" ] }, { "cell_type": "markdown", "id": "c274935c", "metadata": {}, "source": [ "## Concatenar y unir\n", "\n", "Es posible unir varias tablas tanto vertical como horizontalmente.\n", "\n", "\n", "En ambos casos, podemos usar `concat()`, y se juntarán las bases con bases a los nombres de columnas o los indices de las filas" ] }, { "cell_type": "code", "execution_count": 42, "id": "e60fcdfc", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yeardemocratrepublicothercodecounty
020004942119932731001
12000139974087216111003
22000518850961111005
32000271042731181007
420004977126673291009
..................
310720005521942574556037
310820004019545495856039
310920001650546929956041
31102000806313810756043
3111200044925219056045
\n", "

3112 rows × 5 columns

\n", "
" ], "text/plain": [ " year democrat republic other codecounty\n", "0 2000 4942 11993 273 1001\n", "1 2000 13997 40872 1611 1003\n", "2 2000 5188 5096 111 1005\n", "3 2000 2710 4273 118 1007\n", "4 2000 4977 12667 329 1009\n", "... ... ... ... ... ...\n", "3107 2000 5521 9425 745 56037\n", "3108 2000 4019 5454 958 56039\n", "3109 2000 1650 5469 299 56041\n", "3110 2000 806 3138 107 56043\n", "3111 2000 449 2521 90 56045\n", "\n", "[3112 rows x 5 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "elections_2000=elections[elections['year']==2000]\n", "display(elections_2000) #display:ejecuta el método dunder apropiado para obtener los datos apropiados para mostrar" ] }, { "cell_type": "code", "execution_count": 43, "id": "b4191d40", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\User\\AppData\\Local\\Temp\\ipykernel_9972\\3665125724.py:2: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " elections_2004['dummy']=0\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yeardemocratrepublicothercodecountydummy
3112200447581519612710010
31132004155995297175010030
31142004483258994610050
31152004208954723910070
3116200439381738618010090
.....................
62202004520810653411560370
6221200459725124263560390
6222200418156081185560410
62232004855320059560430
62242004580273973560450
\n", "

3113 rows × 6 columns

\n", "
" ], "text/plain": [ " year democrat republic other codecounty dummy\n", "3112 2004 4758 15196 127 1001 0\n", "3113 2004 15599 52971 750 1003 0\n", "3114 2004 4832 5899 46 1005 0\n", "3115 2004 2089 5472 39 1007 0\n", "3116 2004 3938 17386 180 1009 0\n", "... ... ... ... ... ... ...\n", "6220 2004 5208 10653 411 56037 0\n", "6221 2004 5972 5124 263 56039 0\n", "6222 2004 1815 6081 185 56041 0\n", "6223 2004 855 3200 59 56043 0\n", "6224 2004 580 2739 73 56045 0\n", "\n", "[3113 rows x 6 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "elections_2004=elections[elections['year']==2004]\n", "elections_2004['dummy']=0\n", "display(elections_2004)" ] }, { "cell_type": "markdown", "id": "505bc134", "metadata": {}, "source": [ "Si se hace la concatenación sin más, se tomarán todas las columnas y se agregarán NaN" ] }, { "cell_type": "code", "execution_count": 44, "id": "d575d94a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yeardemocratrepublicothercodecountydummy
020004942119932731001NaN
12000139974087216111003NaN
22000518850961111005NaN
32000271042731181007NaN
420004977126673291009NaN
.....................
62202004520810653411560370.0
6221200459725124263560390.0
6222200418156081185560410.0
62232004855320059560430.0
62242004580273973560450.0
\n", "

6225 rows × 6 columns

\n", "
" ], "text/plain": [ " year democrat republic other codecounty dummy\n", "0 2000 4942 11993 273 1001 NaN\n", "1 2000 13997 40872 1611 1003 NaN\n", "2 2000 5188 5096 111 1005 NaN\n", "3 2000 2710 4273 118 1007 NaN\n", "4 2000 4977 12667 329 1009 NaN\n", "... ... ... ... ... ... ...\n", "6220 2004 5208 10653 411 56037 0.0\n", "6221 2004 5972 5124 263 56039 0.0\n", "6222 2004 1815 6081 185 56041 0.0\n", "6223 2004 855 3200 59 56043 0.0\n", "6224 2004 580 2739 73 56045 0.0\n", "\n", "[6225 rows x 6 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "elections_00_04=pd.concat([elections_2000, elections_2004])\n", "display(elections_00_04)" ] }, { "cell_type": "markdown", "id": "f248c759", "metadata": {}, "source": [ "Ahora, usando `merge()`, podemos hacer uniones de las tablas de forma horizontal que compartan una columna/indice en común" ] }, { "cell_type": "markdown", "id": "34425212", "metadata": {}, "source": [ "
\n", "
\n", "\n", "
\n", "
\n", "

Tipos de Joins\n", "

\n", "
" ] }, { "cell_type": "code", "execution_count": 45, "id": "59163972", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
codestatecodecountycountypopulationarea
31.01007Bibb22915.0622.582000
41.01009Blount57322.0644.776000
51.01011Bullock10914.0622.805000
61.01013Butler20947.0776.829000
71.01015Calhoun118572.0605.868000
..................
322972.072151Yabucoa37941.055.215000
323072.072153Yauco42043.068.192000
323178.078010St Croix50601.083.345868
323278.078020St John4170.019.689867
323378.078030St Thomas51634.031.313503
\n", "

3231 rows × 5 columns

\n", "
" ], "text/plain": [ " codestate codecounty county population area\n", "3 1.0 1007 Bibb 22915.0 622.582000\n", "4 1.0 1009 Blount 57322.0 644.776000\n", "5 1.0 1011 Bullock 10914.0 622.805000\n", "6 1.0 1013 Butler 20947.0 776.829000\n", "7 1.0 1015 Calhoun 118572.0 605.868000\n", "... ... ... ... ... ...\n", "3229 72.0 72151 Yabucoa 37941.0 55.215000\n", "3230 72.0 72153 Yauco 42043.0 68.192000\n", "3231 78.0 78010 St Croix 50601.0 83.345868\n", "3232 78.0 78020 St John 4170.0 19.689867\n", "3233 78.0 78030 St Thomas 51634.0 31.313503\n", "\n", "[3231 rows x 5 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(counties)" ] }, { "cell_type": "code", "execution_count": 46, "id": "eb2b5eca", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yeardemocratrepublicothercodecountycodestatecountypopulationarea
020002710427311810071.0Bibb22915.0622.582
12004208954723910071.0Bibb22915.0622.582
22008229962628310071.0Bibb22915.0622.582
32012220261328610071.0Bibb22915.0622.582
420161874673820710071.0Bibb22915.0622.582
5200049771266732910091.0Blount57322.0644.776
6200439381738618010091.0Blount57322.0644.776
7200835222038935610091.0Blount57322.0644.776
8201229702075727910091.0Blount57322.0644.776
9201621562285957310091.0Blount57322.0644.776
\n", "
" ], "text/plain": [ " year democrat republic other codecounty codestate county population \\\n", "0 2000 2710 4273 118 1007 1.0 Bibb 22915.0 \n", "1 2004 2089 5472 39 1007 1.0 Bibb 22915.0 \n", "2 2008 2299 6262 83 1007 1.0 Bibb 22915.0 \n", "3 2012 2202 6132 86 1007 1.0 Bibb 22915.0 \n", "4 2016 1874 6738 207 1007 1.0 Bibb 22915.0 \n", "5 2000 4977 12667 329 1009 1.0 Blount 57322.0 \n", "6 2004 3938 17386 180 1009 1.0 Blount 57322.0 \n", "7 2008 3522 20389 356 1009 1.0 Blount 57322.0 \n", "8 2012 2970 20757 279 1009 1.0 Blount 57322.0 \n", "9 2016 2156 22859 573 1009 1.0 Blount 57322.0 \n", "\n", " area \n", "0 622.582 \n", "1 622.582 \n", "2 622.582 \n", "3 622.582 \n", "4 622.582 \n", "5 644.776 \n", "6 644.776 \n", "7 644.776 \n", "8 644.776 \n", "9 644.776 " ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "inner_joined=pd.merge(elections, counties)\n", "inner_joined.head(10)" ] }, { "cell_type": "code", "execution_count": 47, "id": "630e9319", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yeardemocratrepublicothercodecountycodestatecountypopulationarea
02000.04942.011993.0273.01001NaNNaNNaNNaN
12004.04758.015196.0127.01001NaNNaNNaNNaN
22008.06093.017403.0145.01001NaNNaNNaNNaN
32012.06363.017379.0190.01001NaNNaNNaNNaN
42016.05936.018172.0865.01001NaNNaNNaNNaN
52000.013997.040872.01611.01003NaNNaNNaNNaN
62004.015599.052971.0750.01003NaNNaNNaNNaN
72008.019386.061271.0756.01003NaNNaNNaNNaN
82012.018424.066016.0898.01003NaNNaNNaNNaN
92016.018458.072883.03874.01003NaNNaNNaNNaN
\n", "
" ], "text/plain": [ " year democrat republic other codecounty codestate county \\\n", "0 2000.0 4942.0 11993.0 273.0 1001 NaN NaN \n", "1 2004.0 4758.0 15196.0 127.0 1001 NaN NaN \n", "2 2008.0 6093.0 17403.0 145.0 1001 NaN NaN \n", "3 2012.0 6363.0 17379.0 190.0 1001 NaN NaN \n", "4 2016.0 5936.0 18172.0 865.0 1001 NaN NaN \n", "5 2000.0 13997.0 40872.0 1611.0 1003 NaN NaN \n", "6 2004.0 15599.0 52971.0 750.0 1003 NaN NaN \n", "7 2008.0 19386.0 61271.0 756.0 1003 NaN NaN \n", "8 2012.0 18424.0 66016.0 898.0 1003 NaN NaN \n", "9 2016.0 18458.0 72883.0 3874.0 1003 NaN NaN \n", "\n", " population area \n", "0 NaN NaN \n", "1 NaN NaN \n", "2 NaN NaN \n", "3 NaN NaN \n", "4 NaN NaN \n", "5 NaN NaN \n", "6 NaN NaN \n", "7 NaN NaN \n", "8 NaN NaN \n", "9 NaN NaN " ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "outer_joined=pd.merge(elections, counties, how='outer')\n", "outer_joined.head(10)" ] }, { "cell_type": "code", "execution_count": 48, "id": "53403f5f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yeardemocratrepublicothercodecountycodestatecountypopulationarea
020004942119932731001NaNNaNNaNNaN
12000139974087216111003NaNNaNNaNNaN
22000518850961111005NaNNaNNaNNaN
320002710427311810071.0Bibb22915.0622.582
4200049771266732910091.0Blount57322.0644.776
52000339514337610111.0Bullock10914.0622.805
62000360641277010131.0Butler20947.0776.829
72000157812230682210151.0Calhoun118572.0605.868
820005616603718110171.0Chambers34215.0596.531
920003497415417210191.0Cherokee25989.0553.700
\n", "
" ], "text/plain": [ " year democrat republic other codecounty codestate county \\\n", "0 2000 4942 11993 273 1001 NaN NaN \n", "1 2000 13997 40872 1611 1003 NaN NaN \n", "2 2000 5188 5096 111 1005 NaN NaN \n", "3 2000 2710 4273 118 1007 1.0 Bibb \n", "4 2000 4977 12667 329 1009 1.0 Blount \n", "5 2000 3395 1433 76 1011 1.0 Bullock \n", "6 2000 3606 4127 70 1013 1.0 Butler \n", "7 2000 15781 22306 822 1015 1.0 Calhoun \n", "8 2000 5616 6037 181 1017 1.0 Chambers \n", "9 2000 3497 4154 172 1019 1.0 Cherokee \n", "\n", " population area \n", "0 NaN NaN \n", "1 NaN NaN \n", "2 NaN NaN \n", "3 22915.0 622.582 \n", "4 57322.0 644.776 \n", "5 10914.0 622.805 \n", "6 20947.0 776.829 \n", "7 118572.0 605.868 \n", "8 34215.0 596.531 \n", "9 25989.0 553.700 " ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "left_joined=pd.merge(elections, counties, how='left')\n", "left_joined.head(10)" ] }, { "cell_type": "code", "execution_count": 49, "id": "a37b5382", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yeardemocratrepublicothercodecountycodestatecountypopulationarea
02000.02710.04273.0118.010071.0Bibb22915.0622.582
12004.02089.05472.039.010071.0Bibb22915.0622.582
22008.02299.06262.083.010071.0Bibb22915.0622.582
32012.02202.06132.086.010071.0Bibb22915.0622.582
42016.01874.06738.0207.010071.0Bibb22915.0622.582
52000.04977.012667.0329.010091.0Blount57322.0644.776
62004.03938.017386.0180.010091.0Blount57322.0644.776
72008.03522.020389.0356.010091.0Blount57322.0644.776
82012.02970.020757.0279.010091.0Blount57322.0644.776
92016.02156.022859.0573.010091.0Blount57322.0644.776
\n", "
" ], "text/plain": [ " year democrat republic other codecounty codestate county \\\n", "0 2000.0 2710.0 4273.0 118.0 1007 1.0 Bibb \n", "1 2004.0 2089.0 5472.0 39.0 1007 1.0 Bibb \n", "2 2008.0 2299.0 6262.0 83.0 1007 1.0 Bibb \n", "3 2012.0 2202.0 6132.0 86.0 1007 1.0 Bibb \n", "4 2016.0 1874.0 6738.0 207.0 1007 1.0 Bibb \n", "5 2000.0 4977.0 12667.0 329.0 1009 1.0 Blount \n", "6 2004.0 3938.0 17386.0 180.0 1009 1.0 Blount \n", "7 2008.0 3522.0 20389.0 356.0 1009 1.0 Blount \n", "8 2012.0 2970.0 20757.0 279.0 1009 1.0 Blount \n", "9 2016.0 2156.0 22859.0 573.0 1009 1.0 Blount \n", "\n", " population area \n", "0 22915.0 622.582 \n", "1 22915.0 622.582 \n", "2 22915.0 622.582 \n", "3 22915.0 622.582 \n", "4 22915.0 622.582 \n", "5 57322.0 644.776 \n", "6 57322.0 644.776 \n", "7 57322.0 644.776 \n", "8 57322.0 644.776 \n", "9 57322.0 644.776 " ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "right_joined=pd.merge(elections, counties, how='right')\n", "right_joined.head(10)" ] }, { "cell_type": "markdown", "id": "a049daf6", "metadata": {}, "source": [ "## Series de tiempo y time stamps\n" ] }, { "cell_type": "markdown", "id": "eb89cc20", "metadata": {}, "source": [ "Existe un tipo de datos llamado \"timestamp\" que se usa para medir registros de tiempo. Es posible pasar strings a este tipo de dato en diferentes formatos y trabajar con este usándolo como indice" ] }, { "cell_type": "code", "execution_count": 50, "id": "11f753ec", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TimeR1R2R3
0Jan-7926.6415.5011.15
1Feb-7927.2016.5810.62
2Mar-7927.8718.859.02
3Apr-7923.2217.235.99
4May-7923.2716.047.23
\n", "
" ], "text/plain": [ " Time R1 R2 R3\n", "0 Jan-79 26.64 15.50 11.15\n", "1 Feb-79 27.20 16.58 10.62\n", "2 Mar-79 27.87 18.85 9.02\n", "3 Apr-79 23.22 17.23 5.99\n", "4 May-79 23.27 16.04 7.23" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "airrpm=pd.read_csv(folder_path+'airrpm.txt?raw=true', header=None, delimiter= '\\s+', decimal=\",\")\n", "airrpm.columns=[\"Time\", \"R1\", \"R2\", \"R3\"]\n", "airrpm.head()" ] }, { "cell_type": "markdown", "id": "4ddc8116", "metadata": {}, "source": [ "La variable \"Time\" en este momento es un string. Pero podemos transformarlo usando `to_datetime()`. Para obtener más información de qué se podría poner en `format`, pueden revisar este [link](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior)" ] }, { "cell_type": "code", "execution_count": 51, "id": "40e9ed94", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
R1R2R3
Time
1979-01-0126.6415.5011.15
1979-02-0127.2016.5810.62
1979-03-0127.8718.859.02
1979-04-0123.2217.235.99
1979-05-0123.2716.047.23
\n", "
" ], "text/plain": [ " R1 R2 R3\n", "Time \n", "1979-01-01 26.64 15.50 11.15\n", "1979-02-01 27.20 16.58 10.62\n", "1979-03-01 27.87 18.85 9.02\n", "1979-04-01 23.22 17.23 5.99\n", "1979-05-01 23.27 16.04 7.23" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "airrpm['Time']=pd.to_datetime(airrpm['Time'], format='%b-%y')\n", "airrpm.index=airrpm['Time']\n", "airrpm=airrpm.drop(['Time'], axis=1)\n", "airrpm.head()" ] }, { "cell_type": "markdown", "id": "39bafd28", "metadata": {}, "source": [ "Definiendo la fecha como índice, podemos extraer información con respecto al año y respecto al mes" ] }, { "cell_type": "code", "execution_count": 52, "id": "8130d4ca", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countmeanstdmin25%50%75%max
Time
126.043.7365389.77840226.6436.657544.8251.8659.36
225.043.6808009.98369627.2035.540044.7452.7359.74
325.044.51440010.36333627.8735.700045.0153.2660.10
425.044.28720010.48033123.2236.610045.6352.7960.49
525.043.89160010.52481423.2734.600045.1152.5760.36
625.045.65560010.58695127.3035.420047.2954.5561.88
725.046.47840010.33391029.2838.270047.7954.9463.02
825.046.77360010.61316927.2738.860048.0655.3963.71
925.044.2024009.73124627.1337.260046.3951.8359.54
1025.044.43320010.09383226.9737.850045.8951.5660.26
1125.043.8412009.92371926.1937.330045.5051.0259.21
1225.044.6004009.57327727.3238.750045.3351.9857.65
\n", "
" ], "text/plain": [ " count mean std min 25% 50% 75% max\n", "Time \n", "1 26.0 43.736538 9.778402 26.64 36.6575 44.82 51.86 59.36\n", "2 25.0 43.680800 9.983696 27.20 35.5400 44.74 52.73 59.74\n", "3 25.0 44.514400 10.363336 27.87 35.7000 45.01 53.26 60.10\n", "4 25.0 44.287200 10.480331 23.22 36.6100 45.63 52.79 60.49\n", "5 25.0 43.891600 10.524814 23.27 34.6000 45.11 52.57 60.36\n", "6 25.0 45.655600 10.586951 27.30 35.4200 47.29 54.55 61.88\n", "7 25.0 46.478400 10.333910 29.28 38.2700 47.79 54.94 63.02\n", "8 25.0 46.773600 10.613169 27.27 38.8600 48.06 55.39 63.71\n", "9 25.0 44.202400 9.731246 27.13 37.2600 46.39 51.83 59.54\n", "10 25.0 44.433200 10.093832 26.97 37.8500 45.89 51.56 60.26\n", "11 25.0 43.841200 9.923719 26.19 37.3300 45.50 51.02 59.21\n", "12 25.0 44.600400 9.573277 27.32 38.7500 45.33 51.98 57.65" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "airrpm.groupby([airrpm.index.month])['R1'].describe()" ] }, { "cell_type": "code", "execution_count": 53, "id": "e3893d47", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
R1R2R3
Time
1979328.83210.49118.34
1980340.95200.92140.05
1981341.96198.53143.45
1982355.17210.23144.91
1983374.03226.76147.26
1984415.88243.71172.18
1985440.43271.80168.61
1986492.37302.20190.18
1987522.07324.58197.52
1988530.25328.69201.57
\n", "
" ], "text/plain": [ " R1 R2 R3\n", "Time \n", "1979 328.83 210.49 118.34\n", "1980 340.95 200.92 140.05\n", "1981 341.96 198.53 143.45\n", "1982 355.17 210.23 144.91\n", "1983 374.03 226.76 147.26\n", "1984 415.88 243.71 172.18\n", "1985 440.43 271.80 168.61\n", "1986 492.37 302.20 190.18\n", "1987 522.07 324.58 197.52\n", "1988 530.25 328.69 201.57" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "airrpm.groupby([airrpm.index.year]).sum().head(10)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.9.7 ('base')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" }, "vscode": { "interpreter": { "hash": "f08154012ddadd8e950e6e9e035c7a7b32c136e7647e9b7c77e02eb723a8bedb" } } }, "nbformat": 4, "nbformat_minor": 5 }