Multiply two columns of Census data and groupby

up vote
0
down vote

favorite

I have census data that looks like this

    State   County  TotalPop    Hispanic    White   Black   Native  Asian   Pacific

   Alabama  Autauga     1948    0.9         87.4    7.7     0.3     0.6     0.0

   Alabama  Autauga     2156    0.8         40.4    53.3    0.0     2.3     0.0

   Alabama  Autauga     2968    0.0         74.5    18.6    0.5     1.4     0.3

   ...

Two things to note, (1) there can be multiple rows for a County and (2) the racial data is given in percentages, but sometimes I want the actual size of the population.

Getting the total racial population translates to (in pseudo Pandas):

(census.TotalPop * census.Hispanic / 100).groupby("County").sum()

But, this gives an error: KeyError: 'State'. As the product of TotalPop and Hispanic is a Pandas Series not the original dataframe.

As suggested by this Stack Overflow question, I can create a new column for each race...

census["HispanicPop"] = census.TotalPop * census.Hispanic / 100

This works, but feels messy, it adds 6 columns unnecessarily as I just need the data for one plot. Here is the data (I'm using "acs2015_census_tract_data.csv") and here is my implementation:

Working Code

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

sns.set()

%matplotlib inline



census = pd.read_csv("data/acs2015_census_tract_data.csv")



races = ['Hispanic', 'White', 'Black', 'Native', 'Asian', 'Pacific']



# Creating a total population column for each race

# FIXME: this feels inefficient.  Does Pandas have another option?

for race in races:

    census[race + "_pop"] = (census[race] * census.TotalPop) / 100



# current racial population being plotted

race = races[0]



# Sum the populations in each state

race_pops = census.groupby("State")[race + "_pop"].sum().sort_values(ascending=False)



#### Plotting the results for each state



fig, axarr = plt.subplots(2, 2, figsize=(18, 12))

fig.suptitle("{} population in all 52 states".format(race), fontsize=18)



# Splitting the plot into 4 subplots so I can fit all 52 States

data = race_pops.head(13)

sns.barplot(x=data.values, y=data.index, ax=axarr[0][0])



data = race_pops.iloc[13:26]

sns.barplot(x=data.values, y=data.index, ax=axarr[0][1]).set(ylabel="")



data = race_pops.iloc[26:39]

sns.barplot(x=data.values, y=data.index, ax=axarr[1][0])



data = race_pops.tail(13)

_ = sns.barplot(x=data.values, y=data.index, ax=axarr[1][1]).set(ylabel="")

edited May 16 at 23:02

asked May 16 at 4:51

Patrick Stetz

245

bumped to the homepage by Community♦ 13 hours ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

Hmm... are you going to keep it here? it appears you posted working code...
– Sᴀᴍ Onᴇᴌᴀ
May 16 at 16:45

Yeah I think this question belongs here, what do you think? My working code wasn't so obvious the first time. Sorry for the confusion
– Patrick Stetz
May 16 at 16:48

1

@Graipher In its current state, the code seems to work and he's asking for a better approach. This seems sufficiently on-topic to me.
– scnerd
May 16 at 17:07

1

Hi Sam, the variable census is the data I'm looking at and can be found here. I'll edit my question to include this too
– Patrick Stetz
May 16 at 17:22

@scnerd I agree, the current question is on-topic.
– Graipher
May 16 at 18:29

add a comment |

up vote
0
down vote

favorite

I have census data that looks like this

    State   County  TotalPop    Hispanic    White   Black   Native  Asian   Pacific

   Alabama  Autauga     1948    0.9         87.4    7.7     0.3     0.6     0.0

   Alabama  Autauga     2156    0.8         40.4    53.3    0.0     2.3     0.0

   Alabama  Autauga     2968    0.0         74.5    18.6    0.5     1.4     0.3

   ...

Two things to note, (1) there can be multiple rows for a County and (2) the racial data is given in percentages, but sometimes I want the actual size of the population.

Getting the total racial population translates to (in pseudo Pandas):

(census.TotalPop * census.Hispanic / 100).groupby("County").sum()

But, this gives an error: KeyError: 'State'. As the product of TotalPop and Hispanic is a Pandas Series not the original dataframe.

As suggested by this Stack Overflow question, I can create a new column for each race...

census["HispanicPop"] = census.TotalPop * census.Hispanic / 100

This works, but feels messy, it adds 6 columns unnecessarily as I just need the data for one plot. Here is the data (I'm using "acs2015_census_tract_data.csv") and here is my implementation:

Working Code

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

sns.set()

%matplotlib inline



census = pd.read_csv("data/acs2015_census_tract_data.csv")



races = ['Hispanic', 'White', 'Black', 'Native', 'Asian', 'Pacific']



# Creating a total population column for each race

# FIXME: this feels inefficient.  Does Pandas have another option?

for race in races:

    census[race + "_pop"] = (census[race] * census.TotalPop) / 100



# current racial population being plotted

race = races[0]



# Sum the populations in each state

race_pops = census.groupby("State")[race + "_pop"].sum().sort_values(ascending=False)



#### Plotting the results for each state



fig, axarr = plt.subplots(2, 2, figsize=(18, 12))

fig.suptitle("{} population in all 52 states".format(race), fontsize=18)



# Splitting the plot into 4 subplots so I can fit all 52 States

data = race_pops.head(13)

sns.barplot(x=data.values, y=data.index, ax=axarr[0][0])



data = race_pops.iloc[13:26]

sns.barplot(x=data.values, y=data.index, ax=axarr[0][1]).set(ylabel="")



data = race_pops.iloc[26:39]

sns.barplot(x=data.values, y=data.index, ax=axarr[1][0])



data = race_pops.tail(13)

_ = sns.barplot(x=data.values, y=data.index, ax=axarr[1][1]).set(ylabel="")

edited May 16 at 23:02

asked May 16 at 4:51

Patrick Stetz

245

bumped to the homepage by Community♦ 13 hours ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

Hmm... are you going to keep it here? it appears you posted working code...
– Sᴀᴍ Onᴇᴌᴀ
May 16 at 16:45

Yeah I think this question belongs here, what do you think? My working code wasn't so obvious the first time. Sorry for the confusion
– Patrick Stetz
May 16 at 16:48

1

@Graipher In its current state, the code seems to work and he's asking for a better approach. This seems sufficiently on-topic to me.
– scnerd
May 16 at 17:07

1

Hi Sam, the variable census is the data I'm looking at and can be found here. I'll edit my question to include this too
– Patrick Stetz
May 16 at 17:22

@scnerd I agree, the current question is on-topic.
– Graipher
May 16 at 18:29

add a comment |

up vote
0
down vote

favorite

I have census data that looks like this

    State   County  TotalPop    Hispanic    White   Black   Native  Asian   Pacific

   Alabama  Autauga     1948    0.9         87.4    7.7     0.3     0.6     0.0

   Alabama  Autauga     2156    0.8         40.4    53.3    0.0     2.3     0.0

   Alabama  Autauga     2968    0.0         74.5    18.6    0.5     1.4     0.3

   ...

Two things to note, (1) there can be multiple rows for a County and (2) the racial data is given in percentages, but sometimes I want the actual size of the population.

Getting the total racial population translates to (in pseudo Pandas):

(census.TotalPop * census.Hispanic / 100).groupby("County").sum()

But, this gives an error: KeyError: 'State'. As the product of TotalPop and Hispanic is a Pandas Series not the original dataframe.

As suggested by this Stack Overflow question, I can create a new column for each race...

census["HispanicPop"] = census.TotalPop * census.Hispanic / 100

This works, but feels messy, it adds 6 columns unnecessarily as I just need the data for one plot. Here is the data (I'm using "acs2015_census_tract_data.csv") and here is my implementation:

Working Code

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

sns.set()

%matplotlib inline



census = pd.read_csv("data/acs2015_census_tract_data.csv")



races = ['Hispanic', 'White', 'Black', 'Native', 'Asian', 'Pacific']



# Creating a total population column for each race

# FIXME: this feels inefficient.  Does Pandas have another option?

for race in races:

    census[race + "_pop"] = (census[race] * census.TotalPop) / 100



# current racial population being plotted

race = races[0]



# Sum the populations in each state

race_pops = census.groupby("State")[race + "_pop"].sum().sort_values(ascending=False)



#### Plotting the results for each state



fig, axarr = plt.subplots(2, 2, figsize=(18, 12))

fig.suptitle("{} population in all 52 states".format(race), fontsize=18)



# Splitting the plot into 4 subplots so I can fit all 52 States

data = race_pops.head(13)

sns.barplot(x=data.values, y=data.index, ax=axarr[0][0])



data = race_pops.iloc[13:26]

sns.barplot(x=data.values, y=data.index, ax=axarr[0][1]).set(ylabel="")



data = race_pops.iloc[26:39]

sns.barplot(x=data.values, y=data.index, ax=axarr[1][0])



data = race_pops.tail(13)

_ = sns.barplot(x=data.values, y=data.index, ax=axarr[1][1]).set(ylabel="")

edited May 16 at 23:02

asked May 16 at 4:51

Patrick Stetz

245

I have census data that looks like this

    State   County  TotalPop    Hispanic    White   Black   Native  Asian   Pacific

   Alabama  Autauga     1948    0.9         87.4    7.7     0.3     0.6     0.0

   Alabama  Autauga     2156    0.8         40.4    53.3    0.0     2.3     0.0

   Alabama  Autauga     2968    0.0         74.5    18.6    0.5     1.4     0.3

   ...

Two things to note, (1) there can be multiple rows for a County and (2) the racial data is given in percentages, but sometimes I want the actual size of the population.

Getting the total racial population translates to (in pseudo Pandas):

(census.TotalPop * census.Hispanic / 100).groupby("County").sum()

But, this gives an error: KeyError: 'State'. As the product of TotalPop and Hispanic is a Pandas Series not the original dataframe.

As suggested by this Stack Overflow question, I can create a new column for each race...

census["HispanicPop"] = census.TotalPop * census.Hispanic / 100

This works, but feels messy, it adds 6 columns unnecessarily as I just need the data for one plot. Here is the data (I'm using "acs2015_census_tract_data.csv") and here is my implementation:

Working Code

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

sns.set()

%matplotlib inline



census = pd.read_csv("data/acs2015_census_tract_data.csv")



races = ['Hispanic', 'White', 'Black', 'Native', 'Asian', 'Pacific']



# Creating a total population column for each race

# FIXME: this feels inefficient.  Does Pandas have another option?

for race in races:

    census[race + "_pop"] = (census[race] * census.TotalPop) / 100



# current racial population being plotted

race = races[0]



# Sum the populations in each state

race_pops = census.groupby("State")[race + "_pop"].sum().sort_values(ascending=False)



#### Plotting the results for each state



fig, axarr = plt.subplots(2, 2, figsize=(18, 12))

fig.suptitle("{} population in all 52 states".format(race), fontsize=18)



# Splitting the plot into 4 subplots so I can fit all 52 States

data = race_pops.head(13)

sns.barplot(x=data.values, y=data.index, ax=axarr[0][0])



data = race_pops.iloc[13:26]

sns.barplot(x=data.values, y=data.index, ax=axarr[0][1]).set(ylabel="")



data = race_pops.iloc[26:39]

sns.barplot(x=data.values, y=data.index, ax=axarr[1][0])



data = race_pops.tail(13)

_ = sns.barplot(x=data.values, y=data.index, ax=axarr[1][1]).set(ylabel="")

python python-3.x csv pandas

edited May 16 at 23:02

asked May 16 at 4:51

Patrick Stetz

245

edited May 16 at 23:02

asked May 16 at 4:51

Patrick Stetz

245

edited May 16 at 23:02

asked May 16 at 4:51

Patrick Stetz

245

asked May 16 at 4:51

Patrick Stetz

245

asked May 16 at 4:51

Patrick Stetz

245

bumped to the homepage by Community♦ 13 hours ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

bumped to the homepage by Community♦ 13 hours ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

Hmm... are you going to keep it here? it appears you posted working code...
– Sᴀᴍ Onᴇᴌᴀ
May 16 at 16:45

Yeah I think this question belongs here, what do you think? My working code wasn't so obvious the first time. Sorry for the confusion
– Patrick Stetz
May 16 at 16:48

1

@Graipher In its current state, the code seems to work and he's asking for a better approach. This seems sufficiently on-topic to me.
– scnerd
May 16 at 17:07

1

Hi Sam, the variable census is the data I'm looking at and can be found here. I'll edit my question to include this too
– Patrick Stetz
May 16 at 17:22

@scnerd I agree, the current question is on-topic.
– Graipher
May 16 at 18:29

add a comment |

Hmm... are you going to keep it here? it appears you posted working code...
– Sᴀᴍ Onᴇᴌᴀ
May 16 at 16:45

Yeah I think this question belongs here, what do you think? My working code wasn't so obvious the first time. Sorry for the confusion
– Patrick Stetz
May 16 at 16:48

1

@Graipher In its current state, the code seems to work and he's asking for a better approach. This seems sufficiently on-topic to me.
– scnerd
May 16 at 17:07

1

Hi Sam, the variable census is the data I'm looking at and can be found here. I'll edit my question to include this too
– Patrick Stetz
May 16 at 17:22

@scnerd I agree, the current question is on-topic.
– Graipher
May 16 at 18:29

Hmm... are you going to keep it here? it appears you posted working code...
– Sᴀᴍ Onᴇᴌᴀ
May 16 at 16:45

Yeah I think this question belongs here, what do you think? My working code wasn't so obvious the first time. Sorry for the confusion
– Patrick Stetz
May 16 at 16:48

@Graipher In its current state, the code seems to work and he's asking for a better approach. This seems sufficiently on-topic to me.
– scnerd
May 16 at 17:07

Hi Sam, the variable census is the data I'm looking at and can be found here. I'll edit my question to include this too
– Patrick Stetz
May 16 at 17:22

@scnerd I agree, the current question is on-topic.
– Graipher
May 16 at 18:29

add a comment |

1 Answer
1

active

oldest

votes

up vote
0
down vote

Since you only want to use the total population values for these plots it is not worth adding these columns to your census DataFrame. I would package the plots into a function which creates a temporary DataFrame that is used and then disposed of after the plotting is complete.

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

sns.set()

%matplotlib inline



def plot_populations(census, race):

    # Group the data

    race_pops = pd.DataFrame(data={

                                   'State': census['State'], 

                                   'Pop': census[race] * census['TotalPop'] / 100

                                  }

                            ).groupby('State')['Pop'].sum().sort_values(ascending=False)



    # Plot the results

    fig, axarr = plt.subplots(2, 2, figsize=(18, 12))

    fig.suptitle("{} population in all 52 states".format(race), fontsize=18)

    for ix, ax in enumerate(axarr.reshape(-1)):

        data = race_pops.iloc[ix*len(race_pops)//4:(ix+1)*len(race_pops)//4]

        sns.barplot(x=data.values, y=data.index, ax=ax)

        if ix % 2 != 0: ax.set_ylabel('') 





census = pd.read_csv("acs2015_census_tract_data.csv")



races = ['Hispanic', 'White', 'Black', 'Native', 'Asian', 'Pacific']

# current racial population being plotted

race = races[0]



plot_populations(census, race)

answered May 17 at 8:56

JahKnows

1011

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f194520%2fmultiply-two-columns-of-census-data-and-groupby%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

sns.set()

%matplotlib inline



def plot_populations(census, race):

    # Group the data

    race_pops = pd.DataFrame(data={

                                   'State': census['State'], 

                                   'Pop': census[race] * census['TotalPop'] / 100

                                  }

                            ).groupby('State')['Pop'].sum().sort_values(ascending=False)



    # Plot the results

    fig, axarr = plt.subplots(2, 2, figsize=(18, 12))

    fig.suptitle("{} population in all 52 states".format(race), fontsize=18)

    for ix, ax in enumerate(axarr.reshape(-1)):

        data = race_pops.iloc[ix*len(race_pops)//4:(ix+1)*len(race_pops)//4]

        sns.barplot(x=data.values, y=data.index, ax=ax)

        if ix % 2 != 0: ax.set_ylabel('') 





census = pd.read_csv("acs2015_census_tract_data.csv")



races = ['Hispanic', 'White', 'Black', 'Native', 'Asian', 'Pacific']

# current racial population being plotted

race = races[0]



plot_populations(census, race)

answered May 17 at 8:56

JahKnows

1011

add a comment |

up vote
0
down vote

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

sns.set()

%matplotlib inline



def plot_populations(census, race):

    # Group the data

    race_pops = pd.DataFrame(data={

                                   'State': census['State'], 

                                   'Pop': census[race] * census['TotalPop'] / 100

                                  }

                            ).groupby('State')['Pop'].sum().sort_values(ascending=False)



    # Plot the results

    fig, axarr = plt.subplots(2, 2, figsize=(18, 12))

    fig.suptitle("{} population in all 52 states".format(race), fontsize=18)

    for ix, ax in enumerate(axarr.reshape(-1)):

        data = race_pops.iloc[ix*len(race_pops)//4:(ix+1)*len(race_pops)//4]

        sns.barplot(x=data.values, y=data.index, ax=ax)

        if ix % 2 != 0: ax.set_ylabel('') 





census = pd.read_csv("acs2015_census_tract_data.csv")



races = ['Hispanic', 'White', 'Black', 'Native', 'Asian', 'Pacific']

# current racial population being plotted

race = races[0]



plot_populations(census, race)

answered May 17 at 8:56

JahKnows

1011

add a comment |

up vote
0
down vote

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

sns.set()

%matplotlib inline



def plot_populations(census, race):

    # Group the data

    race_pops = pd.DataFrame(data={

                                   'State': census['State'], 

                                   'Pop': census[race] * census['TotalPop'] / 100

                                  }

                            ).groupby('State')['Pop'].sum().sort_values(ascending=False)



    # Plot the results

    fig, axarr = plt.subplots(2, 2, figsize=(18, 12))

    fig.suptitle("{} population in all 52 states".format(race), fontsize=18)

    for ix, ax in enumerate(axarr.reshape(-1)):

        data = race_pops.iloc[ix*len(race_pops)//4:(ix+1)*len(race_pops)//4]

        sns.barplot(x=data.values, y=data.index, ax=ax)

        if ix % 2 != 0: ax.set_ylabel('') 





census = pd.read_csv("acs2015_census_tract_data.csv")



races = ['Hispanic', 'White', 'Black', 'Native', 'Asian', 'Pacific']

# current racial population being plotted

race = races[0]



plot_populations(census, race)

answered May 17 at 8:56

JahKnows

1011

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

sns.set()

%matplotlib inline



def plot_populations(census, race):

    # Group the data

    race_pops = pd.DataFrame(data={

                                   'State': census['State'], 

                                   'Pop': census[race] * census['TotalPop'] / 100

                                  }

                            ).groupby('State')['Pop'].sum().sort_values(ascending=False)



    # Plot the results

    fig, axarr = plt.subplots(2, 2, figsize=(18, 12))

    fig.suptitle("{} population in all 52 states".format(race), fontsize=18)

    for ix, ax in enumerate(axarr.reshape(-1)):

        data = race_pops.iloc[ix*len(race_pops)//4:(ix+1)*len(race_pops)//4]

        sns.barplot(x=data.values, y=data.index, ax=ax)

        if ix % 2 != 0: ax.set_ylabel('') 





census = pd.read_csv("acs2015_census_tract_data.csv")



races = ['Hispanic', 'White', 'Black', 'Native', 'Asian', 'Pacific']

# current racial population being plotted

race = races[0]



plot_populations(census, race)

answered May 17 at 8:56

JahKnows

1011

answered May 17 at 8:56

JahKnows

1011

answered May 17 at 8:56

JahKnows

1011

answered May 17 at 8:56

JahKnows

1011

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Code Review Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Krdytkyu