Munging select rows in CSV files using Pandas
up vote
4
down vote
favorite
I'm learning Pandas. I have a project where I need to munge some CSV files and resave them as CSV. I can use dictionaries and CSV module, but decided to use DataFrames to get more exposure and practice with Pandas.
The task is to sum()
values for some keys while others are not summed. My solution was to add a column called "Line" where items with the same line number would be summed.
sales = [{"Line": 1, "KEY": "Apples", "AMOUNT": 3.99},
{"Line": 1, "KEY": "Apples", "AMOUNT": 3.99},
{"Line": 2, "KEY": "Oranges", "AMOUNT": 5.99},
{"Line": 3, "KEY": "Pears", "AMOUNT": 2.99},
{"Line": 3, "KEY": "Pears", "AMOUNT": 2.99},
{"Line": 4, "KEY": "Carrots", "AMOUNT": .99},
{"Line": 5, "KEY": "Carrots", "AMOUNT": .99},
]
df = pd.DataFrame(sales)
Now, Pandas groupby
might seem like a perfect solution, but since I need to export the munged file as CSV, this seemed like something I could do only to have to undo.
# Find duplicate Line entries
# Subset df into just duplicate `Line` values
df_tmp = df[df.duplicated(subset="Line", keep=False)]
# Save a list of Line numbers to sum
line_dups = df_tmp['Line'].unique()
for x in line_dups:
# Sum every line in the DF; one value sum is unchanged
# asum = df.loc[df['Line'] == x, 'AMOUNT'].sum()
# or
# Subset the subset
df_tmp2 = df_tmp[df_tmp["Line"] == x]
# sum the sub-subset
asum = df_tmp2['AMOUNT'].sum()
# set the value of all keys with the same Line value
df.loc[df['Line'] == x, 'AMOUNT'] = asum
# take the inverse of the duplicate subset on the original df
# Keep only the first duplicate line
df2 = df[~df.duplicated(subset="Line", keep='first')]
The solution is effective, df2
:
AMOUNT KEY Line
0 7.98 Apples 1
2 5.99 Oranges 2
3 5.98 Pears 3
5 0.99 Carrots 4
6 0.99 Carrots 5
And yet, this solution is sort of hairy with all of the temp dataframes. Maybe a more seasoned Pandas user will have something to add which might help me better understand Pandas?
python csv pandas
bumped to the homepage by Community♦ 2 days ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
up vote
4
down vote
favorite
I'm learning Pandas. I have a project where I need to munge some CSV files and resave them as CSV. I can use dictionaries and CSV module, but decided to use DataFrames to get more exposure and practice with Pandas.
The task is to sum()
values for some keys while others are not summed. My solution was to add a column called "Line" where items with the same line number would be summed.
sales = [{"Line": 1, "KEY": "Apples", "AMOUNT": 3.99},
{"Line": 1, "KEY": "Apples", "AMOUNT": 3.99},
{"Line": 2, "KEY": "Oranges", "AMOUNT": 5.99},
{"Line": 3, "KEY": "Pears", "AMOUNT": 2.99},
{"Line": 3, "KEY": "Pears", "AMOUNT": 2.99},
{"Line": 4, "KEY": "Carrots", "AMOUNT": .99},
{"Line": 5, "KEY": "Carrots", "AMOUNT": .99},
]
df = pd.DataFrame(sales)
Now, Pandas groupby
might seem like a perfect solution, but since I need to export the munged file as CSV, this seemed like something I could do only to have to undo.
# Find duplicate Line entries
# Subset df into just duplicate `Line` values
df_tmp = df[df.duplicated(subset="Line", keep=False)]
# Save a list of Line numbers to sum
line_dups = df_tmp['Line'].unique()
for x in line_dups:
# Sum every line in the DF; one value sum is unchanged
# asum = df.loc[df['Line'] == x, 'AMOUNT'].sum()
# or
# Subset the subset
df_tmp2 = df_tmp[df_tmp["Line"] == x]
# sum the sub-subset
asum = df_tmp2['AMOUNT'].sum()
# set the value of all keys with the same Line value
df.loc[df['Line'] == x, 'AMOUNT'] = asum
# take the inverse of the duplicate subset on the original df
# Keep only the first duplicate line
df2 = df[~df.duplicated(subset="Line", keep='first')]
The solution is effective, df2
:
AMOUNT KEY Line
0 7.98 Apples 1
2 5.99 Oranges 2
3 5.98 Pears 3
5 0.99 Carrots 4
6 0.99 Carrots 5
And yet, this solution is sort of hairy with all of the temp dataframes. Maybe a more seasoned Pandas user will have something to add which might help me better understand Pandas?
python csv pandas
bumped to the homepage by Community♦ 2 days ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
What do you mean that you need to undo the groupby to save your data to csv?
– mochi
Jun 4 at 4:58
I see, that was a strange note--the problem is really a file conversion problem and can not produce new columns. From the groupby examples I've seen I expected changes to the index which I would need to undo. Maybe I was incorrect--if the same results could be obtained.
– xtian
Jun 6 at 9:04
Is it important to maintain the index information (0, 2, 3, 5, 6)? Or do you really just want KEY and AMOUNT? I'm unclear about why exactly groupby doesn't work.
– mochi
Jun 7 at 1:42
add a comment |
up vote
4
down vote
favorite
up vote
4
down vote
favorite
I'm learning Pandas. I have a project where I need to munge some CSV files and resave them as CSV. I can use dictionaries and CSV module, but decided to use DataFrames to get more exposure and practice with Pandas.
The task is to sum()
values for some keys while others are not summed. My solution was to add a column called "Line" where items with the same line number would be summed.
sales = [{"Line": 1, "KEY": "Apples", "AMOUNT": 3.99},
{"Line": 1, "KEY": "Apples", "AMOUNT": 3.99},
{"Line": 2, "KEY": "Oranges", "AMOUNT": 5.99},
{"Line": 3, "KEY": "Pears", "AMOUNT": 2.99},
{"Line": 3, "KEY": "Pears", "AMOUNT": 2.99},
{"Line": 4, "KEY": "Carrots", "AMOUNT": .99},
{"Line": 5, "KEY": "Carrots", "AMOUNT": .99},
]
df = pd.DataFrame(sales)
Now, Pandas groupby
might seem like a perfect solution, but since I need to export the munged file as CSV, this seemed like something I could do only to have to undo.
# Find duplicate Line entries
# Subset df into just duplicate `Line` values
df_tmp = df[df.duplicated(subset="Line", keep=False)]
# Save a list of Line numbers to sum
line_dups = df_tmp['Line'].unique()
for x in line_dups:
# Sum every line in the DF; one value sum is unchanged
# asum = df.loc[df['Line'] == x, 'AMOUNT'].sum()
# or
# Subset the subset
df_tmp2 = df_tmp[df_tmp["Line"] == x]
# sum the sub-subset
asum = df_tmp2['AMOUNT'].sum()
# set the value of all keys with the same Line value
df.loc[df['Line'] == x, 'AMOUNT'] = asum
# take the inverse of the duplicate subset on the original df
# Keep only the first duplicate line
df2 = df[~df.duplicated(subset="Line", keep='first')]
The solution is effective, df2
:
AMOUNT KEY Line
0 7.98 Apples 1
2 5.99 Oranges 2
3 5.98 Pears 3
5 0.99 Carrots 4
6 0.99 Carrots 5
And yet, this solution is sort of hairy with all of the temp dataframes. Maybe a more seasoned Pandas user will have something to add which might help me better understand Pandas?
python csv pandas
I'm learning Pandas. I have a project where I need to munge some CSV files and resave them as CSV. I can use dictionaries and CSV module, but decided to use DataFrames to get more exposure and practice with Pandas.
The task is to sum()
values for some keys while others are not summed. My solution was to add a column called "Line" where items with the same line number would be summed.
sales = [{"Line": 1, "KEY": "Apples", "AMOUNT": 3.99},
{"Line": 1, "KEY": "Apples", "AMOUNT": 3.99},
{"Line": 2, "KEY": "Oranges", "AMOUNT": 5.99},
{"Line": 3, "KEY": "Pears", "AMOUNT": 2.99},
{"Line": 3, "KEY": "Pears", "AMOUNT": 2.99},
{"Line": 4, "KEY": "Carrots", "AMOUNT": .99},
{"Line": 5, "KEY": "Carrots", "AMOUNT": .99},
]
df = pd.DataFrame(sales)
Now, Pandas groupby
might seem like a perfect solution, but since I need to export the munged file as CSV, this seemed like something I could do only to have to undo.
# Find duplicate Line entries
# Subset df into just duplicate `Line` values
df_tmp = df[df.duplicated(subset="Line", keep=False)]
# Save a list of Line numbers to sum
line_dups = df_tmp['Line'].unique()
for x in line_dups:
# Sum every line in the DF; one value sum is unchanged
# asum = df.loc[df['Line'] == x, 'AMOUNT'].sum()
# or
# Subset the subset
df_tmp2 = df_tmp[df_tmp["Line"] == x]
# sum the sub-subset
asum = df_tmp2['AMOUNT'].sum()
# set the value of all keys with the same Line value
df.loc[df['Line'] == x, 'AMOUNT'] = asum
# take the inverse of the duplicate subset on the original df
# Keep only the first duplicate line
df2 = df[~df.duplicated(subset="Line", keep='first')]
The solution is effective, df2
:
AMOUNT KEY Line
0 7.98 Apples 1
2 5.99 Oranges 2
3 5.98 Pears 3
5 0.99 Carrots 4
6 0.99 Carrots 5
And yet, this solution is sort of hairy with all of the temp dataframes. Maybe a more seasoned Pandas user will have something to add which might help me better understand Pandas?
python csv pandas
python csv pandas
edited Apr 1 at 20:34
asked Apr 1 at 18:34
xtian
1686
1686
bumped to the homepage by Community♦ 2 days ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
bumped to the homepage by Community♦ 2 days ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
What do you mean that you need to undo the groupby to save your data to csv?
– mochi
Jun 4 at 4:58
I see, that was a strange note--the problem is really a file conversion problem and can not produce new columns. From the groupby examples I've seen I expected changes to the index which I would need to undo. Maybe I was incorrect--if the same results could be obtained.
– xtian
Jun 6 at 9:04
Is it important to maintain the index information (0, 2, 3, 5, 6)? Or do you really just want KEY and AMOUNT? I'm unclear about why exactly groupby doesn't work.
– mochi
Jun 7 at 1:42
add a comment |
What do you mean that you need to undo the groupby to save your data to csv?
– mochi
Jun 4 at 4:58
I see, that was a strange note--the problem is really a file conversion problem and can not produce new columns. From the groupby examples I've seen I expected changes to the index which I would need to undo. Maybe I was incorrect--if the same results could be obtained.
– xtian
Jun 6 at 9:04
Is it important to maintain the index information (0, 2, 3, 5, 6)? Or do you really just want KEY and AMOUNT? I'm unclear about why exactly groupby doesn't work.
– mochi
Jun 7 at 1:42
What do you mean that you need to undo the groupby to save your data to csv?
– mochi
Jun 4 at 4:58
What do you mean that you need to undo the groupby to save your data to csv?
– mochi
Jun 4 at 4:58
I see, that was a strange note--the problem is really a file conversion problem and can not produce new columns. From the groupby examples I've seen I expected changes to the index which I would need to undo. Maybe I was incorrect--if the same results could be obtained.
– xtian
Jun 6 at 9:04
I see, that was a strange note--the problem is really a file conversion problem and can not produce new columns. From the groupby examples I've seen I expected changes to the index which I would need to undo. Maybe I was incorrect--if the same results could be obtained.
– xtian
Jun 6 at 9:04
Is it important to maintain the index information (0, 2, 3, 5, 6)? Or do you really just want KEY and AMOUNT? I'm unclear about why exactly groupby doesn't work.
– mochi
Jun 7 at 1:42
Is it important to maintain the index information (0, 2, 3, 5, 6)? Or do you really just want KEY and AMOUNT? I'm unclear about why exactly groupby doesn't work.
– mochi
Jun 7 at 1:42
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
If you have the Line
column in your CSV itself, you can replace what you have with:
df.groupby(['Line', 'KEY']).agg({'AMOUNT': 'sum'}).reset_index()
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
If you have the Line
column in your CSV itself, you can replace what you have with:
df.groupby(['Line', 'KEY']).agg({'AMOUNT': 'sum'}).reset_index()
add a comment |
up vote
0
down vote
If you have the Line
column in your CSV itself, you can replace what you have with:
df.groupby(['Line', 'KEY']).agg({'AMOUNT': 'sum'}).reset_index()
add a comment |
up vote
0
down vote
up vote
0
down vote
If you have the Line
column in your CSV itself, you can replace what you have with:
df.groupby(['Line', 'KEY']).agg({'AMOUNT': 'sum'}).reset_index()
If you have the Line
column in your CSV itself, you can replace what you have with:
df.groupby(['Line', 'KEY']).agg({'AMOUNT': 'sum'}).reset_index()
answered Apr 4 at 18:59
RCA
1735
1735
add a comment |
add a comment |
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f191013%2fmunging-select-rows-in-csv-files-using-pandas%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
What do you mean that you need to undo the groupby to save your data to csv?
– mochi
Jun 4 at 4:58
I see, that was a strange note--the problem is really a file conversion problem and can not produce new columns. From the groupby examples I've seen I expected changes to the index which I would need to undo. Maybe I was incorrect--if the same results could be obtained.
– xtian
Jun 6 at 9:04
Is it important to maintain the index information (0, 2, 3, 5, 6)? Or do you really just want KEY and AMOUNT? I'm unclear about why exactly groupby doesn't work.
– mochi
Jun 7 at 1:42