Munging select rows in CSV files using Pandas











up vote
4
down vote

favorite












I'm learning Pandas. I have a project where I need to munge some CSV files and resave them as CSV. I can use dictionaries and CSV module, but decided to use DataFrames to get more exposure and practice with Pandas.



The task is to sum() values for some keys while others are not summed. My solution was to add a column called "Line" where items with the same line number would be summed.



sales = [{"Line": 1, "KEY": "Apples", "AMOUNT": 3.99},
{"Line": 1, "KEY": "Apples", "AMOUNT": 3.99},
{"Line": 2, "KEY": "Oranges", "AMOUNT": 5.99},
{"Line": 3, "KEY": "Pears", "AMOUNT": 2.99},
{"Line": 3, "KEY": "Pears", "AMOUNT": 2.99},
{"Line": 4, "KEY": "Carrots", "AMOUNT": .99},
{"Line": 5, "KEY": "Carrots", "AMOUNT": .99},
]

df = pd.DataFrame(sales)


Now, Pandas groupby might seem like a perfect solution, but since I need to export the munged file as CSV, this seemed like something I could do only to have to undo.



# Find duplicate Line entries
# Subset df into just duplicate `Line` values
df_tmp = df[df.duplicated(subset="Line", keep=False)]

# Save a list of Line numbers to sum
line_dups = df_tmp['Line'].unique()

for x in line_dups:
# Sum every line in the DF; one value sum is unchanged
# asum = df.loc[df['Line'] == x, 'AMOUNT'].sum()
# or
# Subset the subset
df_tmp2 = df_tmp[df_tmp["Line"] == x]
# sum the sub-subset
asum = df_tmp2['AMOUNT'].sum()
# set the value of all keys with the same Line value
df.loc[df['Line'] == x, 'AMOUNT'] = asum
# take the inverse of the duplicate subset on the original df
# Keep only the first duplicate line
df2 = df[~df.duplicated(subset="Line", keep='first')]


The solution is effective, df2:





  AMOUNT    KEY     Line
0 7.98 Apples 1
2 5.99 Oranges 2
3 5.98 Pears 3
5 0.99 Carrots 4
6 0.99 Carrots 5


And yet, this solution is sort of hairy with all of the temp dataframes. Maybe a more seasoned Pandas user will have something to add which might help me better understand Pandas?










share|improve this question
















bumped to the homepage by Community 2 days ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.















  • What do you mean that you need to undo the groupby to save your data to csv?
    – mochi
    Jun 4 at 4:58










  • I see, that was a strange note--the problem is really a file conversion problem and can not produce new columns. From the groupby examples I've seen I expected changes to the index which I would need to undo. Maybe I was incorrect--if the same results could be obtained.
    – xtian
    Jun 6 at 9:04










  • Is it important to maintain the index information (0, 2, 3, 5, 6)? Or do you really just want KEY and AMOUNT? I'm unclear about why exactly groupby doesn't work.
    – mochi
    Jun 7 at 1:42















up vote
4
down vote

favorite












I'm learning Pandas. I have a project where I need to munge some CSV files and resave them as CSV. I can use dictionaries and CSV module, but decided to use DataFrames to get more exposure and practice with Pandas.



The task is to sum() values for some keys while others are not summed. My solution was to add a column called "Line" where items with the same line number would be summed.



sales = [{"Line": 1, "KEY": "Apples", "AMOUNT": 3.99},
{"Line": 1, "KEY": "Apples", "AMOUNT": 3.99},
{"Line": 2, "KEY": "Oranges", "AMOUNT": 5.99},
{"Line": 3, "KEY": "Pears", "AMOUNT": 2.99},
{"Line": 3, "KEY": "Pears", "AMOUNT": 2.99},
{"Line": 4, "KEY": "Carrots", "AMOUNT": .99},
{"Line": 5, "KEY": "Carrots", "AMOUNT": .99},
]

df = pd.DataFrame(sales)


Now, Pandas groupby might seem like a perfect solution, but since I need to export the munged file as CSV, this seemed like something I could do only to have to undo.



# Find duplicate Line entries
# Subset df into just duplicate `Line` values
df_tmp = df[df.duplicated(subset="Line", keep=False)]

# Save a list of Line numbers to sum
line_dups = df_tmp['Line'].unique()

for x in line_dups:
# Sum every line in the DF; one value sum is unchanged
# asum = df.loc[df['Line'] == x, 'AMOUNT'].sum()
# or
# Subset the subset
df_tmp2 = df_tmp[df_tmp["Line"] == x]
# sum the sub-subset
asum = df_tmp2['AMOUNT'].sum()
# set the value of all keys with the same Line value
df.loc[df['Line'] == x, 'AMOUNT'] = asum
# take the inverse of the duplicate subset on the original df
# Keep only the first duplicate line
df2 = df[~df.duplicated(subset="Line", keep='first')]


The solution is effective, df2:





  AMOUNT    KEY     Line
0 7.98 Apples 1
2 5.99 Oranges 2
3 5.98 Pears 3
5 0.99 Carrots 4
6 0.99 Carrots 5


And yet, this solution is sort of hairy with all of the temp dataframes. Maybe a more seasoned Pandas user will have something to add which might help me better understand Pandas?










share|improve this question
















bumped to the homepage by Community 2 days ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.















  • What do you mean that you need to undo the groupby to save your data to csv?
    – mochi
    Jun 4 at 4:58










  • I see, that was a strange note--the problem is really a file conversion problem and can not produce new columns. From the groupby examples I've seen I expected changes to the index which I would need to undo. Maybe I was incorrect--if the same results could be obtained.
    – xtian
    Jun 6 at 9:04










  • Is it important to maintain the index information (0, 2, 3, 5, 6)? Or do you really just want KEY and AMOUNT? I'm unclear about why exactly groupby doesn't work.
    – mochi
    Jun 7 at 1:42













up vote
4
down vote

favorite









up vote
4
down vote

favorite











I'm learning Pandas. I have a project where I need to munge some CSV files and resave them as CSV. I can use dictionaries and CSV module, but decided to use DataFrames to get more exposure and practice with Pandas.



The task is to sum() values for some keys while others are not summed. My solution was to add a column called "Line" where items with the same line number would be summed.



sales = [{"Line": 1, "KEY": "Apples", "AMOUNT": 3.99},
{"Line": 1, "KEY": "Apples", "AMOUNT": 3.99},
{"Line": 2, "KEY": "Oranges", "AMOUNT": 5.99},
{"Line": 3, "KEY": "Pears", "AMOUNT": 2.99},
{"Line": 3, "KEY": "Pears", "AMOUNT": 2.99},
{"Line": 4, "KEY": "Carrots", "AMOUNT": .99},
{"Line": 5, "KEY": "Carrots", "AMOUNT": .99},
]

df = pd.DataFrame(sales)


Now, Pandas groupby might seem like a perfect solution, but since I need to export the munged file as CSV, this seemed like something I could do only to have to undo.



# Find duplicate Line entries
# Subset df into just duplicate `Line` values
df_tmp = df[df.duplicated(subset="Line", keep=False)]

# Save a list of Line numbers to sum
line_dups = df_tmp['Line'].unique()

for x in line_dups:
# Sum every line in the DF; one value sum is unchanged
# asum = df.loc[df['Line'] == x, 'AMOUNT'].sum()
# or
# Subset the subset
df_tmp2 = df_tmp[df_tmp["Line"] == x]
# sum the sub-subset
asum = df_tmp2['AMOUNT'].sum()
# set the value of all keys with the same Line value
df.loc[df['Line'] == x, 'AMOUNT'] = asum
# take the inverse of the duplicate subset on the original df
# Keep only the first duplicate line
df2 = df[~df.duplicated(subset="Line", keep='first')]


The solution is effective, df2:





  AMOUNT    KEY     Line
0 7.98 Apples 1
2 5.99 Oranges 2
3 5.98 Pears 3
5 0.99 Carrots 4
6 0.99 Carrots 5


And yet, this solution is sort of hairy with all of the temp dataframes. Maybe a more seasoned Pandas user will have something to add which might help me better understand Pandas?










share|improve this question















I'm learning Pandas. I have a project where I need to munge some CSV files and resave them as CSV. I can use dictionaries and CSV module, but decided to use DataFrames to get more exposure and practice with Pandas.



The task is to sum() values for some keys while others are not summed. My solution was to add a column called "Line" where items with the same line number would be summed.



sales = [{"Line": 1, "KEY": "Apples", "AMOUNT": 3.99},
{"Line": 1, "KEY": "Apples", "AMOUNT": 3.99},
{"Line": 2, "KEY": "Oranges", "AMOUNT": 5.99},
{"Line": 3, "KEY": "Pears", "AMOUNT": 2.99},
{"Line": 3, "KEY": "Pears", "AMOUNT": 2.99},
{"Line": 4, "KEY": "Carrots", "AMOUNT": .99},
{"Line": 5, "KEY": "Carrots", "AMOUNT": .99},
]

df = pd.DataFrame(sales)


Now, Pandas groupby might seem like a perfect solution, but since I need to export the munged file as CSV, this seemed like something I could do only to have to undo.



# Find duplicate Line entries
# Subset df into just duplicate `Line` values
df_tmp = df[df.duplicated(subset="Line", keep=False)]

# Save a list of Line numbers to sum
line_dups = df_tmp['Line'].unique()

for x in line_dups:
# Sum every line in the DF; one value sum is unchanged
# asum = df.loc[df['Line'] == x, 'AMOUNT'].sum()
# or
# Subset the subset
df_tmp2 = df_tmp[df_tmp["Line"] == x]
# sum the sub-subset
asum = df_tmp2['AMOUNT'].sum()
# set the value of all keys with the same Line value
df.loc[df['Line'] == x, 'AMOUNT'] = asum
# take the inverse of the duplicate subset on the original df
# Keep only the first duplicate line
df2 = df[~df.duplicated(subset="Line", keep='first')]


The solution is effective, df2:





  AMOUNT    KEY     Line
0 7.98 Apples 1
2 5.99 Oranges 2
3 5.98 Pears 3
5 0.99 Carrots 4
6 0.99 Carrots 5


And yet, this solution is sort of hairy with all of the temp dataframes. Maybe a more seasoned Pandas user will have something to add which might help me better understand Pandas?







python csv pandas






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Apr 1 at 20:34

























asked Apr 1 at 18:34









xtian

1686




1686





bumped to the homepage by Community 2 days ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.







bumped to the homepage by Community 2 days ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.














  • What do you mean that you need to undo the groupby to save your data to csv?
    – mochi
    Jun 4 at 4:58










  • I see, that was a strange note--the problem is really a file conversion problem and can not produce new columns. From the groupby examples I've seen I expected changes to the index which I would need to undo. Maybe I was incorrect--if the same results could be obtained.
    – xtian
    Jun 6 at 9:04










  • Is it important to maintain the index information (0, 2, 3, 5, 6)? Or do you really just want KEY and AMOUNT? I'm unclear about why exactly groupby doesn't work.
    – mochi
    Jun 7 at 1:42


















  • What do you mean that you need to undo the groupby to save your data to csv?
    – mochi
    Jun 4 at 4:58










  • I see, that was a strange note--the problem is really a file conversion problem and can not produce new columns. From the groupby examples I've seen I expected changes to the index which I would need to undo. Maybe I was incorrect--if the same results could be obtained.
    – xtian
    Jun 6 at 9:04










  • Is it important to maintain the index information (0, 2, 3, 5, 6)? Or do you really just want KEY and AMOUNT? I'm unclear about why exactly groupby doesn't work.
    – mochi
    Jun 7 at 1:42
















What do you mean that you need to undo the groupby to save your data to csv?
– mochi
Jun 4 at 4:58




What do you mean that you need to undo the groupby to save your data to csv?
– mochi
Jun 4 at 4:58












I see, that was a strange note--the problem is really a file conversion problem and can not produce new columns. From the groupby examples I've seen I expected changes to the index which I would need to undo. Maybe I was incorrect--if the same results could be obtained.
– xtian
Jun 6 at 9:04




I see, that was a strange note--the problem is really a file conversion problem and can not produce new columns. From the groupby examples I've seen I expected changes to the index which I would need to undo. Maybe I was incorrect--if the same results could be obtained.
– xtian
Jun 6 at 9:04












Is it important to maintain the index information (0, 2, 3, 5, 6)? Or do you really just want KEY and AMOUNT? I'm unclear about why exactly groupby doesn't work.
– mochi
Jun 7 at 1:42




Is it important to maintain the index information (0, 2, 3, 5, 6)? Or do you really just want KEY and AMOUNT? I'm unclear about why exactly groupby doesn't work.
– mochi
Jun 7 at 1:42










1 Answer
1






active

oldest

votes

















up vote
0
down vote













If you have the Line column in your CSV itself, you can replace what you have with:



df.groupby(['Line', 'KEY']).agg({'AMOUNT': 'sum'}).reset_index()





share|improve this answer





















    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "196"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f191013%2fmunging-select-rows-in-csv-files-using-pandas%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    0
    down vote













    If you have the Line column in your CSV itself, you can replace what you have with:



    df.groupby(['Line', 'KEY']).agg({'AMOUNT': 'sum'}).reset_index()





    share|improve this answer

























      up vote
      0
      down vote













      If you have the Line column in your CSV itself, you can replace what you have with:



      df.groupby(['Line', 'KEY']).agg({'AMOUNT': 'sum'}).reset_index()





      share|improve this answer























        up vote
        0
        down vote










        up vote
        0
        down vote









        If you have the Line column in your CSV itself, you can replace what you have with:



        df.groupby(['Line', 'KEY']).agg({'AMOUNT': 'sum'}).reset_index()





        share|improve this answer












        If you have the Line column in your CSV itself, you can replace what you have with:



        df.groupby(['Line', 'KEY']).agg({'AMOUNT': 'sum'}).reset_index()






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Apr 4 at 18:59









        RCA

        1735




        1735






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Code Review Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f191013%2fmunging-select-rows-in-csv-files-using-pandas%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Ellipse (mathématiques)

            Quarter-circle Tiles

            Mont Emei