[ Python ] 연설문 데이터 분석 프로그램

Language/Python

[ Python ] 연설문 데이터 분석 프로그램

곽수진 2021. 9. 4. 18:50

Abraham Lincoln 전 대통령의 Gettysburg 연설문의 총 단어수와 단어들의 빈도를 분석하는 프로그램을 작성해보자.

다음은 Abraham Lincoln 전 대통령이 1863년에 발표한 Gettysburg 연설문의 원문이다.

Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.
Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that the nation might live. It is altogether fitting and proper that we should do this.
But, in a larger sense, we can not dedicate — we can not consecrate — we can not hallow — this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us — that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion — that we here highly resolve that these dead shall not have died in vain — that this nation, under God, shall have a new birth of freedom — and that government of the people, by the people, for the people, shall not perish from the earth.

▶ 연설문을 메모장에 복사해 각 문단 사이에 빈 줄이 들어가지 않도록 자료를 정리해 'input.txt'로 저장

infile = open('input.txt', 'r', encoding='UTF8')
outfile = open('output.txt', 'w', encoding='UTF8')

word_dic = {}
total_count = 0

for line in infile:
    line = line.rstrip()
    word_list = line.split()

    for word in word_list:
        word = word.lower()
        word = word.strip(',')
        word = word.strip('.')

        if word in word_dic:
            word_dic[word] += 1
            total_count += 1

        else:
            word_dic[word] = 1
            total_count += 1

result = ''
for key in sorted(word_dic.keys()):
    result = key + '' + str(word_dic[key]) + '\n'
    outfile.write(result)

print('총 단어 수= ', total_count)

outfile.close()
infile.close()

▶ infile = open('input.txt', 'r') : input.txt 파일을 읽기 모드로 열어줌

→ outfile = open('output.txt', 'w') : output.txt 파일을 쓰기 모드로 열어줌

▶ word_dic = {} : 입력 파일에서 추출한 단어와 그 단어의 빈도를 저장할 word_dic라는 빈 딕셔너리 생성

★ encoding = 'UTF8'을 붙이지 않으면 cp949 코덱으로 인코딩 된 파일을 읽어들일 때 오류 발생 ★

for line in infile:
    line = line.rstrip()
    word_list = line.split()

▶ 입력 파일로부터 연설문을 한 문단(줄 바꿈이 발생하기 전)씩 가져와 단어별로 분리함

→ 반복문과 rstrip()을 사용해 리스트 형태로 저장하고 리스트에 저장된 문단을 strip()을 이용해 단어별로 분리

for word in word_list:
    word = word.lower()
    word = word.strip(',')
    word = word.strip('.')

▶ 분리된 단어를 모두 영어 소문자로 바꾸고, 단어와 붙어 있는 콤마(,)와 마침표(.)를 삭제하는 데이터 선처리 작업을 함

if word in word_dic:
    word_dic[word] += 1
    total_count += 1

else:
    word_dic[word] = 1
    total_count += 1

▶ 단어별 빈도를 카운트 하는 작업을 하게 됨

→ word_dic 딕셔너리에 해당 단어가 없으면 딕셔너리에 key를 추가하고 값을 1로 가짐

→ 만약 word_dic 딕셔너리에 해당 단어가 있다면 해당 단어키의 값을 1씩 증가시킴

→ 총 단어의 수를 세는 total_count의 값도 각각의 상황에서 1씩 증가함

result = ''
for key in sorted(word_dic.keys()):
    result = key + '' + str(word_dic[key]) + '\n'
    outfile.write(result)

print('총 단어 수= ', total_count)

▶ word_dic 딕셔너리에 저장된 내용을 output.txt 파일에 저장하고 총 단어 수를 화면에 출력함

▶ outfile.close() : 출력 파일 닫기

→ infile.close() : 입력 파일 닫기

저작자표시

'Language > Python' 카테고리의 다른 글

[ Python ] 행 맨 게임 프로그램 (0)	2021.09.04
[ Python ] 평균 강수량 통계 프로그램 (0)	2021.09.04
[ Python ] 파일 복사하기 프로그램 (0)	2021.09.04
[ Python ] 파일 개념 정리 (0)	2021.09.04
[ Python ] 문제 제시 프로그램 (0)	2021.09.04

현재글[ Python ] 연설문 데이터 분석 프로그램

Sujin's Dlog

Study Note

django, Visual Studio, Inflearn, HTML, Baekjoon, web, Java, 국비지원교육, 백준, Data Analysis, Turtle Graphic, OS, Jupyter Notebook, anaconda, Spring, Python, MySQL, Operating System, CSS, C,

일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30

Sujin's Dlog